Data Sets
In this example, we explore how to customize the data sets that come with the canonical_sets
package. Note that they all inherit from BaseData
and that a lot of documentation is also there available.
[1]:
from canonical_sets.data import Adult
data = Adult()
We can just use the default settings, and have then access to the training, validation and test data and corresponding labels (which are attributes of the data object).
[2]:
data.train_data.head()
[2]:
Age | fnlwgt | Education-Num | Capital Gain | Capital Loss | Hours per week | Workclass+Federal-gov | Workclass+Local-gov | Workclass+Private | Workclass+Self-emp-inc | ... | Country+Portugal | Country+Puerto-Rico | Country+Scotland | Country+South | Country+Taiwan | Country+Thailand | Country+Trinadad&Tobago | Country+United-States | Country+Vietnam | Country+Yugoslavia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.123288 | -0.950895 | 0.066667 | -1.0 | -1.000000 | 0.000000 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | -0.726027 | -0.621532 | 0.066667 | -1.0 | -1.000000 | -0.102041 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | -0.150685 | -0.874857 | -0.466667 | -1.0 | -1.000000 | -0.204082 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | -0.561644 | -0.787375 | 0.066667 | -1.0 | -1.000000 | -0.102041 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | -0.013699 | -0.694464 | 0.333333 | -1.0 | -0.318182 | -0.204082 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 104 columns
[3]:
data.train_labels.head()
[3]:
<=50K | >50K | |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
2 | 0 | 1 |
3 | 1 | 0 |
4 | 1 | 0 |
We can change the proportion of the validation and test data. Note however that the Adult
data is a special case as the test data is fixed and downloaded seperatly.
[4]:
data = Adult()
print(len(data.val_data))
data = Adult(val_prop=0.5)
print(len(data.val_data))
6033
15081
We have also control of the features that we select.
[5]:
data = Adult(features=["Age", "Sex"])
data.train_data.head()
[5]:
Age | Sex+Female | Sex+Male | |
---|---|---|---|
0 | 0.123288 | 0 | 1 |
1 | -0.726027 | 1 | 0 |
2 | -0.150685 | 0 | 1 |
3 | -0.561644 | 1 | 0 |
4 | -0.013699 | 0 | 1 |
And the scaling of the numerical features, and how to split the name of the categorical feature and the category itself.
[6]:
from sklearn.preprocessing import StandardScaler
data = Adult(features=["Age", "Sex"], scaler=StandardScaler(), prefix_sep="*")
data.train_data.head()
[6]:
Age | Sex*Female | Sex*Male | |
---|---|---|---|
0 | 1.489612 | 0 | 1 |
1 | -0.867884 | 1 | 0 |
2 | 0.729129 | 0 | 1 |
3 | -0.411595 | 1 | 0 |
4 | 1.109370 | 0 | 1 |
We can get the data without any preprocessing or (additional) splitting.
[7]:
data = Adult(preprocess=False)
data.train_data.head()
[7]:
Age | Workclass | fnlwgt | Education | Education-Num | Martial Status | Occupation | Relationship | Race | Sex | Capital Gain | Capital Loss | Hours per week | Country | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
We can also group categories together, for example all the others countries except for the US (using the previous data to make the listing easier).
[8]:
others = list(data.train_data.Country.unique())
others.remove("United-States")
groups = {"Country": dict.fromkeys(others, "Others")}
data = Adult(features=["Age", "Sex", "Country"], scaler=StandardScaler(), prefix_sep="*",
groups=groups)
data.train_data.head()
[8]:
Age | Sex*Female | Sex*Male | Country*Others | Country*United-States | |
---|---|---|---|---|---|
0 | 1.489612 | 0 | 1 | 0 | 1 |
1 | -0.867884 | 1 | 0 | 0 | 1 |
2 | 0.729129 | 0 | 1 | 0 | 1 |
3 | -0.411595 | 1 | 0 | 0 | 1 |
4 | 1.109370 | 0 | 1 | 0 | 1 |