Data Sets

In this example, we explore how to customize the data sets that come with the canonical_sets package. Note that they all inherit from BaseData and that a lot of documentation is also there available.

[1]:
from canonical_sets.data import Adult

data = Adult()

We can just use the default settings, and have then access to the training, validation and test data and corresponding labels (which are attributes of the data object).

[2]:
data.train_data.head()
[2]:
Age fnlwgt Education-Num Capital Gain Capital Loss Hours per week Workclass+Federal-gov Workclass+Local-gov Workclass+Private Workclass+Self-emp-inc ... Country+Portugal Country+Puerto-Rico Country+Scotland Country+South Country+Taiwan Country+Thailand Country+Trinadad&Tobago Country+United-States Country+Vietnam Country+Yugoslavia
0 0.123288 -0.950895 0.066667 -1.0 -1.000000 0.000000 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 -0.726027 -0.621532 0.066667 -1.0 -1.000000 -0.102041 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0
2 -0.150685 -0.874857 -0.466667 -1.0 -1.000000 -0.204082 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0
3 -0.561644 -0.787375 0.066667 -1.0 -1.000000 -0.102041 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0
4 -0.013699 -0.694464 0.333333 -1.0 -0.318182 -0.204082 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 104 columns

[3]:
data.train_labels.head()
[3]:
<=50K >50K
0 1 0
1 1 0
2 0 1
3 1 0
4 1 0

We can change the proportion of the validation and test data. Note however that the Adult data is a special case as the test data is fixed and downloaded seperatly.

[4]:
data = Adult()
print(len(data.val_data))

data = Adult(val_prop=0.5)
print(len(data.val_data))
6033
15081

We have also control of the features that we select.

[5]:
data = Adult(features=["Age", "Sex"])
data.train_data.head()
[5]:
Age Sex+Female Sex+Male
0 0.123288 0 1
1 -0.726027 1 0
2 -0.150685 0 1
3 -0.561644 1 0
4 -0.013699 0 1

And the scaling of the numerical features, and how to split the name of the categorical feature and the category itself.

[6]:
from sklearn.preprocessing import StandardScaler

data = Adult(features=["Age", "Sex"], scaler=StandardScaler(), prefix_sep="*")
data.train_data.head()
[6]:
Age Sex*Female Sex*Male
0 1.489612 0 1
1 -0.867884 1 0
2 0.729129 0 1
3 -0.411595 1 0
4 1.109370 0 1

We can get the data without any preprocessing or (additional) splitting.

[7]:
data = Adult(preprocess=False)
data.train_data.head()
[7]:
Age Workclass fnlwgt Education Education-Num Martial Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country Target
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

We can also group categories together, for example all the others countries except for the US (using the previous data to make the listing easier).

[8]:
others = list(data.train_data.Country.unique())
others.remove("United-States")
groups = {"Country": dict.fromkeys(others, "Others")}

data = Adult(features=["Age", "Sex", "Country"], scaler=StandardScaler(), prefix_sep="*",
groups=groups)
data.train_data.head()
[8]:
Age Sex*Female Sex*Male Country*Others Country*United-States
0 1.489612 0 1 0 1
1 -0.867884 1 0 0 1
2 0.729129 0 1 0 1
3 -0.411595 1 0 0 1
4 1.109370 0 1 0 1