Data Sets

In this example, we explore how to customize the data sets that come with the canonical_sets package. Note that they all inherit from BaseData and that a lot of documentation is also there available.

[1]:

from canonical_sets.data import Adult

data = Adult()

We can just use the default settings, and have then access to the training, validation and test data and corresponding labels (which are attributes of the data object).

[2]:

data.train_data.head()

[2]:

	Age	fnlwgt	Education-Num	Capital Gain	Capital Loss	Hours per week	Workclass+Private	...	Country+United-States
0	0.123288	-0.950895	0.066667	-1.0	-1.000000	0.000000	0	...	1
1	-0.726027	-0.621532	0.066667	-1.0	-1.000000	-0.102041	1	...	1
2	-0.150685	-0.874857	-0.466667	-1.0	-1.000000	-0.204082	1	...	1
3	-0.561644	-0.787375	0.066667	-1.0	-1.000000	-0.102041	1	...	1
4	-0.013699	-0.694464	0.333333	-1.0	-0.318182	-0.204082	1	...	1

5 rows × 104 columns

[3]:

data.train_labels.head()

[3]:

	<=50K	>50K
0	1	0
1	1	0
2	0	1
3	1	0
4	1	0

We can change the proportion of the validation and test data. Note however that the Adult data is a special case as the test data is fixed and downloaded seperatly.

[4]:

data = Adult()
print(len(data.val_data))

data = Adult(val_prop=0.5)
print(len(data.val_data))

6033
15081

We have also control of the features that we select.

[5]:

data = Adult(features=["Age", "Sex"])
data.train_data.head()

[5]:

	Age	Sex+Female	Sex+Male
0	0.123288	0	1
1	-0.726027	1	0
2	-0.150685	0	1
3	-0.561644	1	0
4	-0.013699	0	1

And the scaling of the numerical features, and how to split the name of the categorical feature and the category itself.

[6]:

from sklearn.preprocessing import StandardScaler

data = Adult(features=["Age", "Sex"], scaler=StandardScaler(), prefix_sep="*")
data.train_data.head()

[6]:

	Age	Sex*Female	Sex*Male
0	1.489612	0	1
1	-0.867884	1	0
2	0.729129	0	1
3	-0.411595	1	0
4	1.109370	0	1

We can get the data without any preprocessing or (additional) splitting.

[7]:

data = Adult(preprocess=False)
data.train_data.head()

[7]:

	Age	Workclass	fnlwgt	Education	Education-Num	Martial Status	Occupation	Relationship	Race	Sex	Capital Gain	Hours per week	Country	Target
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

We can also group categories together, for example all the others countries except for the US (using the previous data to make the listing easier).

[8]:

others = list(data.train_data.Country.unique())
others.remove("United-States")
groups = {"Country": dict.fromkeys(others, "Others")}

data = Adult(features=["Age", "Sex", "Country"], scaler=StandardScaler(), prefix_sep="*",
groups=groups)
data.train_data.head()

[8]:

	Age	Sex*Female	Sex*Male	Country*United-States
0	1.489612	0	1	1
1	-0.867884	1	0	1
2	0.729129	0	1	1
3	-0.411595	1	0	1
4	1.109370	0	1	1