I have a dataframe of the form, df:
cat_var_1 cat_var_2 num_var_1 0 Orange Monkey 34 1 Banana Cat 56 2 Orange Dog 22 3 Banana Monkey 6 ..
Suppose the possible values of cat_var_1 in the dataset have the ratios- ['Orange': 0.6, 'Banana': 0.4] and the possible values of cat_var_2 have the ratios ['Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1].
How to I split the data into train, test and validation sets (60:20:20 split) such that the ratios of the categorical variables remain preserved? In practice, these variables can be of any number, not just two. Also, clearly, the exact ratios may never be achieved in practice, but we would like it to be as near as possible.
I have looked into the StratifiedKFold method from sklearn described here: how to split a dataset into training and validation set keeping ratio between classes? but this is restricted to evaluating on the basis of one categorical variable only.
Additionally, I would be grateful if you could provide the complexity of the solution you achieve.
1 Answers
Answers 1
You can pass df.cat_var_1+ "_" + df.cat_var_2
to argument y
of StratifiedShuffleSplit.split()
:
But here is a method that use DataFrame.groupby
:
import pandas as pd import numpy as np nrows = 10000 p1 = {'Orange': 0.6, 'Banana': 0.4} p2 = {'Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1} c1 = [key for key, val in p1.items() for i in range(int(nrows * val))] c2 = [key for key, val in p2.items() for i in range(int(nrows * val))] random.shuffle(c1) random.shuffle(c2) df = pd.DataFrame({"c1":c1, "c2":c2, "val":np.random.randint(0, 100, nrows)}) index = [] for key, idx in df.groupby(["c1", "c2"]).groups.items(): arr = idx.values.copy() np.random.shuffle(arr) p1 = int(0.6 * len(arr)) p2 = int(0.8 * len(arr)) index.append(np.split(arr, [p1, p2])) idx_train, idx_test, idx_validate = list(map(np.concatenate, zip(*index)))
0 comments:
Post a Comment