Wednesday, March 7, 2018

How to achieve stratified K fold splitting for arbitrary number of categorical variables?

By Hường Hana 12:00 PM machine-learning, numpy, pandas, python, scikit-learn Leave a Comment

I have a dataframe of the form, df:

    cat_var_1    cat_var_2     num_var_1 0    Orange       Monkey         34 1    Banana        Cat           56 2    Orange        Dog           22 3    Banana       Monkey          6 ..

Suppose the possible values of cat_var_1 in the dataset have the ratios- ['Orange': 0.6, 'Banana': 0.4] and the possible values of cat_var_2 have the ratios ['Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1].

How to I split the data into train, test and validation sets (60:20:20 split) such that the ratios of the categorical variables remain preserved? In practice, these variables can be of any number, not just two. Also, clearly, the exact ratios may never be achieved in practice, but we would like it to be as near as possible.

I have looked into the StratifiedKFold method from sklearn described here: how to split a dataset into training and validation set keeping ratio between classes? but this is restricted to evaluating on the basis of one categorical variable only.

Additionally, I would be grateful if you could provide the complexity of the solution you achieve.

1 Answers

Answers 1

You can pass df.cat_var_1+ "_" + df.cat_var_2 to argument y of StratifiedShuffleSplit.split():

But here is a method that use DataFrame.groupby:

import pandas as pd import numpy as np  nrows = 10000 p1 = {'Orange': 0.6, 'Banana': 0.4} p2 = {'Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1}  c1 = [key for key, val in p1.items() for i in range(int(nrows * val))] c2 = [key for key, val in p2.items() for i in range(int(nrows * val))] random.shuffle(c1) random.shuffle(c2)  df = pd.DataFrame({"c1":c1, "c2":c2, "val":np.random.randint(0, 100, nrows)})  index = [] for key, idx in df.groupby(["c1", "c2"]).groups.items():     arr = idx.values.copy()     np.random.shuffle(arr)     p1 = int(0.6 * len(arr))     p2 = int(0.8 * len(arr))     index.append(np.split(arr, [p1, p2]))  idx_train, idx_test, idx_validate = list(map(np.concatenate, zip(*index)))

Coding Question

Wednesday, March 7, 2018

How to achieve stratified K fold splitting for arbitrary number of categorical variables?

1 Answers

Answers 1

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook