I have a training set of 168 instances, 12 measurements each and 2 possible categories (7 sick/ 161 healthy). I have the data arranged in a 168x13 table with the 13th row a categorical with either 'sick' or 'healthy.' I've been trying out the 2015b MATLAB Classification Learner app, using 5-10 fold cross-validation, various trees/SVMs/KNNs.
I have a 2 sets of features for 2 classes, the number of instances in the 2nd class outnumbers the number in the first class 10:1 which seems to shift the model towards a lower TPR, higher TNR. I can fix this by growing the first class with repeats of itself (class1=[class1;class1;class1]), just wondering if this is a bad solution that leads to overfitting or something similar. Is there a better strategy? Is it better to just get the best model without doing anything to the smaller class?
1 Answers
Answers 1
Your dataset is imbalanced. What you can do is either create more sick data points or interpolate the healthy data points to reduce the number of healthy data points. Trying using SMOTE and other such techniques for this. Also, decision trees would be a good start, try limiting the height of the tree.
0 comments:
Post a Comment