Monday, May 14, 2018

How to extract decision rules (features splits) from xgboost model in python3?

Leave a Comment

I need to extract the decision rules from my fitted xgboost model in python. I use 0.6a2 version of xgboost library and my python version is 3.5.2.

My ultimate goal is to use those splits to bin variables ( according to the splits).

I did not come across any property of the model for this version which can give me splits.

plot_tree is giving me something similar. However it is visualization of the tree.

I need something like https://stackoverflow.com/a/39772170/4559070 for xgboost model

3 Answers

Answers 1

It is possible, but not easy. I would recommend you to use GradientBoostingClassifier from scikit-learn, which is similar to xgboost, but has native access to the built trees.

With xgboost, however, it is possible to get a textual representation of the model and then parse it:

from sklearn.datasets import load_iris from xgboost import XGBClassifier # build a very simple model X, y = load_iris(return_X_y=True) model = XGBClassifier(max_depth=2, n_estimators=2) model.fit(X, y); # dump it to a text file model.get_booster().dump_model('xgb_model.txt', with_stats=True) # read the contents of the file with open('xgb_model.txt', 'r') as f:     txt_model = f.read() print(txt_model) 

It will print you a textual description of 6 trees (2 estimators, each consists of 3 trees, one per class), which starts like this:

booster[0]: 0:[f2<2.45] yes=1,no=2,missing=1,gain=72.2968,cover=66.6667     1:leaf=0.143541,cover=22.2222     2:leaf=-0.0733496,cover=44.4444 booster[1]: 0:[f2<2.45] yes=1,no=2,missing=1,gain=18.0742,cover=66.6667     1:leaf=-0.0717703,cover=22.2222     2:[f3<1.75] yes=3,no=4,missing=3,gain=41.9078,cover=44.4444         3:leaf=0.124,cover=24         4:leaf=-0.0668394,cover=20.4444 ... 

Now you can, for example, extract all splits from this description:

import re # trying to extract all patterns like "[f2<2.45]" splits = re.findall('\[f([0-9]+)<([0-9]+.[0-9]+)\]', txt_model) splits 

It will print you the list of tuples (feature_id, split_value), like

[('2', '2.45'),  ('2', '2.45'),  ('3', '1.75'),  ('3', '1.65'),  ('2', '4.95'),  ('2', '2.45'),  ('2', '2.45'),  ('3', '1.75'),  ('3', '1.65'),  ('2', '4.95')] 

You can further process this list as you wish.

Answers 2

Generally, it is not possible. As There are hundreds of thousands of trees in xgboost. Credit: Jason Brownlee

Answers 3

You need to know the name of your tree, and after that, you can insert it into your code.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment