Friday, March 2, 2018

Lime vs TreeInterpreter for interpreting decision tree

Leave a Comment

Lime source: https://github.com/marcotcr/lime

treeinterpreter source: tree interpreter

I am trying to understand how the DecisionTree made its predictions using Lime and treeinterpreter. While both claim they are able to interpret the decision tree in their description. It seems like both interpret the same DecisionTree in different ways. That is, the feature contribution order. How is that possible? if both are looking at the same thing and are trying to describe the same event but assign importance in difference order.

Who should we trust? Especially where the top feature does matter in prediction.

The code for tree

import sklearn import sklearn.datasets import sklearn.ensemble import numpy as np import lime import lime.lime_tabular from __future__ import print_function np.random.seed(1) from treeinterpreter import treeinterpreter as ti from sklearn.tree import DecisionTreeClassifier  iris = sklearn.datasets.load_iris()  dt = DecisionTreeClassifier(random_state=42)                 dt.fit(iris.data, iris.target)  n = 100   instances =iris.data[n].reshape(1,-1)  prediction, biases, contributions = ti.predict(dt, instances)   for i in range(len(instances)):     print ("prediction:",prediction)     print ("-"*20)      print ("Feature contributions:")     print ("-"*20)       for c, feature in sorted(zip(contributions[i],                                   iris.feature_names),                               key=lambda x: ~abs(x[0].any())):          print (feature, c) 

The code for lime

import sklearn import sklearn.datasets import sklearn.ensemble import numpy as np import lime import lime.lime_tabular from __future__ import print_function np.random.seed(1) from sklearn.tree import DecisionTreeClassifier  iris = sklearn.datasets.load_iris()  dt = DecisionTreeClassifier(random_state=42)                 dt.fit(iris.data, iris.target)  explainer = lime.lime_tabular.LimeTabularExplainer(iris.data, feature_names=iris.feature_names,                                                     class_names=iris.target_names,                                                     discretize_continuous=False)  n = 100   exp = explainer.explain_instance(iris.data[n], dt.predict_proba, num_features=4, top_labels=2)  exp.show_in_notebook(show_table=True,  predict_proba= True , show_predicted_value = True , show_all=False) 

Lets look first at the output of the tree.

enter image description here

so a it did correctly say it was a virginica. However by assigning the importance in

1) petal width (cm) then petal length (cm)

Now lets look at the output of lime

enter image description here

Yes, it does say the algorithm predicted virginica however looking at how it made that classification, we clearly see the following

1) petal length (cm) > petal width (cm) in lime instead of petal length (cm) < petal width (cm) as shown in tree

2) where sepal width and sepal length was predicted zero, lime claims of certain value, as shown in the uploaded images

What is happening here ?

The problem grows when the features are 1000+ where every digit does matter to take a decision.

1 Answers

Answers 1

Why is it possible for the two approaches to have different results?

Lime: A short explanation of how it works, taken from their github page:

Intuitively, an explanation is a local linear approximation of the model's behaviour. While the model may be very complex globally, it is easier to approximate it around the vicinity of a particular instance. While treating the model as a black box, we perturb the instance we want to explain and learn a sparse linear model around it, as an explanation. The figure below illustrates the intuition for this procedure. The model's decision function is represented by the blue/pink background, and is clearly nonlinear. The bright red cross is the instance being explained (let's call it X). We sample instances around X, and weight them according to their proximity to X (weight here is indicated by size). We then learn a linear model (dashed line) that approximates the model well in the vicinity of X, but not necessarily globally.

There is much more detailed information in various links on the github page.

treeinterpreter: An explanation of how this one works is available on http://blog.datadive.net/interpreting-random-forests/ (this is for regression; an example for classification, which works very similarly, can be found here).

In short: suppose we have a node that compares feature F to some value and splits instances based on that. Suppose that 50% of all instances reaching that node belong to class C. Suppose we have a new instance, and it ends up getting assigned to the left child of this node, where now 80% of all instances belong to class C. Then, the contribution of feature F for this decision is computed as 0.8 - 0.5 = 0.3 (plus additional terms if there are more nodes along the path to leaf that also use feature F).

Comparison: The important thing to note is that Lime is a model-independent method (not specific to Decision Trees / RFs), which is based on local linear approximation. Treeinterpreter, on the other hand, specifically operates in a similar manner to the Decision Tree itself, and really looks at which features are actually used in comparisons by the algorithm. So they're really fundamentally doing quite different things. Lime says "a feature is important if we wiggle it a bit and this results in a different prediction". Treeinterpreter says "a feature is important if it was compared to a threshold in one of our nodes and this caused us to take a split that drastically changed our prediction".


Which one to trust?

This is difficult to answer definitively. They're probably both useful in their own way. Intuitively, you may be inclined to lean towards treeinterpreter at first glance, because it was specifically created for Decision Trees. However, consider the following example:

  • Root Node: 50% of instances class 0, 50% class 1. IF F <= 50, go left, otherwise go right.
  • Left Child: 48% of instances class 0, 52% class 1. Subtree below this.
  • Right Child: 99% of instances class 0, 1% of instances class 1. Subtree below this.

This kind of setup is possible if the majority of instances go left, only some right. Now suppose we have an instance with F = 49 that got assigned to the left and ultimately assigned class 1. Treeinterpreter won't care that F was really close to ending up on the other side of the equation in the root node, and only assign a low contribution of 0.48 - 0.50 = -0.02. Lime will notice that changing F just a little bit would completely change the odds.

Which one is right? That's not really clear. You could say that F was really important because if it had been only a little bit different the prediction would be different (then lime wins). You could also argue that F did not contribute to our final prediction because we hardly got any closer to a decision after inspecting its value, and still had to investigate many other features afterwards. Then treeinterpreter wins.

To get a better idea here, it may help to also actually plot the learned Decision Tree itself. Then you can manually follow along its decision path and decide which features you think were important and/or see if you can understand why both Lime and treeinterpreter say what they say.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment