Get the Best from your ML Model Features using Shapley Values

Luis Zul
4 min readJul 27, 2020

--

The WHAT values?

How do we measure the influence of our variables in our outcome?

How do we know if a feature gives our model good performance?
In Layman’s terms, a good model gets correct results for unseen cases. For example, if you train a model to classify an image as a cat or a dog with only 3D models, a good model would look at a real life picture of a cat and classify it as such.

How do we measure this before the fact?
In Machine Learning, algorithms learn through data. At minimum, data is partitioned into training and test sets (sometimes also into validation sets, but for the sake of simplicity we won’t consider it at this time).

Usually, you first get your slice for the test set, and then the rest is used for training the algorithm to your task. After training, you use the algorithm to predict results using the test set. The predictions are then compared against the ground truth, and an evaluation score is calculated.

How important is a feature in determining these results?
The importance of a feature is how much it helps us get the correct answer. For example, the BMI of a patient may be important in determining if the patient has a risk of heart attack or not. This is because there’s a relationship between the independent variable and the outcome. However, sometimes these relationships are not apparent in every problem domain. How then, can we calculate the strength of the relationship between an independent variable and its outcome?

One simple approach is to drop the column you want to check, and then train your model with the changed dataset.

Then, you calculate the performance of the original model with the model trained on the changed dataset.

However, this would be expensive since we’d need to train the model for the different permutations of your feature sets. For example, with just 2 features:
[], [F1], [F1, F2], [F2] (the empty set just means trying to predict the outcome with just a flip of the coin, in the case of a binary classification model.

To avoid this, we can train a single model and when evaluating with the test set, you shuffle the column to evaluate, and extract the score. Finally, to determine the feature importance you can substract the original model’s score on the test set, the baseline score, with the permuted feature score. To better unbias this feature importance, you can shuffle the column multiple times and then take the average of the feature importance.

Sometimes, including two features has a greater effect than using the feature separately. In other words, the combination is greater than the sum of its parts. For example, if you’re creating a model to predict if a patient could have osteoporosis, the age may not be enough of a feature to precisely predict the outcome. Likewise, just knowing the sex at birth of the patient on its own is not enough of a signal. However, when you combine both the age and sex, you remember that osteoporosis is more likely in women of age 40 and above. This kind of relationship between features can’t be captured by this method, the permutation method.

How can we measure the influence of features between each other, which in turn influence the outcome predicted by the model?
To measure these relationships, we can make use of the Shapley Values. Let’s start with a simple example of three features: [color, fabric, size] for a shirt. To get the Shapley Value of the “color” feature, we consider the following calculations:

To calculate the scores, where f is the evaluation function:

  • S1: f({color, fabric, size})-f({color, fabric})
  • S2: f({color, size})-f({color})
  • S3: f({size})-f({})
  • S4: f({fabric, color, size})-f({fabric, color})
  • S5: f({fabric, size})-f({fabric})
  • S6: f({size})-f({})

In other words, you’re simulating the impact of the size feature should you add it to a particular arrangement of the feature set.

Finally, to calculate the Shapley Score, you average these scores:

Et voila! You have the Shapley score for the “size” feature! The Shapley scores of the other features would be:

Fabric

  • S1: f({color, size, fabric})-f({color, size})
  • S2: f({color, fabric})-f({color})
  • S3: f({fabric})-f({})
  • S4: f({size, color, fabric})-f({size, color})
  • S5: f({size, fabric})-f({size})
  • S6: f({fabric})-f({})

Color

  • S1: f({size, fabric, color})-f({size, fabric})
  • S2: f({size, color})-f({size})
  • S3: f({color})-f({})
  • S4: f({fabric, size, color})-f({fabric, size})
  • S5: f({fabric, color})-f({fabric})
  • S6: f({color})-f({})

With these feature importances, we can compute the feature importance of a feature that on its own doesn’t have enough power to contribute to the prediction, but that with the help of one or more other features it is able to do so.

However, as you might imagine getting the Shapley Values for huge feature sets in large datasets might be too expensive. Fortunately, there are readily available libraries which implement this method using faster approximations like https://github.com/slundberg/shap in Python.

Now that you know what a Shapley Value is, you can consider it the next time you build your classification model!

--

--

Luis Zul
Luis Zul

No responses yet