Feature Importance of MindsDB and Lightwood
MindsDB together with the Lightwood ML engine provide the feature importance tool.
What is Feature Importance?
Feature importance is a useful tool for obtaining explanations about how any given machine learning model generally works. Simply put, it assigns a relative numerical score to each input feature so that a user understands what parts of the input are used to a greater or lesser degree by the model when generating predictions.
While there are a few variations of this technique, Lightwood offers the permutation-based variant.
Permutation Feature Importance
The procedure consists of splitting data into the training and validation sets. Once the ML model is trained with the former, the latter is used to, among other things, find out the importance scores for each feature.
The algorithm is rather simple. We iterate over all the input features and randomly shuffle the information within each of them one by one without shuffling the rest of the input features. Then, the model generates predictions for this altered input as it normally would for any other input.
Once we have predictions for all the shuffled variations of the validation dataset, we can evaluate accuracy metrics that are of interest to the user, such as mean absolute error for regression tasks, and compare them against the value obtained for the original dataset, which acts as a reference value. Based on the lost accuracy, we finally assign a numerical score that reflects this impact and report it as the importance of this column for the model.
For edge cases where a feature is completely irrelevant (no lost accuracy if the feature is absent) or absolutely critical (accuracy drops to the minimum possible value if the feature is absent), the importance score is 0.0 and 1.0, respectively. However, the user should be careful to note that these scores do not model intra-feature dependencies or correlations, meaning that it is not wise to interpret the scores as independent from each other, as any feature that is correlated with another highly-scored feature will present a high score, too.
How to Use the Permutation Feature Importance
You can customize the behavior of this analysis module from the MindsDB SQL editor via the USING key. For example, if you want to consider all rows in the validation dataset, rather than clip it to some default value, here is an example:
Once you train a model, use the DESCRIBE model_name;
command to see the reported importance scores.
Example
Let’s use the following data to create a model:
On execution, we get:
Now we create a model using the USING
clause as shown in the previous chapter.
On execution, we get:
Here is how you can monitor the status of the model:
Once the status is complete, we can query the model.
On execution, we get:
Here is how you can check the importance scores for all columns:
On execution, we get:
Please note that the rental_price
column is not listed here, as it is the target column. The column importance scores are presented for all the feature columns.
About Confidence Estimation
Check out the article on Model agnostic confidence estimation with conformal predictors for AutoML by Patricio Cerda Mardini to learn how MindsDB estimates confidence.