使用feature Importance進行特徵選擇

在前一篇機器學習之特徵選擇的文章中講到了樹模型中GBDT也可用來作爲基模型進行特徵選擇。今天在此基礎上進行拓展，介紹除決策樹外用的比較多的XGBoost、LightGBM。

DecisionTree

決策樹的feature_importances_屬性，返回的重要性是按照決策樹種被用來分割後帶來的增益(gain)總和進行返回。

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

關於信息增益（Gain）相關介紹可以決策樹簡介。

參考鏈接： https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.feature_importances_

GradientBoosting和ExtraTrees與DecisionTree類似。

XGBoost

get_score(fmap='', importance_type='weight')
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.

其中：

weight：該特徵被選爲分裂特徵的次數。
gain：該特徵的帶來平均增益(有多棵樹)。在tree中用到時的gain之和/在tree中用到的次數計數。gain = total_gain / weight
cover：該特徵對每棵樹的覆蓋率。
total_gain：在所有樹中，某特徵在每次分裂節點時帶來的總增益
total_cover：在所有樹中，某特徵在每次分裂節點時處理(覆蓋)的所有樣例的數量。

參考鏈接： https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score

LightGBM

feature_importance(importance_type='split', iteration=None)
Get feature importances.
 importance_type (string, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
 iteration (int or None, optional (default=None)) – Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used. If <= 0, all trees are used (no limits).

其中：