【机器学习笔记】使用lightgbm画并保存Feature-JobPlus

前言

基于树的模型可以用来评估特征的重要性。我将使用LightGBM中的GBDT模型来评估特性重要性的步骤。 LightGBM是由微软发布的高精度和高速度梯度增强框架（一些测试表明LightGBM可以产生与XGBoost一样的准确预测，但速度可以提高25倍）。

首先，我们导入所需的软件包：用于数据预处理的pandas，用于GBDT模型的LightGBM以及用于构建功能重要性条形图的matplotlib。

import pandas as pd

import matplotlib.pylab as plt

import lightgbm as lgb

然后，我们需要加载和预处理训练数据。在这个例子中，我们使用预测性维护数据集。

# read data

train = pd.read_csv('E:\Data\predicitivemaintance_processed.csv')

# drop the columns that are not used for the model

train = train.drop(['Date', 'FailureDate'],axis=1)

# set the target column

target = 'FailNextWeek'

# One-hot encoding

feature_categorical = ['Model']

train = pd.get_dummies(train, columns=feature_categorical)

接下来，我们用训练数据训练GBDT模型：

lgb_params = {

'boosting_type': 'gbdt',

'objective': 'binary',

'num_leaves': 30,

'num_round': 360,

'max_depth':8,

'learning_rate': 0.01,

'feature_fraction': 0.5,

'bagging_fraction': 0.8,

'bagging_freq': 12

}

lgb_train = lgb.Dataset(train.drop(target, 1), train[target])

model = lgb.train(lgb_params, lgb_train)

模型训练完成后，我们可以调用训练模型的plot_importance函数来获取特征的重要性。

plt.figure(figsize=(12,6))

lgb.plot_importance(model, max_num_features=30)

plt.title("Featurertances")

plt.show()

保存feature importance

booster = model.booster_

importance = booster.feature_importance(importance_type='split')

feature_name = booster.feature_name()

# for (feature_name,importance) in zip(feature_name,importance):

# print (feature_name,importance)

feature_importance = pd.DataFrame({'feature_name':feature_name,'importance':importance} )

feature_importance.to_csv('feature_importance.csv',index=False)

<h2>前言</h2>基于树的模型可以用来评估特征的重要性。我将使用LightGBM中的GBDT模型来评估特性重要性的步骤。 LightGBM是由微软发布的高精度和高速度梯度增强框架（一些测试表明LightGBM可以产生与XGBoost一样的准确预测，但速度可以提高25倍）。首先，我们导入所需的软件包：用于数据预处理的pandas，用于GBDT模型的LightGBM以及用于构建功能重要性条形图的matplotlib。import pandas as pdimport matplotlib.pylab as pltimport lightgbm as lgb<ul><li> </li></ul>然后，我们需要加载和预处理训练数据。 在这个例子中，我们使用预测性维护数据集。# read datatrain = pd.read_csv('E:\Data\predicitivemaintance_processed.csv')# drop the columns that are not used for the modeltrain = train.drop(['Date', 'FailureDate'],axis=1)# set the target columntarget = 'FailNextWeek'# One-hot encodingfeature_categorical = ['Model'] train = pd.get_dummies(train, columns=feature_categorical)<ul><li> </li></ul>接下来，我们用训练数据训练GBDT模型：lgb_params = {    'boosting_type': 'gbdt',    'objective': 'binary',    'num_leaves': 30,    'num_round': 360,    'max_depth':8,    'learning_rate': 0.01,    'feature_fraction': 0.5,    'bagging_fraction': 0.8,    'bagging_freq': 12}lgb_train = lgb.Dataset(train.drop(target, 1), train[target]) model = lgb.train(lgb_params, lgb_train)<ul><li> </li></ul>模型训练完成后，我们可以调用训练模型的plot_importance函数来获取特征的重要性。plt.figure(figsize=(12,6)) lgb.plot_importance(model, max_num_features=30) plt.title("Featurertances") plt.show()<img src="https://file.jobplus.com.cn/2018/06/05/26742c1da30f491d9cce4c7546598249.png" _src="https://file.jobplus.com.cn/2018/06/05/26742c1da30f491d9cce4c7546598249.png"/><h2>保存feature importance</h2>booster = model.booster_importance = booster.feature_importance(importance_type='split') feature_name = booster.feature_name()# for (feature_name,importance) in zip(feature_name,importance):#     print (feature_name,importance) feature_importance = pd.DataFrame({'feature_name':feature_name,'importance':importance} ) feature_importance.to_csv('feature_importance.csv',index=False)<ul><li> </li></ul>