Welcome to LightGBM’s documentation!¶
Quick Start¶
This is a quick start guide for LightGBM of cli version.
Follow the Installation Guide to install LightGBM first.
List of other Helpful Links
Training data format¶
LightGBM supports input data file with CSV, TSV and LibSVM formats.
Label is the data of first column, and there is no header in the file.
Categorical feature support¶
update 12/5/2016:
LightGBM can use categorical feature directly (without one-hot coding). The experiment on Expo data shows about 8x speed-up compared with one-hot coding.
For the setting details, please refer to Parameters.
Weight and query/group data¶
LightGBM also support weighted training, it needs an additional weight data. And it needs an additional query data for ranking task.
update 11/3/2016:
- support input with header now
- can specific label column, weight column and query/group id column. Both index and column are supported
- can specific a list of ignored columns
For the detailed usage, please refer to Configuration.
Parameter quick look¶
The parameter format is key1=value1 key2=value2 ...
. And parameters can be in both config file and command line.
Some important parameters:
config
, default=""
, type=string, alias=config_file
- path of config file
task
, default=train
, type=enum, options=train
,prediction
train
for trainingprediction
for prediction.
application
, default=regression
, type=enum, options=regression
,binary
,lambdarank
,multiclass
, alias=objective
,app
regression
, regression applicationbinary
, binary classification applicationlambdarank
, lambdarank applicationmulticlass
, multi-class classification application, should setnum_class
as well
boosting
, default=gbdt
, type=enum, options=gbdt
,dart
, alias=boost
,boosting_type
gbdt
, traditional Gradient Boosting Decision Treedart
, Dropouts meet Multiple Additive Regression Trees
data
, default=""
, type=string, alias=train
,train_data
- training data, LightGBM will train from this data
valid
, default=""
, type=multi-string, alias=test
,valid_data
,test_data
- validation/test data, LightGBM will output metrics for these data
- support multi validation data, separate by
,
num_iterations
, default=100
, type=int, alias=num_iteration
,num_tree
,num_trees
,num_round
,num_rounds
- number of boosting iterations/trees
learning_rate
, default=0.1
, type=double, alias=shrinkage_rate
- shrinkage rate
num_leaves
, default=31
, type=int, alias=num_leaf
- number of leaves in one tree
tree_learner
, default=serial
, type=enum, options=serial
,feature
,data
serial
, single machine tree learnerfeature
, feature parallel tree learnerdata
, data parallel tree learner- Refer to Parallel Learning Guide to get more details.
num_threads
, default=OpenMP_default, type=int, alias=num_thread
,nthread
- Number of threads for LightGBM.
- For the best speed, set this to the number of real CPU cores, not the number of threads (most CPU using hyper-threading to generate 2 threads per CPU core).
- For parallel learning, should not use full CPU cores since this will cause poor performance for the network.
max_depth
, default=-1
, type=int- Limit the max depth for tree model. This is used to deal with overfit when #data is small. Tree still grow by leaf-wise.
< 0
means no limit
min_data_in_leaf
, default=20
, type=int, alias=min_data_per_leaf
,min_data
- Minimal number of data in one leaf. Can use this to deal with over-fit.
min_sum_hessian_in_leaf
, default=1e-3
, type=double, alias=min_sum_hessian_per_leaf
,min_sum_hessian
,min_hessian
- Minimal sum hessian in one leaf. Like
min_data_in_leaf
, can use this to deal with over-fit.
- Minimal sum hessian in one leaf. Like
For all parameters, please refer to Parameters.
Run LightGBM¶
For Windows:
lightgbm.exe config=your_config_file other_args ...
For unix:
./lightgbm config=your_config_file other_args ...
Parameters can be both in the config file and command line, and the parameters in command line have higher priority than in config file. For example, following command line will keep ‘num_trees=10’ and ignore same parameter in config file.
./lightgbm config=train.conf num_trees=10
Python Package Introduction¶
This document gives a basic walkthrough of LightGBM python package.
List of other Helpful Links
Install¶
- Install the library first, follow the wiki here.
- Install python-package dependencies,
setuptools
,numpy
andscipy
is required,scikit-learn
is required for sklearn interface and recommended. Run:
pip install setuptools numpy scipy scikit-learn -U
- In the
python-package
directory, run
python setup.py install
- To verify your installation, try to
import lightgbm
in Python.
import lightgbm as lgb
Data Interface¶
The LightGBM python module is able to load data from:
- libsvm/tsv/csv txt format file
- Numpy 2D array, pandas object
- LightGBM binary file
The data is stored in a Dataset
object.
To load a libsvm text file or a LightGBM binary file into Dataset
:¶
train_data = lgb.Dataset('train.svm.bin')
To load a numpy array into Dataset
:¶
data = np.random.rand(500,10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target
train_data = lgb.Dataset( data, label=label)
To load a scpiy.sparse.csr_matrix array into Dataset
:¶
csr = scipy.sparse.csr_matrix((dat, (row, col)))
train_data = lgb.Dataset(csr)
Saving Dataset
into a LightGBM binary file will make loading faster:¶
train_data = lgb.Dataset('train.svm.txt')
train_data.save_binary("train.bin")
Create validation data¶
test_data = train_data.create_valid('test.svm')
or
test_data = lgb.Dataset('test.svm', reference=train_data)
In LightGBM, the validation data should be aligned with training data.
Specific feature names and categorical features¶
train_data = lgb.Dataset(data, label=label, feature_name=['c1', 'c2', 'c3'], categorical_feature=['c3'])
LightGBM can use categorical features as input directly. It doesn’t need to covert to one-hot coding, and is much faster than one-hot coding (about 8x speed-up).
Note:You should convert your categorical features to int type before you construct Dataset
.
Weights can be set when needed:¶
w = np.random.rand(500, )
train_data = lgb.Dataset(data, label=label, weight=w)
or
train_data = lgb.Dataset(data, label=label)
w = np.random.rand(500, )
train_data.set_weight(w)
And you can use Dataset.set_init_score()
to set initial score, and Dataset.set_group()
to set group/query data for ranking tasks.
Memory efficent usage¶
The Dataset
object in LightGBM is very memory-efficient, due to it only need to save discrete bins.
However, Numpy/Array/Pandas object is memory cost. If you concern about your memory consumption. You can save memory accroding to following:
- Let
free_raw_data=True
(default isTrue
) when constructing theDataset
- Explicit set
raw_data=None
after theDataset
has been constructed - Call
gc
Setting Parameters¶
LightGBM can use either a list of pairs or a dictionary to set parameters. For instance:
- Booster parameters
param = {'num_leaves':31, 'num_trees':100, 'objective':'binary' }
param['metric'] = 'auc'
- You can also specify multiple eval metrics:
param['metric'] = ['auc', 'binary_logloss']
Training¶
Training a model requires a parameter list and data set.
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data] )
After training, the model can be saved.
bst.save_model('model.txt')
The trained model can also be dumped to JSON format
# dump model
json_model = bst.dump_model()
A saved model can be loaded as follows:
bst = lgb.Booster(model_file="model.txt") #init model
Early Stopping¶
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds.
Early stopping requires at least one set in valid_sets
. If there’s more than one, it will use all of them.
bst = train(param, train_data, num_round, valid_sets=valid_sets, early_stopping_rounds=10)
bst.save_model('model.txt', num_iteration=bst.best_iteration)
The model will train until the validation score stops improving. Validation error needs to improve at least every early_stopping_rounds
to continue training.
If early stopping occurs, the model will have an additional field: bst.best_iteration
. Note that train()
will return a model from the last iteration, not the best one. And you can set num_iteration=bst.best_iteration
when saving model.
This works with both metrics to minimize (L2, log loss, etc.) and to maximize (NDCG, AUC). Note that if you specify more than one evaluation metric, all of them will be used for early stopping.
Prediction¶
A model that has been trained or loaded can perform predictions on data sets.
# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)
If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_iteration
:
ypred = bst.predict(data,num_iteration=bst.best_iteration)
Parameters¶
This is a page contains all parameters in LightGBM.
List of other Helpful Links
Update of 04/13/2017
Default values for the following parameters have changed:
- min_data_in_leaf = 100 => 20
- min_sum_hessian_in_leaf = 10 => 1e-3
- num_leaves = 127 => 31
- num_iterations = 10 => 100
Parameter format¶
The parameter format is key1=value1 key2=value2 ...
. And parameters can be set both in config file and command line. By using command line, parameters should not have spaces before and after =
. By using config files, one line can only contain one parameter. you can use #
to comment. If one parameter appears in both command line and config file, LightGBM will use the parameter in command line.
Core Parameters¶
config
, default=""
, type=string, alias=config_file
- path of config file
task
, default=train
, type=enum, options=train
,prediction
train
for trainingprediction
for prediction.
application
, default=regression
, type=enum, options=regression
,regression_l1
,huber
,fair
,poisson
,binary
,lambdarank
,multiclass
, alias=objective
,app
regression
, regression applicationregression_l2
, L2 loss, alias=mean_squared_error
,mse
regression_l1
, L1 loss, alias=mean_absolute_error
,mae
huber
, Huber lossfair
, Fair losspoisson
, Poisson regression
binary
, binary classification applicationlambdarank
, lambdarank applicationmulticlass
, multi-class classification application, should setnum_class
as well
boosting
, default=gbdt
, type=enum, options=gbdt
,dart
, alias=boost
,boosting_type
gbdt
, traditional Gradient Boosting Decision Treedart
, Dropouts meet Multiple Additive Regression Treesgoss
, Gradient-based One-Side Sampling
data
, default=""
, type=string, alias=train
,train_data
- training data, LightGBM will train from this data
valid
, default=""
, type=multi-string, alias=test
,valid_data
,test_data
- validation/test data, LightGBM will output metrics for these data
- support multi validation data, separate by
,
num_iterations
, default=100
, type=int, alias=num_iteration
,num_tree
,num_trees
,num_round
,num_rounds
- number of boosting iterations
- note:
num_tree
here equal withnum_iterations
. For multi-class, it actually learnsnum_class * num_iterations
trees. - note: For python/R package, cannot use this parameters to control number of iterations.
learning_rate
, default=0.1
, type=double, alias=shrinkage_rate
- shrinkage rate
- in
dart
, it also affects normalization weights of dropped trees
num_leaves
, default=31
, type=int, alias=num_leaf
- number of leaves in one tree
tree_learner
, default=serial
, type=enum, options=serial
,feature
,data
serial
, single machine tree learnerfeature
, feature parallel tree learnerdata
, data parallel tree learner- Refer to Parallel Learning Guide to get more details.
num_threads
, default=OpenMP_default, type=int, alias=num_thread
,nthread
- Number of threads for LightGBM.
- For the best speed, set this to the number of real CPU cores, not the number of threads (most CPU using hyper-threading to generate 2 threads per CPU core).
- For parallel learning, should not use full CPU cores since this will cause poor performance for the network.
device
, default=cpu
, options=cpu
,gpu
- Choose device for the tree learning, can use gpu to achieve the faster learning.
- Note: 1. Recommend use the smaller
max_bin
(e.g63
) to get the better speed up. 2. For the faster speed, GPU use 32-bit float point to sum up by default, may affect the accuracy for some tasks. You can setgpu_use_dp=true
to enable 64-bit float point, but it will slow down the training. 3. Refer to Installation Guide to build with GPU .
Learning control parameters¶
max_depth
, default=-1
, type=int- Limit the max depth for tree model. This is used to deal with overfit when #data is small. Tree still grow by leaf-wise.
< 0
means no limit
min_data_in_leaf
, default=20
, type=int, alias=min_data_per_leaf
,min_data
- Minimal number of data in one leaf. Can use this to deal with over-fit.
min_sum_hessian_in_leaf
, default=1e-3
, type=double, alias=min_sum_hessian_per_leaf
,min_sum_hessian
,min_hessian
- Minimal sum hessian in one leaf. Like
min_data_in_leaf
, can use this to deal with over-fit.
- Minimal sum hessian in one leaf. Like
feature_fraction
, default=1.0
, type=double,0.0 < feature_fraction < 1.0
, alias=sub_feature
- LightGBM will random select part of features on each iteration if
feature_fraction
smaller than1.0
. For example, if set to0.8
, will select 80% features before training each tree. - Can use this to speed up training
- Can use this to deal with over-fit
- LightGBM will random select part of features on each iteration if
feature_fraction_seed
, default=2
, type=int- Random seed for feature fraction.
bagging_fraction
, default=1.0
, type=double, ,0.0 < bagging_fraction < 1.0
, alias=sub_row
- Like
feature_fraction
, but this will random select part of data - Can use this to speed up training
- Can use this to deal with over-fit
- Note: To enable bagging, should set
bagging_freq
to a non zero value as well
- Like
bagging_freq
, default=0
, type=int- Frequency for bagging,
0
means disable bagging.k
means will perform bagging at everyk
iteration. - Note: To enable bagging, should set
bagging_fraction
as well
- Frequency for bagging,
bagging_seed
, default=3
, type=int- Random seed for bagging.
early_stopping_round
, default=0
, type=int, alias=early_stopping_rounds
,early_stopping
- Will stop training if one metric of one validation data doesn’t improve in last
early_stopping_round
rounds.
- Will stop training if one metric of one validation data doesn’t improve in last
lambda_l1
, default=0
, type=double- l1 regularization
lambda_l2
, default=0
, type=double- l2 regularization
min_gain_to_split
, default=0
, type=double- The minimal gain to perform split
drop_rate
, default=0.1
, type=double- only used in
dart
- only used in
skip_drop
, default=0.5
, type=double- only used in
dart
, probability of skipping drop
- only used in
max_drop
, default=50
, type=int- only used in
dart
, max number of dropped trees on one iteration.<=0
means no limit.
- only used in
uniform_drop
, default=false
, type=bool- only used in
dart
, true if want to use uniform drop
- only used in
xgboost_dart_mode
, default=false
, type=bool- only used in
dart
, true if want to use xgboost dart mode
- only used in
drop_seed
, default=4
, type=int- only used in
dart
, used to random seed to choose dropping models.
- only used in
top_rate
, default=0.2
, type=double- only used in
goss
, the retain ratio of large gradient data
- only used in
other_rate
, default=0.1
, type=int- only used in
goss
, the retain ratio of small gradient data
- only used in
IO parameters¶
max_bin
, default=255
, type=int- max number of bin that feature values will bucket in. Small bin may reduce training accuracy but may increase general power (deal with over-fit).
- LightGBM will auto compress memory according
max_bin
. For example, LightGBM will useuint8_t
for feature value ifmax_bin=255
.
data_random_seed
, default=1
, type=int- random seed for data partition in parallel learning(not include feature parallel).
output_model
, default=LightGBM_model.txt
, type=string, alias=model_output
,model_out
- file name of output model in training.
input_model
, default=""
, type=string, alias=model_input
,model_in
- file name of input model.
- for prediction task, will prediction data using this model.
- for train task, will continued train from this model.
output_result
, default=LightGBM_predict_result.txt
, type=string, alias=predict_result
,prediction_result
- file name of prediction result in prediction task.
is_pre_partition
, default=false
, type=bool- used for parallel learning(not include feature parallel).
true
if training data are pre-partitioned, and different machines using different partition.
is_sparse
, default=true
, type=bool, alias=is_enable_sparse
- used to enable/disable sparse optimization. Set to
false
to disable sparse optimization.
- used to enable/disable sparse optimization. Set to
two_round
, default=false
, type=bool, alias=two_round_loading
,use_two_round_loading
- by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed. But it may out of memory when the data file is very big.
- set this to
true
if data file is too big to fit in memory.
save_binary
, default=false
, type=bool, alias=is_save_binary
,is_save_binary_file
- set this to
true
will save the data set(include validation data) to a binary file. Speed up the data loading speed for the next time.
- set this to
verbosity
, default=1
, type=int, alias=verbose
<0
= Fatel,=0
= Error(Warn),>0
= Info
header
, default=false
, type=bool, alias=has_header
true
if input data has header
label
, default=""
, type=string, alias=label_column
- specific the label column
- Use number for index, e.g.
label=0
means column_0 is the label - Add a prefix
name:
for column name, e.g.label=name:is_click
weight
, default=""
, type=string, alias=weight_column
- specific the weight column
- Use number for index, e.g.
weight=0
means column_0 is the weight - Add a prefix
name:
for column name, e.g.weight=name:weight
- Note: Index start from
0
. And it doesn’t count the label column when passing type is Index. e.g. when label is column_0, and weight is column_1, the correct parameter isweight=0
.
query
, default=""
, type=string, alias=query_column
,group
,group_column
- specific the query/group id column
- Use number for index, e.g.
query=0
means column_0 is the query id - Add a prefix
name:
for column name, e.g.query=name:query_id
- Note: Data should group by query_id. Index start from
0
. And it doesn’t count the label column when passing type is Index. e.g. when label is column_0, and query_id is column_1, the correct parameter isquery=0
.
ignore_column
, default=""
, type=string, alias=ignore_feature
,blacklist
- specific some ignore columns in training
- Use number for index, e.g.
ignore_column=0,1,2
means column_0, column_1 and column_2 will be ignored. - Add a prefix
name:
for column name, e.g.ignore_column=name:c1,c2,c3
means c1, c2 and c3 will be ignored. - Note: Index start from
0
. And it doesn’t count the label column.
categorical_feature
, default=""
, type=string, alias=categorical_column
,cat_feature
,cat_column
- specific categorical features
- Use number for index, e.g.
categorical_feature=0,1,2
means column_0, column_1 and column_2 are categorical features. - Add a prefix
name:
for column name, e.g.categorical_feature=name:c1,c2,c3
means c1, c2 and c3 are categorical features. - Note: Only support categorical with
int
type. Index start from0
. And it doesn’t count the label column.
predict_raw_score
, default=false
, type=bool, alias=raw_score
,is_predict_raw_score
- only used in prediction task
- Set to
true
will only predict the raw scores. - Set to
false
will transformed score
predict_leaf_index
, default=false
, type=bool, alias=leaf_index
,is_predict_leaf_index
- only used in prediction task
- Set to
true
to predict with leaf index of all trees
bin_construct_sample_cnt
, default=200000
, type=int- Number of data that sampled to construct histogram bins.
- Will give better training result when set this larger. But will increase data loading time.
- Set this to larger value if data is very sparse.
num_iteration_predict
, default=-1
, type=int- only used in prediction task, used to how many trained iterations will be used in prediction.
<= 0
means no limit
Objective parameters¶
sigmoid
, default=1.0
, type=double- parameter for sigmoid function. Will be used in binary classification and lambdarank.
huber_delta
, default=1.0
, type=double- parameter for Huber loss. Will be used in regression task.
fair_c
, default=1.0
, type=double- parameter for Fair loss. Will be used in regression task.
poission_max_delta_step
, default=0.7
, type=double- parameter used to safeguard optimization
scale_pos_weight
, default=1.0
, type=double- weight of positive class in binary classification task
boost_from_average
, default=true
, type=bool- adjust initial score to the mean of labels for faster convergence, only used in Regression task.
is_unbalance
, default=false
, type=bool- used in binary classification. Set this to
true
if training data are unbalance.
- used in binary classification. Set this to
max_position
, default=20
, type=int- used in lambdarank, will optimize NDCG at this position.
label_gain
, default=0,1,3,7,15,31,63,...
, type=multi-double- used in lambdarank, relevant gain for labels. For example, the gain of label
2
is3
if using default label gains. - Separate by
,
- used in lambdarank, relevant gain for labels. For example, the gain of label
num_class
, default=1
, type=int, alias=num_classes
- only used in multi-class classification
Metric parameters¶
metric
, default={l2
for regression}, {binary_logloss
for binary classification},{ndcg
for lambdarank}, type=multi-enum, options=l1
,l2
,ndcg
,auc
,binary_logloss
,binary_error
...l1
, absolute loss, alias=mean_absolute_error
,mae
l2
, square loss, alias=mean_squared_error
,mse
l2_root
, root square loss, alias=root_mean_squared_error
,rmse
huber
, Huber lossfair
, Fair losspoisson
, Poisson regressionndcg
, NDCGmap
, MAPauc
, AUCbinary_logloss
, log lossbinary_error
. For one sample0
for correct classification,1
for error classification.multi_logloss
, log loss for mulit-class classificationmulti_error
. error rate for mulit-class classification- Support multi metrics, separate by
,
metric_freq
, default=1
, type=int- frequency for metric output
is_training_metric
, default=false
, type=bool- set this to true if need to output metric result of training
ndcg_at
, default=1,2,3,4,5
, type=multi-int, alias=ndcg_eval_at
,eval_at
- NDCG evaluation position, separate by
,
- NDCG evaluation position, separate by
Network parameters¶
Following parameters are used for parallel learning, and only used for base(socket) version.
num_machines
, default=1
, type=int, alias=num_machine
- Used for parallel learning, the number of machines for parallel learning application
- Need to set this in both socket and mpi version.
local_listen_port
, default=12400
, type=int, alias=local_port
- TCP listen port for local machines.
- Should allow this port in firewall setting before training.
time_out
, default=120
, type=int- Socket time-out in minutes.
machine_list_file
, default=""
, type=string- File that list machines for this parallel learning application
- Each line contains one IP and one port for one machine. The format is
ip port
, separate by space.
GPU parameters¶
gpu_platform_id
, default=-1
, type=int- OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform.
- Default value is -1, using the system-wide default platform.
gpu_device_id
, default=-1
, type=int- OpenCL device ID in the specified platform. Each GPU in the selected platform has a unique device ID.
- Default value is -1, using the default device in the selected platform.
gpu_use_dp
, default=false
, type=bool- Set to true to use double precision math on GPU (default using single precision).
Others¶
Continued training with input score¶
LightGBM support continued train with initial score. It uses an additional file to store these initial score, like the following:
0.5
-0.1
0.9
...
It means the initial score of first data is 0.5
, second is -0.1
, and so on. The initial score file corresponds with data file line by line, and has per score per line. And if the name of data file is “train.txt”, the initial score file should be named as “train.txt.init” and in the same folder as the data file. And LightGBM will auto load initial score file if it exists.
Weight data¶
LightGBM support weighted training. It uses an additional file to store weight data, like the following:
1.0
0.5
0.8
...
It means the weight of first data is 1.0
, second is 0.5
, and so on. The weight file corresponds with data file line by line, and has per weight per line. And if the name of data file is “train.txt”, the weight file should be named as “train.txt.weight” and in the same folder as the data file. And LightGBM will auto load weight file if it exists.
update:
You can specific weight column in data file now. Please refer to parameter weight
in above.
Query data¶
For LambdaRank learning, it needs query information for training data. LightGBM use an additional file to store query data. Following is an example:
27
18
67
...
It means first 27
lines samples belong one query and next 18
lines belong to another, and so on.(Note: data should order by query) If name of data file is “train.txt”, the query file should be named as “train.txt.query” and in same folder of training data. LightGBM will load the query file automatically if it exists.
You can specific query/group id in data file now. Please refer to parameter group
in above.
Parameters Tuning¶
This is a page contains all parameters in LightGBM.
List of other Helpful Links
Convert parameters from XGBoost¶
LightGBM uses leaf-wise tree growth algorithm. But other popular tools, e.g. XGBoost, use depth-wise tree growth. So LightGBM use num_leaves
to control complexity of tree model, and other tools usually use max_depth
. Following table is the correspond between leaves and depths. The relation is num_leaves = 2^(max_depth)
.
| max_depth | num_leaves | | ——— | ———- | | 1 | 2 | | 2 | 4 | | 3 | 8 | | 7 | 128 | | 10 | 1024 |
For faster speed¶
- Use bagging by set
bagging_fraction
andbagging_freq
- Use feature sub-sampling by set
feature_fraction
- Use small
max_bin
- Use
save_binary
to speed up data loading in future learning - Use parallel learning, refer to parallel learning guide.
For better accuracy¶
- Use large
max_bin
(may be slower) - Use small
learning_rate
with largenum_iterations
- Use large
num_leaves
(may cause over-fitting) - Use bigger training data
- Try
dart
Deal with over-fitting¶
- Use small
max_bin
- Use small
num_leaves
- Use
min_data_in_leaf
andmin_sum_hessian_in_leaf
- Use bagging by set
bagging_fraction
andbagging_freq
- Use feature sub-sampling by set
feature_fraction
- Use bigger training data
- Try
lambda_l1
,lambda_l2
andmin_gain_to_split
to regularization - Try
max_depth
to avoid growing deep tree
lightgbm package¶
Data Structure API¶
-
class
lightgbm.
Dataset
(data, label=None, max_bin=255, reference=None, weight=None, group=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]¶ Bases:
object
Dataset in LightGBM.
-
create_valid
(data, label=None, weight=None, group=None, silent=False, params=None)[source]¶ Create validation data align with current dataset
Parameters: - data (string/numpy array/scipy.sparse) – Data source of Dataset. When data type is string, it represents the path of txt file
- label (list or numpy 1-D array, optional) – Label of the training data.
- weight (list or numpy 1-D array , optional) – Weight for each instance.
- group (list or numpy 1-D array , optional) – Group/query size for dataset
- silent (boolean, optional) – Whether print messages during construction
- params (dict, optional) – Other parameters
-
get_field
(field_name)[source]¶ Get property from the Dataset.
Parameters: field_name (str) – The field name of the information Returns: info – A numpy array of information of the data Return type: array
-
get_init_score
()[source]¶ Get the initial score of the Dataset.
Returns: init_score Return type: array
-
num_feature
()[source]¶ Get the number of columns (features) in the Dataset.
Returns: number of columns Return type: int
-
save_binary
(filename)[source]¶ Save Dataset to binary file
Parameters: filename (string) – Name of the output file.
-
set_categorical_feature
(categorical_feature)[source]¶ Set categorical features
Parameters: categorical_feature (list of int or str) – Name/index of categorical features
-
set_feature_name
(feature_name)[source]¶ Set feature name
Parameters: feature_name (list of str) – Feature names
-
set_field
(field_name, data)[source]¶ Set property into the Dataset.
Parameters: - field_name (str) – The field name of the information
- data (numpy array or list or None) – The array ofdata to be set
-
set_group
(group)[source]¶ Set group size of Dataset (used for ranking).
Parameters: group (numpy array or list or None) – Group size of each group
-
set_init_score
(init_score)[source]¶ Set init score of booster to start from.
Parameters: init_score (numpy array or list or None) – Init score for booster
-
set_label
(label)[source]¶ Set label of Dataset
Parameters: label (numpy array or list or None) – The label information to be set into Dataset
-
set_reference
(reference)[source]¶ Set reference dataset
Parameters: reference (Dataset) – Will use reference as template to consturct current dataset
-
-
class
lightgbm.
Booster
(params=None, train_set=None, model_file=None, silent=False)[source]¶ Bases:
object
“Booster in LightGBM.
-
add_valid
(data, name)[source]¶ Add an validation data
Parameters: - data (Dataset) – Validation data
- name (String) – Name of validation data
-
attr
(key)[source]¶ Get attribute string from the Booster.
Parameters: key (str) – The key to get attribute from. Returns: value – The attribute value of the key, returns None if attribute do not exist. Return type: str
-
dump_model
(num_iteration=-1)[source]¶ Dump model to json format
Parameters: num_iteration (int) – Number of iteration that want to dump. < 0 means dump to best iteration(if have) Returns: Return type: Json format of model
-
eval
(data, name, feval=None)[source]¶ Evaluate for data
Parameters: - data (Dataset object) –
- name – Name of data
- feval (function) – Custom evaluation function.
Returns: result – Evaluation result list.
Return type: list
-
eval_train
(feval=None)[source]¶ Evaluate for training data
Parameters: feval (function) – Custom evaluation function. Returns: result – Evaluation result list. Return type: str
-
eval_valid
(feval=None)[source]¶ Evaluate for validation data
Parameters: feval (function) – Custom evaluation function. Returns: result – Evaluation result list. Return type: str
-
feature_importance
(importance_type='split')[source]¶ Get feature importances
Parameters: - importance_type (str, default "split") –
- the importance is calculated (How) –
- is the number of times a feature is used in a model ("split") –
- is the total gain of splits which use the feature ("gain") –
Returns: result – Array of feature importances.
Return type: array
-
feature_name
()[source]¶ Get feature names.
Returns: result – Array of feature names. Return type: array
-
params_str
= None¶ construct booster object
-
predict
(data, num_iteration=-1, raw_score=False, pred_leaf=False, data_has_header=False, is_reshape=True)[source]¶ Predict logic
Parameters: - data (string/numpy array/scipy.sparse) – Data source for prediction When data type is string, it represents the path of txt file
- num_iteration (int) – Used iteration for prediction, < 0 means predict for best iteration(if have)
- raw_score (bool) – True for predict raw score
- pred_leaf (bool) – True for predict leaf index
- data_has_header (bool) – Used for txt data
- is_reshape (bool) – Reshape to (nrow, ncol) if true
Returns: Return type: Prediction result
-
reset_parameter
(params)[source]¶ Reset parameters for booster
Parameters: - params (dict) – New parameters for boosters
- silent (boolean, optional) – Whether print messages during construction
-
save_model
(filename, num_iteration=-1)[source]¶ Save model of booster to file
Parameters: - filename (str) – Filename to save
- num_iteration (int) – Number of iteration that want to save. < 0 means save the best iteration(if have)
-
set_attr
(**kwargs)[source]¶ Set the attribute of the Booster.
Parameters: **kwargs – The attributes to set. Setting a value to None deletes an attribute.
-
update
(train_set=None, fobj=None)[source]¶ Update for one iteration Note: for multi-class task, the score is group by class_id first, then group by row_id
if you want to get i-th row score in j-th class, the access way is score[j*num_data+i] and you should group grad and hess in this way as wellParameters: - train_set – Training data, None means use last training data
- fobj (function) – Customized objective function.
Returns: Return type: is_finished, bool
-
Training API¶
-
lightgbm.
train
(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, evals_result=None, verbose_eval=True, learning_rates=None, callbacks=None)[source]¶ Train with given parameters.
Parameters: - params (dict) – Parameters for training.
- train_set (Dataset) – Data to be trained.
- num_boost_round (int) – Number of boosting iterations.
- valid_sets (list of Datasets) – List of data to be evaluated during training
- valid_names (list of string) – Names of valid_sets
- fobj (function) – Customized objective function.
- feval (function) – Customized evaluation function. Note: should return (eval_name, eval_result, is_higher_better) of list of this
- init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train
- feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
- categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
- early_stopping_rounds (int) – Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them Returns the model with (best_iter + early_stopping_rounds) If early stopping occurs, the model will add ‘best_iteration’ field
- evals_result (dict or None) –
This dictionary used to store all evaluation results of all the items in valid_sets. Example: with a valid_sets containing [valid_set, train_set]
and valid_names containing [‘eval’, ‘train’] and a paramater containing (‘metric’:’logloss’)- Returns: {‘train’: {‘logloss’: [‘0.48253’, ‘0.35953’, ...]},
- ‘eval’: {‘logloss’: [‘0.480385’, ‘0.357756’, ...]}}
passed with None means no using this function
- verbose_eval (bool or int) –
Requires at least one item in evals. If verbose_eval is True,
the eval metric on the valid set is printed at each boosting stage.- If verbose_eval is int,
- the eval metric on the valid set is printed at every verbose_eval boosting stage.
- The last boosting stage
- or the boosting stage found by using early_stopping_rounds is also printed.
- Example: with verbose_eval=4 and at least one item in evals,
- an evaluation metric is printed every 4 (instead of 1) boosting stages.
- learning_rates (list or function) – List of learning rate for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: learning_rate = l[current_round] - function f: learning_rate = f(current_round)
- callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns: booster
Return type: a trained booster model
-
lightgbm.
cv
(params, train_set, num_boost_round=10, folds=None, nfold=5, stratified=False, shuffle=True, metrics=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, fpreproc=None, verbose_eval=None, show_stdv=True, seed=0, callbacks=None)[source]¶ Cross-validation with given paramaters.
Parameters: - params (dict) – Booster params.
- train_set (Dataset) – Data to be trained.
- num_boost_round (int) – Number of boosting iterations.
- folds (a generator or iterator of (train_idx, test_idx) tuples) – The train indices and test indices for each folds. This argument has highest priority over other data split arguments.
- nfold (int) – Number of folds in CV.
- stratified (bool) – Perform stratified sampling.
- shuffle (bool) – Whether shuffle before split data
- metrics (string or list of strings) – Evaluation metrics to be watched in CV. If metrics is not None, the metric in params will be overridden.
- fobj (function) – Custom objective function.
- feval (function) – Custom evaluation function.
- init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train
- feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
- categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
- early_stopping_rounds (int) – Activates early stopping. CV error needs to decrease at least every <early_stopping_rounds> round(s) to continue. Last entry in evaluation history is the one from best iteration.
- fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those.
- verbose_eval (bool, int, or None, default None) –
Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given,
progress will be displayed at every given verbose_eval boosting stage. - show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std.
- seed (int) – Seed used to generate the folds (passed to numpy.random.seed).
- callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns: evaluation history
Return type: list(string)
Scikit-learn API¶
-
class
lightgbm.
LGBMModel
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
object
-
apply
(X, num_iteration=0)[source]¶ Return the predicted leaf every tree for each sample.
Parameters: - X (array_like, shape=[n_samples, n_features]) – Input features matrix.
- num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns: X_leaves
Return type: array_like, shape=[n_samples, n_trees]
-
booster_
¶ Get the underlying lightgbm Booster of this model.
-
evals_result_
¶ Get the evaluation results.
-
feature_importances_
¶ Get normailized feature importances.
-
fit
(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric=None, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]¶ Fit the gradient boosting model
Parameters: - X (array_like) – Feature matrix
- y (array_like) – Labels
- sample_weight (array_like) – weight of training data
- init_score (array_like) – init score of training data
- group (array_like) – group data of training data
- eval_set (list, optional) – A list of (X, y) tuple pairs to use as a validation set for early-stopping
- eval_names (list of string) – Names of eval_set
- eval_sample_weight (List of array) – weight of eval data
- eval_init_score (List of array) – init score of eval data
- eval_group (List of array) – group data of eval data
- eval_metric (str, list of str, callable, optional) – If a str, should be a built-in evaluation metric to use. If callable, a custom evaluation metric, see note for more details.
- early_stopping_rounds (int) –
- verbose (bool) – If verbose and an evaluation set is used, writes the evaluation
- feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
- categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
- callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Note
- Custom eval function expects a callable with following functions:
func(y_true, y_pred)
,func(y_true, y_pred, weight)
- or
func(y_true, y_pred, weight, group)
. - return (eval_name, eval_result, is_bigger_better)
- or list of (eval_name, eval_result, is_bigger_better)
- y_true: array_like of shape [n_samples]
- The target values
- y_pred: array_like of shape [n_samples] or shape[n_samples * n_class] (for multi-class)
- The predicted values
- weight: array_like of shape [n_samples]
- The weight of samples
- group: array_like
- group/query data, used for ranking task
- eval_name: str
- name of evaluation
- eval_result: float
- eval result
- is_bigger_better: bool
- is eval result bigger better, e.g. AUC is bigger_better.
- for multi-class task, the y_pred is group by class_id first, then group by row_id
- if you want to get i-th row y_pred in j-th class, the access way is y_pred[j*num_data+i]
-
predict
(X, raw_score=False, num_iteration=0)[source]¶ Return the predicted value for each sample.
Parameters: - X (array_like, shape=[n_samples, n_features]) – Input features matrix.
- num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns: predicted_result
Return type: array_like, shape=[n_samples] or [n_samples, n_classes]
-
-
class
lightgbm.
LGBMClassifier
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='binary', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
lightgbm.sklearn.LGBMModel
,object
-
classes_
¶ Get class label array.
-
n_classes_
¶ Get number of classes
-
predict_proba
(X, raw_score=False, num_iteration=0)[source]¶ Return the predicted probability for each class for each sample.
Parameters: - X (array_like, shape=[n_samples, n_features]) – Input features matrix.
- num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns: predicted_probability
Return type: array_like, shape=[n_samples, n_classes]
-
-
class
lightgbm.
LGBMRegressor
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
lightgbm.sklearn.LGBMModel
,object
-
class
lightgbm.
LGBMRanker
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='lambdarank', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
lightgbm.sklearn.LGBMModel
-
fit
(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric='ndcg', eval_at=1, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]¶ Most arguments like common methods except following:
- eval_at : list of int
- The evaulation positions of NDCG
-
Callbacks¶
-
lightgbm.
early_stopping
(stopping_rounds, verbose=True)[source]¶ Create a callback that activates early stopping. Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them
Parameters: - stopping_rounds (int) – The stopping rounds before the trend occur.
- verbose (optional, bool) – Whether to print message about early stopping information.
Returns: callback – The requested callback function.
Return type: function
-
lightgbm.
print_evaluation
(period=1, show_stdv=True)[source]¶ Create a callback that print evaluation result.
Parameters: - period (int) – The period to log the evaluation results
- show_stdv (bool, optional) – Whether show stdv if provided
Returns: callback – A callback that print evaluation every period iterations.
Return type: function
-
lightgbm.
record_evaluation
(eval_result)[source]¶ Create a call back that records the evaluation history into eval_result.
Parameters: eval_result (dict) – A dictionary to store the evaluation results. Returns: callback – The requested callback function. Return type: function
-
lightgbm.
reset_parameter
(**kwargs)[source]¶ Reset parameter after first iteration
NOTE: the initial parameter will still take in-effect on first iteration.
Parameters: **kwargs (value should be list or function) – List of parameters for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: parameter = l[current_round] - function f: parameter = f(current_round) Returns: callback – The requested callback function. Return type: function
Plotting¶
-
lightgbm.
plot_importance
(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='Feature importance', ylabel='Features', importance_type='split', max_num_features=None, ignore_zero=True, figsize=None, grid=True, **kwargs)[source]¶ Plot model feature importances.
Parameters: - booster (Booster or LGBMModel) – Booster or LGBMModel instance
- ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
- height (float) – Bar height, passed to ax.barh()
- xlim (tuple of 2 elements) – Tuple passed to axes.xlim()
- ylim (tuple of 2 elements) – Tuple passed to axes.ylim()
- title (str) – Axes title. Pass None to disable.
- xlabel (str) – X axis title label. Pass None to disable.
- ylabel (str) – Y axis title label. Pass None to disable.
- importance_type (str) – How the importance is calculated: “split” or “gain” “split” is the number of times a feature is used in a model “gain” is the total gain of splits which use the feature
- max_num_features (int) – Max number of top features displayed on plot. If None or smaller than 1, all features will be displayed.
- ignore_zero (bool) – Ignore features with zero importance
- figsize (tuple of 2 elements) – Figure size
- grid (bool) – Whether add grid for axes
- **kwargs – Other keywords passed to ax.barh()
Returns: ax
Return type: matplotlib Axes
-
lightgbm.
plot_metric
(booster, metric=None, dataset_names=None, ax=None, xlim=None, ylim=None, title='Metric during training', xlabel='Iterations', ylabel='auto', figsize=None, grid=True)[source]¶ Plot one metric during training.
Parameters: - booster (dict or LGBMModel) – Evals_result recorded by lightgbm.train() or LGBMModel instance
- metric (str or None) – The metric name to plot. Only one metric supported because different metrics have various scales. Pass None to pick first one (according to dict hashcode).
- dataset_names (None or list of str) – List of the dataset names to plot. Pass None to plot all datasets.
- ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
- xlim (tuple of 2 elements) – Tuple passed to axes.xlim()
- ylim (tuple of 2 elements) – Tuple passed to axes.ylim()
- title (str) – Axes title. Pass None to disable.
- xlabel (str) – X axis title label. Pass None to disable.
- ylabel (str) – Y axis title label. Pass None to disable. Pass ‘auto’ to use metric.
- figsize (tuple of 2 elements) – Figure size
- grid (bool) – Whether add grid for axes
Returns: ax
Return type: matplotlib Axes
-
lightgbm.
plot_tree
(booster, ax=None, tree_index=0, figsize=None, graph_attr=None, node_attr=None, edge_attr=None, show_info=None)[source]¶ Plot specified tree.
Parameters: - booster (Booster, LGBMModel) – Booster or LGBMModel instance.
- ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
- tree_index (int, default 0) – Specify tree index of target tree.
- figsize (tuple of 2 elements) – Figure size.
- graph_attr (dict) – Mapping of (attribute, value) pairs for the graph.
- node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes.
- edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges.
- show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’.
Returns: ax
Return type: matplotlib Axes
-
lightgbm.
create_tree_digraph
(booster, tree_index=0, show_info=None, name=None, comment=None, filename=None, directory=None, format=None, engine=None, encoding=None, graph_attr=None, node_attr=None, edge_attr=None, body=None, strict=False)[source]¶ Create a digraph of specified tree.
Parameters: - booster (Booster, LGBMModel) – Booster or LGBMModel instance.
- tree_index (int, default 0) – Specify tree index of target tree.
- show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’.
- name (str) – Graph name used in the source code.
- comment (str) – Comment added to the first line of the source.
- filename (str) – Filename for saving the source (defaults to name + ‘.gv’).
- directory (str) – (Sub)directory for source saving and rendering.
- format (str) – Rendering output format (‘pdf’, ‘png’, ...).
- engine (str) – Layout command used (‘dot’, ‘neato’, ...).
- encoding (str) – Encoding for saving the source.
- graph_attr (dict) – Mapping of (attribute, value) pairs for the graph.
- node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes.
- edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges.
- body (list of str) – Iterable of lines to add to the graph body.
- strict (bool) – Iterable of lines to add to the graph body.
Returns: graph
Return type: graphviz Digraph
LightGBM GPU Tutorial¶
The purpose of this document is to give you a quick step-by-step tutorial on GPU training.
For Windows, please see GPU Windows Tutorial.
We will use the GPU instance on Microsoft Azure cloud computing platform for demonstration, but you can use any machine with modern AMD or NVIDIA GPUs.
GPU Setup¶
You need to launch a NV
type instance on Azure (available in East US, North Central US, South Central US, West Europe and Southeast Asia zones) and select Ubuntu 16.04 LTS as the operating system.
For testing, the smallest NV6
type virtual machine is sufficient, which includes 1/2 M60 GPU, with 8 GB memory, 180 GB/s memory bandwidth and 4,825 GFLOPS peak computation power. Don’t use the NC
type instance as the GPUs (K80) are based on an older architecture (Kepler).
First we need to install minimal NVIDIA drivers and OpenCL development environment:
sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-375
sudo apt-get install --no-install-recommends nvidia-opencl-icd-375 nvidia-opencl-dev opencl-headers
After installing the drivers you need to restart the server.
sudo init 6
After about 30 seconds, the server should be up again.
If you are using a AMD GPU, you should download and install the AMDGPU-Pro driver and also install package ocl-icd-libopencl1
and ocl-icd-opencl-dev
.
Build LightGBM¶
Now install necessary building tools and dependencies:
sudo apt-get install --no-install-recommends git cmake build-essential libboost-dev libboost-system-dev libboost-filesystem-dev
The NV6 GPU instance has a 320 GB ultra-fast SSD mounted at /mnt. Let’s use it as our workspace (skip this if you are using your own machine):
sudo mkdir -p /mnt/workspace
sudo chown $(whoami):$(whoami) /mnt/workspace
cd /mnt/workspace
Now we are ready to checkout LightGBM and compile it with GPU support:
git clone --recursive https://github.com/Microsoft/LightGBM
cd LightGBM
mkdir build ; cd build
cmake -DUSE_GPU=1 ..
make -j$(nproc)
cd ..
You will see two binaries are generated, lightgbm
and lib_lightgbm.so
.
If you are building on OSX, you probably need to remove macro BOOST_COMPUTE_USE_OFFLINE_CACHE
in src/treelearner/gpu_tree_learner.h
to avoid a known crash bug in Boost.Compute.
Install Python Interface (optional)¶
If you want to use the Python interface of LightGBM, you can install it now (along with some necessary Python package dependencies):
sudo apt-get -y install python-pip
sudo -H pip install setuptools numpy scipy scikit-learn -U
cd python-package/
sudo python setup.py install
cd ..
You need to set an additional parameter "device" : "gpu"
(along with your other options like learning_rate
, num_leaves
, etc) to use GPU in Python.
You can read our Python Guide for more information on how to use the Python interface.
Dataset Preparation¶
Using the following commands to prepare the Higgs dataset:
git clone https://github.com/guolinke/boosting_tree_benchmarks.git
cd boosting_tree_benchmarks/data
wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"
gunzip HIGGS.csv.gz
python higgs2libsvm.py
cd ../..
ln -s boosting_tree_benchmarks/data/higgs.train
ln -s boosting_tree_benchmarks/data/higgs.test
Now we create a configuration file for LightGBM by running the following commands (please copy the entire block and run it as a whole):
cat > lightgbm_gpu.conf <<EOF
max_bin = 63
num_leaves = 255
num_iterations = 50
learning_rate = 0.1
tree_learner = serial
task = train
is_train_metric = false
min_data_in_leaf = 1
min_sum_hessian_in_leaf = 100
ndcg_eval_at = 1,3,5,10
sparse_threshold = 1.0
device = gpu
gpu_platform_id = 0
gpu_device_id = 0
EOF
echo "num_threads=$(nproc)" >> lightgbm_gpu.conf
GPU is enabled in the configuration file we just created by setting device=gpu
. It will use the first GPU installed on the system by default (gpu_platform_id=0
and gpu_device_id=0
).
Run Your First Learning Task on GPU¶
Now we are ready to start GPU training! First we want to verify the GPU works correctly. Run the following command to train on GPU, and take a note of the AUC after 50 iterations:
./lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc
Now train the same dataset on CPU using the following command. You should observe a similar AUC:
./lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc device=cpu
Now we can make a speed test on GPU without calculating AUC after each iteration.
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=binary metric=auc
Speed test on CPU:
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=binary metric=auc device=cpu
You should observe over three times speedup on this GPU.
The GPU acceleration can be used on other tasks/metrics (regression, multi-class classification, ranking, etc) as well. For example, we can train the Higgs dataset on GPU as a regression task:
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=regression_l2 metric=l2
Also, you can compare the training speed with CPU:
./lightgbm config=lightgbm_gpu.conf data=higgs.train objective=regression_l2 metric=l2 device=cpu
LightGBM FAQ¶
Catalog¶
Python-package¶
Question 1: I see error messages like this when install from github using
python setup.py install
.error: Error: setup script specifies an absolute path: /Users/Microsoft/LightGBM/python-package/lightgbm/../../lib_lightgbm.so setup() arguments must *always* be /-separated paths relative to the setup.py directory, *never* absolute paths.
Solution 1: this error should be solved in latest version. If you still meet this error, try to remove lightgbm.egg-info folder in your python-package and reinstall, or check this thread on stackoverflow.
Question 2: I see error messages like
Cannot get/set label/weight/init_score/group/num_data/num_feature before construct dataset
, but I already construct dataset by some code liketrain = lightgbm.Dataset(X_train, y_train)
, or error messages likeCannot set predictor/reference/categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
.Solution 2: Because LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster share the same bin mappers, categorical features and feature names etc., the Dataset objects are constructed when construct a Booster. And if you set free_raw_data=True (default), the raw data (with python data struct) will be freed. So, if you want to:
- get label(or weight/init_score/group) before construct dataset, it’s same as get
self.label
- set label(or weight/init_score/group) before construct dataset, it’s same as
self.label=some_label_array
- get num_data(or num_feature) before construct dataset, you can get data with
self.data
, then if your data isnumpy.ndarray
, use some code likeself.data.shape
- set predictor(or reference/categorical feature) after construct dataset, you should set free_raw_data=False or init a Dataset object with the same raw data
- get label(or weight/init_score/group) before construct dataset, it’s same as get
Development Guide¶
Classes And Code Structure¶
Important Classes¶
| Class | description |
| —– | ——— |
| Application
| The entrance of application, including training and prediction logic |
| Bin
| Data structure used for store feature discrete values(converted from float values) |
| Boosting
| Boosting interface, current implementation is GBDT and DART |
| Config
| Store parameters and configurations|
| Dataset
| Store information of dataset |
| DatasetLoader
| Used to construct dataset |
| Feature
| Store One column feature |
| Metric
| Evaluation metrics |
| Network
| Newwork interfaces and communication algorithms |
| ObjectiveFunction
| Objective function used to train |
| Tree
| Store information of tree model |
| TreeLearner
| Used to learn trees |
Code Structure¶
| Path | description |
| —– | ——— |
| ./include | header files |
| ./include/utils | some common functions |
| ./src/application | Implementations of training and prediction logic |
| ./src/boosting | Implementations of Boosting |
| ./src/io | Implementations of IO relatived classes, including Bin
, Config
, Dataset
, DatasetLoader
, Feature
and Tree
|
| ./src/metric | Implementations of metrics |
| ./src/network | Implementations of network functions |
| ./src/objective | Implementations of objective functions |
| ./src/treelearner | Implementations of tree learners |
High level Language package¶
Follow the implementation of python-package.