lightgbm package

Data Structure API

class lightgbm.Dataset(data, label=None, max_bin=255, reference=None, weight=None, group=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]

Bases: object

Dataset in LightGBM.

construct()[source]

Lazy init

create_valid(data, label=None, weight=None, group=None, silent=False, params=None)[source]

Create validation data align with current dataset

Parameters:
  • data (string/numpy array/scipy.sparse) – Data source of Dataset. When data type is string, it represents the path of txt file
  • label (list or numpy 1-D array, optional) – Label of the training data.
  • weight (list or numpy 1-D array , optional) – Weight for each instance.
  • group (list or numpy 1-D array , optional) – Group/query size for dataset
  • silent (boolean, optional) – Whether print messages during construction
  • params (dict, optional) – Other parameters
get_field(field_name)[source]

Get property from the Dataset.

Parameters:field_name (str) – The field name of the information
Returns:info – A numpy array of information of the data
Return type:array
get_group()[source]

Get the group of the Dataset.

Returns:init_score
Return type:array
get_init_score()[source]

Get the initial score of the Dataset.

Returns:init_score
Return type:array
get_label()[source]

Get the label of the Dataset.

Returns:label
Return type:array
get_weight()[source]

Get the weight of the Dataset.

Returns:weight
Return type:array
num_data()[source]

Get the number of rows in the Dataset.

Returns:number of rows
Return type:int
num_feature()[source]

Get the number of columns (features) in the Dataset.

Returns:number of columns
Return type:int
save_binary(filename)[source]

Save Dataset to binary file

Parameters:filename (string) – Name of the output file.
set_categorical_feature(categorical_feature)[source]

Set categorical features

Parameters:categorical_feature (list of int or str) – Name/index of categorical features
set_feature_name(feature_name)[source]

Set feature name

Parameters:feature_name (list of str) – Feature names
set_field(field_name, data)[source]

Set property into the Dataset.

Parameters:
  • field_name (str) – The field name of the information
  • data (numpy array or list or None) – The array ofdata to be set
set_group(group)[source]

Set group size of Dataset (used for ranking).

Parameters:group (numpy array or list or None) – Group size of each group
set_init_score(init_score)[source]

Set init score of booster to start from.

Parameters:init_score (numpy array or list or None) – Init score for booster
set_label(label)[source]

Set label of Dataset

Parameters:label (numpy array or list or None) – The label information to be set into Dataset
set_reference(reference)[source]

Set reference dataset

Parameters:reference (Dataset) – Will use reference as template to consturct current dataset
set_weight(weight)[source]

Set weight of each instance.

Parameters:weight (numpy array or list or None) – Weight for each data point
subset(used_indices, params=None)[source]

Get subset of current dataset

Parameters:
  • used_indices (list of int) – Used indices of this subset
  • params (dict) – Other parameters
class lightgbm.Booster(params=None, train_set=None, model_file=None, silent=False)[source]

Bases: object

“Booster in LightGBM.

add_valid(data, name)[source]

Add an validation data

Parameters:
  • data (Dataset) – Validation data
  • name (String) – Name of validation data
attr(key)[source]

Get attribute string from the Booster.

Parameters:key (str) – The key to get attribute from.
Returns:value – The attribute value of the key, returns None if attribute do not exist.
Return type:str
dump_model(num_iteration=-1)[source]

Dump model to json format

Parameters:num_iteration (int) – Number of iteration that want to dump. < 0 means dump to best iteration(if have)
Returns:
Return type:Json format of model
eval(data, name, feval=None)[source]

Evaluate for data

Parameters:
  • data (Dataset object) –
  • name – Name of data
  • feval (function) – Custom evaluation function.
Returns:

result – Evaluation result list.

Return type:

list

eval_train(feval=None)[source]

Evaluate for training data

Parameters:feval (function) – Custom evaluation function.
Returns:result – Evaluation result list.
Return type:str
eval_valid(feval=None)[source]

Evaluate for validation data

Parameters:feval (function) – Custom evaluation function.
Returns:result – Evaluation result list.
Return type:str
feature_importance(importance_type='split')[source]

Get feature importances

Parameters:
  • importance_type (str, default "split") –
  • the importance is calculated (How) –
  • is the number of times a feature is used in a model ("split") –
  • is the total gain of splits which use the feature ("gain") –
Returns:

result – Array of feature importances.

Return type:

array

feature_name()[source]

Get feature names.

Returns:result – Array of feature names.
Return type:array
params_str = None

construct booster object

predict(data, num_iteration=-1, raw_score=False, pred_leaf=False, data_has_header=False, is_reshape=True)[source]

Predict logic

Parameters:
  • data (string/numpy array/scipy.sparse) – Data source for prediction When data type is string, it represents the path of txt file
  • num_iteration (int) – Used iteration for prediction, < 0 means predict for best iteration(if have)
  • raw_score (bool) – True for predict raw score
  • pred_leaf (bool) – True for predict leaf index
  • data_has_header (bool) – Used for txt data
  • is_reshape (bool) – Reshape to (nrow, ncol) if true
Returns:

Return type:

Prediction result

reset_parameter(params)[source]

Reset parameters for booster

Parameters:
  • params (dict) – New parameters for boosters
  • silent (boolean, optional) – Whether print messages during construction
rollback_one_iter()[source]

Rollback one iteration

save_model(filename, num_iteration=-1)[source]

Save model of booster to file

Parameters:
  • filename (str) – Filename to save
  • num_iteration (int) – Number of iteration that want to save. < 0 means save the best iteration(if have)
set_attr(**kwargs)[source]

Set the attribute of the Booster.

Parameters:**kwargs – The attributes to set. Setting a value to None deletes an attribute.
update(train_set=None, fobj=None)[source]

Update for one iteration Note: for multi-class task, the score is group by class_id first, then group by row_id

if you want to get i-th row score in j-th class, the access way is score[j*num_data+i] and you should group grad and hess in this way as well
Parameters:
  • train_set – Training data, None means use last training data
  • fobj (function) – Customized objective function.
Returns:

Return type:

is_finished, bool

Training API

lightgbm.train(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, evals_result=None, verbose_eval=True, learning_rates=None, callbacks=None)[source]

Train with given parameters.

Parameters:
  • params (dict) – Parameters for training.
  • train_set (Dataset) – Data to be trained.
  • num_boost_round (int) – Number of boosting iterations.
  • valid_sets (list of Datasets) – List of data to be evaluated during training
  • valid_names (list of string) – Names of valid_sets
  • fobj (function) – Customized objective function.
  • feval (function) – Customized evaluation function. Note: should return (eval_name, eval_result, is_higher_better) of list of this
  • init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train
  • feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
  • categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
  • early_stopping_rounds (int) – Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them Returns the model with (best_iter + early_stopping_rounds) If early stopping occurs, the model will add ‘best_iteration’ field
  • evals_result (dict or None) –

    This dictionary used to store all evaluation results of all the items in valid_sets. Example: with a valid_sets containing [valid_set, train_set]

    and valid_names containing [‘eval’, ‘train’] and a paramater containing (‘metric’:’logloss’)
    Returns: {‘train’: {‘logloss’: [‘0.48253’, ‘0.35953’, ...]},
    ‘eval’: {‘logloss’: [‘0.480385’, ‘0.357756’, ...]}}

    passed with None means no using this function

  • verbose_eval (bool or int) –

    Requires at least one item in evals. If verbose_eval is True,

    the eval metric on the valid set is printed at each boosting stage.
    If verbose_eval is int,
    the eval metric on the valid set is printed at every verbose_eval boosting stage.
    The last boosting stage
    or the boosting stage found by using early_stopping_rounds is also printed.
    Example: with verbose_eval=4 and at least one item in evals,
    an evaluation metric is printed every 4 (instead of 1) boosting stages.
  • learning_rates (list or function) – List of learning rate for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: learning_rate = l[current_round] - function f: learning_rate = f(current_round)
  • callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns:

booster

Return type:

a trained booster model

lightgbm.cv(params, train_set, num_boost_round=10, folds=None, nfold=5, stratified=False, shuffle=True, metrics=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, fpreproc=None, verbose_eval=None, show_stdv=True, seed=0, callbacks=None)[source]

Cross-validation with given paramaters.

Parameters:
  • params (dict) – Booster params.
  • train_set (Dataset) – Data to be trained.
  • num_boost_round (int) – Number of boosting iterations.
  • folds (a generator or iterator of (train_idx, test_idx) tuples) – The train indices and test indices for each folds. This argument has highest priority over other data split arguments.
  • nfold (int) – Number of folds in CV.
  • stratified (bool) – Perform stratified sampling.
  • shuffle (bool) – Whether shuffle before split data
  • metrics (string or list of strings) – Evaluation metrics to be watched in CV. If metrics is not None, the metric in params will be overridden.
  • fobj (function) – Custom objective function.
  • feval (function) – Custom evaluation function.
  • init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train
  • feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
  • categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
  • early_stopping_rounds (int) – Activates early stopping. CV error needs to decrease at least every <early_stopping_rounds> round(s) to continue. Last entry in evaluation history is the one from best iteration.
  • fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those.
  • verbose_eval (bool, int, or None, default None) –

    Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given,

    progress will be displayed at every given verbose_eval boosting stage.
  • show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std.
  • seed (int) – Seed used to generate the folds (passed to numpy.random.seed).
  • callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns:

evaluation history

Return type:

list(string)

Scikit-learn API

class lightgbm.LGBMModel(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]

Bases: object

apply(X, num_iteration=0)[source]

Return the predicted leaf every tree for each sample.

Parameters:
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.
  • num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns:

X_leaves

Return type:

array_like, shape=[n_samples, n_trees]

booster_

Get the underlying lightgbm Booster of this model.

evals_result_

Get the evaluation results.

feature_importances_

Get normailized feature importances.

fit(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric=None, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]

Fit the gradient boosting model

Parameters:
  • X (array_like) – Feature matrix
  • y (array_like) – Labels
  • sample_weight (array_like) – weight of training data
  • init_score (array_like) – init score of training data
  • group (array_like) – group data of training data
  • eval_set (list, optional) – A list of (X, y) tuple pairs to use as a validation set for early-stopping
  • eval_names (list of string) – Names of eval_set
  • eval_sample_weight (List of array) – weight of eval data
  • eval_init_score (List of array) – init score of eval data
  • eval_group (List of array) – group data of eval data
  • eval_metric (str, list of str, callable, optional) – If a str, should be a built-in evaluation metric to use. If callable, a custom evaluation metric, see note for more details.
  • early_stopping_rounds (int) –
  • verbose (bool) – If verbose and an evaluation set is used, writes the evaluation
  • feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
  • categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
  • callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.

Note

Custom eval function expects a callable with following functions:
func(y_true, y_pred), func(y_true, y_pred, weight)
or func(y_true, y_pred, weight, group).
return (eval_name, eval_result, is_bigger_better)
or list of (eval_name, eval_result, is_bigger_better)
y_true: array_like of shape [n_samples]
The target values
y_pred: array_like of shape [n_samples] or shape[n_samples * n_class] (for multi-class)
The predicted values
weight: array_like of shape [n_samples]
The weight of samples
group: array_like
group/query data, used for ranking task
eval_name: str
name of evaluation
eval_result: float
eval result
is_bigger_better: bool
is eval result bigger better, e.g. AUC is bigger_better.
for multi-class task, the y_pred is group by class_id first, then group by row_id
if you want to get i-th row y_pred in j-th class, the access way is y_pred[j*num_data+i]
predict(X, raw_score=False, num_iteration=0)[source]

Return the predicted value for each sample.

Parameters:
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.
  • num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns:

predicted_result

Return type:

array_like, shape=[n_samples] or [n_samples, n_classes]

class lightgbm.LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='binary', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]

Bases: lightgbm.sklearn.LGBMModel, object

classes_

Get class label array.

n_classes_

Get number of classes

predict_proba(X, raw_score=False, num_iteration=0)[source]

Return the predicted probability for each class for each sample.

Parameters:
  • X (array_like, shape=[n_samples, n_features]) – Input features matrix.
  • num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns:

predicted_probability

Return type:

array_like, shape=[n_samples, n_classes]

class lightgbm.LGBMRegressor(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]

Bases: lightgbm.sklearn.LGBMModel, object

class lightgbm.LGBMRanker(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='lambdarank', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]

Bases: lightgbm.sklearn.LGBMModel

fit(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric='ndcg', eval_at=1, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]

Most arguments like common methods except following:

eval_at : list of int
The evaulation positions of NDCG

Callbacks

lightgbm.early_stopping(stopping_rounds, verbose=True)[source]

Create a callback that activates early stopping. Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them

Parameters:
  • stopping_rounds (int) – The stopping rounds before the trend occur.
  • verbose (optional, bool) – Whether to print message about early stopping information.
Returns:

callback – The requested callback function.

Return type:

function

lightgbm.print_evaluation(period=1, show_stdv=True)[source]

Create a callback that print evaluation result.

Parameters:
  • period (int) – The period to log the evaluation results
  • show_stdv (bool, optional) – Whether show stdv if provided
Returns:

callback – A callback that print evaluation every period iterations.

Return type:

function

lightgbm.record_evaluation(eval_result)[source]

Create a call back that records the evaluation history into eval_result.

Parameters:eval_result (dict) – A dictionary to store the evaluation results.
Returns:callback – The requested callback function.
Return type:function
lightgbm.reset_parameter(**kwargs)[source]

Reset parameter after first iteration

NOTE: the initial parameter will still take in-effect on first iteration.

Parameters:**kwargs (value should be list or function) – List of parameters for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: parameter = l[current_round] - function f: parameter = f(current_round)
Returns:callback – The requested callback function.
Return type:function

Plotting

lightgbm.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='Feature importance', ylabel='Features', importance_type='split', max_num_features=None, ignore_zero=True, figsize=None, grid=True, **kwargs)[source]

Plot model feature importances.

Parameters:
  • booster (Booster or LGBMModel) – Booster or LGBMModel instance
  • ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
  • height (float) – Bar height, passed to ax.barh()
  • xlim (tuple of 2 elements) – Tuple passed to axes.xlim()
  • ylim (tuple of 2 elements) – Tuple passed to axes.ylim()
  • title (str) – Axes title. Pass None to disable.
  • xlabel (str) – X axis title label. Pass None to disable.
  • ylabel (str) – Y axis title label. Pass None to disable.
  • importance_type (str) – How the importance is calculated: “split” or “gain” “split” is the number of times a feature is used in a model “gain” is the total gain of splits which use the feature
  • max_num_features (int) – Max number of top features displayed on plot. If None or smaller than 1, all features will be displayed.
  • ignore_zero (bool) – Ignore features with zero importance
  • figsize (tuple of 2 elements) – Figure size
  • grid (bool) – Whether add grid for axes
  • **kwargs – Other keywords passed to ax.barh()
Returns:

ax

Return type:

matplotlib Axes

lightgbm.plot_metric(booster, metric=None, dataset_names=None, ax=None, xlim=None, ylim=None, title='Metric during training', xlabel='Iterations', ylabel='auto', figsize=None, grid=True)[source]

Plot one metric during training.

Parameters:
  • booster (dict or LGBMModel) – Evals_result recorded by lightgbm.train() or LGBMModel instance
  • metric (str or None) – The metric name to plot. Only one metric supported because different metrics have various scales. Pass None to pick first one (according to dict hashcode).
  • dataset_names (None or list of str) – List of the dataset names to plot. Pass None to plot all datasets.
  • ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
  • xlim (tuple of 2 elements) – Tuple passed to axes.xlim()
  • ylim (tuple of 2 elements) – Tuple passed to axes.ylim()
  • title (str) – Axes title. Pass None to disable.
  • xlabel (str) – X axis title label. Pass None to disable.
  • ylabel (str) – Y axis title label. Pass None to disable. Pass ‘auto’ to use metric.
  • figsize (tuple of 2 elements) – Figure size
  • grid (bool) – Whether add grid for axes
Returns:

ax

Return type:

matplotlib Axes

lightgbm.plot_tree(booster, ax=None, tree_index=0, figsize=None, graph_attr=None, node_attr=None, edge_attr=None, show_info=None)[source]

Plot specified tree.

Parameters:
  • booster (Booster, LGBMModel) – Booster or LGBMModel instance.
  • ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
  • tree_index (int, default 0) – Specify tree index of target tree.
  • figsize (tuple of 2 elements) – Figure size.
  • graph_attr (dict) – Mapping of (attribute, value) pairs for the graph.
  • node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes.
  • edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges.
  • show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’.
Returns:

ax

Return type:

matplotlib Axes

lightgbm.create_tree_digraph(booster, tree_index=0, show_info=None, name=None, comment=None, filename=None, directory=None, format=None, engine=None, encoding=None, graph_attr=None, node_attr=None, edge_attr=None, body=None, strict=False)[source]

Create a digraph of specified tree.

See:
Parameters:
  • booster (Booster, LGBMModel) – Booster or LGBMModel instance.
  • tree_index (int, default 0) – Specify tree index of target tree.
  • show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’.
  • name (str) – Graph name used in the source code.
  • comment (str) – Comment added to the first line of the source.
  • filename (str) – Filename for saving the source (defaults to name + ‘.gv’).
  • directory (str) – (Sub)directory for source saving and rendering.
  • format (str) – Rendering output format (‘pdf’, ‘png’, ...).
  • engine (str) – Layout command used (‘dot’, ‘neato’, ...).
  • encoding (str) – Encoding for saving the source.
  • graph_attr (dict) – Mapping of (attribute, value) pairs for the graph.
  • node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes.
  • edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges.
  • body (list of str) – Iterable of lines to add to the graph body.
  • strict (bool) – Iterable of lines to add to the graph body.
Returns:

graph

Return type:

graphviz Digraph