lightgbm package¶

Data Structure API¶

class lightgbm.Dataset(data, label=None, max_bin=255, reference=None, weight=None, group=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]¶

Bases: object

Dataset in LightGBM.

construct()[source]¶: Lazy init

create_valid(data, label=None, weight=None, group=None, silent=False, params=None)[source]¶

Create validation data align with current dataset

Parameters:

data (string/numpy array/scipy.sparse) – Data source of Dataset. When data type is string, it represents the path of txt file
label (list or numpy 1-D array, optional) – Label of the training data.
weight (list or numpy 1-D array , optional) – Weight for each instance.
group (list or numpy 1-D array , optional) – Group/query size for dataset
silent (boolean, optional) – Whether print messages during construction
params (dict, optional) – Other parameters

get_field(field_name)[source]¶

Get property from the Dataset.

Parameters:	field_name (str) – The field name of the information
Returns:	info – A numpy array of information of the data
Return type:	array

get_group()[source]¶

Get the group of the Dataset.

Returns:	init_score
Return type:	array

get_init_score()[source]¶

Get the initial score of the Dataset.

Returns:	init_score
Return type:	array

get_label()[source]¶

Get the label of the Dataset.

Returns:	label
Return type:	array

get_weight()[source]¶

Get the weight of the Dataset.

Returns:	weight
Return type:	array

num_data()[source]¶

Get the number of rows in the Dataset.

Returns:	number of rows
Return type:	int

num_feature()[source]¶

Get the number of columns (features) in the Dataset.

Returns:	number of columns
Return type:	int

save_binary(filename)[source]¶

Save Dataset to binary file

Parameters:	filename (string) – Name of the output file.

set_categorical_feature(categorical_feature)[source]¶

Set categorical features

Parameters:	categorical_feature (list of int or str) – Name/index of categorical features

set_feature_name(feature_name)[source]¶

Set feature name

Parameters:	feature_name (list of str) – Feature names

set_field(field_name, data)[source]¶

Set property into the Dataset.

Parameters:	field_name (str) – The field name of the information data (numpy array or list or None) – The array ofdata to be set

set_group(group)[source]¶

Set group size of Dataset (used for ranking).

Parameters:	group (numpy array or list or None) – Group size of each group

set_init_score(init_score)[source]¶

Set init score of booster to start from.

Parameters:	init_score (numpy array or list or None) – Init score for booster

set_label(label)[source]¶

Set label of Dataset

Parameters:	label (numpy array or list or None) – The label information to be set into Dataset

set_reference(reference)[source]¶

Set reference dataset

Parameters:	reference (Dataset) – Will use reference as template to consturct current dataset

set_weight(weight)[source]¶

Set weight of each instance.

Parameters:	weight (numpy array or list or None) – Weight for each data point

subset(used_indices, params=None)[source]¶

Get subset of current dataset

Parameters:	used_indices (list of int) – Used indices of this subset params (dict) – Other parameters

class lightgbm.Booster(params=None, train_set=None, model_file=None, silent=False)[source]¶

Bases: object

“Booster in LightGBM.

add_valid(data, name)[source]¶

Add an validation data

Parameters:	data (Dataset) – Validation data name (String) – Name of validation data

attr(key)[source]¶

Get attribute string from the Booster.

Parameters:	key (str) – The key to get attribute from.
Returns:	value – The attribute value of the key, returns None if attribute do not exist.
Return type:	str

dump_model(num_iteration=-1)[source]¶

Dump model to json format

Parameters:	num_iteration (int) – Number of iteration that want to dump. < 0 means dump to best iteration(if have)
Returns:
Return type:	Json format of model

eval(data, name, feval=None)[source]¶

Evaluate for data

Parameters:	data (Dataset object) – name – Name of data feval (function) – Custom evaluation function.
Returns:	result – Evaluation result list.
Return type:	list

eval_train(feval=None)[source]¶

Evaluate for training data

Parameters:	feval (function) – Custom evaluation function.
Returns:	result – Evaluation result list.
Return type:	str

eval_valid(feval=None)[source]¶

Evaluate for validation data

Parameters:	feval (function) – Custom evaluation function.
Returns:	result – Evaluation result list.
Return type:	str

feature_importance(importance_type='split')[source]¶

Get feature importances

Parameters:	importance_type (str, default "split") – the importance is calculated (How) – is the number of times a feature is used in a model ("split") – is the total gain of splits which use the feature ("gain") –
Returns:	result – Array of feature importances.
Return type:	array

feature_name()[source]¶

Get feature names.

Returns:	result – Array of feature names.
Return type:	array

params_str = None¶: construct booster object

predict(data, num_iteration=-1, raw_score=False, pred_leaf=False, data_has_header=False, is_reshape=True)[source]¶

Predict logic

Parameters:	data (string/numpy array/scipy.sparse) – Data source for prediction When data type is string, it represents the path of txt file num_iteration (int) – Used iteration for prediction, < 0 means predict for best iteration(if have) raw_score (bool) – True for predict raw score pred_leaf (bool) – True for predict leaf index data_has_header (bool) – Used for txt data is_reshape (bool) – Reshape to (nrow, ncol) if true
Returns:
Return type:	Prediction result

reset_parameter(params)[source]¶

Reset parameters for booster

Parameters:	params (dict) – New parameters for boosters silent (boolean, optional) – Whether print messages during construction

rollback_one_iter()[source]¶: Rollback one iteration

save_model(filename, num_iteration=-1)[source]¶

Save model of booster to file

Parameters:	filename (str) – Filename to save num_iteration (int) – Number of iteration that want to save. < 0 means save the best iteration(if have)

set_attr(**kwargs)[source]¶

Set the attribute of the Booster.

Parameters:	**kwargs – The attributes to set. Setting a value to None deletes an attribute.

update(train_set=None, fobj=None)[source]¶

Update for one iteration Note: for multi-class task, the score is group by class_id first, then group by row_id

if you want to get i-th row score in j-th class, the access way is score[j*num_data+i] and you should group grad and hess in this way as well

Parameters:	train_set – Training data, None means use last training data fobj (function) – Customized objective function.
Returns:
Return type:	is_finished, bool

Training API¶

lightgbm.train(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, evals_result=None, verbose_eval=True, learning_rates=None, callbacks=None)[source]¶

Train with given parameters.

Parameters:	params (dict) – Parameters for training. train_set (Dataset) – Data to be trained. num_boost_round (int) – Number of boosting iterations. valid_sets (list of Datasets) – List of data to be evaluated during training valid_names (list of string) – Names of valid_sets fobj (function) – Customized objective function. feval (function) – Customized evaluation function. Note: should return (eval_name, eval_result, is_higher_better) of list of this init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns early_stopping_rounds (int) – Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them Returns the model with (best_iter + early_stopping_rounds) If early stopping occurs, the model will add ‘best_iteration’ field evals_result (dict or None) – This dictionary used to store all evaluation results of all the items in valid_sets. Example: with a valid_sets containing [valid_set, train_set] and valid_names containing [‘eval’, ‘train’] and a paramater containing (‘metric’:’logloss’) Returns: {‘train’: {‘logloss’: [‘0.48253’, ‘0.35953’, ...]}, ‘eval’: {‘logloss’: [‘0.480385’, ‘0.357756’, ...]}} passed with None means no using this function verbose_eval (bool or int) – Requires at least one item in evals. If verbose_eval is True, the eval metric on the valid set is printed at each boosting stage. If verbose_eval is int, the eval metric on the valid set is printed at every verbose_eval boosting stage. The last boosting stage or the boosting stage found by using early_stopping_rounds is also printed. Example: with verbose_eval=4 and at least one item in evals, an evaluation metric is printed every 4 (instead of 1) boosting stages. learning_rates (list or function) – List of learning rate for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: learning_rate = l[current_round] - function f: learning_rate = f(current_round) callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns:	booster
Return type:	a trained booster model

lightgbm.cv(params, train_set, num_boost_round=10, folds=None, nfold=5, stratified=False, shuffle=True, metrics=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, fpreproc=None, verbose_eval=None, show_stdv=True, seed=0, callbacks=None)[source]¶

Cross-validation with given paramaters.

Parameters:	params (dict) – Booster params. train_set (Dataset) – Data to be trained. num_boost_round (int) – Number of boosting iterations. folds (a generator or iterator of (train_idx, test_idx) tuples) – The train indices and test indices for each folds. This argument has highest priority over other data split arguments. nfold (int) – Number of folds in CV. stratified (bool) – Perform stratified sampling. shuffle (bool) – Whether shuffle before split data metrics (string or list of strings) – Evaluation metrics to be watched in CV. If metrics is not None, the metric in params will be overridden. fobj (function) – Custom objective function. feval (function) – Custom evaluation function. init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns early_stopping_rounds (int) – Activates early stopping. CV error needs to decrease at least every <early_stopping_rounds> round(s) to continue. Last entry in evaluation history is the one from best iteration. fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those. verbose_eval (bool, int, or None, default None) – Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given, progress will be displayed at every given verbose_eval boosting stage. show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std. seed (int) – Seed used to generate the folds (passed to numpy.random.seed). callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns:	evaluation history
Return type:	list(string)

Scikit-learn API¶

class lightgbm.LGBMModel(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶

Bases: object

apply(X, num_iteration=0)[source]¶

Return the predicted leaf every tree for each sample.

Parameters:	X (array_like, shape=[n_samples, n_features]) – Input features matrix. num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns:	X_leaves
Return type:	array_like, shape=[n_samples, n_trees]

booster_¶: Get the underlying lightgbm Booster of this model.

evals_result_¶: Get the evaluation results.

feature_importances_¶: Get normailized feature importances.

fit(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric=None, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]¶

Fit the gradient boosting model

Parameters:

X (array_like) – Feature matrix
y (array_like) – Labels
sample_weight (array_like) – weight of training data
init_score (array_like) – init score of training data
group (array_like) – group data of training data
eval_set (list, optional) – A list of (X, y) tuple pairs to use as a validation set for early-stopping
eval_names (list of string) – Names of eval_set
eval_sample_weight (List of array) – weight of eval data
eval_init_score (List of array) – init score of eval data
eval_group (List of array) – group data of eval data
eval_metric (str, list of str, callable, optional) – If a str, should be a built-in evaluation metric to use. If callable, a custom evaluation metric, see note for more details.
early_stopping_rounds (int) –
verbose (bool) – If verbose and an evaluation set is used, writes the evaluation
feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.

Note

Custom eval function expects a callable with following functions:

func(y_true, y_pred), func(y_true, y_pred, weight): or func(y_true, y_pred, weight, group).
return (eval_name, eval_result, is_bigger_better): or list of (eval_name, eval_result, is_bigger_better)
y_true: array_like of shape [n_samples]: The target values
y_pred: array_like of shape [n_samples] or shape[n_samples * n_class] (for multi-class): The predicted values
weight: array_like of shape [n_samples]: The weight of samples
group: array_like: group/query data, used for ranking task
eval_name: str: name of evaluation
eval_result: float: eval result
is_bigger_better: bool: is eval result bigger better, e.g. AUC is bigger_better.

for multi-class task, the y_pred is group by class_id first, then group by row_id

if you want to get i-th row y_pred in j-th class, the access way is y_pred[j*num_data+i]

predict(X, raw_score=False, num_iteration=0)[source]¶

Return the predicted value for each sample.

Parameters:	X (array_like, shape=[n_samples, n_features]) – Input features matrix. num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns:	predicted_result
Return type:	array_like, shape=[n_samples] or [n_samples, n_classes]

class lightgbm.LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='binary', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶

Bases: lightgbm.sklearn.LGBMModel, object

classes_¶: Get class label array.

n_classes_¶: Get number of classes

predict_proba(X, raw_score=False, num_iteration=0)[source]¶

Return the predicted probability for each class for each sample.

Parameters:	X (array_like, shape=[n_samples, n_features]) – Input features matrix. num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns:	predicted_probability
Return type:	array_like, shape=[n_samples, n_classes]

class lightgbm.LGBMRegressor(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶: Bases: lightgbm.sklearn.LGBMModel, object

class lightgbm.LGBMRanker(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='lambdarank', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶

Bases: lightgbm.sklearn.LGBMModel

fit(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric='ndcg', eval_at=1, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]¶

Most arguments like common methods except following:

eval_at : list of int: The evaulation positions of NDCG

Callbacks¶

lightgbm.early_stopping(stopping_rounds, verbose=True)[source]¶

Create a callback that activates early stopping. Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them

Parameters:	stopping_rounds (int) – The stopping rounds before the trend occur. verbose (optional, bool) – Whether to print message about early stopping information.
Returns:	callback – The requested callback function.
Return type:	function

lightgbm.print_evaluation(period=1, show_stdv=True)[source]¶

Create a callback that print evaluation result.

Parameters:	period (int) – The period to log the evaluation results show_stdv (bool, optional) – Whether show stdv if provided
Returns:	callback – A callback that print evaluation every period iterations.
Return type:	function

lightgbm.record_evaluation(eval_result)[source]¶

Create a call back that records the evaluation history into eval_result.

Parameters:	eval_result (dict) – A dictionary to store the evaluation results.
Returns:	callback – The requested callback function.
Return type:	function

lightgbm.reset_parameter(**kwargs)[source]¶

Reset parameter after first iteration

NOTE: the initial parameter will still take in-effect on first iteration.

Parameters:	*kwargs (value should be list* or function) – List of parameters for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: parameter = l[current_round] - function f: parameter = f(current_round)
Returns:	callback – The requested callback function.
Return type:	function

Plotting¶

lightgbm.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='Feature importance', ylabel='Features', importance_type='split', max_num_features=None, ignore_zero=True, figsize=None, grid=True, **kwargs)[source]¶

Plot model feature importances.

Parameters:	booster (Booster or LGBMModel) – Booster or LGBMModel instance ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created. height (float) – Bar height, passed to ax.barh() xlim (tuple of 2 elements) – Tuple passed to axes.xlim() ylim (tuple of 2 elements) – Tuple passed to axes.ylim() title (str) – Axes title. Pass None to disable. xlabel (str) – X axis title label. Pass None to disable. ylabel (str) – Y axis title label. Pass None to disable. importance_type (str) – How the importance is calculated: “split” or “gain” “split” is the number of times a feature is used in a model “gain” is the total gain of splits which use the feature max_num_features (int) – Max number of top features displayed on plot. If None or smaller than 1, all features will be displayed. ignore_zero (bool) – Ignore features with zero importance figsize (tuple of 2 elements) – Figure size grid (bool) – Whether add grid for axes **kwargs – Other keywords passed to ax.barh()
Returns:	ax
Return type:	matplotlib Axes

lightgbm.plot_metric(booster, metric=None, dataset_names=None, ax=None, xlim=None, ylim=None, title='Metric during training', xlabel='Iterations', ylabel='auto', figsize=None, grid=True)[source]¶

Plot one metric during training.

Parameters:	booster (dict or LGBMModel) – Evals_result recorded by lightgbm.train() or LGBMModel instance metric (str or None) – The metric name to plot. Only one metric supported because different metrics have various scales. Pass None to pick first one (according to dict hashcode). dataset_names (None or list of str) – List of the dataset names to plot. Pass None to plot all datasets. ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created. xlim (tuple of 2 elements) – Tuple passed to axes.xlim() ylim (tuple of 2 elements) – Tuple passed to axes.ylim() title (str) – Axes title. Pass None to disable. xlabel (str) – X axis title label. Pass None to disable. ylabel (str) – Y axis title label. Pass None to disable. Pass ‘auto’ to use metric. figsize (tuple of 2 elements) – Figure size grid (bool) – Whether add grid for axes
Returns:	ax
Return type:	matplotlib Axes

lightgbm.plot_tree(booster, ax=None, tree_index=0, figsize=None, graph_attr=None, node_attr=None, edge_attr=None, show_info=None)[source]¶

Plot specified tree.

Parameters:	booster (Booster, LGBMModel) – Booster or LGBMModel instance. ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created. tree_index (int, default 0) – Specify tree index of target tree. figsize (tuple of 2 elements) – Figure size. graph_attr (dict) – Mapping of (attribute, value) pairs for the graph. node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes. edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges. show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’.
Returns:	ax
Return type:	matplotlib Axes

lightgbm.create_tree_digraph(booster, tree_index=0, show_info=None, name=None, comment=None, filename=None, directory=None, format=None, engine=None, encoding=None, graph_attr=None, node_attr=None, edge_attr=None, body=None, strict=False)[source]¶

Create a digraph of specified tree.

See:

http://graphviz.readthedocs.io/en/stable/api.html#digraph

Parameters:	booster (Booster, LGBMModel) – Booster or LGBMModel instance. tree_index (int, default 0) – Specify tree index of target tree. show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’. name (str) – Graph name used in the source code. comment (str) – Comment added to the first line of the source. filename (str) – Filename for saving the source (defaults to name + ‘.gv’). directory (str) – (Sub)directory for source saving and rendering. format (str) – Rendering output format (‘pdf’, ‘png’, ...). engine (str) – Layout command used (‘dot’, ‘neato’, ...). encoding (str) – Encoding for saving the source. graph_attr (dict) – Mapping of (attribute, value) pairs for the graph. node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes. edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges. body (list of str) – Iterable of lines to add to the graph body. strict (bool) – Iterable of lines to add to the graph body.
Returns:	graph
Return type:	graphviz Digraph