lightgbm package¶
Data Structure API¶
-
class
lightgbm.
Dataset
(data, label=None, max_bin=255, reference=None, weight=None, group=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]¶ Bases:
object
Dataset in LightGBM.
-
create_valid
(data, label=None, weight=None, group=None, silent=False, params=None)[source]¶ Create validation data align with current dataset
Parameters: - data (string/numpy array/scipy.sparse) – Data source of Dataset. When data type is string, it represents the path of txt file
- label (list or numpy 1-D array, optional) – Label of the training data.
- weight (list or numpy 1-D array , optional) – Weight for each instance.
- group (list or numpy 1-D array , optional) – Group/query size for dataset
- silent (boolean, optional) – Whether print messages during construction
- params (dict, optional) – Other parameters
-
get_field
(field_name)[source]¶ Get property from the Dataset.
Parameters: field_name (str) – The field name of the information Returns: info – A numpy array of information of the data Return type: array
-
get_init_score
()[source]¶ Get the initial score of the Dataset.
Returns: init_score Return type: array
-
num_feature
()[source]¶ Get the number of columns (features) in the Dataset.
Returns: number of columns Return type: int
-
save_binary
(filename)[source]¶ Save Dataset to binary file
Parameters: filename (string) – Name of the output file.
-
set_categorical_feature
(categorical_feature)[source]¶ Set categorical features
Parameters: categorical_feature (list of int or str) – Name/index of categorical features
-
set_feature_name
(feature_name)[source]¶ Set feature name
Parameters: feature_name (list of str) – Feature names
-
set_field
(field_name, data)[source]¶ Set property into the Dataset.
Parameters: - field_name (str) – The field name of the information
- data (numpy array or list or None) – The array ofdata to be set
-
set_group
(group)[source]¶ Set group size of Dataset (used for ranking).
Parameters: group (numpy array or list or None) – Group size of each group
-
set_init_score
(init_score)[source]¶ Set init score of booster to start from.
Parameters: init_score (numpy array or list or None) – Init score for booster
-
set_label
(label)[source]¶ Set label of Dataset
Parameters: label (numpy array or list or None) – The label information to be set into Dataset
-
set_reference
(reference)[source]¶ Set reference dataset
Parameters: reference (Dataset) – Will use reference as template to consturct current dataset
-
-
class
lightgbm.
Booster
(params=None, train_set=None, model_file=None, silent=False)[source]¶ Bases:
object
“Booster in LightGBM.
-
add_valid
(data, name)[source]¶ Add an validation data
Parameters: - data (Dataset) – Validation data
- name (String) – Name of validation data
-
attr
(key)[source]¶ Get attribute string from the Booster.
Parameters: key (str) – The key to get attribute from. Returns: value – The attribute value of the key, returns None if attribute do not exist. Return type: str
-
dump_model
(num_iteration=-1)[source]¶ Dump model to json format
Parameters: num_iteration (int) – Number of iteration that want to dump. < 0 means dump to best iteration(if have) Returns: Return type: Json format of model
-
eval
(data, name, feval=None)[source]¶ Evaluate for data
Parameters: - data (Dataset object) –
- name – Name of data
- feval (function) – Custom evaluation function.
Returns: result – Evaluation result list.
Return type: list
-
eval_train
(feval=None)[source]¶ Evaluate for training data
Parameters: feval (function) – Custom evaluation function. Returns: result – Evaluation result list. Return type: str
-
eval_valid
(feval=None)[source]¶ Evaluate for validation data
Parameters: feval (function) – Custom evaluation function. Returns: result – Evaluation result list. Return type: str
-
feature_importance
(importance_type='split')[source]¶ Get feature importances
Parameters: - importance_type (str, default "split") –
- the importance is calculated (How) –
- is the number of times a feature is used in a model ("split") –
- is the total gain of splits which use the feature ("gain") –
Returns: result – Array of feature importances.
Return type: array
-
feature_name
()[source]¶ Get feature names.
Returns: result – Array of feature names. Return type: array
-
params_str
= None¶ construct booster object
-
predict
(data, num_iteration=-1, raw_score=False, pred_leaf=False, data_has_header=False, is_reshape=True)[source]¶ Predict logic
Parameters: - data (string/numpy array/scipy.sparse) – Data source for prediction When data type is string, it represents the path of txt file
- num_iteration (int) – Used iteration for prediction, < 0 means predict for best iteration(if have)
- raw_score (bool) – True for predict raw score
- pred_leaf (bool) – True for predict leaf index
- data_has_header (bool) – Used for txt data
- is_reshape (bool) – Reshape to (nrow, ncol) if true
Returns: Return type: Prediction result
-
reset_parameter
(params)[source]¶ Reset parameters for booster
Parameters: - params (dict) – New parameters for boosters
- silent (boolean, optional) – Whether print messages during construction
-
save_model
(filename, num_iteration=-1)[source]¶ Save model of booster to file
Parameters: - filename (str) – Filename to save
- num_iteration (int) – Number of iteration that want to save. < 0 means save the best iteration(if have)
-
set_attr
(**kwargs)[source]¶ Set the attribute of the Booster.
Parameters: **kwargs – The attributes to set. Setting a value to None deletes an attribute.
-
update
(train_set=None, fobj=None)[source]¶ Update for one iteration Note: for multi-class task, the score is group by class_id first, then group by row_id
if you want to get i-th row score in j-th class, the access way is score[j*num_data+i] and you should group grad and hess in this way as wellParameters: - train_set – Training data, None means use last training data
- fobj (function) – Customized objective function.
Returns: Return type: is_finished, bool
-
Training API¶
-
lightgbm.
train
(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, evals_result=None, verbose_eval=True, learning_rates=None, callbacks=None)[source]¶ Train with given parameters.
Parameters: - params (dict) – Parameters for training.
- train_set (Dataset) – Data to be trained.
- num_boost_round (int) – Number of boosting iterations.
- valid_sets (list of Datasets) – List of data to be evaluated during training
- valid_names (list of string) – Names of valid_sets
- fobj (function) – Customized objective function.
- feval (function) – Customized evaluation function. Note: should return (eval_name, eval_result, is_higher_better) of list of this
- init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train
- feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
- categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
- early_stopping_rounds (int) – Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them Returns the model with (best_iter + early_stopping_rounds) If early stopping occurs, the model will add ‘best_iteration’ field
- evals_result (dict or None) –
This dictionary used to store all evaluation results of all the items in valid_sets. Example: with a valid_sets containing [valid_set, train_set]
and valid_names containing [‘eval’, ‘train’] and a paramater containing (‘metric’:’logloss’)- Returns: {‘train’: {‘logloss’: [‘0.48253’, ‘0.35953’, ...]},
- ‘eval’: {‘logloss’: [‘0.480385’, ‘0.357756’, ...]}}
passed with None means no using this function
- verbose_eval (bool or int) –
Requires at least one item in evals. If verbose_eval is True,
the eval metric on the valid set is printed at each boosting stage.- If verbose_eval is int,
- the eval metric on the valid set is printed at every verbose_eval boosting stage.
- The last boosting stage
- or the boosting stage found by using early_stopping_rounds is also printed.
- Example: with verbose_eval=4 and at least one item in evals,
- an evaluation metric is printed every 4 (instead of 1) boosting stages.
- learning_rates (list or function) – List of learning rate for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: learning_rate = l[current_round] - function f: learning_rate = f(current_round)
- callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns: booster
Return type: a trained booster model
-
lightgbm.
cv
(params, train_set, num_boost_round=10, folds=None, nfold=5, stratified=False, shuffle=True, metrics=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, fpreproc=None, verbose_eval=None, show_stdv=True, seed=0, callbacks=None)[source]¶ Cross-validation with given paramaters.
Parameters: - params (dict) – Booster params.
- train_set (Dataset) – Data to be trained.
- num_boost_round (int) – Number of boosting iterations.
- folds (a generator or iterator of (train_idx, test_idx) tuples) – The train indices and test indices for each folds. This argument has highest priority over other data split arguments.
- nfold (int) – Number of folds in CV.
- stratified (bool) – Perform stratified sampling.
- shuffle (bool) – Whether shuffle before split data
- metrics (string or list of strings) – Evaluation metrics to be watched in CV. If metrics is not None, the metric in params will be overridden.
- fobj (function) – Custom objective function.
- feval (function) – Custom evaluation function.
- init_model (file name of lightgbm model or 'Booster' instance) – model used for continued train
- feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
- categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
- early_stopping_rounds (int) – Activates early stopping. CV error needs to decrease at least every <early_stopping_rounds> round(s) to continue. Last entry in evaluation history is the one from best iteration.
- fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those.
- verbose_eval (bool, int, or None, default None) –
Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given,
progress will be displayed at every given verbose_eval boosting stage. - show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std.
- seed (int) – Seed used to generate the folds (passed to numpy.random.seed).
- callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Returns: evaluation history
Return type: list(string)
Scikit-learn API¶
-
class
lightgbm.
LGBMModel
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
object
-
apply
(X, num_iteration=0)[source]¶ Return the predicted leaf every tree for each sample.
Parameters: - X (array_like, shape=[n_samples, n_features]) – Input features matrix.
- num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns: X_leaves
Return type: array_like, shape=[n_samples, n_trees]
-
booster_
¶ Get the underlying lightgbm Booster of this model.
-
evals_result_
¶ Get the evaluation results.
-
feature_importances_
¶ Get normailized feature importances.
-
fit
(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric=None, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]¶ Fit the gradient boosting model
Parameters: - X (array_like) – Feature matrix
- y (array_like) – Labels
- sample_weight (array_like) – weight of training data
- init_score (array_like) – init score of training data
- group (array_like) – group data of training data
- eval_set (list, optional) – A list of (X, y) tuple pairs to use as a validation set for early-stopping
- eval_names (list of string) – Names of eval_set
- eval_sample_weight (List of array) – weight of eval data
- eval_init_score (List of array) – init score of eval data
- eval_group (List of array) – group data of eval data
- eval_metric (str, list of str, callable, optional) – If a str, should be a built-in evaluation metric to use. If callable, a custom evaluation metric, see note for more details.
- early_stopping_rounds (int) –
- verbose (bool) – If verbose and an evaluation set is used, writes the evaluation
- feature_name (list of str, or 'auto') – Feature names If ‘auto’ and data is pandas DataFrame, use data columns name
- categorical_feature (list of str or int, or 'auto') – Categorical features, type int represents index, type str represents feature names (need to specify feature_name as well) If ‘auto’ and data is pandas DataFrame, use pandas categorical columns
- callbacks (list of callback functions) – List of callback functions that are applied at each iteration. See Callbacks in Python-API.md for more information.
Note
- Custom eval function expects a callable with following functions:
func(y_true, y_pred)
,func(y_true, y_pred, weight)
- or
func(y_true, y_pred, weight, group)
. - return (eval_name, eval_result, is_bigger_better)
- or list of (eval_name, eval_result, is_bigger_better)
- y_true: array_like of shape [n_samples]
- The target values
- y_pred: array_like of shape [n_samples] or shape[n_samples * n_class] (for multi-class)
- The predicted values
- weight: array_like of shape [n_samples]
- The weight of samples
- group: array_like
- group/query data, used for ranking task
- eval_name: str
- name of evaluation
- eval_result: float
- eval result
- is_bigger_better: bool
- is eval result bigger better, e.g. AUC is bigger_better.
- for multi-class task, the y_pred is group by class_id first, then group by row_id
- if you want to get i-th row y_pred in j-th class, the access way is y_pred[j*num_data+i]
-
predict
(X, raw_score=False, num_iteration=0)[source]¶ Return the predicted value for each sample.
Parameters: - X (array_like, shape=[n_samples, n_features]) – Input features matrix.
- num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns: predicted_result
Return type: array_like, shape=[n_samples] or [n_samples, n_classes]
-
-
class
lightgbm.
LGBMClassifier
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='binary', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
lightgbm.sklearn.LGBMModel
,object
-
classes_
¶ Get class label array.
-
n_classes_
¶ Get number of classes
-
predict_proba
(X, raw_score=False, num_iteration=0)[source]¶ Return the predicted probability for each class for each sample.
Parameters: - X (array_like, shape=[n_samples, n_features]) – Input features matrix.
- num_iteration (int) – Limit number of iterations in the prediction; defaults to 0 (use all trees).
Returns: predicted_probability
Return type: array_like, shape=[n_samples, n_classes]
-
-
class
lightgbm.
LGBMRegressor
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='regression', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, seed=0, nthread=-1, silent=True, huber_delta=1.0, gaussian_eta=1.0, fair_c=1.0, poisson_max_delta_step=0.7, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
lightgbm.sklearn.LGBMModel
,object
-
class
lightgbm.
LGBMRanker
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=50000, objective='lambdarank', min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, is_unbalance=False, seed=0, nthread=-1, silent=True, sigmoid=1.0, max_position=20, label_gain=None, drop_rate=0.1, skip_drop=0.5, max_drop=50, uniform_drop=False, xgboost_dart_mode=False)[source]¶ Bases:
lightgbm.sklearn.LGBMModel
-
fit
(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric='ndcg', eval_at=1, early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]¶ Most arguments like common methods except following:
- eval_at : list of int
- The evaulation positions of NDCG
-
Callbacks¶
-
lightgbm.
early_stopping
(stopping_rounds, verbose=True)[source]¶ Create a callback that activates early stopping. Activates early stopping. Requires at least one validation data and one metric If there’s more than one, will check all of them
Parameters: - stopping_rounds (int) – The stopping rounds before the trend occur.
- verbose (optional, bool) – Whether to print message about early stopping information.
Returns: callback – The requested callback function.
Return type: function
-
lightgbm.
print_evaluation
(period=1, show_stdv=True)[source]¶ Create a callback that print evaluation result.
Parameters: - period (int) – The period to log the evaluation results
- show_stdv (bool, optional) – Whether show stdv if provided
Returns: callback – A callback that print evaluation every period iterations.
Return type: function
-
lightgbm.
record_evaluation
(eval_result)[source]¶ Create a call back that records the evaluation history into eval_result.
Parameters: eval_result (dict) – A dictionary to store the evaluation results. Returns: callback – The requested callback function. Return type: function
-
lightgbm.
reset_parameter
(**kwargs)[source]¶ Reset parameter after first iteration
NOTE: the initial parameter will still take in-effect on first iteration.
Parameters: **kwargs (value should be list or function) – List of parameters for each boosting round or a customized function that calculates learning_rate in terms of current number of round (e.g. yields learning rate decay) - list l: parameter = l[current_round] - function f: parameter = f(current_round) Returns: callback – The requested callback function. Return type: function
Plotting¶
-
lightgbm.
plot_importance
(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='Feature importance', ylabel='Features', importance_type='split', max_num_features=None, ignore_zero=True, figsize=None, grid=True, **kwargs)[source]¶ Plot model feature importances.
Parameters: - booster (Booster or LGBMModel) – Booster or LGBMModel instance
- ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
- height (float) – Bar height, passed to ax.barh()
- xlim (tuple of 2 elements) – Tuple passed to axes.xlim()
- ylim (tuple of 2 elements) – Tuple passed to axes.ylim()
- title (str) – Axes title. Pass None to disable.
- xlabel (str) – X axis title label. Pass None to disable.
- ylabel (str) – Y axis title label. Pass None to disable.
- importance_type (str) – How the importance is calculated: “split” or “gain” “split” is the number of times a feature is used in a model “gain” is the total gain of splits which use the feature
- max_num_features (int) – Max number of top features displayed on plot. If None or smaller than 1, all features will be displayed.
- ignore_zero (bool) – Ignore features with zero importance
- figsize (tuple of 2 elements) – Figure size
- grid (bool) – Whether add grid for axes
- **kwargs – Other keywords passed to ax.barh()
Returns: ax
Return type: matplotlib Axes
-
lightgbm.
plot_metric
(booster, metric=None, dataset_names=None, ax=None, xlim=None, ylim=None, title='Metric during training', xlabel='Iterations', ylabel='auto', figsize=None, grid=True)[source]¶ Plot one metric during training.
Parameters: - booster (dict or LGBMModel) – Evals_result recorded by lightgbm.train() or LGBMModel instance
- metric (str or None) – The metric name to plot. Only one metric supported because different metrics have various scales. Pass None to pick first one (according to dict hashcode).
- dataset_names (None or list of str) – List of the dataset names to plot. Pass None to plot all datasets.
- ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
- xlim (tuple of 2 elements) – Tuple passed to axes.xlim()
- ylim (tuple of 2 elements) – Tuple passed to axes.ylim()
- title (str) – Axes title. Pass None to disable.
- xlabel (str) – X axis title label. Pass None to disable.
- ylabel (str) – Y axis title label. Pass None to disable. Pass ‘auto’ to use metric.
- figsize (tuple of 2 elements) – Figure size
- grid (bool) – Whether add grid for axes
Returns: ax
Return type: matplotlib Axes
-
lightgbm.
plot_tree
(booster, ax=None, tree_index=0, figsize=None, graph_attr=None, node_attr=None, edge_attr=None, show_info=None)[source]¶ Plot specified tree.
Parameters: - booster (Booster, LGBMModel) – Booster or LGBMModel instance.
- ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
- tree_index (int, default 0) – Specify tree index of target tree.
- figsize (tuple of 2 elements) – Figure size.
- graph_attr (dict) – Mapping of (attribute, value) pairs for the graph.
- node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes.
- edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges.
- show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’.
Returns: ax
Return type: matplotlib Axes
-
lightgbm.
create_tree_digraph
(booster, tree_index=0, show_info=None, name=None, comment=None, filename=None, directory=None, format=None, engine=None, encoding=None, graph_attr=None, node_attr=None, edge_attr=None, body=None, strict=False)[source]¶ Create a digraph of specified tree.
Parameters: - booster (Booster, LGBMModel) – Booster or LGBMModel instance.
- tree_index (int, default 0) – Specify tree index of target tree.
- show_info (list) – Information shows on nodes. options: ‘split_gain’, ‘internal_value’, ‘internal_count’ or ‘leaf_count’.
- name (str) – Graph name used in the source code.
- comment (str) – Comment added to the first line of the source.
- filename (str) – Filename for saving the source (defaults to name + ‘.gv’).
- directory (str) – (Sub)directory for source saving and rendering.
- format (str) – Rendering output format (‘pdf’, ‘png’, ...).
- engine (str) – Layout command used (‘dot’, ‘neato’, ...).
- encoding (str) – Encoding for saving the source.
- graph_attr (dict) – Mapping of (attribute, value) pairs for the graph.
- node_attr (dict) – Mapping of (attribute, value) pairs set for all nodes.
- edge_attr (dict) – Mapping of (attribute, value) pairs set for all edges.
- body (list of str) – Iterable of lines to add to the graph body.
- strict (bool) – Iterable of lines to add to the graph body.
Returns: graph
Return type: graphviz Digraph