Ensembling and LightGBM Classifier Hyperparameters basics
Hi Reader! LightGBM, XGBoost and Catboost are some of the most well known Gradient Boosting Decision Tree Algorithms.
Ensembling: means to combine multiple trees into one single model. But wait… Why do we need multiple trees instead of single large tree? The answer is variability in many simple/weak and especially non-linear learners can jointly outperform a single (and usually over-fitting) complicated/strong learner.
There are two techniques used for Ensembling:
a. Bagging: Here the objective is to create several subsets of data from training sample chosen randomly with replacement. Each collection of subset data is used to train their decision trees. As a result, we get an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree classifier.
b. Boosting: In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors. Consecutive trees (random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree.
What is a Decision Tree ?
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
LightGBM Classifier Params & Hyperparameters
1.Task: This param is used to determine the action to be executed. Default value is ‘train’. Some other possible values include: predict, refit, save_binary etc.
2. Boosting Type: Default value is ‘gbdt’. Other values include
2a. Dart: Namely, gbdt suffers from over-specialization, which means trees added at later iterations tend to impact the prediction of only a few instances and make a negligible contribution towards the remaining instances. Adding dropout makes it more difficult for the trees at later iterations to specialize on those few samples and hence improves the performance. For more details: https://arxiv.org/abs/1505.01866
2b. Goss: The standard gbdt is reliable but it is not fast enough on large datasets. Hence, goss suggests a sampling method based on the gradient to avoid searching for the whole search space. We know that for each data instance when the gradient is small that means no worries data is well-trained and when the gradient is large that should be retrained again. So we have two sides here, data instances with large and small gradients. Thus, goss keeps all data with a large gradient and does a random sampling (that’s why it is called One-Side Sampling) on data with a small gradient. This makes the search space smaller and goss can converge faster. Finally, for gaining more insight about goss, you can check this blog post.
3. Objective: The type of label your model is trying to fit. For eg. if you are classification task, your objective can be ‘binary’ or ‘multiclass’. For regression, objective is set as ‘regression’.
4. Random State: Used to generate seeds for random number generation in the algorithm. Setting a random fixed random state helps to ensure that you come up with the exact same model given other factors (like train, validation, test and other hyperparams) are constant.
5. N_Jobs (int, optional (default=-1)) — Number of parallel threads to use for training
6. Metric: metric(s) to be evaluated on the evaluation set(s) Eg. Logloss, AUC, MSE etc.
7. Monotone Constraints: Used for constraints of monotonic features. For example if X is one of the numeric feature to predict a regression objective Y, and we know (by logical reasoning or statistical evidences) that X and Y are have |correlation| = 1, we can set monotone constraint on feature X.
8. Bin_construct_sample_cnt: Number of data that sampled to construct histogram bins. Decreasing the value of this parameter usually decreases the training time and accuracy.
9. Cat_l2: L2 regularization in categorical split. Overfitting can be solved by increasing this param.
10. Cat_smooth: This can reduce the effect of noises in categorical features, especially for categories with few data. This parameter is used to deal with over-fitting (when
#data is small or
#category is large).
11. Feature_fraction: LightGBM will randomly select a subset of features on each iteration (tree) if
feature_fraction is smaller than
1.0. For example, if you set it to
0.8, LightGBM will select 80% of features before training each tree. This can be used to speed up training and deal with over-fitting.
12. Lambda_L1: L1 regularization. Used to deal with overfitting.
13. Lambda_L2: L2 regularization. Used to deal with overfitting.
14. Learning Rate: Boosting learning rate. Lower learning rate usually results in better accuracy. However the downside is that it will require more number of estimators to converge and hence increase both training and prediction times.
<More to be added>