1. num_boost_round
– n_estimators
Afterwards, you’ve to find out the variety of decision trees (often called base learners in XGBoost) to plant during training using num_boost_round
. The default is 100 but that is hardly enough for today’s large datasets.
Increasing the parameter will plant more trees but significantly increases the probabilities of overfitting because the model becomes more complex.
One trick I learned from Kaggle is to set a high number like 100,000 for num_boost_round
and make use of early stopping rounds.
In each boosting round, XGBoost plants another decision tree to enhance the collective rating of the previous ones. That’s why it known as boosting. This process continues until num_boost_round
rounds, regardless whether each latest round is an improvement on the last or not.
But through the use of early stopping, we will stop the training and thus planting of unnecessary trees when the rating hasn’t been improving for the last 5, 10, 50 or any arbitrary variety of rounds.
With this trick, we will find the right variety of decision trees without even tuning num_boost_round
and we are going to save time and computation resources. Here is how it might appear like in code:
# Define the remainder of the params
params = {...}# Construct the train/validation sets
dtrain_final = xgb.DMatrix(X_train, label=y_train)
dvalid_final = xgb.DMatrix(X_valid, label=y_valid)
bst_final = xgb.train(
params,
dtrain_final,
num_boost_round=100000 # Set a high number
evals=[(dvalid_final, "validation")],
early_stopping_rounds=50, # Enable early stopping
verbose_eval=False,
)
The above code would’ve made XGBoost use 100k decision trees but due to early stopping, it’ll stop when the validation rating hasn’t been improving for the last 50 rounds. Normally, the variety of required trees will probably be lower than 5000–10000.
Controlling num_boost_round
can be one among the most important aspects in how long the training process runs as more trees require more resources.