Introduction to Tree Models in Python: Key Points

Beta

Introduction to Tree Models in Python

Introduction

Understanding your data is key.
Data is typically partitioned into training and test sets.
Setting random states helps to promote reproducibility.

Decision trees

Decision trees are intuitive models that can be used for prediction and regression.
Gini impurity is a measure of “impurity”. The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.
Greedy algorithms take the optimal decision at a single point, without considering the larger problem as a whole.

Variance

Overfitting is a problem that occurs when our algorithm is too closely aligned to our training data.
Models that are overfitted may not generalise well to “unseen” data.
Pruning is one approach for helping to prevent overfitting.
By combining many of instances of “high variance” classifiers, we can end up with a single classifier with low variance.

Boosting

An algorithm that performs somewhat poorly at a task - such as simple decision tree - is sometimes referred to as a “weak learner”.
With boosting, we create a combination of many weak learners to form a single “strong” learner.

Bagging

“Bagging” is short name for bootstrap aggregation.
Bootstrapping is a data resampling technique.
Bagging is another method for combining multiple weak learners to create a strong learner.

Random forest

With Random Forest models, we resample data and use subsets of features.
Random Forest are powerful predictive models.

Gradient boosting

As a “boosting” method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree.
Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting.
The performance gains come at the cost of interpretability.

Performance

There is a large performance gap between different types of tree.
Boosted models typically perform strongly.