Introduction
- Understanding your data is key.
- Data is typically partitioned into training and test sets.
- Setting random states helps to promote reproducibility.
Decision trees
- Decision trees are intuitive models that can be used for prediction and regression.
- Gini impurity is a measure of “impurity”. The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.
- Greedy algorithms take the optimal decision at a single point, without considering the larger problem as a whole.
Variance
- Overfitting is a problem that occurs when our algorithm is too closely aligned to our training data.
- Models that are overfitted may not generalise well to “unseen” data.
- Pruning is one approach for helping to prevent overfitting.
- By combining many of instances of “high variance” classifiers, we can end up with a single classifier with low variance.
Boosting
- An algorithm that performs somewhat poorly at a task - such as simple decision tree - is sometimes referred to as a “weak learner”.
- With boosting, we create a combination of many weak learners to form a single “strong” learner.
Bagging
- “Bagging” is short name for bootstrap aggregation.
- Bootstrapping is a data resampling technique.
- Bagging is another method for combining multiple weak learners to create a strong learner.
Random forest
- With Random Forest models, we resample data and use subsets of features.
- Random Forest are powerful predictive models.
Gradient boosting
- As a “boosting” method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree.
- Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting.
- The performance gains come at the cost of interpretability.
Performance
- There is a large performance gap between different types of tree.
- Boosted models typically perform strongly.