Introduction


  • Understanding your data is key.
  • Data is typically partitioned into training and test sets.
  • Setting random states helps to promote reproducibility.

Decision trees


  • Decision trees are intuitive models that can be used for prediction and regression.
  • Gini impurity is a measure of “impurity”. The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.
  • Greedy algorithms take the optimal decision at a single point, without considering the larger problem as a whole.

Variance


  • Overfitting is a problem that occurs when our algorithm is too closely aligned to our training data.
  • Models that are overfitted may not generalise well to “unseen” data.
  • Pruning is one approach for helping to prevent overfitting.
  • By combining many of instances of “high variance” classifiers, we can end up with a single classifier with low variance.

Boosting


  • An algorithm that performs somewhat poorly at a task - such as simple decision tree - is sometimes referred to as a “weak learner”.
  • With boosting, we create a combination of many weak learners to form a single “strong” learner.

Bagging


  • “Bagging” is short name for bootstrap aggregation.
  • Bootstrapping is a data resampling technique.
  • Bagging is another method for combining multiple weak learners to create a strong learner.

Random forest


  • With Random Forest models, we resample data and use subsets of features.
  • Random Forest are powerful predictive models.

Gradient boosting


  • As a “boosting” method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree.
  • Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting.
  • The performance gains come at the cost of interpretability.

Performance


  • There is a large performance gap between different types of tree.
  • Boosted models typically perform strongly.