entropy

Measuring Purity with Entropy:

Purity and Impurity: A set of examples is pure if all examples belong to a single class (e.g., all cats or all not cats).
 Impurity arises when the set contains a mix of classes.
Entropy Definition: Entropy measures the impurity of a set of data. It is denoted as ( H(p_1) ), where ( p_1 )
 is the fraction of examples that are cats.
Entropy Calculation:
For a set with 3 cats and 3 dogs (( p_1 = 0.5 )), entropy ( H(p_1) = 1 ).
The entropy function peaks at 1 when the set is a 50-50 mix of two classes.
If the set is all cats or all not cats, entropy is 0.
Examples:
5 cats and 1 dog (( p_1 = 0.83 )): Entropy ( \approx 0.65 ).
All cats (( p_1 = 1 )): Entropy ( = 0 ).
2 cats and 4 dogs (( p_1 = 0.33 )): Entropy ( \approx 0.92 ).
All dogs (( p_1 = 0 )): Entropy ( = 0 ).
Entropy Function:
( H(p_1) = -p_1 \log_2(p_1) - (1 - p_1) \log_2(1 - p_1) ).
Logarithms are taken to base 2 to make the peak of the curve equal to 1.
Conventionally, ( 0 \log(0) ) is defined as 0 for entropy calculations.
Comparison with Gini Criterion: The Gini criterion is another measure of impurity that behaves similarly to entropy
and can also be used for building decision trees.
This summary should give you a clear understanding of how entropy is used to measure the impurity of a set of examples
and its application in decision trees.

Decision Trees and Information Gain
Choosing Features to Split:

When building a decision tree, the goal is to choose features that reduce entropy (impurity) the most, thereby maximizing purity.
The reduction in entropy is called information gain.
Example:

Consider a decision tree for recognizing cats vs. not cats.
Ear Shape Feature:
Left: 4/5 cats (( P_1 = 0.8 )), Entropy ≈ 0.72
Right: 1/5 cats (( P_1 = 0.2 )), Entropy ≈ 0.72
Face Shape Feature:
Left: 4/7 cats (( P_1 = 0.57 )), Entropy ≈ 0.99
Right: 1/3 cats (( P_1 = 0.33 )), Entropy ≈ 0.92
Whiskers Feature:
Left: 3/4 cats (( P_1 = 0.75 )), Entropy ≈ 0.81
Right: 2/6 cats (( P_1 = 0.33 )), Entropy ≈ 0.92
Weighted Average Entropy:

To decide the best feature, compute the weighted average entropy of the left and right sub-branches.
Example for Ear Shape:
Weighted Entropy = ( \frac{5}{10} \times 0.72 + \frac{5}{10} \times 0.72 = 0.72 )
Information Gain:

Calculate the reduction in entropy compared to the root node.
Root Node Entropy (for 5 cats and 5 dogs) = 1
Information Gain = Root Node Entropy - Weighted Entropy
Example for Ear Shape:
Information Gain = 1 - 0.72 = 0.28
Choosing the Best Feature:

Compare the information gain for all features and choose the one with the highest value.
Example:
Ear Shape: 0.28
Face Shape: 0.03
Whiskers: 0.12
Best feature to split on: Ear Shape (highest information gain).
Stopping Criteria:

If the reduction in entropy is too small, stop splitting to avoid overfitting.
Building a Decision Tree Using Information Gain
Overall Process:

Start at the Root Node:

Begin with all training examples at the root node.
Calculate the information gain for all possible features.
Choose the feature with the highest information gain to split the data.
Splitting the Data:

Split the dataset into two subsets based on the selected feature.
Create left and right branches and distribute the training examples accordingly.
Repeat the Process:

Continue splitting the data at each node using the same criteria.
Repeat until the stopping criteria are met.
Stopping Criteria:

A node is 100% a single class (entropy of zero).
Further splitting exceeds the maximum tree depth.
Information gain from additional splits is below a threshold.
The number of examples in a node is below a threshold.
Example Illustration:

Root Node Split:
Choose the best feature (e.g., ear shape) and split the data.
Create left and right sub-branches based on the feature values.
Left Sub-Branch:
Check if stopping criteria are met (e.g., all cats or all dogs).
If not, compute information gain for remaining features and split again.
Continue until stopping criteria are met, creating leaf nodes with predictions.
Right Sub-Branch:
Follow the same process as the left sub-branch.
Recursive Algorithm:

Building a decision tree involves creating smaller sub-trees recursively.
Each node's decision tree is built using a subset of the data.
This recursive approach simplifies the implementation of decision trees.
Choosing Maximum Depth:

Larger maximum depth allows for more complex models but increases the risk of overfitting.
Use cross-validation or default settings in open-source libraries to choose the optimal depth.
Stopping Splits:

Stop splitting if the information gain is too small or the number of examples is below a threshold.
Making Predictions:

To make a prediction, follow the decision path from the root to a leaf node based on the feature values of the test example.