Decision Tree Classifier in Machine Learning – Gini vs Entropy, Pruning & Python Code

🌳 Decision Trees – Splits, Gini & Entropy

Learn how Decision Trees split data into pure groups using Gini or Entropy, how to control overfitting with pruning, and train a tree in Python with scikit-learn.

You’ll Learn

How trees choose splits
Gini vs Entropy criteria
Preventing overfitting (pruning)
Train & visualize a tree in Python

Great For

Interpretable models
Mixed numeric + categorical features
Baseline classification/regression

How Splits Work

At each node, the tree tests a feature & threshold (e.g., PetalLength ≤ 2.45) to best separate classes, maximizing purity.

Gini vs Entropy

Gini: fast, common impurity metric
Entropy: information gain (ID3/C4.5)
Both usually yield similar trees

Avoid Overfitting

max_depth, min_samples_split
min_samples_leaf, max_leaf_nodes
Use cross-validation to tune

Python Implementation (Classification on Iris)

# Step 1: Imports
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import classification_report, confusion_matrix

# Step 2: Data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 3: Model + Hyperparameters
clf = DecisionTreeClassifier(random_state=42)
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_leaf": [1, 2, 3]
}
grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

best_tree = grid.best_estimator_
y_pred = best_tree.predict(X_test)

print("Best params:", grid.best_params_)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Optional: Text view of the tree
print(export_text(best_tree, feature_names=[f"x{i}" for i in range(X.shape[1])]))

Tip: Use export_text for a quick textual tree view, or plot_tree (matplotlib) for a graphic.

✅ Advantages

Easy to visualize & explain
Handles non-linear boundaries
Little data prep required

⚠️ Limitations

Prone to overfitting without pruning
Unstable to small data changes
Bias toward features with many splits

📝 Self-Check

When would you prefer Entropy over Gini?
How do max_depth and min_samples_leaf reduce overfitting?
Why can trees be unstable to tiny data changes?

Next: Random Forest – Ensemble of Decision Trees

Machine Learning with Python: From Basics to Capstone

Curriculum