Learn how Decision Trees split data into pure groups using Gini or Entropy, how to control overfitting with pruning, and train a tree in Python with scikit-learn.
At each node, the tree tests a feature & threshold (e.g., PetalLength ≤ 2.45) to best separate classes, maximizing purity.
max_depth, min_samples_splitmin_samples_leaf, max_leaf_nodes# Step 1: Imports
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import classification_report, confusion_matrix
# Step 2: Data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Step 3: Model + Hyperparameters
clf = DecisionTreeClassifier(random_state=42)
param_grid = {
"criterion": ["gini", "entropy"],
"max_depth": [2, 3, 4, 5, None],
"min_samples_leaf": [1, 2, 3]
}
grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
best_tree = grid.best_estimator_
y_pred = best_tree.predict(X_test)
print("Best params:", grid.best_params_)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Optional: Text view of the tree
print(export_text(best_tree, feature_names=[f"x{i}" for i in range(X.shape[1])]))
Tip: Use export_text for a quick textual tree view, or plot_tree (matplotlib) for a graphic.
max_depth and min_samples_leaf reduce overfitting?