Lesson 2: K-Nearest Neighbors (KNN)

🧭 K-Nearest Neighbors (KNN) – Intuition & Practice

A simple, powerful algorithm that classifies a point based on the majority label of its K nearest neighbors. No training phase — just data and distance.

You’ll Learn

KNN intuition (majority vote)
Choosing K & distance metrics
Why scaling matters
Build & tune KNN in Python

Great For

Image & text similarity
Anomaly detection (KNN distances)
Cold-start recommendations

Core Idea

For a new point, find its K closest points by a distance metric (e.g., Euclidean). Assign the class by majority vote.

Distance & Scaling

Common metrics: Euclidean, Manhattan. Always scale features (e.g., StandardScaler) so units don’t dominate.

Choosing K

Small K → low bias, high variance (noisy). Large K → high bias, smoother. Use cross-validation to pick K.

Python Implementation (with Scaling & CV)

# Step 1: Imports
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

# Step 2: Data
X, y = load_wine(return_X_y=True)

# Step 3: Pipeline (scale -> KNN)
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier())
])

# Step 4: Hyperparameter search for K
param_grid = {"knn__n_neighbors": [3,5,7,9,11], "knn__weights": ["uniform","distance"]}
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

# Step 5: Evaluate
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Best params:", grid.best_params_)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Tip: Use a Pipeline so scaling is applied inside CV folds (prevents data leakage).

✅ Advantages

Simple, no training time
Works well with multi-class
Flexible with different distances

⚠️ Limitations

Slow at prediction on large datasets
Suffers in high dimensions (curse of dimensionality)
Sensitive to feature scaling & noisy data

📝 Self-Check

Why is feature scaling essential for KNN?
How does increasing K affect bias and variance?
When might Manhattan distance be preferable to Euclidean?

Next: Naive Bayes – Fast Probabilistic Classifier

Machine Learning with Python: From Basics to Capstone

Curriculum