A simple, powerful algorithm that classifies a point based on the majority label of its K nearest neighbors. No training phase — just data and distance.
For a new point, find its K closest points by a distance metric (e.g., Euclidean). Assign the class by majority vote.
Common metrics: Euclidean, Manhattan. Always scale features (e.g., StandardScaler) so units don’t dominate.
Small K → low bias, high variance (noisy). Large K → high bias, smoother. Use cross-validation to pick K.
# Step 1: Imports
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
# Step 2: Data
X, y = load_wine(return_X_y=True)
# Step 3: Pipeline (scale -> KNN)
pipe = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier())
])
# Step 4: Hyperparameter search for K
param_grid = {"knn__n_neighbors": [3,5,7,9,11], "knn__weights": ["uniform","distance"]}
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
# Step 5: Evaluate
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Best params:", grid.best_params_)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Tip: Use a Pipeline so scaling is applied inside CV folds (prevents data leakage).
