🤖 Classification Algorithm in Machine Learning क्या है?
Table of Contents
Toggleआसान भाषा में समझें — क्या है Classification, कहाँ use होता है, और क्यों Data Science students के लिए यह बेसिक skill है। साथ में छोटे real-world examples और quick visual समझ।
🎯 Quick snapshot — क्यों Classification सीखें?
- High demand: Business & healthcare models में classification सबसे ज़्यादा इस्तेमाल होता है।
- Real problems solved: Spam detection, loan approval, disease diagnosis — सब classification पर निर्भर।
- Easy to start: Simple models जैसे Logistic Regression और Decision Tree से शुरुआत कर सकते हैं।
📘 Classification क्या है?
Classification का मतलब है — किसी data को एक या कई predefined categories (labels) में बाँटना। Machine Learning में जब model को यह सिखाया जाता है कि नए data को किस category में डालना है, तो यह process classification कहलाती है।
- 📧 Spam Detection: Email → Spam / Not Spam
- 🏦 Loan Decision: Applicant data → Approve / Reject
- 🩺 Disease Prediction: Symptoms → Disease label
⚙️ Classification कैसे काम करता है? (Simple Flow)
Data collect → Clean & preprocess → Features select → Model train → Prediction → Evaluation.
| Use Case | Input | Output (Class) |
|---|---|---|
| Email Filtering | Email text features | Spam / Not Spam |
| Bank Loan | Credit score, income | Approve / Reject |
⚖️ Classification vs Regression (Quick)
अगर result category (discrete) है → Classification. अगर result continuous number है → Regression.
- Binary Classification: 2 classes (Yes/No)
- Multi-class: 3+ classes (e.g., cat/dog/bird)
- Multi-label: एक sample पर multiple labels हो सकते हैं
Vista Academy — Practical, project-driven learning for students who want real skills. Continue to Section 2 for algorithm details & Python code.
⚙️ Classification Algorithm Machine Learning में कैसे काम करता है?
Step-by-step flow में समझिए कि Machine Learning model कैसे data से patterns सीखकर सही class predict करता है।
🔁 Step-by-Step Process
- Data Collection: Model को train करने के लिए relevant data collect किया जाता है (जैसे emails, transactions, images आदि)।
- Data Pre-processing: Missing values को handle करना और categorical data को encode करना।
- Feature Selection: ऐसे features choose करना जो output पर ज़्यादा effect डालते हैं।
- Model Training: Dataset को train-test में divide किया जाता है (70/30 या 80/20)।
- Prediction: नए data पर model prediction करता है कि कौन-सी class में आएगा।
- Evaluation: Accuracy, Precision, Recall जैसे metrics से model की performance check की जाती है।
| Step | Action | Purpose |
|---|---|---|
| 1️⃣ Data Prep | Clean & format data | Remove noise & bias |
| 2️⃣ Training | Feed data to algorithm | Learn patterns |
| 3️⃣ Testing | Predict unknown data | Check accuracy |
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LogisticRegression()
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
ऊपर code एक basic flow दिखाता है जहाँ model data से pattern सीखकर prediction करता है।
अब आप जान चुके हैं कि classification model कैसे काम करता है। Section 3 में हम देखेंगे — Logistic Regression, KNN, Decision Tree और SVM का पूरा tutorial।
🧩 Types of Classification Algorithms — आसान समझ और Python Examples
नीचे सबसे ज़रूरी classification algorithms दिए हैं — हर एक का intuition, कब use करें, pros/cons और छोटा sklearn-based code snippet.
1. Logistic Regression (लॉजिस्टिक रिग्रेशन)
Simple और interpretable linear model — binary classification के लिए सबसे common। यह probability estimate करता है (sigmoid function) और threshold के आधार पर class predict करता है.
Use when: linear boundary उम्मीद हो, features numeric हों।Pros: Fast, interpretable, baseline model.
Cons: Non-linear problems में कम perform करता है।
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
2. K-Nearest Neighbors (KNN)
Instance-based algorithm — prediction के लिए closest k points देखता है (distance like Euclidean). Simple और intuitive.
Use when: small dataset, clear clusters हों।Pros: No training (lazy), simple.
Cons: Big datasets में slow, feature scaling जरूरी।
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
3. Decision Tree (निर्णय वृक्ष)
Tree-like model जो features पर splits बनाकर decision लेता है. Intuitive और easily visualizable.
Use when: interpretability चाहिए और categorical/numeric mix हो।Pros: Easy to explain, no scaling required.
Cons: Overfitting (deep trees) — pruning जरूरी।
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
4. Random Forest (रैंडम फॉरेस्ट)
Ensemble of decision trees — bagging और feature randomness use करके accuracy और robustness बढ़ाता है.
Use when: strong baseline चाहिए, overfitting कम करना।Pros: High accuracy, handles missing values/feature importance।
Cons: Harder to interpret, heavier compute।
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
5. Support Vector Machine (SVM)
Margin-based classifier — best separating hyperplane ढूँढता है. Kernels से non-linear boundaries भी handle करता है.
Use when: high-dimensional space, clear margin expected।Pros: Effective in complex spaces, margin maximization।
Cons: Slow on large datasets, kernel tuning जरूरी।
from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
6. Naive Bayes (नाइव बेयज़)
Probabilistic classifier based on Bayes’ theorem — features की independence assume करता है. Text classification में बहुत fast और effective।
Use when: Text classification, high-dimensional sparse data।Pros: Fast, works well with small data sets।
Cons: Independence assumption realistic नहीं होता।
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
- Fast & interpretable: Logistic Regression, Decision Tree
- Best baseline & accuracy: Random Forest
- Works well on text: Naive Bayes
- High-dim complex boundary: SVM
- Small dataset & intuitive: KNN
अगले सेक्शन में हम हर algorithm का deep-dive करेंगे — math intuition, hyperparameters, और real dataset examples (Iris, Titanic, SMS Spam).
🧠 Logistic Regression — Math Intuition, Python Code और Decision Boundary
Logistic Regression एक simple लेकिन powerful model है binary classification के लिए — इस सेक्शन में हम इसकी theory, code और visualization करेंगे।
Concept & Intuition (आसान भाषा)
Logistic Regression regression जैसा नाम है पर यह classification model है। इसका goal है probability estimate करना कि sample किसी class (label) में आता है या नहीं — और फिर threshold (जैसे 0.5) के basis पर class assign करना।
Linear combination: z = w₀ + w₁x₁ + w₂x₂ + …
Probability = sigmoid(z) = 1 / (1 + e^(−z))
Decision: if sigmoid(z) ≥ 0.5 → class 1, else class 0.
Loss Function (Log Loss)
Training में हम weights (w) उस तरीके से चुनते हैं जिससे log loss (cross-entropy) minimize हो:
L = −[y log(p) + (1−y) log(1−p)] summed over samples — जहाँ p = sigmoid(z)
Use-cases & Assumptions
- Binary classification problems (spam vs not spam, fraud vs legit)।
- Assumes linear relationship between features and log-odds (logit)।
- Features should ideally be scaled for better convergence।
End-to-End Python Example (Iris → Binary)
नीचे पूरा workflow है — load data, EDA, preprocessing, train, evaluate और decision boundary plot. (Iris dataset में हम class 0 vs rest बना रहे हैं)
# Logistic Regression - End to End (Iris binary example)
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# 1. Load dataset
iris = load_iris()
X = iris.data[:, :2] # for easy 2D visualization use first two features
y = (iris.target == 0).astype(int) # class 0 vs rest (binary)
# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# 3. Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# 4. Train model
model = LogisticRegression()
model.fit(X_train_s, y_train)
# 5. Predict & Evaluate
y_pred = model.predict(X_test_s)
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, model.predict_proba(X_test_s)[:,1]))
# 6. Decision boundary plot (2D)
xx, yy = np.mgrid[X_train_s[:,0].min()-1:X_train_s[:,0].max()+1:0.02,
X_train_s[:,1].min()-1:X_train_s[:,1].max()+1:0.02]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = model.predict_proba(grid)[:,1].reshape(xx.shape)
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, probs, levels=[0,0.5,1], alpha=0.2)
plt.scatter(X_train_s[:,0], X_train_s[:,1], c=y_train, edgecolor='k', s=50)
plt.title('Decision Boundary (Logistic Regression)')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.show()
ऊपर plot code 2D decision boundary दिखाता है — production में आप more features के लिए ROC/AUC और probability thresholds पर ध्यान देते हैं।
Important Hyperparameters
- C: Inverse regularization strength — छोटी C => strong regularization (less overfitting)
- penalty: ‘l2’ (default) या ‘l1’ (sparse features)
- solver: ‘liblinear’,’saga’,’lbfgs’ — depending on data size and penalty
- class_weight: handle imbalance (e.g., ‘balanced’)
Pros / Cons
- Interpretable (coefficients explain feature impact)
- Fast to train & predict
- Works well as baseline
- Assumes linear decision boundary
- Not ideal for complex non-linear patterns
- Feature scaling often required
Visual & UX Suggestions (for blog)
- Show a small animated SVG of sigmoid function (hover to show formula).
- Interactive decision-boundary demo (2 sliders to adjust coefficients) — embed JS demo or Observable notebook.
- Add an expandable code block so mobile users can view/copy code easily.
अब Logistic Regression का strong foundation बन गया है — अगले सेक्शन में हम KNN और Decision Tree का deep-dive करेंगे (math, pros/cons, aur Python examples).
🧠 Logistic Regression — Math Intuition, Python Code और Decision Boundary
Logistic Regression एक simple लेकिन powerful model है binary classification के लिए — इस सेक्शन में हम इसकी theory, code और visualization करेंगे।
Concept & Intuition (आसान भाषा)
Logistic Regression regression जैसा नाम है पर यह classification model है। इसका goal है probability estimate करना कि sample किसी class (label) में आता है या नहीं — और फिर threshold (जैसे 0.5) के basis पर class assign करना।
Linear combination: z = w₀ + w₁x₁ + w₂x₂ + …
Probability = sigmoid(z) = 1 / (1 + e^(−z))
Decision: if sigmoid(z) ≥ 0.5 → class 1, else class 0.
Loss Function (Log Loss)
Training में हम weights (w) उस तरीके से चुनते हैं जिससे log loss (cross-entropy) minimize हो:
L = −[y log(p) + (1−y) log(1−p)] summed over samples — जहाँ p = sigmoid(z)
Use-cases & Assumptions
- Binary classification problems (spam vs not spam, fraud vs legit)।
- Assumes linear relationship between features and log-odds (logit)।
- Features should ideally be scaled for better convergence।
End-to-End Python Example (Iris → Binary)
नीचे पूरा workflow है — load data, EDA, preprocessing, train, evaluate और decision boundary plot. (Iris dataset में हम class 0 vs rest बना रहे हैं)
# Logistic Regression - End to End (Iris binary example)
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# 1. Load dataset
iris = load_iris()
X = iris.data[:, :2] # for easy 2D visualization use first two features
y = (iris.target == 0).astype(int) # class 0 vs rest (binary)
# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# 3. Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# 4. Train model
model = LogisticRegression()
model.fit(X_train_s, y_train)
# 5. Predict & Evaluate
y_pred = model.predict(X_test_s)
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, model.predict_proba(X_test_s)[:,1]))
# 6. Decision boundary plot (2D)
xx, yy = np.mgrid[X_train_s[:,0].min()-1:X_train_s[:,0].max()+1:0.02,
X_train_s[:,1].min()-1:X_train_s[:,1].max()+1:0.02]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = model.predict_proba(grid)[:,1].reshape(xx.shape)
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, probs, levels=[0,0.5,1], alpha=0.2)
plt.scatter(X_train_s[:,0], X_train_s[:,1], c=y_train, edgecolor='k', s=50)
plt.title('Decision Boundary (Logistic Regression)')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.show()
ऊपर plot code 2D decision boundary दिखाता है — production में आप more features के लिए ROC/AUC और probability thresholds पर ध्यान देते हैं।
Important Hyperparameters
- C: Inverse regularization strength — छोटी C => strong regularization (less overfitting)
- penalty: ‘l2’ (default) या ‘l1’ (sparse features)
- solver: ‘liblinear’,’saga’,’lbfgs’ — depending on data size and penalty
- class_weight: handle imbalance (e.g., ‘balanced’)
Pros / Cons
- Interpretable (coefficients explain feature impact)
- Fast to train & predict
- Works well as baseline
- Assumes linear decision boundary
- Not ideal for complex non-linear patterns
- Feature scaling often required
Visual & UX Suggestions (for blog)
- Show a small animated SVG of sigmoid function (hover to show formula).
- Interactive decision-boundary demo (2 sliders to adjust coefficients) — embed JS demo or Observable notebook.
- Add an expandable code block so mobile users can view/copy code easily.
अब Logistic Regression का strong foundation बन गया है — अगले सेक्शन में हम KNN और Decision Tree का deep-dive करेंगे (math, pros/cons, aur Python examples).
👥 K-Nearest Neighbors (KNN) Algorithm — Intuition, Steps & Python Example
KNN एक simple लेकिन powerful non-parametric algorithm है जो नए data points को उनके सबसे नज़दीकी पड़ोसियों की classes के आधार पर classify करता है।
🔍 KNN क्या है?
KNN (K-Nearest Neighbors) एक instance-based supervised learning algorithm है। यह model training के दौरान कोई assumption नहीं करता — बल्कि prediction के समय केवल data points के बीच की दूरी देखकर decision लेता है।
📏 Distance Calculation
सबसे common distance metric है Euclidean Distance:
d(p, q) = √Σ (pᵢ − qᵢ)²
- Small k: Model noisy हो सकता है
- Large k: Model smooth but underfit कर सकता है
🐍 Python Example (Iris Dataset)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
आप n_neighbors को tune करके accuracy और smoothness में balance ला सकते हैं।
⚙️ Hyperparameters
- n_neighbors (K): कितने पड़ोसी consider करने हैं
- metric: Distance type (euclidean, manhattan, minkowski)
- weights: ‘uniform’ या ‘distance’ — closer neighbors का ज़्यादा impact
- No training time — instant fit
- Simple and intuitive
- Works on non-linear boundaries
- Slow for large datasets
- Feature scaling required
- Sensitive to irrelevant features
अगले सेक्शन में हम Decision Tree Algorithm को समझेंगे — splits, entropy, Gini index, और pruning के साथ Python visualization.
🌳 Decision Tree (निर्णय वृक्ष) — Intuition, Entropy / Gini और Pruning
Decision Tree एक visual और interpretable model है — इस सेक्शन में हम समझेंगे कैसे split बनते हैं, entropy और gini क्या हैं, pruning क्यों ज़रूरी है और sklearn code के साथ visualization.
💡 Intuition — Tree कैसे काम करता है?
Decision Tree data को condition-based rules में बाँटता है — हर node पर एक feature के आधार पर split होता है। एक simple आईडिया: जैसे आप एक fruit बेच रहे हों — पहले पूछो “क्या fruit लाल है?” → हाँ तो next question, नहीं तो अलग branch। इसी तरह tree classify करता है।
🧮 Split Criteria — Entropy और Gini
Split choose करने के लिए algorithm दो popular impurity metrics use करता है: Entropy (Information Gain) और Gini Impurity.
H(S) = − Σ p(i) log₂ p(i)
जहाँ p(i) किसी class का probability है।
Information Gain = H(parent) − weighted avg H(children).
Gini = 1 − Σ p(i)²
दोनों metrics impurity कम करने वाली splits चुनते हैं — Gini थोड़ा faster है, entropy थोड़ा theoretically grounded।
छोटा Example (Hands-on)
मान लीजिए node में 10 samples हैं — 6 class A और 4 class B।
Entropy = −(0.6 log₂0.6 + 0.4 log₂0.4) ≈ 0.971 bits.
Gini = 1 − (0.6² + 0.4²) = 0.48.
⚠️ Overfitting और Pruning
Decision trees आसानी से overfit कर लेते हैं अगर depth ज़्यादा हो। इसलिए pruning और constraints ज़रूरी हैं — जैसे max_depth, min_samples_leaf, min_samples_split।
- Pre-pruning: tree grow करने से पहले limits (max_depth आदि) लगाना
- Post-pruning: पूरा tree grow करके फिर low-importance branches हटाना
🐍 Python Example (Decision Tree with Visualization)
# Decision Tree - train, visualize & evaluate (Iris example)
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
# Load data (use first two features for visualization)
iris = load_iris()
X = iris.data[:, :2]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# Train with pre-pruning
model = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
# Visualize tree
plt.figure(figsize=(12,8))
plot_tree(model, feature_names=iris.feature_names[:2], class_names=iris.target_names, filled=True, rounded=True)
plt.show()
ऊपर plot_tree function से आप tree का visual देख सकते हैं — हर split पर feature और threshold दिखेगा।
🔑 Feature Importance
Decision trees feature importance निकालते हैं — यह बताता है कि कौन-सा feature splitting में ज़्यादा use हुआ और prediction में ज़्यादा contribute करता है।
# Feature importance example
for name, score in zip(iris.feature_names[:2], model.feature_importances_):
print(name, round(score,3))
🔧 Important Hyperparameters
- max_depth: Tree की maximum depth — overfitting control
- min_samples_split: Node split करने के लिए minimum samples
- min_samples_leaf: Leaf node में minimum samples
- criterion: ‘gini’ या ‘entropy’
- random_state: Reproducibility
👍 Pros / 👎 Cons
- Easy to interpret & visualize
- No feature scaling required
- Works with mixed (categorical + numeric) data
- Easy to overfit without pruning
- Small changes in data can change the tree structure
- Not as accurate as ensembles (Random Forest, XGBoost) on many tasks
Visual & UX Suggestions (for blog)
- Embed an interactive tree explorer (collapse/expand nodes) — use d3.js or Observable.
- Show side-by-side: raw data → split chosen → resulting child nodes (animated).
- Add small tooltip for entropy/gini formula when hovering over a split node.
Decision Tree समझना बहुत ज़रूरी है क्योंकि यह interpretability देता है — अगले सेक्शन में हम Random Forest और ensemble techniques पर detailed work करेंगे।
🌲 Random Forest Algorithm — Bagging, Feature Importance और Python Code
Random Forest एक ensemble learning method है जो कई Decision Trees को मिलाकर accuracy बढ़ाता है और overfitting घटाता है। इसे समझना किसी भी Machine Learning aspirant के लिए must-have है।
🌳 Random Forest क्या है?
Random Forest कई Decision Trees का combination होता है। हर tree data के random subset और features के random subset पर train होता है, जिससे variance कम और generalization better होती है।
🧩 Working Process (Bagging Concept)
- Training data से कई random subsets बनाए जाते हैं।
- हर subset पर एक decision tree train होता है।
- Final prediction → majority vote (classification) या average (regression)।
🐍 Python Code (Iris Dataset)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# Train Random Forest
model = RandomForestClassifier(
n_estimators=100, max_depth=5, random_state=42, criterion='gini'
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Feature Importance
import pandas as pd
importance = pd.Series(model.feature_importances_, index=["Sepal L", "Sepal W", "Petal L", "Petal W"])
print(importance.sort_values(ascending=False))
⚙️ Key Hyperparameters
- n_estimators: Tree की संख्या (default = 100)
- max_depth: Individual tree की depth
- max_features: Split करने के लिए random feature count
- criterion: “gini” या “entropy”
- bootstrap: Sampling with replacement (True by default)
- High accuracy and stability
- Less overfitting than Decision Tree
- Handles missing values and outliers well
- Complex to interpret (black box)
- Computationally expensive for large datasets
- May require tuning for optimal results
अब आप Ensemble Learning की foundation समझ चुके हैं — अगले सेक्शन में हम Support Vector Machine (SVM) के concepts, kernel tricks और margin theory सीखेंगे।
⚔️ Support Vector Machine (SVM) — Margin Theory, Kernel Trick और Python Visualization
SVM एक powerful margin-based classifier है — high-dimensional और non-linear problems में kernel trick से शानदार काम करता है। चलिए step-by-step समझते हैं।
🔎 Basic Intuition — क्या है SVM?
SVM classification के लिए सबसे अच्छी separating hyperplane खोजता है — जो दो classes के बीच **maximum margin** रखे। Margin को maximize करने से generalization बेहतर होती है।
- Hyperplane: Decision boundary (line in 2D, plane in 3D).
- Margin: Distance between hyperplane and nearest points of classes.
- Support Vectors: वो points जो margin को define करते हैं (closest points).
🧮 Math (Short)
SVM solves optimization: minimize ||w|| subject to yᵢ (w·xᵢ + b) ≥ 1 (for hard-margin). Soft-margin allows slack variables ξᵢ and penalizes them with parameter C.
minimize (1/2)||w||² + C Σ ξᵢ — जहाँ C trade-off है margin width और misclassification penalties के बीच।
🪄 Kernel Trick — Non-linear Problems का जादू
Kernel trick से हम inputs को higher-dimensional space में project कर सकते हैं बिना explicit mapping के — और वहाँ linear separator ढूँढ लें। Common kernels:
- linear — simple linear separator
- rbf (Gaussian) — flexible, good default for many tasks
- poly — polynomial relations
- sigmoid — neural-network like
🧭 When to use SVM?
- Medium-sized datasets (not extremely huge).
- High-dimensional feature spaces (text data, TF-IDF).
- When margin-based robustness is desired.
🐍 Python Example (2D Decision Boundary with RBF Kernel)
# SVM - train and 2D decision boundary (Iris binary example)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
iris = load_iris()
X = iris.data[:, :2] # use first two features for visualization
y = (iris.target != 0).astype(int) # binary: class 0 vs rest
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
model = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True)
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
# Decision boundary
xx, yy = np.meshgrid(np.linspace(X_train_s[:,0].min()-1, X_train_s[:,0].max()+1, 300),
np.linspace(X_train_s[:,1].min()-1, X_train_s[:,1].max()+1, 300))
Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, levels=50, cmap='RdYlBu', alpha=0.6)
plt.contour(xx, yy, Z, levels=[0], colors='k', linewidths=1) # decision boundary
plt.scatter(X_train_s[:,0], X_train_s[:,1], c=y_train, edgecolor='k', s=50)
plt.title('SVM Decision Boundary (RBF Kernel)')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.show()
ऊपर decision_function contour से आप margin और boundary दोनों देख सकते हैं — decision boundary वो contour level है जहाँ function = 0।
⚙️ Key Hyperparameters
- C: Regularization — बड़ी C => low bias, high variance (less regularization)
- kernel: ‘linear’,’rbf’,’poly’,’sigmoid’
- gamma: For RBF/poly — scale of influence (auto/scale or float)
- class_weight: handle imbalance (‘balanced’)
👍 Pros / 👎 Cons
- Effective in high-dimensional spaces
- Works well with clear margin separation
- Robust with kernel trick for non-linear data
- Slow on very large datasets (computationally heavy)
- Requires careful hyperparameter tuning (C, gamma)
- Less interpretable than simple linear models
💡 Practical Tips
- Always scale features before SVM (StandardScaler).
- Start with kernel=’rbf’ and tune C & gamma via GridSearchCV.
- For text classification, use linear kernel with sparse TF-IDF features (fast & effective).
SVM समझना advanced ML के लिए helpful है — अगले सेक्शन में हम Naive Bayes और text-classification techniques पर practical tutorial करेंगे।
📊 Naive Bayes Algorithm — Bayes’ Theorem, Text Classification और Python Example
Naive Bayes एक probabilistic classifier है जो Bayes’ theorem पर आधारित है। यह fast, scalable और text data (spam detection, sentiment analysis) के लिए बहुत useful algorithm है।
🧠 Basic Intuition — Bayes’ Theorem
Bayes’ theorem किसी hypothesis (class) की probability को data evidence के आधार पर update करने का तरीका है।
P(Class | Features) = [ P(Features | Class) × P(Class) ] / P(Features)
यानी किसी data point के किसी class से belong करने की संभावना proportional होती है कि उस class में ऐसे features कितनी बार देखे गए हैं।
📚 Types of Naive Bayes
- Gaussian Naive Bayes: Continuous data के लिए (assumes normal distribution)
- Multinomial Naive Bayes: Count data (जैसे word frequency) के लिए
- Bernoulli Naive Bayes: Binary features (जैसे presence/absence of a word)
🐍 Python Example — SMS Spam Detection
# Naive Bayes - SMS Spam Classifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Sample data (replace with sms_spam.csv)
data = {'text': ["Free entry in 2 a wkly comp!", "Hey, are you free tonight?", "Win cash now!!!", "Let's go for dinner"],
'label': ["spam","ham","spam","ham"]}
df = pd.DataFrame(data)
# Split data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.3, random_state=42)
# Convert text to numeric using CountVectorizer
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)
# Train model
model = MultinomialNB()
model.fit(X_train_cv, y_train)
# Predict
y_pred = model.predict(X_test_cv)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
MultinomialNB text features (word counts or TF-IDF) के साथ सबसे अच्छा काम करता है और spam detection में industry standard है।
⚙️ Key Hyperparameters
- alpha: Laplace smoothing (default = 1.0)
- fit_prior: Prior class probabilities consider करना है या नहीं
- class_prior: Custom class probability manually set करना
- Very fast, even on large datasets
- Works great for text & document classification
- Needs small training data
- Assumes feature independence (rare in real-world)
- Bad for numerical correlation-heavy data
- Less interpretability
अब आपने सभी major classification algorithms सीख लिए — अगले सेक्शन में हम **Model Evaluation Metrics** जैसे Accuracy, Precision, Recall, F1-score और ROC curve को detail में सीखेंगे।
📊 Model Evaluation Metrics — Confusion Matrix, Precision, Recall, F1 और ROC (समझें और लागू करें)
Classification models को सही से evaluate करने के लिए सिर्फ accuracy देखना अक्सर भ्रमित करने वाला होता है — इस सेक्शन में हम सभी important metrics आसान Hinglish में समझेंगे और Python में कैसे plot/interpret करें दिखाएंगे।
🧾 Confusion Matrix — आधार (TP, FP, TN, FN)
Confusion matrix एक 2×2 table है (binary case) जो model के predictions और actual values को मिलाकर दिखाती है — इससे आप समझ पाते हैं कहाँ model गलत कर रहा है।
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
यह matrix निम्न metrics के लिए base है — आइए formula और intuition देखें:
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision = TP / (TP + FP) — predicted positive में से कितने सही थे
- Recall (Sensitivity) = TP / (TP + FN) — actual positive में से कितने correct पकड़े
- F1-score = 2 * (Precision * Recall) / (Precision + Recall) — precision & recall का harmonic mean
👉 **कब कौन सा metric देखें?** – अगर classes balanced हों और cost of FP/FN similar हो → accuracy ठीक है। – पर अगर class imbalance हो (fraud detection, rare disease) → **precision/recall** ज़्यादा meaningful हैं. – F1 तब useful है जब precision और recall दोनों important हों.
📈 ROC Curve & AUC
ROC (Receiver Operating Characteristic) curve true positive rate (TPR = recall) vs false positive rate (FPR = FP/(FP+TN)) प्लॉट करती है । AUC (Area Under Curve) बताता है model का overall ranking ability — 0.5 = random, 1.0 = perfect.
- AUC ≈ 0.7 to 0.8 — fair
- AUC ≈ 0.8 to 0.9 — good
- AUC > 0.9 — excellent
🐍 Python Code — Confusion Matrix, Classification Report & ROC
# Evaluation Example: confusion matrix, classification report and ROC/AUC
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
# assume X, y already prepared (binary)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:,1]
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\\n", cm)
# Classification Report
print(classification_report(y_test, y_pred))
# ROC & AUC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
print("AUC:", auc_score)
# Plot Confusion Matrix (heatmap)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='YlOrBr', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Plot ROC Curve
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'AUC = {auc_score:.3f}')
plt.plot([0,1],[0,1],'--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.show()
**Note:** seaborn used for nicer heatmap — अगर आप blog में charts दिखाना चाहते हैं, तब images save करके blog में embed करें (PNG/SVG)। Mobile users के लिए small-size images optimized रखें।
🔁 Multi-class Evaluation
Multi-class में metrics को average करने के तरीके होते हैं: – **macro** (simple average across classes), – **micro** (global average weighted by support), – **weighted** (class support weighted average).
💡 Practical Tips
- Imbalanced data में accuracy misleading होती है — prefer precision/recall or AUC.
- Use confusion matrix to find error types (FP vs FN) और business cost के हिसाब से threshold adjust करें।
- ROC से threshold-independent performance पता चलता है — लेकिन when classes heavily imbalanced, Precision-Recall curve ज़्यादा informative हो सकती है.
अब आप model evaluation के key metrics जानते हैं — अगले सेक्शन में हम “Handling Imbalanced Data” (SMOTE, class weights, undersampling) पर practical solutions देखेंगे।
