Beginner Projects in Data Science (Step-by-Step Guide with Outcomes)

Table of Contents

Start your data science journey with 5 hands-on projects. Learn classification, regression, NLP, time series, and healthcare analytics — with datasets, Python code, and expected outcomes.

The fastest way to become a data scientist is by building real projects. This guide takes you through five beginner-friendly projects step by step. You’ll clean data, engineer features, train models, evaluate performance, and understand the results — just like a professional data analyst or data scientist would. Each project comes with Python code and clear outcomes so you can practice, learn, and add them to your portfolio.

🚀 Start Exploring Projects

🛠️ Prerequisites & Toolkit

Before jumping into the projects, make sure you have the right setup. Don’t worry—these requirements are beginner-friendly and free to use.

🐍 Python

Install Python 3.8+ via Anaconda or python.org.

📓 Jupyter Notebook

Use Jupyter Notebook / JupyterLab to run code step by step, visualize results, and annotate your workflow.

📦 Python Libraries

Install these essentials:
• pandas, numpy (data handling)
• matplotlib, seaborn (visualization)
• scikit-learn (ML models & preprocessing)
• nltk, textblob (for NLP project)

📂 Datasets

We’ll use free public datasets from Kaggle and other open sources. Each project section will link directly to the dataset.

📊 How We Evaluate Outcomes

Data science isn’t just about building models — it’s about measuring how well they perform. For each project in this guide, we’ll define clear metrics so you know exactly when your solution is working.

✅ Classification

Metrics: Accuracy, Precision, Recall, F1-score. Example: Did our Titanic model correctly predict who survived?

📈 Regression

Metrics: RMSE (Root Mean Squared Error), MAE, and R². Example: How close were our predicted house prices to the real ones?

📝 NLP

Metrics: Accuracy, F1, and interpretability of top keywords. Example: Does our sentiment analysis correctly classify positive vs negative reviews?

⏳ Time Series

Metrics: RMSE on forecasted values vs actuals. Example: How well do our sales forecasts match reality?

🏥 Healthcare

Metrics: ROC-AUC, Recall (catching positives), Precision. Example: Does our diabetes prediction catch most true cases while avoiding false alarms?

🚢 Project 1: Titanic Survival Prediction (Classification)

One of the most popular beginner projects in data science. The goal is to predict whether a passenger survived the Titanic disaster based on features like age, gender, ticket class, and family size.

📂 Dataset

Download the dataset from Kaggle: Titanic: Machine Learning from Disaster. Files: train.csv, test.csv.

🔎 Step-by-Step Process

Load & Inspect: read CSV, explore columns, check missing values.
Target & Features: Target = Survived; drop ID/noisy fields.
Handle Missing Values: impute age (median), embarked (mode).
Feature Engineering: create FamilySize, IsAlone.
Encode Categoricals: convert sex/embarked into numeric.
Split Train/Test: evaluate with hold-out data.
Train Model: Logistic Regression as baseline, Random Forest for better results.
Evaluate: Accuracy, F1-score, confusion matrix.

💻 Complete Python Code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1) Load
df = pd.read_csv("train.csv")

# 2) Target & drop irrelevant
y = df["Survived"]
X = df.drop(columns=["Survived","PassengerId","Name","Ticket","Cabin"])

# 3) Handle missing
X["Age"] = X["Age"].fillna(X["Age"].median())
X["Embarked"] = X["Embarked"].fillna(X["Embarked"].mode()[0])

# 4) Feature engineering
X["FamilySize"] = X["SibSp"] + X["Parch"] + 1
X["IsAlone"] = (X["FamilySize"] == 1).astype(int)
X = X.drop(columns=["SibSp","Parch"])

# 5) Encode categoricals
le = LabelEncoder()
X["Sex"] = le.fit_transform(X["Sex"])
X["Embarked"] = le.fit_transform(X["Embarked"])

# 6) Train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

# 7) Models
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train,y_train)
pred_log = logreg.predict(X_test)

rf = RandomForestClassifier(n_estimators=300,random_state=42)
rf.fit(X_train,y_train)
pred_rf = rf.predict(X_test)

# 8) Evaluation
print("Logistic Regression Accuracy:", accuracy_score(y_test,pred_log))
print(confusion_matrix(y_test,pred_log))
print(classification_report(y_test,pred_log))

print("\nRandom Forest Accuracy:", accuracy_score(y_test,pred_rf))
print(confusion_matrix(y_test,pred_rf))
print(classification_report(y_test,pred_rf))

📊 Expected Outcomes

✔️ Logistic Regression baseline: ~78–80% accuracy
✔️ Random Forest: ~81–84% accuracy (higher recall for survivors)
✔️ Confusion Matrix: see survival vs non-survival predictions
✔️ Learnings: importance of gender, class, and family size in survival

🏠 Project 2: House Price Prediction (Regression)

Predict the SalePrice of homes using features like quality, living area, year built, and garage size. This project teaches supervised regression, data preprocessing, one-hot encoding, and model evaluation using RMSE and R².

📂 Dataset

Download from Kaggle: House Prices – Advanced Regression Techniques . Place train.csv in your working folder.

🔎 Step-by-Step Process

Load & Inspect: understand datatypes, missing values, and target skew.
Target Transform: apply log1p to SalePrice for normality.
Split Columns: identify numeric vs categorical features.
Preprocess: median-impute numerics; most-frequent-impute + one-hot-encode categoricals.
Train/Test Split: hold out 20% for honest evaluation.
Models: baseline LinearRegression, improve with Ridge or RandomForestRegressor.
Evaluate: report RMSE (log & $) and R²; show predicted vs actual intuition.
Interpret: examine top features (quality, living area, garage) for stakeholder insights.

💻 Complete Python Code (Pipeline + Metrics)

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 1) Load
df = pd.read_csv("train.csv")  # Kaggle House Prices train.csv

# 2) Target transform
y = np.log1p(df["SalePrice"])   # log(1 + price)
X = df.drop(columns=["SalePrice", "Id"])

# 3) Identify column types
num_cols = X.select_dtypes(include=["int64","float64"]).columns
cat_cols = X.select_dtypes(include=["object"]).columns

# 4) Preprocessing (pipelines avoid leakage)
num_pipe = SimpleImputer(strategy="median")
cat_pipe = Pipeline([
    ("imp", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

# 5) Train/Test split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# ---- Baseline Linear Regression ----
lin = Pipeline([("prep", preprocess), ("reg", LinearRegression())])
lin.fit(X_tr, y_tr)
pred_log_lin = lin.predict(X_te)

rmse_log_lin = np.sqrt(mean_squared_error(y_te, pred_log_lin))
r2_lin = r2_score(y_te, pred_log_lin)

# invert to dollars for intuition
pred_lin_d = np.expm1(pred_log_lin); y_te_d = np.expm1(y_te)
rmse_lin_d = np.sqrt(mean_squared_error(y_te_d, pred_lin_d))

print(f"Linear Regression -> RMSE(log): {rmse_log_lin:.4f} | R²: {r2_lin:.3f} | RMSE($): {rmse_lin_d:,.0f}")

# ---- Ridge (regularized linear) ----
ridge = Pipeline([("prep", preprocess), ("ridge", Ridge(alpha=10.0, random_state=42))])
ridge.fit(X_tr, y_tr)
pred_log_ridge = ridge.predict(X_te)
rmse_log_ridge = np.sqrt(mean_squared_error(y_te, pred_log_ridge))
r2_ridge = r2_score(y_te, pred_log_ridge)
print(f"Ridge(alpha=10) -> RMSE(log): {rmse_log_ridge:.4f} | R²: {r2_ridge:.3f}")

# ---- Random Forest (non-linear) ----
rf = Pipeline([("prep", preprocess),
               ("rf", RandomForestRegressor(n_estimators=400, random_state=42, n_jobs=-1))])
rf.fit(X_tr, y_tr)
pred_log_rf = rf.predict(X_te)
rmse_log_rf = np.sqrt(mean_squared_error(y_te, pred_log_rf))
r2_rf = r2_score(y_te, pred_log_rf)
pred_rf_d = np.expm1(pred_log_rf)
rmse_rf_d = np.sqrt(mean_squared_error(y_te_d, pred_rf_d))
print(f"RandomForest -> RMSE(log): {rmse_log_rf:.4f} | R²: {r2_rf:.3f} | RMSE($): {rmse_rf_d:,.0f}")

# ---- Optional: quick hyperparameter search for RF ----
param_dist = {
    "rf__n_estimators": [300, 400, 600],
    "rf__max_depth": [None, 10, 20, 30],
    "rf__min_samples_split": [2, 5, 10],
    "rf__min_samples_leaf": [1, 2, 4]
}
search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=3,
                            n_jobs=-1, random_state=42, verbose=0)
search.fit(X_tr, y_tr)
best = search.best_estimator_
best_pred_log = best.predict(X_te)
rmse_log_best = np.sqrt(mean_squared_error(y_te, best_pred_log))
r2_best = r2_score(y_te, best_pred_log)
print("Best RF params:", search.best_params_)
print(f"Best RF -> RMSE(log): {rmse_log_best:.4f} | R²: {r2_best:.3f}")

# ---- Feature Importance (for best RF) ----
feat_names = best.named_steps["prep"].get_feature_names_out()
rf_model = best.named_steps["rf"]
importances = pd.Series(rf_model.feature_importances_, index=feat_names).sort_values(ascending=False)
print("Top 15 features:\\n", importances.head(15).round(4))

📊 Expected Outcomes

✔️ R²: often ≥ 0.85 with tuned tree-based models.
✔️ RMSE (log): ~0.14–0.16; report also RMSE in dollars for business context.
✔️ Top Drivers: OverallQual, GrLivArea (living area), GarageCars, TotalBsmtSF, YearBuilt.
✔️ Deliverables: 1-page summary, predicted vs actual scatter, feature-importance bar chart.

📝 Project 3: Sentiment Analysis on Reviews (NLP)

In this project, you’ll classify reviews as positive or negative using Natural Language Processing (NLP). This introduces text cleaning, TF-IDF vectorization, and logistic regression — powerful foundations for any NLP task.

📂 Dataset

Use the IMDb 50K Movie Reviews dataset. Columns: review (text), sentiment (Positive/Negative).

🔎 Step-by-Step Process

Load & Inspect: read dataset, check distribution of sentiments.
Clean Text: remove HTML tags, lowercase, strip extra spaces.
Split: train/test (80/20, stratified).
Vectorize: use TfidfVectorizer with unigrams + bigrams.
Train: fit a LogisticRegression or LinearSVC model.
Evaluate: Accuracy, F1-score, confusion matrix.
Interpret: extract top positive/negative words/phrases from model coefficients.
Test: run predictions on custom sentences.

💻 Complete Python Code

import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# 1) Load
df = pd.read_csv("IMDB_Dataset.csv")
df = df.dropna(subset=["review","sentiment"]).reset_index(drop=True)

# 2) Clean
def clean_text(x):
    x = BeautifulSoup(x, "html.parser").get_text(" ")
    x = x.lower()
    x = re.sub(r"\s+", " ", x).strip()
    return x
df["review_clean"] = df["review"].apply(clean_text)
df["label"] = (df["sentiment"].str.lower()=="positive").astype(int)

# 3) Split
X_train,X_test,y_train,y_test = train_test_split(
    df["review_clean"], df["label"], test_size=0.2,
    stratify=df["label"], random_state=42
)

# 4) Vectorize
tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

# 5) Model
clf = LogisticRegression(max_iter=200, C=2.0, n_jobs=-1)
clf.fit(X_train_vec,y_train)

# 6) Evaluate
pred = clf.predict(X_test_vec)
print("Accuracy:", accuracy_score(y_test,pred))
print("F1-score:", f1_score(y_test,pred))
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

# 7) Interpret
feat_names = np.array(tfidf.get_feature_names_out())
coefs = clf.coef_[0]
top_pos = feat_names[np.argsort(coefs)[-15:]]
top_neg = feat_names[np.argsort(coefs)[:15]]
print("Top Positive:", top_pos)
print("Top Negative:", top_neg)

# 8) Custom Test
samples = ["Absolutely loved it!", "Waste of time, boring..."]
print(clf.predict(tfidf.transform(samples)))

📊 Expected Outcomes

✔️ Accuracy: ~85–90% with TF-IDF + Logistic Regression.
✔️ Confusion Matrix: clear split between positive and negative reviews.
✔️ Top Positive phrases: “highly recommend”, “great movie”.
✔️ Top Negative phrases: “waste of time”, “poor acting”.
✔️ Students can test their own reviews and see instant predictions.

⏳ Project 4: Retail Sales Forecasting (Time Series)

Forecast future retail sales using time series data. This project introduces trend analysis, seasonality, moving averages, and ARIMA models — essential skills for demand forecasting, inventory planning, and financial projections.

📂 Dataset

Download a sample dataset: Daily Historical Retail Sales (Kaggle) . Column: Date, Sales.

🔎 Step-by-Step Process

Load & Parse Dates: convert Date to datetime index.
Visualize: line plots to check trend & seasonality.
Smoothing: moving average to understand overall patterns.
Stationarity Check: Augmented Dickey–Fuller (ADF) test.
Model: build an ARIMA/SARIMA model.
Forecast: generate predictions for next N days/weeks.
Evaluate: RMSE between predicted vs actual.
Visualize Forecast: overlay actual vs forecasted sales.

💻 Complete Python Code

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
import numpy as np

# 1) Load dataset
df = pd.read_csv("retail_sales.csv", parse_dates=["Date"])
df = df.sort_values("Date").set_index("Date")

# 2) Visualize
df["Sales"].plot(figsize=(12,5), title="Retail Sales Over Time")

# 3) Train/Test split (last 3 months as test)
train = df.iloc[:-90]
test = df.iloc[-90:]

# 4) Build SARIMA model
model = SARIMAX(train["Sales"], order=(1,1,1), seasonal_order=(1,1,1,7))  # weekly seasonality
res = model.fit(disp=False)

# 5) Forecast
pred = res.get_forecast(steps=90)
pred_ci = pred.conf_int()

# 6) Evaluation
y_pred = pred.predicted_mean
rmse = np.sqrt(mean_squared_error(test["Sales"], y_pred))
print("RMSE:", rmse)

# 7) Plot forecast vs actual
ax = train["Sales"].plot(label="Train", figsize=(12,5))
test["Sales"].plot(ax=ax, label="Test")
y_pred.plot(ax=ax, label="Forecast", alpha=0.7)
plt.fill_between(pred_ci.index, pred_ci.iloc[:,0], pred_ci.iloc[:,1], color="k", alpha=.1)
plt.legend()
plt.show()

📊 Expected Outcomes

✔️ Time series visualization of sales trends & seasonality.
✔️ RMSE score to measure forecast accuracy.
✔️ Forecast plot showing train vs test vs predictions.
✔️ Understanding of ARIMA/SARIMA for business forecasting tasks.

🏥 Project 5: Diabetes Prediction (Healthcare Classification)

Healthcare is one of the most impactful applications of data science. In this project, you’ll build a model to predict whether a patient is likely to have diabetes based on medical features like glucose levels, BMI, age, and insulin measurements. The goal is to understand how classification models can assist in preventive healthcare.

📂 Dataset

Use the popular Pima Indians Diabetes Dataset from Kaggle. Columns include: Glucose, Insulin, BMI, Age, and outcome label Diabetes (0/1).

🔎 Step-by-Step Process

Load & Inspect: check shape, null values, target distribution.
Preprocess: handle zero values in glucose/BMI as missing → impute with median.
Split: 80/20 train-test split.
Scale Features: standardize numeric values using StandardScaler.
Model: try Logistic Regression, Random Forest, and XGBoost.
Evaluate: accuracy, recall, precision, F1-score, ROC-AUC.
Interpret: identify most important predictors (e.g., glucose & BMI).

💻 Complete Python Code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix

# 1) Load dataset
df = pd.read_csv("diabetes.csv")

# 2) Replace zeros with NaN in medical columns where zero is impossible
cols_to_fix = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
df[cols_to_fix] = df[cols_to_fix].replace(0, np.nan)
df.fillna(df.median(), inplace=True)

# 3) Features & target
X = df.drop(columns=["Outcome"])
y = df["Outcome"]

# 4) Split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

# 5) Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 6) Models
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train,y_train)
pred_log = logreg.predict(X_test)

rf = RandomForestClassifier(n_estimators=300,random_state=42)
rf.fit(X_train,y_train)
pred_rf = rf.predict(X_test)

# 7) Evaluation
print("=== Logistic Regression ===")
print("Accuracy:", accuracy_score(y_test,pred_log))
print("ROC-AUC:", roc_auc_score(y_test,pred_log))
print(confusion_matrix(y_test,pred_log))
print(classification_report(y_test,pred_log))

print("\n=== Random Forest ===")
print("Accuracy:", accuracy_score(y_test,pred_rf))
print("ROC-AUC:", roc_auc_score(y_test,pred_rf))
print(confusion_matrix(y_test,pred_rf))
print(classification_report(y_test,pred_rf))

📊 Expected Outcomes

✔️ Accuracy: ~75–80% (baseline Logistic Regression).
✔️ Random Forest/XGBoost: ~82–85% accuracy, higher recall for positive cases.
✔️ ROC-AUC ~0.85 shows strong separability of classes.
✔️ Key drivers: glucose, BMI, age are top predictors.
✔️ Understanding trade-offs: recall is critical in healthcare (catching true cases).

🎯 Conclusion – From Beginner to Portfolio-Ready

Congratulations! You’ve explored 5 beginner-friendly data science projects covering classification, regression, NLP, time series, and healthcare analytics. Each project included datasets, Python code, and outcomes so you can replicate them step by step. By completing these, you’re not only learning the concepts but also creating a portfolio that hiring managers love.

Remember — recruiters don’t just look for theory, they want to see what you’ve built. These projects can be showcased on your GitHub, LinkedIn, or resume to stand out in interviews. The next step? Keep building, keep experimenting, and move from beginner to job-ready data scientist.

🚀 Talk to Vista Academy & Kickstart Your Data Science Career

📞 Need guidance? Chat with us directly on WhatsApp and join our next free masterclass.

❓ Frequently Asked Questions (FAQ)

1. Do I need strong math skills to start these projects?

No. These beginner projects focus on Python, data handling, and machine learning basics. High school-level math is enough to start. As you advance, linear algebra and statistics will help deepen your understanding.

2. How long does it take to complete all 5 projects?

On average, 2–3 weeks if you dedicate 1–2 hours daily. Titanic & Sentiment Analysis can be done in a few days, while House Price Prediction and Time Series may take longer.

3. Do I need powerful hardware for these projects?

Not at all. These projects run fine on a laptop with 4–8GB RAM. For bigger datasets, you can use free cloud platforms like Google Colab.

4. Can I showcase these projects on my resume and LinkedIn?

Absolutely. Upload your code on GitHub, write short summaries on LinkedIn, and include key metrics (e.g., “Built a Random Forest model achieving 85% accuracy in diabetes prediction”). This makes your profile stand out to recruiters.

5. What’s the next step after these beginner projects?

After these projects, move to intermediate projects like customer segmentation, recommendation systems, fraud detection, or deploying ML models with Flask/Streamlit. These show industry-ready skills.