Table of Contents
TogglePython is one of the most popular languages in data science due to its readability, versatility, and rich ecosystem of data analytics libraries. Its syntax is simple and beginner-friendly, which makes it accessible for people from various backgrounds, including those without prior programming experience.
pandas
and numpy
make it easy to handle large datasets and perform complex data manipulations. Visualization libraries like matplotlib
and seaborn
allow data scientists to create meaningful and customized plots.Python’s relevance in data science and analytics is backed by its adoption in major companies like Google, Netflix, and Amazon, where it’s used for tasks like recommendation engines, natural language processing, and data visualization.
Getting Python and its essential tools set up is the first step to start working on data analytics projects.
pip
, a package manager that allows you to install additional libraries. After installing Python, you can check if pip is working by running pip --version
in your terminal.Jupyter Notebooks are an essential tool for data analysts, offering an interactive coding environment where code, text, and visuals can be combined seamlessly.
pip install jupyterlab
to install Jupyter Notebook. You can then launch it by running jupyter notebook
in your terminal, which will open a new browser window.pandas
provides DataFrames, a two-dimensional, tabular data structure essential for data work.pip install pandas
numpy
enables efficient handling of arrays and mathematical operations on large datasets.pip install numpy
pip install matplotlib
matplotlib
, seaborn
simplifies data visualization by offering an easier syntax and additional plot types for statistical graphics.pip install seaborn
After installation, open a Jupyter Notebook and run a small script to verify the libraries are ready to use:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print("Libraries are working correctly!")
For beginners in Python, understanding the fundamentals of Python syntax is essential before diving into analytics. Here’s a quick overview of core concepts:
x = 5 # integer
y = 3.14 # float
name = "Python" # string
is_data_science_fun = True # boolean
Python supports operators for arithmetic (+
, -
, *
, /
), comparison (>
, <
, ==
), and logical operations (and
, or
, not
).
You can perform mathematical operations on numbers or even on collections like lists.
sum = 10 + 15 # arithmetic
is_greater = 10 > 5 # comparison
Conditional Statements: Control the flow of the program based on conditions.
if x > 0:
print("x is positive")
elif x == 0:
print("x is zero")
else:
print("x is negative")
Loops: Used to execute a block of code multiple times.
Iterate over items in a sequence (like lists or ranges).
for i in range(5):
print(i)
Continue looping as long as a condition is true.
i = 0
while i < 5:
print(i)
i += 1
Functions are blocks of reusable code. Define them using the def
keyword:
def greet(name):
return f"Hello, {name}!"
print(greet("Data Scientist"))
fruits = ["apple", "banana", "cherry"]
student = {"name": "John", "age": 22, "grade": "A"}
Tuples and Sets are additional data structures useful in analytics, with specific properties for data immutability and uniqueness.
Pandas is a powerful library in Python used for data manipulation and analysis. With its versatile functions, you can load, clean, and transform data, making it ready for analysis. Below are some essential techniques for working with data in Pandas.
Pandas allows you to easily load datasets from multiple file types such as CSV, Excel, JSON, and SQL databases. Here’s how to load data in various formats:
import pandas as pd
# Load a CSV file into a DataFrame
df = pd.read_csv("data.csv")
# Load an Excel file into a DataFrame
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
# Load a JSON file into a DataFrame
df = pd.read_json("data.json")
from sqlalchemy import create_engine
# Create a database connection
engine = create_engine("sqlite:///database.db")
df = pd.read_sql("SELECT * FROM tablename", engine)
Data cleaning is crucial to ensure the quality and accuracy of your data. Here are some common techniques to handle missing values, duplicates, and formatting issues in Pandas.
Missing values can be filled with a specific value or dropped entirely.
# Drop rows with missing values
df = df.dropna()
# Fill missing values with a specific value (e.g., 0)
df = df.fillna(0)
# Fill missing values with the mean of the column
df['column_name'] = df['column_name'].fillna(df['column_name'].mean())
Duplicates can distort your analysis. Use the following code to identify and remove them:
# Identify duplicates
duplicates = df.duplicated()
# Remove duplicates
df = df.drop_duplicates()
Formatting data consistently is essential, especially with strings and dates.
# Convert text to lowercase
df['column_name'] = df['column_name'].str.lower()
# Trim whitespace from strings
df['column_name'] = df['column_name'].str.strip()
# Convert a column to datetime
df['date_column'] = pd.to_datetime(df['date_column'])
Data manipulation allows you to organize and transform your data for analysis. Here are some core techniques using Pandas:
Filter rows based on conditions:
# Filter rows where 'column_name' is greater than 10
filtered_df = df[df['column_name'] > 10]
Grouping allows you to split data into categories and apply functions to each category:
# Group by 'category_column' and calculate the mean of each group
grouped_df = df.groupby('category_column').mean()
Aggregation lets you summarize data, such as calculating sums or counts:
# Calculate the sum of each group in 'category_column'
aggregated_df = df.groupby('category_column').agg({'numeric_column': 'sum'})
You can apply multiple aggregations to get various summaries for each group:
# Apply multiple aggregations
aggregated_df = df.groupby('category_column').agg({
'numeric_column': ['mean', 'sum', 'count']
})
Exploratory Data Analysis (EDA) is an essential step in the data analysis process, allowing analysts to gain insights into the dataset’s structure, patterns, and relationships before applying more complex analyses. This section covers key techniques used in EDA, including statistical summaries, data visualization, and identifying trends and patterns.
Descriptive statistics provide a summary of the main characteristics of a dataset. This includes measures of central tendency and measures of dispersion.
# Basic descriptive statistics
summary = df.describe()
# Specific statistics
mean_value = df['column_name'].mean()
median_value = df['column_name'].median()
mode_value = df['column_name'].mode()[0]
std_dev = df['column_name'].std()
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
Data visualization is crucial for understanding complex data and communicating findings effectively. Libraries like matplotlib
and seaborn
make it easy to create a variety of plots.
import matplotlib.pyplot as plt
# Line plot
plt.plot(df['x_column'], df['y_column'])
plt.title('Line Plot')
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.show()
Seaborn provides a high-level interface for drawing attractive statistical graphics.
import seaborn as sns
# Scatter plot
sns.scatterplot(data=df, x='x_column', y='y_column', hue='category_column')
plt.title('Scatter Plot')
plt.show()
# Box plot
sns.boxplot(x='category_column', y='numeric_column', data=df)
plt.title('Box Plot')
plt.show()
Understanding trends and patterns within your data is key to drawing insights. Here are two powerful techniques for this purpose.
Correlation measures the strength and direction of a linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.
# Calculate the correlation matrix
correlation_matrix = df.corr()
# Visualize the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Pivot tables allow you to summarize and analyze data by creating a new DataFrame based on categorical variables.
Interactive visualizations allow users to engage with data dynamically, making it easier to explore and analyze insights. Libraries like plotly
and bokeh
provide powerful tools for creating interactive charts and dashboards.
import plotly.express as px
# Sample data
df = px.data.iris()
# Create an interactive scatter plot
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species', title='Interactive Scatter Plot of Iris Data')
fig.show()
Bokeh is great for creating interactive web-based visualizations. Here’s how to create a simple dashboard:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.layouts import column
output_notebook()
# Create a new plot
p = figure(title="Interactive Line Plot", x_axis_label='X-axis', y_axis_label='Y-axis')
p.line(df['x_column'], df['y_column'], legend_label='Data Line', line_width=2)
# Show the plot
show(p)
Interactive visualizations allow users to engage with data dynamically, making it easier to explore and analyze insights. Libraries like plotly
and bokeh
provide powerful tools for creating interactive charts and dashboards.
import plotly.express as px
# Sample data
df = px.data.iris()
# Create an interactive scatter plot
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species', title='Interactive Scatter Plot of Iris Data')
fig.show()
Bokeh is great for creating interactive web-based visualizations. Here’s how to create a simple dashboard:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.layouts import column
output_notebook()
# Create a new plot
p = figure(title="Interactive Line Plot", x_axis_label='X-axis', y_axis_label='Y-axis')
p.line(df['x_column'], df['y_column'], legend_label='Data Line', line_width=2)
# Show the plot
show(p)
Data visualization is not just about presenting data; it’s about telling a story and providing insights that help decision-making. This section covers advanced visualization techniques using interactive visuals, geospatial data, and effective storytelling strategies.
Interactive visualizations allow users to engage with data dynamically, making it easier to explore and analyze insights. Libraries like plotly
and bokeh
provide powerful tools for creating interactive charts and dashboards.
import plotly.express as px
# Sample data
df = px.data.iris()
# Create an interactive scatter plot
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species', title='Interactive Scatter Plot of Iris Data')
fig.show()
Bokeh is great for creating interactive web-based visualizations. Here’s how to create a simple dashboard:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.layouts import column
output_notebook()
# Create a new plot
p = figure(title="Interactive Line Plot", x_axis_label='X-axis', y_axis_label='Y-axis')
p.line(df['x_column'], df['y_column'], legend_label='Data Line', line_width=2)
# Show the plot
show(p)
Geospatial visualization allows you to visualize data on maps, which is essential for understanding location-based insights. Libraries like folium
and geopandas
make this process straightforward.
import folium
# Create a map centered at a specific location
map = folium.Map(location=[latitude, longitude], zoom_start=10)
# Add a marker
folium.Marker([latitude, longitude], popup='Location Name').add_to(map)
# Display the map
map.save('map.html') # Save to an HTML file to view in a browser
GeoPandas extends the Pandas library to enable spatial operations. Here’s how to plot geospatial data:
import geopandas as gpd
import matplotlib.pyplot as plt
# Load a geospatial dataset
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Plot the data
world.plot()
plt.title('World Map')
plt.show()
Data storytelling involves combining data visualization with narrative techniques to communicate insights more effectively. Here are some key principles for effective storytelling with data:
Explore case studies or examples of successful data storytelling in media, business reports, and presentations to understand how to apply these principles in practice.
Statistical analysis and hypothesis testing are essential for making inferences from data. This section covers key concepts in probability, distributions, inferential statistics, and methodologies for A/B testing and experimentation.
Probability is the measure of the likelihood of an event occurring. Understanding probability distributions is crucial for statistical analysis.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Generate data for normal distribution
data = np.random.normal(loc=0, scale=1, size=1000)
# Plotting the normal distribution
sns.histplot(data, bins=30, kde=True)
plt.title('Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Inferential statistics allows us to draw conclusions about a population based on a sample. This includes formulating and testing hypotheses.
Hypothesis testing involves two competing hypotheses:
T-tests are used to compare the means of two groups.
from scipy import stats
# Sample data
group1 = [2, 3, 5, 7, 9]
group2 = [1, 4, 6, 8, 10]
# Conducting a t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f'T-statistic: {t_stat}, P-value: {p_value}')
Chi-square tests are used to determine if there is a significant association between categorical variables.
# Sample contingency table
observed = [[10, 20], [30, 40]]
# Conducting a chi-square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
print(f'Chi-square Statistic: {chi2_stat}, P-value: {p_value}')
ANOVA is used to compare means across three or more groups.
# Sample data
group1 = [5, 7, 8, 6]
group2 = [6, 5, 7, 8]
group3 = [8, 9, 10, 10]
# Conducting ANOVA
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f'F-statistic: {f_stat}, P-value: {p_value}')
A/B testing is a method of comparing two versions of a webpage, app, or product to determine which one performs better. It’s a key component of experimentation in data analysis.
# Example: Analyzing conversion rates for A/B test
conversions_a = 120 # Conversions for version A
conversions_b = 150 # Conversions for version B
visitors_a = 1000 # Visitors for version A
visitors_b = 1000 # Visitors for version B
# Calculate conversion rates
conversion_rate_a = conversions_a / visitors_a
conversion_rate_b = conversions_b / visitors_b
print(f'Conversion Rate A: {conversion_rate_a:.2f}, Conversion Rate B: {conversion_rate_b:.2f}')
# Use statistical tests to analyze significance
z_score, p_value = stats.proportions_ztest([conversions_a, conversions_b], [visitors_a, visitors_b])
print(f'Z-score: {z_score}, P-value: {p_value}')
Data transformation and feature engineering are critical steps in the data preprocessing phase. These processes enhance the dataset’s quality and prepare it for machine learning models. This section covers handling date and time data, creating new features, and dimensionality reduction techniques.
Time series data often requires special handling due to its temporal nature. The datetime
module and pandas
provide robust functionality for managing and manipulating date and time data.
import pandas as pd
# Creating a DataFrame with datetime
data = {'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'value': [10, 20, 30]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date']) # Convert to datetime
# Set the date column as the index
df.set_index('date', inplace=True)
# Display the DataFrame
print(df)
Pandas makes it easy to perform time series operations, such as resampling and calculating moving averages.
# Resampling the data to monthly frequency
monthly_data = df.resample('M').sum()
# Calculating a moving average
df['moving_avg'] = df['value'].rolling(window=2).mean()
print(df)
Feature engineering involves creating new variables that can help improve the performance of machine learning models. This may include transformations, interactions, or aggregations.
from sklearn.preprocessing import PolynomialFeatures
# Sample data
data = {'feature1': [1, 2, 3],
'feature2': [4, 5, 6]}
df = pd.DataFrame(data)
# Creating polynomial features
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(df)
# Display the new features
print(poly_features)
Dimensionality reduction techniques are used to reduce the number of features in a dataset while retaining as much information as possible. This can help improve model performance and reduce computational costs.
PCA is a popular method for reducing dimensionality by transforming to a new set of variables (principal components) that are orthogonal and capture the most variance in the data.
from sklearn.decomposition import PCA
# Sample data
data = [[2, 8], [3, 6], [5, 4], [6, 1], [7, 2]]
df = pd.DataFrame(data, columns=['Feature1', 'Feature2'])
# Apply PCA
pca = PCA(n_components=1)
reduced_data = pca.fit_transform(df)
# Display reduced data
print(reduced_data)
Machine learning is a powerful tool for analyzing data and making predictions. This section provides an overview of machine learning types, guides you through the process of building a model, and explains how to evaluate model performance using Scikit-Learn.
Machine learning can be broadly categorized into two main types:
In supervised learning, models are trained on labeled data, meaning that the input data is paired with the correct output. Common tasks include classification and regression.
In unsupervised learning, models are trained on data without labels, and the goal is to identify patterns or groupings within the data. Common tasks include clustering and dimensionality reduction.
Building a machine learning model involves several key steps:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
df = pd.read_csv('data.csv')
# Display the first few rows
print(df.head())
# Handling missing values
df.fillna(df.mean(), inplace=True)
# Splitting data into features and target
X = df.drop('target_column', axis=1)
y = df['target_column']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the model
model = RandomForestClassifier()
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions on the test set
predictions = model.predict(X_test)
Evaluating a model’s performance is crucial to ensure it makes accurate predictions. Common evaluation metrics include:
Accuracy measures the proportion of correct predictions made by the model.
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')
Precision and recall are particularly useful for classification problems where class distributions are imbalanced:
from sklearn.metrics import precision_score, recall_score
# Calculate precision and recall
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
print(f'Precision: {precision:.2f}, Recall: {recall:.2f}')
Hands-on projects are essential for solidifying your understanding of data analytics. This section outlines three real-world projects that apply various data analysis techniques, from predictive modeling to customer segmentation and time series forecasting.
In this project, you will create a predictive model to forecast housing prices based on various features such as location, size, number of bedrooms, and amenities.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
# Load the dataset
df = pd.read_csv('housing_data.csv')
# Preprocessing steps (e.g., handle missing values, encoding)
# Split the dataset
X = df.drop('price', axis=1) # Features
y = df['price'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build the model
model = RandomForestRegressor()
model.fit(X_train, y_train)
# Predictions and evaluation
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f'Mean Absolute Error: {mae:.2f}, R-squared: {r2:.2f}')
This project involves applying clustering techniques to segment customers into distinct groups based on purchasing behavior, demographics, or other attributes.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('customer_data.csv')
# Preprocessing steps
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=5) # Choosing 5 clusters
clusters = kmeans.fit_predict(df[['feature1', 'feature2']]) # Features used for clustering
df['Cluster'] = clusters
# Visualize the clusters
plt.scatter(df['feature1'], df['feature2'], c=df['Cluster'], cmap='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
This project involves applying clustering techniques to segment customers into distinct groups based on purchasing behavior, demographics, or other attributes.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('customer_data.csv')
# Preprocessing steps
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=5) # Choosing 5 clusters
clusters = kmeans.fit_predict(df[['feature1', 'feature2']]) # Features used for clustering
df['Cluster'] = clusters
# Visualize the clusters
plt.scatter(df['feature1'], df['feature2'], c=df['Cluster'], cmap='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Automating data pipelines is crucial for ensuring efficient data processing and analysis. This section explores ETL processes, introduces data pipeline tools, and discusses scheduling and monitoring techniques to maintain data reliability and timely updates.
The ETL process involves three key steps:
Data is extracted from various sources, including databases, APIs, and flat files. Automation can be achieved by scheduling regular data extractions using scripts or specialized tools.
Data transformation involves cleaning and processing the extracted data to make it suitable for analysis. This can include operations like filtering, aggregating, and merging datasets. Automation of transformation can be accomplished using frameworks like Pandas or built-in functions within ETL tools.
Finally, the transformed data is loaded into a target destination, such as a data warehouse or a database. Automated loading can be performed using data pipeline orchestration tools that manage the workflow.
import pandas as pd
from sqlalchemy import create_engine
# Step 1: Extract
data = pd.read_csv('data_source.csv')
# Step 2: Transform
data['new_column'] = data['old_column'].apply(lambda x: x * 2) # Example transformation
# Step 3: Load
engine = create_engine('postgresql://username:password@localhost:5432/mydatabase')
data.to_sql('my_table', engine, if_exists='replace', index=False)
Several tools can help automate and orchestrate data pipelines. Two of the most popular are Apache Airflow and Luigi:
Airflow is an open-source platform that allows you to programmatically author, schedule, and monitor workflows. You define workflows as Directed Acyclic Graphs (DAGs), which enable complex data pipeline management.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def extract():
# Extraction logic here
pass
def transform():
# Transformation logic here
pass
def load():
# Load logic here
pass
with DAG('my_etl_pipeline', start_date=datetime(2023, 1, 1), schedule_interval='@daily') as dag:
t1 = PythonOperator(task_id='extract', python_callable=extract)
t2 = PythonOperator(task_id='transform', python_callable=transform)
t3 = PythonOperator(task_id='load', python_callable=load)
t1 >> t2 >> t3 # Set task dependencies
Luigi is another Python package that helps you build complex data pipelines. It is designed to manage long-running batch processes and provides a simple way to visualize pipeline tasks.
import luigi
class Extract(luigi.Task):
def output(self):
return luigi.LocalTarget('data/extracted_data.csv')
def run(self):
# Extraction logic
with self.output().open('w') as f:
f.write("data extracted")
class Transform(luigi.Task):
def requires(self):
return Extract()
def output(self):
return luigi.LocalTarget('data/transformed_data.csv')
def run(self):
# Transformation logic
with self.output().open('w') as f:
f.write("data transformed")
class Load(luigi.Task):
def requires(self):
return Transform()
def run(self):
# Load logic
print("Data loaded successfully")
if __name__ == '__main__':
luigi.run()
Effective scheduling and monitoring are vital for the reliability of data pipelines:
Using tools like Airflow, you can schedule your data pipelines to run at specific intervals (e.g., hourly, daily). This automation ensures that data is consistently up-to-date without manual intervention.
Monitoring tools provide insights into the health and performance of your data pipelines. You can track metrics such as job success rates, execution times, and error logs. Airflow offers a web interface to monitor DAG runs and task status.
Airflow allows you to set alerts for task failures and successes, enabling proactive management of your data workflows.
from airflow.utils.email import send_email
def on_failure_callback(context):
send_email(
to='alert@example.com',
subject='Airflow Alert: Task Failed',
html_content='Task failed. Please check logs.'
)
t1 = PythonOperator(task_id='extract', python_callable=extract, on_failure_callback=on_failure_callback)
This section explores advanced topics in data analytics, including Natural Language Processing (NLP), Deep Learning, and Big Data Integration. These topics are essential for tackling complex data-rich projects and leveraging cutting-edge technologies.
NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It involves techniques for analyzing and understanding text data.
NLTK is a powerful library in Python for working with human language data. It provides tools for tokenization, parsing, classification, stemming, and more.
import nltk
from nltk.tokenize import word_tokenize
# Sample text
text = "Natural language processing is fascinating!"
tokens = word_tokenize(text)
print(tokens)
spaCy is another popular NLP library that is designed for efficiency and usability. It offers advanced features such as part-of-speech tagging, named entity recognition, and dependency parsing.
import spacy
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')
# Process a text
doc = nlp("Natural language processing is fascinating!")
for token in doc:
print(f'{token.text} - {token.pos_}')
Deep learning is a subset of machine learning that uses neural networks to model complex patterns in large datasets. TensorFlow and Keras are popular frameworks for building deep learning models.
TensorFlow is an open-source library developed by Google for numerical computation that makes machine learning faster and easier. It supports both CPUs and GPUs for efficient training.
Keras is an API that runs on top of TensorFlow, providing a high-level interface for building and training deep learning models.
import tensorflow as tf
from tensorflow import keras
# Build a simple neural network
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(input_shape,)),
keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_data, train_labels, epochs=10, batch_size=32)
Big data technologies enable the processing and analysis of large datasets that cannot be handled by traditional data processing methods.
PySpark is the Python API for Apache Spark, an open-source distributed computing system. It provides high-level APIs to manipulate large datasets efficiently.
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("BigDataExample").getOrCreate()
# Load a large dataset
df = spark.read.csv('large_dataset.csv', header=True, inferSchema=True)
# Perform operations
df.show()
df.groupBy('column_name').count().show()
Dask is a flexible parallel computing library for analytics. It allows you to work with large datasets using familiar pandas-like syntax, scaling your computations across multiple cores or clusters.
import dask.dataframe as dd
# Load a large dataset
ddf = dd.read_csv('large_dataset.csv')
# Perform operations
result = ddf.groupby('column_name').count().compute()
print(result)