May 2024 - Data Analytics and Data Science course in Dehradun Uttarakhand

27May, 2024

What is the first phase in the data analytics journey ?

The Data Analytics Journey: Phase 1 - Data Collection

Introduction:

The data analytics journey is a multi-phase process that transforms raw data into valuable insights. Each phase is crucial, starting with the foundational step: data collection.
Proper data collection is vital as it directly impacts the accuracy and reliability of the subsequent analysis and insights.

The Data Analytics Journey: The data analytics journey is a systematic process that transforms raw data into meaningful insights and actionable knowledge. It typically consists of several phases, including data collection, data cleaning and preprocessing, data analysis, data visualization, and decision-making.
Each phase is critical, with each step building on the previous one to ensure accurate and valuable results.
Importance of Each Phase:

Data Collection:

Foundation of Analytics: Data collection is the initial and foundational phase of the data analytics journey. It involves gathering raw data from various sources, ensuring that the data is relevant, accurate, and comprehensive.
Impact on Analysis: The quality of data collected directly impacts the accuracy and reliability of the insights derived from the subsequent analysis. Poor data collection practices can lead to flawed analyses and incorrect conclusions.

Data Cleaning and Preprocessing:

Preparing Data for Analysis: This phase involves cleaning the collected data by removing errors, filling in missing values, and transforming the data into a suitable format for analysis.
Ensuring Data Quality: Proper data cleaning ensures that the data is accurate, consistent, and ready for analysis, which is crucial for obtaining reliable insights.

Data Analysis:

Extracting Insights: In this phase, statistical and computational techniques are applied to the cleaned data to uncover patterns, trends, and relationships.
Making Sense of Data: Effective data analysis transforms raw data into meaningful information, providing the basis for informed decision-making.

Data Visualization:

Communicating Findings: Data visualization involves presenting the results of the data analysis in a visual format, such as charts, graphs, and dashboards.
Enhancing Understanding: Visualizations make complex data more accessible and understandable, allowing stakeholders to quickly grasp key insights and trends.

Decision-Making:

Actionable Insights: The final phase involves using the insights gained from the analysis and visualization to make informed decisions and take action.
Driving Business Success: Data-driven decision-making helps organizations optimize operations, improve strategies, and achieve their goals.

Each phase of the data analytics journey is essential, with data collection serving as the critical starting point that determines the success of the entire process. By prioritizing high-quality data collection, organizations can ensure a strong foundation for accurate analysis and valuable insights.

1. Understanding Data Collection:

Definition:

Data Collection: Data collection is the process of gathering information from various sources to be used for analysis. It involves systematically capturing and measuring data to answer questions, test hypotheses, or assess outcomes.
Purpose:

Foundation for Analytics: Data collection lays the groundwork for the entire analytics process, ensuring that the data used is relevant, accurate, and comprehensive.
Informed Analysis: By gathering high-quality data, organizations can ensure that the subsequent analysis is based on solid and reliable information.
Objective Achievement: Proper data collection helps in achieving specific objectives, such as understanding customer behavior, improving operational efficiency, and making data-driven decisions.

Understanding and executing effective data collection is crucial because it influences every subsequent phase of the data analytics journey. High-quality, relevant data is essential for generating meaningful insights and making informed decisions.

2. Importance of Data Collection:

Highlight the Significance of Accurate and Comprehensive Data Collection:

Foundation for Analysis: Accurate and comprehensive data collection forms the bedrock of any data analytics process. It ensures that the data used for analysis is representative and reliable.
Consistency and Reliability: Consistently collecting data across all relevant sources and touchpoints ensures that the analysis is based on a holistic view, reducing the risk of bias and errors.
Decision-Making: High-quality data collection enables organizations to base their decisions on solid evidence rather than assumptions or incomplete information.

Discuss How Good Data Collection Practices Impact the Quality of Insights Derived from Data Analysis:

Enhanced Accuracy: Good data collection practices ensure that the data is free from errors and discrepancies, leading to more precise and accurate analysis. This is crucial for making reliable predictions and informed decisions.
Improved Completeness: Comprehensive data collection captures all necessary information, which is essential for a thorough analysis. Incomplete data can lead to gaps in understanding and missed opportunities.
Increased Consistency: Standardizing data collection methods and protocols across different sources ensures consistency in the data. This uniformity is vital for comparing datasets and drawing meaningful insights.
Better Relevance: Ensuring that the data collected is pertinent to the questions or problems at hand improves the relevance of the insights derived. Irrelevant or extraneous data can cloud the analysis and lead to misleading conclusions.
Ethical and Legal Compliance: Adhering to ethical standards and legal regulations during data collection builds trust and avoids legal repercussions. This is particularly important in industries like healthcare, finance, and any sector handling personal data.

Example Impacts:

Healthcare: Accurate patient data collection leads to better diagnosis, treatment plans, and patient outcomes. Comprehensive medical histories, test results, and treatment records are essential for personalized care.
Retail: Detailed and accurate sales data collection helps retailers understand customer preferences, optimize inventory, and tailor marketing strategies. This can lead to increased sales and customer satisfaction.
Finance: Accurate financial data collection is critical for compliance reporting, risk management, and strategic planning. It ensures that financial statements are reliable and that decisions are based on current and precise financial information.

By prioritizing accurate and comprehensive data collection, organizations can significantly enhance the quality of their data analytics, leading to more reliable insights and better-informed decisions.

3. Types of Data:

Structured vs. Unstructured Data:

Structured Data:

Definition: Organized in a predefined format, often in rows and columns, making it easily searchable and analyzable.
Examples:
Databases: Customer details in a CRM system, sales transactions in an ERP system.
Spreadsheets: Financial records, inventory lists.
Relevance in Industries:
Finance: Transaction records, financial statements.
Retail: Sales data, inventory management.
Healthcare: Patient records, billing information.

Unstructured Data:

Definition: Lacks a predefined format or structure, making it more complex to process and analyze.
Examples:
Text Documents: Emails, Word documents.
Multimedia: Images, videos, audio files.
Social Media: Tweets, Facebook posts.
Relevance in Industries:
Marketing: Analyzing customer sentiment from social media.
Media and Entertainment: Managing and analyzing video and audio content.
Healthcare: Medical imaging, doctor’s notes.

Qualitative vs. Quantitative Data:

Qualitative Data:

Definition: Descriptive data that can be observed but not measured. It provides insights into the qualities or characteristics of a subject.
Examples:
Customer Feedback: Comments and reviews.
Interviews: Transcripts of interviews or focus group discussions.
Observations: Notes from field observations or ethnographic studies.
Relevance in Industries:
Market Research: Understanding customer preferences and behaviors.
Social Sciences: Studying human behavior and societal trends.
Education: Evaluating student experiences and teacher feedback.

Quantitative Data:

Definition: Numeric data that can be measured and quantified. It is used for statistical analysis and modeling.
Examples:
Sales Figures: Revenue, units sold.
Website Analytics: Page views, click-through rates.
Surveys: Rating scales, numerical survey responses.
Relevance in Industries:
Business: Financial performance analysis, KPI tracking.
Healthcare: Patient statistics, clinical trial results.
Technology: Usage metrics, system performance data.

Understanding the differences between these types of data and their applications is crucial for effectively leveraging data in various industries. Structured and quantitative data provide a foundation for statistical analysis and reporting, while unstructured and qualitative data offer deeper insights into human behavior and experiences. Combining these data types enables a comprehensive approach to data analytics, enhancing the ability to make well-informed decisions.

4. Data Sources:

1. Internal Databases:

Description: Data stored within an organization’s internal systems, such as CRM systems, ERP systems, and transactional databases.
Benefits:
Control and Security: Organizations have full control over the data and can ensure its security.
Relevance: Data is highly relevant to the organization’s operations and goals.
Consistency: Standardized data formats and structures.
Challenges:
Data Silos: Data may be fragmented across different departments or systems, making it difficult to integrate.
Maintenance: Requires ongoing management and maintenance to ensure data accuracy and integrity.

2. Surveys:

Description: Data collected directly from individuals through questionnaires or interviews.
Benefits:
Specificity: Can be tailored to gather specific information relevant to the organization’s needs.
Insightful: Provides direct insights into customer opinions, preferences, and behaviors.
Challenges:
Response Bias: Risk of bias if respondents do not answer truthfully or accurately.
Low Response Rates: Can be difficult to achieve a high response rate, which may affect the representativeness of the data.

3. Social Media:

Description: Data from social media platforms like Facebook, Twitter, Instagram, and LinkedIn.
Benefits:
Real-Time Insights: Provides real-time information on trends, opinions, and public sentiment.
Volume and Variety: Large volumes of diverse data can be collected, offering a broad perspective.
Challenges:
Noise: High volume of irrelevant data and potential misinformation.
Privacy Concerns: Ethical and legal issues related to data privacy and user consent.

4. IoT Devices:

Description: Data collected from Internet of Things (IoT) devices, such as sensors, smart appliances, and wearable technology.
Benefits:
Real-Time Monitoring: Provides continuous, real-time data.
Detailed Insights: Granular data on usage patterns, environmental conditions, and more.
Challenges:
Data Volume: Massive amounts of data can be overwhelming to process and store.
Integration: Integrating data from diverse IoT devices can be complex.

5. Public Datasets:

Description: Data made available by governments, research institutions, and other organizations for public use.
Benefits:
Accessibility: Freely available and easily accessible.
Broad Coverage: Often covers a wide range of topics and large populations.
Challenges:
Relevance: May not always align perfectly with specific business needs.
Quality: Varying levels of data quality and completeness.

Conclusion:

Each data source offers unique benefits and challenges. Combining multiple sources can provide a more comprehensive view and enhance the overall quality of data analytics. Effective management and integration of these sources are key to leveraging their full potential for informed decision-making.

5. Data Collection Methods:

1. Manual Entry:

Description: Data is entered manually by individuals, often into spreadsheets or databases.
Pros:
Customization: Allows for highly specific data entry tailored to unique needs.
Control: Users can verify and validate data as it is entered.
Cons:
Time-Consuming: Labor-intensive and slow process.
Prone to Errors: Human errors can lead to inaccuracies.
Suitable Use Cases:
Small-Scale Projects: Where the volume of data is manageable.
Qualitative Data: Collecting detailed, nuanced information that requires human judgment.

2. Web Scraping:

Description: Automated process of extracting data from websites using tools or scripts.
Pros:
Efficiency: Can collect large volumes of data quickly.
Cost-Effective: Reduces the need for manual data collection.
Cons:
Legal and Ethical Issues: May violate terms of service and privacy regulations.
Maintenance: Requires regular updates to handle website changes.
Suitable Use Cases:
Market Research: Collecting data on competitor prices, product details, or reviews.
Content Aggregation: Gathering information from multiple sources for analysis or display.

3. APIs (Application Programming Interfaces):

Description: Interfaces that allow software applications to communicate and exchange data.
Pros:
Real-Time Data: Provides real-time data access and integration.
Automation: Facilitates automated data collection and processing.
Cons:
Dependency: Reliant on the availability and reliability of third-party APIs.
Complexity: May require technical expertise to implement and maintain.
Suitable Use Cases:
Integrating Systems: Connecting different software systems for seamless data exchange.
Data Enrichment: Enhancing internal data with external data sources.

4. Sensor Data:

Description: Data collected from IoT devices and sensors embedded in various environments.
Pros:
Real-Time Monitoring: Provides continuous, real-time data collection.
Precision: Highly accurate and detailed data on environmental conditions, usage patterns, etc.
Cons:
Volume: Generates large volumes of data, which can be challenging to store and process.
Integration: Can be complex to integrate with other data systems.
Suitable Use Cases:
Smart Cities: Monitoring traffic, air quality, and energy usage.
Healthcare: Collecting patient health metrics from wearable devices.

5. Surveys and Questionnaires:

Description: Collecting data directly from individuals through structured questions.
Pros:
Direct Feedback: Gathers specific information directly from the target audience.
Flexibility: Can be designed to capture a wide range of data types.
Cons:
Response Bias: Risk of biased responses affecting data quality.
Response Rate: Achieving a high response rate can be challenging.
Suitable Use Cases:
Customer Feedback: Understanding customer satisfaction and preferences.
Market Research: Collecting demographic information and consumer opinions.

6. Public Records and Datasets:

Description: Data obtained from publicly available sources, such as government databases, research publications, and public APIs.
Pros:
Accessibility: Freely available and often comprehensive.
Broad Coverage: Covers a wide range of topics and demographics.
Cons:
Relevance: May not be specifically tailored to the organization’s needs.
Quality: Varying levels of data quality and completeness.
Suitable Use Cases:
Social Research: Analyzing trends and patterns in public health, economics, etc.
Benchmarking: Comparing performance against industry standards.

Conclusion:
Selecting the appropriate data collection method depends on the specific needs of the project, the type of data required, and the available resources. Each method has its own advantages and limitations, and often a combination of methods is used to achieve the best results.

6. Data Quality Considerations:

1. Accuracy:

Definition: Accuracy refers to the correctness and precision of the data. It ensures that the data accurately represents the real-world entities or events it is intended to describe.
Importance: Accurate data is essential for making reliable decisions and drawing valid conclusions from data analysis.
Methods for Ensuring Accuracy:
Data Validation: Implementing validation checks to ensure data conforms to predefined rules and constraints.
Data Verification: Verifying data against external sources or through cross-referencing to confirm its accuracy.
Regular Audits: Conducting regular audits and quality checks to identify and correct errors in the data.

2. Completeness:

Definition: Completeness refers to ensuring that all necessary data is collected without gaps or missing values. It ensures that the dataset provides a comprehensive representation of the subject being studied.
Importance: Incomplete data can lead to biased analysis and incomplete insights, potentially leading to incorrect decisions.
Methods for Ensuring Completeness:
Data Profiling: Analyzing the dataset to identify missing values or incomplete records.
Data Collection Standards: Establishing clear guidelines and protocols for data collection to ensure all relevant data is captured.
Imputation Techniques: Filling in missing values through methods like mean imputation, interpolation, or predictive modeling.

3. Consistency:

Definition: Consistency refers to ensuring that data is uniform across different sources and systems. It ensures that data elements are standardized and consistent in their format, values, and definitions.
Importance: Consistent data facilitates integration and interoperability between systems and ensures accurate comparisons and analysis.
Methods for Ensuring Consistency:
Data Standardization: Establishing standardized formats and conventions for data elements across systems.
Data Governance: Implementing data governance policies and procedures to enforce consistency and data quality standards.
Data Integration: Integrating data from disparate sources through data integration tools and platforms to ensure consistency in formats and values.

4. Timeliness:

Definition: Timeliness refers to ensuring that data is up-to-date and collected in a timely manner. It ensures that the data reflects the current state of affairs and is relevant for decision-making.
Importance: Timely data enables organizations to respond quickly to changes in their environment, identify emerging trends, and make informed decisions in real-time.
Methods for Ensuring Timeliness:
Real-Time Data Collection: Implementing systems and processes for collecting and updating data in real-time.
Automated Data Feeds: Setting up automated data feeds and integrations to ensure timely data updates.
Data Monitoring: Monitoring data sources and systems to identify delays or issues in data collection and processing.
Conclusion:
Ensuring data quality is paramount for effective data-driven decision-making and analysis. By addressing considerations such as accuracy, completeness, consistency, and timeliness, organizations can enhance the reliability and usability of their data assets, ultimately leading to better insights and outcomes.

7. Ethical and Legal Aspects:

1. Privacy Concerns:

Description: Privacy concerns relate to the protection of individuals’ personal information and ensuring that their privacy rights are respected during data collection, processing, and storage.
Importance: Respecting user privacy builds trust and credibility with customers and stakeholders and helps mitigate the risk of privacy breaches and legal liabilities.
Methods for Addressing Privacy Concerns:
Obtaining Consent: Obtaining explicit consent from individuals before collecting their personal data, informing them of the purpose of data collection and how it will be used.
Anonymization and Pseudonymization: Masking or anonymizing personally identifiable information to protect individual identities while still allowing for analysis.
Data Minimization: Collecting only the minimum amount of personal data necessary for the intended purpose and avoiding unnecessary data collection.

2. Compliance:

Description: Compliance refers to adhering to legal regulations and standards governing data privacy and protection, such as the GDPR, CCPA, and other regional or industry-specific laws.
Importance: Compliance ensures that organizations operate within legal boundaries and avoid potential fines, penalties, and reputational damage associated with non-compliance.
Methods for Achieving Compliance:
Understanding Regulations: Familiarizing oneself with relevant data protection laws and regulations applicable to the organization’s operations and geographic location.
Implementing Policies and Procedures: Developing and implementing data protection policies, procedures, and controls to ensure compliance with legal requirements.
Data Security Measures: Implementing robust data security measures, such as encryption, access controls, and regular security audits, to protect personal data from unauthorized access or disclosure.
Example Regulations:

GDPR (General Data Protection Regulation): Applies to businesses operating in the European Union and regulates the collection, processing, and storage of personal data of EU residents. It requires organizations to obtain explicit consent, provide transparency about data processing activities, and implement data protection measures.
CCPA (California Consumer Privacy Act): Applies to businesses operating in California and grants consumers certain rights over their personal information, including the right to know what data is collected, the right to opt-out of data sharing, and the right to request deletion of personal information.

Conclusion:
Addressing ethical and legal aspects of data collection and processing is essential for maintaining trust, protecting individual privacy rights, and ensuring compliance with applicable regulations. By implementing appropriate measures to address privacy concerns and achieve compliance, organizations can mitigate risks and build a foundation of trust with their customers and stakeholders.

8. Tools and Technologies:

1. Excel and SQL:

Description: Excel is a spreadsheet software commonly used for manual data entry, manipulation, and analysis. SQL (Structured Query Language) is a programming language used for managing and querying relational databases.
Use Cases:
Excel: Suitable for small-scale data entry, basic data manipulation, and ad-hoc analysis.
SQL: Ideal for managing large datasets, querying databases, and performing complex data transformations.

2. Python and R:

Description: Python and R are programming languages commonly used for data collection, manipulation, analysis, and visualization.
Use Cases:
Python: Versatile language with extensive libraries (e.g., Pandas, NumPy, Scikit-learn) for data manipulation, machine learning, and automation.
R: Specialized for statistical computing and graphics, widely used in academia and research.

3. Data Collection Platforms:

Description: Specialized tools and platforms designed for data collection, survey creation, and data integration.
Examples:
Google Forms: Online survey tool for creating and distributing surveys, collecting responses, and analyzing results.
SurveyMonkey: Another popular online survey platform offering advanced survey features and analytics capabilities.
Talend: Data integration platform for connecting and combining data from various sources, transforming data, and loading it into target systems.
Benefits of Each:

Excel and SQL: User-friendly interfaces, widely used, suitable for a range of data tasks from simple to complex.
Python and R: Powerful scripting languages with extensive libraries and capabilities for data analysis, machine learning, and automation.
Data Collection Platforms: Streamlined workflows for creating, distributing, and analyzing surveys, as well as integrating data from multiple sources.
Considerations:

Skill Requirements: Excel and SQL are accessible to non-programmers, while Python and R require programming proficiency.
Scalability: Excel may not be suitable for handling large datasets, whereas Python and R offer scalability and performance.
Cost: Some data collection platforms may have subscription fees or usage-based pricing.

Conclusion:
Each tool and technology has its strengths and use cases in the data analytics workflow. Organizations should consider their specific needs, skill sets, and data requirements when selecting the most appropriate tools and technologies for their projects.

9. Challenges in Data Collection:

1. Data Silos:

Description: Data silos refer to isolated data stored in separate systems or departments within an organization, making it challenging to access and integrate with other datasets.
Impact: Siloed data hinders collaboration, decision-making, and a comprehensive understanding of the organization’s operations.
Mitigation Strategies:
Implementing data governance policies to standardize data formats and promote data sharing.
Investing in data integration platforms to consolidate and unify disparate data sources.
Encouraging cross-functional collaboration and communication to break down silos and promote data sharing.

2. Integration Issues:

Description: Integration issues arise when combining data from different sources with varying formats, structures, and quality standards.
Impact: Integration challenges can lead to inconsistencies, errors, and inefficiencies in data analysis and decision-making processes.
Mitigation Strategies:
Implementing data integration tools and platforms to streamline the process of combining and transforming data.
Establishing data standards and protocols to ensure consistency and compatibility across different datasets.
Conducting thorough data profiling and cleansing to identify and resolve integration issues before analysis.

3. Data Volume:

Description: Data volume refers to the sheer volume of data generated and collected by organizations, which can be overwhelming to manage and process.
Impact: Managing large volumes of data requires significant storage and computational resources, leading to scalability and performance challenges.
Mitigation Strategies:
Implementing scalable storage solutions, such as cloud-based storage and distributed databases, to accommodate large datasets.
Employing data compression and optimization techniques to reduce storage requirements and improve performance.
Leveraging big data technologies, such as Hadoop and Spark, for distributed processing and analysis of large datasets.

4. Data Quality:

Description: Data quality encompasses factors such as accuracy, completeness, consistency, and timeliness of data, which can vary across different sources and formats.
Impact: Poor data quality undermines the reliability and validity of data analysis and decision-making processes, leading to inaccurate insights and ineffective strategies.
Mitigation Strategies:
Establishing data quality standards and implementing data governance frameworks to enforce data quality policies and procedures.
Conducting regular data profiling, cleansing, and validation to identify and address data quality issues proactively.
Investing in data quality tools and technologies, such as data quality management software, to automate data quality assessment and improvement processes.

Conclusion:
Addressing challenges in data collection requires a holistic approach that encompasses organizational, technological, and procedural considerations. By implementing appropriate strategies and leveraging advanced technologies, organizations can overcome these challenges and unlock the full potential of their data assets for informed decision-making and strategic insights.

10. Best Practices for Data Collection:

1. Clear Objectives:

Description: Establish clear objectives and goals for data collection initiatives, outlining what data needs to be collected, why it is important, and how it will be used to drive business outcomes.
Benefits: Clear objectives ensure that data collection efforts are aligned with organizational priorities and objectives, helping to focus resources and efforts on collecting relevant and actionable data.

2. Automation:

Description: Use automation tools and technologies to streamline data collection processes, reduce manual errors, and improve efficiency.
Benefits: Automation accelerates data collection processes, reduces human intervention, and ensures consistency and accuracy in data collection activities.

3. Data Security:

Description: Implement robust security measures to protect the integrity, confidentiality, and privacy of collected data throughout its lifecycle.
Benefits: Data security safeguards sensitive information from unauthorized access, breaches, and cyber threats, preserving trust with customers and stakeholders and ensuring compliance with data protection regulations.

4. Regular Audits:

Description: Conduct regular audits and quality checks to assess the accuracy, completeness, and consistency of collected data against predefined quality standards.
Benefits: Regular audits identify and correct data quality issues proactively, ensuring that collected data remains reliable, relevant, and fit for purpose.

Additional Practices:

5. Data Governance:

Description: Establish data governance frameworks, policies, and procedures to govern data collection activities, ensuring accountability, transparency, and compliance with regulatory requirements.
Benefits: Data governance promotes consistent data management practices, enhances data quality and integrity, and mitigates risks associated with data misuse and non-compliance.

6. Stakeholder Engagement:

Description: Involve relevant stakeholders, including business users, data analysts, and IT professionals, in the data collection process to ensure alignment with business needs and requirements.
Benefits: Stakeholder engagement fosters collaboration, communication, and shared ownership of data collection initiatives, leading to more meaningful and actionable insights.

7. Data Privacy and Consent:

Description: Obtain explicit consent from individuals before collecting their personal data, and ensure compliance with data privacy regulations such as GDPR and CCPA.
Benefits: Respecting individuals’ privacy rights and obtaining consent builds trust and credibility, while also reducing legal and reputational risks associated with non-compliance.

8. Training and Education:

Description: Provide training and education to staff involved in data collection activities, equipping them with the knowledge and skills needed to adhere to best practices and quality standards.
Benefits: Training enhances awareness of data collection processes, quality requirements, and security protocols, empowering staff to perform their roles effectively and contribute to data-driven decision-making.

Conclusion: By adopting these best practices, organizations can optimize their data collection efforts, ensuring that collected data is accurate, reliable, secure, and aligned with business objectives. Effective data collection lays the foundation for successful data-driven decision-making and business outcomes.

25May, 2024

Restaurant prediction in Python

Dataset

# Importing the Libraries
import pandas as pd
import numpy as np

 import warnings
warnings.filterwarnings("ignore")

import warnings:

This imports the warnings module, which provides a way to handle warning messages in Python. This module is part of the standard library, so no additional installation is required.

Pre-Processing Steps

#importing the dataframe
df = pd.read_csv('C:\\Users\\welcome\\Documents\\Python Scripts\\Predict_Restaurant.csv')
print(df.head())

 df = df.drop('Restaurant ID', axis=1)
df = df.drop('Restaurant Name', axis=1)
df = df.drop('Country Code', axis=1)
df = df.drop('City', axis=1)
df = df.drop('Address', axis=1)
df = df.drop('Locality', axis=1)
df = df.drop('Locality Verbose', axis=1)
df = df.drop('Longitude', axis=1)
df = df.drop('Latitude', axis=1)
df = df.drop('Cuisines', axis=1)
df = df.drop('Currency', axis=1)

The provided code snippet removes several columns from a pandas DataFrame, with each line dropping a different column. This operation is useful when you need to clean your dataset by removing irrelevant or unnecessary columns. The optimized version shows a more efficient way to achieve the same result by dropping multiple columns at once.

df

df.shape

(9551, 10)

The df.shape attribute in pandas is used to get the dimensions of a DataFrame. It returns a tuple representing the number of rows and columns in the DataFrame.

df.info

The df.info() method in pandas provides a concise summary of a DataFrame. This summary includes information about the DataFrame’s structure, such as the number of non-null entries, column data types, memory usage, and more. Here’s a detailed explanation of what df.info() outputs and how to interpret it:

 df.describe()

The df.describe() method in pandas provides a summary of the statistical properties of the numerical (and optionally, categorical) columns in a DataFrame. This summary includes measures such as mean, standard deviation, minimum and maximum values, and quartiles.

# Checking for missing values
df.isnull().sum()

# Checking for duplicated values
df.duplicated().sum()

df.dropna(inplace=True) # used to remove rows with missing values from a DataFrame in-place, 
                          meaning it modifies the original DataFrame directly.

df

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df['Price range'].value_counts().plot(kind='pie', autopct = '%.2f')

df['Aggregate rating'].value_counts().plot(kind='pie', autopct = '%.2f')

sns.distplot(df['Aggregate rating'])

sns.distplot(df['Price range'])

sns.barplot(x=df["Rating text"],y=df["Votes"],hue =df["Rating color"])

sns.scatterplot(x=df["Aggregate rating"],y=df["Votes"],hue=df["Price range"])

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Has Table booking'] = label_encoder.fit_transform(df['Has Table booking'])
df['Has Online delivery'] = label_encoder.fit_transform(df['Has Online delivery'])
df['Is delivering now'] = label_encoder.fit_transform(df['Is delivering now'])
df['Switch to order menu'] = label_encoder.fit_transform(df['Switch to order menu'])
df['Rating color'] = label_encoder.fit_transform(df['Rating color'])
df['Rating text'] = label_encoder.fit_transform(df['Rating text'])

The LabelEncoder is useful for converting categorical text data into numerical data, which is often required for machine learning algorithms. However, it’s important to note that LabelEncoder assigns an arbitrary numerical value to each unique category, which can introduce unintended ordinal relationships between categories. If your categorical variables are nominal (i.e., no inherent order), you might want to use one-hot encoding (pd.get_dummies()) instead.

df

correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import r2_score

 x = df.drop('Aggregate rating', axis=1)
y = df['Aggregate rating']

 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.1,random_state=353)
x_train.head()
y_train.head()

#Running the Linear Regression Model

reg=LinearRegression()
reg.fit(x_train,y_train)
y_pred=reg.predict(x_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.44846419965192585

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
reg = LinearRegression()
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R2) Error: {r2:.2f}")

#Building the Decision Tree Regressor

 from sklearn.tree import DecisionTreeRegressor
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.1,random_state=105)
DTree=DecisionTreeRegressor(min_samples_leaf=.0001)
DTree.fit(x_train,y_train)
y_predict=DTree.predict(x_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_predict)

0.9774319598898318

 from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
DTree = DecisionTreeRegressor(min_samples_leaf=0.0001)
DTree.fit(x_train, y_train)
y_predict = DTree.predict(x_test)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_predict)
print(f"Mean Squared Error (RMSE): {mse:.2f}")
# Calculate R-squared (R2) Error
r2 = r2_score(y_test, y_predict)
print(f"R-squared (R2) Error: {r2:.2f}")

Mean Squared Error (RMSE): 0.05
R-squared (R2) Error: 0.98

Monthly Archives: May 2024