Restaurant prediction in Python
# Importing the Libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import warnings:
- This imports the warnings module, which provides a way to handle warning messages in Python. This module is part of the standard library, so no additional installation is required.
Pre-Processing Steps
#importing the dataframe
df = pd.read_csv('C:\\Users\\welcome\\Documents\\Python Scripts\\Predict_Restaurant.csv')
print(df.head())
df = df.drop('Restaurant ID', axis=1)
df = df.drop('Restaurant Name', axis=1)
df = df.drop('Country Code', axis=1)
df = df.drop('City', axis=1)
df = df.drop('Address', axis=1)
df = df.drop('Locality', axis=1)
df = df.drop('Locality Verbose', axis=1)
df = df.drop('Longitude', axis=1)
df = df.drop('Latitude', axis=1)
df = df.drop('Cuisines', axis=1)
df = df.drop('Currency', axis=1)
The provided code snippet removes several columns from a pandas DataFrame, with each line dropping a different column. This operation is useful when you need to clean your dataset by removing irrelevant or unnecessary columns. The optimized version shows a more efficient way to achieve the same result by dropping multiple columns at once.
df
df.shape
(9551, 10)
The df.shape
attribute in pandas is used to get the dimensions of a DataFrame. It returns a tuple representing the number of rows and columns in the DataFrame.
df.info
The df.info()
method in pandas provides a concise summary of a DataFrame. This summary includes information about the DataFrame’s structure, such as the number of non-null entries, column data types, memory usage, and more. Here’s a detailed explanation of what df.info()
outputs and how to interpret it:
df.describe()
The df.describe()
method in pandas provides a summary of the statistical properties of the numerical (and optionally, categorical) columns in a DataFrame. This summary includes measures such as mean, standard deviation, minimum and maximum values, and quartiles.
# Checking for missing values
df.isnull().sum()
# Checking for duplicated values
df.duplicated().sum()
df.dropna(inplace=True) # used to remove rows with missing values from a DataFrame in-place,
meaning it modifies the original DataFrame directly.
df
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df['Price range'].value_counts().plot(kind='pie', autopct = '%.2f')
df['Aggregate rating'].value_counts().plot(kind='pie', autopct = '%.2f')
sns.distplot(df['Aggregate rating'])
sns.distplot(df['Price range'])
sns.barplot(x=df["Rating text"],y=df["Votes"],hue =df["Rating color"])
sns.scatterplot(x=df["Aggregate rating"],y=df["Votes"],hue=df["Price range"])
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Has Table booking'] = label_encoder.fit_transform(df['Has Table booking'])
df['Has Online delivery'] = label_encoder.fit_transform(df['Has Online delivery'])
df['Is delivering now'] = label_encoder.fit_transform(df['Is delivering now'])
df['Switch to order menu'] = label_encoder.fit_transform(df['Switch to order menu'])
df['Rating color'] = label_encoder.fit_transform(df['Rating color'])
df['Rating text'] = label_encoder.fit_transform(df['Rating text'])
The LabelEncoder
is useful for converting categorical text data into numerical data, which is often required for machine learning algorithms. However, it’s important to note that LabelEncoder
assigns an arbitrary numerical value to each unique category, which can introduce unintended ordinal relationships between categories. If your categorical variables are nominal (i.e., no inherent order), you might want to use one-hot encoding (pd.get_dummies()
) instead.
df
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import r2_score
x = df.drop('Aggregate rating', axis=1)
y = df['Aggregate rating']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.1,random_state=353)
x_train.head()
y_train.head()
#Running the Linear Regression Model
reg=LinearRegression()
reg.fit(x_train,y_train)
y_pred=reg.predict(x_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
0.44846419965192585
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
reg = LinearRegression()
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R2) Error: {r2:.2f}")
#Building the Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.1,random_state=105)
DTree=DecisionTreeRegressor(min_samples_leaf=.0001)
DTree.fit(x_train,y_train)
y_predict=DTree.predict(x_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_predict)
0.9774319598898318
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
DTree = DecisionTreeRegressor(min_samples_leaf=0.0001)
DTree.fit(x_train, y_train)
y_predict = DTree.predict(x_test)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_predict)
print(f"Mean Squared Error (RMSE): {mse:.2f}")
# Calculate R-squared (R2) Error
r2 = r2_score(y_test, y_predict)
print(f"R-squared (R2) Error: {r2:.2f}")
Mean Squared Error (RMSE): 0.05 R-squared (R2) Error: 0.98