Mini Project – Part 3 of Data Visualization using Python series

Context

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer’s location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. You are hired as a Data Scientist in this company to help improve their business.

Suggestion

Your task is to ask yourself key questions and answer them to help the company to improve their business.

Data Description

Provided data contains the different data related to a food order. The detailed data dictionary is given below.

order_id: Unique ID of the order
customer_id: ID of the customer who ordered the food
restaurant_name: Name of the restaurant
cuisine_type: Cuisine ordered by the customer
cost: Cost of the order
day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
rating: Rating given by the customer out of 5
food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant’s order confirmation and the delivery person’s pick-up confirmation.
delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person’s pick-up confirmation and drop-off information

Understanding the structure of the data

Download the Jupyter Notebook source code from Git Repository, same as in Part-1 of this series.

After importing all required libraries, lets start exploring the given data by loading the data as pandas data frame object.

# read the data
df = pd.read_csv(f'../{cfg.data.project}')
# returns the first 5 rows
df.head()

How many rows and columns are present in the data?

# Get the dimensions of the data frame
df.shape

What are the datatypes of the different columns in the dataset?

# Use info() to print a concise summary of the DataFrame
df.info()

print(f'We have {df.restaurant_name.nunique()} unique restaurants in the dataset')
print(f'We have {df.cuisine_type.nunique()} unique cuisines in the dataset')

We have 178 unique restaurants in the dataset
We have 14 unique cuisines in the dataset

Are there any missing values in the data?

# Check if any of the features are null from the dataset
df.isnull().sum()

There are no missing data in the data frame

Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed?

We will summarize only the numerical data, even though order_id and customer_id are numerical data, we will ignore them since it will not make sense to summarize them.

# Get the description of the dataframe
df[['cost_of_the_order','food_preparation_time','delivery_time']].describe().T

Observation

We can see that the minimum time it takes for food to be prepared is 20 minutes, the average time it takes for food to prepared is ~27 minutes and the maximum time it takes for food to be prepared is 35 minutes. Standard deviation is ~4.6 minutes, means the spread is not huge.
The fastest delivery time for a food is 15 minutes, the average time it takes for food to be delivered is ~24 minutes and it takes as long as 33 minutes for the food to be delivered. Standard deviation is ~5 minutes, means the spread is not huge.
The minimum cost of the food ordered is around 4.47 dollars and the maximum being around 35.41 dollars. The Standard deviation is around 7.48 dollars, denotes the spread is higher compared to the mean.

How many orders are not rated?

# Group the dataframe based on rating and determine percentage of ratings not rated from the dataset.
print(df.groupby('rating')['rating'].count())
print(f"Total percentage of unrated orders: {len(df[df['rating'] == 'Not given'])/len(df) * 100:.2f}%")

Exploratory Data Analysis (EDA) – Key Points

Data Distribution Analysis: Examine the data’s distribution characteristics (such as normality or skewness) using tools like histograms, box plots, and statistical summaries to gain insights into central tendencies and variability.
Handling Missing Values: Crucial for robust analysis, this involves strategies for managing missing data, which could include imputation, deletion, or investigating the causes of missingness.
Outlier Investigation: Identify and scrutinize outliers to assess their influence on the data set and determine appropriate responses, whether it be removal or transformation.
Correlation Exploration: Investigate the relationships between variables through correlation coefficients and scatter plots to pinpoint linkages and potential dependencies.
Pattern and Trend Analysis: Search for and analyze patterns, trends, or anomalies using tools like line graphs, bar charts, or time-series methods to better understand the data.
Group Comparative Analysis: Conduct comparisons across different categories, such as groups or time periods, to uncover notable differences or similarities.
Data Type Evaluation: Understand the nature of data involved (be it numerical, categorical, or ordinal) and how each type should be appropriately handled in analyses.
Data Quality Review: Assess the quality of the data to identify and rectify any errors or inconsistencies that may be present.
Visual Data Exploration: Utilize a range of visualization techniques, including heat-maps and pair plots, to intuitively explore and understand complex relationships within the data.

Univariate Analysis

Helper functions in analysis

def histogram_boxplot(feature, figsize=(15, 10), bins="auto"):
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (15, 10))
    bins: number of bins (default "auto")
    """
    f, (ax_box, ax_hist) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid
        sharex=True,  # The X-axis will be shared among all the subplots
        gridspec_kw={"height_ratios": (.25, .75)},
        figsize=figsize
    )
    # Creating the subplots
    # Boxplot will be created and the mean value of the column will be indicated using some symbol
    sns.boxplot(x=feature, ax=ax_box, showmeans=True, color='red')
    # For histogram
    sns.histplot(x=feature, kde=True, ax=ax_hist, bins=bins)
    ax_hist.axvline(np.mean(feature), color='g', linestyle='--')      # Add mean to the histogram
    ax_hist.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
    print(f'Mean: {np.mean(feature):.2f} and Median: {np.median(feature):.2f}')
    plt.show()
def bar_perc(data_org, z):
    """ Count/Bar plot with percentage of data
    data_org: Dataset to be plotted
    z: feature or column of interest to be plotted
    """
    # Since there is data manipulation here, we need to copy the data
    data = data_org.copy()
    total = len(data[z]) # Length of the column
    plt.figure(figsize = (15, 5))
    
    # Convert the column to a categorical data type
    data[z] = data[z].astype('category')
    
    ax = sns.countplot(x=z, data=data, palette='Paired', order=data[z].value_counts().index)
    
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Percentage of each class
        x = p.get_x() + p.get_width() / 2 - 0.05                    # Width of the plot
        y = p.get_y() + p.get_height()                              # Height of the plot
        ax.annotate(percentage, (x, y), size = 12)                  # Annotate the percentage 
    plt.xticks(rotation=45)
    plt.show()

Cost of the Order

histogram_boxplot(df.cost_of_the_order)

Observation:

The distribution of cost of the order is right-skewed.
The majority of the order prices are between 10 and 16 dollars.
For one particular order value i.e. 11 dollar we are seeing count as high as 350 orders
The median of the cost of the order is around 14 dollars.
There are no outliers in the dataset.

Food Preparation Time

histogram_boxplot(df.food_preparation_time)

Observation:

The distribution is kind of symmetrical.
The median and mean are close.
There are no outliers in the dataset.

Food Delivery Time

histogram_boxplot(df.delivery_time)

Observation:

Interesting median is left to mean, which mean the data is slightly left skewed.
On an average the delivery time is about 25 minutes.
- We could have had a slightly better understanding if we had distance as a feature.

Restaurants analysis

# Number of restaurants in the dataset with less than 5 orders
least_order = df['restaurant_name'].value_counts().sort_values()
print(f'\nLeast Ordered/Popular restaurants:{least_order[least_order<5].count()}')
print(f'Restaurants with just one order:{least_order[least_order==1].count()}')
print(f'Percentage of Restaurants with less than 5 orders:{least_order[least_order<5].count()/len(df)*100:.2f}%')

# Restaurants distribution by type of Cuisine
rest_list = list(dict(least_order[least_order<5]).keys())
df_least_rest = df[df['restaurant_name'].isin(rest_list)]
bar_perc(df_least_rest, 'cuisine_type')

Observations: We have around 109 restaurants with less than 5 orders and 50 restaurants with just one order.

Order of the Day

# Pie chart to visualize the number of orders based on day of the week i.e. weekday or weekend
df_pie = pd.DataFrame({'order_day': ['Weekday', 'Weekend'], 
                       'Order Count':[df[df['day_of_the_week']=='Weekday'].value_counts().count(),
                               df[df['day_of_the_week']=='Weekend'].value_counts().count()]})
df_pie.set_index('order_day', inplace=True)
df_pie.plot.pie(y='Order Count', title='Orders based on the day of the week',
                legend=False, autopct='%1.1f%%', explode=(0, 0.1), 
                shadow=True, startangle=90)
plt.show()

Which are the top 5 restaurants in terms of the number of orders received?

# Top 5 restaurant names based on orders received
df['restaurant_name'].value_counts().head(5)
# Data visualization for top 5 restaurants
df['restaurant_name'].value_counts().head(5).plot(kind='bar')
plt.title('Top 5 restaurants')

# Top restaurant i.e. Shake Shack's  cuisine type
df[df['restaurant_name'] == 'Shake Shack'].groupby('cuisine_type').size().sort_values(ascending=False).head(1)

Which is the most popular cuisine on weekends vs weekdays?

# Compare cuisine types based on day of the week
sns.countplot(x='cuisine_type', data=df, hue='day_of_the_week');
plt.xticks(rotation=90)
plt.show()

American is the most popular cuisine on weekends.
Next comes Japanese and Italian and then Chinese.

What percentage of the orders cost more than 20 dollars?

# Percentage of orders costing more than $20 = (Number of orders > $20 / Total number of orders) * 100
count_less_than_20 = df[df['cost_of_the_order'] &lt;= 20].value_counts().count()
count_more_than_20 = df[df['cost_of_the_order'] > 20].value_counts().count()

order_percent = (count_more_than_20/df['cost_of_the_order'].count()) * 100
print(f"Percentage of orders costing more than 20 dollars: {order_percent}")

#Visual representation of the pie chart for orders costing more than 20 dollars and less.
df_pie = pd.DataFrame({'dollar_category': ['Less than $20', 'More than $20'], 'orders':[count_less_than_20, count_more_than_20]})
df_pie.set_index('dollar_category', inplace=True)
df_pie.plot.pie(y='orders', title='Orders split above and below $20',
                legend=False, autopct='%1.1f%%', explode=(0, 0.1), 
                shadow=True, startangle=90)
plt.show()

The majority of the orders cost less than $20
We have about 30% of the orders costing more than $20

Multivariate Analysis

Perform a multivariate analysis to explore relationships between the important variables in the dataset

# Check for correlation among numerical variables
sns.catplot(data=df, x="day_of_the_week", y="food_preparation_time", hue="cuisine_type",kind='box', palette='bright')

Food preparation time during weekend and weekday varies based for the same type of cuisine

# Observations on delivery time
sns.violinplot(data=df, x='day_of_the_week', y='delivery_time', orient='v')

Weekday delivery takes longer than weekend delivery.
- Maybe due to traffic conditions during weekdays

# Top 5 customers preference on custine type
top_cust = df['customer_id'].value_counts().head(5)
cust_list = list(dict(top_cust).keys())
df_top_cust = df[df['customer_id'].isin(cust_list)]
print(df_top_cust['customer_id'].value_counts())
df_top_cust.hist(by='customer_id',column = 'cuisine_type', figsize=(7,12))

Interestingly all top 5 customers prefer the American food.
Customer 52832 has ordered food from most cuisines.
Other customers has strong preference for few over the most others

The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer

# Remove the entries for rating not given
df_valid_rating = df[df['rating']!= 'Not given']
# Pick the top restaurants with more than 50 orders in general with rating.
top_restaurants = df_valid_rating['restaurant_name'].value_counts()
print(f'\nTop Restaurants with more than 50 valid ratings:\n{top_restaurants[top_restaurants>50]}\n')

restaurant_list = list(dict(top_restaurants[top_restaurants>50]).keys())
#Get the dataset for only these top restaurants with valid ratings entries.
df_top_restaurants = df_valid_rating[df_valid_rating['restaurant_name'].isin(restaurant_list)]
# Copy the dataset to a new dataframe
df_top_rated_restaurants = df_top_restaurants.copy()
df_top_rated_restaurants['rating'] = df_top_rated_restaurants['rating'].astype(int)

for restaurant in restaurant_list:
    df_restaurant = df_top_rated_restaurants[df_top_rated_restaurants['restaurant_name'] == restaurant]
    print(f"Restaurant: '{restaurant}' has average rating: {df_restaurant['rating'].mean()}")

#Visual representation of the dataset
ax = sns.swarmplot(x="restaurant_name", y="rating", data=df_top_rated_restaurants, size=1)
df_means = df_top_rated_restaurants.groupby("restaurant_name")["rating"].agg("mean").reset_index()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
sns.scatterplot(x="restaurant_name", y="rating", marker='X', color='black', s=100, zorder=3, ax=ax, legend=False, data=df_means)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.xticks(rotation=45)
plt.show()

All the four restaurants which had valid 50+ customer rating has more than 4 avarage ratings
The restaurant ‘The Meatball Shop’ has the highest average rating of 4.5 with atleast 84 registered ratings.

The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders.

# 25% on the orders having cost greater than 20 dollars.
revenue_25_precent = ((df[df['cost_of_the_order']>20])['cost_of_the_order'] * .25).sum()
# 15% on the orders having cost greater than 5 dollars
df_greater_5 = df[df['cost_of_the_order']>5]
revenue_5_precent = ((df_greater_5[df_greater_5['cost_of_the_order']&lt;=20])['cost_of_the_order'] * .05).sum()

# net revenue generated by the company across all orders.
print(f'Net revenue generated by the company across all orders: {revenue_25_precent + revenue_5_precent}')


def autopct_format(values):
    def my_format(pct):
        total = sum(values)
        val = pct*total/100.0
        return '{:.2f}%\n(${:.2f})'.format(pct, val)
    return my_format
df_pie = pd.DataFrame({'key': ['More than $20', 'Between $5 and $20'], 'revenue':[revenue_25_precent, revenue_5_precent]})
df_pie.set_index('key', inplace=True)
pie_values= [i for i in df_pie['revenue']]
df_pie.plot.pie(y='revenue', title=f'Net Revenue ${revenue_25_precent + revenue_5_precent}',
                legend=False, autopct=autopct_format(pie_values), explode=(0, 0.1), 
                shadow=True, startangle=90)
plt.show()

The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.)

#Add a new column to the dataframe for total time taken from order to delivery of the food
df['total_time'] = df['food_preparation_time'] + df['delivery_time']
print(f"Percentage of order taken more than 60 minutes: {len(df[df['total_time']>60]) / len(df) * 100}")
df_pie = pd.DataFrame({'key': ['More than 60 minutes', 'less than 60 minutes'],
                       'data':[len(df[df['total_time']>60]), len(df[df['total_time']&lt;=60])]})
print(df_pie)
def autopct_format(values):
    def my_format(pct):
        total = sum(values)
        val = pct*total/100.0
        return '{:.2f}%\n({:.2f})'.format(pct, val)
    return my_format

df_pie.set_index('key', inplace=True)
pie_values= [i for i in df_pie['data']]
df_pie.plot.pie(y='data', title=f'Orders took more than 60 minutes',
                legend=False, autopct=autopct_format(pie_values), explode=(0, 0.1), 
                shadow=True, startangle=45)
plt.show()

Order total time greater than 60 minutes and categorized based on day of the week.

df_more_than_60 = df[df['total_time']>60]
df_pie = pd.DataFrame({'order_day': ['Weekday', 'Weekend'], 
                       'Order Count':[df_more_than_60[df_more_than_60['day_of_the_week']=='Weekday'].value_counts().count(),
                               df_more_than_60[df_more_than_60['day_of_the_week']=='Weekend'].value_counts().count()]})
df_pie.set_index('order_day', inplace=True)
df_pie.plot.pie(y='Order Count', title='Orders total time > 60 based on the day of the week',
                legend=False, autopct='%1.1f%%', explode=(0, 0.1), 
                shadow=True, startangle=90)
plt.show()

Conclusion

We have analyzed the dataset of about 1898 orders in a given place with 178 unique restaurants and 14 distinct cuisines among the restaurants. The data is categorized as weekday and weekend orders. The primary goal of this analysis is to find the most popular restaurants and cuisines among the restaurants based on the orders and rating provided by the customers.

The analysis shows that the most popular restaurant is “Shake Shack” with 219 orders which is of “American” cuisine and the most popular cuisine is “American” with 368 total orders.
There are around 109 restaurants with less than 5 orders and 50 restaurants with just one order.
The delivery service is quite popular during the weekends(71.2%) compared to weekdays(28.8%).
The food preparation time and delivery time is very similar to all restaurants. With mean and median very close
Unfortunately we don’t have customer rating for almost 38.78% orders.
Higher profit margin orders i.e. order price higher than $20 are around 81.71% than the lower profit margin orders.

Recommendations

5.74% of the restaurants has less than 5 orders in total. These restaurants are not recommended if the duration of the entry is longer. Maybe these restaurants can offer promotions to attract more orders.
Customers love “American” restaurants, explore adding more American restaurants to the list.
Orders during weekdays are very less compared to weekends, need to explore the reasoning behind the weekdays, is it because not many customer to order food, or less delivery drivers or not many restaurant’s in the order service. Explore the reason behind it.
Customers are ready to spend more than $20 for the food, which yields more revenue margin as well. Consider adding more restaurants of higher profit margin.
Encourage customers to leave rating by promotions such as discounts to better provide recommendations based on ratings

Posted

October 21, 2023

Projects

wstanislaus

Tags: