Eating out in Lyon - Data analysis
Analyzing yelp restaurant ratings to find delicious pizza.
June 10, 2022
The code and data for this project is available on github.
Summary
This is the second part of a data analysis project focused on finding a delicious pizza in my hometown of Lyon. After gathering and cleaning data in the first part, I now focus on data analysis. In the exploratory data analysis, I focus mainly on the mean price and ratings of restaurants across city districts and restaurant categories. I then go looking for outstanding restaurants — i.e., restaurants with a higher rating than restaurants of the same category and district — using the residuals of a linear model.
Findings
Restaurant properties vary by city district:
- The historical center (Lyon 1 and 2) has the most restaurants, as well as the most reviews, which makes sense given its touristic appeal.
- Restaurants in fancy neighborhoods (Lyon 5 and 6) are the most expensive on average, but this is not reflected by higher ratings.
- Restaurants have better ratings on average in Lyon 7.
Here are the best pizzerias by district as identified through their residuals:
Analysis
Let’s first import some libraries and take a look at the data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('data/restaurants.csv')
data[data.columns[:8]].head()
id | alias | name | review_count | rating | price | coordinates.latitude | coordinates.longitude | |
---|---|---|---|---|---|---|---|---|
0 | D3NHTerar80aeR6mlyE2mw | azur-afghan-lyon | Azur Afghan | 23 | 4.0 | €€ | 45.775020 | 4.828750 |
1 | ee4wtKIBI_yTz0fJD054pg | tendance-afghane-lyon | Tendance Afghane | 1 | 3.0 | NaN | 45.759540 | 4.825560 |
2 | zmk41IUwIkvO_eM0UGD7Sg | sufy-lyon | Sufy | 2 | 3.5 | NaN | 45.752212 | 4.864384 |
3 | Vo0U5EcXbh7qlpdaQwZchA | le-conakry-lyon | Le Conakry | 9 | 4.0 | €€ | 45.750642 | 4.849127 |
4 | -mFHJBuCxZJ_wJrO-o2Ypw | afc-africa-food-concept-lyon | AFC Africa Food Concept | 8 | 3.5 | €€ | 45.754336 | 4.843469 |
data[data.columns[8:]].head()
location.address1 | location.address2 | location.address3 | location.zip_code | location.country | location.display_address | |
---|---|---|---|---|---|---|
0 | 6 Rue Villeneuve | NaN | NaN | 69004 | FR | ['6 Rue Villeneuve', '69004 Lyon', 'France'] |
1 | 25 Rue Tramassac | NaN | NaN | 69005 | FR | ['25 Rue Tramassac', '69005 Lyon', 'France'] |
2 | 34 rue Jeanne Hachette | NaN | NaN | 69003 | FR | ['34 rue Jeanne Hachette', '69003 Lyon', 'Fran... |
3 | 112 Grande rue de la Guillotière | NaN | NaN | 69007 | FR | ['112 Grande rue de la Guillotière', '69007 Ly... |
4 | 14 Grande rue de la Guillotière | NaN | NaN | 69007 | FR | ['14 Grande rue de la Guillotière', '69007 Lyo... |
Review count and ratings distribution
Let’s take a look at the review count and ratings distributions:
sns.displot(data = data, x = "review_count", bins = 50);
data['price_num'] = data['price'].map({'€': 1, '€€': 2, '€€€': 3, '€€€€': 4})
sns.countplot(data = data, x = "rating");
The ratings are strongly bimodal, with the bulk of the distribution centered around 4 but also numerous restaurants rated 0 stars. The review counts are exponentially distributed, meaning that a large part of the restaurants have very few reviews. Because ratings with few reviews might not be very informative, let’s get rid of all restaurants with less than 10 reviews:
data = data[data['review_count'] >= 10]
sns.countplot(data = data, x = "rating");
Interestingly, all ratings of 0 or 1 star only applied to restaurants with few ratings, so we now have an unimodal ratings’ distribution.
Restaurants properties by districts
Let’s now take a look at how variables of interest differ between districts. Using geopandas, I’ll look at the number of restaurants, mean rating, mean price and median number of reviews by district.
# importing spatial libraries
import geopandas
import geodatasets
import contextily as cx
from shapely.geometry import Point
# importing a geojson file describing the cities district as a geo-dataframe called 'lyon_map'
# obtained from https://data.grandlyon.com/jeux-de-donnees/arrondissements-lyon/info
lyon_map = geopandas.read_file('data/adr_voie_lieu.adrarrond.geojson')
# adding zip codes to be able to merge with the data
lyon_map['location.zip_code'] = lyon_map['nomreduit'].apply(lambda x:('6900' + x[-1])).astype('int')
lyon_map
nom | nomreduit | insee | datemaj | trigramme | gid | geometry | location.zip_code | |
---|---|---|---|---|---|---|---|---|
0 | Lyon 1er Arrondissement | Lyon 1 | 69381 | 1997-10-22 00:00:00+00:00 | LY1 | 128 | POLYGON ((4.83049 45.76454, 4.83125 45.76484, ... | 69001 |
1 | Lyon 9e Arrondissement | Lyon 9 | 69389 | 2005-07-19 00:00:00+00:00 | LY9 | 181 | POLYGON ((4.81088 45.78099, 4.81145 45.78177, ... | 69009 |
2 | Lyon 2e Arrondissement | Lyon 2 | 69382 | 1997-10-22 00:00:00+00:00 | LY2 | 59 | POLYGON ((4.81782 45.72649, 4.81868 45.72660, ... | 69002 |
3 | Lyon 4e Arrondissement | Lyon 4 | 69384 | 2003-05-12 00:00:00+00:00 | LY4 | 29 | POLYGON ((4.81856 45.78944, 4.81817 45.78914, ... | 69004 |
4 | Lyon 3e Arrondissement | Lyon 3 | 69383 | 2005-07-19 00:00:00+00:00 | LY3 | 125 | POLYGON ((4.83901 45.75660, 4.83956 45.75643, ... | 69003 |
5 | Lyon 5e Arrondissement | Lyon 5 | 69385 | 2005-07-19 00:00:00+00:00 | LY5 | 53 | POLYGON ((4.81353 45.74819, 4.81357 45.74823, ... | 69005 |
6 | Lyon 7e Arrondissement | Lyon 7 | 69387 | 2000-03-30 00:00:00+00:00 | LY7 | 189 | POLYGON ((4.83770 45.70737, 4.83894 45.70763, ... | 69007 |
7 | Lyon 8e Arrondissement | Lyon 8 | 69388 | 2011-03-03 00:00:00+00:00 | LY8 | 42 | POLYGON ((4.84879 45.71885, 4.84881 45.71879, ... | 69008 |
8 | Lyon 6e Arrondissement | Lyon 6 | 69386 | 2005-07-19 00:00:00+00:00 | LY6 | 45 | POLYGON ((4.86994 45.76373, 4.86986 45.76384, ... | 69006 |
To obtain data by district, I group the restaurant data by ’location.zip_code’ before merging it with the geodataframe ’lyon_map':
# merging lyon_map with the data aggregated by zip code to get mean and median rating, price and review counts:
lyon_map = lyon_map.merge(
data.groupby('location.zip_code')[['review_count', 'rating', 'price_num']].agg([np.mean, np.median]),
on = 'location.zip_code',
how = 'left')
# merging lyon_map with the data aggregated by zip code to get the restaurant count by district:
lyon_map = lyon_map.merge(
data.groupby('location.zip_code')['id'].count().reset_index(name="restaurants_count"),
on = 'location.zip_code',
how = 'left')
I finally plot the variables of interest aggregated by district:
# changing the coordinates reference system to a more visually appealing one:
lyon_map = lyon_map.to_crs(epsg = 3857)
fig, ax = plt.subplots(2, 2, figsize = (15, 12))
# plotting number of restaurants
lyon_map.plot(ax = ax[0, 0], alpha = 0.5, edgecolor = "k", column = 'restaurants_count', legend = True)
lyon_map.apply(lambda x: ax[0, 0].annotate(text=x['location.zip_code'],
xy=x.geometry.centroid.coords[0],
ha='center'), axis=1)
cx.add_basemap(ax[0, 0], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[0, 0].set_title('Number of restaurants')
# plotting mean rating
lyon_map.plot(ax = ax[0, 1],alpha = 0.5, edgecolor = "k",
column = ('rating', 'mean'), legend = True)
lyon_map.apply(lambda x: ax[0, 1].annotate(text=x['location.zip_code'],
xy=x.geometry.centroid.coords[0],
ha='center'), axis=1)
cx.add_basemap(ax[0, 1], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[0, 1].set_title('Mean rating')
# plotting median review count
lyon_map.plot(ax = ax[1, 0], alpha = 0.5, edgecolor = "k", column = ('review_count', 'median'), legend = True)
lyon_map.apply(lambda x: ax[1, 0].annotate(text=x['location.zip_code'],
xy=x.geometry.centroid.coords[0],
ha='center'), axis=1)
cx.add_basemap(ax[1, 0], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[1, 0].set_title('Median number of reviews')
# plotting mean price
lyon_map.plot(ax = ax[1, 1], alpha = 0.5, edgecolor = "k", column = ('price_num', 'mean'), legend = True)
lyon_map.apply(lambda x: ax[1, 1].annotate(text=x['location.zip_code'],
xy=x.geometry.centroid.coords[0],
ha='center'), axis=1)
cx.add_basemap(ax[1, 1], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[1, 1].set_title('Mean price')
# removing ticks for better visibility
for k in [0, 1]:
for l in [0, 1]:
ax[k, l].set_xticks([])
ax[l, k].set_yticks([])
fig.tight_layout()
- The historical center (Lyon 1 and 2) has the most restaurants, as well as the most reviews, which makes sense given its touristic appeal.
- Restaurants in fancy neighborhoods (Lyon 5 and 6) are the most expensive on average, but this is not reflected by higher ratings.
- Restaurants have better ratings on average in Lyon 7, but this effect is rather small.
Restaurants properties by category
Let us now take a look at how categories affect the restaurant properties.
# Loading the dataframe of categories
categories_df = pd.read_csv('data/categories.csv')
# taking a look at the distribution of the number of restaurant by category.
ax = sns.histplot(categories_df.drop('id', axis = 1).sum())
ax.set_xlabel('Number of restaurants');
Most categories contain very few restaurants, and are therefore not very informative. For the remainder of the analysis, I chose to keep only the most frequent categories.
## dropping all categories concerning less than 100 restaurants
sorted_categories = categories_df.drop('id', axis = 1).sum().sort_values(ascending = False)
# saving a vector of the top categories
top_categories = sorted_categories[sorted_categories >= 100]
# and a vector of all other categories to filter categories_df
non_top_categories = sorted_categories[sorted_categories < 100]
# dropping rare categories
categories_df.drop(non_top_categories.index, axis = 1, inplace = True)
To check how our variables of interest depend on categories, I create a dataframe of with the name of the most common categories, and then compute the restaurant count, mean rating and price and median review count for each category.
# creating a dataframe of the most frequent categories ('alias') and the number of categories for each ('count')
top_categories = pd.DataFrame(top_categories)
top_categories.reset_index(inplace = True)
top_categories.columns = ['alias', 'count']
# computing the mean rating by category
top_categories["mean_rating"] = top_categories['alias'].apply(lambda x: data['rating'][categories_df[x] > 0].mean())
# computing the median number of reviews
top_categories["median_reviews"] = top_categories['alias'].apply(lambda x: data['review_count'][categories_df[x] > 0].median())
# computing the mean price
top_categories["mean_price"] = top_categories['alias'].apply(lambda x: data['price_num'][categories_df[x] > 0].mean())
# taking a look at the resulting dataset:
top_categories
alias | count | mean_rating | median_reviews | mean_price | |
---|---|---|---|---|---|
0 | french | 730.0 | 3.747368 | 30.0 | 2.553571 |
1 | hotdogs | 308.0 | 3.366667 | 19.0 | 1.750000 |
2 | pizza | 255.0 | 3.561404 | 20.0 | 2.017544 |
3 | italian | 174.0 | 3.571429 | 20.5 | 2.115942 |
4 | sandwiches | 164.0 | 3.557692 | 20.0 | 1.576923 |
5 | cafes | 159.0 | 3.833333 | 30.0 | 1.974359 |
6 | bistros | 139.0 | 3.957143 | 24.0 | 2.314286 |
7 | brasseries | 134.0 | 3.346154 | 28.0 | 2.435897 |
8 | burgers | 123.0 | 3.337209 | 34.0 | 1.906977 |
9 | japanese | 113.0 | 3.661017 | 31.0 | 2.237288 |
10 | lyonnais | 100.0 | 3.701613 | 40.0 | 2.677419 |
Finally, let’s take a look at the differences between categories in the form of bar charts:
fig, ax = plt.subplots(2, 2, figsize = (12, 12))
sns.barplot(ax = ax[0, 0], data = top_categories, x = 'count', y = 'alias')
ax[0, 0].set_xlabel('Number of restaurants')
ax[0, 0].set_ylabel('Category')
sns.barplot(ax = ax[0, 1], data = top_categories, x = 'mean_rating', y = 'alias',
order = top_categories.sort_values('mean_rating', ascending = False).alias)
ax[0, 1].set_xlabel('Mean rating')
ax[0, 1].set_ylabel('Category')
sns.barplot(ax = ax[1, 0], data = top_categories, x = 'median_reviews', y = 'alias',
order = top_categories.sort_values('median_reviews', ascending = False).alias)
ax[1, 0].set_xlabel('Median number of reviews')
ax[1, 0].set_ylabel('Category')
sns.barplot(ax = ax[1, 1], data = top_categories, x = 'mean_price', y = 'alias',
order = top_categories.sort_values('mean_price', ascending = False).alias)
ax[1, 1].set_xlabel('Mean price')
ax[1, 1].set_ylabel('Category')
fig.tight_layout();
Finding the best pizza in town
I now want to identify “good” restaurants, i.e., restaurants that are better rated than similar restaurants. To do so, let us fit a simple linear model that predicts rating from the zip code, categories, and price of a restaurant. Such a simple model will of course be fairly bad at predicting the rating of any individual restaurant, but its category, and price. The residuals of the model (the difference between the true rating and predicted rating) will then give us a good understanding of how well a restaurant performs compared to other restaurants with similar properties.
# building and fitting a simple linear regression model to predict the rating
from sklearn.linear_model import LinearRegression
# ratings as the target variable
y = data['rating']
# keeping zip code, price, and categories as predictor variables
X = data[['id', 'location.zip_code']].merge(categories_df, how = 'left', on = 'id')
X = pd.concat([X, pd.get_dummies(data['price']).reset_index()], axis = 1).drop(['id', 'index'], axis = 1)
X = X.values
# initializing and fitting the model
lm = LinearRegression()
lm.fit(X, y);
# getting predictions
data['predicted_rating'] = lm.predict(X)
# computing the residuals
data['residual'] = data['rating'] - data['predicted_rating']
# creating a data frame of the pizzerias with the highest residuals in each zip code
best_pizza = data.loc[data[categories_df["pizza"] == 1].groupby('location.zip_code')['residual'].idxmax()]
# taking a quick look at the selected pizzas:
best_pizza[['name', 'rating', 'price', "location.zip_code", 'residual']]
name | rating | price | location.zip_code | residual | |
---|---|---|---|---|---|
2262 | Le Jardin des Pentes | 4.0 | € | 69001 | 0.447680 |
1773 | Casa Nobile | 4.5 | €€ | 69002 | 0.987304 |
1823 | Le Ferrari | 4.5 | €€€ | 69003 | 0.957764 |
2274 | Chez Puce | 4.5 | €€ | 69004 | 0.882967 |
1804 | Al Dente | 4.0 | €€ | 69005 | 0.463095 |
1786 | Neroliva | 4.5 | €€ | 69006 | 0.955026 |
1790 | Le Vivaldi - Nicolo e Maria | 4.5 | €€ | 69007 | 0.946956 |
2282 | Pizza Lina | 4.0 | €€ | 69008 | 0.350689 |
1808 | Domeva Caffé | 4.5 | €€ | 69009 | 0.930817 |
Finally, we can plot the selected pizzerias on a map:
best_pizza = geopandas.GeoDataFrame(
best_pizza,
geometry = geopandas.points_from_xy(best_pizza['coordinates.longitude'], best_pizza['coordinates.latitude'],
crs = 'EPSG:4326')
)
best_pizza.to_crs(str(lyon_map.crs), inplace = True)
ax = lyon_map.plot(edgecolor = "teal", facecolor = 'none', figsize=(15, 15), linewidth=2)
best_pizza.plot(ax = ax)
best_pizza.apply(lambda x: ax.annotate(text=x['name'],
xy=x.geometry.centroid.coords[0],
ha='center',
xytext=(-5, -15) if x['location.zip_code'] in [69005, 69007] else (-5, 20),
textcoords="offset points",
weight = 'bold',
fontsize = 12,
backgroundcolor='1'), axis=1)
cx.add_basemap(ax, zoom = 13, crs = best_pizza.crs,
source = cx.providers.Esri.WorldTopoMap);
- Posted on:
- June 10, 2022
- Length:
- 10 minute read, 1920 words
- See Also: