Eating out in Lyon - Data analysis

Analyzing yelp restaurant ratings to find delicious pizza.

June 10, 2022

The code and data for this project is available on github.

Summary

This is the second part of a data analysis project focused on finding a delicious pizza in my hometown of Lyon. After gathering and cleaning data in the first part, I now focus on data analysis. In the exploratory data analysis, I focus mainly on the mean price and ratings of restaurants across city districts and restaurant categories. I then go looking for outstanding restaurants — i.e., restaurants with a higher rating than restaurants of the same category and district — using the residuals of a linear model.

Findings

Restaurant properties vary by city district:

  • The historical center (Lyon 1 and 2) has the most restaurants, as well as the most reviews, which makes sense given its touristic appeal.
  • Restaurants in fancy neighborhoods (Lyon 5 and 6) are the most expensive on average, but this is not reflected by higher ratings.
  • Restaurants have better ratings on average in Lyon 7.

Here are the best pizzerias by district as identified through their residuals:

Analysis

Let’s first import some libraries and take a look at the data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('data/restaurants.csv')
data[data.columns[:8]].head()

idaliasnamereview_countratingpricecoordinates.latitudecoordinates.longitude
0D3NHTerar80aeR6mlyE2mwazur-afghan-lyonAzur Afghan234.0€€45.7750204.828750
1ee4wtKIBI_yTz0fJD054pgtendance-afghane-lyonTendance Afghane13.0NaN45.7595404.825560
2zmk41IUwIkvO_eM0UGD7Sgsufy-lyonSufy23.5NaN45.7522124.864384
3Vo0U5EcXbh7qlpdaQwZchAle-conakry-lyonLe Conakry94.0€€45.7506424.849127
4-mFHJBuCxZJ_wJrO-o2Ypwafc-africa-food-concept-lyonAFC Africa Food Concept83.5€€45.7543364.843469
data[data.columns[8:]].head()

location.address1location.address2location.address3location.zip_codelocation.countrylocation.display_address
06 Rue VilleneuveNaNNaN69004FR['6 Rue Villeneuve', '69004 Lyon', 'France']
125 Rue TramassacNaNNaN69005FR['25 Rue Tramassac', '69005 Lyon', 'France']
234 rue Jeanne HachetteNaNNaN69003FR['34 rue Jeanne Hachette', '69003 Lyon', 'Fran...
3112 Grande rue de la GuillotièreNaNNaN69007FR['112 Grande rue de la Guillotière', '69007 Ly...
414 Grande rue de la GuillotièreNaNNaN69007FR['14 Grande rue de la Guillotière', '69007 Lyo...

Review count and ratings distribution

Let’s take a look at the review count and ratings distributions:

sns.displot(data = data, x = "review_count", bins = 50);

png

data['price_num'] = data['price'].map({'€': 1, '€€': 2, '€€€': 3, '€€€€': 4})
sns.countplot(data = data, x = "rating");

png

The ratings are strongly bimodal, with the bulk of the distribution centered around 4 but also numerous restaurants rated 0 stars. The review counts are exponentially distributed, meaning that a large part of the restaurants have very few reviews. Because ratings with few reviews might not be very informative, let’s get rid of all restaurants with less than 10 reviews:

data = data[data['review_count'] >= 10]
sns.countplot(data = data, x = "rating");

png

Interestingly, all ratings of 0 or 1 star only applied to restaurants with few ratings, so we now have an unimodal ratings’ distribution.

Restaurants properties by districts

Let’s now take a look at how variables of interest differ between districts. Using geopandas, I’ll look at the number of restaurants, mean rating, mean price and median number of reviews by district.

# importing spatial libraries
import geopandas
import geodatasets
import contextily as cx
from shapely.geometry import Point
# importing a geojson file describing the cities district as a geo-dataframe called 'lyon_map'
# obtained from https://data.grandlyon.com/jeux-de-donnees/arrondissements-lyon/info
lyon_map = geopandas.read_file('data/adr_voie_lieu.adrarrond.geojson')

# adding zip codes to be able to merge with the data
lyon_map['location.zip_code'] = lyon_map['nomreduit'].apply(lambda x:('6900' + x[-1])).astype('int')
lyon_map

nomnomreduitinseedatemajtrigrammegidgeometrylocation.zip_code
0Lyon 1er ArrondissementLyon 1693811997-10-22 00:00:00+00:00LY1128POLYGON ((4.83049 45.76454, 4.83125 45.76484, ...69001
1Lyon 9e ArrondissementLyon 9693892005-07-19 00:00:00+00:00LY9181POLYGON ((4.81088 45.78099, 4.81145 45.78177, ...69009
2Lyon 2e ArrondissementLyon 2693821997-10-22 00:00:00+00:00LY259POLYGON ((4.81782 45.72649, 4.81868 45.72660, ...69002
3Lyon 4e ArrondissementLyon 4693842003-05-12 00:00:00+00:00LY429POLYGON ((4.81856 45.78944, 4.81817 45.78914, ...69004
4Lyon 3e ArrondissementLyon 3693832005-07-19 00:00:00+00:00LY3125POLYGON ((4.83901 45.75660, 4.83956 45.75643, ...69003
5Lyon 5e ArrondissementLyon 5693852005-07-19 00:00:00+00:00LY553POLYGON ((4.81353 45.74819, 4.81357 45.74823, ...69005
6Lyon 7e ArrondissementLyon 7693872000-03-30 00:00:00+00:00LY7189POLYGON ((4.83770 45.70737, 4.83894 45.70763, ...69007
7Lyon 8e ArrondissementLyon 8693882011-03-03 00:00:00+00:00LY842POLYGON ((4.84879 45.71885, 4.84881 45.71879, ...69008
8Lyon 6e ArrondissementLyon 6693862005-07-19 00:00:00+00:00LY645POLYGON ((4.86994 45.76373, 4.86986 45.76384, ...69006

To obtain data by district, I group the restaurant data by ’location.zip_code’ before merging it with the geodataframe ’lyon_map':


# merging lyon_map with the data aggregated by zip code to get mean and median rating, price and review counts:
lyon_map = lyon_map.merge(
                data.groupby('location.zip_code')[['review_count', 'rating', 'price_num']].agg([np.mean, np.median]),
                on = 'location.zip_code',
                how = 'left')

# merging lyon_map with the data aggregated by zip code to get the restaurant count by district:
lyon_map = lyon_map.merge(
                data.groupby('location.zip_code')['id'].count().reset_index(name="restaurants_count"),
                on = 'location.zip_code',
                how = 'left')

I finally plot the variables of interest aggregated by district:

# changing the coordinates reference system to a more visually appealing one:
lyon_map = lyon_map.to_crs(epsg = 3857)

fig, ax = plt.subplots(2, 2, figsize = (15, 12))

# plotting number of restaurants
lyon_map.plot(ax = ax[0, 0], alpha = 0.5, edgecolor = "k", column = 'restaurants_count', legend = True)
lyon_map.apply(lambda x: ax[0, 0].annotate(text=x['location.zip_code'],
                                     xy=x.geometry.centroid.coords[0],
                                     ha='center'), axis=1)
cx.add_basemap(ax[0, 0], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[0, 0].set_title('Number of restaurants')


# plotting mean rating
lyon_map.plot(ax = ax[0, 1],alpha = 0.5, edgecolor = "k",
                   column = ('rating', 'mean'), legend = True)
lyon_map.apply(lambda x: ax[0, 1].annotate(text=x['location.zip_code'],
                                     xy=x.geometry.centroid.coords[0],
                                     ha='center'), axis=1)
cx.add_basemap(ax[0, 1], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[0, 1].set_title('Mean rating')

# plotting median review count
lyon_map.plot(ax = ax[1, 0], alpha = 0.5, edgecolor = "k", column = ('review_count', 'median'), legend = True)
lyon_map.apply(lambda x: ax[1, 0].annotate(text=x['location.zip_code'],
                                     xy=x.geometry.centroid.coords[0],
                                     ha='center'), axis=1)
cx.add_basemap(ax[1, 0], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[1, 0].set_title('Median number of reviews')

# plotting mean price
lyon_map.plot(ax = ax[1, 1], alpha = 0.5, edgecolor = "k", column = ('price_num', 'mean'), legend = True)
lyon_map.apply(lambda x: ax[1, 1].annotate(text=x['location.zip_code'],
                                     xy=x.geometry.centroid.coords[0],
                                     ha='center'), axis=1)
cx.add_basemap(ax[1, 1], zoom = 12, source = cx.providers.Esri.WorldGrayCanvas)
ax[1, 1].set_title('Mean price')

# removing ticks for better visibility
for k in [0, 1]:
    for l in [0, 1]:
        ax[k, l].set_xticks([])
        ax[l, k].set_yticks([])
        
fig.tight_layout()

png

  • The historical center (Lyon 1 and 2) has the most restaurants, as well as the most reviews, which makes sense given its touristic appeal.
  • Restaurants in fancy neighborhoods (Lyon 5 and 6) are the most expensive on average, but this is not reflected by higher ratings.
  • Restaurants have better ratings on average in Lyon 7, but this effect is rather small.

Restaurants properties by category

Let us now take a look at how categories affect the restaurant properties.

# Loading the dataframe of categories
categories_df = pd.read_csv('data/categories.csv')
# taking a look at the distribution of the number of restaurant by category.
ax = sns.histplot(categories_df.drop('id', axis = 1).sum())
ax.set_xlabel('Number of restaurants');

png

Most categories contain very few restaurants, and are therefore not very informative. For the remainder of the analysis, I chose to keep only the most frequent categories.

## dropping all categories concerning less than 100 restaurants
sorted_categories = categories_df.drop('id', axis = 1).sum().sort_values(ascending = False)
# saving a vector of the top categories
top_categories = sorted_categories[sorted_categories >= 100]
# and a vector of all other categories to filter categories_df
non_top_categories = sorted_categories[sorted_categories < 100]
# dropping rare categories
categories_df.drop(non_top_categories.index, axis = 1, inplace = True)

To check how our variables of interest depend on categories, I create a dataframe of with the name of the most common categories, and then compute the restaurant count, mean rating and price and median review count for each category.

# creating a dataframe of the most frequent categories ('alias') and the number of categories for each ('count')
top_categories = pd.DataFrame(top_categories)
top_categories.reset_index(inplace = True)
top_categories.columns = ['alias', 'count']
# computing the mean rating by category
top_categories["mean_rating"] = top_categories['alias'].apply(lambda x: data['rating'][categories_df[x] > 0].mean())

# computing the median number of reviews
top_categories["median_reviews"] = top_categories['alias'].apply(lambda x: data['review_count'][categories_df[x] > 0].median())

# computing the mean price
top_categories["mean_price"] = top_categories['alias'].apply(lambda x: data['price_num'][categories_df[x] > 0].mean())

# taking a look at the resulting dataset:
top_categories

aliascountmean_ratingmedian_reviewsmean_price
0french730.03.74736830.02.553571
1hotdogs308.03.36666719.01.750000
2pizza255.03.56140420.02.017544
3italian174.03.57142920.52.115942
4sandwiches164.03.55769220.01.576923
5cafes159.03.83333330.01.974359
6bistros139.03.95714324.02.314286
7brasseries134.03.34615428.02.435897
8burgers123.03.33720934.01.906977
9japanese113.03.66101731.02.237288
10lyonnais100.03.70161340.02.677419

Finally, let’s take a look at the differences between categories in the form of bar charts:

fig, ax = plt.subplots(2, 2, figsize = (12, 12))

sns.barplot(ax = ax[0, 0], data = top_categories, x = 'count', y = 'alias')
ax[0, 0].set_xlabel('Number of restaurants')
ax[0, 0].set_ylabel('Category')

sns.barplot(ax = ax[0, 1], data = top_categories, x = 'mean_rating', y = 'alias',
            order = top_categories.sort_values('mean_rating', ascending = False).alias)
ax[0, 1].set_xlabel('Mean rating')
ax[0, 1].set_ylabel('Category')

sns.barplot(ax = ax[1, 0], data = top_categories, x = 'median_reviews', y = 'alias',
            order = top_categories.sort_values('median_reviews', ascending = False).alias)
ax[1, 0].set_xlabel('Median number of reviews')
ax[1, 0].set_ylabel('Category')

sns.barplot(ax = ax[1, 1], data = top_categories, x = 'mean_price', y = 'alias',
            order = top_categories.sort_values('mean_price', ascending = False).alias)
ax[1, 1].set_xlabel('Mean price')
ax[1, 1].set_ylabel('Category')

fig.tight_layout();

png

Finding the best pizza in town

I now want to identify “good” restaurants, i.e., restaurants that are better rated than similar restaurants. To do so, let us fit a simple linear model that predicts rating from the zip code, categories, and price of a restaurant. Such a simple model will of course be fairly bad at predicting the rating of any individual restaurant, but its category, and price. The residuals of the model (the difference between the true rating and predicted rating) will then give us a good understanding of how well a restaurant performs compared to other restaurants with similar properties.

# building and fitting a simple linear regression model to predict the rating
from sklearn.linear_model import LinearRegression

# ratings as the target variable
y = data['rating']

# keeping zip code, price, and categories as predictor variables
X = data[['id', 'location.zip_code']].merge(categories_df, how = 'left', on = 'id')
X = pd.concat([X, pd.get_dummies(data['price']).reset_index()], axis = 1).drop(['id', 'index'], axis = 1)
X = X.values

# initializing and fitting the model
lm = LinearRegression()
lm.fit(X, y);
# getting predictions
data['predicted_rating'] = lm.predict(X)
# computing the residuals
data['residual'] = data['rating'] - data['predicted_rating']
# creating a data frame of the pizzerias with the highest residuals in each zip code
best_pizza = data.loc[data[categories_df["pizza"] == 1].groupby('location.zip_code')['residual'].idxmax()]

# taking a quick look at the selected pizzas:
best_pizza[['name', 'rating', 'price', "location.zip_code", 'residual']]

nameratingpricelocation.zip_coderesidual
2262Le Jardin des Pentes4.0690010.447680
1773Casa Nobile4.5€€690020.987304
1823Le Ferrari4.5€€€690030.957764
2274Chez Puce4.5€€690040.882967
1804Al Dente4.0€€690050.463095
1786Neroliva4.5€€690060.955026
1790Le Vivaldi - Nicolo e Maria4.5€€690070.946956
2282Pizza Lina4.0€€690080.350689
1808Domeva Caffé4.5€€690090.930817

Finally, we can plot the selected pizzerias on a map:

best_pizza = geopandas.GeoDataFrame(
    best_pizza,
    geometry = geopandas.points_from_xy(best_pizza['coordinates.longitude'], best_pizza['coordinates.latitude'],
    crs = 'EPSG:4326')
)

best_pizza.to_crs(str(lyon_map.crs), inplace = True)


ax = lyon_map.plot(edgecolor = "teal", facecolor = 'none', figsize=(15, 15), linewidth=2)
best_pizza.plot(ax = ax)
best_pizza.apply(lambda x: ax.annotate(text=x['name'],
                                     xy=x.geometry.centroid.coords[0],
                                     ha='center',
                                     xytext=(-5, -15) if x['location.zip_code'] in [69005, 69007] else (-5, 20),
                                     textcoords="offset points",
                                     weight = 'bold',
                                     fontsize = 12,
                                     backgroundcolor='1'), axis=1)

cx.add_basemap(ax, zoom = 13, crs = best_pizza.crs,
               source = cx.providers.Esri.WorldTopoMap);

png

Posted on:
June 10, 2022
Length:
10 minute read, 1920 words
See Also: