Eating out in Lyon - Data gathering
Gathering restaurant ratings using the yelp fusion API.
June 1, 2022
The code and data for this project is available on github.
Summary
This is the first part of a data analysis project focused on finding a delicious pizza in my hometown of Lyon. Follow me as I:
- extract restaurant ratings using the yelp fusion API and the request package,
- transform the json response to a tidy dataframe,
- ensure the data integrity of variables of interest.
# loading all necessary packages at once
import requests
import json
import pandas as pd
import numpy as np
Gathering restaurant data from the yelp API
The yelp API allows to get 50 results at a time up to the first 1000 results of a given query. To overcome this limitation, we can first get all the restaurant subcategories (e.g., italian) and then loop over all categories to get up to 1000 restaurants by categories.
Fetching all restaurant subcategories:
I first get the categories using the ‘categories’ endpoint:
# loading my yelp API key
key_file = open('temp_api_key.txt')
api_key = key_file.readline().rstrip('\n')
key_file.close()
# replace the above lines with your own key:
# api_key = <your api key>
# specifying the headers, endpoint and parameters:
headers = {'Authorization': 'bearer %s' % api_key}
endpoint = "https://api.yelp.com/v3/categories"
parameters = {'location': 'Lyon, FR'}
# let the request package build the url and get a response:
response = requests.get(url = endpoint, params = parameters, headers = headers)
# transform the response into a data frame
categories_df = pd.json_normalize(response.json()['categories'])
categories_df.head()
alias | title | parent_aliases | country_whitelist | country_blacklist | |
---|---|---|---|---|---|
0 | 3dprinting | 3D Printing | [localservices] | [] | [] |
1 | abruzzese | Abruzzese | [italian] | [IT] | [] |
2 | absinthebars | Absinthe Bars | [bars] | [CZ] | [] |
3 | acaibowls | Acai Bowls | [food] | [] | [AR, CL, IT, MX, PL, TR] |
4 | accessories | Accessories | [fashion] | [] | [] |
As you can see above, I now have a dataframe with all the yelp categories. Most of them have nothing to do with food, such as the first one (‘3dprinting’). To keep only categories of interest, I filter the data frame to keep only the rows whose parent categories (‘parent_aliases’ column) contains ‘restaurants’:
# creating a column that states whether the row is a sub-category of 'restaurants':
categories_df['is_restaurant'] = ['restaurants' in parent for parent in categories_df['parent_aliases']]
# filtering by the 'is_restaurant' column:
restaurants_df = categories_df[categories_df.is_restaurant]
restaurants_df.head()
alias | title | parent_aliases | country_whitelist | country_blacklist | is_restaurant | |
---|---|---|---|---|---|---|
18 | afghani | Afghan | [restaurants] | [] | [MX, TR] | True |
19 | african | African | [restaurants] | [] | [TR] | True |
39 | andalusian | Andalusian | [restaurants] | [ES, IT] | [] | True |
53 | arabian | Arabic | [restaurants] | [] | [DK] | True |
59 | argentine | Argentine | [restaurants] | [] | [FI] | True |
Finding all restaurants in Lyon
We can now fetch restaurants information by looping over all categories, using the ‘businesses/search’ endpoint of the yelp API. I first declare some parameters by the query:
key_file = open('temp_api_key.txt')
api_key = key_file.readline().rstrip('\n')
key_file.close()
headers = {'Authorization': 'bearer %s' % api_key}
endpoint = "https://api.yelp.com/v3/businesses/search"
parameters = {'location': 'Lyon, FR', # where to look
'offset' : 0, # starting from the first result
'limit': 50, # taking 50 results (the maximum available at a time)
'term': 'restaurants'}
I then declare a list in which to store the response and loop over categories:
restaurants_ratings = []
# looping over categories
for category in restaurants_df.alias:
# specifying the category in the parameters
parameters['categories'] = category
# looping a second time to fetch 50 results at a time
for offset in range(0, 1000, 50):
parameters['offset'] = offset
response = requests.get(url = endpoint, params = parameters, headers = headers)
# we break the loop if there are no restaurants left
if not response.json().get('businesses', False):
break
# we extend the restaurants list with the new response
restaurants_ratings.extend(response.json()['businesses'])
len(restaurants_ratings)
4995
restaurants_ratings[0]
{'id': 'D3NHTerar80aeR6mlyE2mw',
'alias': 'azur-afghan-lyon',
'name': 'Azur Afghan',
'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/8i5nsqv5tbxxg8HdndPY4Q/o.jpg',
'is_closed': False,
'url': 'https://www.yelp.com/biz/azur-afghan-lyon?adjust_creative=qbeDf2GYKB1Prc0VgyQp0A&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=qbeDf2GYKB1Prc0VgyQp0A',
'review_count': 23,
'categories': [{'alias': 'afghani', 'title': 'Afghan'}],
'rating': 4.0,
'coordinates': {'latitude': 45.77502, 'longitude': 4.82875},
'transactions': [],
'price': '€€',
'location': {'address1': '6 Rue Villeneuve',
'address2': None,
'address3': None,
'city': 'Lyon',
'zip_code': '69004',
'country': 'FR',
'state': '69',
'display_address': ['6 Rue Villeneuve', '69004 Lyon', 'France']},
'phone': '+33478396619',
'display_phone': '+33 4 78 39 66 19',
'distance': 1845.795955776875}
Data cleaning
We gathered the data from 4995 restaurants in a json format. We can first transform it to a data frame using the ‘json_normalize’ function which deals fairly well with nested jsons:
data = pd.json_normalize(restaurants_ratings)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4995 entries, 0 to 4994
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 4995 non-null object
1 alias 4995 non-null object
2 name 4995 non-null object
3 image_url 4995 non-null object
4 is_closed 4995 non-null bool
5 url 4995 non-null object
6 review_count 4995 non-null int64
7 categories 4995 non-null object
8 rating 4995 non-null float64
9 transactions 4995 non-null object
10 price 2681 non-null object
11 phone 4995 non-null object
12 display_phone 4995 non-null object
13 distance 4995 non-null float64
14 coordinates.latitude 4992 non-null float64
15 coordinates.longitude 4992 non-null float64
16 location.address1 4987 non-null object
17 location.address2 3122 non-null object
18 location.address3 2730 non-null object
19 location.city 4995 non-null object
20 location.zip_code 4995 non-null object
21 location.country 4995 non-null object
22 location.state 4995 non-null object
23 location.display_address 4995 non-null object
dtypes: bool(1), float64(4), int64(1), object(18)
memory usage: 902.5+ KB
We can already drop columns that don’t interest us:
# droping useless columns:
data.drop(['image_url',
'is_closed', #
'url',
'transactions',
'phone',
'display_phone',
'distance',
'location.state' # redundant with location.zip_code
],
axis = 1, inplace = True)
Let’s check that all the restaurants are in Lyon, France:
data['location.country'].value_counts()
FR 4994
IT 1
Name: location.country, dtype: int64
data['location.city'].value_counts()
Lyon 3678
Villeurbanne 429
Bron 74
Vénissieux 60
Oullins 54
...
Roanne 1
Lyon 7Eme 1
Saint-Paul 1
Lyon 08 1
Oullins Cedex 1
Name: location.city, Length: 131, dtype: int64
It seems that one italian restaurants and quite a lot of restaurants from cities near Lyon have gotten into our data, so let’s filter by city:
# lowering the city field to make sure we don't exlude any restaurants due to case issues
data['location.city'] = data['location.city'].str.lower()
data = data[data['location.city'].str.find('lyon') >= 0]
# checking that all restaurants are now in lyon
data['location.city'].value_counts()
lyon 3680
sainte-foy-lès-lyon 11
lyon 06 10
lyon 07 9
lyon 6eme 9
lyon 03 8
lyon 2eme 8
lyon 9eme 7
lyon 1er 6
sainte foy les lyon 5
lyon 02 5
lyon-7e-arrondissement 4
lyon-2e-arrondissement 4
lyon 5eme 4
lyon 01 4
lyon 04 3
lyon-5e-arrondissement 3
lyon 05 2
lyon 3eme 2
lyon-3e-arrondissement 2
lyon eme 1
lyon 08 1
lyon cedex 3 1
lyon 7eme 1
lyon 8eme 1
lyon 3 eme 1
lyon 09 1
sainte foy lès lyon 1
Name: location.city, dtype: int64
# excluding restaurants from 'sainte-foy-les-lyon' which is not in lyon
data = data[data['location.city'].str.find('sainte') == -1]
# checking that all restaurants are now in lyon
data['location.city'].value_counts()
lyon 3680
lyon 06 10
lyon 6eme 9
lyon 07 9
lyon 03 8
lyon 2eme 8
lyon 9eme 7
lyon 1er 6
lyon 02 5
lyon-2e-arrondissement 4
lyon-7e-arrondissement 4
lyon 5eme 4
lyon 01 4
lyon-5e-arrondissement 3
lyon 04 3
lyon 3eme 2
lyon 05 2
lyon-3e-arrondissement 2
lyon 08 1
lyon cedex 3 1
lyon eme 1
lyon 7eme 1
lyon 8eme 1
lyon 3 eme 1
lyon 09 1
Name: location.city, dtype: int64
# We can finally drop the city column, since it is redundant with zip codes
data.drop('location.city', axis = 1, inplace = True)
We can have a final check of our filter using zip-codes: Lyon is from 69001 to 69009, so there should be no other zipcodes:
data['location.zip_code'].value_counts()
69003 710
69002 658
69007 574
69006 552
69001 468
69005 293
69009 199
69004 150
69008 148
10
69100 3
69000 3
26150 2
69300 2
69200 1
69363 1
69326 1
69800 1
69500 1
Name: location.zip_code, dtype: int64
# dropping restaurants with zip-codes outside of Lyon
data = data[(data['location.zip_code'].isin(['69001', '69002', '69003', '69004',
'69005', '69006', '69007', '69008',
'69009']))]
len(data)
3752
We can then inspect missing values:
data.isna().sum()
id 0
alias 0
name 0
review_count 0
categories 0
rating 0
price 1556
coordinates.latitude 0
coordinates.longitude 0
location.address1 5
location.address2 1364
location.address3 1722
location.zip_code 0
location.country 0
location.display_address 0
dtype: int64
Missing secondary and third addresses are not an issues since most places only have a primary adress. Missing prices might be concerning if we want to later analyse ratings vs price. Lastly, we can drop the rows with a missing value in coordinates and primary address, since the only concern a few restaurants.
data.dropna(subset = ['coordinates.latitude', 'coordinates.longitude', 'location.address1'], inplace = True)
data[data.columns[:8]].head()
id | alias | name | review_count | categories | rating | price | coordinates.latitude | |
---|---|---|---|---|---|---|---|---|
0 | D3NHTerar80aeR6mlyE2mw | azur-afghan-lyon | Azur Afghan | 23 | [{'alias': 'afghani', 'title': 'Afghan'}] | 4.0 | €€ | 45.775020 |
1 | zmk41IUwIkvO_eM0UGD7Sg | sufy-lyon | Sufy | 2 | [{'alias': 'indpak', 'title': 'Indian'}, {'ali... | 3.5 | NaN | 45.752212 |
2 | ee4wtKIBI_yTz0fJD054pg | tendance-afghane-lyon | Tendance Afghane | 1 | [{'alias': 'afghani', 'title': 'Afghan'}] | 3.0 | NaN | 45.759540 |
3 | Vo0U5EcXbh7qlpdaQwZchA | le-conakry-lyon | Le Conakry | 9 | [{'alias': 'african', 'title': 'African'}] | 4.0 | €€ | 45.750642 |
4 | -mFHJBuCxZJ_wJrO-o2Ypw | afc-africa-food-concept-lyon | AFC Africa Food Concept | 8 | [{'alias': 'african', 'title': 'African'}, {'a... | 3.5 | €€ | 45.754336 |
data[data.columns[8:]].head()
coordinates.longitude | location.address1 | location.address2 | location.address3 | location.zip_code | location.country | location.display_address | |
---|---|---|---|---|---|---|---|
0 | 4.828750 | 6 Rue Villeneuve | None | None | 69004 | FR | [6 Rue Villeneuve, 69004 Lyon, France] |
1 | 4.864384 | 34 rue Jeanne Hachette | None | 69003 | FR | [34 rue Jeanne Hachette, 69003 Lyon, France] | |
2 | 4.825560 | 25 Rue Tramassac | 69005 | FR | [25 Rue Tramassac, 69005 Lyon, France] | ||
3 | 4.849127 | 112 Grande rue de la Guillotière | 69007 | FR | [112 Grande rue de la Guillotière, 69007 Lyon,... | ||
4 | 4.843469 | 14 Grande rue de la Guillotière | 69007 | FR | [14 Grande rue de la Guillotière, 69007 Lyon, ... |
##### deal with the 'categories' column !!!
data.to_csv('data/restaurants.csv')
- Posted on:
- June 1, 2022
- Length:
- 8 minute read, 1559 words
- See Also: