base_location = "/content/drive/MyDrive/602_project"
The Geography of Fast Food Pricing: Economic Determinants of Menu Prices Across America¶
Authors: Akhil Kambhatla (122068766), Mokshda Gangrade (122151017), Vyom Agarwal (122246614)
Course: DATA602 - Principles of Data Science
Date: December 2025
1. Introduction¶
Have you ever wondered whether your favorite fast-food meal costs the same everywhere in America? Walk into any Chipotle in San Francisco's Financial District, and your usual chicken bowl rings up at 12.95 dollars. Meanwhile, your college friend in rural Iowa messages you a photo of his identical bowl priced at 9.50 dollars. That's a 3.45 dollar difference—nearly 27% more—for the exact same product from the same company with the same supply chain. What drives these geographic price differences?
While we often assume that national chain restaurants maintain the same pricing, the reality is much different. Fast-food companies operate in a complex economic landscape where local conditions—median household income, state minimum wage laws, urban versus rural settings, and competitive dynamics—may all influence their pricing strategies.
Research Question¶
This tutorial investigates: What economic and geographic factors systematically influence fast-food menu pricing across the United States?
Specifically, we examine whether pricing reflects:
- Demand-side factors: Do restaurants charge more in wealthier areas because customers can afford higher prices?
- Supply-side factors: Do higher labor costs (minimum wages) get passed through to consumers?
- Business model differences: Do corporate-owned chains (like Chipotle) price more consistently than franchise-dominated chains (like Domino's and Papa John's)?
- Geographic factors: Do different US regions show systematic pricing variations?
But why do we need this information? Well, understanding fast-food pricing patterns has implications for:
- Economic Geography: Revealing how businesses adapt to local purchasing power
- Consumer Welfare: Identifying whether low-income communities face regressive pricing
- Business Strategy: Comparing centralized versus decentralized pricing models
- Policy Analysis: Assessing the real-world effects of minimum wage policies on consumer prices
Data and Methodology¶
To answer these questions, we collect and integrate data from multiple sources:
- Restaurant Menu Data: Web-scraped pricing from 6,700+ locations across Chipotle, Domino's, and Papa John's
- US Census Data: Median household income at ZIP code level
- Department of Labor Data: State minimum wage laws
- Regional Food Prices: USDA food retail price indices as control variables
We apply the complete data science lifecycle to this investigation:
Data Collection → Scraping restaurant APIs and integrating government datasets
Data Preprocessing → Cleaning inconsistent formats, handling missing values, standardizing measurements
Exploratory Data Analysis → Visualizing geographic patterns, examining distributions, computing correlations
Hypothesis Testing → Statistical tests (ANOVA, regression) to validate economic theories
Predictive Modeling → Machine learning models to quantify factor importance and forecast prices
Insights and Interpretation → Synthesizing findings to answer our research questions
What We Will Discover¶
This analysis will reveal which economic factors matter most for fast-food pricing, how much variance they explain, and whether corporate structure affects pricing consistency. By the end of this tutorial, you will understand not just what drives geographic price variation, but also how to apply rigorous data science methods to real-world economic questions.
Technical Requirements¶
This tutorial uses the Python programming language for its versatility in data science workflows and Jupyter Notebook for interactive development. Specifically, we develop this project in Google Colaboratory, which provides free cloud-based computing with pre-installed data science libraries. All code in this tutorial is executable in Google Colab. Click the links to learn more about each technology.
Let's begin by collecting our data.
2. Data Sources & Collection¶
This is the first step of the entire project. We had the option of building this project using existing Restaurant APIs, but we wanted to put in the effort of scraping the data from scratch and combining all the data together, resembling a real-world application. We used web-scraping techniques (mentioned in the webscraping section below) to obtain menus and restaurant location details such as latitude, longitude, and zip code for the chains Chipotle, Domino's, and Papa John's. Each restaurant's details are stored in a CSV file.
Apart from these three individual restaurant datasets, we also took the time to see if we could find any other demographic data that would help us answer our main question of the impact of socioeconomic conditions on pricing. Upon searching through the web, we found three datasets that significantly contribute to our research:
Minimum Wages per State: State-level minimum wage laws from the Department of Labor, which help us test whether labor costs influence menu prices (supply-side hypothesis).
ZIP Code-Level Income Data: Median household income by ZIP code from the US Census Bureau, allowing us to test whether restaurants charge more in wealthier areas (demand-side hypothesis).
Regional Food Retail Prices: Regional food price indices from the Bureau of Labor Statistics (West, South, Midwest, Northeast), which serve as control variables to account for general cost-of-living differences across regions rather than restaurant-specific pricing strategies.
Each dataset presents unique challenges—from rate-limited APIs during web scraping to inconsistent geographic identifiers that require careful matching across ZIP codes, counties, and states. We will walk through the collection process for each source, explaining the technical approach and data quality considerations.
Setup and Configuration¶
Before collecting data, we install required libraries and import necessary modules.
Installations¶
We install Folium for creating interactive geographic maps to visualize pricing patterns across the United States.
!pip install folium
!pip install statsmodels
!pip install catboost
Imports¶
We import libraries organized by functionality:
- Web Scraping:
requestsfor API calls,jsonfor parsing responses,timefor rate limiting - Data Manipulation:
pandasfor dataframes,numpyfor numerical operations - Visualization:
matplotlibandseabornfor static plots,foliumfor interactive maps - Statistical Analysis:
scipy.statsfor hypothesis testing,statsmodelsfor regression - Machine Learning:
sklearnfor preprocessing and modeling
import requests
import csv
import json
import time
from urllib.parse import urlencode
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
from folium.plugins import MarkerCluster
from scipy.stats import f_oneway, pearsonr, ttest_ind
from sklearn.model_selection import GroupShuffleSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pickle
from transformers import AutoTokenizer, AutoModel
from catboost import CatBoostRegressor, Pool
from transformers import AutoTokenizer, AutoModel
from google.colab import drive
drive.mount('/content/drive')
Webscraping¶
We now collect restaurant menu and location data by scraping the APIs of three major fast-food chains. Each chain exposes different API structures, requiring custom approaches for data extraction. We use Chrome Developer Tools to inspect network requests and identify the API endpoints that power their store locator and menu systems.
Our scraping strategy collects two types of data for each chain:
- Store Locations: Latitude, longitude, address, ZIP code, and store identifiers
- Menu Pricing: Item names, prices, categories, and availability by location
The following subsections detail our scraping methodology for each store as well as scrape the minimum wage info.
Chipotle¶
Chipotle operates a corporate-owned model with centralized pricing decisions. Their API structure separates location data from menu data, requiring two separate scraping operations. We first retrieve all restaurant locations nationwide, then query each location's menu individually.
API Endpoints:
- Locations:
https://services.chipotle.com/restaurant/v3/restaurant - Menus:
https://services.chipotle.com/menuinnovation/v1/restaurants/{restaurantNumber}/onlinemenu
Authentication: Requires Ocp-Apim-Subscription-Key header for API access.
First Chain Locations
We begin by fetching all Chipotle store locations using their restaurant search API. The API accepts geographic coordinates and a radius parameter. To capture all US locations, we use a central point (coordinates of Cincinnati, OH) with a radius large enough to cover the entire continental United States (8,000 km).
The API returns paginated results with a maximum of 4,000 stores per page. We extract key fields including restaurant number, address, coordinates, and ZIP code for later matching with demographic data.
NOTE: The webscraping code takes hours to run, because it scrapes 18,00,000 records. So for evaluation, we have also added the csv files for all our data which can directly be used to run the rest of the code successfully.
Uncomment the below codes, to run the data collection pipeline, for testing
# # chipotle api url
# url = "https://services.chipotle.com/restaurant/v3/restaurant"
# payload_template = {
# "timestamp": "2020-3-14",
# "latitude": 39.1031182,
# "longitude": -84.5120196,
# "radius": 8000467,
# "restaurantStatuses": ["OPEN", "LAB"],
# "conceptIds": ["CMG"],
# "orderBy": "distance",
# "orderByDescending": False,
# "pageSize": 4000, # <= FIXED
# "pageIndex": 0,
# "embeds": {
# "addressTypes": ["MAIN"],
# "publicPhoneTypes": ["MAIN PHONE"],
# "realHours": True,
# "directions": True,
# "catering": True,
# "onlineOrdering": True,
# "timezone": True,
# "marketing": True
# }
# }
# headers = {
# 'Ocp-Apim-Subscription-Key': 'b4d9f36380184a3788857063bce25d6a',
# 'Content-Type': 'application/json',
# }
# all_stores = []
# # Pagination loop
# page = 0
# while True:
# payload = payload_template.copy()
# payload["pageIndex"] = page
# response = requests.post(url, headers=headers, data=json.dumps(payload))
# stores = response.json()
# # If API errors, break
# if "data" not in stores:
# print("API returned an error:", stores)
# break
# data = stores["data"]
# if not data:
# break
# all_stores.extend(data)
# page += 1
# print(f"Fetched page {page}, {len(data)} stores")
# print(f"Total stores fetched: {len(all_stores)}")
# # Write CSV
# with open('ChipotleLocations.csv', 'w', newline='') as CSVFile:
# writer = csv.writer(CSVFile)
# writer.writerow(["restaurantNumber", "restaurantName", "latitude", "longitude"])
# for store in all_stores:
# writer.writerow([
# store.get("restaurantNumber"),
# store.get("restaurantName"),
# store["addresses"][0]["latitude"],
# store["addresses"][0]["longitude"]
# ])
Next, Menu by Locations
With all store locations collected, we now scrape individual menus for each restaurant. Chipotle's menu API requires the restaurant number as a path parameter. For each of the 3,800+ locations, we query the online menu endpoint and extract item names, prices, and categories.
We implement rate limiting (0.1 second delay between requests) to avoid overwhelming the API server. Menu items are deduplicated by item code, as some items appear multiple times with different customization options.
# RESTAURANT_URL = "https://services.chipotle.com/restaurant/v3/restaurant"
# MENU_URL_TEMPLATE = (
# "https://services.chipotle.com/menuinnovation/v1/restaurants/"
# "{restaurant_id}/onlinemenu?channelId=web&includeUnavailableItems=true"
# )
# COMMON_HEADERS = {
# 'Accept': 'application/json, text/plain, */*',
# 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) '
# 'AppleWebKit/537.36 (KHTML, like Gecko) '
# 'Chrome/80.0.3987.132 Safari/537.36',
# 'Origin': 'https://www.chipotle.com',
# 'Referer': 'https://www.chipotle.com/',
# 'Accept-Language': 'en-US,en;q=0.9,la;q=0.8,pt;q=0.7',
# 'Ocp-Apim-Subscription-Key': 'b4d9f36380184a3788857063bce25d6a',
# }
# RESTAURANT_HEADERS = {
# **COMMON_HEADERS,
# 'Content-Type': 'application/json',
# }
# MENU_HEADERS = {
# **COMMON_HEADERS
# }
# # menu item scraping function
# def extract_menu_items(menu_json):
# """
# Recursively walk the menu JSON and return a list of
# (item_name, unit_price) tuples for any object that has
# an itemName and unitPrice field.
# """
# items = []
# def walk(obj):
# if isinstance(obj, dict):
# if "itemName" in obj and "unitPrice" in obj:
# name = obj.get("itemName")
# price_raw = obj.get("unitPrice")
# price = None
# if isinstance(price_raw, (int, float, str)):
# price = price_raw
# elif isinstance(price_raw, dict):
# price = (
# price_raw.get("amount")
# or price_raw.get("price")
# or price_raw.get("value")
# )
# items.append((name, price))
# for v in obj.values():
# walk(v)
# elif isinstance(obj, list):
# for v in obj:
# walk(v)
# walk(menu_json)
# return items
# def fetch_all_restaurants():
# page_index = 0
# page_size = 4000
# all_restaurants = []
# while True:
# payload = {
# "timestamp": "2020-3-14",
# "latitude": 39.1031182,
# "longitude": -84.5120196,
# "radius": 8000467,
# "restaurantStatuses": ["OPEN", "LAB"],
# "conceptIds": ["CMG"],
# "orderBy": "distance",
# "orderByDescending": False,
# "pageSize": page_size,
# "pageIndex": page_index,
# "embeds": {
# "addressTypes": ["MAIN"],
# "publicPhoneTypes": ["MAIN PHONE"],
# "realHours": True,
# "directions": True,
# "catering": True,
# "onlineOrdering": True,
# "timezone": True,
# "marketing": True
# }
# }
# resp = requests.post(
# RESTAURANT_URL,
# headers=RESTAURANT_HEADERS,
# data=json.dumps(payload)
# )
# data = resp.json()
# if "data" not in data:
# print("Restaurant API error on page", page_index, "->", data)
# break
# restaurants = data["data"]
# if not restaurants:
# break
# all_restaurants.extend(restaurants)
# print(f"Fetched restaurant page {page_index}, {len(restaurants)} records")
# page_index += 1
# time.sleep(0.2)
# print(f"Total restaurants fetched: {len(all_restaurants)}")
# return all_restaurants
# # for each restaurant here, we fetch menu and write it to csv
# def get_chipotle():
# restaurants = fetch_all_restaurants()
# with open("ChipotleMenuByLocation.csv", mode="w", newline="", encoding="utf-8") as f:
# writer = csv.writer(f)
# writer.writerow(["city", "state", "pincode", "menu_item", "price"])
# for idx, store in enumerate(restaurants, start=1):
# addr = (store.get("addresses") or [{}])[0]
# city = addr.get("locality")
# state = addr.get("administrativeArea")
# pincode = addr.get("postalCode")
# restaurant_number = store.get("restaurantNumber")
# if not restaurant_number:
# print("Skipping store with no restaurantNumber:", store)
# continue
# menu_url = MENU_URL_TEMPLATE.format(restaurant_id=restaurant_number)
# try:
# menu_resp = requests.get(menu_url, headers=MENU_HEADERS)
# if menu_resp.status_code != 200:
# print(
# f"[{idx}/{len(restaurants)}] "
# f"Menu API error {menu_resp.status_code} for {restaurant_number}"
# )
# continue
# menu_json = menu_resp.json()
# except Exception as e:
# print(f"Error fetching menu for {restaurant_number}: {e}")
# continue
# menu_items = extract_menu_items(menu_json)
# if not menu_items:
# print(f"[{idx}/{len(restaurants)}] No menu items found for {restaurant_number}")
# continue
# print(
# f"[{idx}/{len(restaurants)}] {city}, {state} ({pincode}) "
# f"- {len(menu_items)} items"
# )
# for item_name, price in menu_items:
# writer.writerow([city, state, pincode, item_name, price])
# time.sleep(0.1)
# get_chipotle()
Domino's¶
Domino's operates primarily through a franchise model, where individual franchisees have more autonomy over local pricing decisions. Unlike Chipotle's two-step process, Domino's allows us to query stores by ZIP code and retrieve menu data in a single API call per store.
API Endpoints:
- Store Locator:
https://order.dominos.com/power/store-locator(accepts ZIP code parameter) - Menu:
https://order.dominos.com/power/store/{store_id}/menu?lang=en&structured=true
Approach: We iterate through all US ZIP codes, find stores in each ZIP, then scrape their menus. This ZIP-based approach naturally aligns with our demographic data integration strategy.
Note: We are using an existing list of zip codes across the entire US as reference to get menus of all the valid locations(matched ones). This is done for both - Domino's and Papa John's.
Menu By Locations
For each ZIP code, we query Domino's store locator API to find nearby restaurants. The API returns store IDs, addresses, and coordinates. We then fetch each store's menu using the store-specific menu endpoint.
Key Implementation Details:
- ZIP Code Coverage: We query all ~42,000 US ZIP codes to ensure complete national coverage
- Rate Limiting: 0.05 second delay between menu requests (faster than Chipotle due to simpler API)
- Error Handling: We implement timeout protection (10 seconds) and skip stores that fail to respond
- Progress Tracking: Status printed every 50 stores to monitor scraping progress
The output CSV contains store location (city, state, ZIP), menu item names, and prices for approximately 6,500 Domino's locations nationwide.
# # US ZIP CODE LINK TO DRIVE
# zip_url = "https://drive.google.com/file/d/1IDG4kjx1-7DkDj-mXy6H_rv9H_lR3Mix/view?usp=share_link"
# zip_url='https://drive.google.com/uc?id=' + zip_url.split('/')[-2]
# df = pd.read_csv(zip_url)
# zip_list = df["ZIP CODE"].astype(str).str.zfill(5).unique().tolist()
# ZIP_CODES = zip_list
# STORE_LOCATOR_URL = "https://order.dominos.com/power/store-locator"
# MENU_URL_TEMPLATE = (
# "https://order.dominos.com/power/store/{store_id}/menu?lang=en&structured=true"
# )
# COMMON_HEADERS = {
# "Accept": "application/json, text/plain, */*",
# "User-Agent": "Mozilla/5.0",
# }
# def extract_menu_items(menu_json):
# items = []
# seen = set()
# def get_name_and_price(obj):
# name = (
# obj.get("Name") or obj.get("name") or
# obj.get("ProductName") or obj.get("productName")
# )
# price_raw = (
# obj.get("Price") or obj.get("price") or
# obj.get("Amount") or obj.get("amount")
# )
# if name is None or price_raw is None:
# return None, None
# price = price_raw
# if isinstance(price_raw, dict):
# price = (
# price_raw.get("Amount") or price_raw.get("amount") or
# price_raw.get("Price") or price_raw.get("price")
# )
# return name, price
# def walk(obj):
# if isinstance(obj, dict):
# name, price = get_name_and_price(obj)
# if name and price:
# key = (str(name), str(price))
# if key not in seen:
# seen.add(key)
# items.append((name, price))
# for v in obj.values():
# walk(v)
# elif isinstance(obj, list):
# for v in obj:
# walk(v)
# walk(menu_json)
# return items
# # FETCH STORES FOR GIVEN ZIP
# def fetch_stores_for_zip(zip_code):
# params = {"s": "", "c": zip_code, "type": "Delivery"}
# url = f"{STORE_LOCATOR_URL}?{urlencode(params)}"
# try:
# resp = requests.get(url, headers=COMMON_HEADERS, timeout=10)
# data = resp.json()
# except:
# return []
# stores = data.get("Stores") or []
# return stores
# # MAIN SCRAPER
# def get_dominos():
# all_stores = {} # StoreID → store object
# print("Collecting all Domino's stores across the USA...")
# for i, z in enumerate(ZIP_CODES):
# stores = fetch_stores_for_zip(z)
# for s in stores:
# store_id = (
# s.get("StoreID") or s.get("StoreId") or
# s.get("storeID") or s.get("id")
# )
# if not store_id:
# continue
# if store_id not in all_stores:
# all_stores[store_id] = s
# if i % 200 == 0:
# print(f"Processed {i} ZIPs, total unique stores so far: {len(all_stores)}")
# time.sleep(0.05)
# print(f"\nTotal Unique Domino's Stores Found: {len(all_stores)}")
# print("Fetching menus...")
# with open("Dominos_ALL_USA.csv", "w", newline="", encoding="utf-8") as f:
# writer = csv.writer(f)
# writer.writerow(["city", "state", "zip", "menu_item", "price"])
# for idx, (store_id, store) in enumerate(all_stores.items(), start=1):
# addr = store.get("Address") or store.get("StoreAddress") or {}
# city = (
# store.get("City") or store.get("city") or
# addr.get("City") or addr.get("city")
# )
# state = (
# store.get("Region") or store.get("region") or
# addr.get("Region") or addr.get("region")
# )
# postal = (
# store.get("PostalCode") or store.get("postalCode") or
# addr.get("PostalCode") or addr.get("postalCode")
# )
# menu_url = MENU_URL_TEMPLATE.format(store_id=store_id)
# try:
# menu_resp = requests.get(menu_url, headers=COMMON_HEADERS, timeout=10)
# menu_json = menu_resp.json()
# except:
# continue
# menu_items = extract_menu_items(menu_json)
# for name, price in menu_items:
# writer.writerow([city, state, postal, name, price])
# if idx % 50 == 0:
# print(f"Store {idx}/{len(all_stores)} complete")
# time.sleep(0.05)
# print("\n Completed USA Domino’s scrape! File saved as Dominos_ALL_USA.csv")
# get_dominos()
Papa John's¶
Papa John's, like Domino's, operates through a franchise model. Their API structure is similar to Domino's, allowing ZIP-based store lookup followed by individual store menu queries.
API Endpoints:
- Store Search:
https://www.papajohns.com/order/storesSearchJson(accepts ZIP code parameter) - Menu Products:
https://www.papajohns.com/api/v6/stores/{store_id}/products
Approach: We use the same ZIP code iteration strategy as Domino's. For each ZIP, we find nearby stores, then fetch their product catalogs. The implementation includes longer timeout periods (15 seconds) compared to Domino's, as Papa John's API tends to have slower response times.
Technical Notes:
- Rate Limiting: 0.1 second delay between requests (moderate throttling)
- Timeout Protection: 15 second timeout (longer than Domino's due to API performance)
- Session Management: Uses
requests.Session()to maintain connection pooling for efficiency
# # US ZIP CODE URL LINK
# zip_url = "https://drive.google.com/file/d/1IDG4kjx1-7DkDj-mXy6H_rv9H_lR3Mix/view?usp=share_link"
# zip_url='https://drive.google.com/uc?id=' + zip_url.split('/')[-2]
# zip_df = pd.read_csv(zip_url)
# zip_series = zip_df["ZIP CODE"].dropna()
# zip_codes = (
# zip_series
# .astype(int)
# .astype(str)
# .str.zfill(5)
# .unique()
# .tolist()
# )
# print(f"Total unique ZIPs loaded: {len(zip_codes)}")
# # API configs
# STORE_SEARCH_URL = "https://www.papajohns.com/order/storesSearchJson"
# PRODUCTS_URL_TEMPLATE = "https://www.papajohns.com/api/v6/stores/{store_id}/products"
# session = requests.Session()
# session.headers.update({
# "User-Agent": "Mozilla/5.0 (compatible; PapaScraper/1.0; +https://example.com)"
# })
# # Fetch stores for each ZIP
# stores_data = []
# seen_store_ids = set()
# for z in tqdm(zip_codes, desc="Fetching stores by ZIP"):
# params = {
# "searchType": "CARRYOUT",
# "zipcode": z
# }
# try:
# resp = session.get(STORE_SEARCH_URL, params=params, timeout=15)
# if resp.status_code != 200:
# print(f"[WARN] ZIP {z}: status {resp.status_code}")
# continue
# data = resp.json()
# except Exception as e:
# print(f"[ERROR] ZIP {z}: {e}")
# continue
# for store in data.get("stores", []):
# store_id = store.get("storeId")
# if store_id is None or store_id in seen_store_ids:
# continue
# seen_store_ids.add(store_id)
# loc = store.get("storeLocation", {}) or {}
# stores_data.append({
# "restaurant_id": store_id,
# "city": loc.get("city"),
# "state": loc.get("state"),
# "pincode": loc.get("postalCode"),
# "latitude": loc.get("latitude"),
# "longitude": loc.get("longitude"),
# "restaurant_name": store.get("storeName", "Papa Johns")
# })
# time.sleep(0.1)
# stores_df = pd.DataFrame(stores_data)
# print(f"Total unique stores found: {len(stores_df)}")
# # Fetch menu items for each store
# menu_rows = []
# for idx, store_row in tqdm(stores_df.iterrows(), total=len(stores_df), desc="Fetching products by store"):
# store_id = store_row["restaurant_id"]
# url = PRODUCTS_URL_TEMPLATE.format(store_id=store_id)
# try:
# resp = session.get(url, timeout=15)
# if resp.status_code != 200:
# print(f"[WARN] Store {store_id}: status {resp.status_code}")
# continue
# data = resp.json()
# except Exception as e:
# print(f"[ERROR] Store {store_id}: {e}")
# continue
# products = data.get("data", [])
# for p in products:
# menu_item = p.get("name") or p.get("title")
# price = p.get("displayPrice") or p.get("regularMenuPrice")
# menu_type = p.get("tag") or p.get("productTypeId")
# menu_rows.append({
# "restaurant_id": store_row["restaurant_id"],
# "city": store_row["city"],
# "state": store_row["state"],
# "pincode": store_row["pincode"],
# "latitude": store_row["latitude"],
# "longitude": store_row["longitude"],
# "restaurant_name": store_row["restaurant_name"],
# "menu_item": menu_item,
# "price": price,
# "menu_type": menu_type
# })
# time.sleep(0.1)
# menu_df = pd.DataFrame(menu_rows)
# menu_df.drop_duplicates(
# subset=["restaurant_id", "menu_item", "price"],
# inplace=True
# )
# print(f"Total menu rows: {len(menu_df)}")
# OUTPUT_CSV = "papajohns_menu_USA.csv"
# menu_df.to_csv(OUTPUT_CSV, index=False)
# print(f"Saved to: {OUTPUT_CSV}")
Minimum Wage Data¶
Our final web scraping target is the Department of Labor's minimum wage table. Unlike restaurant APIs, this data is published as an HTML table on a government website. We use pandas' read_html() function to parse the table directly from the webpage.
Data Source: DOL Minimum Wage Consolidated Table
Challenge: The table has an unusual format with three columns:
- States with minimum wage above federal level (e.g., "CA \$16.50")
- States at federal minimum ($7.25)
- States with minimum below federal (but federal applies)
We parse each column differently, extracting state codes and wage values, then standardize to a simple state-wage mapping. States with "or" cases (multiple rates) use the highest rate.
# MINIMUM WAGE
def get_min_wage_data():
url = "https://www.dol.gov/agencies/whd/mw-consolidated"
dfs = pd.read_html(url)
wage_table = dfs[0]
states_wages = []
for col_idx in range(min(3, len(wage_table.columns))):
col_data = wage_table.iloc[:, col_idx].dropna()
for entry in col_data:
entry_str = str(entry).strip()
# Column 1: States with wages > $7.25 (format: "AK $11.91" or "NY $16.50 or $15.50")
if '$' in entry_str:
if ' or ' in entry_str:
entry_str = entry_str.split(' or ')[0].strip()
parts = entry_str.split('$')
if len(parts) == 2:
state = parts[0].strip()
wage_str = parts[1].strip()
# Clean wage (remove footnotes)
wage_clean = ''.join(c for c in wage_str if c.isdigit() or c == '.')
try:
wage = float(wage_clean)
if len(state) == 2 and state.isalpha():
states_wages.append({'state': state.upper(), 'min_wage': wage})
except:
continue
# Columns 2 & 3: States at/below federal minimum ($7.25)
elif len(entry_str) == 2 and entry_str.isalpha():
states_wages.append({'state': entry_str.upper(), 'min_wage': 7.25})
# Create DataFrame and remove duplicates
wage_df = pd.DataFrame(states_wages).drop_duplicates('state', keep='first')
wage_df['min_wage'] = pd.to_numeric(wage_df['min_wage'], errors='coerce')
# Sort by wage descending
wage_df = wage_df.sort_values('min_wage', ascending=False).reset_index(drop=True)
return wage_df
wage_df = get_min_wage_data()
print(f"Sample (top 10 states by wage):")
print(wage_df.head(100)[['state', 'min_wage']].to_string(index=False))
External Data Sources¶
In addition to the restaurant and minimum wage data we scraped, we integrate two publicly available datasets from government sources. These provide the more context needed to test our hypotheses about pricing determinants.
Note on Data Access: The code below shows our file paths from Google Drive. You can either:
- Follow the same Google Drive structure, add the files to your Drive, and mount your drive, or
- Download these datasets from our GitHub repository and modify the file paths to your local directory
All datasets are also available directly from the official government sources linked below.
ZIP Code-Level Income Data¶
We use median household income at the ZIP code level to test our demand-side pricing hypothesis: do restaurants charge more in wealthier areas because customers can afford higher prices?
Data Source: US Census Bureau - Income by ZIP Code Tabulation Area
Geographic Granularity: ZIP code level (5-digit ZIP codes) provides the finest available geographic resolution for income data, allowing precise matching with restaurant locations.
Key Variable: Median household income (in dollars) represents the typical household's purchasing power in each ZIP code area.
Data Format: The Census Bureau provides this data through tables (click on Download Table Data). We downloaded this data in .xlsx format, cleaned the dataset to have only 2 features - Zip code and Income, and exported this file as a comma separated file (.csv).
We load this dataset and rename the ZIP code column to match our restaurant data format for easier integration during the preprocessing phase.
income_df = pd.read_csv(f"{base_location}/Data/zip_income_data.csv", dtype={"zip": str})
income_df = income_df.rename(columns = {"zip":"pincode"})
USDA Rural-Urban Continuum Codes¶
NOTE: WE ARE NOT USING THIS DATASET. THIS IS ONLY FOR EXTENDED KNOWLEDGE PURPOSES ONLY
The USDA Economic Research Service publishes county-level classification codes that measure urbanization on a scale from 1 to 9:
- 1-3: Metropolitan counties (large to small metros with 250,000+ population)
- 4-6: Urban non-metropolitan counties (with towns 2,500-19,999 residents)
- 7-9: Rural counties (no town over 2,500 residents)
Data Source: USDA ERS Rural-Urban Continuum Codes 2023
Why This Matters: Urbanization affects restaurant operating costs (commercial rent, labor markets) and competitive dynamics. Rural areas may have less competition, potentially leading to market power and higher prices despite lower local incomes. This variable allows us to test whether geographic isolation affects pricing strategies.
Format: County-level FIPS codes (Federal Information Processing Standards) matched to rural-urban classification. We downloaded this directly as a CSV file from the USDA website.
We load the CSV and extract the 2023 classification codes, then process them to enable matching with restaurant ZIP codes through county-level geographic identifiers.
# USDA RURAL CODES
def get_usda_rural_data():
usda_df = pd.read_csv(f"{base_location}/Data/Ruralurbancontinuumcodes2023.csv", encoding='latin1')
usda_rucc = usda_df[usda_df['Attribute'] == 'RUCC_2023'].copy()
usda_rucc.rename(columns={
'Value': 'rural_code',
'County_Name': 'county_name'
}, inplace=True)
usda_rucc['FIPS'] = usda_rucc['FIPS'].astype(str).str.zfill(5)
usda_rucc['rural_code'] = pd.to_numeric(usda_rucc['rural_code'], errors='coerce')
usda_rucc = usda_rucc[['FIPS', 'State', 'county_name', 'rural_code']].copy()
return usda_rucc
usda_df = get_usda_rural_data()
print(f"\n Sample:")
print(usda_df.head(10).to_string(index=False))
Regional Food Retail Prices¶
To control for general cost-of-living differences across regions, we include regional food price indices from the Bureau of Labor Statistics. These indices capture the baseline cost of food retail items (groceries, milk, bread, meat, etc.) in four major US Census regions.
Data Sources: Bureau of Labor Statistics - Average Retail Food Prices by Region
Why This Matters: Fast-food prices may be higher in certain regions simply because all food costs more there, not because of restaurant-specific pricing strategies. By including regional food prices as a control variable, we can separate general cost-of-living effects from pricing differences based on local income levels.
We will be using this in the ML phase.
3. Data Preprocessing¶
After collecting raw data from multiple sources, we now face the challenge of transforming these different datasets into an analysis-ready format. This preprocessing phase addresses several critical data quality issues:
- Inconsistent Formats: Restaurant data uses different column names and data types
- Missing Values: Not all stores have complete ZIP code or location information
- Invalid Entries: Some scraped data includes non-US locations or placeholder values
- Geographic Mismatches: Restaurant ZIP codes must be matched to county-level rural codes
- Menu Item Comparability: We need to filter for equivalent meal types across chains
Our preprocessing workflow follows these steps:
State Validation → Filter to valid US states and territories (51 total including DC)
Individual Brand Cleaning → Process each restaurant chain separately to handle brand-specific data structures
Data Integration → Merge restaurant data with demographic variables (income, minimum wage, rural codes)
Missing Value Imputation → Handle incomplete data using geographic nearest-neighbor approaches
Data Type Standardization → Ensure numeric columns are properly formatted
Final Quality Checks → Remove remaining invalid entries and verify data completeness
This section documents each preprocessing step with explanations and how they prepare the data.
State Validation¶
Before processing restaurant data, we define a reference list of valid US states and territories. This list serves as a filter to remove any non-US entries that may have been captured during web scraping (e.g., Canadian or international locations that appeared in API responses).
Our validation includes all 50 US states plus the District of Columbia (DC), for a total of 51 valid geographic entities. We will apply this filter to each restaurant dataset.
valid_states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'DC']
Brand-Specific Cleaning: Chipotle¶
We begin by processing Chipotle's data, which was collected in two separate files during the scraping phase:
- ChipotleMenuByLocation.csv - Contains menu items and prices per restaurant
- ChipotleLocations.csv - Contains restaurant metadata (coordinates, addresses)
These files must be merged to create a complete dataset with both menu information and geographic location for each restaurant.
Cleaning Steps:
- Load menu data file
- Load location data and merge on restaurant ID
- Remove zero-price items (free sides, customization options)
- Filter to valid US states only
This process reduces the raw dataset from approximately 5.89M rows to 1.48M validated menu items.
Load Menu Data¶
We first load the menu-by-location file containing pricing information for each Chipotle restaurant.
chipotle_df = pd.read_csv(f"{base_location}/Data/ChipotleMenuByLocation.csv")
chipotle_df.head()
Merge with Location Data¶
We now load the restaurant locations file and merge it with menu data using restaurant ID as the key. This two-file structure resulted from our scraping approach: Chipotle's API provides separate endpoints for location lookup and menu retrieval.
After merging, we drop duplicate columns and add a standardized restaurant_name identifier for consistency across all three chains.
chipotle_df_2 = pd.read_csv(f"{base_location}/Data/ChipotleLocations.csv")
chipotle_merged_df = pd.merge(chipotle_df, chipotle_df_2, left_on="restaurant_id", right_on="restaurantNumber")
chipotle_merged_df.drop(columns=['restaurantNumber','restaurantName'], inplace=True)
chipotle_merged_df["restaurant_name"] = "chipotle"
Data Quality Filtering¶
We apply two filters:
Remove zero-price items: Chipotle's menu includes free customization options (extra rice, beans) and included sides with $0.00 prices. These aren't purchasable items, so we exclude them.
State validation: Filter to valid US states only, removing any international locations captured during scraping.
The print statements show progressive data reduction, verifying each filter's impact.
print(len(chipotle_merged_df))
chipotle_merged_df = chipotle_merged_df[chipotle_merged_df['price']>0]
print(len(chipotle_merged_df))
chipotle_merged_df = chipotle_merged_df[chipotle_merged_df['state'].isin(valid_states)]
print(len(chipotle_merged_df))
Brand-Specific Cleaning: Domino's¶
Unlike Chipotle's two-file structure, Domino's data was collected in a single CSV file (Dominos_ALL_USA.csv). However, Domino's presents a different challenge: their franchise model means menu item naming varies by location, and the dataset includes combo deals that must be filtered out for fair comparison.
Cleaning Steps:
- Load CSV and standardize column names
- Categorize menu items (mains, sides, drinks)
- Exclude combo deals and bundles
- Remove non-food items
- Filter zero-price items and validate states
Load and Standardize¶
We load Domino's data and immediately standardize column names to match Chipotle's structure, ensuring consistency for later merging.
dominoes_df = pd.read_csv(f"{base_location}/Data/Dominos_ALL_USA.csv")
dominoes_df["restaurant_name"] = "dominoes"
dominoes_df = dominoes_df.rename(columns = {"zip":"pincode","menu_type":"item_type"})
print(len(dominoes_df))
Menu Item Categorization¶
Domino's menu contains a wide variety of items including pizzas, pasta, wings, sandwiches, sides, drinks, and combo deals. To ensure fair price comparison with Chipotle, we must:
- Identify single-item purchases (mains, sides, drinks)
- Exclude combo deals (e.g., "2 Medium Pizzas Deal", "Choose Any 2")
- Remove non-food items (donations, utensils, bags)
We use keyword-based categorization to classify each menu item. Items matching combo keywords are explicitly excluded (set to None). Items that don't match any category remain uncategorized and will be filtered out.
Categorization Logic:
- Mains: Pizza (all crust types), pasta, wings, sandwiches, hoagies
- Sides: Breadsticks, cheesy bread, dipping sauces, salads, cookies
- Drinks: Sodas, teas, energy drinks, water
- Excluded: Combo deals, bundles, meals, donations, utensils
The code below implements this sequential categorization, processing items in order: exclude combos first, then categorize mains, sides, drinks, and finally remove "other" non-food items.
menu = dominoes_df["menu_item"].str.lower().fillna("")
dominoes_df["item_type"] = None
# 0. EXCLUDE combos explicitly (keep None)
combo_keywords = ["combo", "meal", "deal", "bundle", "and ", " & ", "choose any", "any 2"]
pattern_combo = "|".join(combo_keywords)
dominoes_df.loc[menu.str.contains(pattern_combo), "item_type"] = None
# 1. MAINS (pizza, pasta, wings, sandwiches, specialty chicken, hoagies)
mains_keywords = [
"pizza", "pan", "hand tossed", "thin", "new york", "brooklyn",
"alfredo", "pasta", "mac", "carbonara", "primavera",
"wings", "boneless", "chicken", "specialty chicken",
"sandwich", "hoagie", "loaded chicken"
]
pattern_mains = "|".join(mains_keywords)
dominoes_df.loc[
dominoes_df["item_type"].isna() & menu.str.contains(pattern_mains),
"item_type"
] = "mains"
# 2. SIDES (bread, dips, sauces, pasta trays, cookies)
side_keywords = [
"bread", "bites", "tots", "brownie", "cookie",
"dip", "dipping", "sauce", "marinara",
"packets", "seasoning", "parmesan", "jalapeno",
"side ", "tray", "salad", "garden", "caesar"
]
pattern_sides = "|".join(side_keywords)
dominoes_df.loc[
dominoes_df["item_type"].isna() & menu.str.contains(pattern_sides),
"item_type"
] = "sides"
# 3. DRINKS
drink_keywords = [
"coke", "cola", "sprite", "fanta", "dr pepper", "pepper",
"tea", "powerade", "energy", "monster", "sunkist",
"water", "mello", "pibb", "iced tea", "nos", "sun drop"
]
pattern_drinks = "|".join(drink_keywords)
dominoes_df.loc[
dominoes_df["item_type"].isna() & menu.str.contains(pattern_drinks),
"item_type"
] = "drinks"
# 4. OTHER (donations, bags, forks)
other_keywords = ["donation", "bag", "fork"]
pattern_other = "|".join(other_keywords)
dominoes_df.loc[
dominoes_df["item_type"].isna() & menu.str.contains(pattern_other),
"item_type"
] = "other"
dominoes_df = dominoes_df[dominoes_df["item_type"]!="other"]
Remove Uncategorized and Non-Food Items¶
After categorization, we remove two types of items:
- Uncategorized items (item_type = None): These are primarily combo deals that were explicitly excluded, plus any unusual menu items that didn't match our keywords
- Non-food items (item_type = "other"): Donations, bags, utensils
This ensures our dataset contains only single, purchasable food and drink items that can be fairly compared across chains.
dominoes_df = dominoes_df.dropna(subset=["item_type"]).reset_index(drop=True)
print(len(dominoes_df))
Apply Standard Quality Filters¶
Finally, we apply the same filters used for Chipotle:
- Remove zero-price items: Free add-ons and customizations
- State validation: Filter to valid US states only
The three print statements show the progressive data reduction: after categorization (890,170 items) → after price filter (888,758 items) → after state filter (878,588 items).
print(len(dominoes_df))
dominoes_df = dominoes_df[dominoes_df['price']>0]
print(len(dominoes_df))
dominoes_df = dominoes_df[dominoes_df['state'].isin(valid_states)]
print(len(dominoes_df))
Brand-Specific Cleaning: Papa John's¶
Papa John's data follows the same single-file structure as Domino's and presents similar challenges with franchise-based menu variation. We apply the same categorization approach to identify single-item purchases and exclude combo deals.
Cleaning Steps:
- Load CSV and standardize column names
- Validate states
- Fix price formatting (remove commas)
- Categorize menu items (mains, sides, drinks)
- Exclude uncategorized items
We start by loading Papa John's data.
papa_johns_df = pd.read_csv(f"{base_location}/Data/papajohns_menu_USA.csv")
papa_johns_df["restaurant_name"] = "papa_johns"
papa_johns_df = papa_johns_df.rename(columns = {"menu_type":"item_type"})
Menu Item Categorization¶
We categorize Papa John's menu items using keyword matching, similar to Domino's but with brand-specific menu terms:
Mains: Pizzas (all crust types), wings (boneless and bone-in), calzones, hoagies, sandwiches
Sides: Breadsticks, cheesesticks, garlic knots, dipping sauces, salads, desserts (ice cream)
Drinks: Pepsi products (Mountain Dew, Starry), Lipton teas, juices, Gatorade, Aquafina water
Other: Rare non-food items like extra salad dressing packets
print(len(papa_johns_df))
menu = papa_johns_df["menu_item"].str.lower().fillna("")
papa_johns_df["item_type"] = None
# 1. MAINS — pizzas, wings, calzones, hoagies
mains_keywords = [
"pizza", "pan", "thin", "new york", "ny style",
"create your own", "calzone",
"boneless", "wings",
"hoagie", "sandwich"
]
pattern_mains = "|".join(mains_keywords)
papa_johns_df.loc[
menu.str.contains(pattern_mains),
"item_type"
] = "mains"
# 2. SIDES — breadsticks, knots, cheesesticks, dips, salads, desserts
side_keywords = [
"breadsticks", "cheesesticks", "knots", "sticks",
"dip", "dipping", "sauce",
"salad", "garden", "caesar",
"ice cream", "dessert",
"jalapeno", "parmesan"
]
pattern_sides = "|".join(side_keywords)
papa_johns_df.loc[
(papa_johns_df["item_type"].isna()) &
menu.str.contains(pattern_sides),
"item_type"
] = "sides"
# 3. DRINKS — Pepsi, Mountain Dew, tea, juice, Gatorade, etc.
drink_keywords = [
"pepsi", "mountain dew", "dew", "starry",
"lipton", "tea", "pure leaf",
"ocean spray", "juice",
"crush", "gatorade", "life", "life wtr",
"aquafina", "water",
"soda", "wham", "sol"
]
pattern_drinks = "|".join(drink_keywords)
papa_johns_df.loc[
(papa_johns_df["item_type"].isna()) &
menu.str.contains(pattern_drinks),
"item_type"
] = "drinks"
# 4. Other (rare)
other_keywords = ["salad dressing"]
pattern_other = "|".join(other_keywords)
papa_johns_df.loc[
(papa_johns_df["item_type"].isna()) &
menu.str.contains(pattern_other),
"item_type"
] = "other"
papa_johns_df = papa_johns_df.dropna(subset=["item_type"]).reset_index(drop=True)
print(len(papa_johns_df))
We remove items that didn't match any category (item_type = None), as well as the rare "other" items. This filters the dataset from 389,374 rows to 360,487 validated menu items.
We now apply basic data quality fixes: standardizing city names to lowercase, validating states, and cleaning price strings (removing commas that may interfere with numeric operations).
papa_johns_df['city'] = papa_johns_df['city'].str.lower()
papa_johns_df = papa_johns_df[papa_johns_df['state'].isin(valid_states)]
papa_johns_df['price'] = papa_johns_df['price'].astype(str).str.replace(',', '')
len(papa_johns_df)
Data Integration¶
Now that all three restaurant datasets are cleaned and categorized, we integrate them with demographic data and combine them into a single dataset for analysis.
Integration Steps:
- Merge minimum wage data (state-level) to each restaurant dataset
- Combine all three restaurant datasets into one
- Standardize text fields and data types
- Handle ZIP code formatting inconsistencies
- Identify missing ZIP codes
- Impute missing ZIP codes using geographic nearest-neighbor
- Merge income data (ZIP-level)
- Alternative income matching for remaining gaps
This multi-step process ensures data quality while maximizing the amount of usable information in our final analysis dataset.
Merge Minimum Wage Data¶
We add state-level minimum wage information to each of the three restaurant datasets before combining them. The merge is performed as a left join, ensuring all restaurant records are retained even if minimum wage data is missing (though we expect complete coverage for all US states).
dominos_merged_df = dominoes_df.merge(wage_df, on="state", how="left")
chipotle_merged_df = chipotle_merged_df.merge(wage_df, on="state", how="left")
papa_johns_merged_df = papa_johns_df.merge(wage_df,on="state", how="left")
Combine All Restaurant Datasets¶
We concatenate Chipotle, Domino's, and Papa John's into a single unified dataset using pd.concat(). This creates a combined dataset with approximately 2.7 million menu item records from 13,000+ restaurant locations nationwide.
Post-Combination Cleaning:
- Standardize text fields: Convert menu items and city names to lowercase and strip whitespace for consistency
- Ensure numeric types: Convert price to float to handle any string artifacts
- Remove duplicates: Drop exact duplicate rows that may have resulted from the merge
- Standardize ZIP codes: Trim ZIP+4 codes (9 digits) to standard 5-digit format
These standardization steps ensure consistency across the three brands, which had slightly different data formats from their respective APIs.
all_df = pd.concat([chipotle_merged_df,dominos_merged_df,papa_johns_merged_df], ignore_index=True)
all_df.head()
all_df['menu_item'] = all_df['menu_item'].str.lower().str.strip()
all_df['city'] = all_df['city'].str.lower().str.strip()
all_df['price'] = all_df['price'].astype(float)
print(len(all_df['restaurant_id']))
all_df.drop_duplicates(inplace=True)
mask = (all_df['pincode'].notnull()) & (all_df['pincode'].str.len() > 5)
all_df.loc[mask, 'pincode'] = all_df.loc[mask, 'pincode'].str[:5]
Identify Missing ZIP Codes¶
Some restaurant locations have missing or null ZIP codes from the scraping process. We check how many rows are affected before applying imputation strategies.
The output shows 241 restaurants with missing ZIP codes out of 2.7 million records - approximately 0.009% of the dataset. These restaurants do have latitude/longitude coordinates, which we'll use for geographic imputation.
missing_pincode_rows = all_df[all_df['pincode'].isnull()]
print(len(missing_pincode_rows))
Missing ZIP Code Imputation: Geographic Nearest-Neighbor¶
For restaurants with missing ZIP codes, we use a geographic nearest-neighbor approach based on latitude/longitude coordinates. The strategy:
- Build spatial index: Use scikit-learn's
NearestNeighborswith ball tree algorithm on restaurants that DO have valid ZIP codes - Find nearest neighbor: For each restaurant missing a ZIP code, find the single closest restaurant (by geographic distance) that has a valid ZIP code
- Impute: Assign the nearest neighbor's ZIP code
Why this works: Nearby restaurants (within ~100 meters) almost always share the same ZIP code. This geographic imputation is more accurate than dropping rows or using city-level averages.
Additional Imputation: The code also fills remaining missing values:
- Numeric columns (price, lat, lon): Use median imputation
- Categorical columns (city, state): Use mode imputation
- Menu items: Replace "nan" strings with proper NaN, then fill with "unknown"
Finally, we drop any rows missing critical identifiers (restaurant_id, restaurant_name).
from sklearn.neighbors import NearestNeighbors
missing_pincode_rows = all_df[all_df['pincode'].isnull()]
valid_df = all_df[
all_df['pincode'].notnull() &
all_df['latitude'].notnull() &
all_df['longitude'].notnull()
]
nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree')
nbrs.fit(valid_df[['latitude', 'longitude']])
missing_with_coords = missing_pincode_rows[
missing_pincode_rows['latitude'].notnull() &
missing_pincode_rows['longitude'].notnull()
]
distances, indices = nbrs.kneighbors(missing_with_coords[['latitude', 'longitude']])
imputed_zips = valid_df.iloc[indices.flatten()]['pincode'].values
all_df.loc[missing_with_coords.index, 'pincode'] = imputed_zips
num_cols = ['price', 'latitude', 'longitude']
for col in num_cols:
if col in all_df:
all_df[col] = all_df[col].fillna(all_df[col].median())
cat_cols = ['city', 'state', 'pincode', 'menu_type']
for col in cat_cols:
if col in all_df:
mode_val = all_df[col].mode()
if not mode_val.empty:
all_df[col] = all_df[col].fillna(mode_val[0])
all_df['menu_item'] = all_df['menu_item'].replace("nan", np.nan)
all_df['menu_item'] = all_df['menu_item'].fillna("unknown")
all_df = all_df.dropna(subset=['restaurant_id', 'restaurant_name'])
print("Total missing values left:", all_df.isnull().sum().sum())
print("Final shape:", all_df.shape)
Final cleaning outputs 1.72 million records
Remove Invalid ZIP Codes¶
Some restaurants have the placeholder value "USA" instead of a valid 5-digit ZIP code. This is a scraping artifact where the API returned a country code rather than a postal code. We remove these rows as they cannot be matched to demographic data.
all_df = all_df[all_df["pincode"] != "USA"]
Merge Income Data¶
We merge ZIP code-level median household income data to each restaurant. This is the key variable for testing our demand-side pricing hypothesis.
The merge is a left join on the pincode column. Some restaurants may not have matching income data after this merge if their ZIP codes aren't in the Census dataset. We'll handle these remaining gaps in the next step.
all_df = all_df.merge(income_df, on="pincode", how="left")
Alternative Income Matching: Numerical ZIP Code Proximity¶
Some restaurants still lack income data after the direct ZIP code match. This occurs when:
- ZIP code formatting differences (e.g., leading zeros handled differently)
- Restaurants in ZIP codes not present in the Census dataset
- New ZIP codes created after the Census data snapshot
Alternative Strategy: Convert both restaurant and income ZIP codes to integers, then find the numerically closest ZIP code that has income data. For example, if ZIP 10001 has no data, we might use data from ZIP 10002 or 10000.
Why this works: ZIP codes are assigned geographically in sequential order, so numerically adjacent ZIP codes are typically also geographically adjacent and have similar income demographics.
The get_nearest_income() function computes the absolute difference between the target ZIP code and all available ZIP codes, then returns the income value from the closest match.
After this secondary matching, we verify that all restaurants now have income data (output shows 0 remaining nulls).
income_df["zip_int"] = income_df["pincode"].astype(int)
all_df["zip_int"] = all_df["pincode"].astype(int)
def get_nearest_income(zip_code, income_df):
# absolute difference
income_df["dist"] = (income_df["zip_int"] - zip_code).abs()
nearest_row = income_df.loc[income_df["dist"].idxmin()]
return nearest_row["income"]
missing_mask = all_df["income"].isna()
all_df.loc[missing_mask, "income"] = all_df.loc[missing_mask, "zip_int"].apply(
lambda z: get_nearest_income(z, income_df)
)
print(len(all_df[all_df["income"].isnull()]))
4. Exploratory Data Analysis¶
With our integrated dataset containing 1.72 million menu items from 6,717 restaurants across the United States, we now explore patterns and relationships that will inform our hypothesis testing. This EDA phase addresses questions like:
- How do prices vary across restaurant brands/geographic locations?
- Are there regional patterns in pricing that suggest geographic discrimination?
- How consistent is pricing within each chain versus across competitors?
Our exploratory analysis follows this structure:
- Feature Engineering - Create derived variables for analysis
- Dataset Overview - Summary statistics and data quality verification
- Price Distribution Analysis - Understand pricing patterns and outliers
- Brand Comparison - Compare pricing strategies across Chipotle, Domino's, and Papa John's
- Geographic Analysis - Examine city, state, and spatial pricing patterns
- Correlation Analysis - Identify relationships between economic and pricing variables
These visualizations and summary statistics will reveal the preliminary patterns that our statistical tests (Section 5) will formally validate.
Feature Engineering¶
Before conducting exploratory analysis, we create derived features that capture important relationships in our data. These engineered features will be essential for both visualization and modeling:
Price Features:
avg_price: Restaurant-level average price (captures overall price positioning)price_vs_rest_avg: How much each item deviates from its restaurant's average (captures menu item variability)log_price: Log-transformed price for handling right-skewed distributions
Location Features:
city_rest_count: Number of unique restaurants per city (measures local competition)state_rest_count: Number of unique restaurants per state (captures market density)pincode_rest_count: Number of unique restaurants per ZIP code (fine-grained competition measure)
These features enable us to examine both absolute pricing levels and relative positioning within local markets.
print("FEATURE ENGINEERING - CREATING DERIVED VARIABLES")
df_engineered = all_df.copy()
print(f"\nStarting columns: {len(df_engineered.columns)}")
print("PRICE FEATURES")
df_engineered['avg_price'] = (
df_engineered.groupby('restaurant_id')['price'].transform('mean')
)
df_engineered['price_vs_rest_avg'] = (
df_engineered['price'] - df_engineered['avg_price']
)
df_engineered['log_price'] = np.log(df_engineered['price'] + 1)
print("Location FEATURES")
df_engineered['city_rest_count'] = df_engineered.groupby('city')['restaurant_id'].transform('nunique')
df_engineered['state_rest_count'] = df_engineered.groupby('state')['restaurant_id'].transform('nunique')
df_engineered['pincode_rest_count'] = df_engineered.groupby('pincode')['restaurant_id'].transform('nunique')
print(f"\nFinal columns: {len(df_engineered.columns)}")
print(df_engineered.head().to_string(index=False))
The feature engineering step successfully creates 6 new variables, expanding our dataset from 13 to 19 columns. The avg_price variable will be particularly important for comparing restaurant pricing levels while controlling for menu composition differences. The competition measures (restaurant counts by geography) will help us test whether market concentration affects pricing power.
Dataset Overview¶
We begin with high-level summary statistics to understand our data's scope and completeness. This verification step ensures data quality before proceeding with detailed analysis.
df = df_engineered.copy()
print("\nDATASET OVERVIEW")
print("\nUnique restaurants:", df["restaurant_id"].nunique())
print("Unique cities:", df["city"].nunique())
print("Unique states:", df["state"].nunique())
print("Unique menu items:", df["menu_item"].nunique())
print("\nMISSING VALUES")
missing = df.isnull().sum()
print(missing.to_string())
Our final integrated dataset contains:
- 6,717 unique restaurants across three major chains
- 3,077 cities spanning all 51 US states and territories (including DC)
- 1,265 distinct menu items after filtering for single-item purchases
- 1.72 million total observations (menu item × restaurant combinations)
The item_type missingness occurs because some menu items (seasonal specials, customizations) don't match our "mains/sides/drinks" keyword categories.
Price Distribution Analysis¶
Understanding the distribution of menu prices is fundamental to our research question. We examine both item-level prices and restaurant-level average prices to identify patterns, outliers, and the overall price landscape.
display(df[["price", "avg_price", "log_price"]].describe())
plt.figure(figsize=(8,5))
sns.histplot(df["avg_price"], kde=True)
plt.title("Distribution of Average Price per Restaurant")
plt.xlabel("Average Price ($)")
plt.show()
print()
rest_price_var = df.groupby("restaurant_id")["price"].std().fillna(0)
plt.figure(figsize=(8,5))
sns.histplot(rest_price_var, bins=30)
plt.title("Variation of Prices Within Restaurants")
plt.xlabel("Standard Deviation of Price")
plt.show()
expensive = df.groupby("restaurant_id")["avg_price"].mean().nlargest(5)
print(expensive.to_string())
cheap = df.groupby("restaurant_id")["avg_price"].mean().nsmallest(5)
print(cheap.to_string())
Price Distribution Characteristics:
The average menu item price across all chains is approximately $13.50, but with substantial variation:
- Standard deviation: Wide spread indicating diverse menu offerings from low-cost sides to premium entrees
- Log-normal distribution: Prices are right-skewed, with most items clustering in the 8 - 15 dollars range and a long tail of expensive items extending to $22+
Restaurant-Level Pricing:
When aggregated by restaurant (using avg_price), we observe:
- Most restaurants cluster around $11-15 average price points
- The distribution is more symmetric than item-level prices, suggesting consistent menu composition strategies
- Within-restaurant variation: The standard deviation plot shows most restaurants maintain relatively consistent pricing across their menu (standard deviation around $3-5), with some outliers showing high variance
Pricing Extremes:
The five most expensive restaurants (average price ~\$22) and five least expensive (average price ~\$8) differ by nearly 3x. This dramatic range suggests either:
- Different menu compositions (premium items vs. budget options)
- Geographic pricing discrimination (high-cost vs. low-cost markets)
- Brand positioning differences (corporate premium vs. franchise value)
Our subsequent analyses will decompose these sources of variation.
Brand Comparison Analysis¶
We compare pricing strategies across our three fast-food chains.
brand_avg = df.groupby("restaurant_name")["avg_price"].mean()
print("\nAverage price by brand:")
print(brand_avg.to_string())
plt.figure(figsize=(6,5))
sns.boxplot(data=df[df["restaurant_name"].isin(["dominoes", "chipotle","papa_johns"])],
x="restaurant_name", y="avg_price",
palette={"dominoes": "#006491", "chipotle": "#43A047","papa_johns":"#CC3745"})
plt.title("Average Price Distribution by Brand")
plt.xlabel("Brand")
plt.ylabel("Average Price ($)")
plt.show()
Average Price by Brand:
- Domino's: $14.32 (highest average)
- Papa John's: $13.88 (middle)
- Chipotle: $11.44 (lowest average)
This ranking is initially surprising—Chipotle is often perceived as a premium "fast-casual" brand, yet has the lowest average menu prices in our dataset. Several factors explain this:
- Menu composition: Chipotle's menu is simpler (burritos, bowls, tacos) with fewer premium add-ons, while pizza chains offer a wider range of sizes, specialty pizzas, and combination deals that push averages higher
- Measurement approach: Our filtering excluded combo deals, but pizza chains still include multiple size options (small, medium, large, extra-large) where large sizes inflate averages
- Category differences: Pizza (per item) costs more than individual burrito bowls, even though per-serving value may differ
Price Distribution Patterns:
The box plot reveals:
- Chipotle: Tight distribution with small interquartile range, confirming centralized pricing consistency
- Domino's: Wider spread with more outliers, suggesting franchise pricing autonomy and regional variation
- Papa John's: Similar to Domino's in spread, though slightly lower median
This visual evidence shows that corporate-owned chains (Chipotle) maintain more consistent pricing than franchise-dominated chains (Domino's, Papa John's), which allow greater local price adaptation.
Geographic Pricing Patterns¶
Geographic location is central to our research questions. We examine pricing variation across cities, states, and spatial coordinates to identify whether restaurants practice geographic price discrimination based on local economic conditions.
# City-wise price average
city_price = df.groupby("city")["avg_price"].mean().sort_values(ascending=False)
print("\nTop 10 expensive price cities:")
print(city_price.head(10).to_string())
# Count state appearances in top 10 cities
top_10_cities = city_price.head(10).index
top_10_states = df[df['city'].isin(top_10_cities)].groupby('city')['state'].first()
state_counts = top_10_states.value_counts()
print("\nState distribution in top 10 cities:")
for state, count in state_counts.items():
print(f" {state}: {count} cities")
plt.figure(figsize=(10,5))
city_price.head(15).plot(kind='bar')
plt.title("Top 15 Expensive Cities — Avg Price")
plt.ylabel("Avg Price ($)")
plt.show()
City-Level Pricing:
The most expensive cities for fast-food pricing are predominantly California coastal communities:
- San Juan Capistrano ($19.39 average) - Orange County, CA
- Pacifica ($19.02) - San Mateo County, CA (near San Francisco)
- Hercules ($18.94) - Contra Costa County, CA (SF Bay Area)
- South San Francisco ($18.80) - San Mateo County, CA
This California concentration is striking - 8 of the top 10 most expensive cities are in California.
The bar chart of top 15 cities shows a clear price tier above $17, compared to national averages around $13-14.
#State-Wise
state_price = df.groupby("state")["avg_price"].mean().sort_values()
print("\nBottom 4 states (lowest prices):")
print(state_price.head(4).to_string())
print("\nTop 4 states (highest prices):")
print(state_price.tail(4).to_string())
plt.figure(figsize=(12,5))
state_price.plot(kind='bar')
plt.title("Average Price by State")
plt.ylabel("Avg Price ($)")
plt.show()
State-Level Pricing:
The state-level analysis reveals systematic regional patterns:
- Highest-price states: Alaska, Hawaii, Washington, California
- Lowest-price states: Rhode Island, Ohio, Minnesota, Michigan
The state bar chart shows a gradual gradient rather than discrete jumps, suggesting pricing responds to continuous economic variation rather than arbitrary geographic boundaries. The ~$5-6 spread from lowest to highest state averages (roughly 40-50% difference) indicates substantial geographic price discrimination.
Key Insight: Restaurants charge significantly more in wealthy states. We even cross checked the cost of living for the above mentioned top and low states and found out that they actually show behaviour that matches their state-level pricing.
avg_df = df.groupby(["latitude", "longitude"])["price"].mean().reset_index()
print(len(avg_df['latitude']))
heat_data = avg_df[['latitude', 'longitude', 'price']].values.tolist()
m = folium.Map(location=[39.5, -98.35], zoom_start=4)
HeatMap(
heat_data,
radius=18,
blur=12,
max_zoom=8,
min_opacity=0.4
).add_to(m)
m.save("price_intensity_heatmap.html")
m
Spatial Heat Map Analysis:
The interactive heat map visualizes 9,707 unique restaurant locations, color-coded by average menu price intensity. Key spatial patterns emerge:
- Coastal concentration: Intense red zones (high prices) cluster on the West Coast (San Francisco Bay Area, Los Angeles, Seattle) and Northeast corridor (New York, Boston)
- Interior affordability: The Midwest and South parts show cooler colors (blue/green), indicating lower average prices
- Urban-rural gradient: Metropolitan areas show visibly higher intensity than surrounding rural regions within the same state
This geographic heat map provides visual confirmation that fast-food pricing is not uniform across America. The spatial clustering suggests restaurants are responding to local market conditions rather than implementing flat national pricing.
Menu Item Categories¶
Our categorization process classified items into three main types:
- Mains: Entrees and primary menu items (pizzas, burritos, bowls, sandwiches, wings) - 78.1% of categorized items
- Sides: Complementary items (breadsticks, chips, salads, cookies, dipping sauces) - 14.7% of categorized items
- Drinks: Beverages (sodas, teas, juices, water) - 7.2% of categorized items
The distribution shows mains heavily dominate the dataset, representing over three-quarters of categorized items. This makes sense as entrees are the core offerings at these restaurants.
Of the 1.72 million total observations, 1.24 million (72%) were successfully categorized using keyword matching on menu item names, while 481,202 items (28%) remain uncategorized. Uncategorized items typically include seasonal specials, customizations, promotional bundles, and items with non-standard naming conventions that don't match our "mains/sides/drinks" keyword patterns.
# Menu Item Type Distribution
print("MENU ITEM CATEGORIZATION")
print("\nItem type distribution:")
print(df['item_type'].value_counts().to_string())
print("\nItem type percentage:")
item_type_pct = df['item_type'].value_counts(normalize=True) * 100
print(item_type_pct.round(2).to_string())
print(f"\nCategorized items: {df['item_type'].notna().sum():,}")
print(f"Uncategorized items: {df['item_type'].isna().sum():,}")
print(f"Total items: {len(df):,}")
# Visualize distribution
plt.figure(figsize=(8, 5))
df['item_type'].value_counts().plot(kind='bar')
plt.title("Menu Items by Category")
plt.xlabel("Item Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Correlation Analysis¶
We examine relationships between pricing variables, economic factors (income, minimum wage), and competition measures using a correlation matrix.
Key Findings:
The correlation matrix reveals surprisingly weak relationships between menu prices and local economic conditions:
- Minimum wage (r = 0.03): Nearly zero - labor costs barely predict prices
- Income (r = -0.02): Negligible negative - wealthier areas aren't more expensive
- Competition (r < 0.02): No relationship - restaurant density doesn't affect pricing
- Restaurant average (r = 0.21): Weak positive - items track their restaurant's overall level
Interpretation:
These weak correlations suggest simple linear relationships cannot explain fast-food pricing. Possible reasons include non-linear effects, confounding variables (brand identity, menu composition), and within-chain pricing standardization that overrides local market conditions.
These findings motivate Section 5's hypothesis testing, where we'll use ANOVA and multivariate regression to test categorical differences (brand, region, item type) while controlling for multiple predictors simultaneously. The weak correlations preview our core finding: fast-food pricing is product-driven (what you buy, which brand) rather than market-driven (local economics).
print("\nCORRELATION MATRIX")
# Include price, economic factors (income, min_wage), and competition measures
num_cols = ["price", "avg_price", "income", "min_wage",
"city_rest_count", "state_rest_count", "pincode_rest_count"]
plt.figure(figsize=(12, 8))
corr_matrix = df[num_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f",
center=0, vmin=-1, vmax=1, square=True, linewidths=0.5)
plt.title("Correlation Matrix: Prices, Economic Factors, and Competition", fontsize=14, pad=20)
plt.xlabel("")
plt.ylabel("")
plt.tight_layout()
plt.show()
# Print key correlations for interpretation
print("\nKey correlations with price:")
price_corr = corr_matrix['price'].sort_values(ascending=False)
print(price_corr.to_string())
5. Hypothesis Testing¶
This section moves from descriptive analysis to inferential statistics, testing specific hypotheses about the relationships between economic factors and fast-food pricing.
Our hypothesis testing strategy addresses the core research questions posed in the Introduction:
- H1: Restaurant Chain Effects - Do different brands (Chipotle, Domino's, Papa John's) have significantly different pricing strategies?
- H2: Item Type Effects - Does the category of menu item (mains, sides, drinks) systematically affect price?
- H3: Regional Pricing Differences - Do US geographic regions show significant variation in menu prices?
- H4: Multivariate Regression - How do income, minimum wage, and restaurant chain jointly explain price variation?
- H5: Competition Effects - Does the level of local restaurant competition influence pricing?
Each hypothesis test follows the same structure:
- Null Hypothesis (H₀): Statement of no effect or no difference
- Alternative Hypothesis (H₁): Statement that an effect exists
- Test Selection: Appropriate statistical method (ANOVA, regression, correlation)
- Results: Test statistic, p-value, and interpretation
- Conclusion: Accept or reject null hypothesis with practical implications
We use a significance level of α = 0.05 for all tests, meaning we require p-values below 0.05 to claim statistical significance.
We begin by preparing the dataset for hypothesis testing. This involves converting all price-related columns to numeric format, ensuring categorical variables (restaurant names, states, cities) are properly formatted, and removing any rows with missing critical information.
After cleaning, our analysis dataset contains 1,718,220 observations - a robust sample size that provides strong statistical power for detecting even small effects. With this large sample, we must be careful to distinguish between statistical significance (p < 0.05) and practical significance (effect size matters in real-world terms).
all_dataset_df = all_df.copy()
df = all_dataset_df.copy()
# Make sure key numeric columns are numeric
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df["income"] = pd.to_numeric(df["income"], errors="coerce")
df["min_wage"] = pd.to_numeric(df["min_wage"], errors="coerce")
# Drop rows missing core fields where needed
df = df.dropna(subset=["price", "restaurant_name", "state", "city"])
print("Total rows after cleaning:", len(df))
H1: Effect of Restaurant Chain on Menu Price¶
Research Question: Do the three restaurant chains (Chipotle, Domino's, Papa John's) have significantly different average menu prices?
Hypotheses:
- H₀: Mean menu price is the same across all restaurant brands (μ_Chipotle = μ_Domino's = μ_PapaJohns)
- H₁: At least one restaurant brand has a different mean menu price
Statistical Test: One-way ANOVA (Analysis of Variance)
Why ANOVA? We're comparing means across three independent groups. ANOVA tests whether the variance between groups is significantly larger than the variance within groups. If restaurant brand influences pricing, we expect items within each brand to be similar to each other but different from other brands.
print("\n=== H1: Price ~ Restaurant Name (ANOVA) ===")
groups_rest = [
g["price"].values
for _, g in df.groupby("restaurant_name")
if len(g) > 1
]
anova_rest = f_oneway(*groups_rest)
print("H1 F-statistic:", anova_rest.statistic)
print("H1 p-value:", anova_rest.pvalue)
Results¶
Test Statistics:
- F-statistic: 67,317.66
- p-value: 0
Interpretation:
The one-way ANOVA produces an extremely large F-statistic with a p-value of zero. This provides overwhelming evidence to reject the null hypothesis.
What the F-statistic means: The F-value of 67,317 indicates that the variance in prices between the three restaurant chains is 67,317 times larger than the variance within each chain. This is an extraordinarily strong effect - restaurant brand is a massive determinant of menu pricing.
Conclusion:
Menu prices differ highly significantly across restaurant chains. Recall from our EDA:
- Chipotle: $11.44 average
- Papa John's: $13.88 average
- Domino's: $14.32 average
These differences are not due to random chance - they reflect fundamental differences in business models, menu composition, and corporate pricing strategies. The corporate-owned Chipotle maintains lower, more consistent prices, while the franchise-dominated pizza chains have higher and more variable pricing.
Practical Significance: A consumer choosing between chains can expect to pay approximately $2-3 more per item at Domino's or Papa John's compared to Chipotle, a difference of about 20-25%.
H2: Effect of Item Type on Menu Price¶
Research Question: Does menu item category (mains, sides, drinks) significantly affect price?
Hypotheses:
- H₀: Mean price is the same across all item types (μ_mains = μ_sides = μ_drinks)
- H₁: At least one item type has a different mean price
Statistical Test: One-way ANOVA
Why This Matters: This test validates our data quality and menu categorization. We expect mains (entrees) to cost significantly more than sides and drinks. If this hypothesis fails, it would suggest problems with our categorization or data collection.
print("\n=== H2: Price ~ Item Type (ANOVA) ===")
df_item = df.dropna(subset=["item_type"])
groups_item = [
g["price"].values
for _, g in df_item.groupby("item_type")
if len(g) > 1
]
anova_item = f_oneway(*groups_item)
print("H2 F-statistic:", anova_item.statistic)
print("H2 p-value:", anova_item.pvalue)
Results¶
Test Statistics:
- F-statistic: 416,933.06
- p-value: 0
Interpretation:
The ANOVA produces an extraordinarily large F-statistic, providing definitive evidence to reject the null hypothesis. Menu item type has a profound effect on pricing.
What this tells us: The between-group variance (differences across item types) is 416,933 times larger than within-group variance. This is one of the strongest effects in our entire analysis, which makes intuitive sense - a pizza naturally costs more than a side of breadsticks or a soda.
Conclusion:
Menu prices differ extremely significantly by item category (p = 0). This validates our categorization approach and confirms that comparing "average prices" across restaurants requires controlling for menu composition. A restaurant with more drinks and sides will appear cheaper than one selling primarily entrees, even if individual item prices are similar.
Implication for Analysis: When comparing restaurant pricing strategies, we must account for item type. This is why our multivariate regression (H4) includes item type as a control variable.
H3: Regional Differences in Menu Pricing¶
Research Question: Are there significant price differences across US Census regions (West, Northeast, South, Midwest)?
Hypotheses:
- H₀: Mean price is the same across all regions (μ_West = μ_Northeast = μ_South = μ_Midwest)
- H₁: At least one region has a different mean price
Statistical Test: One-way ANOVA
Why Regions Matter: US Census regions capture broad economic and geographic differences:
- West: Includes high-cost states like California, Washington, Hawaii
- Northeast: Major metros (NYC, Boston) with high cost of living
- South: Generally lower wages and costs (except major cities)
- Midwest: Lower cost of living, more rural areas
print("\n=== H3: Price ~ Region (ANOVA) ===")
region_map = {
"ME":"Northeast","NH":"Northeast","VT":"Northeast","MA":"Northeast","RI":"Northeast","CT":"Northeast",
"NY":"Northeast","NJ":"Northeast","PA":"Northeast",
"OH":"Midwest","MI":"Midwest","IN":"Midwest","IL":"Midwest","WI":"Midwest","MN":"Midwest","IA":"Midwest",
"MO":"Midwest","ND":"Midwest","SD":"Midwest","NE":"Midwest","KS":"Midwest",
"DE":"South","MD":"South","DC":"South","VA":"South","WV":"South","NC":"South","SC":"South",
"GA":"South","FL":"South","KY":"South","TN":"South","AL":"South","MS":"South","AR":"South",
"LA":"South","TX":"South","OK":"South",
"MT":"West","ID":"West","WY":"West","CO":"West","NM":"West","AZ":"West",
"UT":"West","NV":"West","WA":"West","OR":"West","CA":"West","AK":"West","HI":"West"
}
df["region"] = df["state"].map(region_map)
df_region = df.dropna(subset=["region"])
groups_region = [
g["price"].values
for _, g in df_region.groupby("region")
]
anova_region = f_oneway(*groups_region)
print("H3 F-statistic:", anova_region.statistic)
print("H3 p-value:", anova_region.pvalue)
region_means = df_region.groupby("region")["price"].mean().round(2)
print("\nMean price by region:")
print(region_means)
Results¶
Test Statistics:
- F-statistic: 1,407.68
- p-value: 0
Regional Mean Prices:
| Region | Mean Price | Ranking |
|---|---|---|
| West | $14.34 | Highest |
| South | $13.30 | 2nd |
| Northeast | $13.20 | 3rd |
| Midwest | $12.91 | Lowest |
Interpretation:
The ANOVA reveals highly significant regional price variation (p = 0). We confidently reject the null hypothesis - menu prices are not uniform across US regions.
Regional Price Spread:
- The difference between West (14.34) and Midwest (12.91) is $1.43 per item
- This represents approximately 11% higher prices in the West compared to Midwest
- For a family purchasing 4 items, this translates to $5.72 more in Western states
Conclusion:
Geographic region significantly influences fast-food pricing with the West showing a clear high pricing. This provides strong evidence for geographic price discrimination - restaurants adjust prices based on local economic conditions.
H4: Multivariate Regression - Price ~ Income + Minimum Wage + Restaurant Chain¶
Research Question: How do local income, state minimum wage, and restaurant brand simultaneously influence menu prices?
Hypotheses:
- H₀: All predictor coefficients equal zero (β_income = β_wage = β_brand = 0) - the model has no explanatory power
- H₁: At least one predictor has a non-zero coefficient - the model explains significant price variance
Statistical Test: Multiple Linear Regression (Ordinary Least Squares)
Model Specification:
Price = β₀ + β₁(Income) + β₂(MinWage) + β₃(Domino's) + β₄(PapaJohns) + ε
Where:
Income: Median household income in ZIP code (continuous, in dollars)MinWage: State minimum wage (continuous, in dollars/hour)Domino's: Binary indicator (1 if Domino's, 0 otherwise) - Chipotle is reference categoryPapaJohns: Binary indicator (1 if Papa John's, 0 otherwise)ε: Random error term
print("\n=== H4: price ~ income + min_wage + C(restaurant_name) (OLS Regression) ===")
df_reg = df.dropna(subset=["income", "min_wage"])
model_h4 = smf.ols("price ~ income + min_wage + C(restaurant_name)", data=df_reg).fit()
print(model_h4.summary())
anova_h4 = anova_lm(model_h4, typ=2)
print("\nH4 ANOVA table:")
print(anova_h4)
Results¶
Overall Model Fit:
- F-statistic: 35,730
- R²: 0.077 (explains 7.7% of price variance)
- Observations: 1,718,220
Interpretation:
The model is highly statistically significant (p = 0), confirming that income, minimum wage, and restaurant chain collectively influence menu prices. However, R² = 0.077 indicates these macro-economic factors explain only a modest portion of price variation - most variance comes from item-specific factors (what's being sold) rather than location economics.
Key Coefficients:
| Predictor | Coefficient | p-value | Interpretation |
|---|---|---|---|
| Income | 2.465e-06 | 0 | +$0.0025 per $1,000 income increase |
| Min Wage | 0.160 | 0 | +$0.16 per $1.00 wage increase |
| Domino's | +5.89 | 0 | +$5.89 vs Chipotle |
| Papa John's | +6.69 | 0 | +$6.69 vs Chipotle |
Key Findings:
1. Income Effect (Demand-Side): Each \$10,000 increase in median household income is associated with only a $0.025 price increase (coefficient = 2.465e-06 × 10,000 = 0.0246). This is economically negligible - restaurants barely adjust prices based on local wealth.
2. Minimum Wage Effect (Supply-Side): Each \$1.00 increase in state minimum wage is associated with a $0.16 price increase. This represents ~16% pass-through of labor costs to consumers. Example: California's \$16.50 wage vs Texas's \$7.25 wage (\$9.25 difference) predicts a \$1.48 price premium in California (0.16 × 9.25 = 1.48).
3. Restaurant Brand Dominance: Even controlling for income and wages, Domino's charges \$5.89 more and Papa John's charges \$6.69 more than Chipotle per menu item. Brand identity has a far larger impact (25-30× larger) than local economic conditions.
Conclusion:
All predictors are statistically significant, but their practical importance varies dramatically:
- Restaurant chain (+\$5.89-\$6.69) >> Minimum wage (+\$0.16 per dollar) >>> Income (+\$0.0025 per \$1,000)
The brand effect is approximately 37× larger than the wage effect and 2,400× larger than the income effect, confirming that what you buy and where you buy from matters far more than where the restaurant is located.
H5: Effect of City-Level Competition on Menu Prices¶
Research Question: Does the number of competing restaurant locations in a city affect average menu prices?
Hypotheses:
- H₀: City-level restaurant count has no relationship with average menu price (β_competition = 0)
- H₁: Restaurant count significantly predicts average menu price (β_competition ≠ 0)
Statistical Test: Linear Regression at city level
Approach:
We aggregate data to the city level to test if restaurant density affects pricing. Standard competition theory predicts more competition → lower prices. However, restaurant count may proxy for city characteristics (big cities have both more restaurants AND higher costs), creating a confounding variable. We control for this by including mean income in our regression model.
print("\n=== H5: City-level Competition vs Mean Price (Regression) ===")
# 1) city-level restaurant_count
city_counts = df.groupby(["city", "state"])["restaurant_id"].nunique().reset_index()
city_counts.columns = ["city", "state", "restaurant_count"]
# 2) city-level mean price
city_prices = df.groupby(["city", "state"])["price"].mean().reset_index()
city_prices.columns = ["city", "state", "mean_price"]
# 3) city-level mean income
city_income = df.groupby(["city", "state"])["income"].mean().reset_index()
city_income.columns = ["city", "state", "mean_income"]
city_df = city_counts.merge(city_prices, on=["city", "state"], how="inner")
city_df = city_df.merge(city_income, on=["city", "state"], how="left")
city_df = city_df.dropna(subset=["mean_price", "restaurant_count", "mean_income"])
corr_comp, p_comp = pearsonr(city_df["restaurant_count"], city_df["mean_price"])
print("H5 City-level corr(restaurant_count, mean_price):", corr_comp)
print("H5 correlation p-value:", p_comp)
# Regression: mean_price ~ restaurant_count + mean_income
model_h5 = smf.ols("mean_price ~ restaurant_count + mean_income", data=city_df).fit()
print("\nH5 Regression: mean_price ~ restaurant_count + mean_income")
print(model_h5.summary())
anova_h5 = anova_lm(model_h5, typ=2)
print("\nH5 ANOVA table:")
print(anova_h5)
Results¶
Simple Correlation:
- Pearson r = 0.083 (very weak positive correlation)
- p-value = 3.64e-07 (< 0.001, statistically significant)
Multiple Regression: mean_price ~ restaurant_count + mean_income
Model Fit:
- R² = 0.019 (explains only 1.9% of city-level variance)
- F-statistic = 36.4 (p < 0.001)
Coefficients:
| Predictor | Coefficient | p-value | Interpretation |
|---|---|---|---|
| restaurant_count | +0.042 | < 0.001 | Each additional restaurant adds \$0.042 to city mean price |
| mean_income | -0.000006 | < 0.001 | Each \$1,000 income reduces price by \$0.006 |
Interpretation:
The analysis reveals a paradoxical finding: cities with more restaurants have slightly higher prices (coefficient = +0.042), not lower as competition theory predicts. This suggests restaurant count proxies for urbanization rather than competitive pressure—big cities have both more restaurants AND higher costs (rent, wages, regulations).
Income Effect:
The negative income coefficient (-0.000006) is surprising but practically negligible. A $20,000 income difference changes prices by only 12 cents—essentially no effect after controlling for competition.
Conclusion:
City-level competition shows a weak positive relationship with prices (r = 0.083, p < 0.001), contrary to standard economic theory. However, the model explains only 1.9% of variance, indicating that city-level factors (competition, income) are poor predictors of fast-food pricing. Restaurant brand and menu composition dominate over local market conditions.
The 6. Predictive Modelling does not run in the same Runtime due to colab RAM crash, Use the Other ipynb file¶
all_dataset_df.to_csv(f"{base_location}/Data/all_dataset.csv")
!!! SECOND PART STARTS HERE !!!