New York City Airbnb Market Analysis (2019)¶

Shraddha Chandrashekar
UID: 122092846


Notebook goal: build an end-to-end workflow (cleaning → EDA → hypothesis testing → ML) using the AB_NYC_2019.csv dataset.


UID: 122092846

Table of Contents¶

  1. Introduction
  2. Data Collection
  3. Data Processing
  4. Exploratory Analysis & Data Visualization
  5. Hypothesis Testing
  6. Machine Learning
  7. Model Evaluation
  8. Limitations & Future Work
  9. External Links

1. Introduction¶

Sho t-term rental platforms such as Airbnb have transformed the housing and tourism markets of major metropolitan areas. New York City, one of the most visited cities in the world, represents a particularly complex and competitive Airbnb marketplace due to its diverse neighborhoods, strict housing regulations, and wide variation in listing prices.

For hosts, determining an appropriate nightly price is a challenging decision that depends on multiple factors including location, room type, availability, and host behavior. For policymakers, understanding pricing and listing patterns can help assess housing availability and the impact of short-term rentals on local communities.

In this project, we analyze Airbnb listings in New York City using data from 2019. The goals of this analysis are threefold:

  • To understand how prices vary across boroughs and room types
  • To identify key features that influence Airbnb pricing
  • To build a machine learning model capable of predicting listing prices

This notebook is structured as a step-by-step tutorial, walking through data collection, cleaning, exploratory analysis, hypothesis testing, and predictive modeling.

2. Data Collection¶

The dataset used in this analysis comes from Inside Airbnb, a publicly available project that provides detailed information about Airbnb listings in cities around the world.

Dataset Overview¶

  • File name: AB_NYC_2019.csv
  • Geographic scope: New York City, USA
  • Time period: 2019
  • Number of listings: ~49,000
  • Unit of observation: One Airbnb listing

Each row in the dataset represents a unique Airbnb listing and includes information about its price, location, room type, availability, and host characteristics.

Key Features¶

Some of the most important variables in the dataset include:

  • price: Nightly listing price in USD
  • neighbourhood_group: Borough (e.g., Manhattan, Brooklyn)
  • neighbourhood: Specific neighborhood within a borough
  • room_type: Entire home/apartment, private room, or shared room
  • minimum_nights: Minimum stay requirement
  • number_of_reviews: Total number of reviews
  • availability_365: Number of days available in a year

This dataset does not include booking data or actual transaction prices, which is an important limitation discussed later.

1. Importing Libraries + Loading Data

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

df_raw = pd.read_csv("AB_NYC_2019.csv")
df_raw.head()
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0

Initial Data Inspection¶

Before cleaning, have I cheedck:

  • column names and types,
  • missing values,
  • basic summary statistics.

This step helps identify which columns are useful and which columns need cleaning.

In [4]:
df_raw.info()
df_raw.isnull().sum()
df_raw.describe(include='all')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     38843 non-null  object 
 13  reviews_per_month               38843 non-null  float64
 14  calculated_host_listings_count  48895 non-null  int64  
 15  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
Out[4]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
count 4.889500e+04 48879 4.889500e+04 48874 48895 48895 48895.000000 48895.000000 48895 48895.000000 48895.000000 48895.000000 38843 38843.000000 48895.000000 48895.000000
unique NaN 47905 NaN 11452 5 221 NaN NaN 3 NaN NaN NaN 1764 NaN NaN NaN
top NaN Hillside Hotel NaN Michael Manhattan Williamsburg NaN NaN Entire home/apt NaN NaN NaN 2019-06-23 NaN NaN NaN
freq NaN 18 NaN 417 21661 3920 NaN NaN 25409 NaN NaN NaN 1413 NaN NaN NaN
mean 1.901714e+07 NaN 6.762001e+07 NaN NaN NaN 40.728949 -73.952170 NaN 152.720687 7.029962 23.274466 NaN 1.373221 7.143982 112.781327
std 1.098311e+07 NaN 7.861097e+07 NaN NaN NaN 0.054530 0.046157 NaN 240.154170 20.510550 44.550582 NaN 1.680442 32.952519 131.622289
min 2.539000e+03 NaN 2.438000e+03 NaN NaN NaN 40.499790 -74.244420 NaN 0.000000 1.000000 0.000000 NaN 0.010000 1.000000 0.000000
25% 9.471945e+06 NaN 7.822033e+06 NaN NaN NaN 40.690100 -73.983070 NaN 69.000000 1.000000 1.000000 NaN 0.190000 1.000000 0.000000
50% 1.967728e+07 NaN 3.079382e+07 NaN NaN NaN 40.723070 -73.955680 NaN 106.000000 3.000000 5.000000 NaN 0.720000 1.000000 45.000000
75% 2.915218e+07 NaN 1.074344e+08 NaN NaN NaN 40.763115 -73.936275 NaN 175.000000 5.000000 24.000000 NaN 2.020000 2.000000 227.000000
max 3.648724e+07 NaN 2.743213e+08 NaN NaN NaN 40.913060 -73.712990 NaN 10000.000000 1250.000000 629.000000 NaN 58.500000 327.000000 365.000000

3. Data Processing¶

Before performing analysis or modeling, the dataset must be cleaned and standardized. Raw r al-world data often contains missing values, extreme outliers, and inconsistent formatting that can negatively affect resul.

Main goals:

  1. Handle missing values in important columns
  2. Fix invalid values (e.g., price <= 0)
  3. Reduce the effect of extreme outliers in price
  4. Convert categorical columns to appropriate types
  5. Create a few features that help analysis (e.g., log(price), reviews per month) terpretable.

Copy , Drop Missing , Remove Invalid values

In [5]:
df = df_raw.copy()

# Remove rows with missing price or room type
df = df.dropna(subset=['price', 'room_type', 'neighbourhood_group', 'neighbourhood'])

# Remove zero or negative prices
df = df[df['price'] > 0]

# Also remove listings with missing critical numeric fields used later
df = df.dropna(subset=['minimum_nights', 'number_of_reviews', 'availability_365'])

df.shape
Out[5]:
(48884, 16)

Handling Outliers in Price¶

Airbnb prices can have a long right tail (very expensive listings). A few extreme values can distort averages and make models unstable. A simple approach is to cap prices at a reasonable upper bound.

Here I cap prices at $1000. This keeps expensive listings in the data but limits their influence.

Cap price and Check

In [6]:
df['price_capped'] = np.where(df['price'] > 1000, 1000, df['price'])

print("Max original price:", df['price'].max())
print("Max capped price:", df['price_capped'].max())
df[['price', 'price_capped']].describe()
Max original price: 10000
Max capped price: 1000
Out[6]:
price price_capped
count 48884.000000 48884.000000
mean 152.755053 145.510024
std 240.170260 130.946570
min 10.000000 10.000000
25% 69.000000 69.000000
50% 106.000000 106.000000
75% 175.000000 175.000000
max 10000.000000 1000.000000

Data Type Conversion¶

Some columns are categorical (borough and room type). Setting them as categorical helps keep the dataset clean and also helps later when encoding features for modeling.

In [9]:
df['neighbourhood_group'] = df['neighbourhood_group'].astype('category')
df['room_type'] = df['room_type'].astype('category')

Feature Engineering¶

A few additional features make analysis easier:

  • log_price: log transform to reduce skewness
  • reviews_per_month: some listings have missing values; fill missing with 0
  • has_reviews: whether the listing has any reviews
  • availability_rate: availability_365 scaled to [0,1]

Create features

In [10]:
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df['has_reviews'] = (df['number_of_reviews'] > 0).astype(int)

df['log_price'] = np.log1p(df['price_capped'])
df['availability_rate'] = df['availability_365'] / 365

df[['price_capped', 'log_price', 'reviews_per_month', 'availability_rate']].head()
Out[10]:
price_capped log_price reviews_per_month availability_rate
0 149 5.010635 0.21 1.000000
1 225 5.420535 0.38 0.972603
2 150 5.017280 0.00 1.000000
3 89 4.499810 4.64 0.531507
4 80 4.394449 0.10 0.000000

4. Exploratory Analysis & Data Visualization¶

Now that the data is cleaned, I explore:

  • the overall distribution of prices,
  • how prices vary across boroughs and room types,
  • whether review/availability variables show visible relationships to price,
  • which numeric variables correlate with one another.

For each plot, the goal is to interpret what it suggests about pricing behavior.

Price Distribution (Raw vs Log Transformed)¶

Airbnb prices are usually right-skewed (many moderate prices, few expensive listings). A log transform often makes patterns easier to see.

Histograms

In [11]:
plt.figure(figsize=(7,4))
plt.hist(df['price_capped'], bins=50)
plt.title("Distribution of Price (capped at $1000)")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

plt.figure(figsize=(7,4))
plt.hist(df['log_price'], bins=50)
plt.title("Distribution of log(1 + price)")
plt.xlabel("log(1 + price)")
plt.ylabel("Frequency")
plt.show()
No description has been provided for this image
No description has been provided for this image

Price by Borough¶

I expect Manhattan to have the highest prices because of tourism and central location. Brooklyn often follows. Queens, Bronx, and Staten Island tend to be cheaper on average.

Boxplots help compare distributions (not just averages).

Boxplot borough

In [12]:
plt.figure(figsize=(10,6))
sns.boxplot(x='neighbourhood_group', y='price_capped', data=df)
plt.title("Price Distribution by Borough (capped)")
plt.xlabel("Borough")
plt.ylabel("Price (capped)")
plt.show()
No description has been provided for this image

Price by Room Type¶

Room type should matter a lot. Entire homes/apartments are expected to be more expensive than private rooms, and shared rooms should be the least expensive.

Boxplot room type

In [13]:
plt.figure(figsize=(10,6))
sns.boxplot(x='room_type', y='price_capped', data=df)
plt.title("Price Distribution by Room Type (capped)")
plt.xlabel("Room Type")
plt.ylabel("Price (capped)")
plt.xticks(rotation=20)
plt.show()

df.groupby("room_type")['price_capped'].mean().sort_values()
No description has been provided for this image
/tmp/ipykernel_1939/3201097528.py:9: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby("room_type")['price_capped'].mean().sort_values()
Out[13]:
room_type
Shared room         69.341969
Private room        86.673552
Entire home/apt    200.667021
Name: price_capped, dtype: float64

Borough + Room Type Together¶

To understand whether borough differences persist within each room type, I compare price by borough and room type together.

Grouped plot

In [14]:
plt.figure(figsize=(12,6))
sns.boxplot(x='neighbourhood_group', y='price_capped', hue='room_type', data=df)
plt.title("Price by Borough and Room Type (capped)")
plt.xlabel("Borough")
plt.ylabel("Price (capped)")
plt.legend(title="Room Type")
plt.show()
No description has been provided for this image

Most Expensive Neighborhoods (by Median Price)¶

Borough is a broad label. Neighborhoods can vary a lot even within the same borough. Here I compute median price by neighborhood and look at the top 15 neighborhoods.

Top neighborhoods

In [15]:
neigh_median = (
    df.groupby('neighbourhood')['price_capped']
      .median()
      .sort_values(ascending=False)
      .head(15)
)

plt.figure(figsize=(10,6))
neigh_median.sort_values().plot(kind='barh')
plt.title("Top 15 Neighborhoods by Median Price (capped)")
plt.xlabel("Median Price (capped)")
plt.ylabel("Neighborhood")
plt.show()

neigh_median
No description has been provided for this image
Out[15]:
neighbourhood
Fort Wadsworth        800.0
Woodrow               700.0
Tribeca               295.0
Neponsit              274.0
NoHo                  250.0
Willowbrook           249.0
Flatiron District     225.0
Midtown               210.0
West Village          200.0
Financial District    200.0
SoHo                  199.0
Chelsea               199.0
Greenwich Village     197.5
Breezy Point          195.0
Battery Park City     195.0
Name: price_capped, dtype: float64

Reviews and Availability vs Price¶

Reviews and availability may relate to pricing, but the direction is not obvious. More reviews could mean high demand (possibly higher price), or it could mean lower price leading to more bookings. Availability could reflect host strategy as well.

Scatter plots help show whether there is any visible relationship.

Scatter plots

In [16]:
plt.figure(figsize=(7,4))
plt.scatter(df['number_of_reviews'], df['price_capped'], alpha=0.2)
plt.title("Price vs Number of Reviews")
plt.xlabel("Number of Reviews")
plt.ylabel("Price (capped)")
plt.show()

plt.figure(figsize=(7,4))
plt.scatter(df['availability_365'], df['price_capped'], alpha=0.2)
plt.title("Price vs Availability (days/year)")
plt.xlabel("Availability_365")
plt.ylabel("Price (capped)")
plt.show()
No description has been provided for this image
No description has been provided for this image

Correlation Heatmap (Numeric Variables)¶

Correlation does not imply causation, but it helps identify whether variables move together. I compute correlations among numeric variables and visualize them.

In [17]:
plt.figure(figsize=(8,6))
sns.heatmap(
    df[['price_capped', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']].corr(),
    annot=True
)
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()
No description has been provided for this image

Correlation inspection

In [35]:
numeric_features = df.select_dtypes(include=['int64', 'float64'])
correlation_matrix = numeric_features.corr()

# Identify strongest positive and negative correlations
correlation_pairs = correlation_matrix.unstack().sort_values()
strong_negative = correlation_pairs.head(5)
strong_positive = correlation_pairs.tail(5)

print("Strongest Negative Correlations:")
print(strong_negative)

print("\nStrongest Positive Correlations:")
print(strong_positive)
Strongest Negative Correlations:
longitude          log_price           -0.329992
log_price          longitude           -0.329992
number_of_reviews  id                  -0.319800
id                 number_of_reviews   -0.319800
price_capped       longitude           -0.247066
dtype: float64

Strongest Positive Correlations:
price_capped       price_capped         1.0
has_reviews        has_reviews          1.0
availability_rate  availability_rate    1.0
                   availability_365     1.0
availability_365   availability_rate    1.0
dtype: float64

5. Hypothesis Testing¶

Based on EDA, room type seems to strongly affect price. A simple test is to compare the mean prices of two room types.

Here I compare:

  • Entire home/apt vs Private room

Null hypothesis (H0): average prices are equal
Alternative hypothesis (H1): average prices differ

t-test

In [20]:
from scipy.stats import ttest_ind

entire = df[df['room_type']=="Entire home/apt"]['price_capped']
private = df[df['room_type']=="Private room"]['price_capped']

t_stat, p_val = ttest_ind(entire, private, equal_var=False)

t_stat, p_val
Out[20]:
(np.float64(108.68412190706182), np.float64(0.0))

Testing Price Differences Across Boroughs (ANOVA)¶

To test whether boroughs differ in mean price, ANOVA is a common approach.

Null hypothesis (H0): borough means are equal
Alternative hypothesis (H1): at least one borough differs

In [21]:
from scipy.stats import f_oneway

groups = []
for b in df['neighbourhood_group'].cat.categories:
    groups.append(df[df['neighbourhood_group'] == b]['price_capped'])

anova_stat, anova_p = f_oneway(*groups)
anova_stat, anova_p
Out[21]:
(np.float64(1035.230812771272), np.float64(0.0))

6. Machine Learning: Predicting Price¶

In this section I build a regression model to predict listing price.

I start with a simple baseline and then train a Random Forest model.

Important modeling notes:

  • Price is skewed, so I predict log_price and convert back if needed.
  • Categorical variables must be encoded (borough + room type).
  • Evaluation uses MAE (Mean Absolute Error), which is easy to interpret in dollars.
In [25]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Simple baseline: predict the median price for everyone
y = df['price_capped']
y_train, y_test = train_test_split(y, test_size=0.2, random_state=42)

baseline_pred = np.median(y_train) * np.ones_like(y_test)
baseline_mae = mean_absolute_error(y_test, baseline_pred)

baseline_mae
Out[25]:
77.2404623095019

Random Forest Model (Numeric Features Only)¶

As a first model, I use only numeric features:

  • minimum_nights
  • number_of_reviews
  • availability_365

This is a limited feature set, but it helps establish a starting point.

In [36]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Feature selection
X = df[['minimum_nights', 'number_of_reviews', 'availability_365']]
y = df['price_capped']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Random Forest model
model = RandomForestRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Model evaluation using MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.4f}")

# Compare model error to a naive baseline
baseline_pred = np.full_like(y_test, y_test.mean())
baseline_mae = mean_absolute_error(y_test, baseline_pred)

print(f"Baseline MAE: {baseline_mae:.4f}")
print(f"Improvement over baseline: {baseline_mae - mae:.4f}")
Mean Absolute Error (MAE): 86.4471
Baseline MAE: 83.0890
Improvement over baseline: -3.3581

Feature Importance (Numeric-Only Model)¶

Feature importance gives a rough sense of which numeric variables the model is using most. This does not prove causation, but it helps interpretation.

In [27]:
plt.figure(figsize=(6,4))
plt.bar(X.columns, model.feature_importances_)
plt.title("Feature Importance (Numeric Features)")
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()
No description has been provided for this image
In [34]:
# Feature importance analysis for interpretability

if hasattr(model, "coef_"):
    feature_importance = pd.Series(
        model.coef_.flatten(), index=X.columns
    ).sort_values(ascending=False)

    print("Feature Importance:")
    print(feature_importance)

Improved Model: Include Borough and Room Type¶

The earlier model ignores two major drivers of price: borough and room type. To include them, I one-hot encode these categorical variables and re-train the model.

This should improve predictive performance because borough/room type capture major price differences.

One-hot and RF

In [28]:
X2 = df[['minimum_nights', 'number_of_reviews', 'availability_365', 'neighbourhood_group', 'room_type']]
X2 = pd.get_dummies(X2, drop_first=True)

y2 = df['price_capped']

X_train, X_test, y_train, y_test = train_test_split(
    X2, y2, test_size=0.2, random_state=42
)

model2 = RandomForestRegressor(n_estimators=300, random_state=42)
model2.fit(X_train, y_train)

pred2 = model2.predict(X_test)
mae2 = mean_absolute_error(y_test, pred2)

baseline_mae, mae, mae2
Out[28]:
(77.2404623095019, 86.44710796983082, 64.4309596953749)

Interpreting the Improved Model¶

After adding borough and room type, the model should perform better. I also compare predicted vs actual prices to see if the model systematically under- or over-predicts.

In [29]:
plt.figure(figsize=(6,6))
plt.scatter(y_test, pred2, alpha=0.2)
plt.title("Predicted vs Actual Price (capped)")
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.plot([0, 1000], [0, 1000])
plt.show()
No description has been provided for this image

7. Conclusions¶

Main takeaways from this analysis:

  • Price is strongly related to borough and room type.
  • Manhattan and entire-home listings tend to be the most expensive.
  • Neighborhoods vary substantially even within the same borough.
  • A Random Forest model improves substantially when borough and room type are included.

Even with a better model, predicting very expensive listings is still difficult because those prices are influenced by factors not captured here (e.g., exact address, amenities, seasonal demand).

8. Limitations and Future Work¶

Limitations:

  • Listed prices may not match booked prices.
  • The dataset does not include time/seasonality (which matters for NYC).
  • No amenities or text descriptions are included here.

Future work:

  • Predict occupancy or booking likelihood (classification task).
  • Add geographic features using latitude/longitude.
  • Use neighborhood-level aggregation and clustering.
  • Include amenities/text if

There are multiple directions in which this analysis could be extended in future work. One potential improvement would be incorporating additional data sources to enrich the feature space. External datasets, such as demographic, economic, or temporal information, could help capture broader contextual factors and improve model performance.

Another extension involves experimenting with alternative modeling approaches. Techniques such as ensemble methods, regularization strategies, or nonlinear models may better capture complex relationships within the data. Hyperparameter tuning and cross-validation could also be applied more extensively to optimize model performance and reduce overfitting.

Finally, deploying this analysis as an interactive dashboard or web application would increase its real-world utility. Allowing users to explore trends dynamically or input new data points could transform this tutorial into a practical decision-support tool. These extensions demonstrate how the current tutorial serves as a strong foundation for deeper and more impactful data science projects. available.

9. External Links¶

  • Airbnb NYC Open Data (Kaggle): https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data

  • Airbnb Pricing Economics Study (Harvard Business Review): https://hbr.org/2023/01/hotel-pricing-lessons-from-airbnb

    (Alternative academic study if you prefer a research paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2867871)

  • NYC Tourism & Visitor Statistics (NYC’s Official Tourism Reports): https://www.nycgo.com/tourism-visitor-data/

  • scikit-learn (sklearn) Documentation: https://scikit-learn.org/stable/documentation.html

  • seaborn Documentation: https://seaborn.pydata.org/

  • Pandas documentation: https://pandas.pydata.org/docs/

  • Matplotlib visualization guide: https://matplotlib.org/stable/tutorials/index.html