How do college-town home prices fare during a recession?

During a recession, home prices suffer. How do college towns compare to other communities? Let's find out. But first, a few definitions.

A quarter is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December.
A recession is defined as starting with two consecutive quarters of GDP decline and ending with two consecutive quarters of GDP growth.
A recession bottom is the quarter within a recession, that had the lowest GDP.
A university town is a city that has a high percentage of college students compared to the total population of the city.

Hypothesis

College towns have their mean housing prices less affected by recessions. Run a t-test to compare the ratio of the mean price of houses in university towns the quarter before the recession starts compared to the recession bottom. Values less than one show home price increases, and values greater than one, show home price declines.

Data

From the Zillow research data site there is housing data for the United States. In particular, the data file for all homes at a city level, zillow_homes_prices_by_city.csv, has median home sale prices at a fine-grained level.
From the Wikipedia page on college towns is a list of college towns in the United States and saved as college_towns.txt.
From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (use the chained value in 2009 dollars), in quarterly intervals, in the file gdplev.xls. For this project, I only look at GDP data from the first quarter of 2000 onward.

import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import ttest_ind

# Use fully spelled out place names
states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

def get_list_of_university_towns():
    """Returns a DataFrame of towns and the states they are in from the 
    university_towns.txt list. The format of the DataFrame should be:
    DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
    columns=["State", "RegionName"]  )

    Accomplish the following cleaning:

    1. For "State", removing characters from "[" to the end.
    2. For "RegionName", when applicable, removing every character from " (" to the end.
    3. Depending on how you read the data, you may need to remove newline character '\n'. """

    towns_by_state = {}
    towns = []
    with open('data/college_towns.txt') as fin:
        for line in fin:
            if '[edit]' in line:
                if len(towns) > 1:
                    towns_by_state[state] = towns
                state = line.split('[')[0].strip()
                towns = []
            elif (line.find(',') != -1) and (line.find('(') != -1) and (line.find(',') < line.find('(')) :
                town = line.split(',')[0].strip()
                towns.append(town)
            elif line.find('(') != -1:
                town = line.split('(')[0].strip()
                towns.append(town)
            elif line.find(',') != -1:
                town = line.split(',')[0].strip()
                towns.append(town)
    state_town_tups = [tuple((state, town)) for state, towns in towns_by_state.items() for town in towns]
    df = pd.DataFrame(data=state_town_tups, columns=['State', 'RegionName'])
    df.State.map(states)
    return df

get_list_of_university_towns().head()

	State	RegionName
0	North Dakota	Fargo
1	North Dakota	Grand Forks
2	Colorado	Alamosa
3	Colorado	Boulder
4	Colorado	Durango

def gdp_data():
    """Return a dataframe containing the quarterly GDP data expressed in 2009 dollars."""

    df = pd.read_excel('data/gdplev.xls', 
        skiprows=7, 
        usecols=[4, 5, 6], 
        names=['qtr', 'gdp_curr', 'gdp_2009'])
    return df

gdp_data().head()

	qtr	gdp_curr	gdp_2009
0	1947q1	243.1	1934.5
1	1947q2	246.3	1932.3
2	1947q3	250.1	1930.3
3	1947q4	260.3	1960.7
4	1948q1	266.2	1989.5

# better approach
def qtr_to_prd(df):
    """Return the provided dataframe converting the quarterly dates expressed as strings
    to Pandas' Period objects contained in the 'date' column, year in the 'year' column
    and the number of the quarter in the 'qtr' column. """

    df['date'] = pd.to_datetime(df.qtr).dt.to_period('q')
    df['year'] = df['date'].dt.year
    df['qtr'] = df['date'].dt.quarter
    return df

qtr_to_prd(gdp_data()).head()

	qtr	gdp_curr	gdp_2009	date	year
0	1	243.1	1934.5	1947Q1	1947
1	2	246.3	1932.3	1947Q2	1947
2	3	250.1	1930.3	1947Q3	1947
3	4	260.3	1960.7	1947Q4	1947
4	1	266.2	1989.5	1948Q1	1948

# an alternate approach using pandas "str" accessor
def qrt_to_dt(df):
    """Return the provided dataframe converting the quarterly dates expressed as strings
    to Pandas' Period objects contained in the 'date' column, year in the 'year' column
    and the number of the quarter in the 'qtr' column. """

    df['date'] = df.qtr.copy()
    df[['year', 'qtr']] = df['qtr'].str.split('q', expand=True)
    df.year = df.year.astype(int)
    return df

def gdp_change(df):
    """Return a Series containing the differences between an element 
    and the value in the previous row of the gdp_2009 column."""

    return df.gdp_2009.diff()

def is_increasing(df):
    """Return a Series of floating point values containing 1.0 if the value
    in the 'delta' column is greater than 0, otherwise 0.0 is returned."""

    return (df['delta'] > 0).astype(float)

df = gdp_data()
df = qtr_to_prd(df).set_index('date')
df['delta'] = gdp_change(df)
df['increasing'] = is_increasing(df)
df.head()

	qtr	gdp_curr	gdp_2009	year	delta	increasing
date
1947Q1	1	243.1	1934.5	1947	NaN	0.0
1947Q2	2	246.3	1932.3	1947	-2.2	0.0
1947Q3	3	250.1	1930.3	1947	-2.0	0.0
1947Q4	4	260.3	1960.7	1947	30.4	1.0
1948Q1	1	266.2	1989.5	1948	28.8	1.0

def recession_prep(df):
    """Return the dataframe with columns added that support analysis of recessions.
    In particular, determine if GDP is increasing or decreasing by using the shift method
    to create a Series of duplicated values that have been shifted by the number of
    rows specified.  Together the shifted columns allow for the quarter in which a recession
    starts and ends is identified."""

    df = qtr_to_prd(df).set_index('date')
    df['delta'] = gdp_change(df)
    df['increasing'] = is_increasing(df)
    df['prv1'] = df['increasing'].shift(1)  # look back one
    df['prv2'] = df['increasing'].shift(2)
    df['curr'] = df['increasing']
    df['nxt1'] = df['increasing'].shift(-1) # look ahead one
    df['nxt2'] = df['increasing'].shift(-2)
    df['start'] = ((df['prv1']==1) & (df['curr']==0) & (df['nxt1']==0)).astype(float)
    df['end']   = ((df['prv2']==0) & (df['prv1']==1) & (df['curr']==1)).astype(float)
    return df

recession_prep(gdp_data()).loc['2007q3':'2010q4']

	qtr	gdp_curr	gdp_2009	year	delta	increasing	prv1	prv2	curr	nxt1	nxt2	start	end
date
2007Q3	3	14569.7	14938.5	2007	99.8	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0
2007Q4	4	14685.3	14991.8	2007	53.3	1.0	1.0	1.0	1.0	0.0	1.0	0.0	0.0
2008Q1	1	14668.4	14889.5	2008	-102.3	0.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0
2008Q2	2	14813.0	14963.4	2008	73.9	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0
2008Q3	3	14843.0	14891.6	2008	-71.8	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
2008Q4	4	14549.9	14577.0	2008	-314.6	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
2009Q1	1	14383.9	14375.0	2009	-202.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
2009Q2	2	14340.4	14355.6	2009	-19.4	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0
2009Q3	3	14384.1	14402.5	2009	46.9	1.0	0.0	0.0	1.0	1.0	1.0	0.0	0.0
2009Q4	4	14566.5	14541.9	2009	139.4	1.0	1.0	0.0	1.0	1.0	1.0	0.0	1.0
2010Q1	1	14681.1	14604.8	2010	62.9	1.0	1.0	1.0	1.0	1.0	1.0	0.0	0.0
2010Q2	2	14888.6	14745.9	2010	141.1	1.0	1.0	1.0	1.0	1.0	1.0	0.0	0.0
2010Q3	3	15057.7	14845.5	2010	99.6	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0
2010Q4	4	15230.2	14939.0	2010	93.5	1.0	1.0	1.0	1.0	0.0	1.0	0.0	0.0

def get_recessions(df):
    """Return list of tuples containing the starting and ending dates for recessions."""

    df = df.loc['2000q1':]
    recessions = []
    is_recession = False
    start, end = None, None
    for index, row in df.iterrows():
        if row['start'] == 1.0:
            start = index
            is_recession = True
        elif is_recession and row['end'] == 1.0:
            end = index
            is_recession = False
            recessions.append((start.strftime('%Yq%q'), end.strftime('%Yq%q')))
    return recessions

get_recessions(recession_prep(gdp_data()))

[('2008q3', '2009q4')]

def get_recession_start():
    """Returns the year and quarter of the recession start time as a
    string value in a format such as 2005q3."""

    # only one recession during this period, so no need to loop over start, end tuples
    start, end = get_recessions(recession_prep(gdp_data()))[0]
    return start

get_recession_start()

'2008q3'

def get_recession_end():
    """Returns the year and quarter of the recession end time as a
    string value in a format such as 2005q3."""

    # only one recession during this period, so no need to loop over start, end tuples
    start, end = get_recessions(recession_prep(gdp_data()))[0]
    return end

get_recession_end()

'2009q4'

def get_recession_bottom():
    """Returns the year and quarter of the recession bottom time as a
    string value in a format such as 2005q3."""

    df = recession_prep(gdp_data())
    start, end = get_recessions(df)[0]
    return df.loc[start:end, 'gdp_2009'].idxmin().strftime('%Yq%q')

get_recession_bottom()

'2009q2'

def convert_housing_data_to_quarters():
    """Converts the housing data to quarters and returns it as mean
    values in a dataframe. This dataframe should be a dataframe with
    columns for 2000q1 through 2016q3, and should have a multi-index
    in the shape of ["State","RegionName"].
    """

    df = pd.read_csv('data/zillow_homes_prices_by_city.csv',
                 dtype={'RegionID': 'str',
                        'RegionName': 'str',
                        'State': 'str',
                        'Metro': 'str',
                        'CountyName': 'str',
                        'SizeRank': 'int'})

    region_id_df = df[['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank']]
    region_id_df = region_id_df.assign(State=df.State.map(states))
    region_id_df.head()

    df = df.drop(['RegionName', 'State', 'Metro', 'CountyName', 'SizeRank'], axis=1)
    df = pd.melt(df,
                 id_vars=['RegionID'],
                 var_name='Date',
                 value_name='MedianSalesPrice')

    df = df.set_index(pd.to_datetime(df.Date)).sort_index()
    df = df.drop('Date', axis=1)
    df = df.loc['2000-01-01':]

    region_mean_df = df.reset_index().groupby([pd.Grouper(freq='Q', key='Date'), 'RegionID']).mean()

    joined_df = pd.merge(region_id_df, region_mean_df.reset_index(), how='inner', on='RegionID')

    # Note: Pivoting on State and RegionName without RegionID reduces the number of rows to 10592.
    #       Only by pivoting on all three and later dropping RegionID can I get the correct number of rows (10730)
    #       Finding this was truly painful.
    pivot_df = joined_df.pivot_table(index=['State', 'RegionID', 'RegionName'], columns='Date', values='MedianSalesPrice')
    pivot_df.columns = [c.lower() for c in pivot_df.columns.to_period('Q').format()]
    pivot_df.index = pivot_df.index.droplevel('RegionID')

    return pivot_df

# save dataframe as csv
filepath = './data/effect_of_recession_on_median_home_prices.csv'
convert_housing_data_to_quarters().to_csv(filepath)

def run_ttest():
    '''First creates new data showing the decline or growth of housing prices
    between the recession start and the recession bottom. Then runs a ttest
    comparing the university town values to non-university towns values,
    return whether the alternative hypothesis (that the two groups are the same)
    is true or not and its p-value. 

    Return the tuple (different, p, better) where different=True if the t-test is
    True at a p<0.01 (we reject the null hypothesis), or different=False, if 
    otherwise (we cannot reject the null hypothesis). The variable p should
    be equal to the exact p-value returned from scipy.stats.ttest_ind(). The
    value for better should be either "university town" or "non-university town"
    depending on which has a lower mean price ratio (which is equivalent to a
    reduced market loss).'''

    # group university and non-university towns
    univ_towns = get_list_of_university_towns()
    univ_towns['Location'] = univ_towns[['State', 'RegionName']].apply(tuple, axis=1)

    # get home sales data
    sales_df = convert_housing_data_to_quarters().reset_index()
    sales_df['Location'] = sales_df[['State', 'RegionName']].apply(tuple, axis=1)

    # calculate the ratio of prices from the start to the bottom of the recession
    # smaller values indicates that homes retained their value during the recession
    sales_df['Price_Ratio'] = sales_df.loc[:, get_recession_start()] / sales_df.loc[:, get_recession_bottom()]
    sales_df = sales_df.loc[:, ['Location', 'Price_Ratio']]
    sales_df = sales_df.dropna()

    # home sales data series for university and other towns
    is_univ = sales_df.Location.isin(univ_towns.Location)
    univ_sales_df = sales_df.loc[is_univ]
    other_sales_df = sales_df.loc[~is_univ]

    # compare the means of median home prices
    _, p_value = ttest_ind(other_sales_df.Price_Ratio.values, univ_sales_df.Price_Ratio.values)

    # interpret the results
    if p_value < 0.01:
        # reject null hypothesis --> population means are different
        different = True
        better = 'university town'
    else:
        # accept null hypothesis --> population means are the same
        different = False
        better = 'non-university town'

    # return results
    return different, p_value, better

run_ttest()

(True, 0.0050464756985043489, 'university town')

Results

Median home prices in college towns declined less than other towns during the recession that started in late 2008. In this data set, betting on college towns appears to be a winner though there is a 1 in 100 chance that this conclusion is wrong. Let's dive in a little deeper into the data and see what it shows.

def create_plotting_frame():
    # group university and non-university towns
    univ_towns = get_list_of_university_towns()
    univ_towns['Location'] = univ_towns[['State', 'RegionName']].apply(tuple, axis=1)

    # get home sales data
    sales_df = convert_housing_data_to_quarters().reset_index()
    sales_df['Location'] = sales_df[['State', 'RegionName']].apply(tuple, axis=1)

    # calculate a more intuitive price ratio where values above one represent price gains
    sales_df['Intuitive_Price_Ratio'] = sales_df.loc[:, get_recession_bottom()] / sales_df.loc[:, get_recession_start()]
    sales_df = sales_df.loc[:, ['Location', 'State', 'Intuitive_Price_Ratio']]
    sales_df = sales_df.dropna()

    # identify and label towns as either college or regular
    sales_df = sales_df.assign(Type=sales_df.Location.isin(univ_towns.Location))
    sales_df = sales_df.assign(Town=sales_df.Type.map({True:'College', False:'Regular'}))

    # replace long state names with two letter abbreviations for plotting
    reverse_state_lookup = {long: short for short, long in states.items()}
    sales_df = sales_df.assign(State = sales_df.State.apply(reverse_state_lookup.get))
    return sales_df

plot_df = create_plotting_frame()

Comparing College and Regular Towns

A price ratio of greater than one indicates that median home prices increased during the recession and values less than one home price decreases. The ratio is formed by dividing the value from the bottom by that of its start. Looking closely at the plot reveals that college towns have a subtly higher median value (indicated by the line in the middle of the box) than other towns. The median value is unaffected by outliers (shown as dots) that comprise less than 1% of all towns. Home prices in college towns were more stable, showing less variance in home prices.

sns.set(rc={'figure.figsize':(4, 10)})
sns.boxplot(x='Town', y='Intuitive_Price_Ratio', hue='Town', data=plot_df)
sns.despine(offset=10, trim=True)

Box plot of home prices by town type

State-Level Median Home Prices

Let's see how values compare by state with New Mexico standing out as one of the housing markets during the recession. The violin plot shows the distribution of home price values by town type for each state.

sns.set(rc={'figure.figsize':(20, 6)})
sns.violinplot(x='State', y='Intuitive_Price_Ratio', hue='Town', data=plot_df, split=True)

<matplotlib.axes._subplots.AxesSubplot at 0x11bc02390>

Violin plot of home prices by town type and state

Best and Worst State Housing Markets

How did home prices change on the state-level? Let's the best and worst states by sorting the price ratio and extract the states with the best and worst-performing home markets.

Best: New Mexico
Worst: Nevada

# states with the best housing markets
sorted_states_df = plot_df.groupby(by='State').agg({'Intuitive_Price_Ratio': 'median'}).sort_values(by='Intuitive_Price_Ratio', ascending=False)
sorted_states_df.head(10)

	Intuitive_Price_Ratio
State
NM	1.147781
MS	1.108406
MT	1.090499
WY	1.055361
LA	1.047571
KS	1.036582
VT	1.036013
WV	1.027663
OK	1.015508
SC	1.015338

# states with the worst housing market
sorted_states_df.tail(10)

	Intuitive_Price_Ratio
State
RI	0.939939
WA	0.931055
AK	0.926371
ID	0.924646
AZ	0.921471
HI	0.915682
MI	0.914556
CA	0.898245
FL	0.880623
NV	0.837595

Best and Worst College Towns Housing Markets

Silver City, New Mexico, the home of New Mexico Tech, had a median home price increase of 33% the best in the country. The 27% decline in Riverside, California is the worst of any college town analyzed. Hopefully, they have fully recovered.

# best and worst college towns
sorted_college_df = plot_df[plot_df.Town.str.contains('College')].sort_values(by='Intuitive_Price_Ratio', ascending=False)
top_10 = sorted_college_df.head(10)
top_10

	Location	State	Intuitive_Price_Ratio	Type	Town
5840	(New Mexico, Silver City)	NM	1.331176	True	College
9705	(Texas, Stephenville)	TX	1.215334	True	College
6991	(North Dakota, Grand Forks)	ND	1.211294	True	College
5013	(Montana, Bozeman)	MT	1.142552	True	College
8616	(Pennsylvania, California)	PA	1.122962	True	College
9693	(Texas, Lubbock)	TX	1.120328	True	College
5020	(Montana, Missoula)	MT	1.101759	True	College
2987	(Indiana, West Lafayette)	IN	1.099918	True	College
9819	(Utah, Salt Lake City)	UT	1.081056	True	College
9852	(Vermont, Burlington)	VT	1.077434	True	College

bottom_10 = sorted_college_df.tail(10)
bottom_10

	Location	State	Intuitive_Price_Ratio	Type	Town
523	(California, Sacramento)	CA	0.856530	True	College
1558	(Florida, Orlando)	FL	0.855448	True	College
2826	(Indiana, South Bend)	IN	0.846934	True	College
4391	(Michigan, Flint)	MI	0.828449	True	College
576	(California, Merced)	CA	0.825599	True	College
963	(California, Turlock)	CA	0.818946	True	College
6802	(North Carolina, Cullowhee)	NC	0.806439	True	College
5045	(Nevada, Las Vegas)	NV	0.770192	True	College
518	(California, Pomona)	CA	0.731259	True	College
862	(California, Riverside)	CA	0.726472	True	College