Clustering and Predicting Housing Price Trend at the neighborhood level, NYC
Author: Jianwen Du, Hanbo Lei
1) Background
As Mayor Bill De Blasio said, New York City would like to ensure itself remains a safe, welcoming, and exciting place that attracts a broad mix of people (Mayor Bill De Blasio, 2019). As one of the elements in the city, housing price is such an important feature that impacts everyone’s life and the attractions of the city. The housing price trend whatever goes down or up sharply will significantly negatively impact the rent burden or economic development.
Therefore, it is important for city governments to know the city’s housing price trend and find out the pattern to conduct more place-based policies to guide housing price and economic development. Under such circumstances, we want to create the model to help the governors of NYC to understand and predict the housing price trend at the neighborhood level with mainly machine learning methods.
2) Literature review
The housing market’s great impact on cities has been proved by many researchers. On the one hand, if housing prices sharply boost, it also increases the city-level rent burden rate, and more people tend to spend more than 30% or more of their income on housing, which significantly decreases the attraction of the city. According to one research, Gabriel stated that elevated rent burdens are associated with a myriad of adverse household outcomes including residential crowding, long commutes, low levels of family expenditure on health care and other vital family needs, and problems of child well-being and development (Gabriel 2020). On the other hand, if housing prices sharply drop, not only will landowners’ property value get damaged, even the economic development will suspend. The researcher, Christiansen, stated that large negative house price co-movements contribute above and beyond traditional recession predictors (Christiansen 2019).
In this paper, we want to build one model that predicts the housing price trends rather than directly housing prices. It is inspired by a research of Long, which claimed that his method utilizes an unsupervised heuristic algorithm to classifies stocks into four main classes (Up, Down, Flat, and Unknown), then predicts the probability of forming a predefined shape in a fixed duration when we only see some of the data in early trade days (Long 2018). As a result, he points out new prediction models are robust to market volatility. Traditional prediction usually does not work well. In our research, we are likely to classify trends into four classes to capture the housing price trends like up, down, flat, and others.
In view of predicting variables, much research has predicted the housing price of particular units with micro-level data. Kang stated that existing models such as the hedonic pricing model proposed by Rosen (1974) typically only take structural attributes and locational amenities into consideration. Then later researchers added the built environment dataset, like the physical appearance of the house, surrounding physical, etc. into consideration (Kang 2020). As for city-level research, we are supposed to analyze the housing market of neighborhoods instead of special housing units. Thus, we want to focus on the neighborhood level data to make the predictions, like median household income and new residential units in the neighborhood.
3) Research Questions
In this project, we want to focus on these two research questions:
1.Clustering housing price trend pattern in NYC
How do housing prices change for each neighborhood in recent years? Are there any patterns of housing price trends in NYC?
2.Predicting housing price change with neighborhood characteristics
For each housing price trend pattern, are there some common neighborhood characteristics? Whether these neighborhood characteristics change over time can explain these different housing price trends?
4) Methodology
This project mainly consists of the following two parts of analysis by using different methods:
1.Clustering
- Data Collection, Cleaning, and Munging
- Statistical Analysis
- KMeans Clustering/DBSCAN/Agglomerative Clustering/Gaussian Mixture Clustering
- Visualization
2.Predicting
- Data Collection, Cleaning, and Munging
- Statistical Analysis
- Random Forest / Decision Tree / Support Vector Machine / Gaussian Naive Bayes
- 5-fold Cross-Validation
- Visualization
5) Data Source and Description
In this paper, we define the neighborhood as the zip code district in New York City. In the first part of the analysis, we obtain the average housing price per square foot at the zip code level based on Property Rolling Sales Data from 2015 to 2019. In the second part of the analysis, we use the clustering result of the housing price trend as the dependent variable to build our predicting model. As for independent variables of the predicting models, we focus on the major neighborhood characteristics related to the housing price and use the five years data of these indicators to make the predicting models. These data are generated from NYC open data or Census Bureau. In detail, considering the delayed impact of these neighborhood characteristics in affecting the housing price, we plan to use the data from 2011 to 2015 to train and test the predicting model. Then, we can use the latest data from 2015 to 2019 to predict the housing price trend from 2019 to 2023. Detailed data sources are listed below.
Property Rolling Sales Data
Annualized Rolling Sales Update, 2015–2019, Department of Finance, https://www1.nyc.gov/site/finance/taxes/property-annualized-sales-update.page
Median Household Income
Median Household Income, 2011–2019, American Community Survey 5 Year Estimates, https://data.census.gov/cedsci/table?t=Income%20and%20Poverty&tid=ACSST5Y2019.S1903
Housing Supply
MapPLUTO, 2021, New York City Department of City Planning, https://www1.nyc.gov/site/planning/data-maps/open-data.page
Total Population
Total Population, 2011–2019, American Community Survey 5 Year Estimates, https://data.census.gov/cedsci/table?t=Populations%20and%20People&tid=ACSST5Y2019.S0101&hidePreview=false
Number of 311 Complaints
311 Service Requests, 2011–2019, Department of Information Technology and Telecommunications, https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9
Poverty Status
Poverty Rate, 2011–2019, American Community Survey 5 year estimate, https://data.census.gov/cedsci/table?t=Poverty&tid=ACSST5Y2019.S1701&hidePreview=false
Unemployment Status
Employment Status, 2014–2018, American Community Survey 5 year estimate, https://data.census.gov/cedsci/table?q=employment&tid=ACSST5Y2019.S2301&hidePreview=false
Homeowner Status
Occupied Housing Units by Tenure, 2011–2019, American Community Survey 5 year estimates, https://data.census.gov/cedsci/table?q=Occupied%20housing%20units&g=8600000US10001&tid=ACSDT5Y2019.B25008&hidePreview=false
Percentage of No Child household Status
Percentage of No Child under 18 in the household, 2011–2019, American Community Survey 5 year estimates, https://data.census.gov/cedsci/table?t=Income%20and%20Poverty&g=8600000US10001&tid=ACSST5Y2019.S1702&hidePreview=false
6) Analysis and Result
1. Clustering
In this section, we complete the cluster analysis based on the housing price trends in five years by zip-code level in NYC. The data we used is annual housing sales across 2015 to 2019 from the Department of Finance. In this process, we aggregated the separate sale data for each housing into the zip-code level to calculate the average housing price per square foot in each zip code in these 5 years. After completing the data cleaning and munging process, we conduct a detailed analysis shown as the following.
a) Data description and pre-visualization
As shown in the descriptive statistics table, there are a total of 170 zip codes included in the analysis. From 2015 to 2019, the 25th and median housing price per square foot in NYC keeps growing, which indicates the overall increasing trend in housing prices. On the other hand, the max housing price along with the change in 75th housing prices in all zip codes show the increasing trend in housing prices from 2015 to 2017 but also indicate that there is a decrease in high-priced housing in recent years.
Considering the geographic distribution of housing prices, the maps show that the housing market in NYC can be divided into five categories from the highest housing prices to the lowest (shown as five colors). In these five years, these categories’ boundaries have slightly changed and the overall housing price pattern looks very similar in NYC. The first tier of the housing market is located in the Midtown and Lowtown of Manhattan, and the north of Brooklyn. The second tier is located in the southwest of Brooklyn, and northeast and northwest corner of Queens. The third tier is mainly located in the northeast and south of Brooklyn and the north of Brooklyn. The fourth tier is located in Staten Island and southwest of Queens. The worst housing market is located in the Bronx. Overall, the rank of housing market among these five boroughs is Manhattan > Brooklyn > Queen > Staten Island > Bronx.
Clustering by using four different models
For the clustering models, this paper uses four different models, including KMeans Clustering, DBSCAN, Agglomerative Clustering, and Gaussian Mixture Clustering. The detailed outcomes are shown in the following.
- KMeans Clustering
By using the Silhouette method, the optimal number of clusters is two. So I use this parameter and get the outcome shown in the graph. 115 out of 170 zip codes are clustered into Group One as shown in the blue line. The median value of these two groups shows the up and down housing price trends in these five years. And large portions of the neighborhoods in NYC have the uptrends in housing prices.
- DBSCAN
By using the dendrogram, the optimal number of clusters is still two. It shows similar outcomes like KMeans Clustering. 105 of 170 neighborhoods are categorized into Group One with the increasing trends in housing prices.
- Agglomerative Clustering
By using the nearest neighbor distances, we found the optimal parameter of eps is 0.6, and the minor samples are 3. As a result, we got the seven groups after clustering. 65 out of 170 zip codes are seen as the outliers. The largest group has 83 zip codes with the median value of housing price keeping growth in these years. The other 22 zip codes are clustered into six groups. The housing price trends of these six groups are more complicated. Group six has one year decreasing and then keeps growing from 2016 to 2019. Group four also has one year decrease in housing prices and then remains stable in recent years. Group two has two years of slight decrease and then rapid increases in housing prices. Group three and five have more unstable fluctuations in housing prices.
- Gaussian Mixture Clustering
After using the other three methods, we found that the number of clustering outcomes is neither too large nor small, which is hard for us to capture the overall trends of housing prices. Then, when using this method, we set four groups of trends based on literature(Long 2018) to cluster the housing price trend in NYC, which can better inform us to understand the housing price trend pattern. As shown in figure 7, Group Zero with 79 neighborhoods in this category, which used to have the lower housing price, has the highly increasing trend in housing prices. Group One with 41 neighborhoods has a slight increase in the first year and then keeps downtrends in these years with a one-year slight increase. Overall, the housing price is under normal fluctuations and a slight decrease. Group Two with 25 neighborhoods has very stable housing prices in the first three years, then suddenly increases and goes back to the normal standard of housing prices. These neighborhoods used to have below-average housing prices. Group Three with 25 neighborhoods used to be the higher housing price in NYC. Their housing price trends have one large decrease in the first year and then keep growing in these years. But their housing price is still lower than the beginning year.
Visualization and analysis
After comparing the outcomes of these four methods, we plan to use the result by using Gaussian Mixture Clustering with four groups. Because it can better generalize the overall pattern of housing price trends in NYC and be easy for us to understand. From this choropleth map, Group Zero with highly increasing housing prices is mainly located in the Bronx, Brooklyn, Queen, and Staten Island. These are basically not the first tier of housing markets with the housing price per SF ranging from $100 to $1500. As for Group One with the overall slightly downtrends, they are mainly located in the Midtown and Downtown of Manhattan, and south of Brooklyn. Large of them are the first-tier housing market in NYC. These results imply that there are slight decrease trends in the high-priced housing market and uptrends in the medium and low-priced housing market. As a result, housing prices in Staten Island, Bronx, and Queens have the overall increasing trends in these five years.
2. Predicting
In this section, we use four classifier models including linear and nonlinear methods to build our own predicting model. To figure out which is the most appropriate model for this paper, the performance of these four classifier models, including RF, SVM, GNB, and DT, are compared. In this step, we use the 5-fold cross-validation to avoid bias and stabilize the results and choose the best model based on the mean and standard deviation of R-squared. The data we used for the prediction is from NYC open data and the census bureau from 2011 to 2019. In this process, we first use the following eight independent variables from 2011 to 2015 to build the model. And then we can use this model to predict the housing price trend from 2019 to 2023 based on the neighborhood data from 2015 to 2019.
Data Description
For the neighborhood characteristics, this paper uses the number of 311 calls, homeowner rate, median household income, total population, poverty rate, unemployment rate, number of new residential units, and rate of no child in the household as independent variables. We categorized these neighborhood characteristics based on the clustering result from the first part of the analysis and got the median values of these variables from 2011 to 2015. As shown in figure 9, the median number of 311 calls of Group Two districts increases then decreases, while other Group’s median number of 311 calls increases after a drop in 2011. The median population of Group One districts decreases, while their median population is still larger than other Group’s median population even if the median population of other Groups is increasing. The median number of new residential units of all Groups increases, while Group One and Three districts increase sharply. The median MHI of all Groups increases and the median MHI of Group Three districts is significantly higher than others. The table of all median values of independent variables is attached in the appendix.
For further analysis, this paper preprocesses independent variables by standardization to generate the scaled variables whose mean is 0 and the standard deviation is 1.
Methods Comparison
To figure out the model with best performance in this research, we calculate the accuracy and standard deviation of each optimal model by 5-fold cross-validation. As the result shown in table 2, the SVM and RF models have better performance than the other two models. In view of the accuracy of these models, SVM with linear kernel and RF have better outcome than SVM with RBF kernel. On the other hand, the RF model has lower standard deviation than SVM with linear kernel. Considering both the accuracy and standard deviation, this paper decides to use the RF model to conduct further analysis.
Visualization and Analysis
Based on the RF model, this paper builds the predicting model As shown in figure 10, the importance of each variable is changing in these five years. Overall, the variables in 2011 are more important than the latter years, which proves our thoughts of delayed impacts of neighborhood characteristics on housing prices. For these 8 independent variables, the more powerful variables in the predicting model are median household income, total population, number of newly built residential units and number of 311 calls. Besides, the highest importance of all these features is roughly 6%, which indicates these variables all have the low ability to predict the housing price trends.
Then, we use this model to predict the future housing price trend based on neighborhood characteristics from 2015 to 2019. From this choropleth map, we can find that there are mainly two categories, Group Zero and Group One, in the predicting outcome. Large portions of zip codes are predicted as Group Zero. The neighborhoods classified into Group One are mainly located in Manhattan Downtown, northwest of Queens and northwest of Brooklyn. Lots of these neighborhoods are under the high-priced housing market with slight downtrends in the previous years. The predicting outcome indicates that their housing market will be fluctuated in the future several years and slightly decrease in housing prices in the end. In summary, our prediction indicates that the overall housing price trend in NYC will keep increasing in the future. But for the high-priced housing market, the housing market is more stable and even will decrease in the future
7)Implications and Limitations
a) Implications
In this paper, based on past housing prices in NYC, we identify the overall housing market pattern among these counties. Besides, through clustering of housing price trends from 2015 to 2019, this paper finds that the high-priced housing market has a fluctuated housing price trend and also slightly decreased in these years. On the contrary, the medium and low-priced housing market had great growth in these years, especially for the low-price housing market in the Bronx, Queens, Brooklyn, and Staten Island. This may result from the affordable housing policy or some policies related to rent, which prohibit the growth of the high-price housing market. On the other hand, by using the predicting model, we predict some neighborhoods in the high-priced housing market will continue their current trends and are more likely to slightly decrease in housing prices in the future. But overall, lots of neighborhoods’ future housing price trends will keep increasing in the future. This result should be noted by housing officials to address the affordable housing problems in these areas.
b) Limitations and Next Steps
In conclusion, this paper has the following limitations.
Low Accuracy: The accuracy of this model is 0.479. Such accuracy means more than half of the prediction is likely to be wrong. It is a relatively high possibility prediction considering the 4 housing price trend classes that we define in this paper, so the prediction results of this model could be only for a reference to policymakers.
Period of Indicators: When considering the delayed impact of neighborhood characteristics on housing price, we use the data of these variables four years ago to train and predict the housing price trend. The reason is mainly data limitation and the American Community Survey can only provide the data from 2011 to 2019. So this paper builds the model based on the assumption that these variables can have short-term impacts on the housing price.
Variable Selections: In this paper, we only consider the neighborhood-level variables in predicting the housing price trend. It should be noted that the housing market is a very local market and the project-level characteristics also play an important role in determining the housing price. Besides, we also don’t include any macro-level data like GDP in this model. Meanwhile, due to data limitations, our neighborhood-level data lacks some indicators related to public health and amenities that are also important. These reasons will affect the performance of the model.
With such limitations in this paper, our next steps will focus on selecting more comprehensive variables and adding more models, especially non-linear models, into the model comparison process. If we have more data available, we will also consider the longer variable effects on housing prices, and try to use more previous data to make the prediction. Besides, we should try to connect the current and future housing price patterns with the place-based policies in the real estate industry in NYC. All these can help us to build better predictive models to capture the future housing price trends in NYC and understand the underlying reasons for housing price trends in the housing market.
Citation
Mayor Bill De Blasio (2019). OneNYC 2050: Building a Strong and Fair City. New York City. http://onenyc.cityofnewyork.us/
Stuart Gabriel, Gary Painter, 2020, Why affordability matters, Regional Science and Urban Economics, Volume 80,
Charlotte Christiansen, Jonas N. Eriksen, Stig V. Møller, 2019, Negative house price co-movements and US recessions, Regional Science and Urban Economics, Volume 77, Pages 382–394,
Yuhao Kang Fan Zhang, et al., 2020, Understanding house price appreciation using multi-source big geo-data and machine learning, Land Use Policy, online 21 July 2020,
Jing Zhang, Shicheng Cui, et al., 2018, A novel data-driven stock price trend prediction system, Expert Systems with Applications, Volume 97, Pages 60–69