1. Introduction

The real estate market can be intimidating for anyone looking for a new home. For first time home buyers and experienced investors alike, news articles are citing that we’re living through the hottest market in decades 1.

For many people, buying a house is one of the most important decisions they’ll make in their lives 2. However, when is the right time to buy a house? Are prices high right now, and will they further increase? Should someone consider purchasing a house thinking the market may worsen? How does one maximize their investment by buying at the right time? By predicting the home value based on what we know historically from houses in the same profile, we can help an individual understand if the price is inflated or not, and help them determine if this is the right moment.

For this data analysis project, we will study the housing market in Austin, Texas, one of America’s hottest real estate markets. By using historical data, we’ll attempt to predict home prices using multiple linear regression, and apply methods discussed in class to find the best-possible prediction model.We believe that the trends in Austin can serve as a thermometer for the rest of the country due to high interest driven by the COVID-19 pandemic, which is shifting workers to home-offices 3. By using detailed historical data about properties that were sold in Austin in the past, our intent is to offer home buyers guidance to help them understand if the sale price for a house is within overall market expectation, helping them on the decision making process.

This project implements the following data analysis techniques and concepts:

2. Dataset description and analysis

We gathered data from a large study 4 based on Zillow listings that include the sale price of properties between the years 2018 and 2021 in Austin, Texas, and several variables that help describe the given properties. We will try to predict prices in the past, present, and also future trends based on those variables.

The data used in our study offers 15,171 observations and 45 variables. Following is the list of variables in the dataset:

  1. zpid: Unique Identifier or Property ID
  2. city: The lowercase name of a city or town in or surrounding Austin, Texas.
  3. streetaddress: The street address of the property.
  4. zipcode: The property’s 5-digit ZIP code.
  5. description: The description of the property listing from Zillow.
  6. latitude: Latitude of the property.
  7. longitude: Longitude of the property.
  8. propertyTaxRate: Tax rate for the property.
  9. garageSpaces: Number of garage spaces. This is a subset of the ParkingSpaces feature.
  10. hasAssociation: Indicates if there is a Homeowners Association associated with the property.
  11. hasCooling: Boolean indicating if the property has a cooling system.
  12. hasGarage: Boolean indicating if the property has a garage.
  13. hasHeating: Boolean indicating if the property has a heating system.
  14. hasSpa: Boolean indicating if the property has a Spa.
  15. hasView: Boolean indicating if the property comes with a view.
  16. homeType: The home type (i.e., Single Family, Townhouse, Apartment).
  17. parkingSpaces: The number of parking spots.
  18. yearBuilt: The year the property was built.
  19. numPriceChanges: The number of price changes the property has undergone since being listed.
  20. latest_saledate: The latest sale date (YYYY-MM-DD).
  21. latest_salemonth: The month the property sold (1-12).
  22. latest_saleyear: The year the property sold (2018-2021).
  23. latestPriceSource: The party that provided the sale price.
  24. numOfPhotos: The number of photos in the Zillow listing.
  25. numOfAccessibilities: The number of unique accessibility features in the property.
  26. numOfAppliances: The number of unique appliances in the property.
  27. numOfParkingFeatures: The number of unique parking features in the property.
  28. numOfPatioAndPorts: The number of unique patio and/or porch features in the property.
  29. numOfSecurityFeatures: The number of unique security features in the property.
  30. numOfWaterFront: The number of unique waterfront features in the property.
  31. numOfUniqueWindowFeatures: The number of unique window aesthetics in the property.
  32. numOfCommunityFeatures: The number of unique community features (community meeting room, mailbox) in the property.
  33. lotSizeSqFt: The lot size of the property reported in square feet.
  34. livingAreaSqFt: The living area of the property reported in square feet.
  35. numOfPrimarySchools: The number of primary schools listed in the area on the listing.
  36. numOfElementrySchools: The number of elementary schools listed in the area on the listing.
  37. numOfMiddleSchools: The number of middle schools listed in the area on the listing.
  38. numOfHighSchools: The number of high schools listed in the area on the listing.
  39. avgSchoolDistance: The average distance of all school types (i.e., Middle, High) in the listing.
  40. avgSchoolRating: The average school rating of all school types (i.e., Middle, High) in the listing.
  41. avgSchoolSize: The average school size of all school types (i.e., Middle, High) in the listing.
  42. MedianStudentsPerTeacher: The median students-per-teacher for all schools near the listing.
  43. numOfBathrooms: The number of bathrooms in the property.
  44. numOfBedrooms: The number of bedrooms in the property.
  45. numOfStories: The number of stories in the property.

2.1. Data cleaning

Our first step will be to read the data from a csv file (austinHousingData.csv) and perform some data cleaning tasks:

  • Remove rows with missing data.
  • The homeType variable has 10 different possible values (Apartment, Condo, Residential, etc). We’ll make it a factor variable.
  • Remove variables that will not be used in the analysis for lack of relevancy to the property price: zpid, latest_saledate, latestPriceSource, city, homeImage, streetAddress, and numOfPhotos.
raw_housing_data = read.csv("austinHousingData.csv")

# remove all rows with missing data
raw_housing_data = na.omit(raw_housing_data)

# Make homeType a factor variable
raw_housing_data$homeType = as.factor(raw_housing_data$homeType)

# Remove predictors that are not used
selected_housing_data = subset(
  raw_housing_data,
  select = -c(
    zpid,
    latest_salemonth,
    latest_saledate,
    latestPriceSource,
    city,
    homeImage,
    streetAddress,
    numOfPhotos
  )
)

Now that we’ve performed basic data cleaning tasks, let’s take a look at the dataset.

head(selected_housing_data)
##   zipcode latitude longitude propertyTaxRate garageSpaces hasAssociation
## 1   78660    30.43    -97.66            1.98            2           TRUE
## 2   78660    30.43    -97.66            1.98            2           TRUE
## 3   78660    30.41    -97.64            1.98            0           TRUE
## 4   78660    30.43    -97.66            1.98            2           TRUE
## 5   78660    30.44    -97.66            1.98            0           TRUE
## 6   78660    30.44    -97.66            1.98            2           TRUE
##   hasCooling hasGarage hasHeating hasSpa hasView      homeType parkingSpaces
## 1       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
## 2       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
## 3       TRUE     FALSE       TRUE  FALSE   FALSE Single Family             0
## 4       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
## 5       TRUE     FALSE       TRUE  FALSE   FALSE Single Family             0
## 6       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
##   yearBuilt latestPrice numPriceChanges latest_saleyear
## 1      2012      305000               5            2019
## 2      2013      295000               1            2020
## 3      2018      256125               1            2019
## 4      2013      240000               4            2018
## 5      2002      239900               3            2018
## 6      2020      309045               2            2020
##   numOfAccessibilityFeatures numOfAppliances numOfParkingFeatures
## 1                          0               5                    2
## 2                          0               1                    2
## 3                          0               4                    1
## 4                          0               0                    2
## 5                          0               0                    1
## 6                          0               3                    1
##   numOfPatioAndPorchFeatures numOfSecurityFeatures numOfWaterfrontFeatures
## 1                          1                     3                       0
## 2                          0                     0                       0
## 3                          0                     1                       0
## 4                          0                     0                       0
## 5                          0                     0                       0
## 6                          2                     2                       0
##   numOfWindowFeatures numOfCommunityFeatures lotSizeSqFt livingAreaSqFt
## 1                   1                      0        6011           2601
## 2                   0                      0        6185           1768
## 3                   0                      0        7840           1478
## 4                   0                      0        6098           1678
## 5                   0                      0        6708           2132
## 6                   0                      0        5161           1446
##   numOfPrimarySchools numOfElementarySchools numOfMiddleSchools
## 1                   1                      0                  1
## 2                   1                      0                  1
## 3                   0                      2                  1
## 4                   1                      0                  1
## 5                   1                      0                  1
## 6                   1                      0                  1
##   numOfHighSchools avgSchoolDistance avgSchoolRating avgSchoolSize
## 1                1             1.267           2.667          1063
## 2                1             1.400           2.667          1063
## 3                1             1.200           3.000          1108
## 4                1             1.400           2.667          1063
## 5                1             1.133           4.000          1223
## 6                1             1.067           4.000          1223
##   MedianStudentsPerTeacher numOfBathrooms numOfBedrooms numOfStories
## 1                       14              3             4            2
## 2                       14              2             4            1
## 3                       14              2             3            1
## 4                       14              2             3            1
## 5                       14              3             3            2
## 6                       14              2             3            1

We’ll also look at the summary of the dataset to better understand the data ranges for each variable. This allows us to understand if some of the variables present abnormal max or min values when compared to the mean of that variable, helping to identify outliers which may cause noise in the dataset.

summary(selected_housing_data)
##     zipcode         latitude      longitude     propertyTaxRate  garageSpaces  
##  Min.   :78617   Min.   :30.1   Min.   :-98.0   Min.   :1.98    Min.   : 0.00  
##  1st Qu.:78727   1st Qu.:30.2   1st Qu.:-97.8   1st Qu.:1.98    1st Qu.: 0.00  
##  Median :78739   Median :30.3   Median :-97.8   Median :1.98    Median : 1.00  
##  Mean   :78736   Mean   :30.3   Mean   :-97.8   Mean   :1.99    Mean   : 1.23  
##  3rd Qu.:78749   3rd Qu.:30.4   3rd Qu.:-97.7   3rd Qu.:1.98    3rd Qu.: 2.00  
##  Max.   :78759   Max.   :30.5   Max.   :-97.6   Max.   :2.21    Max.   :22.00  
##                                                                                
##  hasAssociation  hasCooling      hasGarage       hasHeating     
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:7164      FALSE:274       FALSE:6825      FALSE:149      
##  TRUE :8007      TRUE :14897     TRUE :8346      TRUE :15022    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##    hasSpa         hasView                      homeType     parkingSpaces  
##  Mode :logical   Mode :logical   Single Family     :14241   Min.   : 0.00  
##  FALSE:13972     FALSE:11716     Condo             :  470   1st Qu.: 0.00  
##  TRUE :1199      TRUE :3455      Townhouse         :  174   Median : 1.00  
##                                  Multiple Occupancy:   96   Mean   : 1.23  
##                                  Vacant Land       :   83   3rd Qu.: 2.00  
##                                  Apartment         :   37   Max.   :22.00  
##                                  (Other)           :   70                  
##    yearBuilt     latestPrice       numPriceChanges latest_saleyear
##  Min.   :1905   Min.   :    5500   Min.   : 1.00   Min.   :2018   
##  1st Qu.:1974   1st Qu.:  309000   1st Qu.: 1.00   1st Qu.:2018   
##  Median :1993   Median :  405000   Median : 2.00   Median :2019   
##  Mean   :1989   Mean   :  512768   Mean   : 3.03   Mean   :2019   
##  3rd Qu.:2006   3rd Qu.:  575000   3rd Qu.: 4.00   3rd Qu.:2020   
##  Max.   :2020   Max.   :13500000   Max.   :23.00   Max.   :2021   
##                                                                   
##  numOfAccessibilityFeatures numOfAppliances numOfParkingFeatures
##  Min.   :0.000              Min.   : 0.00   Min.   :0.00        
##  1st Qu.:0.000              1st Qu.: 2.00   1st Qu.:1.00        
##  Median :0.000              Median : 3.00   Median :2.00        
##  Mean   :0.013              Mean   : 3.48   Mean   :1.71        
##  3rd Qu.:0.000              3rd Qu.: 4.00   3rd Qu.:2.00        
##  Max.   :8.000              Max.   :12.00   Max.   :6.00        
##                                                                 
##  numOfPatioAndPorchFeatures numOfSecurityFeatures numOfWaterfrontFeatures
##  Min.   :0.000              Min.   :0.000         Min.   :0.0000         
##  1st Qu.:0.000              1st Qu.:0.000         1st Qu.:0.0000         
##  Median :0.000              Median :0.000         Median :0.0000         
##  Mean   :0.663              Mean   :0.467         Mean   :0.0028         
##  3rd Qu.:1.000              3rd Qu.:1.000         3rd Qu.:0.0000         
##  Max.   :8.000              Max.   :6.000         Max.   :2.0000         
##                                                                          
##  numOfWindowFeatures numOfCommunityFeatures  lotSizeSqFt        
##  Min.   :0.000       Min.   :0.000          Min.   :       100  
##  1st Qu.:0.000       1st Qu.:0.000          1st Qu.:      6534  
##  Median :0.000       Median :0.000          Median :      8276  
##  Mean   :0.208       Mean   :0.019          Mean   :    119084  
##  3rd Qu.:0.000       3rd Qu.:0.000          3rd Qu.:     10890  
##  Max.   :4.000       Max.   :8.000          Max.   :1508482800  
##                                                                 
##  livingAreaSqFt   numOfPrimarySchools numOfElementarySchools numOfMiddleSchools
##  Min.   :   300   Min.   :0.000       Min.   :0.0000         Min.   :0.00      
##  1st Qu.:  1483   1st Qu.:1.000       1st Qu.:0.0000         1st Qu.:1.00      
##  Median :  1975   Median :1.000       Median :0.0000         Median :1.00      
##  Mean   :  2208   Mean   :0.941       Mean   :0.0492         Mean   :1.04      
##  3rd Qu.:  2687   3rd Qu.:1.000       3rd Qu.:0.0000         3rd Qu.:1.00      
##  Max.   :109292   Max.   :2.000       Max.   :2.0000         Max.   :3.00      
##                                                                                
##  numOfHighSchools avgSchoolDistance avgSchoolRating avgSchoolSize 
##  Min.   :0.000    Min.   :0.20      Min.   :2.33    Min.   : 396  
##  1st Qu.:1.000    1st Qu.:1.10      1st Qu.:4.00    1st Qu.: 966  
##  Median :1.000    Median :1.57      Median :5.78    Median :1287  
##  Mean   :0.977    Mean   :1.84      Mean   :5.78    Mean   :1237  
##  3rd Qu.:1.000    3rd Qu.:2.27      3rd Qu.:7.00    3rd Qu.:1496  
##  Max.   :2.000    Max.   :9.00      Max.   :9.50    Max.   :1913  
##                                                                   
##  MedianStudentsPerTeacher numOfBathrooms  numOfBedrooms    numOfStories 
##  Min.   :10.0             Min.   : 0.00   Min.   : 0.00   Min.   :1.00  
##  1st Qu.:14.0             1st Qu.: 2.00   1st Qu.: 3.00   1st Qu.:1.00  
##  Median :15.0             Median : 3.00   Median : 3.00   Median :1.00  
##  Mean   :14.9             Mean   : 2.68   Mean   : 3.44   Mean   :1.47  
##  3rd Qu.:16.0             3rd Qu.: 3.00   3rd Qu.: 4.00   3rd Qu.:2.00  
##  Max.   :19.0             Max.   :27.00   Max.   :20.00   Max.   :4.00  
## 

Some of the variables demonstrate strange min and max values when compared to their mean: avgSchoolDistance, livingAreaSqFt, lotSizeSqFt, numOfBedrooms, and numOfBathrooms. We’ll plot the observations of these variables in charts to better understand if they are outliers.

data_visuals = function(data) {
  par(mfrow = c(2, 3))
  
  plot(
    latestPrice ~ homeType,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. homeType",
    cex = 1.5
  )
  plot(
    latestPrice ~ avgSchoolDistance  ,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. avgSchoolDistance  ",
    cex = 1.5
  )
  plot(
    latestPrice ~ livingAreaSqFt,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. livingAreaSqFt",
    cex = 1.5
  )
  
  plot(
    latestPrice ~ lotSizeSqFt,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. lotSizeSqFt",
    cex = 1.5
  )
  plot(
    latestPrice ~ numOfBedrooms,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. numOfBedrooms",
    cex = 1.5
  )
  plot(
    latestPrice ~ numOfBathrooms,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. numOfBathrooms",
    cex = 1.5
  )
}

data_visuals(selected_housing_data)

From the data structure and visuals, we see that there are significant outliers in the dataset. For instance, one observation has a livingAreaSqft value of ‘109,292’, compared to its mean ‘2,208’. We shall remove these outliers using boxplot stats.

for (x in c(
  'homeType',
  'latestPrice',
  'avgSchoolDistance',
  'livingAreaSqFt',
  'lotSizeSqFt',
  'numOfBedrooms',
  'numOfBathrooms'
))
{
  value = selected_housing_data[, x][selected_housing_data[, x] %in% boxplot.stats(selected_housing_data[, x])$out]
  selected_housing_data[, x][selected_housing_data[, x] %in% value] = NA
}

# remove all rows with missing data
selected_housing_data = na.omit(selected_housing_data
)

Looking at the plots again, we confirm that the observations are better represented now, without outliers.

data_visuals(selected_housing_data)

The “cleaned” dataset now offers 11,493 observations and 39 variables.

str(selected_housing_data)
## 'data.frame':    12475 obs. of  38 variables:
##  $ zipcode                   : int  78660 78660 78660 78660 78660 78660 78660 78660 78660 78617 ...
##  $ latitude                  : num  30.4 30.4 30.4 30.4 30.4 ...
##  $ longitude                 : num  -97.7 -97.7 -97.6 -97.7 -97.7 ...
##  $ propertyTaxRate           : num  1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 ...
##  $ garageSpaces              : int  2 2 0 2 0 2 0 0 0 2 ...
##  $ hasAssociation            : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ hasCooling                : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ hasGarage                 : logi  TRUE TRUE FALSE TRUE FALSE TRUE ...
##  $ hasHeating                : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ hasSpa                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ hasView                   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ homeType                  : Factor w/ 10 levels "Apartment","Condo",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ parkingSpaces             : int  2 2 0 2 0 2 0 0 0 2 ...
##  $ yearBuilt                 : int  2012 2013 2018 2013 2002 2020 2016 2002 2002 2013 ...
##  $ latestPrice               : int  305000 295000 256125 240000 239900 309045 315000 219900 225000 194800 ...
##  $ numPriceChanges           : int  5 1 1 4 3 2 2 2 1 1 ...
##  $ latest_saleyear           : int  2019 2020 2019 2018 2018 2020 2020 2018 2019 2018 ...
##  $ numOfAccessibilityFeatures: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ numOfAppliances           : int  5 1 4 0 0 3 3 3 2 3 ...
##  $ numOfParkingFeatures      : int  2 2 1 2 1 1 1 1 1 2 ...
##  $ numOfPatioAndPorchFeatures: int  1 0 0 0 0 2 0 0 1 0 ...
##  $ numOfSecurityFeatures     : int  3 0 1 0 0 2 0 0 1 0 ...
##  $ numOfWaterfrontFeatures   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ numOfWindowFeatures       : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ numOfCommunityFeatures    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lotSizeSqFt               : num  6011 6185 7840 6098 6708 ...
##  $ livingAreaSqFt            : int  2601 1768 1478 1678 2132 1446 2432 1422 1870 1422 ...
##  $ numOfPrimarySchools       : int  1 1 0 1 1 1 1 1 1 1 ...
##  $ numOfElementarySchools    : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ numOfMiddleSchools        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ numOfHighSchools          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ avgSchoolDistance         : num  1.27 1.4 1.2 1.4 1.13 ...
##  $ avgSchoolRating           : num  2.67 2.67 3 2.67 4 ...
##  $ avgSchoolSize             : int  1063 1063 1108 1063 1223 1223 1051 1223 1223 1615 ...
##  $ MedianStudentsPerTeacher  : int  14 14 14 14 14 14 12 14 14 14 ...
##  $ numOfBathrooms            : num  3 2 2 2 3 2 3 3 2 3 ...
##  $ numOfBedrooms             : int  4 4 3 3 3 3 4 3 3 3 ...
##  $ numOfStories              : int  2 1 1 1 2 1 2 2 2 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:2696] 18 23 24 29 30 34 35 38 42 44 ...
##   ..- attr(*, "names")= chr [1:2696] "18" "23" "24" "29" ...

Now that we have a clean dataset, without outliers, let’s have a look at the distribution of property prices when plotted over the map of Austin. We notice that the most expensive houses are concentrated near the central part of Austin, with some exceptions for more prestigious areas. Overall, houses are in the $250,000 to $750,000 range.

library(ggmap)
library(ggplot2)

# register google maps API key
register_google(key = "AIzaSyAXuwivTHN6rIgi3teuusdz3r8dqNMQQx8")

## Central co-ordinates of the region we are interested in.
central_location = c(mean(selected_housing_data$longitude),
                     mean(selected_housing_data$latitude))

## Get map centered on Austin, TX (or the mean of the coordinates in our dataset)
austin_map = ggmap(get_googlemap(
  center = central_location,
  scale = 1,
  zoom = 10
),
extent = "normal")

## Plot heatmap
austin_map + geom_point(
  aes(x = longitude, y = latitude, color = latestPrice),
  data = selected_housing_data,
  alpha = 0.4,
  size = 3
) + xlim(range(selected_housing_data$longitude)) + ylim(range(selected_housing_data$latitude)) + scale_color_distiller(palette = "Spectral", labels = scales::comma) + xlab("Longitude") + ylab("Latitude") + ggtitle("Heatmap: latest sale price ($ USD) by property")