1. Introduction

The real estate market can be intimidating for anyone looking for a new home. For first time home buyers and experienced investors alike, news articles are citing that we’re living through the hottest market in decades ¹.

For many people, buying a house is one of the most important decisions they’ll make in their lives ². However, when is the right time to buy a house? Are prices high right now, and will they further increase? Should someone consider purchasing a house thinking the market may worsen? How does one maximize their investment by buying at the right time? By predicting the home value based on what we know historically from houses in the same profile, we can help an individual understand if the price is inflated or not, and help them determine if this is the right moment.

For this data analysis project, we will study the housing market in Austin, Texas, one of America’s hottest real estate markets. By using historical data, we’ll attempt to predict home prices using multiple linear regression, and apply methods discussed in class to find the best-possible prediction model.We believe that the trends in Austin can serve as a thermometer for the rest of the country due to high interest driven by the COVID-19 pandemic, which is shifting workers to home-offices ³. By using detailed historical data about properties that were sold in Austin in the past, our intent is to offer home buyers guidance to help them understand if the sale price for a house is within overall market expectation, helping them on the decision making process.

This project implements the following data analysis techniques and concepts:

Data cleaning
Collinearity
Multiple linear regression
ANOVA
Interaction
Assumption diagnostics
Outlier diagnostics
Transformations
Stepwise model selection
Variable selection

2. Dataset description and analysis

We gathered data from a large study ⁴ based on Zillow listings that include the sale price of properties between the years 2018 and 2021 in Austin, Texas, and several variables that help describe the given properties. We will try to predict prices in the past, present, and also future trends based on those variables.

The data used in our study offers 15,171 observations and 45 variables. Following is the list of variables in the dataset:

zpid: Unique Identifier or Property ID
city: The lowercase name of a city or town in or surrounding Austin, Texas.
streetaddress: The street address of the property.
zipcode: The property’s 5-digit ZIP code.
description: The description of the property listing from Zillow.
latitude: Latitude of the property.
longitude: Longitude of the property.
propertyTaxRate: Tax rate for the property.
garageSpaces: Number of garage spaces. This is a subset of the ParkingSpaces feature.
hasAssociation: Indicates if there is a Homeowners Association associated with the property.
hasCooling: Boolean indicating if the property has a cooling system.
hasGarage: Boolean indicating if the property has a garage.
hasHeating: Boolean indicating if the property has a heating system.
hasSpa: Boolean indicating if the property has a Spa.
hasView: Boolean indicating if the property comes with a view.
homeType: The home type (i.e., Single Family, Townhouse, Apartment).
parkingSpaces: The number of parking spots.
yearBuilt: The year the property was built.
numPriceChanges: The number of price changes the property has undergone since being listed.
latest_saledate: The latest sale date (YYYY-MM-DD).
latest_salemonth: The month the property sold (1-12).
latest_saleyear: The year the property sold (2018-2021).
latestPriceSource: The party that provided the sale price.
numOfPhotos: The number of photos in the Zillow listing.
numOfAccessibilities: The number of unique accessibility features in the property.
numOfAppliances: The number of unique appliances in the property.
numOfParkingFeatures: The number of unique parking features in the property.
numOfPatioAndPorts: The number of unique patio and/or porch features in the property.
numOfSecurityFeatures: The number of unique security features in the property.
numOfWaterFront: The number of unique waterfront features in the property.
numOfUniqueWindowFeatures: The number of unique window aesthetics in the property.
numOfCommunityFeatures: The number of unique community features (community meeting room, mailbox) in the property.
lotSizeSqFt: The lot size of the property reported in square feet.
livingAreaSqFt: The living area of the property reported in square feet.
numOfPrimarySchools: The number of primary schools listed in the area on the listing.
numOfElementrySchools: The number of elementary schools listed in the area on the listing.
numOfMiddleSchools: The number of middle schools listed in the area on the listing.
numOfHighSchools: The number of high schools listed in the area on the listing.
avgSchoolDistance: The average distance of all school types (i.e., Middle, High) in the listing.
avgSchoolRating: The average school rating of all school types (i.e., Middle, High) in the listing.
avgSchoolSize: The average school size of all school types (i.e., Middle, High) in the listing.
MedianStudentsPerTeacher: The median students-per-teacher for all schools near the listing.
numOfBathrooms: The number of bathrooms in the property.
numOfBedrooms: The number of bedrooms in the property.
numOfStories: The number of stories in the property.

2.1. Data cleaning

Our first step will be to read the data from a csv file (austinHousingData.csv) and perform some data cleaning tasks:

Remove rows with missing data.
The homeType variable has 10 different possible values (Apartment, Condo, Residential, etc). We’ll make it a factor variable.
Remove variables that will not be used in the analysis for lack of relevancy to the property price: zpid, latest_saledate, latestPriceSource, city, homeImage, streetAddress, and numOfPhotos.

raw_housing_data = read.csv("austinHousingData.csv")

# remove all rows with missing data
raw_housing_data = na.omit(raw_housing_data)

# Make homeType a factor variable
raw_housing_data$homeType = as.factor(raw_housing_data$homeType)

# Remove predictors that are not used
selected_housing_data = subset(
  raw_housing_data,
  select = -c(
    zpid,
    latest_salemonth,
    latest_saledate,
    latestPriceSource,
    city,
    homeImage,
    streetAddress,
    numOfPhotos
  )
)

Now that we’ve performed basic data cleaning tasks, let’s take a look at the dataset.

head(selected_housing_data)

##   zipcode latitude longitude propertyTaxRate garageSpaces hasAssociation
## 1   78660    30.43    -97.66            1.98            2           TRUE
## 2   78660    30.43    -97.66            1.98            2           TRUE
## 3   78660    30.41    -97.64            1.98            0           TRUE
## 4   78660    30.43    -97.66            1.98            2           TRUE
## 5   78660    30.44    -97.66            1.98            0           TRUE
## 6   78660    30.44    -97.66            1.98            2           TRUE
##   hasCooling hasGarage hasHeating hasSpa hasView      homeType parkingSpaces
## 1       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
## 2       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
## 3       TRUE     FALSE       TRUE  FALSE   FALSE Single Family             0
## 4       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
## 5       TRUE     FALSE       TRUE  FALSE   FALSE Single Family             0
## 6       TRUE      TRUE       TRUE  FALSE   FALSE Single Family             2
##   yearBuilt latestPrice numPriceChanges latest_saleyear
## 1      2012      305000               5            2019
## 2      2013      295000               1            2020
## 3      2018      256125               1            2019
## 4      2013      240000               4            2018
## 5      2002      239900               3            2018
## 6      2020      309045               2            2020
##   numOfAccessibilityFeatures numOfAppliances numOfParkingFeatures
## 1                          0               5                    2
## 2                          0               1                    2
## 3                          0               4                    1
## 4                          0               0                    2
## 5                          0               0                    1
## 6                          0               3                    1
##   numOfPatioAndPorchFeatures numOfSecurityFeatures numOfWaterfrontFeatures
## 1                          1                     3                       0
## 2                          0                     0                       0
## 3                          0                     1                       0
## 4                          0                     0                       0
## 5                          0                     0                       0
## 6                          2                     2                       0
##   numOfWindowFeatures numOfCommunityFeatures lotSizeSqFt livingAreaSqFt
## 1                   1                      0        6011           2601
## 2                   0                      0        6185           1768
## 3                   0                      0        7840           1478
## 4                   0                      0        6098           1678
## 5                   0                      0        6708           2132
## 6                   0                      0        5161           1446
##   numOfPrimarySchools numOfElementarySchools numOfMiddleSchools
## 1                   1                      0                  1
## 2                   1                      0                  1
## 3                   0                      2                  1
## 4                   1                      0                  1
## 5                   1                      0                  1
## 6                   1                      0                  1
##   numOfHighSchools avgSchoolDistance avgSchoolRating avgSchoolSize
## 1                1             1.267           2.667          1063
## 2                1             1.400           2.667          1063
## 3                1             1.200           3.000          1108
## 4                1             1.400           2.667          1063
## 5                1             1.133           4.000          1223
## 6                1             1.067           4.000          1223
##   MedianStudentsPerTeacher numOfBathrooms numOfBedrooms numOfStories
## 1                       14              3             4            2
## 2                       14              2             4            1
## 3                       14              2             3            1
## 4                       14              2             3            1
## 5                       14              3             3            2
## 6                       14              2             3            1

We’ll also look at the summary of the dataset to better understand the data ranges for each variable. This allows us to understand if some of the variables present abnormal max or min values when compared to the mean of that variable, helping to identify outliers which may cause noise in the dataset.

summary(selected_housing_data)

##     zipcode         latitude      longitude     propertyTaxRate  garageSpaces  
##  Min.   :78617   Min.   :30.1   Min.   :-98.0   Min.   :1.98    Min.   : 0.00  
##  1st Qu.:78727   1st Qu.:30.2   1st Qu.:-97.8   1st Qu.:1.98    1st Qu.: 0.00  
##  Median :78739   Median :30.3   Median :-97.8   Median :1.98    Median : 1.00  
##  Mean   :78736   Mean   :30.3   Mean   :-97.8   Mean   :1.99    Mean   : 1.23  
##  3rd Qu.:78749   3rd Qu.:30.4   3rd Qu.:-97.7   3rd Qu.:1.98    3rd Qu.: 2.00  
##  Max.   :78759   Max.   :30.5   Max.   :-97.6   Max.   :2.21    Max.   :22.00  
##                                                                                
##  hasAssociation  hasCooling      hasGarage       hasHeating     
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:7164      FALSE:274       FALSE:6825      FALSE:149      
##  TRUE :8007      TRUE :14897     TRUE :8346      TRUE :15022    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##    hasSpa         hasView                      homeType     parkingSpaces  
##  Mode :logical   Mode :logical   Single Family     :14241   Min.   : 0.00  
##  FALSE:13972     FALSE:11716     Condo             :  470   1st Qu.: 0.00  
##  TRUE :1199      TRUE :3455      Townhouse         :  174   Median : 1.00  
##                                  Multiple Occupancy:   96   Mean   : 1.23  
##                                  Vacant Land       :   83   3rd Qu.: 2.00  
##                                  Apartment         :   37   Max.   :22.00  
##                                  (Other)           :   70                  
##    yearBuilt     latestPrice       numPriceChanges latest_saleyear
##  Min.   :1905   Min.   :    5500   Min.   : 1.00   Min.   :2018   
##  1st Qu.:1974   1st Qu.:  309000   1st Qu.: 1.00   1st Qu.:2018   
##  Median :1993   Median :  405000   Median : 2.00   Median :2019   
##  Mean   :1989   Mean   :  512768   Mean   : 3.03   Mean   :2019   
##  3rd Qu.:2006   3rd Qu.:  575000   3rd Qu.: 4.00   3rd Qu.:2020   
##  Max.   :2020   Max.   :13500000   Max.   :23.00   Max.   :2021   
##                                                                   
##  numOfAccessibilityFeatures numOfAppliances numOfParkingFeatures
##  Min.   :0.000              Min.   : 0.00   Min.   :0.00        
##  1st Qu.:0.000              1st Qu.: 2.00   1st Qu.:1.00        
##  Median :0.000              Median : 3.00   Median :2.00        
##  Mean   :0.013              Mean   : 3.48   Mean   :1.71        
##  3rd Qu.:0.000              3rd Qu.: 4.00   3rd Qu.:2.00        
##  Max.   :8.000              Max.   :12.00   Max.   :6.00        
##                                                                 
##  numOfPatioAndPorchFeatures numOfSecurityFeatures numOfWaterfrontFeatures
##  Min.   :0.000              Min.   :0.000         Min.   :0.0000         
##  1st Qu.:0.000              1st Qu.:0.000         1st Qu.:0.0000         
##  Median :0.000              Median :0.000         Median :0.0000         
##  Mean   :0.663              Mean   :0.467         Mean   :0.0028         
##  3rd Qu.:1.000              3rd Qu.:1.000         3rd Qu.:0.0000         
##  Max.   :8.000              Max.   :6.000         Max.   :2.0000         
##                                                                          
##  numOfWindowFeatures numOfCommunityFeatures  lotSizeSqFt        
##  Min.   :0.000       Min.   :0.000          Min.   :       100  
##  1st Qu.:0.000       1st Qu.:0.000          1st Qu.:      6534  
##  Median :0.000       Median :0.000          Median :      8276  
##  Mean   :0.208       Mean   :0.019          Mean   :    119084  
##  3rd Qu.:0.000       3rd Qu.:0.000          3rd Qu.:     10890  
##  Max.   :4.000       Max.   :8.000          Max.   :1508482800  
##                                                                 
##  livingAreaSqFt   numOfPrimarySchools numOfElementarySchools numOfMiddleSchools
##  Min.   :   300   Min.   :0.000       Min.   :0.0000         Min.   :0.00      
##  1st Qu.:  1483   1st Qu.:1.000       1st Qu.:0.0000         1st Qu.:1.00      
##  Median :  1975   Median :1.000       Median :0.0000         Median :1.00      
##  Mean   :  2208   Mean   :0.941       Mean   :0.0492         Mean   :1.04      
##  3rd Qu.:  2687   3rd Qu.:1.000       3rd Qu.:0.0000         3rd Qu.:1.00      
##  Max.   :109292   Max.   :2.000       Max.   :2.0000         Max.   :3.00      
##                                                                                
##  numOfHighSchools avgSchoolDistance avgSchoolRating avgSchoolSize 
##  Min.   :0.000    Min.   :0.20      Min.   :2.33    Min.   : 396  
##  1st Qu.:1.000    1st Qu.:1.10      1st Qu.:4.00    1st Qu.: 966  
##  Median :1.000    Median :1.57      Median :5.78    Median :1287  
##  Mean   :0.977    Mean   :1.84      Mean   :5.78    Mean   :1237  
##  3rd Qu.:1.000    3rd Qu.:2.27      3rd Qu.:7.00    3rd Qu.:1496  
##  Max.   :2.000    Max.   :9.00      Max.   :9.50    Max.   :1913  
##                                                                   
##  MedianStudentsPerTeacher numOfBathrooms  numOfBedrooms    numOfStories 
##  Min.   :10.0             Min.   : 0.00   Min.   : 0.00   Min.   :1.00  
##  1st Qu.:14.0             1st Qu.: 2.00   1st Qu.: 3.00   1st Qu.:1.00  
##  Median :15.0             Median : 3.00   Median : 3.00   Median :1.00  
##  Mean   :14.9             Mean   : 2.68   Mean   : 3.44   Mean   :1.47  
##  3rd Qu.:16.0             3rd Qu.: 3.00   3rd Qu.: 4.00   3rd Qu.:2.00  
##  Max.   :19.0             Max.   :27.00   Max.   :20.00   Max.   :4.00  
##

Some of the variables demonstrate strange min and max values when compared to their mean: avgSchoolDistance, livingAreaSqFt, lotSizeSqFt, numOfBedrooms, and numOfBathrooms. We’ll plot the observations of these variables in charts to better understand if they are outliers.

data_visuals = function(data) {
  par(mfrow = c(2, 3))
  
  plot(
    latestPrice ~ homeType,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. homeType",
    cex = 1.5
  )
  plot(
    latestPrice ~ avgSchoolDistance  ,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. avgSchoolDistance  ",
    cex = 1.5
  )
  plot(
    latestPrice ~ livingAreaSqFt,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. livingAreaSqFt",
    cex = 1.5
  )
  
  plot(
    latestPrice ~ lotSizeSqFt,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. lotSizeSqFt",
    cex = 1.5
  )
  plot(
    latestPrice ~ numOfBedrooms,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. numOfBedrooms",
    cex = 1.5
  )
  plot(
    latestPrice ~ numOfBathrooms,
    data = data,
    pch = 20,
    col = "dodgerblue",
    main = "latestPrice vs. numOfBathrooms",
    cex = 1.5
  )
}

data_visuals(selected_housing_data)

From the data structure and visuals, we see that there are significant outliers in the dataset. For instance, one observation has a livingAreaSqft value of ‘109,292’, compared to its mean ‘2,208’. We shall remove these outliers using boxplot stats.

for (x in c(
  'homeType',
  'latestPrice',
  'avgSchoolDistance',
  'livingAreaSqFt',
  'lotSizeSqFt',
  'numOfBedrooms',
  'numOfBathrooms'
))
{
  value = selected_housing_data[, x][selected_housing_data[, x] %in% boxplot.stats(selected_housing_data[, x])$out]
  selected_housing_data[, x][selected_housing_data[, x] %in% value] = NA
}

# remove all rows with missing data
selected_housing_data = na.omit(selected_housing_data
)

Looking at the plots again, we confirm that the observations are better represented now, without outliers.

data_visuals(selected_housing_data)

The “cleaned” dataset now offers 11,493 observations and 39 variables.

str(selected_housing_data)

## 'data.frame':    12475 obs. of  38 variables:
##  $ zipcode                   : int  78660 78660 78660 78660 78660 78660 78660 78660 78660 78617 ...
##  $ latitude                  : num  30.4 30.4 30.4 30.4 30.4 ...
##  $ longitude                 : num  -97.7 -97.7 -97.6 -97.7 -97.7 ...
##  $ propertyTaxRate           : num  1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 ...
##  $ garageSpaces              : int  2 2 0 2 0 2 0 0 0 2 ...
##  $ hasAssociation            : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ hasCooling                : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ hasGarage                 : logi  TRUE TRUE FALSE TRUE FALSE TRUE ...
##  $ hasHeating                : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ hasSpa                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ hasView                   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ homeType                  : Factor w/ 10 levels "Apartment","Condo",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ parkingSpaces             : int  2 2 0 2 0 2 0 0 0 2 ...
##  $ yearBuilt                 : int  2012 2013 2018 2013 2002 2020 2016 2002 2002 2013 ...
##  $ latestPrice               : int  305000 295000 256125 240000 239900 309045 315000 219900 225000 194800 ...
##  $ numPriceChanges           : int  5 1 1 4 3 2 2 2 1 1 ...
##  $ latest_saleyear           : int  2019 2020 2019 2018 2018 2020 2020 2018 2019 2018 ...
##  $ numOfAccessibilityFeatures: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ numOfAppliances           : int  5 1 4 0 0 3 3 3 2 3 ...
##  $ numOfParkingFeatures      : int  2 2 1 2 1 1 1 1 1 2 ...
##  $ numOfPatioAndPorchFeatures: int  1 0 0 0 0 2 0 0 1 0 ...
##  $ numOfSecurityFeatures     : int  3 0 1 0 0 2 0 0 1 0 ...
##  $ numOfWaterfrontFeatures   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ numOfWindowFeatures       : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ numOfCommunityFeatures    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lotSizeSqFt               : num  6011 6185 7840 6098 6708 ...
##  $ livingAreaSqFt            : int  2601 1768 1478 1678 2132 1446 2432 1422 1870 1422 ...
##  $ numOfPrimarySchools       : int  1 1 0 1 1 1 1 1 1 1 ...
##  $ numOfElementarySchools    : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ numOfMiddleSchools        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ numOfHighSchools          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ avgSchoolDistance         : num  1.27 1.4 1.2 1.4 1.13 ...
##  $ avgSchoolRating           : num  2.67 2.67 3 2.67 4 ...
##  $ avgSchoolSize             : int  1063 1063 1108 1063 1223 1223 1051 1223 1223 1615 ...
##  $ MedianStudentsPerTeacher  : int  14 14 14 14 14 14 12 14 14 14 ...
##  $ numOfBathrooms            : num  3 2 2 2 3 2 3 3 2 3 ...
##  $ numOfBedrooms             : int  4 4 3 3 3 3 4 3 3 3 ...
##  $ numOfStories              : int  2 1 1 1 2 1 2 2 2 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:2696] 18 23 24 29 30 34 35 38 42 44 ...
##   ..- attr(*, "names")= chr [1:2696] "18" "23" "24" "29" ...

Now that we have a clean dataset, without outliers, let’s have a look at the distribution of property prices when plotted over the map of Austin. We notice that the most expensive houses are concentrated near the central part of Austin, with some exceptions for more prestigious areas. Overall, houses are in the $250,000 to $750,000 range.

library(ggmap)
library(ggplot2)

# register google maps API key
register_google(key = "AIzaSyAXuwivTHN6rIgi3teuusdz3r8dqNMQQx8")

## Central co-ordinates of the region we are interested in.
central_location = c(mean(selected_housing_data$longitude),
                     mean(selected_housing_data$latitude))

## Get map centered on Austin, TX (or the mean of the coordinates in our dataset)
austin_map = ggmap(get_googlemap(
  center = central_location,
  scale = 1,
  zoom = 10
),
extent = "normal")

## Plot heatmap
austin_map + geom_point(
  aes(x = longitude, y = latitude, color = latestPrice),
  data = selected_housing_data,
  alpha = 0.4,
  size = 3
) + xlim(range(selected_housing_data$longitude)) + ylim(range(selected_housing_data$latitude)) + scale_color_distiller(palette = "Spectral", labels = scales::comma) + xlab("Longitude") + ylab("Latitude") + ggtitle("Heatmap: latest sale price ($ USD) by property")

2.2. Train-Test split

We’ll split our dataset into two data frames: one used for training, which will contain 25% of the observations, and one for testing, containing the remaining 75%.

set.seed(19870412)
ratio = 0.25
idx  = sample(nrow(selected_housing_data),
              size = nrow(selected_housing_data) * ratio)
housing_data_train = selected_housing_data[idx, ]
housing_data_test = selected_housing_data[-idx, ]

2.3. Collinearity

Next, we’ll look at the variables in the dataset to investigate if there’s multicollinearity.

library(faraway)
options(max.print = 1000000)

# This is a helper function to get the top n items from a matrix.
# Adjusted from https://stackoverflow.com/questions/32544566/find-the-largest-values-on-a-matrix-in-r

nlargest = function(m, n = 10, sim = TRUE) {
  mult = 1
  
  if (sim)
    mult = 2
  res = order(m, decreasing = TRUE)[seq_len(n) * mult]
  pos = arrayInd(res, dim(m), useNames = TRUE)
  list(values = m[res],
       position = pos)
}

# A correlation cannot be computed for factor variables.So we'll create a copy of the data frame without the factor variables to run the collinearity analysis
num_cols = unlist(lapply(housing_data_train, is.numeric))
housing_data_numerical = housing_data_train[, num_cols]

# Pairs won't work with more than 26 variables
# pairs(housing_data_train[,1:26], col="dodgerblue")

# run cor() and store results on a matrix
(coll_matrix = round(cor(housing_data_numerical), 2))

##                            zipcode latitude longitude propertyTaxRate
## zipcode                       1.00    -0.07     -0.16           -0.21
## latitude                     -0.07     1.00      0.36            0.48
## longitude                    -0.16     0.36      1.00           -0.01
## propertyTaxRate              -0.21     0.48     -0.01            1.00
## garageSpaces                 -0.01     0.05     -0.06            0.05
## parkingSpaces                -0.01     0.04     -0.06            0.05
## yearBuilt                    -0.02    -0.11     -0.19            0.13
## latestPrice                  -0.15     0.14     -0.27           -0.03
## numPriceChanges              -0.05     0.00     -0.04           -0.02
## latest_saleyear              -0.04    -0.04     -0.02           -0.01
## numOfAccessibilityFeatures   -0.03     0.00      0.03           -0.01
## numOfAppliances               0.03    -0.01     -0.01           -0.01
## numOfParkingFeatures         -0.08     0.19     -0.01            0.31
## numOfPatioAndPorchFeatures   -0.02    -0.03     -0.08           -0.01
## numOfSecurityFeatures         0.00    -0.02     -0.06            0.03
## numOfWaterfrontFeatures      -0.01    -0.02     -0.02           -0.01
## numOfWindowFeatures           0.00     0.00     -0.13            0.01
## numOfCommunityFeatures       -0.01     0.02     -0.03            0.03
## lotSizeSqFt                   0.12     0.18     -0.25            0.09
## livingAreaSqFt               -0.01     0.14     -0.43            0.19
## numOfPrimarySchools          -0.05    -0.07      0.12            0.03
## numOfElementarySchools        0.06     0.13      0.00           -0.04
## numOfMiddleSchools           -0.01    -0.02     -0.20            0.00
## numOfHighSchools              0.08     0.11      0.38           -0.03
## avgSchoolDistance             0.05     0.01     -0.04           -0.03
## avgSchoolRating               0.05     0.27     -0.55            0.21
## avgSchoolSize                 0.12     0.02     -0.43            0.18
## MedianStudentsPerTeacher      0.11    -0.02     -0.60           -0.03
## numOfBathrooms               -0.03     0.05     -0.31            0.12
## numOfBedrooms                 0.04     0.09     -0.28            0.15
## numOfStories                 -0.03     0.01     -0.18            0.07
##                            garageSpaces parkingSpaces yearBuilt latestPrice
## zipcode                           -0.01         -0.01     -0.02       -0.15
## latitude                           0.05          0.04     -0.11        0.14
## longitude                         -0.06         -0.06     -0.19       -0.27
## propertyTaxRate                    0.05          0.05      0.13       -0.03
## garageSpaces                       1.00          1.00      0.04        0.14
## parkingSpaces                      1.00          1.00      0.05        0.14
## yearBuilt                          0.04          0.05      1.00       -0.08
## latestPrice                        0.14          0.14     -0.08        1.00
## numPriceChanges                    0.10          0.10     -0.05       -0.05
## latest_saleyear                    0.34          0.34      0.01        0.13
## numOfAccessibilityFeatures         0.05          0.05      0.03        0.05
## numOfAppliances                    0.15          0.15      0.09        0.04
## numOfParkingFeatures               0.67          0.67      0.00        0.11
## numOfPatioAndPorchFeatures         0.24          0.24      0.02        0.17
## numOfSecurityFeatures              0.17          0.17      0.14        0.11
## numOfWaterfrontFeatures            0.02          0.02     -0.02        0.04
## numOfWindowFeatures                0.03          0.03      0.04        0.14
## numOfCommunityFeatures             0.05          0.02      0.02        0.04
## lotSizeSqFt                        0.06          0.06     -0.15        0.31
## livingAreaSqFt                     0.13          0.13      0.41        0.48
## numOfPrimarySchools               -0.01         -0.01     -0.06       -0.14
## numOfElementarySchools             0.00          0.00      0.04        0.11
## numOfMiddleSchools                 0.01          0.01      0.08        0.12
## numOfHighSchools                  -0.03         -0.02      0.10       -0.23
## avgSchoolDistance                  0.02          0.02      0.32       -0.11
## avgSchoolRating                    0.08          0.08      0.11        0.46
## avgSchoolSize                      0.03          0.03      0.34        0.10
## MedianStudentsPerTeacher           0.07          0.07      0.07        0.35
## numOfBathrooms                     0.12          0.12      0.52        0.34
## numOfBedrooms                      0.09          0.09      0.24        0.26
## numOfStories                       0.06          0.06      0.41        0.18
##                            numPriceChanges latest_saleyear
## zipcode                              -0.05           -0.04
## latitude                              0.00           -0.04
## longitude                            -0.04           -0.02
## propertyTaxRate                      -0.02           -0.01
## garageSpaces                          0.10            0.34
## parkingSpaces                         0.10            0.34
## yearBuilt                            -0.05            0.01
## latestPrice                          -0.05            0.13
## numPriceChanges                       1.00            0.00
## latest_saleyear                       0.00            1.00
## numOfAccessibilityFeatures           -0.04            0.08
## numOfAppliances                       0.07            0.09
## numOfParkingFeatures                  0.11            0.25
## numOfPatioAndPorchFeatures           -0.03            0.53
## numOfSecurityFeatures                -0.03            0.41
## numOfWaterfrontFeatures               0.01            0.04
## numOfWindowFeatures                   0.00            0.26
## numOfCommunityFeatures                0.04            0.05
## lotSizeSqFt                           0.01           -0.05
## livingAreaSqFt                        0.09           -0.03
## numOfPrimarySchools                  -0.02            0.03
## numOfElementarySchools               -0.01           -0.04
## numOfMiddleSchools                    0.01           -0.03
## numOfHighSchools                     -0.05           -0.01
## avgSchoolDistance                    -0.03            0.00
## avgSchoolRating                       0.02           -0.03
## avgSchoolSize                        -0.04           -0.03
## MedianStudentsPerTeacher              0.03            0.00
## numOfBathrooms                        0.08            0.00
## numOfBedrooms                         0.07           -0.01
## numOfStories                          0.06           -0.03
##                            numOfAccessibilityFeatures numOfAppliances
## zipcode                                         -0.03            0.03
## latitude                                         0.00           -0.01
## longitude                                        0.03           -0.01
## propertyTaxRate                                 -0.01           -0.01
## garageSpaces                                     0.05            0.15
## parkingSpaces                                    0.05            0.15
## yearBuilt                                        0.03            0.09
## latestPrice                                      0.05            0.04
## numPriceChanges                                 -0.04            0.07
## latest_saleyear                                  0.08            0.09
## numOfAccessibilityFeatures                       1.00            0.03
## numOfAppliances                                  0.03            1.00
## numOfParkingFeatures                             0.05            0.16
## numOfPatioAndPorchFeatures                       0.11            0.15
## numOfSecurityFeatures                            0.05            0.17
## numOfWaterfrontFeatures                          0.00           -0.02
## numOfWindowFeatures                             -0.02            0.10
## numOfCommunityFeatures                           0.02            0.08
## lotSizeSqFt                                     -0.05           -0.06
## livingAreaSqFt                                  -0.02            0.03
## numOfPrimarySchools                              0.01            0.01
## numOfElementarySchools                           0.00           -0.01
## numOfMiddleSchools                              -0.01            0.01
## numOfHighSchools                                -0.01           -0.01
## avgSchoolDistance                               -0.02            0.00
## avgSchoolRating                                 -0.01            0.02
## avgSchoolSize                                   -0.01            0.04
## MedianStudentsPerTeacher                        -0.01            0.02
## numOfBathrooms                                  -0.01            0.08
## numOfBedrooms                                   -0.04            0.01
## numOfStories                                     0.03            0.06
##                            numOfParkingFeatures numOfPatioAndPorchFeatures
## zipcode                                   -0.08                      -0.02
## latitude                                   0.19                      -0.03
## longitude                                 -0.01                      -0.08
## propertyTaxRate                            0.31                      -0.01
## garageSpaces                               0.67                       0.24
## parkingSpaces                              0.67                       0.24
## yearBuilt                                  0.00                       0.02
## latestPrice                                0.11                       0.17
## numPriceChanges                            0.11                      -0.03
## latest_saleyear                            0.25                       0.53
## numOfAccessibilityFeatures                 0.05                       0.11
## numOfAppliances                            0.16                       0.15
## numOfParkingFeatures                       1.00                       0.14
## numOfPatioAndPorchFeatures                 0.14                       1.00
## numOfSecurityFeatures                      0.11                       0.50
## numOfWaterfrontFeatures                    0.01                       0.03
## numOfWindowFeatures                        0.04                       0.34
## numOfCommunityFeatures                     0.04                       0.07
## lotSizeSqFt                                0.01                       0.00
## livingAreaSqFt                             0.09                       0.05
## numOfPrimarySchools                        0.02                       0.01
## numOfElementarySchools                    -0.02                      -0.02
## numOfMiddleSchools                         0.01                      -0.02
## numOfHighSchools                          -0.02                      -0.04
## avgSchoolDistance                         -0.05                       0.00
## avgSchoolRating                            0.12                       0.04
## avgSchoolSize                              0.04                       0.03
## MedianStudentsPerTeacher                   0.04                       0.06
## numOfBathrooms                             0.08                       0.04
## numOfBedrooms                              0.05                       0.04
## numOfStories                               0.08                      -0.01
##                            numOfSecurityFeatures numOfWaterfrontFeatures
## zipcode                                     0.00                   -0.01
## latitude                                   -0.02                   -0.02
## longitude                                  -0.06                   -0.02
## propertyTaxRate                             0.03                   -0.01
## garageSpaces                                0.17                    0.02
## parkingSpaces                               0.17                    0.02
## yearBuilt                                   0.14                   -0.02
## latestPrice                                 0.11                    0.04
## numPriceChanges                            -0.03                    0.01
## latest_saleyear                             0.41                    0.04
## numOfAccessibilityFeatures                  0.05                    0.00
## numOfAppliances                             0.17                   -0.02
## numOfParkingFeatures                        0.11                    0.01
## numOfPatioAndPorchFeatures                  0.50                    0.03
## numOfSecurityFeatures                       1.00                   -0.01
## numOfWaterfrontFeatures                    -0.01                    1.00
## numOfWindowFeatures                         0.41                   -0.01
## numOfCommunityFeatures                      0.03                    0.00
## lotSizeSqFt                                -0.02                    0.08
## livingAreaSqFt                              0.10                   -0.01
## numOfPrimarySchools                         0.00                    0.01
## numOfElementarySchools                     -0.02                   -0.01
## numOfMiddleSchools                         -0.01                    0.00
## numOfHighSchools                           -0.02                    0.00
## avgSchoolDistance                           0.02                    0.00
## avgSchoolRating                             0.04                    0.00
## avgSchoolSize                               0.03                    0.00
## MedianStudentsPerTeacher                    0.03                    0.02
## numOfBathrooms                              0.10                   -0.02
## numOfBedrooms                               0.07                    0.01
## numOfStories                                0.06                   -0.03
##                            numOfWindowFeatures numOfCommunityFeatures
## zipcode                                   0.00                  -0.01
## latitude                                  0.00                   0.02
## longitude                                -0.13                  -0.03
## propertyTaxRate                           0.01                   0.03
## garageSpaces                              0.03                   0.05
## parkingSpaces                             0.03                   0.02
## yearBuilt                                 0.04                   0.02
## latestPrice                               0.14                   0.04
## numPriceChanges                           0.00                   0.04
## latest_saleyear                           0.26                   0.05
## numOfAccessibilityFeatures               -0.02                   0.02
## numOfAppliances                           0.10                   0.08
## numOfParkingFeatures                      0.04                   0.04
## numOfPatioAndPorchFeatures                0.34                   0.07
## numOfSecurityFeatures                     0.41                   0.03
## numOfWaterfrontFeatures                  -0.01                   0.00
## numOfWindowFeatures                       1.00                   0.01
## numOfCommunityFeatures                    0.01                   1.00
## lotSizeSqFt                               0.05                   0.04
## livingAreaSqFt                            0.15                   0.04
## numOfPrimarySchools                      -0.02                  -0.04
## numOfElementarySchools                    0.02                  -0.01
## numOfMiddleSchools                        0.00                   0.06
## numOfHighSchools                         -0.05                  -0.03
## avgSchoolDistance                         0.03                   0.00
## avgSchoolRating                           0.13                   0.04
## avgSchoolSize                             0.11                  -0.01
## MedianStudentsPerTeacher                  0.13                   0.01
## numOfBathrooms                            0.08                   0.03
## numOfBedrooms                             0.09                   0.02
## numOfStories                              0.05                  -0.02
##                            lotSizeSqFt livingAreaSqFt numOfPrimarySchools
## zipcode                           0.12          -0.01               -0.05
## latitude                          0.18           0.14               -0.07
## longitude                        -0.25          -0.43                0.12
## propertyTaxRate                   0.09           0.19                0.03
## garageSpaces                      0.06           0.13               -0.01
## parkingSpaces                     0.06           0.13               -0.01
## yearBuilt                        -0.15           0.41               -0.06
## latestPrice                       0.31           0.48               -0.14
## numPriceChanges                   0.01           0.09               -0.02
## latest_saleyear                  -0.05          -0.03                0.03
## numOfAccessibilityFeatures       -0.05          -0.02                0.01
## numOfAppliances                  -0.06           0.03                0.01
## numOfParkingFeatures              0.01           0.09                0.02
## numOfPatioAndPorchFeatures        0.00           0.05                0.01
## numOfSecurityFeatures            -0.02           0.10                0.00
## numOfWaterfrontFeatures           0.08          -0.01                0.01
## numOfWindowFeatures               0.05           0.15               -0.02
## numOfCommunityFeatures            0.04           0.04               -0.04
## lotSizeSqFt                       1.00           0.38               -0.20
## livingAreaSqFt                    0.38           1.00               -0.15
## numOfPrimarySchools              -0.20          -0.15                1.00
## numOfElementarySchools            0.14           0.11               -0.81
## numOfMiddleSchools                0.15           0.14               -0.41
## numOfHighSchools                 -0.17          -0.09                0.45
## avgSchoolDistance                 0.00           0.16                0.05
## avgSchoolRating                   0.29           0.54               -0.19
## avgSchoolSize                     0.13           0.44               -0.03
## MedianStudentsPerTeacher          0.23           0.41               -0.01
## numOfBathrooms                    0.13           0.75               -0.13
## numOfBedrooms                     0.34           0.69               -0.10
## numOfStories                     -0.10           0.49               -0.05
##                            numOfElementarySchools numOfMiddleSchools
## zipcode                                      0.06              -0.01
## latitude                                     0.13              -0.02
## longitude                                    0.00              -0.20
## propertyTaxRate                             -0.04               0.00
## garageSpaces                                 0.00               0.01
## parkingSpaces                                0.00               0.01
## yearBuilt                                    0.04               0.08
## latestPrice                                  0.11               0.12
## numPriceChanges                             -0.01               0.01
## latest_saleyear                             -0.04              -0.03
## numOfAccessibilityFeatures                   0.00              -0.01
## numOfAppliances                             -0.01               0.01
## numOfParkingFeatures                        -0.02               0.01
## numOfPatioAndPorchFeatures                  -0.02              -0.02
## numOfSecurityFeatures                       -0.02              -0.01
## numOfWaterfrontFeatures                     -0.01               0.00
## numOfWindowFeatures                          0.02               0.00
## numOfCommunityFeatures                      -0.01               0.06
## lotSizeSqFt                                  0.14               0.15
## livingAreaSqFt                               0.11               0.14
## numOfPrimarySchools                         -0.81              -0.41
## numOfElementarySchools                       1.00               0.31
## numOfMiddleSchools                           0.31               1.00
## numOfHighSchools                            -0.22              -0.36
## avgSchoolDistance                           -0.04               0.08
## avgSchoolRating                              0.11               0.15
## avgSchoolSize                                0.07              -0.03
## MedianStudentsPerTeacher                    -0.01              -0.03
## numOfBathrooms                               0.10               0.11
## numOfBedrooms                                0.08               0.10
## numOfStories                                 0.06               0.05
##                            numOfHighSchools avgSchoolDistance avgSchoolRating
## zipcode                                0.08              0.05            0.05
## latitude                               0.11              0.01            0.27
## longitude                              0.38             -0.04           -0.55
## propertyTaxRate                       -0.03             -0.03            0.21
## garageSpaces                          -0.03              0.02            0.08
## parkingSpaces                         -0.02              0.02            0.08
## yearBuilt                              0.10              0.32            0.11
## latestPrice                           -0.23             -0.11            0.46
## numPriceChanges                       -0.05             -0.03            0.02
## latest_saleyear                       -0.01              0.00           -0.03
## numOfAccessibilityFeatures            -0.01             -0.02           -0.01
## numOfAppliances                       -0.01              0.00            0.02
## numOfParkingFeatures                  -0.02             -0.05            0.12
## numOfPatioAndPorchFeatures            -0.04              0.00            0.04
## numOfSecurityFeatures                 -0.02              0.02            0.04
## numOfWaterfrontFeatures                0.00              0.00            0.00
## numOfWindowFeatures                   -0.05              0.03            0.13
## numOfCommunityFeatures                -0.03              0.00            0.04
## lotSizeSqFt                           -0.17              0.00            0.29
## livingAreaSqFt                        -0.09              0.16            0.54
## numOfPrimarySchools                    0.45              0.05           -0.19
## numOfElementarySchools                -0.22             -0.04            0.11
## numOfMiddleSchools                    -0.36              0.08            0.15
## numOfHighSchools                       1.00              0.18           -0.21
## avgSchoolDistance                      0.18              1.00            0.08
## avgSchoolRating                       -0.21              0.08            1.00
## avgSchoolSize                         -0.07              0.28            0.63
## MedianStudentsPerTeacher              -0.27              0.08            0.76
## numOfBathrooms                        -0.07              0.13            0.35
## numOfBedrooms                         -0.04              0.11            0.30
## numOfStories                          -0.03              0.09            0.22
##                            avgSchoolSize MedianStudentsPerTeacher
## zipcode                             0.12                     0.11
## latitude                            0.02                    -0.02
## longitude                          -0.43                    -0.60
## propertyTaxRate                     0.18                    -0.03
## garageSpaces                        0.03                     0.07
## parkingSpaces                       0.03                     0.07
## yearBuilt                           0.34                     0.07
## latestPrice                         0.10                     0.35
## numPriceChanges                    -0.04                     0.03
## latest_saleyear                    -0.03                     0.00
## numOfAccessibilityFeatures         -0.01                    -0.01
## numOfAppliances                     0.04                     0.02
## numOfParkingFeatures                0.04                     0.04
## numOfPatioAndPorchFeatures          0.03                     0.06
## numOfSecurityFeatures               0.03                     0.03
## numOfWaterfrontFeatures             0.00                     0.02
## numOfWindowFeatures                 0.11                     0.13
## numOfCommunityFeatures             -0.01                     0.01
## lotSizeSqFt                         0.13                     0.23
## livingAreaSqFt                      0.44                     0.41
## numOfPrimarySchools                -0.03                    -0.01
## numOfElementarySchools              0.07                    -0.01
## numOfMiddleSchools                 -0.03                    -0.03
## numOfHighSchools                   -0.07                    -0.27
## avgSchoolDistance                   0.28                     0.08
## avgSchoolRating                     0.63                     0.76
## avgSchoolSize                       1.00                     0.66
## MedianStudentsPerTeacher            0.66                     1.00
## numOfBathrooms                      0.32                     0.27
## numOfBedrooms                       0.29                     0.23
## numOfStories                        0.24                     0.16
##                            numOfBathrooms numOfBedrooms numOfStories
## zipcode                             -0.03          0.04        -0.03
## latitude                             0.05          0.09         0.01
## longitude                           -0.31         -0.28        -0.18
## propertyTaxRate                      0.12          0.15         0.07
## garageSpaces                         0.12          0.09         0.06
## parkingSpaces                        0.12          0.09         0.06
## yearBuilt                            0.52          0.24         0.41
## latestPrice                          0.34          0.26         0.18
## numPriceChanges                      0.08          0.07         0.06
## latest_saleyear                      0.00         -0.01        -0.03
## numOfAccessibilityFeatures          -0.01         -0.04         0.03
## numOfAppliances                      0.08          0.01         0.06
## numOfParkingFeatures                 0.08          0.05         0.08
## numOfPatioAndPorchFeatures           0.04          0.04        -0.01
## numOfSecurityFeatures                0.10          0.07         0.06
## numOfWaterfrontFeatures             -0.02          0.01        -0.03
## numOfWindowFeatures                  0.08          0.09         0.05
## numOfCommunityFeatures               0.03          0.02        -0.02
## lotSizeSqFt                          0.13          0.34        -0.10
## livingAreaSqFt                       0.75          0.69         0.49
## numOfPrimarySchools                 -0.13         -0.10        -0.05
## numOfElementarySchools               0.10          0.08         0.06
## numOfMiddleSchools                   0.11          0.10         0.05
## numOfHighSchools                    -0.07         -0.04        -0.03
## avgSchoolDistance                    0.13          0.11         0.09
## avgSchoolRating                      0.35          0.30         0.22
## avgSchoolSize                        0.32          0.29         0.24
## MedianStudentsPerTeacher             0.27          0.23         0.16
## numOfBathrooms                       1.00          0.55         0.67
## numOfBedrooms                        0.55          1.00         0.29
## numOfStories                         0.67          0.29         1.00

We don’t observe examples of collinearity between variables above 0.8, except for the relationship between parkingSpaces and garageSpaces. Remember that parkingSpaces is the number of parking spots, while garageSpaces represents the number of garage spaces as a subset of the ParkingSpaces variable. The latest may include additional parking spaces provided by common areas.

The first reaction is to think that parkingSpaces and garageSpaces are the same. This is the case in almost all observations, except for 0.16% of the observations in the dataset.

#garageSpaces is not always the same as parkingSpaces
spaces_different = housing_data_train$parkingSpaces != housing_data_train$garageSpaces

# Proportion of observations where parkingSpaces is different than garageSpaces
length(spaces_different[spaces_different == TRUE]) / length(spaces_different)

## [1] 0.001604

Therefore, we’ll eliminate the garageSpaces variable from the dataset.

housing_data_train = subset(housing_data_train, select = -c(garageSpaces))

We’ll further look for multicollinearity with the remaining variables.

num_cols = unlist(lapply(housing_data_train, is.numeric))
housing_data_numerical = housing_data_train[, num_cols]
coll_matrix = round(cor(housing_data_numerical), 2)

This matrix is extensive, and it may be easy to miss high values. So let’s use a function to look at the highest values in the matrix.

# Look at the top values from coll_matrix that are different than 1: 
nlargest(coll_matrix, n = 45)$values[nlargest(coll_matrix, n = 45)$values < 1]

##  [1] 0.76 0.75 0.69 0.67 0.67 0.66 0.63 0.55 0.54 0.53 0.52 0.50 0.49 0.48 0.48
## [16] 0.46 0.45 0.44 0.41 0.41 0.41 0.41 0.41 0.38 0.38 0.36 0.35 0.35 0.34 0.34

We now see that there’s no collinearity between variables that’s higher than 0.8. Still, we can further investigate the model to see if there’s any variables impacting the response at considerable rates when compared to the others.

housing_data_model = lm(latestPrice ~ ., data = housing_data_train)

vif = vif(housing_data_model)
sort(vif[which(vif > 5)], decreasing = TRUE)

##      homeTypeSingle Family              homeTypeCondo 
##                    164.488                    115.289 
##          homeTypeTownhouse homeTypeMultiple Occupancy 
##                     36.950                     11.364 
##        homeTypeResidential              hasGarageTRUE 
##                      5.812                      5.369

the homeType predictor will be key in our analysis, so we’ve decided to keep it in the model. However, variable hasGarage shows a VIF greater than 5, which may be a concern. What proportion of the observed variation in latestPrice is explained by a linear relationship with hasGarage?

summary(lm(hasGarage ~ . - latestPrice, data = housing_data_train))$r.squared

## [1] 0.8137

housing_data_model_non_significant = lm(latestPrice ~ . - hasGarage, data = housing_data_train)
vif_non_significant = vif(housing_data_model_non_significant)
vif_non_significant[which(vif_non_significant > 5)]

##              homeTypeCondo homeTypeMultiple Occupancy 
##                    112.768                     11.167 
##        homeTypeResidential      homeTypeSingle Family 
##                      5.664                    161.098 
##          homeTypeTownhouse 
##                     36.195

#Finally, compare both models
(anova_results = anova(housing_data_model, housing_data_model_non_significant))

## Analysis of Variance Table
## 
## Model 1: latestPrice ~ zipcode + latitude + longitude + propertyTaxRate + 
##     hasAssociation + hasCooling + hasGarage + hasHeating + hasSpa + 
##     hasView + homeType + parkingSpaces + yearBuilt + numPriceChanges + 
##     latest_saleyear + numOfAccessibilityFeatures + numOfAppliances + 
##     numOfParkingFeatures + numOfPatioAndPorchFeatures + numOfSecurityFeatures + 
##     numOfWaterfrontFeatures + numOfWindowFeatures + numOfCommunityFeatures + 
##     lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools + 
##     numOfMiddleSchools + numOfHighSchools + avgSchoolDistance + 
##     avgSchoolRating + avgSchoolSize + MedianStudentsPerTeacher + 
##     numOfBathrooms + numOfBedrooms + numOfStories
## Model 2: latestPrice ~ (zipcode + latitude + longitude + propertyTaxRate + 
##     hasAssociation + hasCooling + hasGarage + hasHeating + hasSpa + 
##     hasView + homeType + parkingSpaces + yearBuilt + numPriceChanges + 
##     latest_saleyear + numOfAccessibilityFeatures + numOfAppliances + 
##     numOfParkingFeatures + numOfPatioAndPorchFeatures + numOfSecurityFeatures + 
##     numOfWaterfrontFeatures + numOfWindowFeatures + numOfCommunityFeatures + 
##     lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools + 
##     numOfMiddleSchools + numOfHighSchools + avgSchoolDistance + 
##     avgSchoolRating + avgSchoolSize + MedianStudentsPerTeacher + 
##     numOfBathrooms + numOfBedrooms + numOfStories) - hasGarage
##   Res.Df      RSS Df Sum of Sq    F Pr(>F)
## 1   3075 3.71e+13                         
## 2   3076 3.71e+13 -1 -79148314 0.01   0.94

When we compare the model with all predictor versus one that does not include the hasGarage predictor, we see the p-value significant at 0.94, so we fail to reject the null hypothesis. We’ll continue the analysis with the hasGarage predictor.

3. Model Buidling

Now that we’ve cleaned the dataset by removing outliers and predictors that may not be helpful, we’ll start looking at options to build an optimal model. We will build an additive and an interactive model, perform some model selection analysis and diagnostics to chose a model that best represents our dataset and purpose.

3.1 Additive model

Let’s build an additive model with all available predictors.

model_measures = function(models, names){

  df = data.frame(matrix(nrow = 4, ncol = length(names)))
  # assign row names
  rownames(df) = c("Number Of Predictors", "RSquare", "Adj. RSquare", "LOOCV_RMSE")
  # assign column names
  colnames(df) = names
  
  for(i in 1:length(models)) { 
    model = models[[i]]
    
    num_of_predictors = length(coef(model))
    adj_rsquare = summary(model)$adj.r.squared
    rsquare = summary(model)$r.squared
    loocv_rmse = sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
  
    df[i] =  c( num_of_predictors, adj_rsquare, rsquare, loocv_rmse)
  }
  knitr::kable(df, "simple")
}

housing_data_model = lm(latestPrice ~ ., data = housing_data_train)
model_measures(list(housing_data_model), c("Additive Model"))

	Additive Model
Number Of Predictors	43.0000
RSquare	0.5505
Adj. RSquare	0.5565
LOOCV_RMSE	Inf

We will find the significant variables with alpha 0.05.

alpha = 0.05
variables_significant =  summary(housing_data_model)$coef[, 'Pr(>|t|)'] < alpha
variableNames_significant = names(variables_significant[variables_significant == TRUE][-1])

predictors = paste(variableNames_significant, collapse = "+")
predictors

## [1] "zipcode+longitude+propertyTaxRate+hasAssociationTRUE+homeTypeMultiFamily+yearBuilt+numPriceChanges+latest_saleyear+numOfAccessibilityFeatures+numOfAppliances+numOfPatioAndPorchFeatures+numOfWaterfrontFeatures+lotSizeSqFt+livingAreaSqFt+numOfPrimarySchools+numOfElementarySchools+numOfHighSchools+avgSchoolDistance+avgSchoolRating+avgSchoolSize+numOfBathrooms+numOfBedrooms"

housing_data_model_significant =  lm(
  latestPrice ~ zipcode + longitude + propertyTaxRate + hasAssociation + yearBuilt +
    numPriceChanges + latest_saleyear + numOfPatioAndPorchFeatures + lotSizeSqFt +
    livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools + numOfHighSchools +
    avgSchoolDistance + avgSchoolRating + avgSchoolSize + numOfBathrooms + numOfBedrooms,
  data = housing_data_train
)

anova(housing_data_model, housing_data_model_significant)

## Analysis of Variance Table
## 
## Model 1: latestPrice ~ zipcode + latitude + longitude + propertyTaxRate + 
##     hasAssociation + hasCooling + hasGarage + hasHeating + hasSpa + 
##     hasView + homeType + parkingSpaces + yearBuilt + numPriceChanges + 
##     latest_saleyear + numOfAccessibilityFeatures + numOfAppliances + 
##     numOfParkingFeatures + numOfPatioAndPorchFeatures + numOfSecurityFeatures + 
##     numOfWaterfrontFeatures + numOfWindowFeatures + numOfCommunityFeatures + 
##     lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools + 
##     numOfMiddleSchools + numOfHighSchools + avgSchoolDistance + 
##     avgSchoolRating + avgSchoolSize + MedianStudentsPerTeacher + 
##     numOfBathrooms + numOfBedrooms + numOfStories
## Model 2: latestPrice ~ zipcode + longitude + propertyTaxRate + hasAssociation + 
##     yearBuilt + numPriceChanges + latest_saleyear + numOfPatioAndPorchFeatures + 
##     lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools + 
##     numOfHighSchools + avgSchoolDistance + avgSchoolRating + 
##     avgSchoolSize + numOfBathrooms + numOfBedrooms
##   Res.Df      RSS  Df     Sum of Sq    F  Pr(>F)    
## 1   3075 3.71e+13                                   
## 2   3099 3.78e+13 -24 -699422042303 2.41 0.00014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_measures(
  list(housing_data_model, housing_data_model_significant),
  c("All predictors", "Significant predictors")
)

	All predictors	Significant predictors
Number Of Predictors	43.0000	19.0000
RSquare	0.5505	0.5456
Adj. RSquare	0.5565	0.5482
LOOCV_RMSE	Inf	110989.6549

housing_data_model = housing_data_model_significant

From the $R^2$ value, about 0.5482 of data is explained by this model, with 43 predictors. Next, we will try to find a “better” model with $R^2$ greater than 0.5482, or adjusted $R^2$ greater than 0.5456.

We’ll also try to lower the value of LOOCV_RMSE (< 110989) to explain the data. Next, we’ll investigate how well backwards AIC and BIC performs in the additive model.

## Additive model AIC and BIC

housing_data_model_aic = step(housing_data_model, direction = "backward", trace = 0)
extractAIC(housing_data_model_aic) # returns both p and AIC

## [1]    19 72436

summary(housing_data_model_aic)$adj.r.squared

## [1] 0.5456

housing_data_model_bic = step(
  housing_data_model,
  direction = "backward",
  trace = 0,
  k = log(nrow(housing_data_numerical))
)
extractAIC(housing_data_model_bic)  # returns both p and AIC

## [1]    19 72436

summary(housing_data_model_bic)$adj.r.squared

## [1] 0.5456

The adjusted $R^2$ values are 0.5456 for the AIC model, and 0.5456 for the BIC model. Both of them are inferior to the additive model before backwards AIC and BIC.

As another attempt, we’ll use exhaustive search to test every possible model and see if we can find a better one.

library(leaps)

housing_data_model_leaps = summary(regsubsets(latestPrice ~ ., data = housing_data_train))

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 2 linear dependencies found

## Reordering variables and trying again:

housing_data_model_leaps$rss

## [1] 64146841620808 56161407030534 52094506040565 48713182226940 46559324402781
## [6] 44319773668574 42659684991101 41238981978781 40360747739034

housing_data_model_leaps$adjr2

## [1] 0.2338 0.3290 0.3774 0.4176 0.4432 0.4698 0.4895 0.5063 0.5167

housing_data_model_leaps_r2_index = which.max(housing_data_model_leaps$adjr2)
housing_data_model_leaps$which[housing_data_model_leaps_r2_index,]

##                   (Intercept)                       zipcode 
##                          TRUE                          TRUE 
##                      latitude                     longitude 
##                         FALSE                         FALSE 
##               propertyTaxRate            hasAssociationTRUE 
##                          TRUE                          TRUE 
##                hasCoolingTRUE                 hasGarageTRUE 
##                         FALSE                         FALSE 
##                hasHeatingTRUE                    hasSpaTRUE 
##                         FALSE                         FALSE 
##                   hasViewTRUE                 homeTypeCondo 
##                         FALSE                         FALSE 
## homeTypeMobile / Manufactured           homeTypeMultiFamily 
##                         FALSE                         FALSE 
##    homeTypeMultiple Occupancy                 homeTypeOther 
##                         FALSE                         FALSE 
##           homeTypeResidential         homeTypeSingle Family 
##                         FALSE                         FALSE 
##             homeTypeTownhouse           homeTypeVacant Land 
##                         FALSE                         FALSE 
##                 parkingSpaces                     yearBuilt 
##                         FALSE                          TRUE 
##               numPriceChanges               latest_saleyear 
##                          TRUE                          TRUE 
##    numOfAccessibilityFeatures               numOfAppliances 
##                         FALSE                         FALSE 
##          numOfParkingFeatures    numOfPatioAndPorchFeatures 
##                         FALSE                         FALSE 
##         numOfSecurityFeatures       numOfWaterfrontFeatures 
##                         FALSE                         FALSE 
##           numOfWindowFeatures        numOfCommunityFeatures 
##                         FALSE                         FALSE 
##                   lotSizeSqFt                livingAreaSqFt 
##                         FALSE                          TRUE 
##           numOfPrimarySchools        numOfElementarySchools 
##                         FALSE                         FALSE 
##            numOfMiddleSchools              numOfHighSchools 
##                         FALSE                         FALSE 
##             avgSchoolDistance               avgSchoolRating 
##                         FALSE                          TRUE 
##                 avgSchoolSize      MedianStudentsPerTeacher 
##                          TRUE                         FALSE 
##                numOfBathrooms                 numOfBedrooms 
##                         FALSE                         FALSE 
##                  numOfStories 
##                         FALSE

housing_data_model_leaps_best = lm(
  latestPrice ~ zipcode + propertyTaxRate + hasAssociation + latest_saleyear + yearBuilt + numPriceChanges + numOfWaterfrontFeatures +  avgSchoolSize +
    livingAreaSqFt + avgSchoolRating ,
  data = housing_data_train
)

anova(housing_data_model, housing_data_model_leaps_best)[2, "Pr(>F)"]

## [1] 6.387e-38

From anova results, the p-value < 2e-16 is significantly small and null hypothesis can be rejected. Considering the leaps model, we shall continue to perform model improvements techniques.

3.2. Interactive model

We’ll now build an interactive model.

housing_data_model_interaction = lm(
  latestPrice ~ (
    zipcode + propertyTaxRate + hasAssociation + yearBuilt +
      numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + livingAreaSqFt + latest_saleyear + avgSchoolRating
  ) ^ 2,
  data = housing_data_train
)

summary(housing_data_model_interaction)

## 
## Call:
## lm(formula = latestPrice ~ (zipcode + propertyTaxRate + hasAssociation + 
##     yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + 
##     livingAreaSqFt + latest_saleyear + avgSchoolRating)^2, data = housing_data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -510968  -63943   -5403   52983  497439 
## 
## Coefficients: (8 not defined because of singularities)
##                                             Estimate Std. Error t value
## (Intercept)                                 1.72e+09   1.62e+09    1.06
## zipcode                                     8.79e+03   1.86e+04    0.47
## propertyTaxRate                             3.82e+08   5.78e+08    0.66
## hasAssociationTRUE                          1.31e+07   2.88e+07    0.45
## yearBuilt                                  -1.14e+06   6.06e+05   -1.88
## numPriceChanges                             1.84e+07   4.12e+06    4.47
## numOfWaterfrontFeatures                     2.26e+09   1.37e+09    1.65
## avgSchoolSize                              -2.83e+05   3.70e+04   -7.63
## livingAreaSqFt                              8.38e+04   2.07e+04    4.05
## latest_saleyear                            -1.16e+06   3.08e+05   -3.76
## avgSchoolRating                            -8.76e+06   7.67e+06   -1.14
## zipcode:propertyTaxRate                    -5.38e+03   7.10e+03   -0.76
## zipcode:hasAssociationTRUE                  3.74e+02   3.22e+02    1.16
## zipcode:yearBuilt                          -1.72e+00   6.64e+00   -0.26
## zipcode:numPriceChanges                    -1.12e+02   4.43e+01   -2.53
## zipcode:numOfWaterfrontFeatures            -1.95e+04   1.14e+04   -1.71
## zipcode:avgSchoolSize                       4.13e+00   3.89e-01   10.62
## zipcode:livingAreaSqFt                     -1.11e+00   2.33e-01   -4.77
## zipcode:latest_saleyear                           NA         NA      NA
## zipcode:avgSchoolRating                     1.08e+02   8.39e+01    1.28
## propertyTaxRate:hasAssociationTRUE          1.75e+05   1.23e+05    1.43
## propertyTaxRate:yearBuilt                   1.65e+04   4.57e+03    3.62
## propertyTaxRate:numPriceChanges            -1.65e+03   1.59e+04   -0.10
## propertyTaxRate:numOfWaterfrontFeatures           NA         NA      NA
## propertyTaxRate:avgSchoolSize               8.88e+02   2.83e+02    3.13
## propertyTaxRate:livingAreaSqFt             -1.39e+02   5.97e+01   -2.33
## propertyTaxRate:latest_saleyear             4.14e+03   4.53e+04    0.09
## propertyTaxRate:avgSchoolRating            -1.49e+05   4.56e+04   -3.28
## hasAssociationTRUE:yearBuilt                2.14e+03   3.19e+02    6.71
## hasAssociationTRUE:numPriceChanges          3.40e+03   2.32e+03    1.47
## hasAssociationTRUE:numOfWaterfrontFeatures        NA         NA      NA
## hasAssociationTRUE:avgSchoolSize            1.03e+02   2.41e+01    4.25
## hasAssociationTRUE:livingAreaSqFt          -8.07e+01   1.01e+01   -8.03
## hasAssociationTRUE:latest_saleyear         -2.33e+04   6.70e+03   -3.49
## hasAssociationTRUE:avgSchoolRating         -8.16e+03   4.60e+03   -1.77
## yearBuilt:numPriceChanges                  -1.84e+01   5.21e+01   -0.35
## yearBuilt:numOfWaterfrontFeatures          -3.68e+05   2.45e+05   -1.50
## yearBuilt:avgSchoolSize                    -4.88e+00   6.16e-01   -7.93
## yearBuilt:livingAreaSqFt                   -2.95e-02   2.17e-01   -0.14
## yearBuilt:latest_saleyear                   6.15e+02   1.52e+02    4.04
## yearBuilt:avgSchoolRating                   2.15e+02   1.07e+02    2.01
## numPriceChanges:numOfWaterfrontFeatures           NA         NA      NA
## numPriceChanges:avgSchoolSize               4.84e+00   3.47e+00    1.39
## numPriceChanges:livingAreaSqFt             -4.58e+00   1.31e+00   -3.49
## numPriceChanges:latest_saleyear            -4.73e+03   9.73e+02   -4.86
## numPriceChanges:avgSchoolRating            -8.47e+02   6.11e+02   -1.39
## numOfWaterfrontFeatures:avgSchoolSize             NA         NA      NA
## numOfWaterfrontFeatures:livingAreaSqFt            NA         NA      NA
## numOfWaterfrontFeatures:latest_saleyear           NA         NA      NA
## numOfWaterfrontFeatures:avgSchoolRating           NA         NA      NA
## avgSchoolSize:livingAreaSqFt                4.66e-02   1.63e-02    2.86
## avgSchoolSize:latest_saleyear              -1.72e+01   1.01e+01   -1.71
## avgSchoolSize:avgSchoolRating              -3.30e+01   5.25e+00   -6.29
## livingAreaSqFt:latest_saleyear              1.96e+00   4.31e+00    0.45
## livingAreaSqFt:avgSchoolRating              3.64e+00   2.33e+00    1.57
## latest_saleyear:avgSchoolRating             1.14e+02   1.85e+03    0.06
##                                            Pr(>|t|)    
## (Intercept)                                 0.28924    
## zipcode                                     0.63615    
## propertyTaxRate                             0.50897    
## hasAssociationTRUE                          0.64934    
## yearBuilt                                   0.06066 .  
## numPriceChanges                             8.0e-06 ***
## numOfWaterfrontFeatures                     0.09805 .  
## avgSchoolSize                               3.1e-14 ***
## livingAreaSqFt                              5.3e-05 ***
## latest_saleyear                             0.00017 ***
## avgSchoolRating                             0.25351    
## zipcode:propertyTaxRate                     0.44863    
## zipcode:hasAssociationTRUE                  0.24508    
## zipcode:yearBuilt                           0.79573    
## zipcode:numPriceChanges                     0.01145 *  
## zipcode:numOfWaterfrontFeatures             0.08732 .  
## zipcode:avgSchoolSize                       < 2e-16 ***
## zipcode:livingAreaSqFt                      1.9e-06 ***
## zipcode:latest_saleyear                          NA    
## zipcode:avgSchoolRating                     0.19980    
## propertyTaxRate:hasAssociationTRUE          0.15236    
## propertyTaxRate:yearBuilt                   0.00030 ***
## propertyTaxRate:numPriceChanges             0.91715    
## propertyTaxRate:numOfWaterfrontFeatures          NA    
## propertyTaxRate:avgSchoolSize               0.00174 ** 
## propertyTaxRate:livingAreaSqFt              0.01970 *  
## propertyTaxRate:latest_saleyear             0.92709    
## propertyTaxRate:avgSchoolRating             0.00106 ** 
## hasAssociationTRUE:yearBuilt                2.4e-11 ***
## hasAssociationTRUE:numPriceChanges          0.14216    
## hasAssociationTRUE:numOfWaterfrontFeatures       NA    
## hasAssociationTRUE:avgSchoolSize            2.2e-05 ***
## hasAssociationTRUE:livingAreaSqFt           1.4e-15 ***
## hasAssociationTRUE:latest_saleyear          0.00050 ***
## hasAssociationTRUE:avgSchoolRating          0.07625 .  
## yearBuilt:numPriceChanges                   0.72358    
## yearBuilt:numOfWaterfrontFeatures           0.13368    
## yearBuilt:avgSchoolSize                     3.0e-15 ***
## yearBuilt:livingAreaSqFt                    0.89196    
## yearBuilt:latest_saleyear                   5.5e-05 ***
## yearBuilt:avgSchoolRating                   0.04437 *  
## numPriceChanges:numOfWaterfrontFeatures          NA    
## numPriceChanges:avgSchoolSize               0.16341    
## numPriceChanges:livingAreaSqFt              0.00049 ***
## numPriceChanges:latest_saleyear             1.2e-06 ***
## numPriceChanges:avgSchoolRating             0.16570    
## numOfWaterfrontFeatures:avgSchoolSize            NA    
## numOfWaterfrontFeatures:livingAreaSqFt           NA    
## numOfWaterfrontFeatures:latest_saleyear          NA    
## numOfWaterfrontFeatures:avgSchoolRating          NA    
## avgSchoolSize:livingAreaSqFt                0.00421 ** 
## avgSchoolSize:latest_saleyear               0.08695 .  
## avgSchoolSize:avgSchoolRating               3.7e-10 ***
## livingAreaSqFt:latest_saleyear              0.65022    
## livingAreaSqFt:avgSchoolRating              0.11759    
## latest_saleyear:avgSchoolRating             0.95072    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106000 on 3070 degrees of freedom
## Multiple R-squared:  0.59,   Adjusted R-squared:  0.584 
## F-statistic:   94 on 47 and 3070 DF,  p-value: <2e-16

length(coef(housing_data_model_interaction))

## [1] 56

summary(housing_data_model_interaction)$r.squared

## [1] 0.5899

summary(housing_data_model_interaction)$adj.r.squared

## [1] 0.5836

From the $R^2$ value, about 0.5899 of data is explained by this model, with 56 predictors. The adjusted $R^2$ is 0.5836. This model is preferred over the additive model.

Let’s compare the interactive model with the original model:

anova(housing_data_model, housing_data_model_interaction)[2, "Pr(>F)"]

## [1] 1.356e-46

With a p-value of 2.438e-49, we can reject the null hypothesis and choose this model.

Similarly to what we did with the additive model, we’ll investigate how well backwards AIC and BIC performs in the interactive model.

housing_data_model_interaction_aic = step(housing_data_model_interaction,
                                          direction = "backward",
                                          trace = 0)
extractAIC(housing_data_model_interaction_aic) # returns both p and AIC

## [1]    36 72174

housing_data_model_interaction_aic

## 
## Call:
## lm(formula = latestPrice ~ zipcode + propertyTaxRate + hasAssociation + 
##     yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + 
##     livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges + 
##     zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize + 
##     zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation + 
##     propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize + 
##     propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating + 
##     hasAssociation:yearBuilt + hasAssociation:numPriceChanges + 
##     hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt + 
##     hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating + 
##     yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize + 
##     yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt + 
##     numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt + 
##     avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating + 
##     livingAreaSqFt:avgSchoolRating, data = housing_data_train)
## 
## Coefficients:
##                        (Intercept)                             zipcode  
##                           2.83e+09                           -4.95e+03  
##                    propertyTaxRate                  hasAssociationTRUE  
##                          -3.50e+07                            4.20e+07  
##                          yearBuilt                     numPriceChanges  
##                          -1.30e+06                            1.78e+07  
##            numOfWaterfrontFeatures                       avgSchoolSize  
##                           2.20e+09                           -2.99e+05  
##                     livingAreaSqFt                     latest_saleyear  
##                           8.04e+04                           -1.18e+06  
##                    avgSchoolRating             zipcode:numPriceChanges  
##                          -5.30e+04                           -1.07e+02  
##    zipcode:numOfWaterfrontFeatures               zipcode:avgSchoolSize  
##                          -1.90e+04                            4.28e+00  
##             zipcode:livingAreaSqFt  propertyTaxRate:hasAssociationTRUE  
##                          -1.02e+00                            1.83e+05  
##          propertyTaxRate:yearBuilt       propertyTaxRate:avgSchoolSize  
##                           1.72e+04                            8.84e+02  
##     propertyTaxRate:livingAreaSqFt     propertyTaxRate:avgSchoolRating  
##                          -1.27e+02                           -1.50e+05  
##       hasAssociationTRUE:yearBuilt  hasAssociationTRUE:numPriceChanges  
##                           2.15e+03                            3.64e+03  
##   hasAssociationTRUE:avgSchoolSize   hasAssociationTRUE:livingAreaSqFt  
##                           1.11e+02                           -8.24e+01  
## hasAssociationTRUE:latest_saleyear  hasAssociationTRUE:avgSchoolRating  
##                          -2.31e+04                           -8.70e+03  
##  yearBuilt:numOfWaterfrontFeatures             yearBuilt:avgSchoolSize  
##                          -3.57e+05                           -4.98e+00  
##          yearBuilt:latest_saleyear           yearBuilt:avgSchoolRating  
##                           6.27e+02                            2.13e+02  
##     numPriceChanges:livingAreaSqFt     numPriceChanges:latest_saleyear  
##                          -5.02e+00                           -4.62e+03  
##       avgSchoolSize:livingAreaSqFt       avgSchoolSize:latest_saleyear  
##                           4.67e-02                           -1.51e+01  
##      avgSchoolSize:avgSchoolRating      livingAreaSqFt:avgSchoolRating  
##                          -3.09e+01                            3.23e+00

summary(housing_data_model_interaction_aic)$adj.r.squared

## [1] 0.5844

housing_data_model_interaction_bic = step(
  housing_data_model_interaction,
  direction = "backward",
  trace = 0,
  k = log(nrow(housing_data_numerical))
)

extractAIC(housing_data_model_interaction_bic)  # returns both p and AIC

## [1]    25 72189

summary(housing_data_model_interaction_bic)$adj.r.squared

## [1] 0.581

The adjusted $R^2$ values are 0.5844 for the AIC model, and 0.581 for the BIC model. Both of them are superior to the additive model and the original model. The interaction model with backwards AIC (housing_data_model_interaction_aic) is the preferred model so far.

3.4. Diagnostics

To perform model diagnostics, we’ll define a helper function which shows the Fitted versus Residuals plot, the Normal Q-Q Plot, the Histogram of Residuals, prints the result of the Breusch-Pagan Test, and Shapiro-Wilk Test for assessing the normality of errors.

diagnostics = function (model) {
  par(mfrow = c(1, 3))
  
  plot(
    fitted(model),
    resid(model),
    pch = 20,
    xlab = "Fitted Values",
    ylab = "Residuals",
    main = "Fitted vs Residuals",
    col = "grey"
  )
  
  abline(h = 0, lwd = 2, col = "orange")
  
  qqnorm(resid(model),
         pch = 20,
         main = "Normal Q-Q Plot",
         col = "grey")
  qqline(resid(model), lwd = 2, col =  "orange")
  
  hist(
    resid(model),
    main = "Histogram of Residuals",
    col = "orange",
    xlab = "Residuals",
    ylab = "Frequency"
  )
  
  library(lmtest)
  bptest(model)
  shapiro.test(resid(model))
}

Having defined the funcion, let’s visualize the plots for the chosen model housing_data_model_interaction_aic:

diagnostics(housing_data_model_interaction_aic)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model)
## W = 0.97, p-value <2e-16

The Fitted versus Residuals plot shows the spread of residuals for many fitted values away from zero in the order of 200.000. The Q-Q Plot and the Histogram of Residuals show data points away from the line from -2 to 1 Theoretical Quantiles. This is a suspect Q-Q plot, leading to believe that the errors do not follow a normal distribution.

3.5. Outliers

To try to identify the issues shown on model diagnistics, we’ll look for influential observations that have large effect on the regression. To measure this, we’ll use Cook’s Distance.

cooksd = cooks.distance(housing_data_model_interaction_aic)

plot(cooksd,
     pch = "*",
     cex = 2,
     main = "Influential Observations by Cooks distance")  # plot cook's distance
abline(h = 2 * mean(cooksd, na.rm = T), col = "black")  # add cutoff line

text(
  x = 1:length(cooksd) + 1,
  y = cooksd,
  labels = ifelse(cooksd > 2 * mean(cooksd, na.rm = T), names(cooksd), ""),
  col = "red"
)  # add labels

Now that we’ve identified the outliers and stored the results in the cooksd variable, we’ll build a new model without these outliers and run diagnostics again.

housing_data_model_interaction_aic_without_outliers = lm(
  latestPrice ~ zipcode + propertyTaxRate + hasAssociation + 
    yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + 
    livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges + 
    zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize + 
    zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation + 
    propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize + 
    propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating + 
    hasAssociation:yearBuilt + hasAssociation:numPriceChanges + 
    hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt + 
    hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating + 
    yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize + 
    yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt + 
    numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt + 
    avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating + 
    livingAreaSqFt:avgSchoolRating,
  data = housing_data_train,
  subset = cooksd < 2 * mean(cooksd, na.rm = T)
)

diagnostics(housing_data_model_interaction_aic_without_outliers)

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model)
## W = 1, p-value = 0.00009

bptest(housing_data_model_interaction_aic_without_outliers)

## 
##  studentized Breusch-Pagan test
## 
## data:  housing_data_model_interaction_aic_without_outliers
## BP = 322, df = 32, p-value <2e-16

The Fitted versus Residuals plot shows the spread of residuals for many fitted values away from zero in the order of hundreds of thousands, but at half the distance from the mean when compared to the previous model. Also, the Q-Q Plot and the Histogram of Residuals show data points close to line, meaning errors follow a normal distribution.

Finally, we’ll use a box cox transformation on our model to improve the constant variance.

library(MASS)
boxcox(
  housing_data_model_interaction_aic_without_outliers,
  plotit = TRUE,
  lambda = seq(0, 1, by = 0.05)
)

housing_data_model_interaction_aic_without_outliers = lm((((latestPrice ^ 0.5) - 1) / 0.5) ~ zipcode + propertyTaxRate + hasAssociation + yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges + zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize + zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation + propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize + propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating + hasAssociation:yearBuilt + hasAssociation:numPriceChanges + hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt + hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating + yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize + yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt + numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt + avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating + livingAreaSqFt:avgSchoolRating,
                                                         data = housing_data_train,
                                                         subset = cooksd < 2 * mean(cooksd, na.rm = T)
)

diagnostics(housing_data_model_interaction_aic_without_outliers)

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model)
## W = 1, p-value = 0.04

bptest(housing_data_model_interaction_aic_without_outliers)

## 
##  studentized Breusch-Pagan test
## 
## data:  housing_data_model_interaction_aic_without_outliers
## BP = 202, df = 32, p-value <2e-16

As seen in box cox transformation, the mean of the normal distribution is centered around 0.43. We tried using 0.43 to 0.50 as exponential formula as per the box cox transformation, and we got best model at 0.50. Our p-value for normal distribution is 0.4, however, constant variation is still showing a low p-value. As we improved a lot based on the original Fitted versus Residual plot, we are choosing the interactive model with backwards AIC without outliers (housing_data_model_interaction_aic_without_outliers) as our final model after applying box cox transformation.

Lastly, we will calculate the error and noise with the final model for both the test and train data frames. We to perform reverse transformation (box cox applied on the model) on sigma of the model to get the final error value.

For illustration, here’s the variance for the original model (additive):

sigma(lm(latestPrice ~ ., data = housing_data_train))

## [1] 109900

sigma(lm(latestPrice ~ ., data = housing_data_test))

## [1] 110702

And the error obtained from the chosen model:

error_raw = sigma(housing_data_model_interaction_aic_without_outliers)
error = (error_raw ^2 + 1) * 0.5
error

## [1] 7427

This small error value shows the greater accuracy of model.

housing_data_model_interaction_aic_without_outliers = lm((((latestPrice ^ 0.5) - 1) / 0.5) ~ zipcode + propertyTaxRate + hasAssociation + yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges + zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize + zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation + propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize + propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating + hasAssociation:yearBuilt + hasAssociation:numPriceChanges + hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt + hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating + yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize + yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt + numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt + avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating + livingAreaSqFt:avgSchoolRating,
                                                         data = housing_data_test,
                                                         subset = cooksd < 2 * mean(cooksd, na.rm = T)
)
error_raw = sigma(housing_data_model_interaction_aic_without_outliers)
error = (error_raw ^ 2 + 1) * 0.5
error

## [1] 13722

RMSE errors for train and test data (respectivelly):

library(Metrics)

predictions_train = predict(housing_data_model_interaction_aic_without_outliers,
                            housing_data_train)
error_1 = rmse(((housing_data_train$latestPrice ^ 0.5 - 1) / 0.5), predictions_train)

(error_1 ^ 2 + 1) * 0.5

## [1] 13104

predictions_test = predict(housing_data_model_interaction_aic_without_outliers,
                           housing_data_test)
error_1 = rmse(((housing_data_test$latestPrice ^ 0.5 - 1) / 0.5), predictions_test)

(error_1 ^ 2 + 1) * 0.5

## [1] 13576

4. Conclusion

Through this project, we’ve built a model that would help home buyers predict house prices in Austin, TX and surrounding cities and towns. Using data obtained from house sale listings on zillow.com, we were able to produce a clean data set to work with, verify the predictors for relevancy and collinearity, and adjusted the dataset based on findings of this analysis.

Model building considered several techniques, including the use of additive or interactive models, and the use of backwards AIC and BIC to find an optimal model the model. Having identified the model of choice, the interactive model with backwards AIC without outliers, we’ve performed diagnostics and fine tuning using an analysis of outliers to reduce errors and increase the accuracy of the model.

Links and citations

Appendix A: about Team 42 PST

Our team is formed by the following individuals:

Jagadeesh Kedarisetty (jk64)
Nilesh Bhandarwar (nileshb2)
Peri Rocha (procha2)

Appendix B: libraries used

The following libraries were used in the creation of this report:

- ggmap

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2.

The R Journal, 5(1), 144-161. URL

http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

- ggplot2

H. Wickham. ggplot2: Elegant Graphics for Data Analysis.

Springer-Verlag New York, 2016.

- faraway

Julian Faraway (2016). faraway: Functions and Datasets for Books

by Julian Faraway. R package version 1.0.7.

https://CRAN.R-project.org/package=faraway

- MASS

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York.
ISBN 0-387-95457-0

- lmtest

Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in

Regression Relationships. R News 2(3), 7-10. URL

https://CRAN.R-project.org/doc/Rnews/

- leaps

Thomas Lumley based on Fortran code by Alan Miller (2020). leaps: Regression Subset Selection. R package version 3.1. https://CRAN.R-project.org/package=leaps

- Metrics

Ben Hamner and Michael Frasco (2018). Metrics: Evaluation Metrics for Machine Learning. R package version 0.1.4.

https://CRAN.R-project.org/package=Metrics

Appendix C: links

“How To Succeed As A First-Time Home Buyer In Today’s Market” (https://www.forbes.com/sites/forbesrealestatecouncil/2021/07/19/how-to-succeed-as-a-first-time-home-buyer-in-todays-market/?sh=79e0d37f19f8)↩︎
“Your 4 Most Important Financial Decisions: #1 – The House Purchase” (https://www.retirementstewardship.com/2016/05/28/4-important-financial-decisions-1-house-purchase/)↩︎
“Why hot-desking is a terrible idea” (https://www.msn.com/en-us/lifestyle/career/why-hot-desking-is-a-terrible-idea/ar-AAMjgTM?ocid=BingNewsSearch)↩︎
Kaggle dataset: “Austin, TX House Listings - Features and Images scraped in January 2021”. (https://www.kaggle.com/ericpierce/austinhousingprices, austinHousingData.csv).↩︎

A study of housing market trends in Austin, Texas

STAT 420, Summer 2021, Team 42 PST

08/08/2021