The real estate market can be intimidating for anyone looking for a new home. For first time home buyers and experienced investors alike, news articles are citing that we’re living through the hottest market in decades 1.
For many people, buying a house is one of the most important decisions they’ll make in their lives 2. However, when is the right time to buy a house? Are prices high right now, and will they further increase? Should someone consider purchasing a house thinking the market may worsen? How does one maximize their investment by buying at the right time? By predicting the home value based on what we know historically from houses in the same profile, we can help an individual understand if the price is inflated or not, and help them determine if this is the right moment.
For this data analysis project, we will study the housing market in Austin, Texas, one of America’s hottest real estate markets. By using historical data, we’ll attempt to predict home prices using multiple linear regression, and apply methods discussed in class to find the best-possible prediction model.We believe that the trends in Austin can serve as a thermometer for the rest of the country due to high interest driven by the COVID-19 pandemic, which is shifting workers to home-offices 3. By using detailed historical data about properties that were sold in Austin in the past, our intent is to offer home buyers guidance to help them understand if the sale price for a house is within overall market expectation, helping them on the decision making process.
This project implements the following data analysis techniques and concepts:
We gathered data from a large study 4 based on Zillow listings that include the sale price of properties between the years 2018 and 2021 in Austin, Texas, and several variables that help describe the given properties. We will try to predict prices in the past, present, and also future trends based on those variables.
The data used in our study offers 15,171 observations and 45 variables. Following is the list of variables in the dataset:
zpid
: Unique Identifier or Property IDcity
: The lowercase name of a city or town in or surrounding Austin, Texas.streetaddress
: The street address of the property.zipcode
: The property’s 5-digit ZIP code.description
: The description of the property listing from Zillow.latitude
: Latitude of the property.longitude
: Longitude of the property.propertyTaxRate
: Tax rate for the property.garageSpaces
: Number of garage spaces. This is a subset of the ParkingSpaces
feature.hasAssociation
: Indicates if there is a Homeowners Association associated with the property.hasCooling
: Boolean indicating if the property has a cooling system.hasGarage
: Boolean indicating if the property has a garage.hasHeating
: Boolean indicating if the property has a heating system.hasSpa
: Boolean indicating if the property has a Spa.hasView
: Boolean indicating if the property comes with a view.homeType
: The home type (i.e., Single Family, Townhouse, Apartment).parkingSpaces
: The number of parking spots.yearBuilt
: The year the property was built.numPriceChanges
: The number of price changes the property has undergone since being listed.latest_saledate
: The latest sale date (YYYY-MM-DD).latest_salemonth
: The month the property sold (1-12).latest_saleyear
: The year the property sold (2018-2021).latestPriceSource
: The party that provided the sale price.numOfPhotos
: The number of photos in the Zillow listing.numOfAccessibilities
: The number of unique accessibility features in the property.numOfAppliances
: The number of unique appliances in the property.numOfParkingFeatures
: The number of unique parking features in the property.numOfPatioAndPorts
: The number of unique patio and/or porch features in the property.numOfSecurityFeatures
: The number of unique security features in the property.numOfWaterFront
: The number of unique waterfront features in the property.numOfUniqueWindowFeatures
: The number of unique window aesthetics in the property.numOfCommunityFeatures
: The number of unique community features (community meeting room, mailbox) in the property.lotSizeSqFt
: The lot size of the property reported in square feet.livingAreaSqFt
: The living area of the property reported in square feet.numOfPrimarySchools
: The number of primary schools listed in the area on the listing.numOfElementrySchools
: The number of elementary schools listed in the area on the listing.numOfMiddleSchools
: The number of middle schools listed in the area on the listing.numOfHighSchools
: The number of high schools listed in the area on the listing.avgSchoolDistance
: The average distance of all school types (i.e., Middle, High) in the listing.avgSchoolRating
: The average school rating of all school types (i.e., Middle, High) in the listing.avgSchoolSize
: The average school size of all school types (i.e., Middle, High) in the listing.MedianStudentsPerTeacher
: The median students-per-teacher for all schools near the listing.numOfBathrooms
: The number of bathrooms in the property.numOfBedrooms
: The number of bedrooms in the property.numOfStories
: The number of stories in the property.Our first step will be to read the data from a csv file (austinHousingData.csv
) and perform some data cleaning tasks:
homeType
variable has 10 different possible values (Apartment, Condo, Residential, etc). We’ll make it a factor variable.zpid
, latest_saledate
, latestPriceSource
, city
, homeImage
, streetAddress
, and numOfPhotos
.raw_housing_data = read.csv("austinHousingData.csv")
# remove all rows with missing data
raw_housing_data = na.omit(raw_housing_data)
# Make homeType a factor variable
raw_housing_data$homeType = as.factor(raw_housing_data$homeType)
# Remove predictors that are not used
selected_housing_data = subset(
raw_housing_data,
select = -c(
zpid,
latest_salemonth,
latest_saledate,
latestPriceSource,
city,
homeImage,
streetAddress,
numOfPhotos
)
)
Now that we’ve performed basic data cleaning tasks, let’s take a look at the dataset.
head(selected_housing_data)
## zipcode latitude longitude propertyTaxRate garageSpaces hasAssociation
## 1 78660 30.43 -97.66 1.98 2 TRUE
## 2 78660 30.43 -97.66 1.98 2 TRUE
## 3 78660 30.41 -97.64 1.98 0 TRUE
## 4 78660 30.43 -97.66 1.98 2 TRUE
## 5 78660 30.44 -97.66 1.98 0 TRUE
## 6 78660 30.44 -97.66 1.98 2 TRUE
## hasCooling hasGarage hasHeating hasSpa hasView homeType parkingSpaces
## 1 TRUE TRUE TRUE FALSE FALSE Single Family 2
## 2 TRUE TRUE TRUE FALSE FALSE Single Family 2
## 3 TRUE FALSE TRUE FALSE FALSE Single Family 0
## 4 TRUE TRUE TRUE FALSE FALSE Single Family 2
## 5 TRUE FALSE TRUE FALSE FALSE Single Family 0
## 6 TRUE TRUE TRUE FALSE FALSE Single Family 2
## yearBuilt latestPrice numPriceChanges latest_saleyear
## 1 2012 305000 5 2019
## 2 2013 295000 1 2020
## 3 2018 256125 1 2019
## 4 2013 240000 4 2018
## 5 2002 239900 3 2018
## 6 2020 309045 2 2020
## numOfAccessibilityFeatures numOfAppliances numOfParkingFeatures
## 1 0 5 2
## 2 0 1 2
## 3 0 4 1
## 4 0 0 2
## 5 0 0 1
## 6 0 3 1
## numOfPatioAndPorchFeatures numOfSecurityFeatures numOfWaterfrontFeatures
## 1 1 3 0
## 2 0 0 0
## 3 0 1 0
## 4 0 0 0
## 5 0 0 0
## 6 2 2 0
## numOfWindowFeatures numOfCommunityFeatures lotSizeSqFt livingAreaSqFt
## 1 1 0 6011 2601
## 2 0 0 6185 1768
## 3 0 0 7840 1478
## 4 0 0 6098 1678
## 5 0 0 6708 2132
## 6 0 0 5161 1446
## numOfPrimarySchools numOfElementarySchools numOfMiddleSchools
## 1 1 0 1
## 2 1 0 1
## 3 0 2 1
## 4 1 0 1
## 5 1 0 1
## 6 1 0 1
## numOfHighSchools avgSchoolDistance avgSchoolRating avgSchoolSize
## 1 1 1.267 2.667 1063
## 2 1 1.400 2.667 1063
## 3 1 1.200 3.000 1108
## 4 1 1.400 2.667 1063
## 5 1 1.133 4.000 1223
## 6 1 1.067 4.000 1223
## MedianStudentsPerTeacher numOfBathrooms numOfBedrooms numOfStories
## 1 14 3 4 2
## 2 14 2 4 1
## 3 14 2 3 1
## 4 14 2 3 1
## 5 14 3 3 2
## 6 14 2 3 1
We’ll also look at the summary of the dataset to better understand the data ranges for each variable. This allows us to understand if some of the variables present abnormal max or min values when compared to the mean of that variable, helping to identify outliers which may cause noise in the dataset.
summary(selected_housing_data)
## zipcode latitude longitude propertyTaxRate garageSpaces
## Min. :78617 Min. :30.1 Min. :-98.0 Min. :1.98 Min. : 0.00
## 1st Qu.:78727 1st Qu.:30.2 1st Qu.:-97.8 1st Qu.:1.98 1st Qu.: 0.00
## Median :78739 Median :30.3 Median :-97.8 Median :1.98 Median : 1.00
## Mean :78736 Mean :30.3 Mean :-97.8 Mean :1.99 Mean : 1.23
## 3rd Qu.:78749 3rd Qu.:30.4 3rd Qu.:-97.7 3rd Qu.:1.98 3rd Qu.: 2.00
## Max. :78759 Max. :30.5 Max. :-97.6 Max. :2.21 Max. :22.00
##
## hasAssociation hasCooling hasGarage hasHeating
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:7164 FALSE:274 FALSE:6825 FALSE:149
## TRUE :8007 TRUE :14897 TRUE :8346 TRUE :15022
##
##
##
##
## hasSpa hasView homeType parkingSpaces
## Mode :logical Mode :logical Single Family :14241 Min. : 0.00
## FALSE:13972 FALSE:11716 Condo : 470 1st Qu.: 0.00
## TRUE :1199 TRUE :3455 Townhouse : 174 Median : 1.00
## Multiple Occupancy: 96 Mean : 1.23
## Vacant Land : 83 3rd Qu.: 2.00
## Apartment : 37 Max. :22.00
## (Other) : 70
## yearBuilt latestPrice numPriceChanges latest_saleyear
## Min. :1905 Min. : 5500 Min. : 1.00 Min. :2018
## 1st Qu.:1974 1st Qu.: 309000 1st Qu.: 1.00 1st Qu.:2018
## Median :1993 Median : 405000 Median : 2.00 Median :2019
## Mean :1989 Mean : 512768 Mean : 3.03 Mean :2019
## 3rd Qu.:2006 3rd Qu.: 575000 3rd Qu.: 4.00 3rd Qu.:2020
## Max. :2020 Max. :13500000 Max. :23.00 Max. :2021
##
## numOfAccessibilityFeatures numOfAppliances numOfParkingFeatures
## Min. :0.000 Min. : 0.00 Min. :0.00
## 1st Qu.:0.000 1st Qu.: 2.00 1st Qu.:1.00
## Median :0.000 Median : 3.00 Median :2.00
## Mean :0.013 Mean : 3.48 Mean :1.71
## 3rd Qu.:0.000 3rd Qu.: 4.00 3rd Qu.:2.00
## Max. :8.000 Max. :12.00 Max. :6.00
##
## numOfPatioAndPorchFeatures numOfSecurityFeatures numOfWaterfrontFeatures
## Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.000 Median :0.0000
## Mean :0.663 Mean :0.467 Mean :0.0028
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :8.000 Max. :6.000 Max. :2.0000
##
## numOfWindowFeatures numOfCommunityFeatures lotSizeSqFt
## Min. :0.000 Min. :0.000 Min. : 100
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 6534
## Median :0.000 Median :0.000 Median : 8276
## Mean :0.208 Mean :0.019 Mean : 119084
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.: 10890
## Max. :4.000 Max. :8.000 Max. :1508482800
##
## livingAreaSqFt numOfPrimarySchools numOfElementarySchools numOfMiddleSchools
## Min. : 300 Min. :0.000 Min. :0.0000 Min. :0.00
## 1st Qu.: 1483 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:1.00
## Median : 1975 Median :1.000 Median :0.0000 Median :1.00
## Mean : 2208 Mean :0.941 Mean :0.0492 Mean :1.04
## 3rd Qu.: 2687 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.00
## Max. :109292 Max. :2.000 Max. :2.0000 Max. :3.00
##
## numOfHighSchools avgSchoolDistance avgSchoolRating avgSchoolSize
## Min. :0.000 Min. :0.20 Min. :2.33 Min. : 396
## 1st Qu.:1.000 1st Qu.:1.10 1st Qu.:4.00 1st Qu.: 966
## Median :1.000 Median :1.57 Median :5.78 Median :1287
## Mean :0.977 Mean :1.84 Mean :5.78 Mean :1237
## 3rd Qu.:1.000 3rd Qu.:2.27 3rd Qu.:7.00 3rd Qu.:1496
## Max. :2.000 Max. :9.00 Max. :9.50 Max. :1913
##
## MedianStudentsPerTeacher numOfBathrooms numOfBedrooms numOfStories
## Min. :10.0 Min. : 0.00 Min. : 0.00 Min. :1.00
## 1st Qu.:14.0 1st Qu.: 2.00 1st Qu.: 3.00 1st Qu.:1.00
## Median :15.0 Median : 3.00 Median : 3.00 Median :1.00
## Mean :14.9 Mean : 2.68 Mean : 3.44 Mean :1.47
## 3rd Qu.:16.0 3rd Qu.: 3.00 3rd Qu.: 4.00 3rd Qu.:2.00
## Max. :19.0 Max. :27.00 Max. :20.00 Max. :4.00
##
Some of the variables demonstrate strange min and max values when compared to their mean: avgSchoolDistance
, livingAreaSqFt
, lotSizeSqFt
, numOfBedrooms
, and numOfBathrooms
. We’ll plot the observations of these variables in charts to better understand if they are outliers.
data_visuals = function(data) {
par(mfrow = c(2, 3))
plot(
latestPrice ~ homeType,
data = data,
pch = 20,
col = "dodgerblue",
main = "latestPrice vs. homeType",
cex = 1.5
)
plot(
latestPrice ~ avgSchoolDistance ,
data = data,
pch = 20,
col = "dodgerblue",
main = "latestPrice vs. avgSchoolDistance ",
cex = 1.5
)
plot(
latestPrice ~ livingAreaSqFt,
data = data,
pch = 20,
col = "dodgerblue",
main = "latestPrice vs. livingAreaSqFt",
cex = 1.5
)
plot(
latestPrice ~ lotSizeSqFt,
data = data,
pch = 20,
col = "dodgerblue",
main = "latestPrice vs. lotSizeSqFt",
cex = 1.5
)
plot(
latestPrice ~ numOfBedrooms,
data = data,
pch = 20,
col = "dodgerblue",
main = "latestPrice vs. numOfBedrooms",
cex = 1.5
)
plot(
latestPrice ~ numOfBathrooms,
data = data,
pch = 20,
col = "dodgerblue",
main = "latestPrice vs. numOfBathrooms",
cex = 1.5
)
}
data_visuals(selected_housing_data)
From the data structure and visuals, we see that there are significant outliers in the dataset. For instance, one observation has a livingAreaSqft
value of ‘109,292’, compared to its mean ‘2,208’. We shall remove these outliers using boxplot stats.
for (x in c(
'homeType',
'latestPrice',
'avgSchoolDistance',
'livingAreaSqFt',
'lotSizeSqFt',
'numOfBedrooms',
'numOfBathrooms'
))
{
value = selected_housing_data[, x][selected_housing_data[, x] %in% boxplot.stats(selected_housing_data[, x])$out]
selected_housing_data[, x][selected_housing_data[, x] %in% value] = NA
}
# remove all rows with missing data
selected_housing_data = na.omit(selected_housing_data
)
Looking at the plots again, we confirm that the observations are better represented now, without outliers.
data_visuals(selected_housing_data)
The “cleaned” dataset now offers 11,493 observations and 39 variables.
str(selected_housing_data)
## 'data.frame': 12475 obs. of 38 variables:
## $ zipcode : int 78660 78660 78660 78660 78660 78660 78660 78660 78660 78617 ...
## $ latitude : num 30.4 30.4 30.4 30.4 30.4 ...
## $ longitude : num -97.7 -97.7 -97.6 -97.7 -97.7 ...
## $ propertyTaxRate : num 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 ...
## $ garageSpaces : int 2 2 0 2 0 2 0 0 0 2 ...
## $ hasAssociation : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ hasCooling : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ hasGarage : logi TRUE TRUE FALSE TRUE FALSE TRUE ...
## $ hasHeating : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ hasSpa : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ hasView : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ homeType : Factor w/ 10 levels "Apartment","Condo",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ parkingSpaces : int 2 2 0 2 0 2 0 0 0 2 ...
## $ yearBuilt : int 2012 2013 2018 2013 2002 2020 2016 2002 2002 2013 ...
## $ latestPrice : int 305000 295000 256125 240000 239900 309045 315000 219900 225000 194800 ...
## $ numPriceChanges : int 5 1 1 4 3 2 2 2 1 1 ...
## $ latest_saleyear : int 2019 2020 2019 2018 2018 2020 2020 2018 2019 2018 ...
## $ numOfAccessibilityFeatures: int 0 0 0 0 0 0 0 0 0 0 ...
## $ numOfAppliances : int 5 1 4 0 0 3 3 3 2 3 ...
## $ numOfParkingFeatures : int 2 2 1 2 1 1 1 1 1 2 ...
## $ numOfPatioAndPorchFeatures: int 1 0 0 0 0 2 0 0 1 0 ...
## $ numOfSecurityFeatures : int 3 0 1 0 0 2 0 0 1 0 ...
## $ numOfWaterfrontFeatures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ numOfWindowFeatures : int 1 0 0 0 0 0 0 0 0 0 ...
## $ numOfCommunityFeatures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lotSizeSqFt : num 6011 6185 7840 6098 6708 ...
## $ livingAreaSqFt : int 2601 1768 1478 1678 2132 1446 2432 1422 1870 1422 ...
## $ numOfPrimarySchools : int 1 1 0 1 1 1 1 1 1 1 ...
## $ numOfElementarySchools : int 0 0 2 0 0 0 0 0 0 0 ...
## $ numOfMiddleSchools : int 1 1 1 1 1 1 1 1 1 1 ...
## $ numOfHighSchools : int 1 1 1 1 1 1 1 1 1 1 ...
## $ avgSchoolDistance : num 1.27 1.4 1.2 1.4 1.13 ...
## $ avgSchoolRating : num 2.67 2.67 3 2.67 4 ...
## $ avgSchoolSize : int 1063 1063 1108 1063 1223 1223 1051 1223 1223 1615 ...
## $ MedianStudentsPerTeacher : int 14 14 14 14 14 14 12 14 14 14 ...
## $ numOfBathrooms : num 3 2 2 2 3 2 3 3 2 3 ...
## $ numOfBedrooms : int 4 4 3 3 3 3 4 3 3 3 ...
## $ numOfStories : int 2 1 1 1 2 1 2 2 2 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:2696] 18 23 24 29 30 34 35 38 42 44 ...
## ..- attr(*, "names")= chr [1:2696] "18" "23" "24" "29" ...
Now that we have a clean dataset, without outliers, let’s have a look at the distribution of property prices when plotted over the map of Austin. We notice that the most expensive houses are concentrated near the central part of Austin, with some exceptions for more prestigious areas. Overall, houses are in the $250,000 to $750,000 range.
library(ggmap)
library(ggplot2)
# register google maps API key
register_google(key = "AIzaSyAXuwivTHN6rIgi3teuusdz3r8dqNMQQx8")
## Central co-ordinates of the region we are interested in.
central_location = c(mean(selected_housing_data$longitude),
mean(selected_housing_data$latitude))
## Get map centered on Austin, TX (or the mean of the coordinates in our dataset)
austin_map = ggmap(get_googlemap(
center = central_location,
scale = 1,
zoom = 10
),
extent = "normal")
## Plot heatmap
austin_map + geom_point(
aes(x = longitude, y = latitude, color = latestPrice),
data = selected_housing_data,
alpha = 0.4,
size = 3
) + xlim(range(selected_housing_data$longitude)) + ylim(range(selected_housing_data$latitude)) + scale_color_distiller(palette = "Spectral", labels = scales::comma) + xlab("Longitude") + ylab("Latitude") + ggtitle("Heatmap: latest sale price ($ USD) by property")
We’ll split our dataset into two data frames: one used for training, which will contain 25% of the observations, and one for testing, containing the remaining 75%.
set.seed(19870412)
ratio = 0.25
idx = sample(nrow(selected_housing_data),
size = nrow(selected_housing_data) * ratio)
housing_data_train = selected_housing_data[idx, ]
housing_data_test = selected_housing_data[-idx, ]
Next, we’ll look at the variables in the dataset to investigate if there’s multicollinearity.
library(faraway)
options(max.print = 1000000)
# This is a helper function to get the top n items from a matrix.
# Adjusted from https://stackoverflow.com/questions/32544566/find-the-largest-values-on-a-matrix-in-r
nlargest = function(m, n = 10, sim = TRUE) {
mult = 1
if (sim)
mult = 2
res = order(m, decreasing = TRUE)[seq_len(n) * mult]
pos = arrayInd(res, dim(m), useNames = TRUE)
list(values = m[res],
position = pos)
}
# A correlation cannot be computed for factor variables.So we'll create a copy of the data frame without the factor variables to run the collinearity analysis
num_cols = unlist(lapply(housing_data_train, is.numeric))
housing_data_numerical = housing_data_train[, num_cols]
# Pairs won't work with more than 26 variables
# pairs(housing_data_train[,1:26], col="dodgerblue")
# run cor() and store results on a matrix
(coll_matrix = round(cor(housing_data_numerical), 2))
## zipcode latitude longitude propertyTaxRate
## zipcode 1.00 -0.07 -0.16 -0.21
## latitude -0.07 1.00 0.36 0.48
## longitude -0.16 0.36 1.00 -0.01
## propertyTaxRate -0.21 0.48 -0.01 1.00
## garageSpaces -0.01 0.05 -0.06 0.05
## parkingSpaces -0.01 0.04 -0.06 0.05
## yearBuilt -0.02 -0.11 -0.19 0.13
## latestPrice -0.15 0.14 -0.27 -0.03
## numPriceChanges -0.05 0.00 -0.04 -0.02
## latest_saleyear -0.04 -0.04 -0.02 -0.01
## numOfAccessibilityFeatures -0.03 0.00 0.03 -0.01
## numOfAppliances 0.03 -0.01 -0.01 -0.01
## numOfParkingFeatures -0.08 0.19 -0.01 0.31
## numOfPatioAndPorchFeatures -0.02 -0.03 -0.08 -0.01
## numOfSecurityFeatures 0.00 -0.02 -0.06 0.03
## numOfWaterfrontFeatures -0.01 -0.02 -0.02 -0.01
## numOfWindowFeatures 0.00 0.00 -0.13 0.01
## numOfCommunityFeatures -0.01 0.02 -0.03 0.03
## lotSizeSqFt 0.12 0.18 -0.25 0.09
## livingAreaSqFt -0.01 0.14 -0.43 0.19
## numOfPrimarySchools -0.05 -0.07 0.12 0.03
## numOfElementarySchools 0.06 0.13 0.00 -0.04
## numOfMiddleSchools -0.01 -0.02 -0.20 0.00
## numOfHighSchools 0.08 0.11 0.38 -0.03
## avgSchoolDistance 0.05 0.01 -0.04 -0.03
## avgSchoolRating 0.05 0.27 -0.55 0.21
## avgSchoolSize 0.12 0.02 -0.43 0.18
## MedianStudentsPerTeacher 0.11 -0.02 -0.60 -0.03
## numOfBathrooms -0.03 0.05 -0.31 0.12
## numOfBedrooms 0.04 0.09 -0.28 0.15
## numOfStories -0.03 0.01 -0.18 0.07
## garageSpaces parkingSpaces yearBuilt latestPrice
## zipcode -0.01 -0.01 -0.02 -0.15
## latitude 0.05 0.04 -0.11 0.14
## longitude -0.06 -0.06 -0.19 -0.27
## propertyTaxRate 0.05 0.05 0.13 -0.03
## garageSpaces 1.00 1.00 0.04 0.14
## parkingSpaces 1.00 1.00 0.05 0.14
## yearBuilt 0.04 0.05 1.00 -0.08
## latestPrice 0.14 0.14 -0.08 1.00
## numPriceChanges 0.10 0.10 -0.05 -0.05
## latest_saleyear 0.34 0.34 0.01 0.13
## numOfAccessibilityFeatures 0.05 0.05 0.03 0.05
## numOfAppliances 0.15 0.15 0.09 0.04
## numOfParkingFeatures 0.67 0.67 0.00 0.11
## numOfPatioAndPorchFeatures 0.24 0.24 0.02 0.17
## numOfSecurityFeatures 0.17 0.17 0.14 0.11
## numOfWaterfrontFeatures 0.02 0.02 -0.02 0.04
## numOfWindowFeatures 0.03 0.03 0.04 0.14
## numOfCommunityFeatures 0.05 0.02 0.02 0.04
## lotSizeSqFt 0.06 0.06 -0.15 0.31
## livingAreaSqFt 0.13 0.13 0.41 0.48
## numOfPrimarySchools -0.01 -0.01 -0.06 -0.14
## numOfElementarySchools 0.00 0.00 0.04 0.11
## numOfMiddleSchools 0.01 0.01 0.08 0.12
## numOfHighSchools -0.03 -0.02 0.10 -0.23
## avgSchoolDistance 0.02 0.02 0.32 -0.11
## avgSchoolRating 0.08 0.08 0.11 0.46
## avgSchoolSize 0.03 0.03 0.34 0.10
## MedianStudentsPerTeacher 0.07 0.07 0.07 0.35
## numOfBathrooms 0.12 0.12 0.52 0.34
## numOfBedrooms 0.09 0.09 0.24 0.26
## numOfStories 0.06 0.06 0.41 0.18
## numPriceChanges latest_saleyear
## zipcode -0.05 -0.04
## latitude 0.00 -0.04
## longitude -0.04 -0.02
## propertyTaxRate -0.02 -0.01
## garageSpaces 0.10 0.34
## parkingSpaces 0.10 0.34
## yearBuilt -0.05 0.01
## latestPrice -0.05 0.13
## numPriceChanges 1.00 0.00
## latest_saleyear 0.00 1.00
## numOfAccessibilityFeatures -0.04 0.08
## numOfAppliances 0.07 0.09
## numOfParkingFeatures 0.11 0.25
## numOfPatioAndPorchFeatures -0.03 0.53
## numOfSecurityFeatures -0.03 0.41
## numOfWaterfrontFeatures 0.01 0.04
## numOfWindowFeatures 0.00 0.26
## numOfCommunityFeatures 0.04 0.05
## lotSizeSqFt 0.01 -0.05
## livingAreaSqFt 0.09 -0.03
## numOfPrimarySchools -0.02 0.03
## numOfElementarySchools -0.01 -0.04
## numOfMiddleSchools 0.01 -0.03
## numOfHighSchools -0.05 -0.01
## avgSchoolDistance -0.03 0.00
## avgSchoolRating 0.02 -0.03
## avgSchoolSize -0.04 -0.03
## MedianStudentsPerTeacher 0.03 0.00
## numOfBathrooms 0.08 0.00
## numOfBedrooms 0.07 -0.01
## numOfStories 0.06 -0.03
## numOfAccessibilityFeatures numOfAppliances
## zipcode -0.03 0.03
## latitude 0.00 -0.01
## longitude 0.03 -0.01
## propertyTaxRate -0.01 -0.01
## garageSpaces 0.05 0.15
## parkingSpaces 0.05 0.15
## yearBuilt 0.03 0.09
## latestPrice 0.05 0.04
## numPriceChanges -0.04 0.07
## latest_saleyear 0.08 0.09
## numOfAccessibilityFeatures 1.00 0.03
## numOfAppliances 0.03 1.00
## numOfParkingFeatures 0.05 0.16
## numOfPatioAndPorchFeatures 0.11 0.15
## numOfSecurityFeatures 0.05 0.17
## numOfWaterfrontFeatures 0.00 -0.02
## numOfWindowFeatures -0.02 0.10
## numOfCommunityFeatures 0.02 0.08
## lotSizeSqFt -0.05 -0.06
## livingAreaSqFt -0.02 0.03
## numOfPrimarySchools 0.01 0.01
## numOfElementarySchools 0.00 -0.01
## numOfMiddleSchools -0.01 0.01
## numOfHighSchools -0.01 -0.01
## avgSchoolDistance -0.02 0.00
## avgSchoolRating -0.01 0.02
## avgSchoolSize -0.01 0.04
## MedianStudentsPerTeacher -0.01 0.02
## numOfBathrooms -0.01 0.08
## numOfBedrooms -0.04 0.01
## numOfStories 0.03 0.06
## numOfParkingFeatures numOfPatioAndPorchFeatures
## zipcode -0.08 -0.02
## latitude 0.19 -0.03
## longitude -0.01 -0.08
## propertyTaxRate 0.31 -0.01
## garageSpaces 0.67 0.24
## parkingSpaces 0.67 0.24
## yearBuilt 0.00 0.02
## latestPrice 0.11 0.17
## numPriceChanges 0.11 -0.03
## latest_saleyear 0.25 0.53
## numOfAccessibilityFeatures 0.05 0.11
## numOfAppliances 0.16 0.15
## numOfParkingFeatures 1.00 0.14
## numOfPatioAndPorchFeatures 0.14 1.00
## numOfSecurityFeatures 0.11 0.50
## numOfWaterfrontFeatures 0.01 0.03
## numOfWindowFeatures 0.04 0.34
## numOfCommunityFeatures 0.04 0.07
## lotSizeSqFt 0.01 0.00
## livingAreaSqFt 0.09 0.05
## numOfPrimarySchools 0.02 0.01
## numOfElementarySchools -0.02 -0.02
## numOfMiddleSchools 0.01 -0.02
## numOfHighSchools -0.02 -0.04
## avgSchoolDistance -0.05 0.00
## avgSchoolRating 0.12 0.04
## avgSchoolSize 0.04 0.03
## MedianStudentsPerTeacher 0.04 0.06
## numOfBathrooms 0.08 0.04
## numOfBedrooms 0.05 0.04
## numOfStories 0.08 -0.01
## numOfSecurityFeatures numOfWaterfrontFeatures
## zipcode 0.00 -0.01
## latitude -0.02 -0.02
## longitude -0.06 -0.02
## propertyTaxRate 0.03 -0.01
## garageSpaces 0.17 0.02
## parkingSpaces 0.17 0.02
## yearBuilt 0.14 -0.02
## latestPrice 0.11 0.04
## numPriceChanges -0.03 0.01
## latest_saleyear 0.41 0.04
## numOfAccessibilityFeatures 0.05 0.00
## numOfAppliances 0.17 -0.02
## numOfParkingFeatures 0.11 0.01
## numOfPatioAndPorchFeatures 0.50 0.03
## numOfSecurityFeatures 1.00 -0.01
## numOfWaterfrontFeatures -0.01 1.00
## numOfWindowFeatures 0.41 -0.01
## numOfCommunityFeatures 0.03 0.00
## lotSizeSqFt -0.02 0.08
## livingAreaSqFt 0.10 -0.01
## numOfPrimarySchools 0.00 0.01
## numOfElementarySchools -0.02 -0.01
## numOfMiddleSchools -0.01 0.00
## numOfHighSchools -0.02 0.00
## avgSchoolDistance 0.02 0.00
## avgSchoolRating 0.04 0.00
## avgSchoolSize 0.03 0.00
## MedianStudentsPerTeacher 0.03 0.02
## numOfBathrooms 0.10 -0.02
## numOfBedrooms 0.07 0.01
## numOfStories 0.06 -0.03
## numOfWindowFeatures numOfCommunityFeatures
## zipcode 0.00 -0.01
## latitude 0.00 0.02
## longitude -0.13 -0.03
## propertyTaxRate 0.01 0.03
## garageSpaces 0.03 0.05
## parkingSpaces 0.03 0.02
## yearBuilt 0.04 0.02
## latestPrice 0.14 0.04
## numPriceChanges 0.00 0.04
## latest_saleyear 0.26 0.05
## numOfAccessibilityFeatures -0.02 0.02
## numOfAppliances 0.10 0.08
## numOfParkingFeatures 0.04 0.04
## numOfPatioAndPorchFeatures 0.34 0.07
## numOfSecurityFeatures 0.41 0.03
## numOfWaterfrontFeatures -0.01 0.00
## numOfWindowFeatures 1.00 0.01
## numOfCommunityFeatures 0.01 1.00
## lotSizeSqFt 0.05 0.04
## livingAreaSqFt 0.15 0.04
## numOfPrimarySchools -0.02 -0.04
## numOfElementarySchools 0.02 -0.01
## numOfMiddleSchools 0.00 0.06
## numOfHighSchools -0.05 -0.03
## avgSchoolDistance 0.03 0.00
## avgSchoolRating 0.13 0.04
## avgSchoolSize 0.11 -0.01
## MedianStudentsPerTeacher 0.13 0.01
## numOfBathrooms 0.08 0.03
## numOfBedrooms 0.09 0.02
## numOfStories 0.05 -0.02
## lotSizeSqFt livingAreaSqFt numOfPrimarySchools
## zipcode 0.12 -0.01 -0.05
## latitude 0.18 0.14 -0.07
## longitude -0.25 -0.43 0.12
## propertyTaxRate 0.09 0.19 0.03
## garageSpaces 0.06 0.13 -0.01
## parkingSpaces 0.06 0.13 -0.01
## yearBuilt -0.15 0.41 -0.06
## latestPrice 0.31 0.48 -0.14
## numPriceChanges 0.01 0.09 -0.02
## latest_saleyear -0.05 -0.03 0.03
## numOfAccessibilityFeatures -0.05 -0.02 0.01
## numOfAppliances -0.06 0.03 0.01
## numOfParkingFeatures 0.01 0.09 0.02
## numOfPatioAndPorchFeatures 0.00 0.05 0.01
## numOfSecurityFeatures -0.02 0.10 0.00
## numOfWaterfrontFeatures 0.08 -0.01 0.01
## numOfWindowFeatures 0.05 0.15 -0.02
## numOfCommunityFeatures 0.04 0.04 -0.04
## lotSizeSqFt 1.00 0.38 -0.20
## livingAreaSqFt 0.38 1.00 -0.15
## numOfPrimarySchools -0.20 -0.15 1.00
## numOfElementarySchools 0.14 0.11 -0.81
## numOfMiddleSchools 0.15 0.14 -0.41
## numOfHighSchools -0.17 -0.09 0.45
## avgSchoolDistance 0.00 0.16 0.05
## avgSchoolRating 0.29 0.54 -0.19
## avgSchoolSize 0.13 0.44 -0.03
## MedianStudentsPerTeacher 0.23 0.41 -0.01
## numOfBathrooms 0.13 0.75 -0.13
## numOfBedrooms 0.34 0.69 -0.10
## numOfStories -0.10 0.49 -0.05
## numOfElementarySchools numOfMiddleSchools
## zipcode 0.06 -0.01
## latitude 0.13 -0.02
## longitude 0.00 -0.20
## propertyTaxRate -0.04 0.00
## garageSpaces 0.00 0.01
## parkingSpaces 0.00 0.01
## yearBuilt 0.04 0.08
## latestPrice 0.11 0.12
## numPriceChanges -0.01 0.01
## latest_saleyear -0.04 -0.03
## numOfAccessibilityFeatures 0.00 -0.01
## numOfAppliances -0.01 0.01
## numOfParkingFeatures -0.02 0.01
## numOfPatioAndPorchFeatures -0.02 -0.02
## numOfSecurityFeatures -0.02 -0.01
## numOfWaterfrontFeatures -0.01 0.00
## numOfWindowFeatures 0.02 0.00
## numOfCommunityFeatures -0.01 0.06
## lotSizeSqFt 0.14 0.15
## livingAreaSqFt 0.11 0.14
## numOfPrimarySchools -0.81 -0.41
## numOfElementarySchools 1.00 0.31
## numOfMiddleSchools 0.31 1.00
## numOfHighSchools -0.22 -0.36
## avgSchoolDistance -0.04 0.08
## avgSchoolRating 0.11 0.15
## avgSchoolSize 0.07 -0.03
## MedianStudentsPerTeacher -0.01 -0.03
## numOfBathrooms 0.10 0.11
## numOfBedrooms 0.08 0.10
## numOfStories 0.06 0.05
## numOfHighSchools avgSchoolDistance avgSchoolRating
## zipcode 0.08 0.05 0.05
## latitude 0.11 0.01 0.27
## longitude 0.38 -0.04 -0.55
## propertyTaxRate -0.03 -0.03 0.21
## garageSpaces -0.03 0.02 0.08
## parkingSpaces -0.02 0.02 0.08
## yearBuilt 0.10 0.32 0.11
## latestPrice -0.23 -0.11 0.46
## numPriceChanges -0.05 -0.03 0.02
## latest_saleyear -0.01 0.00 -0.03
## numOfAccessibilityFeatures -0.01 -0.02 -0.01
## numOfAppliances -0.01 0.00 0.02
## numOfParkingFeatures -0.02 -0.05 0.12
## numOfPatioAndPorchFeatures -0.04 0.00 0.04
## numOfSecurityFeatures -0.02 0.02 0.04
## numOfWaterfrontFeatures 0.00 0.00 0.00
## numOfWindowFeatures -0.05 0.03 0.13
## numOfCommunityFeatures -0.03 0.00 0.04
## lotSizeSqFt -0.17 0.00 0.29
## livingAreaSqFt -0.09 0.16 0.54
## numOfPrimarySchools 0.45 0.05 -0.19
## numOfElementarySchools -0.22 -0.04 0.11
## numOfMiddleSchools -0.36 0.08 0.15
## numOfHighSchools 1.00 0.18 -0.21
## avgSchoolDistance 0.18 1.00 0.08
## avgSchoolRating -0.21 0.08 1.00
## avgSchoolSize -0.07 0.28 0.63
## MedianStudentsPerTeacher -0.27 0.08 0.76
## numOfBathrooms -0.07 0.13 0.35
## numOfBedrooms -0.04 0.11 0.30
## numOfStories -0.03 0.09 0.22
## avgSchoolSize MedianStudentsPerTeacher
## zipcode 0.12 0.11
## latitude 0.02 -0.02
## longitude -0.43 -0.60
## propertyTaxRate 0.18 -0.03
## garageSpaces 0.03 0.07
## parkingSpaces 0.03 0.07
## yearBuilt 0.34 0.07
## latestPrice 0.10 0.35
## numPriceChanges -0.04 0.03
## latest_saleyear -0.03 0.00
## numOfAccessibilityFeatures -0.01 -0.01
## numOfAppliances 0.04 0.02
## numOfParkingFeatures 0.04 0.04
## numOfPatioAndPorchFeatures 0.03 0.06
## numOfSecurityFeatures 0.03 0.03
## numOfWaterfrontFeatures 0.00 0.02
## numOfWindowFeatures 0.11 0.13
## numOfCommunityFeatures -0.01 0.01
## lotSizeSqFt 0.13 0.23
## livingAreaSqFt 0.44 0.41
## numOfPrimarySchools -0.03 -0.01
## numOfElementarySchools 0.07 -0.01
## numOfMiddleSchools -0.03 -0.03
## numOfHighSchools -0.07 -0.27
## avgSchoolDistance 0.28 0.08
## avgSchoolRating 0.63 0.76
## avgSchoolSize 1.00 0.66
## MedianStudentsPerTeacher 0.66 1.00
## numOfBathrooms 0.32 0.27
## numOfBedrooms 0.29 0.23
## numOfStories 0.24 0.16
## numOfBathrooms numOfBedrooms numOfStories
## zipcode -0.03 0.04 -0.03
## latitude 0.05 0.09 0.01
## longitude -0.31 -0.28 -0.18
## propertyTaxRate 0.12 0.15 0.07
## garageSpaces 0.12 0.09 0.06
## parkingSpaces 0.12 0.09 0.06
## yearBuilt 0.52 0.24 0.41
## latestPrice 0.34 0.26 0.18
## numPriceChanges 0.08 0.07 0.06
## latest_saleyear 0.00 -0.01 -0.03
## numOfAccessibilityFeatures -0.01 -0.04 0.03
## numOfAppliances 0.08 0.01 0.06
## numOfParkingFeatures 0.08 0.05 0.08
## numOfPatioAndPorchFeatures 0.04 0.04 -0.01
## numOfSecurityFeatures 0.10 0.07 0.06
## numOfWaterfrontFeatures -0.02 0.01 -0.03
## numOfWindowFeatures 0.08 0.09 0.05
## numOfCommunityFeatures 0.03 0.02 -0.02
## lotSizeSqFt 0.13 0.34 -0.10
## livingAreaSqFt 0.75 0.69 0.49
## numOfPrimarySchools -0.13 -0.10 -0.05
## numOfElementarySchools 0.10 0.08 0.06
## numOfMiddleSchools 0.11 0.10 0.05
## numOfHighSchools -0.07 -0.04 -0.03
## avgSchoolDistance 0.13 0.11 0.09
## avgSchoolRating 0.35 0.30 0.22
## avgSchoolSize 0.32 0.29 0.24
## MedianStudentsPerTeacher 0.27 0.23 0.16
## numOfBathrooms 1.00 0.55 0.67
## numOfBedrooms 0.55 1.00 0.29
## numOfStories 0.67 0.29 1.00
We don’t observe examples of collinearity between variables above 0.8, except for the relationship between parkingSpaces
and garageSpaces
. Remember that parkingSpaces
is the number of parking spots, while garageSpaces
represents the number of garage spaces as a subset of the ParkingSpaces
variable. The latest may include additional parking spaces provided by common areas.
The first reaction is to think that parkingSpaces
and garageSpaces
are the same. This is the case in almost all observations, except for 0.16% of the observations in the dataset.
#garageSpaces is not always the same as parkingSpaces
spaces_different = housing_data_train$parkingSpaces != housing_data_train$garageSpaces
# Proportion of observations where parkingSpaces is different than garageSpaces
length(spaces_different[spaces_different == TRUE]) / length(spaces_different)
## [1] 0.001604
Therefore, we’ll eliminate the garageSpaces
variable from the dataset.
housing_data_train = subset(housing_data_train, select = -c(garageSpaces))
We’ll further look for multicollinearity with the remaining variables.
num_cols = unlist(lapply(housing_data_train, is.numeric))
housing_data_numerical = housing_data_train[, num_cols]
coll_matrix = round(cor(housing_data_numerical), 2)
This matrix is extensive, and it may be easy to miss high values. So let’s use a function to look at the highest values in the matrix.
# Look at the top values from coll_matrix that are different than 1:
nlargest(coll_matrix, n = 45)$values[nlargest(coll_matrix, n = 45)$values < 1]
## [1] 0.76 0.75 0.69 0.67 0.67 0.66 0.63 0.55 0.54 0.53 0.52 0.50 0.49 0.48 0.48
## [16] 0.46 0.45 0.44 0.41 0.41 0.41 0.41 0.41 0.38 0.38 0.36 0.35 0.35 0.34 0.34
We now see that there’s no collinearity between variables that’s higher than 0.8. Still, we can further investigate the model to see if there’s any variables impacting the response at considerable rates when compared to the others.
housing_data_model = lm(latestPrice ~ ., data = housing_data_train)
vif = vif(housing_data_model)
sort(vif[which(vif > 5)], decreasing = TRUE)
## homeTypeSingle Family homeTypeCondo
## 164.488 115.289
## homeTypeTownhouse homeTypeMultiple Occupancy
## 36.950 11.364
## homeTypeResidential hasGarageTRUE
## 5.812 5.369
the homeType predictor will be key in our analysis, so we’ve decided to keep it in the model. However, variable hasGarage
shows a VIF greater than 5, which may be a concern. What proportion of the observed variation in latestPrice is explained by a linear relationship with hasGarage
?
summary(lm(hasGarage ~ . - latestPrice, data = housing_data_train))$r.squared
## [1] 0.8137
housing_data_model_non_significant = lm(latestPrice ~ . - hasGarage, data = housing_data_train)
vif_non_significant = vif(housing_data_model_non_significant)
vif_non_significant[which(vif_non_significant > 5)]
## homeTypeCondo homeTypeMultiple Occupancy
## 112.768 11.167
## homeTypeResidential homeTypeSingle Family
## 5.664 161.098
## homeTypeTownhouse
## 36.195
#Finally, compare both models
(anova_results = anova(housing_data_model, housing_data_model_non_significant))
## Analysis of Variance Table
##
## Model 1: latestPrice ~ zipcode + latitude + longitude + propertyTaxRate +
## hasAssociation + hasCooling + hasGarage + hasHeating + hasSpa +
## hasView + homeType + parkingSpaces + yearBuilt + numPriceChanges +
## latest_saleyear + numOfAccessibilityFeatures + numOfAppliances +
## numOfParkingFeatures + numOfPatioAndPorchFeatures + numOfSecurityFeatures +
## numOfWaterfrontFeatures + numOfWindowFeatures + numOfCommunityFeatures +
## lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools +
## numOfMiddleSchools + numOfHighSchools + avgSchoolDistance +
## avgSchoolRating + avgSchoolSize + MedianStudentsPerTeacher +
## numOfBathrooms + numOfBedrooms + numOfStories
## Model 2: latestPrice ~ (zipcode + latitude + longitude + propertyTaxRate +
## hasAssociation + hasCooling + hasGarage + hasHeating + hasSpa +
## hasView + homeType + parkingSpaces + yearBuilt + numPriceChanges +
## latest_saleyear + numOfAccessibilityFeatures + numOfAppliances +
## numOfParkingFeatures + numOfPatioAndPorchFeatures + numOfSecurityFeatures +
## numOfWaterfrontFeatures + numOfWindowFeatures + numOfCommunityFeatures +
## lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools +
## numOfMiddleSchools + numOfHighSchools + avgSchoolDistance +
## avgSchoolRating + avgSchoolSize + MedianStudentsPerTeacher +
## numOfBathrooms + numOfBedrooms + numOfStories) - hasGarage
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3075 3.71e+13
## 2 3076 3.71e+13 -1 -79148314 0.01 0.94
When we compare the model with all predictor versus one that does not include the hasGarage predictor, we see the p-value significant at 0.94, so we fail to reject the null hypothesis. We’ll continue the analysis with the hasGarage predictor.
Now that we’ve cleaned the dataset by removing outliers and predictors that may not be helpful, we’ll start looking at options to build an optimal model. We will build an additive and an interactive model, perform some model selection analysis and diagnostics to chose a model that best represents our dataset and purpose.
Let’s build an additive model with all available predictors.
model_measures = function(models, names){
df = data.frame(matrix(nrow = 4, ncol = length(names)))
# assign row names
rownames(df) = c("Number Of Predictors", "RSquare", "Adj. RSquare", "LOOCV_RMSE")
# assign column names
colnames(df) = names
for(i in 1:length(models)) {
model = models[[i]]
num_of_predictors = length(coef(model))
adj_rsquare = summary(model)$adj.r.squared
rsquare = summary(model)$r.squared
loocv_rmse = sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
df[i] = c( num_of_predictors, adj_rsquare, rsquare, loocv_rmse)
}
knitr::kable(df, "simple")
}
housing_data_model = lm(latestPrice ~ ., data = housing_data_train)
model_measures(list(housing_data_model), c("Additive Model"))
Additive Model | |
---|---|
Number Of Predictors | 43.0000 |
RSquare | 0.5505 |
Adj. RSquare | 0.5565 |
LOOCV_RMSE | Inf |
We will find the significant variables with alpha 0.05.
alpha = 0.05
variables_significant = summary(housing_data_model)$coef[, 'Pr(>|t|)'] < alpha
variableNames_significant = names(variables_significant[variables_significant == TRUE][-1])
predictors = paste(variableNames_significant, collapse = "+")
predictors
## [1] "zipcode+longitude+propertyTaxRate+hasAssociationTRUE+homeTypeMultiFamily+yearBuilt+numPriceChanges+latest_saleyear+numOfAccessibilityFeatures+numOfAppliances+numOfPatioAndPorchFeatures+numOfWaterfrontFeatures+lotSizeSqFt+livingAreaSqFt+numOfPrimarySchools+numOfElementarySchools+numOfHighSchools+avgSchoolDistance+avgSchoolRating+avgSchoolSize+numOfBathrooms+numOfBedrooms"
housing_data_model_significant = lm(
latestPrice ~ zipcode + longitude + propertyTaxRate + hasAssociation + yearBuilt +
numPriceChanges + latest_saleyear + numOfPatioAndPorchFeatures + lotSizeSqFt +
livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools + numOfHighSchools +
avgSchoolDistance + avgSchoolRating + avgSchoolSize + numOfBathrooms + numOfBedrooms,
data = housing_data_train
)
anova(housing_data_model, housing_data_model_significant)
## Analysis of Variance Table
##
## Model 1: latestPrice ~ zipcode + latitude + longitude + propertyTaxRate +
## hasAssociation + hasCooling + hasGarage + hasHeating + hasSpa +
## hasView + homeType + parkingSpaces + yearBuilt + numPriceChanges +
## latest_saleyear + numOfAccessibilityFeatures + numOfAppliances +
## numOfParkingFeatures + numOfPatioAndPorchFeatures + numOfSecurityFeatures +
## numOfWaterfrontFeatures + numOfWindowFeatures + numOfCommunityFeatures +
## lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools +
## numOfMiddleSchools + numOfHighSchools + avgSchoolDistance +
## avgSchoolRating + avgSchoolSize + MedianStudentsPerTeacher +
## numOfBathrooms + numOfBedrooms + numOfStories
## Model 2: latestPrice ~ zipcode + longitude + propertyTaxRate + hasAssociation +
## yearBuilt + numPriceChanges + latest_saleyear + numOfPatioAndPorchFeatures +
## lotSizeSqFt + livingAreaSqFt + numOfPrimarySchools + numOfElementarySchools +
## numOfHighSchools + avgSchoolDistance + avgSchoolRating +
## avgSchoolSize + numOfBathrooms + numOfBedrooms
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3075 3.71e+13
## 2 3099 3.78e+13 -24 -699422042303 2.41 0.00014 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_measures(
list(housing_data_model, housing_data_model_significant),
c("All predictors", "Significant predictors")
)
All predictors | Significant predictors | |
---|---|---|
Number Of Predictors | 43.0000 | 19.0000 |
RSquare | 0.5505 | 0.5456 |
Adj. RSquare | 0.5565 | 0.5482 |
LOOCV_RMSE | Inf | 110989.6549 |
housing_data_model = housing_data_model_significant
From the \(R^2\) value, about 0.5482 of data is explained by this model, with 43 predictors. Next, we will try to find a “better” model with \(R^2\) greater than 0.5482, or adjusted \(R^2\) greater than 0.5456.
We’ll also try to lower the value of LOOCV_RMSE (< 110989) to explain the data. Next, we’ll investigate how well backwards AIC and BIC performs in the additive model.
## Additive model AIC and BIC
housing_data_model_aic = step(housing_data_model, direction = "backward", trace = 0)
extractAIC(housing_data_model_aic) # returns both p and AIC
## [1] 19 72436
summary(housing_data_model_aic)$adj.r.squared
## [1] 0.5456
housing_data_model_bic = step(
housing_data_model,
direction = "backward",
trace = 0,
k = log(nrow(housing_data_numerical))
)
extractAIC(housing_data_model_bic) # returns both p and AIC
## [1] 19 72436
summary(housing_data_model_bic)$adj.r.squared
## [1] 0.5456
The adjusted \(R^2\) values are 0.5456 for the AIC model, and 0.5456 for the BIC model. Both of them are inferior to the additive model before backwards AIC and BIC.
As another attempt, we’ll use exhaustive search to test every possible model and see if we can find a better one.
library(leaps)
housing_data_model_leaps = summary(regsubsets(latestPrice ~ ., data = housing_data_train))
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 2 linear dependencies found
## Reordering variables and trying again:
housing_data_model_leaps$rss
## [1] 64146841620808 56161407030534 52094506040565 48713182226940 46559324402781
## [6] 44319773668574 42659684991101 41238981978781 40360747739034
housing_data_model_leaps$adjr2
## [1] 0.2338 0.3290 0.3774 0.4176 0.4432 0.4698 0.4895 0.5063 0.5167
housing_data_model_leaps_r2_index = which.max(housing_data_model_leaps$adjr2)
housing_data_model_leaps$which[housing_data_model_leaps_r2_index,]
## (Intercept) zipcode
## TRUE TRUE
## latitude longitude
## FALSE FALSE
## propertyTaxRate hasAssociationTRUE
## TRUE TRUE
## hasCoolingTRUE hasGarageTRUE
## FALSE FALSE
## hasHeatingTRUE hasSpaTRUE
## FALSE FALSE
## hasViewTRUE homeTypeCondo
## FALSE FALSE
## homeTypeMobile / Manufactured homeTypeMultiFamily
## FALSE FALSE
## homeTypeMultiple Occupancy homeTypeOther
## FALSE FALSE
## homeTypeResidential homeTypeSingle Family
## FALSE FALSE
## homeTypeTownhouse homeTypeVacant Land
## FALSE FALSE
## parkingSpaces yearBuilt
## FALSE TRUE
## numPriceChanges latest_saleyear
## TRUE TRUE
## numOfAccessibilityFeatures numOfAppliances
## FALSE FALSE
## numOfParkingFeatures numOfPatioAndPorchFeatures
## FALSE FALSE
## numOfSecurityFeatures numOfWaterfrontFeatures
## FALSE FALSE
## numOfWindowFeatures numOfCommunityFeatures
## FALSE FALSE
## lotSizeSqFt livingAreaSqFt
## FALSE TRUE
## numOfPrimarySchools numOfElementarySchools
## FALSE FALSE
## numOfMiddleSchools numOfHighSchools
## FALSE FALSE
## avgSchoolDistance avgSchoolRating
## FALSE TRUE
## avgSchoolSize MedianStudentsPerTeacher
## TRUE FALSE
## numOfBathrooms numOfBedrooms
## FALSE FALSE
## numOfStories
## FALSE
housing_data_model_leaps_best = lm(
latestPrice ~ zipcode + propertyTaxRate + hasAssociation + latest_saleyear + yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize +
livingAreaSqFt + avgSchoolRating ,
data = housing_data_train
)
anova(housing_data_model, housing_data_model_leaps_best)[2, "Pr(>F)"]
## [1] 6.387e-38
From anova results, the p-value < 2e-16 is significantly small and null hypothesis can be rejected. Considering the leaps model, we shall continue to perform model improvements techniques.
We’ll now build an interactive model.
housing_data_model_interaction = lm(
latestPrice ~ (
zipcode + propertyTaxRate + hasAssociation + yearBuilt +
numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + livingAreaSqFt + latest_saleyear + avgSchoolRating
) ^ 2,
data = housing_data_train
)
summary(housing_data_model_interaction)
##
## Call:
## lm(formula = latestPrice ~ (zipcode + propertyTaxRate + hasAssociation +
## yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize +
## livingAreaSqFt + latest_saleyear + avgSchoolRating)^2, data = housing_data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -510968 -63943 -5403 52983 497439
##
## Coefficients: (8 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 1.72e+09 1.62e+09 1.06
## zipcode 8.79e+03 1.86e+04 0.47
## propertyTaxRate 3.82e+08 5.78e+08 0.66
## hasAssociationTRUE 1.31e+07 2.88e+07 0.45
## yearBuilt -1.14e+06 6.06e+05 -1.88
## numPriceChanges 1.84e+07 4.12e+06 4.47
## numOfWaterfrontFeatures 2.26e+09 1.37e+09 1.65
## avgSchoolSize -2.83e+05 3.70e+04 -7.63
## livingAreaSqFt 8.38e+04 2.07e+04 4.05
## latest_saleyear -1.16e+06 3.08e+05 -3.76
## avgSchoolRating -8.76e+06 7.67e+06 -1.14
## zipcode:propertyTaxRate -5.38e+03 7.10e+03 -0.76
## zipcode:hasAssociationTRUE 3.74e+02 3.22e+02 1.16
## zipcode:yearBuilt -1.72e+00 6.64e+00 -0.26
## zipcode:numPriceChanges -1.12e+02 4.43e+01 -2.53
## zipcode:numOfWaterfrontFeatures -1.95e+04 1.14e+04 -1.71
## zipcode:avgSchoolSize 4.13e+00 3.89e-01 10.62
## zipcode:livingAreaSqFt -1.11e+00 2.33e-01 -4.77
## zipcode:latest_saleyear NA NA NA
## zipcode:avgSchoolRating 1.08e+02 8.39e+01 1.28
## propertyTaxRate:hasAssociationTRUE 1.75e+05 1.23e+05 1.43
## propertyTaxRate:yearBuilt 1.65e+04 4.57e+03 3.62
## propertyTaxRate:numPriceChanges -1.65e+03 1.59e+04 -0.10
## propertyTaxRate:numOfWaterfrontFeatures NA NA NA
## propertyTaxRate:avgSchoolSize 8.88e+02 2.83e+02 3.13
## propertyTaxRate:livingAreaSqFt -1.39e+02 5.97e+01 -2.33
## propertyTaxRate:latest_saleyear 4.14e+03 4.53e+04 0.09
## propertyTaxRate:avgSchoolRating -1.49e+05 4.56e+04 -3.28
## hasAssociationTRUE:yearBuilt 2.14e+03 3.19e+02 6.71
## hasAssociationTRUE:numPriceChanges 3.40e+03 2.32e+03 1.47
## hasAssociationTRUE:numOfWaterfrontFeatures NA NA NA
## hasAssociationTRUE:avgSchoolSize 1.03e+02 2.41e+01 4.25
## hasAssociationTRUE:livingAreaSqFt -8.07e+01 1.01e+01 -8.03
## hasAssociationTRUE:latest_saleyear -2.33e+04 6.70e+03 -3.49
## hasAssociationTRUE:avgSchoolRating -8.16e+03 4.60e+03 -1.77
## yearBuilt:numPriceChanges -1.84e+01 5.21e+01 -0.35
## yearBuilt:numOfWaterfrontFeatures -3.68e+05 2.45e+05 -1.50
## yearBuilt:avgSchoolSize -4.88e+00 6.16e-01 -7.93
## yearBuilt:livingAreaSqFt -2.95e-02 2.17e-01 -0.14
## yearBuilt:latest_saleyear 6.15e+02 1.52e+02 4.04
## yearBuilt:avgSchoolRating 2.15e+02 1.07e+02 2.01
## numPriceChanges:numOfWaterfrontFeatures NA NA NA
## numPriceChanges:avgSchoolSize 4.84e+00 3.47e+00 1.39
## numPriceChanges:livingAreaSqFt -4.58e+00 1.31e+00 -3.49
## numPriceChanges:latest_saleyear -4.73e+03 9.73e+02 -4.86
## numPriceChanges:avgSchoolRating -8.47e+02 6.11e+02 -1.39
## numOfWaterfrontFeatures:avgSchoolSize NA NA NA
## numOfWaterfrontFeatures:livingAreaSqFt NA NA NA
## numOfWaterfrontFeatures:latest_saleyear NA NA NA
## numOfWaterfrontFeatures:avgSchoolRating NA NA NA
## avgSchoolSize:livingAreaSqFt 4.66e-02 1.63e-02 2.86
## avgSchoolSize:latest_saleyear -1.72e+01 1.01e+01 -1.71
## avgSchoolSize:avgSchoolRating -3.30e+01 5.25e+00 -6.29
## livingAreaSqFt:latest_saleyear 1.96e+00 4.31e+00 0.45
## livingAreaSqFt:avgSchoolRating 3.64e+00 2.33e+00 1.57
## latest_saleyear:avgSchoolRating 1.14e+02 1.85e+03 0.06
## Pr(>|t|)
## (Intercept) 0.28924
## zipcode 0.63615
## propertyTaxRate 0.50897
## hasAssociationTRUE 0.64934
## yearBuilt 0.06066 .
## numPriceChanges 8.0e-06 ***
## numOfWaterfrontFeatures 0.09805 .
## avgSchoolSize 3.1e-14 ***
## livingAreaSqFt 5.3e-05 ***
## latest_saleyear 0.00017 ***
## avgSchoolRating 0.25351
## zipcode:propertyTaxRate 0.44863
## zipcode:hasAssociationTRUE 0.24508
## zipcode:yearBuilt 0.79573
## zipcode:numPriceChanges 0.01145 *
## zipcode:numOfWaterfrontFeatures 0.08732 .
## zipcode:avgSchoolSize < 2e-16 ***
## zipcode:livingAreaSqFt 1.9e-06 ***
## zipcode:latest_saleyear NA
## zipcode:avgSchoolRating 0.19980
## propertyTaxRate:hasAssociationTRUE 0.15236
## propertyTaxRate:yearBuilt 0.00030 ***
## propertyTaxRate:numPriceChanges 0.91715
## propertyTaxRate:numOfWaterfrontFeatures NA
## propertyTaxRate:avgSchoolSize 0.00174 **
## propertyTaxRate:livingAreaSqFt 0.01970 *
## propertyTaxRate:latest_saleyear 0.92709
## propertyTaxRate:avgSchoolRating 0.00106 **
## hasAssociationTRUE:yearBuilt 2.4e-11 ***
## hasAssociationTRUE:numPriceChanges 0.14216
## hasAssociationTRUE:numOfWaterfrontFeatures NA
## hasAssociationTRUE:avgSchoolSize 2.2e-05 ***
## hasAssociationTRUE:livingAreaSqFt 1.4e-15 ***
## hasAssociationTRUE:latest_saleyear 0.00050 ***
## hasAssociationTRUE:avgSchoolRating 0.07625 .
## yearBuilt:numPriceChanges 0.72358
## yearBuilt:numOfWaterfrontFeatures 0.13368
## yearBuilt:avgSchoolSize 3.0e-15 ***
## yearBuilt:livingAreaSqFt 0.89196
## yearBuilt:latest_saleyear 5.5e-05 ***
## yearBuilt:avgSchoolRating 0.04437 *
## numPriceChanges:numOfWaterfrontFeatures NA
## numPriceChanges:avgSchoolSize 0.16341
## numPriceChanges:livingAreaSqFt 0.00049 ***
## numPriceChanges:latest_saleyear 1.2e-06 ***
## numPriceChanges:avgSchoolRating 0.16570
## numOfWaterfrontFeatures:avgSchoolSize NA
## numOfWaterfrontFeatures:livingAreaSqFt NA
## numOfWaterfrontFeatures:latest_saleyear NA
## numOfWaterfrontFeatures:avgSchoolRating NA
## avgSchoolSize:livingAreaSqFt 0.00421 **
## avgSchoolSize:latest_saleyear 0.08695 .
## avgSchoolSize:avgSchoolRating 3.7e-10 ***
## livingAreaSqFt:latest_saleyear 0.65022
## livingAreaSqFt:avgSchoolRating 0.11759
## latest_saleyear:avgSchoolRating 0.95072
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106000 on 3070 degrees of freedom
## Multiple R-squared: 0.59, Adjusted R-squared: 0.584
## F-statistic: 94 on 47 and 3070 DF, p-value: <2e-16
length(coef(housing_data_model_interaction))
## [1] 56
summary(housing_data_model_interaction)$r.squared
## [1] 0.5899
summary(housing_data_model_interaction)$adj.r.squared
## [1] 0.5836
From the \(R^2\) value, about 0.5899 of data is explained by this model, with 56 predictors. The adjusted \(R^2\) is 0.5836. This model is preferred over the additive model.
Let’s compare the interactive model with the original model:
anova(housing_data_model, housing_data_model_interaction)[2, "Pr(>F)"]
## [1] 1.356e-46
With a p-value of 2.438e-49, we can reject the null hypothesis and choose this model.
Similarly to what we did with the additive model, we’ll investigate how well backwards AIC and BIC performs in the interactive model.
housing_data_model_interaction_aic = step(housing_data_model_interaction,
direction = "backward",
trace = 0)
extractAIC(housing_data_model_interaction_aic) # returns both p and AIC
## [1] 36 72174
housing_data_model_interaction_aic
##
## Call:
## lm(formula = latestPrice ~ zipcode + propertyTaxRate + hasAssociation +
## yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize +
## livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges +
## zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize +
## zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation +
## propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize +
## propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating +
## hasAssociation:yearBuilt + hasAssociation:numPriceChanges +
## hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt +
## hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating +
## yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize +
## yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt +
## numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt +
## avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating +
## livingAreaSqFt:avgSchoolRating, data = housing_data_train)
##
## Coefficients:
## (Intercept) zipcode
## 2.83e+09 -4.95e+03
## propertyTaxRate hasAssociationTRUE
## -3.50e+07 4.20e+07
## yearBuilt numPriceChanges
## -1.30e+06 1.78e+07
## numOfWaterfrontFeatures avgSchoolSize
## 2.20e+09 -2.99e+05
## livingAreaSqFt latest_saleyear
## 8.04e+04 -1.18e+06
## avgSchoolRating zipcode:numPriceChanges
## -5.30e+04 -1.07e+02
## zipcode:numOfWaterfrontFeatures zipcode:avgSchoolSize
## -1.90e+04 4.28e+00
## zipcode:livingAreaSqFt propertyTaxRate:hasAssociationTRUE
## -1.02e+00 1.83e+05
## propertyTaxRate:yearBuilt propertyTaxRate:avgSchoolSize
## 1.72e+04 8.84e+02
## propertyTaxRate:livingAreaSqFt propertyTaxRate:avgSchoolRating
## -1.27e+02 -1.50e+05
## hasAssociationTRUE:yearBuilt hasAssociationTRUE:numPriceChanges
## 2.15e+03 3.64e+03
## hasAssociationTRUE:avgSchoolSize hasAssociationTRUE:livingAreaSqFt
## 1.11e+02 -8.24e+01
## hasAssociationTRUE:latest_saleyear hasAssociationTRUE:avgSchoolRating
## -2.31e+04 -8.70e+03
## yearBuilt:numOfWaterfrontFeatures yearBuilt:avgSchoolSize
## -3.57e+05 -4.98e+00
## yearBuilt:latest_saleyear yearBuilt:avgSchoolRating
## 6.27e+02 2.13e+02
## numPriceChanges:livingAreaSqFt numPriceChanges:latest_saleyear
## -5.02e+00 -4.62e+03
## avgSchoolSize:livingAreaSqFt avgSchoolSize:latest_saleyear
## 4.67e-02 -1.51e+01
## avgSchoolSize:avgSchoolRating livingAreaSqFt:avgSchoolRating
## -3.09e+01 3.23e+00
summary(housing_data_model_interaction_aic)$adj.r.squared
## [1] 0.5844
housing_data_model_interaction_bic = step(
housing_data_model_interaction,
direction = "backward",
trace = 0,
k = log(nrow(housing_data_numerical))
)
extractAIC(housing_data_model_interaction_bic) # returns both p and AIC
## [1] 25 72189
summary(housing_data_model_interaction_bic)$adj.r.squared
## [1] 0.581
The adjusted \(R^2\) values are 0.5844 for the AIC model, and 0.581 for the BIC model. Both of them are superior to the additive model and the original model. The interaction model with backwards AIC (housing_data_model_interaction_aic) is the preferred model so far.
To perform model diagnostics, we’ll define a helper function which shows the Fitted versus Residuals plot, the Normal Q-Q Plot, the Histogram of Residuals, prints the result of the Breusch-Pagan Test, and Shapiro-Wilk Test for assessing the normality of errors.
diagnostics = function (model) {
par(mfrow = c(1, 3))
plot(
fitted(model),
resid(model),
pch = 20,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Fitted vs Residuals",
col = "grey"
)
abline(h = 0, lwd = 2, col = "orange")
qqnorm(resid(model),
pch = 20,
main = "Normal Q-Q Plot",
col = "grey")
qqline(resid(model), lwd = 2, col = "orange")
hist(
resid(model),
main = "Histogram of Residuals",
col = "orange",
xlab = "Residuals",
ylab = "Frequency"
)
library(lmtest)
bptest(model)
shapiro.test(resid(model))
}
Having defined the funcion, let’s visualize the plots for the chosen model housing_data_model_interaction_aic:
diagnostics(housing_data_model_interaction_aic)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.97, p-value <2e-16
The Fitted versus Residuals plot shows the spread of residuals for many fitted values away from zero in the order of 200.000. The Q-Q Plot and the Histogram of Residuals show data points away from the line from -2 to 1 Theoretical Quantiles. This is a suspect Q-Q plot, leading to believe that the errors do not follow a normal distribution.
To try to identify the issues shown on model diagnistics, we’ll look for influential observations that have large effect on the regression. To measure this, we’ll use Cook’s Distance.
cooksd = cooks.distance(housing_data_model_interaction_aic)
plot(cooksd,
pch = "*",
cex = 2,
main = "Influential Observations by Cooks distance") # plot cook's distance
abline(h = 2 * mean(cooksd, na.rm = T), col = "black") # add cutoff line
text(
x = 1:length(cooksd) + 1,
y = cooksd,
labels = ifelse(cooksd > 2 * mean(cooksd, na.rm = T), names(cooksd), ""),
col = "red"
) # add labels
Now that we’ve identified the outliers and stored the results in the cooksd variable, we’ll build a new model without these outliers and run diagnostics again.
housing_data_model_interaction_aic_without_outliers = lm(
latestPrice ~ zipcode + propertyTaxRate + hasAssociation +
yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize +
livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges +
zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize +
zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation +
propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize +
propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating +
hasAssociation:yearBuilt + hasAssociation:numPriceChanges +
hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt +
hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating +
yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize +
yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt +
numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt +
avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating +
livingAreaSqFt:avgSchoolRating,
data = housing_data_train,
subset = cooksd < 2 * mean(cooksd, na.rm = T)
)
diagnostics(housing_data_model_interaction_aic_without_outliers)
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 1, p-value = 0.00009
bptest(housing_data_model_interaction_aic_without_outliers)
##
## studentized Breusch-Pagan test
##
## data: housing_data_model_interaction_aic_without_outliers
## BP = 322, df = 32, p-value <2e-16
The Fitted versus Residuals plot shows the spread of residuals for many fitted values away from zero in the order of hundreds of thousands, but at half the distance from the mean when compared to the previous model. Also, the Q-Q Plot and the Histogram of Residuals show data points close to line, meaning errors follow a normal distribution.
Finally, we’ll use a box cox transformation on our model to improve the constant variance.
library(MASS)
boxcox(
housing_data_model_interaction_aic_without_outliers,
plotit = TRUE,
lambda = seq(0, 1, by = 0.05)
)
housing_data_model_interaction_aic_without_outliers = lm((((latestPrice ^ 0.5) - 1) / 0.5) ~ zipcode + propertyTaxRate + hasAssociation + yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges + zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize + zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation + propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize + propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating + hasAssociation:yearBuilt + hasAssociation:numPriceChanges + hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt + hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating + yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize + yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt + numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt + avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating + livingAreaSqFt:avgSchoolRating,
data = housing_data_train,
subset = cooksd < 2 * mean(cooksd, na.rm = T)
)
diagnostics(housing_data_model_interaction_aic_without_outliers)
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 1, p-value = 0.04
bptest(housing_data_model_interaction_aic_without_outliers)
##
## studentized Breusch-Pagan test
##
## data: housing_data_model_interaction_aic_without_outliers
## BP = 202, df = 32, p-value <2e-16
As seen in box cox transformation, the mean of the normal distribution is centered around 0.43. We tried using 0.43 to 0.50 as exponential formula as per the box cox transformation, and we got best model at 0.50. Our p-value for normal distribution is 0.4, however, constant variation is still showing a low p-value. As we improved a lot based on the original Fitted versus Residual plot, we are choosing the interactive model with backwards AIC without outliers (housing_data_model_interaction_aic_without_outliers) as our final model after applying box cox transformation.
Lastly, we will calculate the error and noise with the final model for both the test and train data frames. We to perform reverse transformation (box cox applied on the model) on sigma of the model to get the final error value.
For illustration, here’s the variance for the original model (additive):
sigma(lm(latestPrice ~ ., data = housing_data_train))
## [1] 109900
sigma(lm(latestPrice ~ ., data = housing_data_test))
## [1] 110702
And the error obtained from the chosen model:
error_raw = sigma(housing_data_model_interaction_aic_without_outliers)
error = (error_raw ^2 + 1) * 0.5
error
## [1] 7427
This small error value shows the greater accuracy of model.
housing_data_model_interaction_aic_without_outliers = lm((((latestPrice ^ 0.5) - 1) / 0.5) ~ zipcode + propertyTaxRate + hasAssociation + yearBuilt + numPriceChanges + numOfWaterfrontFeatures + avgSchoolSize + livingAreaSqFt + latest_saleyear + avgSchoolRating + zipcode:numPriceChanges + zipcode:numOfWaterfrontFeatures + zipcode:avgSchoolSize + zipcode:livingAreaSqFt + propertyTaxRate:hasAssociation + propertyTaxRate:yearBuilt + propertyTaxRate:avgSchoolSize + propertyTaxRate:livingAreaSqFt + propertyTaxRate:avgSchoolRating + hasAssociation:yearBuilt + hasAssociation:numPriceChanges + hasAssociation:avgSchoolSize + hasAssociation:livingAreaSqFt + hasAssociation:latest_saleyear + hasAssociation:avgSchoolRating + yearBuilt:numOfWaterfrontFeatures + yearBuilt:avgSchoolSize + yearBuilt:latest_saleyear + yearBuilt:avgSchoolRating + numPriceChanges:livingAreaSqFt + numPriceChanges:latest_saleyear + avgSchoolSize:livingAreaSqFt + avgSchoolSize:latest_saleyear + avgSchoolSize:avgSchoolRating + livingAreaSqFt:avgSchoolRating,
data = housing_data_test,
subset = cooksd < 2 * mean(cooksd, na.rm = T)
)
error_raw = sigma(housing_data_model_interaction_aic_without_outliers)
error = (error_raw ^ 2 + 1) * 0.5
error
## [1] 13722
RMSE errors for train and test data (respectivelly):
library(Metrics)
predictions_train = predict(housing_data_model_interaction_aic_without_outliers,
housing_data_train)
error_1 = rmse(((housing_data_train$latestPrice ^ 0.5 - 1) / 0.5), predictions_train)
(error_1 ^ 2 + 1) * 0.5
## [1] 13104
predictions_test = predict(housing_data_model_interaction_aic_without_outliers,
housing_data_test)
error_1 = rmse(((housing_data_test$latestPrice ^ 0.5 - 1) / 0.5), predictions_test)
(error_1 ^ 2 + 1) * 0.5
## [1] 13576
Through this project, we’ve built a model that would help home buyers predict house prices in Austin, TX and surrounding cities and towns. Using data obtained from house sale listings on zillow.com, we were able to produce a clean data set to work with, verify the predictors for relevancy and collinearity, and adjusted the dataset based on findings of this analysis.
Model building considered several techniques, including the use of additive or interactive models, and the use of backwards AIC and BIC to find an optimal model the model. Having identified the model of choice, the interactive model with backwards AIC without outliers, we’ve performed diagnostics and fine tuning using an analysis of outliers to reduce errors and increase the accuracy of the model.
Our team is formed by the following individuals:
Jagadeesh Kedarisetty (jk64)
Nilesh Bhandarwar (nileshb2)
Peri Rocha (procha2)
The following libraries were used in the creation of this report:
- ggmap
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2.
The R Journal, 5(1), 144-161. URL
http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
- ggplot2
H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York, 2016.
- faraway
Julian Faraway (2016). faraway: Functions and Datasets for Books
by Julian Faraway. R package version 1.0.7.
https://CRAN.R-project.org/package=faraway
- MASS
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York.
ISBN 0-387-95457-0
- lmtest
Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in
Regression Relationships. R News 2(3), 7-10. URL
https://CRAN.R-project.org/doc/Rnews/
- leaps
Thomas Lumley based on Fortran code by Alan Miller (2020). leaps: Regression Subset Selection. R package version 3.1. https://CRAN.R-project.org/package=leaps
- Metrics
Ben Hamner and Michael Frasco (2018). Metrics: Evaluation Metrics for Machine Learning. R package version 0.1.4.
https://CRAN.R-project.org/package=Metrics
“How To Succeed As A First-Time Home Buyer In Today’s Market” (https://www.forbes.com/sites/forbesrealestatecouncil/2021/07/19/how-to-succeed-as-a-first-time-home-buyer-in-todays-market/?sh=79e0d37f19f8)↩︎
“Your 4 Most Important Financial Decisions: #1 – The House Purchase” (https://www.retirementstewardship.com/2016/05/28/4-important-financial-decisions-1-house-purchase/)↩︎
“Why hot-desking is a terrible idea” (https://www.msn.com/en-us/lifestyle/career/why-hot-desking-is-a-terrible-idea/ar-AAMjgTM?ocid=BingNewsSearch)↩︎
Kaggle dataset: “Austin, TX House Listings - Features and Images scraped in January 2021”. (https://www.kaggle.com/ericpierce/austinhousingprices, austinHousingData.csv
).↩︎