2 Importing & wrangling

2.1 HDB data

#Import all files downloaded from HDB. The files' name start with "resale...")
filenames <- list.files(pattern="^resale")

data <- tibble()

for(i in 1:length(filenames)){
  df <- read_csv(filenames[i])
  if ("remaining_lease" %in% colnames(df)) {
    df <- df %>% 
      separate(col = month, into = c("year", "month"), sep = "\\-") %>% 
      mutate(year=as.integer(year), month=as.integer(month)) %>% 
      mutate(remaining_lease=as.integer(str_extract(remaining_lease, "\\d{2}"))) #to standardize year only
  }
  else {
    df <- df %>% 
      separate(col = month, into = c("year", "month"), sep = "\\-") %>% 
      mutate(year=as.integer(year), month=as.integer(month)) %>% 
      mutate(remaining_lease = 99 -(year-as.integer(lease_commence_date))) 
  }
  
  data <- bind_rows(data, df)
  
}

#create column for address to get geocode
data <- data %>% 
  mutate(address=paste(block,street_name, "Singapore", sep = " "))

unique_add <- data %>% 
  distinct(address) %>% 
  ggmap::mutate_geocode(address)

# check for addresses not geocoded
incomplete_addresses <- unique_add %>% 
  filter(is.na(lat) | is.na(lon)) %>%
  select(-c(lat, lon)) %>%
  mutate(address2 = str_replace(address, "C'WEALTH", "COMMONWEALTH")) %>%
  mutate_geocode(address2) %>%
  select(-address2)

# checks for any errors in geocodinng
incorrect_addresses <- unique_add %>% 
  filter(lat < 0 | lat > 2 | lon < 102 | lon > 108) %>% 
  select(-c(lat, lon)) %>%
  left_join(data, by="address") %>% 
  distinct(address, .keep_all = TRUE) %>%
  select(c(town,address)) %>% 
  mutate(address2 = paste("\\'",address, town,"\\'")) %>%
  mutate_geocode(address2) %>%
  select(-c(address2,town))

# merge all address with geocodes
unique_add <- unique_add %>% 
  filter(!(lat < 0 | lat > 2 | lon < 102 | lon > 108)) %>%
  filter(!is.na(lat) & !is.na(lon)) %>%
  bind_rows(incomplete_addresses) %>%
  bind_rows(incorrect_addresses) 


#combine original data with geocodes

data2 <- left_join(data,unique_add, by ="address")

saveRDS(data2, "data.RDS")

2.5 Combining and wrangling all datasets

Following are excluded from the final dataset:

  • Outliers based on resale price
  • Entries without flat type information
  • Entries with town = Lim Chu Kang

2.6 Sample data

Quick peek on the dataset:

year month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price remaining_lease address lon lat mrt mrt_dist near_mall mall_dist near_school school_dist age resale_clean
793934 2019 10 YISHUN 5R 850 YISHUN ST 81 04 TO 06 128 Improved 1988 540000 67 850 YISHUN ST 81 Singapore 103.8360 1.416346 KHATIB MRT STATION 0.3509089 WISTERIA MALL 0.6149974 ORCHID PARK SECONDARY SCHOOL 0.2932485 31 540000
793935 2019 10 YISHUN 5R 874 YISHUN ST 81 01 TO 03 122 Improved 1988 415000 67 874 YISHUN ST 81 Singapore 103.8363 1.415277 KHATIB MRT STATION 0.4408910 WISTERIA MALL 0.6221038 ORCHID PARK SECONDARY SCHOOL 0.2268274 31 415000
793936 2019 10 YISHUN 5R 820 YISHUN ST 81 10 TO 12 122 Improved 1988 520000 67 820 YISHUN ST 81 Singapore 103.8337 1.413093 KHATIB MRT STATION 0.4814526 WISTERIA MALL 0.9972455 ORCHID PARK SECONDARY SCHOOL 0.5737060 31 520000
793937 2019 10 YISHUN 6R 344 YISHUN AVE 11 01 TO 03 146 Maisonette 1988 505000 67 344 YISHUN AVE 11 Singapore 103.8444 1.429038 YISHUN MRT STATION 1.0446595 JUNCTION NINE 0.5393704 HUAMIN PRIMARY SCHOOL 0.1334673 31 505000
793938 2019 10 YISHUN 6R 791 YISHUN AVE 2 04 TO 06 146 Maisonette 1987 550000 66 791 YISHUN AVE 2 Singapore 103.8336 1.420502 KHATIB MRT STATION 0.3522745 NORTHPOINT CITY 0.8750608 NAVAL BASE SECONDARY SCHOOL 0.3207240 32 550000
793939 2019 10 YISHUN 6R 406 YISHUN AVE 6 01 TO 03 148 Maisonette 1988 566000 68 406 YISHUN AVE 6 Singapore 103.8493 1.425908 YISHUN MRT STATION 1.6379015 JUNCTION NINE 1.1636621 NORTH VIEW PRIMARY SCHOOL 0.2169060 31 566000
793940 2019 10 YISHUN 6R 324 YISHUN CTRL 04 TO 06 142 Apartment 1988 510000 67 324 YISHUN CTRL Singapore 103.8419 1.428448 YISHUN MRT STATION 0.7718536 JUNCTION NINE 0.5077439 HUAMIN PRIMARY SCHOOL 0.3236775 31 510000
793941 2019 10 YISHUN 6R 326 YISHUN RING RD 01 TO 03 146 Maisonette 1988 530000 67 326 YISHUN RING RD Singapore 103.8424 1.429692 YISHUN MRT STATION 0.8249148 JUNCTION NINE 0.3802179 HUAMIN PRIMARY SCHOOL 0.3258902 31 530000
793942 2019 10 YISHUN 6R 606 YISHUN ST 61 10 TO 12 146 Maisonette 1987 638000 67 606 YISHUN ST 61 Singapore 103.8362 1.421362 KHATIB MRT STATION 0.5649737 WISTERIA MALL 0.6758045 CHUNG CHENG HIGH SCHOOL (YISHUN) 0.3169986 32 638000
793943 2019 10 YISHUN 6R 824 YISHUN ST 81 04 TO 06 146 Maisonette 1987 628000 67 824 YISHUN ST 81 Singapore 103.8333 1.413682 KHATIB MRT STATION 0.4104465 WISTERIA MALL 1.0074276 PEIYING PRIMARY SCHOOL 0.5324027 32 628000

3 Data Visualization

Overall, the number of HDB resale transactions surged during the Asian financial crisis (1998-1999) and US financial crisis (2008-2009).

3.1 Overview at national level

## NULL

The HDB resale price increases almost 4 times (without taking into account inflation) since 1990. It peaked in 2013, and due to several cooling measures implemented by the government, it then decreased and subsequently stabilized.

3.2 Overview based on HDB town

You can double click on the town name to filter on a particular town.

3.3 Overview dashboard

The dashboard can also be accessed from https://public.tableau.com/profile/irine#!/vizhome/hdbv2/Overview.

3.4 Recommender dashboard

The following dashboard aims to help potential buyers to identify their “dream flat” given their budget, flat types, town and the proximity of the flat to important locations, such as the nearest MRT station, school and shopping mall. The dashboard can also be accessed from https://public.tableau.com/profile/irine#!/vizhome/hdbv2/Recommender.

4 Data Modelling

4.1 Model HDB resale price in 2014

For this data modelling, we will use data from 2014 for our training and testing.

Before performing modelling, let’s check the following:

  • The HDB resale price are normally distributed with slightly longer right tail

  • There is a relationship between HDB resale price with age, town and flat type

var1 var2 estimate n p.value
age mrt_dist 0.1871008 15153 0.0000000
age mall_dist 0.3076847 15153 0.0000000
mrt_dist mall_dist 0.2218621 15153 0.0000000
age school_dist 0.2184335 15153 0.0000000
mrt_dist school_dist 0.0583201 15153 0.0000000
mall_dist school_dist 0.1271302 15153 0.0000000
age resale_clean -0.4567996 15153 0.0000000
mrt_dist resale_clean -0.1048230 15153 0.0000000
mall_dist resale_clean -0.1923928 15153 0.0000000
school_dist resale_clean -0.0229855 15153 0.0046607

Next, we split the dataset between training set (80% of the data) and testing set (20% of the data).

4.1.2 Additional machine learning algorithms

Two additional machine learning algorithms will be explored to further improve the accuracy and error in predicting HDB resale price. Bootstrapping will also be implemented.

  • Gradient Boosting Machines (Boosting)

  • Random forest

Next, let’s test and evaluate their performance.

.metric .estimate ml_method
rmse 62873.3071235 Boosting
rsq 0.6114637 Boosting
mae 45878.5657760 Boosting
rmse 44884.5442618 Random Forest
rsq 0.8067504 Random Forest
mae 33762.9863738 Random Forest

Lastly, let’s visualize the results.

Random Forest led to the most accurate prediction. It has the highest accuracy (R Square) and lowest error (MAE and RMSE).

4.2 Is this flat overpriced?

A flat was sold in Nov 2017 with the following characteristics:

  • Flat type: 4 ROOM
  • Town: Yishun
  • Flat Model: New Generation
  • Storey Range: 10 to 12
  • Floor Area (sqm): 91
  • Lease Commence Date: 1984
  • Resale Price: 550,800

YES, it is overpriced based on resale transactions in 2017 for 4RM flats within the age of 30 - 35 years, located at 10 to 12 storey, with floor area of 90 - 100 sqm. The price paid $ 550,800 is above the maximum price paid for Yishun.

4.3 Predict flat types using other characteristics

Decision tree is adopted to predict flat types based on other characteristics of the flats, such as: floor area sqm, flat model, town and age. Based on the tree below, only floor area sqm and flat model are important. The following model accuracy in predicting flat type is above 96%.

## [1] 0.9653465

5 Policy Analysis

5.1 Are Yishun flats the cheapest?

Yishun has received a negative reputation as a place to stay due to various incidents involving - cat killings, murders, etc. from the past 5 years. Therefore, the following analysis is limited to 2014 to 2018 period.

Based on the median resale price, Yishun flats are the cheapest in Singapore together with Toa Payoh flats. However looking at the price per sqm, Yishun flats are not the cheapest.

5.2 Flat sizes have gotten smaller over the years?

Based on the chart below, majority of the flat size in Singapore have remained the same over the years, except for 7R (i.e. Multi-Generation flats).

5.3 Did the opening of DTL increase the price of HDB flats serviced by this line?

The construction of DTL started in 2009. The 1st stage of DTL started operation in 2013, followed by 2nd stage in 2015 and the last stage in 2017. In this model, the construction is adopted as the time when treatment started.

To use this model, two dummy variables must be created:

  1. A dummy variable called “nearby” to identify the control vs treatment group
  • Treatment group (value = 1): flats that would benefit from the opening of DTL (i.e. flats located within 1KM distance of DTL MRT stations)

  • Control group (value = 0: flats that would not be affected bythe opening of DTL (i.e. flats located more than 1KM distance of DTL MRT stations)

  1. A dummy variable called after_dt to indicate the time when the construction started
  • Value = 1: flats transcated 2010 onwards

  • Value = 0: flats transcated prior to 2010

Lastly, another variable called did is created for the interaction between time and treatment effect.

The following model estimates the did estimator:

## 
## Call:
## lm(formula = resale_clean ~ after_dt + nearby + did, data = data_dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -279621  -80621  -13636   68864  441364 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   329621       1240 265.802 <0.0000000000000002 ***
## after_dt      139839       2558  54.678 <0.0000000000000002 ***
## nearby        -90986       1296 -70.194 <0.0000000000000002 ***
## did            25464       2691   9.462 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111600 on 120351 degrees of freedom
## Multiple R-squared:  0.2863, Adjusted R-squared:  0.2863 
## F-statistic: 1.609e+04 on 3 and 120351 DF,  p-value: < 0.00000000000000022

The coefficient for did is the differences-in-differences estimator. Both the did and after_dt coefficients are statistically significant and have positive effect. While the treatment variable nearby has a negative effect to the price increment, but also statistically significant.

Based on the time series chart above, the price of HDB flats serviced by the line showed a steady increase from 2010 onwards (i.e. construction stage of DTL) regardless of the proximity to the MRTs. Therefore, it can be concluded that the opening of DTL does influence the price of flats serviced by DTL.

5.4 Are resale prices in HDB estates in areas further away from the city (i.e. Sengkang and Punggol) impacted by Certificate of Entitlement (COE) prices for cars?

In this analysis, distance from the flat to Raffles City MRT station is used as a proxy to estimate the distance from the flat to CBD (or city).

Based on the statistics summary, let’s create five levels of distance to CBD in KM: 0 to 4, 4 to 8, 8 to 12, 12 to 16, and more than 16 KM.

Now, let’s compare the median COE premium VS median HDB resale price over the years.

The original assumption is that when COE premium dropped, people would be buying flats further from the city so that the cost savings can be used for a car. As such, the price of flats further away from city would increase.

Looking at the price trends of HDB located within 12 - 16 KM from CBD (i.e. Punggol or Sengkang), when the COE premium dropped between 2013 to 2018, the resale price did not really increase. The resale price for flats within 12 - 16 KM actually dropped between 2013 to 2015 before stabilizing. However, the number of transactions for HDB located within 12 - 16 KM from CBD within this period (2013 to 2018) did increase.

The resale prices in HDB estates in areas further away from the city (i.e. Sengkang and Punggol) does not seem to be impacted by COE premium, but the number of transactions seems to be impacted.

#References