Play with Airbnb - South Aegean Greek Islands (Part 2)
September 8, 2019
R data science data process visualizationFor now on, I’m going to come up with a model that predicts the price for the airbnb properties at South Aegean region.
Library used
library(tidyverse)
library(stringr)
import::from(magrittr, "%<>%")
library(Hmisc)
library(ggcorrplot)
library(lubridate)
Data source
I recycled the same tables I created in Part 1.
# recycle from 1_exploratory
load("./data/parsed_data.Rdata")
Data processing
Before I can construct a prediction model, the first thing I have to do is to check and clean the data. This step is crucial otherwise you would end up with 1) no model can be constructed, or even worse 2) a wrong model that makes no sense.
Continuous variables
I started with the numeric columns. First, I extracted them as a separate table:
# extract numeric columns
listing_numeric <-
listing %>%
#select_if(is.numeric) # this expression cannot be inversed (!is.numeric)
select_if(function(col) is.numeric(col))
Notice that in the R code, select_if(is.numeric)
works exactly the same as select_if(function(col) is.numeric(col))
; they both select only the numeric columns. Nevertheless, if you want to select by a more sophisticated criterion, say !is.numeric
or is.numeric & other_criterion
, you need to do it with the later one; e.g., select_if(function(col) !is.numeric(col))
.
correlation
Then I checked the correlation between each pair of them:
# check correlation
corr_matrix <-
listing_numeric %>%
as.matrix() %>%
rcorr()
corr_matrix$r %>%
ggcorrplot() +
theme(axis.text.x = element_text(size=5),
axis.text.y = element_text(size=5))
When I checked the pairs with high correlation (shown in red), it seems price is positively correlated with square_feet, bedrooms, bathrooms, and accommodates, which is pretty straightforward in my opinion.
data clean
To ease the data clean process, here I removed two variables that contain only random numbers:
- scrape_id
- host_id
I also removed square_feet because only very few properties has the value.
Finally, I decided to keep review, although 40% entries of this column has no corresponding value (i.e., na). It is simply because I want to see how this factor play its role in the prediction model. Nevertheless, I lost around 40% data and that is the price I need to pay.
listing_numeric_cleaned <-
listing_numeric %>%
# remove unuseful column
select(-scrape_id, -host_id) %>%
# remove square_feet as very few listing has this data
select(-square_feet) %>%
# remove na
na.omit()
amentities
Remember that in Part 1, I created a new table amentities_df
? It contains only numbers, so I can merge it with the numeric variables I have at hand.
# merge with amentities
data_predict_model <-
amentities_df %>%
select(-price_numeric) %>%
spread(key = "amenities", value = "value") %>%
right_join(., listing_numeric_cleaned, by = c("id"))
Alright. Now data_predict_model
contains the data that I’ll use to create my prediciton model. To be precise, it contains all numeric data in listing
. My next step is to deal with textual data.
Textual variables
After quickly checking on the textual veriables in listing
, I picked seven varibales that looks interesting and might be useful in predicting price:
- room_type (ordinal)
- bed_type (ordinal)
- neighbourhood_cleansed (categorical)
- property_type (categorical)
- cancellation_policy (categorical)
- extra_people (textual number)
- calendar_updated (textual number)
There are several types textual variables that are usually used in prediction models, including ordinal variables, categorical variables, and numeric information in text.
categorical variables
For the categorical variables, there are several ways to encode them. One possibility is one hot encoding, which is also called dummy variable in statistics. Alternatively one can use mean coding to encode the categorical variables in a way that represents the target variable. In this case I chose mean encoding to represent price for the following variables:
- neighbourhood_cleansed
- property_type
- cancellation_policy
# mean encoding for neighbourhood_cleansed, property_type, cancellation_policy
mean_encoding <- function(df, obj_col, obj_Y){
group_var <- enquo(obj_col) # Create quosure
group_varY <- enquo(obj_Y) # Create quosure
mean_name <- paste0("mean_", quo_name(group_var))
df_processed <-
df %>%
group_by(!! group_var) %>%
summarise(!! mean_name := mean(!! group_varY, na.rm = TRUE)) %>%
right_join(., df, by = c(quo_name(group_var)))
# return
df_processed
}
listing_textual_cleaned <-
listing_textual %>%
mean_encoding(., neighbourhood_cleansed, price_numeric) %>%
mean_encoding(., property_type, price_numeric) %>%
mean_encoding(., cancellation_policy, price_numeric)
ordinal variables
For the ordinal variables, such as room_type in the dataset, there is an apparent order in its values; Shared room -> Private room -> Entire home/apt. Therefore I used label encoding to encode this type of variable while maintaining the ordinal relationship:
- room_type
- bed_type
# label encoding for room_type, bed_type
listing_textual_cleaned %<>%
mutate(encode_room_type = if_else(room_type == "Shared room", 1,
if_else(room_type == "Private room", 2, 3))) %>%
mutate(encode_bed_type = if_else(bed_type == "Couch", 1,
if_else(bed_type == "Pull-out Sofa", 2,
if_else(bed_type == "Airbed", 3,
if_else(bed_type == "Futon", 4, 5)))))
textual numbers
There are two variables that contain numerical information in text format:
- extra_people
- calendar_updated
extra_people is relatively easy because the format is quite regular, such as $40.00. All I need to do is to extract the numerical part and save it as numbers.
calendar_updated is tricky. Some of the values are regular, but in various format:
- x days ago
- y weeks ago
- z years ago
Some are human-readable, but rather irregular:
- today
- yesterday
- never
My strategy is to 1) convert all values to the similar format of span, such as x days, y weeks, and z years, and then 2) use the function time_length to estimate the exact time length of each span. Hence I converted today to 0 day, yesterday to 1 day, and never to 10 years (I know 10 years is still far away from forever, but I think it’s long enough compared to other values in the given dataset).
# extract the numerical values for extra_people, calendar_updated
listing_textual_cleaned %<>%
mutate(extra_people_numeric = as.numeric(str_extract(extra_people, "\\d+.\\d+"))) %>%
# change text format for special cases
mutate(calendar_updated = if_else(calendar_updated == "today", "0 day",
if_else(calendar_updated == "yesterday", "1 day",
if_else(calendar_updated == "never", "10 years",
if_else(calendar_updated == "a week ago", "1 week", str_replace_all(calendar_updated, " ago", "")))))) %>%
# estimate days
mutate(calendar_updated_numeric = time_length(calendar_updated, 'day'))
Finally, I merged the processed texutal data with the data_predict_model
.
data_predict_model <-
listing_textual_cleaned %>%
select(id, mean_neighbourhood_cleansed, mean_property_type, mean_cancellation_policy, encode_room_type, encode_bed_type, extra_people_numeric, calendar_updated_numeric) %>%
right_join(., data_predict_model, by = c("id"))
Now data_predict_model
contains 190 columns (including id), and they are all numeric. Isn’t that exciting? This dataset is ready to go for the next step - model construction.
The end (for now, again)
The data processing took a bit longer than I expected. But it is very important to spend time in this step before it’s too late. In the next blog:
I’ll use the processed (clean) data to construct a prediction model