Play with Airbnb - South Aegean Greek Islands (Part 2)

September 8, 2019

R data science data process visualization

For now on, I’m going to come up with a model that predicts the price for the airbnb properties at South Aegean region.

Library used

library(tidyverse)
library(stringr)
import::from(magrittr, "%<>%")
library(Hmisc)
library(ggcorrplot)
library(lubridate)

Data source

I recycled the same tables I created in Part 1.

# recycle from 1_exploratory
load("./data/parsed_data.Rdata")

Data processing

Before I can construct a prediction model, the first thing I have to do is to check and clean the data. This step is crucial otherwise you would end up with 1) no model can be constructed, or even worse 2) a wrong model that makes no sense.

Continuous variables

I started with the numeric columns. First, I extracted them as a separate table:

# extract numeric columns
listing_numeric <-
  listing %>%
  #select_if(is.numeric) # this expression cannot be inversed (!is.numeric)
  select_if(function(col) is.numeric(col))

Notice that in the R code, select_if(is.numeric) works exactly the same as select_if(function(col) is.numeric(col)); they both select only the numeric columns. Nevertheless, if you want to select by a more sophisticated criterion, say !is.numeric or is.numeric & other_criterion, you need to do it with the later one; e.g., select_if(function(col) !is.numeric(col)).

correlation

Then I checked the correlation between each pair of them:

# check correlation
corr_matrix <-
  listing_numeric %>%
  as.matrix() %>%
  rcorr()

corr_matrix$r %>%
  ggcorrplot() +
  theme(axis.text.x = element_text(size=5),
        axis.text.y = element_text(size=5))

When I checked the pairs with high correlation (shown in red), it seems price is positively correlated with square_feet, bedrooms, bathrooms, and accommodates, which is pretty straightforward in my opinion.

data clean

To ease the data clean process, here I removed two variables that contain only random numbers:

scrape_id
host_id

I also removed square_feet because only very few properties has the value.

Finally, I decided to keep review, although 40% entries of this column has no corresponding value (i.e., na). It is simply because I want to see how this factor play its role in the prediction model. Nevertheless, I lost around 40% data and that is the price I need to pay.

listing_numeric_cleaned <-
  listing_numeric %>%
  
  # remove unuseful column
  select(-scrape_id, -host_id) %>%
  
  # remove square_feet as very few listing has this data
  select(-square_feet) %>%
  
  # remove na
  na.omit()

amentities

Remember that in Part 1, I created a new table amentities_df? It contains only numbers, so I can merge it with the numeric variables I have at hand.

# merge with amentities
data_predict_model <-
  amentities_df %>%
  select(-price_numeric) %>%
  spread(key = "amenities", value = "value") %>%
  right_join(., listing_numeric_cleaned, by = c("id"))

Alright. Now data_predict_model contains the data that I’ll use to create my prediciton model. To be precise, it contains all numeric data in listing. My next step is to deal with textual data.

Textual variables

After quickly checking on the textual veriables in listing, I picked seven varibales that looks interesting and might be useful in predicting price:

room_type (ordinal)
bed_type (ordinal)
neighbourhood_cleansed (categorical)
property_type (categorical)
cancellation_policy (categorical)
extra_people (textual number)
calendar_updated (textual number)

There are several types textual variables that are usually used in prediction models, including ordinal variables, categorical variables, and numeric information in text.

categorical variables

For the categorical variables, there are several ways to encode them. One possibility is one hot encoding, which is also called dummy variable in statistics. Alternatively one can use mean coding to encode the categorical variables in a way that represents the target variable. In this case I chose mean encoding to represent price for the following variables:

neighbourhood_cleansed
property_type
cancellation_policy

# mean encoding for neighbourhood_cleansed, property_type, cancellation_policy
mean_encoding <- function(df, obj_col, obj_Y){
  
  group_var <- enquo(obj_col)      # Create quosure
  group_varY <- enquo(obj_Y)      # Create quosure
  mean_name <- paste0("mean_", quo_name(group_var))
  
  df_processed <-
    df %>%
    group_by(!! group_var) %>%
    summarise(!! mean_name := mean(!! group_varY, na.rm = TRUE)) %>%
    right_join(., df, by = c(quo_name(group_var)))
  
  # return
  df_processed
}

listing_textual_cleaned <-
  listing_textual %>%
  mean_encoding(., neighbourhood_cleansed, price_numeric) %>%
  mean_encoding(., property_type, price_numeric) %>%
  mean_encoding(., cancellation_policy, price_numeric)

ordinal variables

For the ordinal variables, such as room_type in the dataset, there is an apparent order in its values; Shared room -> Private room -> Entire home/apt. Therefore I used label encoding to encode this type of variable while maintaining the ordinal relationship:

room_type
bed_type

# label encoding for room_type, bed_type
listing_textual_cleaned %<>%
  mutate(encode_room_type = if_else(room_type == "Shared room", 1,
                                    if_else(room_type == "Private room", 2, 3))) %>%
  mutate(encode_bed_type = if_else(bed_type == "Couch", 1,
                            if_else(bed_type == "Pull-out Sofa", 2, 
                                    if_else(bed_type == "Airbed", 3,
                                            if_else(bed_type == "Futon", 4, 5)))))

textual numbers

There are two variables that contain numerical information in text format:

extra_people
calendar_updated

extra_people is relatively easy because the format is quite regular, such as $40.00. All I need to do is to extract the numerical part and save it as numbers.

calendar_updated is tricky. Some of the values are regular, but in various format:

x days ago
y weeks ago
z years ago

Some are human-readable, but rather irregular:

today
yesterday
never

My strategy is to 1) convert all values to the similar format of span, such as x days, y weeks, and z years, and then 2) use the function time_length to estimate the exact time length of each span. Hence I converted today to 0 day, yesterday to 1 day, and never to 10 years (I know 10 years is still far away from forever, but I think it’s long enough compared to other values in the given dataset).

# extract the numerical values for extra_people, calendar_updated
listing_textual_cleaned %<>%
  mutate(extra_people_numeric = as.numeric(str_extract(extra_people, "\\d+.\\d+"))) %>%
  
  # change text format for special cases
  mutate(calendar_updated = if_else(calendar_updated == "today", "0 day",
                                    if_else(calendar_updated == "yesterday", "1 day",
                                            if_else(calendar_updated == "never", "10 years", 
                                                    if_else(calendar_updated == "a week ago", "1 week", str_replace_all(calendar_updated, " ago", "")))))) %>%
  
  # estimate days
  mutate(calendar_updated_numeric = time_length(calendar_updated, 'day'))

Finally, I merged the processed texutal data with the data_predict_model.

data_predict_model <-
  listing_textual_cleaned %>%
  select(id, mean_neighbourhood_cleansed, mean_property_type, mean_cancellation_policy, encode_room_type, encode_bed_type, extra_people_numeric, calendar_updated_numeric) %>%
  right_join(., data_predict_model, by = c("id"))

Now data_predict_model contains 190 columns (including id), and they are all numeric. Isn’t that exciting? This dataset is ready to go for the next step - model construction.

The end (for now, again)

The data processing took a bit longer than I expected. But it is very important to spend time in this step before it’s too late. In the next blog:

I’ll use the processed (clean) data to construct a prediction model

Play with Airbnb - South Aegean Greek Islands (Part 1)

September 1, 2019

R data science data analysis visualization