Take-home Exercise 01

Background

The City of Engagement, with a total population of 50,000, is a small city located in the Country of Nowhere. The city serves as a service centre of an agriculture region surrounding the city. The main agriculture of the region is fruit farms and vineyards. The local council of the city is in the process of preparing the Local Plan 2023. A sample survey of 1000 representative residents had been conducted to collect data related to their household demographic and spending patterns, among other things. The city aims to use the data to assist with their major community revitalization efforts, including how to allocate a very large city renewal grant they have recently received.

Data Preparation

There are two datasets for this exercise:

The first one contains various demographic information of survey participants (Participants.csv)
The second contains various financial transactions of survey participants (FinancialJournal.csv)

Data description (click to view)

Participants.csv

Contains information about the residents of City of Engagement that have agreed to participate in this study.

participantId (integer): unique ID assigned to each participant.
householdSize (integer): the number of people in the participant’s household
haveKids (boolean): whether there are children living in the participant’s household.
age (integer): participant’s age in years at the start of the study.
educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.

FinancialJournal.csv

Contains information about financial transactions.

participantId (integer): unique ID corresponding to the participant affected
timestamp (datetime): the time when the check-in was logged
category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
amount (double): the amount of the transaction

While importing the datasets, we will also ensure that the variables are of the correct type (e.g. character, numeric, integer):

Load datasets

p <- read_csv("data/Participants.csv",
                     col_types = cols("i", "i", "l", "i", "c", "c", "d"))
if (any(is.na(p)) == FALSE) {
  cat("Output:", "There are no missing values in Participants.csv", "\n")
} else {
  cat("Output:", "There are missing values in Participants.csv", "\n")
}
p$educationLevel <-
  p$educationLevel %>% ordered(levels =
                                 c("Low",
                                   "HighSchoolOrCollege",
                                   "Bachelors",
                                   "Graduate")) # order of factors may be meaningful, so use ordered()
p_data <- distinct(p) # remove any possible duplicate rows

fin <- read_csv("data/FinancialJournal.csv",
                     col_types = cols("i", "T", "c", "d"))
if (any(is.na(fin)) == FALSE) {
  cat("Output:", "There are no missing values in FinancialJournal.csv", "\n")
} else {
  cat("Output:", "There are missing values in FinancialJournal.csv", "\n")
}
fin$category <-
  fin$category %>% factor(levels =
                            c("Education",
                              "Food",
                              "Recreation",
                              "RentAdjustment",
                              "Shelter",
                              "Wage"))
fin_data <- distinct(fin) # remove any possible duplicate rows

Output: There are no missing values in Participants.csv 
Output: There are no missing values in FinancialJournal.csv

There were 0 (out of 1011) duplicate rows for Participants.csv and 1113 (out of 1513636) duplicate rows for FinancialJournal.csv. They have been removed.

Information about col_types

“i” is for integer, “c” is for character, “l” is for logical, “d” is for double, “T” is for datetime.

Before we move on, let’s check if the financial journal is complete for all participants:

counts <- fin_data %>% group_by(participantId) %>% summarise(n_obs = n())

ggplot(counts, aes(x = "", y = n_obs)) +
  geom_boxplot(fill = "#80b1d3", outlier.shape = NA, width = 0.5) +
  geom_jitter(width = 0.2, height = 0, alpha = 0.5, color = "#FFA07A") +
  scale_y_continuous(expand = c(0.05, 0)) +
  labs(title = "Distribution of Financial Transactions per participantId",
       subtitle = "Some participants have rather low numbers of financial transactions\n(located at the bottom)",
       y = "No. of financial transactions",
       x = "") +
  theme_classic() +
  theme(axis.line.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12))

For the group of participants that have rather low numbers of financial transactions, it could be due to exogenous reasons (e.g., data entry error, left the city, gave up on the study halfway). As such, they will be removed for the purposes of this analysis:

new_fin_data <- counts %>%
  filter(n_obs >= 500) %>% # filter out participants with low no. of transactions
  inner_join(fin_data, by = "participantId") %>% # keep only relevant obs in fin_data
  select(!contains("n_obs")) # discard n_obs column

They will also be removed from the demographics dataset, since we are unsure if, for example, they have left the city for good.

new_p_data <- counts %>%
  filter(n_obs >= 500) %>%
  inner_join(p_data, by = "participantId") %>%
  select(!contains("n_obs"))

Participants removed due to low number of check-ins in financial journal

A total of 131 participants, out of the original 1011, were dropped from our datasets.

Getting to Know Our Demographics Dataset

The City of Engagement has a population of 50,000. We have participants’ data with a sample size of 880, which should give us a decent representation of how the population looks like.

Firstly, let’s perform some basic visualisations so that we can get to know our data better. Basic insights are provided in the subtitle of each plot.

Demographics of Participants

Code

# Creating a data.frame containing the frequencies of householdSize in our dataset of 1,000 participants
hs_counts <- data.frame(table(new_p_data$householdSize))
colnames(hs_counts) <- c("householdSize", "Freq")

# Adding a percentage variable based on the frequencies
hs_counts$pct <- percent(hs_counts$Freq / sum(hs_counts$Freq))

# Choosing some nice colors for the plot
nice_colors <- c("#80b1d3", "#8dd3c7", "#bebada")

# Plotting (using coord_polar creates a pie chart from a stacked bar chart)
ggplot(hs_counts, aes(x = "", y = Freq, fill = factor(householdSize))) + 
  geom_bar(stat = "identity", width = 1, color = "white") +
  geom_text(aes(label = pct), position = position_stack(vjust = 0.5)) +
  coord_polar("y", start = 0) +
  labs(title = "Household Size Distribution",
       subtitle = "The most common household size is 1 and least common household size is 3.",
       fill = "Household Size") +
  theme_void() +
  scale_fill_manual(values = nice_colors, guide = guide_legend(reverse = TRUE)) +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12))

Code

# Creating a data.frame containing the frequencies of haveKids in our dataset of 1,000 participants
kids_counts <- data.frame(table(new_p_data$haveKids))
colnames(kids_counts) <- c("haveKids", "Freq")

# Adding a percentage variable based on the frequencies
kids_counts$pct <- percent(kids_counts$Freq / sum(kids_counts$Freq))

# Choosing some nice colors for the plot
nice_colors <- c("#80b1d3", "#8dd3c7")

# Plotting (using coord_polar creates a pie chart from a stacked bar chart)
ggplot(kids_counts, aes(x = "", y = Freq, fill = factor(haveKids))) + 
  geom_bar(stat = "identity", width = 1, color = "white") +
  geom_text(aes(label = pct), position = position_stack(vjust = 0.5)) +
  coord_polar("y", start = 0) +
  labs(title = "Distribution of Whether Participants Have Kids",
       subtitle = "Most participants do not have kids. Only slightly less than a third have kids.",
       fill = "Do participants have kids?") +
  theme_void() +
  scale_fill_manual(values = nice_colors, guide = guide_legend(reverse = TRUE)) +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12))

Code

# Calculate the mean, min, and max age
mean_age <- mean(new_p_data$age)
min_age <- min(new_p_data$age)
max_age <- max(new_p_data$age)

# Choosing some nice colors for the plot
nice_colors <- c("#80b1d3", "#8dd3c7", "#bebada", "#fccde5")

# Plotting
ggplot(new_p_data, aes(x = age)) + 
  geom_histogram(breaks = c(18, 30, 40, 50, 60), fill = nice_colors, color = "white") +
  labs(title = "Age Distribution",
       subtitle = "Participants are relatively evenly distributed across age groups.",
       x = "Age",
       y = "Counts",
       caption = "Note:\nFor geom_histogram(), all bins are left-inclusive and right-exclusive by default, except the rightmost bin,\nwhich is both left and right-inclusive.\nThe leftmost and rightmost bins have a larger binwidth of 12 and 11 years respectively, compared to the\nmiddle two bins with binwidths of 10 years. Though not ideal for comparison, they are intuitive.") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9.5, hjust = 0)) +
  geom_vline(xintercept = mean_age, color = "darkred", linetype = "dashed") +
  geom_vline(xintercept = min_age, color = "darkred", linetype = "dashed") +
  geom_vline(xintercept = max_age, color = "darkred", linetype = "dashed") +
  annotate("text", x = mean_age + 4.2, y = 40, label = paste("Mean age:", round(mean_age, 1)), color = "darkred") +
  annotate("text", x = min_age + 3.5, y = 40, label = paste("Min. age:", round(min_age, 1)), color = "darkred") +
  annotate("text", x = max_age - 4, y = 40, label = paste("Max. age:", round(max_age, 1)), color = "darkred") +
  geom_text(aes(x = 23.5, y = 150, label = "18-29\nyears old"), size = 3.5) +
  geom_text(aes(x = 34, y = 150, label = "30-39\nyears old"), size = 3.5) +
  geom_text(aes(x = 44, y = 150, label = "40-49\nyears old"), size = 3.5) +
  geom_text(aes(x = 54.5, y = 150, label = "50-60\nyears old"), size = 3.5) +
  scale_y_continuous(breaks = breaks_width(50)) # this useful function is from the 'scales' package

Code

# Creating a data.frame containing the frequencies of educationLevel in our dataset of 1,000 participants
edu_counts <- data.frame(table(new_p_data$educationLevel))
colnames(edu_counts) <- c("educationLevel", "Counts")

# Adding a percentage variable based on the frequencies
edu_counts$pct <- percent(edu_counts$Counts / sum(edu_counts$Counts))

# Choosing some nice colors for the plot
nice_colors <- c("#80b1d3", "#8dd3c7", "#bebada", "#fccde5")

# Plot
ggplot(edu_counts, aes(x = educationLevel, y = Counts)) +
  geom_bar(stat = "identity", fill = nice_colors, color = "white") +
  labs(title = "Education Level Distribution", x = "Education Level", y = "Counts",
       subtitle = "This city is rather well-educated.\n93.7% of participants have education higher than 'High School/College'.\n45.7% have education higher than Bachelor's degree.") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12)) +
  geom_text(aes(label = pct), vjust = 1.5)

Code

# Creating a data.frame containing the frequencies of interestGroup in our dataset of 1,000 participants
interest_counts <- data.frame(table(new_p_data$interestGroup))
colnames(interest_counts) <- c("interestGroup", "Counts")

# Adding a percentage variable based on the frequencies
interest_counts$pct <- percent(interest_counts$Counts / sum(interest_counts$Counts), accuracy = 0.1)

# Plot
ggplot(interest_counts, aes(x = reorder(interestGroup, -Counts), y = Counts)) +
  geom_bar(stat = "identity", fill = "#80b1d3", color = "white") +
  labs(title = "Interest Group Distribution", x = "Interest Group", y = "Counts",
       subtitle = "Most common interest among participants is 'J', and least common interest is 'E'.") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12)) +
  geom_text(aes(label = pct), vjust = 1.5)

Getting to Know Our Financial Journal Dataset

Let’s take a look at the length (no. of rows) of our new_fin_data, which is the financial journal for our 880 participants:

cat("Output:", "There are", nrow(new_fin_data), "rows", "in our dataset.", "\n")

Output: There are 1509897 rows in our dataset.

This is rather large and it is due to the fact that each participant can have multiple transactions across different categories. To make the dataset more manageable without losing our ability to conduct meaningful analysis, we could extract the Month-Year from timestamp column that records the date and time of each record in POSIXct format. After extracting the Month-Year (as month_year), we will sum the total transaction amounts by participantId, category, and month_year:

grouped_data <-
  new_fin_data %>%
  mutate(month_year = floor_date(timestamp, "month")) %>%
  group_by(participantId, category, month_year) %>%
  summarise(total_amount = sum(amount)) %>%
  arrange(participantId, month_year) # sort by ascending

head(grouped_data, 10)

# A tibble: 10 × 4
# Groups:   participantId, category [5]
   participantId category   month_year          total_amount
           <int> <fct>      <dttm>                     <dbl>
 1             0 Education  2022-03-01 00:00:00        -38.0
 2             0 Food       2022-03-01 00:00:00       -268. 
 3             0 Recreation 2022-03-01 00:00:00       -349. 
 4             0 Shelter    2022-03-01 00:00:00       -555. 
 5             0 Wage       2022-03-01 00:00:00      11932. 
 6             0 Education  2022-04-01 00:00:00        -38.0
 7             0 Food       2022-04-01 00:00:00       -266. 
 8             0 Recreation 2022-04-01 00:00:00       -219. 
 9             0 Shelter    2022-04-01 00:00:00       -555. 
10             0 Wage       2022-04-01 00:00:00       8637.

With this grouped dataset, we can clearly see the expenditure/income for each participant, category, and month-year.

cat("Output:", "There are", nrow(grouped_data), "rows", "in our new dataset.", "\n")

Output: There are 44208 rows in our new dataset.

This is a much smaller dataset, which is more manageable, while still providing insights at the monthly interval.

Let’s explore the Financial Journal data over time to see if there are any trends. We will do this by plotting the mean, maximum, and minimum total_amount earned/spent from all participants, grouped by category and month_year, over month_year.

Code

# Calculate the mean, minimum, and maximum total_amount over all participantId for each category
grouped_data_summary <- grouped_data %>%
  group_by(category, month_year) %>%
  summarise(mean_total_amount = mean(abs(total_amount)),
            min_total_amount = min(abs(total_amount)),
            max_total_amount = max(abs(total_amount)))

# Create a time series plot with mean, min, and max for each category using facets
ggplot(grouped_data_summary, aes(x = month_year, y = mean_total_amount, color = category)) +
  geom_line() +
  geom_ribbon(aes(ymin = min_total_amount, ymax = max_total_amount, fill = category), alpha = 0.3) +
  labs(x = "Month/Year", y = "Amount ($)",
       title = "Mean/Max/Min Amount by Category over Time",
       subtitle = "RentAdjustment and Wage are earnings, the rest are expenses") +
  scale_fill_discrete(name = "Category") +
  scale_color_discrete(name = "Category") +
  theme_bw() +
  facet_wrap(~category, nrow = 3, scales = "free_y") +
  guides(color = "none", fill = "none") +
  scale_x_datetime(date_minor_breaks = "1 month", date_labels = "%b-%Y")

We can see that the mean, minimum, and maximum amounts for each category is relatively stable over time. The exceptions are:

Shelter and RentAdjustment experience ‘shocks’ on Mar-2022, and the effects linger for a month until Apr-2022, after which Shelter stabilises while RentAdjustment becomes zero.
Minor ‘shock’ for Wage on Mar-2022 as well.
High volatility for the maximum amount spent on Recreation over time. This makes sense as it is most likely spending on luxury items, as compared to the other expense categories, which are more like necessities (e.g., Education, Food, Shelter).

Since most of the time series are relatively stable, we can move on to conduct our analysis on aggregated data (aggregated over time). This removes one dimension from the analysis, making it less complicated.

Aggregating Financial Journal Data over Time and Merging Datasets

We will take the mean of total_amount over month_year to obtain mean monthly amounts earned/spent by each participant for each category. We will not include RentAdjustment as it seems like a one-off occurrence.

aggregated_data <- grouped_data %>%
  group_by(participantId, category) %>%
  summarise(monthly_total_amount = mean(total_amount))
head(aggregated_data)

# A tibble: 6 × 3
# Groups:   participantId [2]
  participantId category   monthly_total_amount
          <int> <fct>                     <dbl>
1             0 Education                 -38.0
2             0 Food                     -262. 
3             0 Recreation               -365. 
4             0 Shelter                  -555. 
5             0 Wage                     9151. 
6             1 Education                 -38.0

Let’s merge this aggregated data with the demographics dataset so that we can see the earnings/expense by demographics:

merged_data <- inner_join(aggregated_data, new_p_data, by = "participantId")
head(merged_data)

# A tibble: 6 × 9
# Groups:   participantId [2]
  participantId category   monthly_total_amount householdSize haveKids   age
          <int> <fct>                     <dbl>         <int> <lgl>    <int>
1             0 Education                 -38.0             3 TRUE        36
2             0 Food                     -262.              3 TRUE        36
3             0 Recreation               -365.              3 TRUE        36
4             0 Shelter                  -555.              3 TRUE        36
5             0 Wage                     9151.              3 TRUE        36
6             1 Education                 -38.0             3 TRUE        25
# ℹ 3 more variables: educationLevel <ord>, interestGroup <chr>,
#   joviality <dbl>

Who Spends Most on Education?

Intuitively, households with children should spend the most on education. In this section, we will see from the data whether this is really the case.

The following tables/plots show the monthly amount spent on Education when grouped by the respective variables:

tmp <- merged_data %>%
  filter(category == "Education") %>%
  group_by(householdSize) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))
tmp

# A tibble: 1 × 2
  householdSize average_monthly_amount
          <int>                  <dbl>
1             3                  -46.2


 Only those with householdSize = 3 spend on education, which makes sense since kids are present in households of size 3.

tmp <- merged_data %>%
  filter(category == "Education") %>%
  group_by(haveKids) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))
tmp

# A tibble: 1 × 2
  haveKids average_monthly_amount
  <lgl>                     <dbl>
1 TRUE                      -46.2


 Only those who have kids will spend on education.

Code

tmp <- merged_data %>%
  filter(category == "Education") %>%
  group_by(educationLevel) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = educationLevel, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Education by Education Level",
       subtitle = "Graduates spend the most on education.",
       x = "\nEducation Level",
       y = "Average Monthly Amount\nSpent on Education ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Conclusions:

Confirmed that only households with kids spend on education.
Graduates spend the most on education, though not by far.

Who Spends Most on Food?

The following plots show the monthly amount spent on Food when grouped by the respective variables:

Code

tmp <- merged_data %>%
  filter(category == "Food") %>%
  group_by(householdSize) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = householdSize, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Food by Household Size",
       subtitle = "Surprisingly, Household Size and Amount Spent on Food are not positively correlated",
       x = "\nHousehold Size",
       y = "Average Monthly Amount\nSpent on Food ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 11.5))

Code

tmp <- merged_data %>%
  filter(category == "Food") %>%
  group_by(haveKids) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = haveKids, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Food by Presence of Kids",
       subtitle = "Presence of kids does not affect amount spent on food",
       x = "\nParticipants have kids?",
       y = "Average Monthly Amount\nSpent on Food ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

merged_data$ageGroup <- cut(merged_data$age,
                            breaks = c(0, 29, 39, 49, Inf),
                            labels = c("18-29", "30-39", "40-49", "50-60"))

tmp <- merged_data %>%
  filter(category == "Food") %>%
  group_by(ageGroup) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = ageGroup, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Food by Age Group",
       subtitle = "Age does not seem to affect amount spent on food.",
       x = "\nAge Group",
       y = "Average Monthly Amount\nSpent on Food ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Food") %>%
  group_by(educationLevel) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = educationLevel, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Food by Education Level",
       subtitle = "Slight positive correlation between education level and amount spent on food.",
       x = "\nEducation Level",
       y = "Average Monthly Amount\nSpent on Food ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Food") %>%
  group_by(interestGroup) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = reorder(interestGroup, -abs(average_monthly_amount)), y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Food by Interest Group",
       subtitle = "Slight differences in amount spent on food across interest groups",
       x = "\nInterest Group",
       y = "Average Monthly Amount\nSpent on Food ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Conclusions:

householdSize and haveKids do not appear to affect how much is spent monthly on food, which is quite puzzling. Could it be the case that household members eat less to accommodate a larger family?
educationLevel shows a slight positive correlation with amount spent monthly on food (based on the plot). However, a caveat is that educationLevel may also be positively correlated with wage. This is shown to be the case later on.

Who Spends Most on Recreation?

The following plots show the monthly amount spent on Recreation when grouped by the respective variables:

Code

tmp <- merged_data %>%
  filter(category == "Recreation") %>%
  group_by(householdSize) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = householdSize, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Recreation by Household Size",
       subtitle = "Slight increase in spending on recreation when household size increases",
       x = "\nHousehold Size",
       y = "Average Monthly Amount\nSpent on Recreation ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Recreation") %>%
  group_by(haveKids) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = haveKids, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Recreation by Presence of Kids",
       subtitle = "Slight increase in spending on recreation when participants have kids.",
       x = "\nParticipants have kids?",
       y = "Average Monthly Amount\nSpent on Recreation ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Recreation") %>%
  group_by(ageGroup) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = ageGroup, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Recreation by Age Group",
       subtitle = "Roughly equal spending on recreation across age groups.",
       x = "\nAge Group",
       y = "Average Monthly Amount\nSpent on Recreation ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Recreation") %>%
  group_by(educationLevel) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = educationLevel, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Recreation by Education Level",
       subtitle = "Roughly equal spending on recreation across education levels.",
       x = "\nEducation Level",
       y = "Average Monthly Amount\nSpent on Recreation ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Recreation") %>%
  group_by(interestGroup) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = reorder(interestGroup, -abs(average_monthly_amount)), y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Recreation by Interest Group",
       subtitle = "Slight differences in spending on recreation across interest groups.\nCould some interest groups be more expensive than the others?",
       x = "\nInterest Group",
       y = "Average Monthly Amount\nSpent on Recreation ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Conclusions:

Slight increases in spending on recreation when householdSize increases and when haveKids is true.
Some interest groups may cost more than the others. However, it is difficult to ascertain, based on the data, whether spending on recreation equates to spending on interest groups.

Who Spends Most on Shelter?

The following plots show the monthly amount spent on Shelter when grouped by the respective variables:

Code

tmp <- merged_data %>%
  filter(category == "Shelter") %>%
  group_by(householdSize) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = householdSize, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Shelter by Household Size",
       subtitle = "Larger households spend a lot more on shelter as compared to singles.",
       x = "\nHousehold Size",
       y = "Average Monthly Amount\nSpent on Shelter ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Shelter") %>%
  group_by(haveKids) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = haveKids, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Shelter by Presence of Kids",
       subtitle = "Participants who have kids spend more on shelter.",
       x = "\nParticipants have kids?",
       y = "Average Monthly Amount\nSpent on Shelter ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Shelter") %>%
  group_by(ageGroup) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = ageGroup, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Shelter by Age Group",
       subtitle = "Roughly equal spending on shelter across age groups.",
       x = "\nAge Group",
       y = "Average Monthly Amount\nSpent on Shelter ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Shelter") %>%
  group_by(educationLevel) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = educationLevel, y = abs(average_monthly_amount))) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Amount Spent on Shelter by Education Level",
       subtitle = "Spending on shelter increases as education level increases.",
       x = "\nEducation Level",
       y = "Average Monthly Amount\nSpent on Shelter ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Conclusions:

Both householdSize > 1 and haveKids = TRUE lead to an observed increase in spending on shelter. This makes sense, as a larger household requires more living space.
educationLevel shows a positive correlation with amount spent monthly on shelter (based on the plot, similar to the case for food). However, a caveat is that educationLevel may also be positively correlated with wage. This is shown to be the case later on.

Who Earns Most?

Based on our previous plots for spending on food and shelter across education levels, we observe slight positive correlations. In this section, we will check if wage is also positively correlated with education. If that is the case, the increase in spending on food and shelter may also be due to increase in wage.

More generally, we will explore which demographic group earn the most in terms of wages.

Code

tmp <- merged_data %>%
  filter(category == "Wage") %>%
  group_by(householdSize) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = householdSize, y = average_monthly_amount)) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Wages by Household Size",
       subtitle = "Participants with larger households earn more in wages.",
       x = "\nHousehold Size",
       y = "Average Monthly Wages ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Wage") %>%
  group_by(haveKids) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = haveKids, y = average_monthly_amount)) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Wages by Presence of Kids",
       subtitle = "Participants with kids earn more in wages.",
       x = "\nParticipants have kids?",
       y = "Average Monthly Wages ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Wage") %>%
  group_by(ageGroup) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = ageGroup, y = average_monthly_amount)) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Wages by Age Group",
       subtitle = "Wages do not increase with age.",
       x = "\nAge Group",
       y = "Average Monthly Wages ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Code

tmp <- merged_data %>%
  filter(category == "Wage") %>%
  group_by(educationLevel) %>%
  summarise(average_monthly_amount = mean(monthly_total_amount))

ggplot(tmp, aes(x = educationLevel, y = average_monthly_amount)) +
  geom_bar(stat = "identity", fill = "#80b1d3") +
  labs(title = "Average Monthly Wages by Education Level",
       subtitle = "Wages increase as education level increases.",
       x = "\nEducation Level",
       y = "Average Monthly Wages ($)\n") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 12))

Conclusions:

Participants earn more in wages when householdSize increases and when they have kids. Could it be that they work harder when they have a family?
Wages do not increase when participants’ age increases. This is rather surprising and seems to suggest that jobs in this city (perhaps due to their agricultural nature) do not come with increments over time, and that having more experience does not command higher wages.
Wages increase as education level of participants increases.

Does Higher Wage Equate More Happiness?

There is still one more variable that we have yet to explore - joviality in the demographics dataset.

Let’s explore the relationship between joviality and average monthly wages using an interactive plot:

# Add wage and ageGroup to our data
wage <- merged_data %>%
  filter(category == "Wage") %>%
  select(participantId, monthly_wage = monthly_total_amount)
tmp <- inner_join(new_p_data, wage, by = "participantId")
tmp$ageGroup <- cut(
  tmp$age,
  breaks = c(0, 29, 39, 49, Inf),
  labels = c("18-29", "30-39", "40-49", "50-60")
)

# Add tooltip information
tooltip <- c(paste("Wage: $", tmp$monthly_wage %>% round(0),
                   "\nJoviality: ", tmp$joviality %>% round(2),
                   "\n",
                   "\nID: ", tmp$participantId,
                   "\nHousehold Size: ", tmp$householdSize,
                   "\nHave kids?: ", tmp$haveKids,
                   "\nAge: ", tmp$age,
                   "\nEducation Level: ", tmp$educationLevel,
                   "\nInterest Group: ", tmp$interestGroup,
                   sep = ""))

Choose the variables to colour by:

Code

plt <- ggplot(tmp, aes(x = monthly_wage,
                       y = joviality)) +
  geom_point_interactive(
    aes(
      tooltip = tooltip,
      data_id = householdSize,
      color = factor(householdSize)
      ),
    size = 1.5) +
  scale_color_brewer(type = "qual", palette = "Set2") +
  labs(
    x = "Monthly Wage",
    y = "Joviality",
    title = "Scatterplot of Joviality vs. Monthly Wage",
    subtitle = "Points coloured by Household Size",
    color = "Household Size"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 10, face = "bold"),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.title = element_text(size = 10)
  )

girafe(ggobj = plt,
       options = list(opts_hover(css = ''),
                      opts_hover_inv(css = "opacity: 0.3;")))

Code

plt <- ggplot(tmp, aes(x = monthly_wage,
                       y = joviality)) +
  geom_point_interactive(
    aes(
      tooltip = tooltip,
      data_id = haveKids,
      color = factor(haveKids)
      ),
    size = 1.5) +
  scale_color_brewer(type = "qual", palette = "Set2") +
  labs(
    x = "Monthly Wage",
    y = "Joviality",
    title = "Scatterplot of Joviality vs. Monthly Wage",
    subtitle = "Points coloured by Presence of Kids",
    color = "Presence of Kids"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 10, face = "bold"),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.title = element_text(size = 10)
  )

girafe(ggobj = plt,
       options = list(opts_hover(css = ''),
                      opts_hover_inv(css = "opacity: 0.3;")))

Code

plt <- ggplot(tmp, aes(x = monthly_wage,
                       y = joviality)) +
  geom_point_interactive(
    aes(
      tooltip = tooltip,
      data_id = ageGroup,
      color = factor(ageGroup)
      ),
    size = 1.5) +
  scale_color_brewer(type = "qual", palette = "Set2") +
  labs(
    x = "Monthly Wage",
    y = "Joviality",
    title = "Scatterplot of Joviality vs. Monthly Wage",
    subtitle = "Points coloured by Age Group",
    color = "Age Group"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 10, face = "bold"),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.title = element_text(size = 10)
  )

girafe(ggobj = plt,
       options = list(opts_hover(css = ''),
                      opts_hover_inv(css = "opacity: 0.3;")))

Code

plt <- ggplot(tmp, aes(x = monthly_wage,
                       y = joviality)) +
  geom_point_interactive(
    aes(
      tooltip = tooltip,
      data_id = educationLevel,
      color = factor(educationLevel)
      ),
    size = 1.5) +
  scale_color_brewer(type = "qual", palette = "Set2") +
  labs(
    x = "Monthly Wage",
    y = "Joviality",
    title = "Scatterplot of Joviality vs. Monthly Wage",
    subtitle = "Points coloured by Education Level",
    color = "Education Level"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 10, face = "bold"),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.title = element_text(size = 10)
  )

girafe(ggobj = plt,
       options = list(opts_hover(css = ''),
                      opts_hover_inv(css = "opacity: 0.3;")))

Code

plt <- ggplot(tmp, aes(x = monthly_wage,
                       y = joviality)) +
  geom_point_interactive(
    aes(
      tooltip = tooltip,
      data_id = interestGroup,
      color = factor(interestGroup)
      ),
    size = 1.5) +
  scale_color_brewer(type = "qual", palette = "Set3") +
  labs(
    x = "Monthly Wage",
    y = "Joviality",
    title = "Scatterplot of Joviality vs. Monthly Wage",
    subtitle = "Points coloured by Interest Group",
    color = "Interest Group"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 10, face = "bold"),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.title = element_text(size = 10)
  )

girafe(ggobj = plt,
       options = list(opts_hover(css = ''),
                      opts_hover_inv(css = "opacity: 0.3;")))

Conclusions:

When colouring our data points by educationLevel, we can see that the points are grouped together somewhat by educationLevel. The higher the educationLevel, the higher the monthly wage. However, all variables do not seem to have any clear relationship with joviality (points are evenly spread out across the y-axis).
There is no clear relationship between joviality and monthly wage. Money is not correlated with happiness (at least for this sample of participants).

Verifying Insights from Visual Plots Using Analysis of Variance (ANOVA)

Earlier, based on our earlier plots, we derived the following insights:

householdSize, haveKids, and educationLevel exhibit a visually noticeable positive relationship with monthly wage
Variables did not exhibit any visually noticeable relationships with joviality.

We will now use ANOVA models to verify our earlier conclusions:

ANOVA with `monthly_wage` as dependent variable

model <- aov(monthly_wage ~ householdSize + haveKids + ageGroup + educationLevel + interestGroup,
            data = tmp)
summary(model)

                Df    Sum Sq   Mean Sq F value Pr(>F)    
householdSize    1 2.920e+08 292022588  87.440 <2e-16 ***
haveKids         1 1.360e+07  13602319   4.073 0.0439 *  
ageGroup         3 2.584e+06    861352   0.258 0.8557    
educationLevel   3 1.453e+09 484438161 145.055 <2e-16 ***
interestGroup    9 2.101e+07   2334850   0.699 0.7101    
Residuals      862 2.879e+09   3339693                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

householdSize, haveKids, and educationLevel have p-values < 0.05, indicating that their effects on monthly wage are significant at the 5% level. This verifies our first insight.

ANOVA with `joviality` as dependent variable

model <- aov(joviality ~ householdSize + haveKids + ageGroup + educationLevel + interestGroup,
            data = tmp)
summary(model)

                Df Sum Sq Mean Sq F value  Pr(>F)   
householdSize    1   0.20  0.1992   2.455 0.11749   
haveKids         1   0.00  0.0017   0.021 0.88405   
ageGroup         3   0.52  0.1741   2.146 0.09294 . 
educationLevel   3   1.16  0.3850   4.745 0.00274 **
interestGroup    9   0.73  0.0812   1.001 0.43766   
Residuals      862  69.94  0.0811                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

educationLevel has p-value < 0.05, indicating that its effect on joviality is significant at the 5% level. This is something that we did not notice through our earlier visual plots. We should run a linear regression model to confirm the direction of this effect (positive or negative):

model <- lm(joviality ~ educationLevel, data = tmp)
summary(model)


Call:
lm(formula = joviality ~ educationLevel, data = tmp)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.49725 -0.25024 -0.02126  0.22596  0.55712 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.466797   0.012445  37.510  < 2e-16 ***
educationLevel.L  0.079426   0.029956   2.651  0.00816 ** 
educationLevel.Q  0.001392   0.024889   0.056  0.95541    
educationLevel.C -0.020356   0.018481  -1.101  0.27100    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2854 on 876 degrees of freedom
Multiple R-squared:  0.01651,   Adjusted R-squared:  0.01314 
F-statistic: 4.902 on 3 and 876 DF,  p-value: 0.002199

Since the coefficient on educationLevel.L is positive and statistically significant at the 5% level, we can conclude that a higher education level is associated with higher joviality.

(Note: ‘L’ stands for linear, ‘Q’ stands for quadratic, ‘C’ stands for cubic. Coefficients are as such due to the fact that educationLevel is an ordered factor.)

Background

Data Preparation

Load datasets

Getting to Know Our Demographics Dataset

Demographics of Participants

Getting to Know Our Financial Journal Dataset

Aggregating Financial Journal Data over Time and Merging Datasets

Who Spends Most on Education?

Who Spends Most on Food?

Who Spends Most on Recreation?

Who Spends Most on Shelter?

Who Earns Most?

Does Higher Wage Equate More Happiness?

Choose the variables to colour by:

Verifying Insights from Visual Plots Using Analysis of Variance (ANOVA)

ANOVA with monthly_wage as dependent variable

ANOVA with joviality as dependent variable

ANOVA with `monthly_wage` as dependent variable

ANOVA with `joviality` as dependent variable