Recap exercises: import, preparation and visualization

Published

07 11 2025

The following exercises are meant to help you recap what you learned in the areas of data import, data preparation and visualization.

Customer Satisfaction Tracking

Background: You work for a company that recently implemented new customer service protocols. Customer satisfaction data was collected at three time points: before the change (Q1), and at two follow-up periods (Q2, Q3). Customers rated their satisfaction with service quality and product value on a 1-5 scale. Each row represents one customer’s ratings across all three quarters.

Import data

Download the file customer_satisfaction.csvand place it in your working directory. Import the CSV file and store it in a variable called customers_data.

Prepare data

The data is currently in a wide format - each quarter is a separate column. Transform it to a tidy format where: - One column indicates the quarter (q1, q2, q3) - One column indicates the metric type (service_quality, product_value) - One column contains the rating value

Compute grouped averages

Calculate the mean rating for each combination of quarter and metric type. Store this in a new table called satisfaction_summary.

Visualization

Create a line plot showing how average customer satisfaction changes over quarters. It should have the quarters on the x-axis, the mean rating on the y-axis, and it should show different colors for service quality and product value.

Bonus: compute improvements

Solution

For this exercise we need the following packages:

library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(kableExtra) # <- only for the presentation on the webpage

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows

Since the csv file is in good shape we can just read it via data.table::fread() or, because it is already very clean, with readr::read_csv():

# Note: adjust to you project and best use here::here()
customers_data <- read_csv(
  file = "Recap8b/customer_satisfaction.csv")
Rows: 50 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): customer_id, account_age_months, q1_service_quality, q1_product_val...

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.

This is how the current data looks like:

head(customers_data) |>
  kable()
customer_id account_age_months q1_service_quality q1_product_value q2_service_quality q2_product_value q3_service_quality q3_product_value
1 12 3 5 4 2 5 3
2 25 4 1 4 4 5 3
3 34 4 5 2 3 5 5
4 20 1 5 5 5 3 5
5 16 3 1 2 5 5 5
6 13 5 1 5 4 4 4

To transform it into the desired long format we use tidyr::pivot_longer(). Here we can make our live easier by using the selection helper dplyr::starts_with() to capture all columns starting with q. But a simple application only brings us that far:

customers_data |>
  pivot_longer(
    cols = all_of(starts_with("q"))
    ) |>
  head() |>
  kable()
customer_id account_age_months name value
1 12 q1_service_quality 3
1 12 q1_product_value 5
1 12 q2_service_quality 4
1 12 q2_product_value 2
1 12 q3_service_quality 5
1 12 q3_product_value 3

The column name still contains information about both the quarter and the metric. There are a number of ways we can separate this column.

One way would be to use the argument names_sep of pivot_longer() and then give two column names to names_to. This works in our case but note that it gives a warning that some parts of the column are dropped:

customers_long <- customers_data |>
  pivot_longer(
    cols = starts_with("q"), 
    names_sep = "_", 
    names_to = c("quarter", "metric")
    ) 
Warning: Expected 2 pieces. Additional pieces discarded in 6 rows [1, 2, 3, 4,
5, 6].
head(customers_long)
# A tibble: 6 Γ— 5
  customer_id account_age_months quarter metric  value
        <dbl>              <dbl> <chr>   <chr>   <dbl>
1           1                 12 q1      service     3
2           1                 12 q1      product     5
3           1                 12 q2      service     4
4           1                 12 q2      product     2
5           1                 12 q3      service     5
6           1                 12 q3      product     3

This is because we are telling the function to separate the column using the token _, but this token appears more than once (but we only provide two column names). In our case this is save to ignore, but when you want to avoid it, you can use the function dplyr::separate(), which allows for more sophisticate separation procedures. In our case, we can set extra = "merge", which tells the function to only split on the first underscore and keep everything else together:

customers_long <- customers_data |>
  pivot_longer(
    cols = starts_with("q"),
    names_to = "full_name",
    values_to = "rating"
  ) |>
  separate(
    col = full_name, 
    into = c("quarter", "metric"), 
    sep = "_", 
    extra = "merge")

head(customers_long) |>
  kable()
customer_id account_age_months quarter metric rating
1 12 q1 service_quality 3
1 12 q1 product_value 5
1 12 q2 service_quality 4
1 12 q2 product_value 2
1 12 q3 service_quality 5
1 12 q3 product_value 3

Next we need to compute average ratings for each combination of quarter and metric type:

satisfaction_summary <- customers_long |>
  group_by(quarter, metric) |>
  summarise(avg_rating = mean(rating), .groups = "drop") 

kable(satisfaction_summary)
quarter metric avg_rating
q1 product_value 2.90
q1 service_quality 3.10
q2 product_value 3.52
q2 service_quality 3.44
q3 product_value 3.98
q3 service_quality 4.14

Note: there are different ways of doing this. Instead of the more explicit group_by() you could have used summarize(..., .by=c()).

The final step is then to create the visualization using ggplot2. The only thing to note is that we need to be explicit regarding the grouping and set

ggplot(
  data = satisfaction_summary, 
  mapping = aes(
    x = quarter, 
    y = avg_rating, 
    colour = metric, 
    group = metric)
  ) +
  geom_point() +
  geom_line() +
  theme_light()

When you use geom_line(), ggplot2 needs to know which points to connect. By default, ggplot2 automatically creates groups based on the combination of all discrete variables in your data. In our case these are the two variables quarter and metric.

So ggplot2 creates groups based on the combination of both these variables:

  1. q1 + product_value (1 point)
  2. q1 + service_quality (1 point)
  3. q2 + product_value (1 point)
  4. q2 + service_quality (1 point)
  5. q3 + product_value (1 point)
  6. q3 + service_quality (1 point)

Each group has only one observation, so there’s nothing to connect with a line!

What we actually want is to have separate lines for each metric, i.e. one line for product_value across all quarters and one line for service_quality across all quarters.

So you need to tell ggplot2: β€œGroup by metric only, not by the combination of all discrete variables, which would be quarter and metric!”

Of course, there are many ways to make the plot pretties, but here we may leave it at this!

To also complete the bonus challenge, we need to compute the improvement after the intervention, i.e. from q1 to the other two quarters. We will do this by computer the average for q2 and q3, compute the relative change to q1 and then rank the data according to their relative changes.

Here are all steps conducted separately: first, create a new variable to group periods before and after the intervention:

step1 <- satisfaction_summary |>
  mutate(
    intervention = ifelse(test = quarter=="q1", yes = "No", no = "Yes")
    )
kable(step1)
quarter metric avg_rating intervention
q1 product_value 2.90 No
q1 service_quality 3.10 No
q2 product_value 3.52 Yes
q2 service_quality 3.44 Yes
q3 product_value 3.98 Yes
q3 service_quality 4.14 Yes

Now compute the respective averages:

step2 <- step1 |>
  summarise(
    avg_rating = mean(avg_rating), .by = c("intervention", "metric")
    )
kable(step2)
intervention metric avg_rating
No product_value 2.90
No service_quality 3.10
Yes product_value 3.75
Yes service_quality 3.79

Make it wider:

step3 <- step2 |>
  pivot_wider(names_from = intervention, values_from = avg_rating) 
kable(step3)
metric No Yes
product_value 2.9 3.75
service_quality 3.1 3.79

And now compute the relative change:

step4 <- step3 |>
  mutate(
    absolute_change = Yes - No,
    relative_change = (absolute_change / No)*100
    ) |>
  select(-No, -Yes)
kable(step4)
metric absolute_change relative_change
product_value 0.85 29.31034
service_quality 0.69 22.25806

Or, all in one call:

satisfaction_summary |>
  mutate(
    intervention = ifelse(test = quarter=="q1", yes = "No", no = "Yes")
    ) |>
  summarise(
    avg_rating = mean(avg_rating), .by = c("intervention", "metric")
  ) |>
  pivot_wider(names_from = intervention, values_from = avg_rating) |>
  mutate(
    absolute_change = Yes - No,
    relative_change = (absolute_change / No)*100
    ) |>
  select(-No, -Yes)
# A tibble: 2 Γ— 3
  metric          absolute_change relative_change
  <chr>                     <dbl>           <dbl>
1 product_value              0.85            29.3
2 service_quality            0.69            22.3