A group of friends and I participate in a mini film club together. Just like a book club we take turns picking a movie and meet to discuss whatever we all watched.
My friends have kindly agreed to share their Letterboxd data with me so that I could discern the exact moment of all our deaths do some simple analyses. My focus is less going to be on any individual person’s film habits and more the descriptives and techniques I’m estimating—as a preview, we’ll see some of our favorite movies, the most controversial movies, a simple recommendation system, each person’s hipsterness ratings, and some text analysis on written reviews. The code itself is fully flexible and can be used with any set of downloaded Letterboxd data files.
For those not in the know, Letterboxd is a social networking site revolving around movies (much like Goodreads is for books): you can log movies you’ve watched, offer ratings and write reviews, comment on other people’s reviews, etc. It’s quite fun and non-toxic which is why I enjoy it.
I’ve done these analyses in R. We’ll start by loading in the packages used:
library("tidyverse") # for general data wrangling
library("gtools") # for permutations
library("rvest") # for webscraping
library("lubridate") # for working with dates
library("tidytext") # for working with text
By placing the unzipped Letterboxd folders into the working directory, we can easily identify the members of the analysis by detecing the folders in the directory which have a regular pattern.
members <- list.files(pattern = "letterboxd-")
This will allow us to create our first data frame, where rows represent User-Film pairs of all the films the users have ever watched. We’ll also go ahead and add some basic information on the number of people who have watched each movie, and highlight those which all users have seen, only one user have seen, and all minus one user have seen. At this point, one might want to edit the User
column so the values are more readable instead of the full file names, but in order to keep the code as flexible as possible, I have not done so here.
movies <- lapply(members, function(i)
read.csv(file = paste0(i, "/watched.csv"))
)
names(movies) <- members
movies <- bind_rows(movies, .id = "User") |>
rename(URL = Letterboxd.URI) |>
select(-Date) |>
mutate(Watched = TRUE)
movies <- movies |>
group_by(Name, Year, URL) |>
mutate(N_Watched = sum(!is.na(Watched)),
All_Watched = N_Watched == length(members),
All_Minus1_Watched = N_Watched == length(members) - 1,
Only_1Watched = N_Watched == 1)
We can use pivot_wider
to create a wide version of the data, where rows represent a film and each member’s ratings are columns for each individual film. Our group has watched 1,750 unique movies!
movies_wide <- movies |>
pivot_wider(
names_from = "User",
values_from = "Watched",
names_prefix = "Watched_") |>
mutate(across(starts_with("Watched"),
~ !is.na(.))) |>
pivot_longer(starts_with("Watched"),
names_to = "User",
names_prefix = "Watched_",
values_to = "Watched") |>
mutate(Watched = ifelse(Watched, "Watched", "NotWatched")) |>
group_by(Name, Year, Watched, URL) |>
summarize(Who = paste0(User, collapse = ", ")) |>
pivot_wider(names_from = "Watched",
values_from = "Who",
names_prefix = "Who_") |>
mutate(N_Watched = str_count(Who_Watched, ",") + 1,
All_Watched = N_Watched == length(members),
All_Minus1_Watched = N_Watched == length(members) - 1,
Only_1Watched = N_Watched == 1)
Some of the quantities of interest involve using the Letterboxd average ratings for each movie, which are fairly tricky to webscrape. Here’s a function I wrote to get it done. I’ll save it as an .Rdata file to only run that once, since it takes quite a bit to run (notice I use Sys.sleep
and the rpois
function to generate random breaks between reading in the html of each film’s page). Using each film’s URL as a key, we can join this to the movies
and movies_wide
data frames.
# Add Letterboxd averages
read_letterboxd_average <- function(url) {
page <- read_html(url)
average_rating <- page |>
html_nodes(xpath = '//meta[@name="twitter:data2"]') |>
html_attr("content")
average_rating <- as.numeric(str_replace(average_rating, " out of 5", ""))
Sys.sleep(min(rpois(n = 1, lambda = 1)+1, 10))
return(average_rating)
}
letterboxd_averages_raw <- sapply(movies_wide$URL, read_letterboxd_average)
Letterboxd_Rating <- data.frame(letterboxd_averages_raw) |>
rownames_to_column("URL") |>
rename(Letterboxd_Rating = letterboxd_averages_raw)
save(Letterboxd_Rating, file = "Letterboxd_Rating.Rdata")
load("Letterboxd_Rating.Rdata")
movies_wide <- left_join(movies_wide, Letterboxd_Rating)
movies <- left_join(movies, movies_wide)
We’re almost done with the data processing steps, we just need to add our own ratings, which is separate from the watched
files, since you can watch a movie without rating it.
# Add our ratings
ratings <- lapply(members, function(i)
read.csv(file = paste0(i, "/ratings.csv"))
)
names(ratings) <- members
ratings <- bind_rows(ratings, .id = "User") |>
rename(URL = Letterboxd.URI) |>
select(-Date)
movies <- left_join(movies, ratings)
ratings_wide <- ratings |>
pivot_wider(
names_from = "User",
values_from = "Rating",
names_prefix = "Rating_") |>
rowwise() |>
mutate(Mean_Rating = mean(c_across(starts_with("Rating")), na.rm = TRUE))
movies_wide <- left_join(movies_wide, ratings_wide)
Now we’re ready to go! Let’s start by seeing our favorite movies, by average rating. Of the 27 movies we’ve all seen, our top 5 are Tár, City of God, Jeanne Dielman, 23, quai du Commerce, 1080 Bruxelles, Spirited Away, and Aftersun.
# Best movies we've all watched
best <- movies |>
group_by(Name, Year, URL) |>
filter(All_Watched == TRUE) |>
summarize(rating_mean = mean(Rating, na.rm = TRUE)) |>
arrange(desc(rating_mean))
best <- left_join(best, movies_wide) |>
select(Name, Year, rating_mean, starts_with("Rating"), Letterboxd_Rating)
best
How about something a little spicier? We can define our most controversial movies as those with the biggest variance in ratings, both among movies we’ve all watched and more broadly. I’m not going to share these full tables but one of us gave Godzilla vs. Kong a 4-star rating, while another gave it a 1-star rating.
# Most controversial - biggest variance in ratings, among movies we've all watched
controversy <- movies |>
group_by(Name, Year, URL) |>
filter(All_Watched == TRUE) |>
summarize(rating_sd = sd(Rating)) |>
arrange(desc(rating_sd))
controversy <- left_join(controversy, movies_wide) |>
select(Name, Year, rating_sd, starts_with("Rating"), Letterboxd_Rating)
controversy
# Most controversial - biggest variance in ratings, all movies
controversy <- movies |>
group_by(Name, Year, URL) |>
mutate(rating_sd = sd(Rating, na.rm = TRUE)) |>
ungroup() |>
select(Name, Year, rating_sd, Who_Watched) |>
arrange(desc(rating_sd)) |>
distinct()
controversy <- left_join(controversy, movies_wide) |>
select(Name, Year, rating_sd, starts_with("Rating"), Letterboxd_Rating)
controversy
We might also be interested in the “best-missing”—the best movies one of us hasn’t watched, or alternatively any movies only one person has watched, and that single person gave it a 5-star rating:
# Best movies only one of us hasn't watched
best_missing <- movies |>
group_by(Name) |>
filter(All_Minus1_Watched == TRUE) |>
mutate(rating_mean = mean(Rating, na.rm = TRUE),
rating_sum = sum(Rating, na.rm = TRUE)) |>
select(Name, Year, rating_mean, rating_sum, Who_NotWatched) |>
distinct() |>
arrange(desc(rating_mean), desc(rating_sum)) |>
select(-rating_sum)
best_missing <- left_join(best_missing, movies_wide) |>
select(Name, Year, rating_mean, Who_NotWatched, starts_with("Rating"), Letterboxd_Rating)
best_missing
# Best movies only one of us has watched
best_solo <- movies |>
group_by(Name) |>
filter(Only_1Watched == TRUE) |>
select(Name, Year, Rating, Who_Watched) |>
distinct() |>
arrange(desc(Rating), Who_Watched)
best_solo |>
ungroup() |>
group_by(Who_Watched) |>
slice_max(Rating, n = 2, with_ties = FALSE)
Next we consider deviations from our ratings with the average Letterboxd rating, both by movie and per user, and most interesting to me, the mean deviation from the mean Letterboxd ratings over all movies a person has rated, which could be construed as that person’s “hipsterness” score. Here we do not take into consideration absolute differences, but of course, you could do that as well.
movies <- movies |>
mutate(Rating_Diff = Rating - Letterboxd_Rating)
# Biggest outliers by movie
outliers_movies <- movies |>
group_by(Name, Year) |>
mutate(mean_rating_diff = mean(Rating_Diff, na.rm = TRUE)) |>
select(Name, Year, Letterboxd_Rating, mean_rating_diff) |>
distinct() |>
arrange(desc(mean_rating_diff))
outliers_movies <- left_join(outliers_movies, movies_wide) |>
select(Name, Year, mean_rating_diff, Letterboxd_Rating, starts_with("Rating"))
outliers_movies
# Outliers by person for each movie - max
outliers_rater <- movies |>
group_by(User) |>
slice_max(Rating_Diff, n = 10) |>
select(User, Name, Year, Rating, Letterboxd_Rating, Rating_Diff)
outliers_rater <- left_join(outliers_rater, movies_wide) |>
select(User, Name, Year, Letterboxd_Rating, starts_with("Rating"))
outliers_rater
# Outliers by person for each movie - min
outliers_rater <- movies |>
group_by(User) |>
slice_min(Rating_Diff, n = 10) |>
select(User, Name, Year, Rating, Letterboxd_Rating, Rating_Diff)
outliers_rater <- left_join(outliers_rater, movies_wide) |>
select(User, Name, Year, Letterboxd_Rating, starts_with("Rating"))
outliers_rater
# By person (hipsterness)
movies |>
group_by(User) |>
summarize(mean_rating_diff = mean(Rating_Diff, na.rm = TRUE))
Turning to some slightly more complex procedures, we can see who is most similar to each other in terms of rating movies by calculating cosine similarity ratings for each pair of users. Of course, this is only one similarity metric, you could consider others like Jaccard similarity (with only watched movies or some binarization of the rating), Euclidean distance, and so on.
# Who is the most similar to each other?
combinations_forsimilarity <-
data.frame(combinations(n = length(members),
r = 2,
v = members)) |>
rename(User1 = X1, User2 = X2)
cosine_similarity <- function(x, y) {
dot_product <- sum(x * y, na.rm = TRUE)
magnitude_x <- sqrt(sum(x^2, na.rm = TRUE))
magnitude_y <- sqrt(sum(y^2, na.rm = TRUE))
similarity <- dot_product / (magnitude_x * magnitude_y)
return(similarity)
}
combinations_forsimilarity$cosine_similarity <- sapply(1:nrow(combinations_forsimilarity), function(i)
cosine_similarity(movies_wide[,paste0("Rating_", combinations_forsimilarity[i,]$User1)],
movies_wide[,paste0("Rating_", combinations_forsimilarity[i,]$User2)])
)
combinations_forsimilarity
Subsequently this can be the basis of a simple recommendation engine, which takes the combinations_forsimilarity
object (which has ${n \choose 2}$ rows, where $n$ is the number of users) and pivots it to the similarity
object which just concerns the possible pairings between the target_user
and the other users. It then takes the movies
data frame and filters it to movies the target_user
has not seen, joins the similarity ratings from the other users and creates a weighted rating (if you have many users, you may want to define a threshold by which a user is similar to another) based on the similarity scores. This rating is then transformed to range between 0 and 1 and the output is the list of movies the target_user
has not seen sorted by this weighted rating.
# A simple recommendation system: weighted rating
recommendations <- function(target_user) {
similarity <- combinations_forsimilarity |>
filter(str_detect(User1, paste0("^", target_user, "$")) |
str_detect(User2, paste0("^", target_user, "$"))) |>
pivot_longer(c(User1, User2),
values_to = "other_rater") |>
filter(other_rater != target_user) |>
mutate(target_user = target_user) |>
select(-name) |>
relocate(target_user, other_rater, cosine_similarity)
output <- movies |>
filter(str_detect(Who_NotWatched, paste0("^", target_user, "$")) |
str_detect(Who_NotWatched, paste0(target_user, ",")) |
str_detect(Who_NotWatched, paste0(target_user, "$"))) |>
group_by(Name, Year, User) |>
rename(other_rater = User) |>
summarize(Rating = Rating)
output <- left_join(output, similarity)
output <- output |>
select(-target_user) |>
group_by(Name, Year) |>
summarize(mean_Rating = mean(Rating),
weighted_Rating = sum(Rating*cosine_similarity))
min_wt_rating <- min(output$weighted_Rating, na.rm = TRUE)
max_wt_rating <- max(output$weighted_Rating, na.rm = TRUE)
output <- output |>
mutate(weighted_Rating = (weighted_Rating-min_wt_rating)
/(max_wt_rating-min_wt_rating))
output <- left_join(output, select(movies_wide,
c("Name", "Year",
starts_with("Rating")))) |>
arrange(desc(weighted_Rating))
output <- output[, colSums(is.na(output)) != nrow(output)]
print(paste0("Printing recommendations for: ", target_user))
return(output)
}
Moving on to more general descriptives, I was curious about average ratings over time, both considering the Letterboxd averages (though note the sample is the set of movies we’ve all seen) and our own ratings:
# Average ratings over time
movies_wide |>
mutate(Decade = paste0(Year - Year %% 10, "s")) |>
group_by(Decade) |>
summarize(Mean_OurRating = mean(Mean_Rating, na.rm = TRUE),
Mean_LetterboxdRating = mean(Letterboxd_Rating, na.rm = TRUE),
N_films = n())
movies |>
group_by(Year, User) |>
summarize(Mean_Rating = mean(Rating, na.rm = TRUE),
N_films = n()) |>
ggplot(aes(x = Year, y = Mean_Rating, color = User, size = N_films)) +
geom_point() +
scale_y_continuous(limits = c(0,5))
Next we consider watching behavior. This can be found in the diary
files within your Letterboxd data, which contain movies you’ve watched and have recorded the date that you watched it. The most popular single date among our group consisted of 7 movies watched (split across only two users!). Our most popular month consisted of 64 movies logged, and a single user watched 33 movies in a month. The most popular movie months are December and January, not surprising considering the combination of the holidays and the season.
# Read viewing data
diary <- lapply(members, function(i)
read.csv(file = paste0(i, "/diary.csv"))
)
names(diary) <- members
diary <- bind_rows(diary, .id = "User") |>
rename(URL = Letterboxd.URI)
# Single most popular watch date
diary |>
group_by(Watched.Date) |>
summarize(N_films = n(),
film_names = paste0(User, " - ", Name, collapse = ", ")) |>
arrange(desc(N_films))
# Single most popular date by user
diary |>
group_by(User, Watched.Date) |>
summarize(N_films = n(),
film_names = paste0(Name, collapse = ", ")) |>
arrange(desc(N_films))
# Single most popular watch month
diary |>
mutate(Watched.Month = floor_date(ymd(Watched.Date), unit = "month")) |>
group_by(Watched.Month) |>
summarize(N_films = n(),
film_names = paste0(User, " - ", Name, collapse = ", ")) |>
arrange(desc(N_films))
# Single most popular watch month by user
diary |>
mutate(Watched.Month = floor_date(ymd(Watched.Date), unit = "month")) |>
group_by(User, Watched.Month) |>
summarize(N_films = n(),
film_names = paste0(Name, collapse = ", ")) |>
arrange(desc(N_films))
# Most popular months not including year
diary |>
mutate(Month = month(ymd(Watched.Date))) |>
group_by(Month) |>
summarize(N_films = n()) |>
arrange(desc(N_films))
# Most popular months not including year by user
diary |>
mutate(Month = month(ymd(Watched.Date))) |>
group_by(Month, User) |>
summarize(N_films = n()) |>
arrange(desc(N_films))
Last but not least let’s play around with some text data! We can grab this data from the reviews
files. Our longest review was 4,251 characters long.
# Read review data
reviews <- lapply(members, function(i)
read.csv(file = paste0(i, "/reviews.csv"))
)
names(reviews) <- members
reviews <- bind_rows(reviews, .id = "User")
# Get longest review
reviews <- reviews |>
mutate(length_of_review = str_length(Review))
Converting this text data to a more friendly form for text analysis allows us to assess our most used words (which aren’t too interesting: movie, film, story, time, love, fun—all suggest we could be a little more clever with our reviews), and do some sentiment analysis. On average our reviews consisted of 56.3% positive words and 43.7% negative words, and this pattern held for each individual user—all of us use more positive words in our reviews than negative words. The two most negative reviews were of Doctor Strange in the Multiverse of Madness and Black Panther: Wakanda Forever, while the most positive review was of Nomadland.
# Clean text data for analysis
reviews_clean <- reviews |>
mutate(Review_Clean = tolower(Review)) |>
unnest_tokens(word, Review) |>
anti_join(stop_words) |>
select(User, Name, word)
# Most used words
most_used_words <- reviews_clean |>
group_by(word) |>
summarize(count = n()) |>
arrange(desc(count))
most_used_words <- reviews_clean |>
group_by(word, User) |>
summarize(count = n()) |>
arrange(desc(count))
# Reviews sentiment
reviews_sentiment <- reviews_clean |>
inner_join(get_sentiments("bing"))
reviews_sentiment |>
group_by(sentiment) |>
summarize(count = n()/nrow(reviews_sentiment))
reviews_sentiment |>
group_by(User, sentiment) |>
summarize(count = n()) |>
group_by(User) |>
mutate(prop = count/sum(count))
reviews_sentiment |>
group_by(Name, User, sentiment) |>
summarize(count = n()) |>
group_by(Name, User) |>
mutate(prop = count/sum(count)) |>
arrange(desc(count))