INFO20003 Semester 2, 2025
Assignment 2 : SQL
Due: Week 8 - Friday 19th September 2025, 6 :00PM Melbourne Time.
Case: The Best Movie Recommendations Ever! (BMRE!)
Inspired by the Netflix Challenge x Community Ratings x Streaming Platforms
Introduction
In the 2000s, Netflix had a contest (‘The Netflix Prize’) to find the best ratings prediction algorithm to improve recommendations for its users. This was one of the early releases of a large real-world dataset for data scientists and machine learning/AI researchers. Since then, there are many other movie-related datasets that have been released, for both researchers, as well as for everyday movie enthusiasts (like us)!
Inspired by the Netflix Prize, we present – “The Best Movie Recommendations Ever!” – a new challenge.
Assume you are part of a team of movie fans and AI experts looking to make the best recommendation algorithm. Because you are the database expert in the team, you are placed in charge of the database management and querying.
IMPORTANT: You do not have to do any AI/ML stuff for this assignment, just the SQL. For the purposes of anonymity of Internet users, and respecting the intellectual property of real-world movie platforms, note that the data given for this assignment are synthetic.
Description
The following description explains what data is made available to you. Be careful as many tables share identical attribute names, to simulate real-world datasets. Assume that there are no discrepancies between the different tables (e.g., Netflix year of release is the same as IMDB’s year of release – see below).
Netflix records, for each movie, a unique Netflix Movie ID (e.g., 80234304), movie title, year of release. Each movie is given ratings, where a rating record consists of the numeric rating (0-5 inclusive, integers), the timestamp of the rating, and a unique anonymous user ID. For privacy, no other user data is given.
Now, in this new challenge, we also obtain data from the comprehensive Internet Movie Database (IMDB), with a record for each movie. Each IMDB movie record has an IMDB Movie ID (e.g., tt12584954), the movie title, year of release, averageIMDB rating (decimal from 0.0 to 10.0 inclusive), number of people who have rated, and classification (e.g., PG-13, or R). Each movie record is associated with a lead director and up to five main actor(s)/actress(es). Finally, each movie record has at least one genre (e.g., only Comedy, or Comedy + Drama, etc)., and one language (e.g., ‘EN’ for English, using the 2-letter ISO code for languages).
IMDB also uses some data from MetaCritic, which is a collection of critics scores for each movie. MetaCritic data records comprise of the IMDB Movie ID, Source, and Score (0-100 inclusive, integer); a movie can have any number of scores (e.g., one from a newspaper, the other from a film blogger).
Another reputable data source is RottenTomatoes, a movie review database. Each RottenTomatoes movie review record has a unique RottenTomatoes ID (e.g., ‘the_lord_of_the_rings_the_fellowship_of_the_ring’), a ‘Tomatometer’ critics score (0-100 inclusive, integer), and a ‘Popcornmeter’ audience score (0-100 inclusive, integer).
Finally, we also draw upon ‘tag’ data (like hashtags on social media) from the MovieLens recommendation service (inspired by Harper & Konstan, 2015). Anonymised users contribute tags to movies, in the form. of records containing the MovieLens User ID, MovieLens Movie ID, the tag, and a timestamp.
Luckily, your teammates in this challenge have also supplied mappings in the following tables: imdb_to_netflix, imdb_to_rottentomatoes, and imdb_to_movielens (all self-explanatory) to link movie records across all data sources/platforms. Note that a movie *may or may not* have corresponding IDs in all tables (e.g., a new movie in cinemas may only have an IMDB record and no other corresponding records).
The Data Model
The Data Model from MySQL Workbench is provided in Figure 1.
FIGURE 1. DATA MODEL FOR BMRE!
Assignment 2 Setup
Please pay special attention to the penalties listed [].
A dataset is provided which you can use when developing your solutions. To set up the dataset, download the file BMRE.sql from the Assignment link on Canvas and run it in Workbench. This script creates the database tables and populates them with data.
The sample dataset provided is a basic, synthetic, extract and not necessarily the ‘full data‘. You may find that you may need to add some more sample data in Workbench to fully test edge cases for queries.
Note that this dataset is provided for you to experiment with, but it is NOT the same dataset as what your queries will be tested against (the schema will stay the same, but the data itself may be different). This means when designing your queries, you must consider edge cases even if they are not represented in this particular data set.
The script is designed to run against your account on the Engineering IT server (info20003db.eng.unimelb.edu.au). If you want to install the schema on your own MySQL Server installation, uncomment the lines at the beginning of the script.
WARNING: Do NOT disable only_full_group_by mode when completing this assignment. This mode is the default and is turned on in all default installs of MySQL workbench, and we’ve added a line to the top of slarc.sql to turn it on every time you run the script in case you disable it! You can check whether it is turned on using the command `SELECT @@sql_mode 、; The command should return a string containing ONLY_FULL_GROUP_BY or ANSI. When testing, our test server WILL have this mode turned on, and if your query fails due to this, you will lose marks.
The SQL Tasks
Please pay special attention to the penalties listed [].
In this section are listed 10 questions for you to answer. Write one (single) SQL statement per question. Each statement must end with a semicolon (;). Subqueries and nesting are allowed within a single SQL statement – however, you may be penalised for writing overly complicated SQL statements.
WARNING: DO NOT USE VIEWS (or ‘WITH’ statements/common table expressions) OR VARIABLES to answer questions. Penalties apply.
? The Questions
1. List aII IMDB movies which contain no MetaCritic reviews. Your query shouId return resuIts of the form (IMDBMovieID, MovieTitle). (1 mark)
2. Find the NetfIix movie with the most recent review. Assume there are no ties: onIy one is the most recent. Your query shouId return one row (NetflixMovieID, MovieTitle, TimeOfMostRecentRating). (1 mark)
3. List aII movies rated by MetaCritic source (The Washington Post’that have at Ieast 5 NetfIix user ratings. Your query shouId return resuIts of the form. (IMDBMovieID, NetflixRatingCount). (1 mark)
4. Find the genre whose movies has the highest average Tomatometer score. If there are ties, then you must return aII genres with the highest average. Avg score must be rounded to 1 decimaI. Your query shouId
return resuIts of form. (genre, TomatometerAvgScore), with one row per genre in case of a tie. (2 marks)
5. List aII MetaCritic scores and sources for fiIms that feature an actor with more than 2 words in their fuII
name (e.g.,“James EarI Jones”, but not“Anya TayIor-Joy”). Do not dupIicate resuIts if muItipIe such actors acted in the same fiIm. Your query shouId return resuIts of the form. (Score, Source, IMDBTitle). (2 marks)
6. Find which year has the highest number of (pg’rated movies that have at Ieast one MovieLens tag of
(action_thriIIer’. If there are ties, then you must return aII resuIts. Your query shouId return resuIts of the form. (Year, MovieCount), with one row per Year in case of a tie. (2 marks)
7. Find the totaI number of movies that are in IMDB but not in NetfIix (defined as X). SimiIarIy, find the totaI number of movies that are in IMDB but not in MovieLens (defined as Y). Your query shouId return one
row: (X, Y). (2 marks)
8. We'II refer to a Metacritic review source that has reviewed at Ieast one movie in each Ianguage that
currentIy exists in the IMDB movies Iist as a 'gIobaI-reviewsource'. For each gIobaI-reviewsource, evaIuate how their review count, average score, and score standard-deviation (hint: use the STDDEV operation)
varies based on the Ianguage of the movie the source reviewed. Average and StdDev must be rounded to
1 decimaI pIace. Your query shouId return (globalReviewSource, language, countReviewsForLanguage, avgScoreForLanguage, popStdDevScoreForLanguage) (3 marks)
As an exampIe, consider the foIIowing dataset:
imdb_movie
id
|
language
|
1
|
(EN’
|
2
|
(ES’
|
3
|
(EN’
|
metacritic_review
source
|
imdb_movie_id
|
score
|
NYT
|
1
|
3
|
NYT
|
2
|
5
|
Variety
|
1
|
2
|
CurrentIy, there are 2 unique Ianguages in the imdb_movie tabIe. Source (NYT’has reviewed at Ieast one movie in each of these Ianguages (1x (EN’movie and 1x (ES’movie). Thus, NYT is a (gIobaI-reviewsource’. (Variety’is NOT a gIobaI-reviewsource, since they have not reviewed any (ES’movies.
The returned resuIt from running our query on the given dataset shouId be:
Source
|
Language
|
CountReviews
|
AvgScore
|
StddevScore
|
NYT
|
EN
|
1
|
3
|
0
|
NYT
|
ES
|
1
|
5
|
0
|
9. A (muIti-taIented actor’is one who has acted in movies with 5 or more genres. This incIudes acting in a singIe movie that has 5+ genres, or across muItipIe movies with 5+ distinct genres across aII of them. Find muIti-taIented actors, and how many times peopIe have tagged any of their movies. Your query shouId return resuIts of the form. (ActorName, NumberOfUniqueGenres, TotalNumberOfTags). (3 marks)
10. A (consistent quaIity’movie is one where the average ratings by NetfIix users (converted to a percentage from 0-100%) matches the average Metacritic rating across aII sources, within 15 percentage points. For exampIe, movie X has 3 ratings of {3,4,5} on NetfIix, which converts to an average of 80.0%. Movie X has 3 MetaCritic reviews with scores {70%, 75%, 80%}, giving an average of 75.0%. Since I80.0 一 75.0I <= 15, movie X is (consistent quaIity’. If a movie doesn’t have a Metacritic review, it is not eIigibIe to be consistent.
Of the (consistent quaIity’movies reIeased in 2021 or 2022, find the 3 with the highest avg NetfIix review score. If muItipIe movies tie, they shouId aII be incIuded. For exampIe, if 4 movies aII tie for 1st pIace, the query shouId return 4 rows. If 1st and 2nd were distinct, but 2 movies tied for 3rd pIace, the query shouId aIso return 4 rows. The finaI Metacritic and NetfIix average scores must be rounded to 1 decimaI pIace.
Return as (NetflixMovieID, RoundedAvgMetacriticScore, RoundedAvgNetflixScoreAsPercent). (3 marks)