首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
OMP9727 代做、代写 Java/Python 程序语言
项目预算:
开发周期:
发布时间:
要求地区:
OMP9727: Recommender Systems
Assignment: Content-Based Movie Recommendation
Due Date:Week 4, Friday, June 21, 5:00 p.m.
Value:30%
This assignment is inspired by a typical application of recommender systems. The task is to
build a content-based “movie recommender” such as might be used by a streaming service (such
as Netflix) or review site (such as IMDb) to give users a personalizedlist of movies that match
their interests. The main learning objective for the assignment is togive a concrete example of
the issues that must be faced when building and evaluating a recommender system in a realistic
context. Note that, while movie recommender systems commonly make use of user ratings, our
scenario is not unrealistic as often all that a movie recommender system has are basic summaries
of the movies and the watch histories of the users.
For this assignment, you will be given a collection of 2000 movies that have been labelled as one
of 8 main genres (topics):animation,comedy,drama,family,horror,romance,sci-fiandthriller.
The movies of each genre are in a separate.tsvfile named for the genre (such asanimation.tsv)
with 7 fields:title,year,genre,director,cast,summaryandcountry.
The assignment is in three parts, corresponding to the components of a content-based recommender
system. The focus throughout is onexplanationof choices andevaluationof the various methods
and models, which involves choosing and justifying appropriate metrics. The whole assignment
will be prepared (and submitted) as a Jupyter notebook, similar to those being used in tutorials,
that contains a mixture of running code and tutorial-style explanation.
Part 1 of the assignment is to examine various supervised machine learning methods using a variety
of features and settings to determine what methods work best for topic (genre) classification in
this domain/dataset. For this purpose, simply concatenate all theinformation for one movie into
a single “document”. You will use Bernoulli Naive Bayes from the tutorial, Multinomial Naive
Bayes from the lecture, and one other machine learning method of your choice from scikit-learn
or another machine learning library, and NLTK for auxiliary functionsif needed.
Part 2 of the assignment is to test a potential recommender system that uses the method for
topic classification chosen in Part 1 by “simulating” a recommender system with a variety of
hypothetical users. This involves evaluating a number of techniques for “matching” user profiles
with movies using the similarity measures mentioned in the lecture. As we do not have real users,
for this part of the assignment, we will simply “invent” some (hopefully typical) users and evaluate
how well the recommender system would work for them, using appropriate metrics. Again you
will need to justify the choice of these metrics and explain how you arrived at your conclusions.
Part 3 of the assignment is to run a very small “user study” which means here findingoneperson,
preferably not someone in the class, to try out your recommendation method and give some
informal comments on the performance of your system from the user point of view. This does
not require any user interface to be built, the user can simply be shown the output (or use) the
Jupyter notebook from Parts 1 and 2. However, you will have to decide how many movies to show
the user at any one time, and how to get feedback from them on which movies they would click on
and which movies match their interests. A simple “talk aloud” protocol is a good idea here (this
is where you ask the user to use your system and say out loud what they are thinking/doing at
the same time – however please do not record the user’s voice – for that we need ethics approval).
Note that standard UNSW late penalties apply.
Assignment
Below are a series of questions to guide you through this assignment. Your answer to each question
should be in a separate clearly labelled section of the Jupyter notebook you submit. Each answer
should contain a mixture of explanation and code. Use comments in the code to explain any code
that you think readers will find unclear. The “readers” here are students similar to yourselves
who know something about machine learning and text classification but who may not be familiar
with the details of the methods.
Part 1. Topic (Genre) Classification
1. (2 marks) There are a few simplifications in the Jupyter notebookin the tutorial: (i) the regex
might remove too many special characters, and (ii) the evaluation isbased on only one training-
test split rather than using cross-validation. Explain how you are going to fix these mistakes and
then highlight any changes to the code in the answers to the next questions.
2. (2 marks) Develop a Multinomial Naive Bayes (MNB) model similar to the Bernoulli Naive
Bayes (BNB) model. Now consider all the steps in text preprocessing used prior to classification
with both BNB and MNB. The aim here is to find preprocessing steps that maximize overall ac-
curacy (under the default settings of the classifiers and usingCountVectorizerwith the standard
settings). Consider the special characters to be removed (and how and when they are removed),
the definition of a “word”, the stopword list (from either NLTK or scikit-learn), lowercasing and
stemming/lemmatization. Summarize the preprocessing steps thatyou think work “best” overall
and do not change this for the rest of the assignment.
3. (2 marks) Compare BNB and MNB models by evaluating them using the full dataset with
cross-validation. Choose appropriate metrics from those in the lecture that focus on the overall
accuracy of classification (i.e. not top-N metrics). Briefly discuss the tradeoffs between the various
metrics and then justify your choice of the main metrics for evaluation, taking into account whether
this dataset is balanced or imbalanced. On this basis, conclude whether either of BNB or MNB is
superior. Justify this conclusion with plots/tables.
4. (2 marks) Consider varying the number of features (words) used by BNB and MNB in the
classification, using thesklearnsetting which limits the number to the top N most frequent
words in the Vectorizer. Compare classification results for variousvalues for N and justify, based
on experimental results, one value for N that works well overall and use this value for the rest
of the assignment. Show plots or tables that support your decision. The emphasis is on clear
presentation of the results so do not print out large tables or too many tables that are difficult to
understand.
5. (5 marks) Choose one other machine learning method, perhaps one mentioned in the lecture.
Summarize this method in a single tutorial-style paragraph and explainwhy you think it is suitable
for topic classification for this dataset (for example, maybe otherpeople have used this method
for a similar problem). Use the implementation of this method from a standard machine learning
library such assklearn(notother people’s code from the Internet) to implement this method on
the news dataset using the same text preprocessing as for BNB and MNB. If the method has any
hyperparameters for tuning, explain how you will select those settings (or use the default settings),
and present a concrete hypothesis for how this method will compare to BNB and MNB.
Conduct experiments (and show the code for these experiments)using cross-validation and com-
ment on whether you confirmed (or not) your hypothesis. Finally, compare this method to BNB
and MNB on the metrics you used in Step 3 and choose one overall “best” method and settings
for topic classification.
Part 2. Recommendation Methods
1. (6 marks) The aim is to use the information retrieval algorithms for “matching” user profiles
to “documents” described in the lecture as a recommendation method. The overall idea is that
the classifier from Part 1 will assign a new movie to one of the 8 genres, and this movie will be
recommended to the user if the tf-idf vector for the movie is similar to the tf-idf vector for the
profile of the user in the predicted genre. The user profile for eachgenre will consist of the words,
or top M words, representing the interests of the user in that genre, computed as a tf-idf vector
across all movies predicted in that genre of interest to the user.
To get started, assume there is “training data” for the user profiles and “test data” for the
recommender defined as follows. There are 250 movies in each file. Suppose that the order in the
file is the time ordering of the movies, and suppose these movies camefrom a series of weeks, with
50 movies from each week. Assume Weeks 1–3 (movies 1–150) form the training data and Week 4
(movies 151–200) are the test data. UseTfidfVectorizeron all documents in the training data
to create a tf-idf matrix that defines a vector for each document(movie) in the training set.
Use these tf-idf values to define auser profile, which consists of a vector for each of the 8 genres.
To do this, for each genre, combine the movies from the training setpredicted to be in that genre
that the user “likes” into one (larger) document, so there will be 8 documents, one for each genre,
and use the vectorizer defined above to define a tf-idf vector foreach such document (genre).
Unfortunately we do not have any real users for our recommender system (because it has not yet
been built!), but we want some idea of how well it would perform. We invent two hypothetical
users, and simulate their use of the system. We specify the interests of each user with a set of
keywords for each genre. These user profiles can be found in the filesuser1.tsvanduser2.tsv
where each line in the file is a genre and (followed by a tab) a list of keywords. All the words are
case insensitive.Important: Although we
know the pairing of the genres and keywords,
all the recommender system “knows” is what movies the user liked in each genre.
Develop user profiles for User 1 and User 2 from the simulated training data (notthe keywords
used to define their interests) by supposing they liked all the moviesfrom Weeks 1–3 that matched
their interests and were predicted to be in the right category, i.e. assume the true genre is not
known, but instead the topic classifier is used to predict the movie genre, and the movie is shown
to the user listed under that genre. Print the top 20 words in their profiles for each of the genres.
Comment if these words seem reasonable.
Define another hypothetical “user” (User 3) by choosing different keywords across a range of
genres (perhaps those that match your interests or those of someone you know), and print the
top 20 keywords in their profile for each of their topics of interest.Comment if these words seem
reasonable.
2. (6 marks) Suppose a user sees N recommended movies and “likes”some of them. Choose and
justify appropriate metrics to evaluate the performance of the recommendation method. Also
choose an appropriate value for N based on how you think the movieswill be presented. Pay
attention to the large variety of movies and the need to obtain useful feedback from the user (i.e.
they must likesomemovies shown to them).
Evaluate the performance of the recommendation method by testing how well the top N movies
that the recommender suggests for Week 4, based on the user profiles, match the interests of each
user. That is, assume that each user likes all and only those movies inthe top N recommendations
that matched their profile for the predicted (not true) genre (where N is your chosen value). State
clearly whether you are showing N movies in total or N movies per genre. As part of the analysis,
consider various values for M, the number of words in the user profile for each genre, compared to
using all words.
Show the metrics for some of the matching algorithms to see which performs better for Users 1,
2 and 3. Explain any differences between the users. On the basis of these results, choose one
algorithm for matching user profiles and movies and explain your decision.
Part 3. User Evaluation
1. (5 marks) Conduct a “user study” of a hypothetical recommender system based on the method
chosen in Part 2. Your evaluation in Part 2 will have included a choice ofthe number N of movies
to show the user at any one time. For simplicity, suppose the user uses your system once per
week. Simulate running the recommender system for 3 weeks and training the model at the end
of Week 3 using interaction data obtained from the user, and testing the recommendations that
would be provided to that user in Week 4.
Choose one friendly “subject” and ask them to view (successively over a period of 4 simulated
weeks) N movies chosen at random for each “week”, for Weeks 1, 2and 3, and then (after training
the model) the recommended movies from Week 4. The subject couldbe someone else from the
course, but preferably is someone without knowledge of recommendation algorithms who will give
useful and unbiased feedback.
To be more precise, the user is shown 3 randomly chosen batches ofN movies, one batch from
Week 1 (N movies from 1–50), one batch from Week 2 (N movies from 51–100), and one batch
from Week 3 (N movies from 101–150), and says which of these they“like”. This gives training
data from which you can then train a recommendation model using the method in Part 2. The
user is then shown a batch ofrecommendedmovies from Week 4 (N movies from 151–200) in rank
order, and metrics are calculated based on which ofthesemovies the user likes. Show all these
metrics in a suitable form (plots or tables).
Ask the subject to talk aloud but make sure you find out which moviesthey are interested in.
Calculate and show the various metrics for the Week 4 recommendedmovies that you would show
using the model developed in Part 2. Explain any differences betweenmetrics calculated in Part 2
and the metrics obtained from the real user. Finally, mention any general user feedback concerning
the quality of the recommendations.
Submission and Assessment
?Please include your name and zid at the start of the notebook.
?Submit your notebook files using the following command:
give cs9727 asst .ipynb
You can check that your submission has been received using the command:
9727 classrun -check asst
?Assessment criteria include the correctness and thoroughness of code and experimental anal-
ysis, clarity and succinctness of explanations, and presentation quality.
Plagiarism
Remember that ALL work submitted for this assignment must be your own work and no sharing
or copying of code or answers is allowed. You may discuss the assignment with other students but
must not collaborate on developing answers to the questions. You may use code from the Internet
only with suitable attribution of the source. You may not use ChatGPT or any similar software to
generate any part of your explanations, evaluations or code. Do not use public code repositories
on sites such as github or file sharing sites such as Google Drive to save any part of your work –
make sure your code repository or cloud storage is private and do not share any links. This also
applies after you have finished the course, as we do not want next year’s students accessing your
solution, and plagiarism penalties can still apply after the course hasfinished.
All submitted assignments will be run through plagiarism detection software to detect similarities
to other submissions, including from past years. You shouldcarefullyread the UNSW policy on
academic integrity and plagiarism (linked from the course web page),noting, in particular, that
collusion(working together on an assignment, or sharing parts of assignment solutions) is a form
of plagiarism.
Finally, do not use any contract cheating “academies” or online “tutoring” services. This counts
as serious misconduct with heavy penalties up to automatic failure ofthe course with 0 marks,
and expulsion from the university for repeat offenders.
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代写data driven business mod...
2024-11-12
代做acct1101mno introduction...
2024-11-12
代做can207 continuous and di...
2024-11-12
代做dsci 510: principles of ...
2024-11-12
代写25705 financial modellin...
2024-11-12
代做ccc8013 the process of s...
2024-11-12
代做intro to image understan...
2024-11-12
代写eco380: markets, competi...
2024-11-12
代写ems726u/p - engineering ...
2024-11-12
代写cive5975/cw1/2024 founda...
2024-11-12
代做csci235 – database syst...
2024-11-12
代做ban 5013 analytics softw...
2024-11-12
代写cs 17700 — lab 06 fall ...
2024-11-12
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!