首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
讲解DSCI553程序、辅导Data编程设计、Python程序讲解 辅导R语言程序|解析Haskell程序
项目预算:
开发周期:
发布时间:
要求地区:
DSCI553 Foundations and Applications of Data Mining
Spring 2021
Assignment 3
Deadline: Mar. 23rd 11:59 PM PST
1. Overview of the Assignment
In Assignment 3, you will complete three tasks. You will first implement Min-Hash and Locality Sensitive
Hashing (LSH) to find similar businesses efficiently. Then you will implement various types of
recommendation systems.
2. Requirements
2.1 Programming Requirements
a. You must use Python & Spark to implement all tasks. You can only use the standard Python libraries
(i.e., external libraries like numpy or pandas are not allowed).
b. You are required to only use Spark RDD, i.e. no point if using Spark DataFrame or DataSet.
c. There will be 10% bonus for Scala implementation in each task. You can get the bonus only when both
Python and Scala implementations are correct.
2.2 Programming Environment
Python 3.6, Scala 2.11, and Spark 2.3.0
We will use Vocareum to automatically run and grade your submission. You must test your scripts on your
local machine and the Vocareum terminal before submission.
2.3 Write your own code
Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We emphasize
this point because you may find Python implementations of some of the required functions on the Web.
Please do not look for or at any such code!
Plagiarism detection will combine all the code we can find from the Web (e.g., Github) as well as other
students’ code from this and other (previous) sections. We will report all detected plagiarism to the
university.
3. Yelp Data
For this assignment, we have generated sample review data from the original Yelp review dataset using
some filters, such as the condition: “state” == “CA”. We randomly took 80% of sampled reviews for
training, 10% for testing, and 10% as the blind dataset. (We do not share the blind dataset.) You can access
and download the following JSON files either under the directory on the Vocareum:
resource/asnlib/publicdata/ or on Google Drive (USC email only):
a. train_review.json
b. test_review.json – containing only the target user and business pairs for prediction tasks
c. test_review_ratings.json – containing the ground truth rating for the testing pairs
d. user_avg.json – containing the average stars for the users in the train dataset
e. business_avg.json – containing the average stars for the businesses in the train dataset
f. stopwords
g. We do not share the blind dataset.
4. Tasks
You need to submit the following files on Vocareum: (all in lowercase)
a. Python scripts: task1.py, task2train.py, task2predict.py, task3train.py, task3predict.py
b. Model files: task2.model, task3item.model, task3user.model
c. Result files: task1.res, task2.predict, task3item.predict, task3user.predict
d. Scala scripts: task1.scala, task2train.scala, task2predict.scala, task3train.scala, task3predict.scala; one
jar package: hw3.jar
e. Model files: task2.scala.model, task3item.scala.model, task3user.scala.model
f. Result files: task1.scala.res, task2.scala.predict
g. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).
4.1 Task1: Min-Hash + LSH (2pts)
4.1.1 Task description
In this task, you will implement the Min-Hash and Locality Sensitive Hashing algorithms with Jaccard
similarity to find similar business pairs in the train_review.json file. We focus on 0/1 ratings rather than
the actual rating values in the reviews. In other words, if a user has rated a business, the user’s contribution
in the characteristic matrix is 1; otherwise, the contribution is 0 (Table 1). Your task is to identify business
pairs whose Jaccard similarity is >= 0.05.
Table 1: The left table shows the original ratings; the right table shows the converted 0 and 1 ratings.
You can define any collection of hash functions to permutate the row entries of the characteristic matrix to
generate Min-Hash signatures. Some potential hash functions are:
where is any prime number; is the number of bins. You can define any combination for the parameters
in your implementation.
After you have defined all hash functions, you will build the signature matrix using Min-Hash. Then you
will divide the matrix into bands with rows each, whereis the number of hash functions).
You need to set and properly to balance the number of candidates and the computational cost. Two
businesses become a candidate pair if their signatures are identical in at least one band.
Lastly, you need to verify the candidate pairs using their original Jaccard similarity. Table 1 shows an
example of calculating the Jaccard similarity between two businesses. Your final outputs will be the
business pairs whose Jaccard similarity is >= 0.05.
user1 user2 user3 user4
business1 0 1 1 1
business2 0 1 0 0
Table 2: Jaccard similarity (business1, business2) = #intersection / #union = 1/3
4.1.2 Execution commands
Python $ spark-submit task1.py
Scala $ spark-submit --class task1 hw3.jar
: the train review set
: the similar business pairs and their similarities
4.1.3 Output format
You must write a business pair and its similarity in the JSON format using exactly the same tags like the
example in Figure 1. Each line represents a business pair, e.g., “b1” and “b2”. For each business pair “b1”
and “b2”, you do not need to generate the output for “b2” and “b1” since the similarity value is the same as
“b1” and “b2”. You do not need to truncate decimals for the ‘sim’ values.
Figure 1: An example output for Task 1 in the JSON format
4.1.4 Grading
Your task 1 outputs (1pt) will be graded by precision and recall metrics defined below.
Precision = # true positives / # output pairs, Recall = # true positives / # ground truth pairs
Your precision should be >= 0.95 (0.5pt), and recall should be >= 0.5 (0.5pt). The execution time on
Vocareum should be less than 200 seconds. To evaluate the implementation, you can generate the ground
truth that contains all the business pairs in the train_review.json file whose Jaccard similarity is >=0.05 and
calculate precision and recall by yourself.
4.2 Task2: Content-based Recommendation System (2pts)
4.2.1 Task description
In this task, you will build a content-based recommendation system by generating profiles from review
texts for users and businesses in the train_review.json file. Then you will use the model to predict if a user
prefers to review a given business by computing the cosine similarity between the user and item profile
vectors.
During the training process, you will construct the business and user profiles as follows:
a. Concatenating all reviews for a business as one document and parsing the document, such as removing
the punctuations, numbers, and stopwords. Also, you can remove extremely rare words to reduce the
vocabulary size. Rare words could be the ones whose frequency is less than 0.0001% of the total
number of words.
b. Measuring word importance using TF-IDF, i.e., term frequency multiply inverse doc frequency
c. Using top 200 words with the highest TF-IDF scores to describe the document
d. Creating a Boolean vector with these significant words as the business profile
e. Creating a Boolean vector for representing the user profile by aggregating the profiles of the items that
the user has reviewed
During the prediction process, you will estimate if a user would prefer to review a business by computing
the cosine distance between the profile vectors. The (user, business) pair is valid if their cosine similarity
is >= 0.01. You should only output these valid pairs.
4.2.2 Execution commands
Training commands:
Python $ spark-submit task2train.py
Scala $ spark-submit --class task2train hw3.jar < train_file>
: the train review set
: the output model
: containing the stopwords that can be removed
Predicting commands:
Python $ spark-submit task2predict.py
Scala $ spark-submit --class task2predict hw3.jar
: the test review set (only target pairs)
: the model generated during the training process
: the output results
4.2.3 Output format:
Model format: There is no strict format requirement for the content-based model.
Prediction format:
You must write the results in JSON format using exactly the same tags like the example in Figure 2. Each
line represents a predicted pair of (“user_id”, “business_id”). You do not need to truncate decimals for ‘sim’
values.
Figure 2: An example prediction output for Task 2 in JSON format
4.2.4 Grading
You need to generate the content-based model and the prediction results (1pt). We will grade your
prediction results by calculating precision and recall using the ground truth (i.e., the blind reviews). The
definitions of precision and recall are the same as the ones in task 1. Your precision should be >= 0.8 (0.5pt)
and recall should be >= 0.7 (0.5pt) for the blind datasets. The execution time of the training process on
Vocareum should be less than 600 seconds. The execution time of the predicting process on Vocareum
should be less than 300 seconds.
4.3 Task3: Collaborative Filtering Recommendation System (4pts)
4.3.1 Task description
In this task, you will build collaborative filtering (CF) recommendation systems using the train_review.json
file. After building the systems, you will use the systems to predict the ratings for a user and business pair.
You are required to implement 2 cases:
• Case 1: Item-based CF recommendation system (2pts)
During the training process, you will build a recommendation system by computing the Pearson correlation
for the business pairs with at least three co-rated users. During the predicting process, you will use the
system to predict the rating for a given pair of user and business. You must use at most N business
neighbors who are the top N most similar to the target business for prediction (you can try various N, e.g.,
3 or 5).
• Case 2: User-based CF recommendation system with Min-Hash LSH (2pts)
During the training process, you should combine the Min-Hash and LSH algorithms in your user-based CF
recommendation system since the number of potential user pairs might be too large to compute. You need
to (1) identify user pairs’ similarity using their co-rated businesses without considering their rating scores
(similar to Task 1). This process reduces the number of user pairs you need to compare for the final Pearson
correlation score. (2) compute the Pearson correlation for the user pair candidates with Jaccard
similarity >= 0.01 and at least three co-rated businesses. The predicting process is similar to Case 1.
4.3.2 Execution commands
Training commands:
Python $ spark-submit task3train.py
Scala $ spark-submit --class task3train hw3.jar < train_file>
: the train review set
: the output model
: either “item_based” or “user_based”
Predicting commands:
Python $ spark-submit task3predict.py
Scala $ spark-submit --class task3predict hw3.jar
: the train review set
: the test review set (only target pairs)
: the model generated during the training process
: the output results
: either “item_based” or “user_based”
4.3.3 Output format:
Model format:
You must write the model in JSON format using exactly the same tags like the example in Figure 3. Each
line represents a business pair (“b1”, “b2”) for the item-based model (Figure 3a) or a user pair (“u1”, “u2”)
the for user-based model (Figure 3b). There is no need to have (“b2”, “b1”) or (“u2”, “u1”). You do not
need to truncate decimals for ‘sim’ values.
(a)
(b)
Figure 3: (a) is an example of item-based model and (b) is an example of user-based model
Prediction format:
You must write a target pair and its prediction in the JSON format using exactly the same tags like the
example in Figure 4. Each line represents a predicted pair of (“user_id”, “business_id”). You do not need
to truncate decimals for ‘stars’ values.
Figure 4: An example output for task3 in JSON format
4.3.4 Grading
You need to generate the item-based and user-based CF models. We will grade your model using precision
and recall defined in task 1. For your item-based model, precision should be >= 0.9 (0.25pt) and recall
should be >=0.9 (0.25pt). For your user-based model should, precision should be >= 0.4 (0.25pt) and recall
should be >=0.5 (0.25pt).
Besides, we will compare your prediction results against the ground truth in both test and blind datasets.
You should output the predictions ONLY generated from the model. Then we use RMSE (Root Mean
Squared Error) defined in the equation below to evaluate the performance. For those pairs that your model
cannot predict (e.g., due to cold start problem or too few co-rated users), we will predict them with the
business average stars for the item-based model and the user average stars for the user-based model. We
provide two files contain the average stars for users and businesses in the training dataset, respectively. The
value of “UNK” tag, which can be used for predicting those new businesses and users, is the average stars
for the whole reviews.
Where! is the prediction for business and is the true rating for business is the total
number of the user and business.
The execution time of the training process on Vocareum should be less than 600 seconds. The execution
time of the predicting process on Vocareum should be less than 100 seconds. RMSE for the item-based
model in both test and blind datasets should be <=0.91 (1.5pt), and for the user-based model in both datasets
should be <=1.01 (1.5pt). If the performance of only either one dataset reaches the threshold, you will
obtain 1pt.
5. About Vocareum
a. You can use the provided datasets under the directory resource: /asnlib/publicdata/
b. You should upload the required files under your workspace: work/
c. You must test your scripts on both the local machine and the Vocareum terminal before submission.
d. During the submission period, the Vocareum will directly evaluate the following result files: task1.res,
task2.predict, task3item.model, and task3user.model. The Vocareum will also run task3predict scripts
and evaluate the prediction results for both test and blind datasets.
e. During the grading period, the Vocareum will run both train and predict scripts. If the training or
predicting process fails to run, you can get 50% of the score only if the submission report shows
that your submitted models or results are correct (regrading).
f. Here are the commands that you can use to run Python scripts on Vocareum:
g. You will receive a submission report after Vocareum finishes executing your scripts. The submission
report should show precision and recall for each task. We do not test the Scala implementation during
the submission period.
h. Vocareum will automatically run both Python and Scala implementations during the grading period.
i. The total execution time of the submission period should be less than 600 seconds. The execution time
of grading period needs to be less than 3000 seconds.
j. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on
your last submission.
6. Grading Criteria
(% penalty = % penalty of possible points you get)
a. You can use your free 5-day extension separately or together. You must submit a late-day request via
https://forms.gle/6aDASyXAuBeV3LkWA. This form is recording the number of late days you use for
each assignment. By default, we will not count the late days if no request submission.
b. There will be a 10% bonus for each task if your Scala implementations are correct. Only when your
Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala.
c. There will be no point if your submission cannot be executed on Vocareum.
d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your
assignments if there is a grading error. No exceptions.
e. There will be a 20% penalty for the late submission within one week and no point after that.
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
urba6006代写、java/c++编程语...
2024-12-26
代做program、代写python编程语...
2024-12-26
代写dts207tc、sql编程语言代做
2024-12-25
cs209a代做、java程序设计代写
2024-12-25
cs305程序代做、代写python程序...
2024-12-25
代写csc1001、代做python设计程...
2024-12-24
代写practice test preparatio...
2024-12-24
代写bre2031 – environmental...
2024-12-24
代写ece5550: applied kalman ...
2024-12-24
代做conmgnt 7049 – measurem...
2024-12-24
代写ece3700j introduction to...
2024-12-24
代做adad9311 designing the e...
2024-12-24
代做comp5618 - applied cyber...
2024-12-24
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!