首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
IRDM 2020编程设计辅导、Python语言辅导、讲解Python程序设计 讲解留学生Prolog|辅导R语言编程
项目预算:
开发周期:
发布时间:
要求地区:
IRDM Course Project Part II
IRDM 2020
March 15, 2021
1 Task Definition
An information retrieval model is an essential component for many applications (e.g. search,
question answering, recommendation etc.). Similar to the first part of the project, your task
in this assignment is to develop an information retrieval model that solves the problem of
passage retrieval, i.e., a model that can effectively and efficiently return a ranked list of short
texts (i.e. passages) relevant to a given query. In this part of the assignment, your goal is to
improve the basic models that you implemented in the first part.
This is an individual project, therefore, everyone is expected to submit their own project
report.
2 Data
The dataset from previous task is available through this url and the dataset for training and
validation is available through this url. Our dataset consists of 5 files:
• test-queries.tsv is a tab separated file, where each row contains a query ID (qid) and
the query (i.e., query text).
• candidate_passages_top1000.tsv is a tab separated file, containing initial rankings that
contain 1000 passages for each of the given queries (as it was in the first part of the
assignment) in file test-queries.tsv. The format of this file is
,
where qid is the query ID, pid is the ID of the passage retrieved, query is the query
text and passage is the passage text, all tab separated. Figure 1 shows some sample
rows from the file.
• train_data.tsv and validation_data.tsv. These are the datasets you will be using for
training and validation. You are expected to train your model on the training set and
evaluate your models’ performance on the validation set. In these datasets, you are
given additional relevance column indicating the relevance of the passage to the query
which you will need during training and validation.
3 Subtasks
The course project involves several subtasks that are required to be solved. The four subtasks
of this project are described below.
1
IRDM Course Project Part II
IRDM 2020
March 15, 2021
Figure 1: Sample rows from candidate_passages_top1000.tsv file
1. Evaluating Retrieval Quality. (20 marks) Implement methods to compute the average
precision and NDCG metrics. Compute the performance of using BM25 as the retrieval
model using these metrics. Your marks for this part will mainly depend on the implementation
of metrics (as opposed to your implementation of BM25, since we already
focused on that as part of the first assignment).
2. Logistic Regression (LR). (25 marks) Represent passages and query based on a word
embedding method, (such as Word2Vec, GloVe, FastText, or ELMo). Compute query
(/passage) embeddings by averaging embeddings of all the words in that query (/passage).
With these query and passage embeddings as input, implement a logistic regression
model to assess relevance of a passage to a given query. Describe how you
perform input processing & representation or features used. Using the metrics you
have implemented in the previous part, report the performance of your model based on
the validation data. Analyze the effect of the learning rate on the model training loss.
All implementations for logistic regression algorithm must be your own for this part.
Important Notes:
• The training data size you are given is quite small, so it should not cause you much
difficulty in training but in case you have any issues with the data size, please feel
free to use a sample of the training data.
• If you think it is necessary, you are allowed to use negative sampling for generating
a subset of training data (possibly together with other sampling methods if
needed).
3. LambdaMART Model (LM). (25 marks) Use the LambdaMART [1] learning to rank algorithm
(a variant of LambdaRank we have learned in the class) from XGBoost gradient
boosting library 1
to learn a model that can re-rank passages. You can command XGBoost
to use LambdaMART algorithm for ranking by setting the appropriate value to
the objective parameter as described in the documentation 2
. You are expected to
carry out hyper-parameter tuning in this task and describe the methodology used in
1https://xgboost.readthedocs.io/en/latest/index.html
2https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters
2
IRDM Course Project Part II
IRDM 2020
March 15, 2021
deriving the best performing model. Using the metrics you have implemented in the
first part, report the performance of your model on the validation data. Describe how
you perform input processing & representation or features used.
4. Neural Network Model (NN). (30 marks) Using the same training data representation
from the previous question, build a neural network based model that can re-rank passages.
You may use existing packages, namely Tensorflow or PyTorch in this subtask.
You are expected to justify your neural network architecture by providing the motivation
and how it fits to our problem. You are allowed to use different types of (deep)
neural network architectures (e.g. feed forward, convolutional, recurrent and/or transformer
based neural networks) for this part. Using the metrics you have implemented
in the first part, report the performance of your model on the validation data. Describe
how you perform input processing & representation or features used.
3.1 Submission of Test Results.
You should have one file per model (named LR.txt, LM.txt, and NN.txt, respectively),
where the format of the file is:
...
The width of columns in the format is not important, but it is important to have exactly
six columns per line with at least one space between the columns. In this format:
- The first column is the query number.
- The second column is currently unused and should always be “A1”, to refer to the
fact that this is your submission for Assignment 1.
- The third column is the passage identifier.
- The fourth column is the rank the passage/document is retrieved (starting from 1,
down to 100).
- The fifth column shows the score (integer or floating point) of the model that generated
the ranking.
- The sixth column refers to the algorithm you used for retrieval (would either be LR,
LM or NN, depending on which model you used) .
3
IRDM Course Project Part II
IRDM 2020
March 15, 2021
4 Submission
You are expected to submit all the codes you have implemented for all the parts of the
assignment (e.g. evaluation metrics, data representation, logistic regression, LambdaMART
training, neural network implementation, etc.) All the code should be your own and you are
not allowed to reuse any code that is available online. You are allowed to use both Python
and Java as the programming language.
You are also expected to submit a written report whose size should not exceed 6 pages,
including references. Your report should describe the work you have done for each of the
aforementioned steps. Your report should explicitly describe the performance of the models
you have implemented, the input representations (or features) you have used, how you have
used the training and validation sets (any sub-sampling done, etc.), how you have done
hyper-parameter tuning, the neural architecture you have used and why, etc.
You are required to use the SIGIR 2020 style template for your report. You can either use
LaTeX or Word available from the ACM Website 3
(use the “sigconf” proceedings template).
Please do not change the template (e.g. reducing or increasing the font size, margins, etc.).
5 Deadline
The deadline will be announced later.
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代写dts207tc、sql编程语言代做
2024-12-25
cs209a代做、java程序设计代写
2024-12-25
cs305程序代做、代写python程序...
2024-12-25
代写csc1001、代做python设计程...
2024-12-24
代写practice test preparatio...
2024-12-24
代写bre2031 – environmental...
2024-12-24
代写ece5550: applied kalman ...
2024-12-24
代做conmgnt 7049 – measurem...
2024-12-24
代写ece3700j introduction to...
2024-12-24
代做adad9311 designing the e...
2024-12-24
代做comp5618 - applied cyber...
2024-12-24
代做ece5550: applied kalman ...
2024-12-24
代做cp1402 assignment - netw...
2024-12-24
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!