首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
讲解data编程、辅导Java,Python程序 讲解R语言程序|辅导Processing
项目预算:
开发周期:
发布时间:
要求地区:
IRDM Course Project Part I
IRDM 2020
1 Task Definition
An information retrieval model is an essential component for many applications (e.g. search,
question answering and recommendation). Your task in this project is to develop an information
retrieval model that solves the problem of passage retrieval, i.e., a model that can
effectively and efficiently return a ranked list of short texts (i.e. passages) relevant to a given
query.
This is an individual project, so everyone is expected to submit their own code and project
reports. This is the first part of a larger project, which consists of two components. In the
second part of the project, we will be building upon this first part and will be working on
building more advanced retrieval models.
In this part of the assignment, our final goal is to build a passage re-ranking system:
Given a candidate list of passages to a query (that have already been retrieved using some
initial retrieval model that we have developed), re-rank these candidate passages using the
retrieval models specified in the assignment.
2 Data
The dataset you will be using is available through this url. Our dataset consists of 3 files:
• test-queries.tsv is a tab separated file, where each row contains a query ID (qid) and
the query (i.e., query text).
• passage_collection.txt contains passages in our collection where each row is a passage.
• candidate_passages_top1000.tsv is a tab separated file, containing initial rankings that
contain 1000 passages for each of the given queries in file test-queries.tsv. The format
of this file is
, where qid is the query ID, pid is the ID of
the passage retrieved, query is the query text and passage is the passage text, all tab
separated. Figure 1 shows some sample rows from the file.
1
IRDM Course Project Part I
IRDM 2020
February 12, 2021
Figure 1: Sample rows from candidate_passages_top1000.tsv file
3 Subtasks
The course project involves several subtasks that are required to be solved. The four subtasks
of this project are described below.
1. Text Statistics (20 marks). Perform any type of pre-processing on the collection as
you think is required. Implement a function that counts the frequency of terms from
the provided dataset, plot the distribution of term frequencies and verify if they follow
Zipf’s law. Report the values of the parameters for Zipf’s law for this collection. You
need to use the full collection (file named passage_collection.txt) for this question.
Generate a plot that shows how the results you get using the model based on Zipf’s
law compare with the values you get from the actual collection.
2. Inverted Index (20 marks). Build an inverted index for the collection so that you can
retrieve passages from the initial set of candidate passages in an efficient way. To
implement an effective inverted index, you may consider storing additional information
such as term frequency and term position. Report what type of information you have
stored in your inverted index. Since your task in this project is to focus on re-ranking
candidate passages you were given for each query, you can generate a separate index
for each query by using the candidate list of passages you are provided with for each
query (using the file candidate_passages_top1000.tsv ).
3. Retrieval Models (30 marks). Extract the tf-idf vector representations of the passages
using the inverted index you have constructed. Implement the vector space model and
BM25 using your own implementation and retrieve 100 passages from within the 1000
candidate passages for each query. For both the vector space model and BM25, submit
the 100 passages you have retrieved in sorted order (sorted in decreasing order – passage
with the top score should be at the top) for both models.
4. Retrieval Models, Language Modelling (30 marks). Implement the query likelihood
language model with i) Dirichlet smoothing, where µ = 2000, ii) Laplace smoothing,
and iii) Lindstone correction with = 0.5 using your own implementation and retrieve
2
IRDM Course Project Part I
IRDM 2020
February 12, 2021
100 passages from within the 1000 candidate passages for each query. For all three
smoothing variants, submit the 100 passages you have retrieved in sorted order (sorted
in decreasing order – passage with the top score should be at the top) for both models.
Which smoothing version do you expect to work better? Explain.
You should have one file per model (named VS.txt and BM25.txt, LM-Dirichlet.txt,
LM-Laplace.txt, LM-Lindstone.txt, respectively), where the format of the file is:
...
The width of columns in the format is not important, but it is important to have exactly
six columns per line with at least one space between the columns. In this format:
- The first column is the query number.
- The second column is currently unused and should always be “A1”, to refer to the
fact that this is your submission for Assignment 1.
- The third column is the passage identifier.
- The fourth column is the rank the passage/document is retrieved (starting from 1,
down to 100).
- The fifth column shows the score (integer or floating point) of the model that generated
the ranking.
- The sixth column refers to the algorithm you used for retrieval (would either be VS
or BM25, depending on which model you used) .
4 Submission
You are expected to submit all the codes you have implemented for text pre-processing, Zipf’s
law, inverted index, and retrieval models. All the code should be your own and you are not
allowed to reuse any code that is available from someone/somewhere else. You are allowed
to use both Python and Java as the programming language.
Additionally, you should also submit five files that contain the retrieval results of the
vector space model, BM25 model and language models with the three different smoothing
variants in the format that was described above.
You are also expected to submit a written report whose size should not exceed 4 pages,
including references. Your report should describe the work you have done for each of the
aforementioned steps. Specifically, your report should consist of the following:
3
IRDM Course Project Part I
IRDM 2020
February 12, 2021
1. Describe how you perform the text pre-processing and justify why text pre-processing
is required.
2. Explain how you implement Zipf’s law, provide a plot comparing your model with
the actual collection and report the values of the parameters for Zipf’s law for this
collection.
3. Explain how you implemented the inverted index, what information you have stored
and justify why you decided to store that information.
4. Describe how you implemented the vector space and BM25 models, and what parameters
you have used for BM25.
5. Describe how you implemented the language models, and how you expect their performance
to compare with each other.
You are required to use the SIGIR 2020 style template for your report. You can either
use LaTeX or Microsoft Word templates available from the ACM Website 1
(use the “sigconf”
proceedings template). Please do not change the template (e.g. reducing or increasing the
font size, margins, etc.).
5 Deadline
The deadline for this part of the assignment is 4:00pm on 23 March 2021 (based on
GMT timezone). All the material will be submitted via Moodle.
1https://www.acm.org/publications/proceedings-template
4
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做ceng0013 design of a pro...
2024-11-13
代做mech4880 refrigeration a...
2024-11-13
代做mcd1350: media studies a...
2024-11-13
代写fint b338f (autumn 2024)...
2024-11-13
代做engd3000 design of tunab...
2024-11-13
代做n1611 financial economet...
2024-11-13
代做econ 2331: economic and ...
2024-11-13
代做cs770/870 assignment 8代...
2024-11-13
代写amath 481/581 autumn qua...
2024-11-13
代做ccc8013 the process of s...
2024-11-13
代写csit040 – modern comput...
2024-11-13
代写econ 2070: introduc2on t...
2024-11-13
代写cct260, project 2 person...
2024-11-13
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!