首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
讲解数据库SQL|讲解留学生Processing|辅导Web开发|辅导Web开发
项目预算:
开发周期:
发布时间:
要求地区:
Assignment 1 �C IR
In this assignment, your task is to index a document collection into an inverted index, and then
measure search performance based on predefined queries.
A new document collection containing more than 10,000 government site descriptions, and a set
of predefined queries, is provided for this assignment. The provided code implements most of an
information retrieval system. This provided code is designed to be simple to understand and
modify, it is not particularly efficient nor scalable. When developing a real-world IR system you
would be better off using high performance software such as Apache Lucene.
Throughout this assignment:
1. You will develop a better understanding of indexing, including the tokenizer, parser, and
normaliser components, and how to improve the search performance given a predefined
evaluation metric,
2. You will develop a better understanding of search algorithms, and how to obtain better
search results, and
3. You will find the best way to combine an indexer and search algorithm to maximise your
performance.
Throughout this assignment you will make changes to the provided code to improve the
information retrieval system. In addition, you will produce an answers file with your responses to
each question. Your answers file must be a .pdf file named u1234567.pdf where u1234567 is
your Uni ID. You should submit both your modified code files and answer.pdf inside a single zip
file.
Your answers to coding questions will be marked based on the quality of your code (is it efficient,
is it readable, is it extendable, is it correct).
Your answers to discussion questions will be marked based on how convincing your
explanations are (are they sufficiently detailed, are they well-reasoned, are they backed by
appropriate evidence, are they clear).
Question 1 �C Running queries (20%)
You have been provided with a simple implementation of an inverted index search system, which
can be found in files inverted_index.py, preprocessor.py and similarity_measures.py. Your first task is
to complete the run_queries.py script. The script should
1. Create an inverted index from gov/documents then
2. Run all of the queries from gov/topics and
3. Write the returned results for all of the queries to runs/retrieved.txt in TREC_EVAL format.
Step 1. has already been implemented for you. You will need to edit inverted_index.py to complete
the run_query function so that it returns the highest scoring documents. You will then need to
complete the run_queryies.py script to implement steps 2. and 3.
The Text Retrieval Conference (TREC) is a popular conference for research on designing better
information retrieval systems. We will use their evaluation tool TREC_EVAL to measure the
performance of your system. The tool expects the returned results to be in a file with a very specific
format, each line in this file must be of the form:
query_id Q0 document_id rank score MY_IR_SYSTEM
Where query_id identifies the query that this document was returned for. Q0 is the literal string Q0.
document_id is the name of the document that was returned. Score is the similarity score between
query and document. Rank is the order this document was retrieved in (e.g. a rank of 0 means that
this document was retrieved first, rank of 1 means second and so on). MY_IR_SYSTEM is the literal
string MY_IR_SYSTEM. Once you have successfully created the file you should be able to run the
evaluate.py file to see how well your retrieved documents matched the ground truth relevant
documents for the given queries.
In your answers file to this question put a list of your system��s scores for each of the evaluation
measures, e.g
map: 0.1
Rprec: 0.1
recip_rank: 0.1
P_5: 0.1
P_10: 0.1
P_15: 0.1
Question 2 �C TF-IDF Similarity (20%)
Currently, the inverted index uses TF similarity to calculate scores. Your next task is to implement TF-
IDF similarity functionality. You will need to complete the methods of the TFIDF_Similarity class in
the similarity_measures.py file. You are encouraged to study the provided class TF_Similarity to see
how data can be extracted from the inverted index. There are many ways of implementing similarity
using TF-IDF; while the lectures suggest weighting both the query and documents using TF-IDF this is
not always the best choice in practice. Here you will implement the scheme described in SMART
notation as lnc.ltc . The details of which are described in one of the course texts `Introduction to
Information Retrieval, Manning C, Raghavan P, Sch��tze H, Draft April 1 2009` in section 6.4.3 and
figure 6.15. The following is a reproduction of this figure (for more details see the textbook):
The mnemonic lnc.ltc corresponds to a weighting scheme of lnc for the document terms and ltc for
the query terms. The first letter of each scheme corresponds to the term frequency weighting, the
second corresponds to the document frequency, and the last corresponds to the normalization.
After you have successfully implemented TF-IDF Similarity as described you can change the sim
variable in run_queries.py to TFIDF_Similarity. Next, Run evaluate.py with your new TFIDF_Similarity
index. Write down your new evaluation scores.
Question 3 �C Evaluation measures (20%)
Did your system perform better with TF_IDF or TF similarity? You will need to decide which of the
provided evaluation measures is most useful for evaluating this system. In your answer you should
state which measure(s) you are basing your decision on, and why your chosen measure(s) are
appropriate specifically for this gov corpus. If you need to make any assumptions about how the
system will be used, then make sure to state them.
You should continue using the best performing similarity measure for the next questions.
Question 4 �C Improving the pre-processor (20%)
The pre-processor defines how raw text is converted into tokens to be indexed. Currently, it does
whitespace tokenization and porter-stemming. Your task is to change the Preprocessor class to
improve your system��s evaluation. You may want to try changing the tokenizer, the stemmer, or any
implement any other technique mentioned in the lectures (eg normalization, stop word removal).
You are strongly encouraged to make use of the NLTK library for this task, documentation can be
found here: https://www.nltk.org/api/nltk.html.
Briefly describe four different settings of Preprocessor that you tried, and what impact they had on
your system��s performance. Explain why you think that each change increased/decreased your
model��s evaluation measures. Write down your evaluation measures for the best performing
settings that you found.
Tip: Make sure you change the used_stored_index flag to False in run_queries.py to force your index
to be rebuilt with your new pre-processing.
Question 5 �C Further Modification (20%)
Your final task is to implement some further modification to the system in order to improve its
performance. Exactly what kind of modification you implement is left up to you, some examples of
modifications you could make:
? Implement BM25 similarity for the system (http://ipl.cs.aueb.gr/stougiannis/bm25.html).
? Adjust the inverted_index to use different weighted fields.
? Implement some domain specific pre-processing.
? Implement a different document and query weighting scheme.
? Anything else you can think of.
In your answers describe your best Information Retrieval system, write down its evaluation
measures, and explain why you think your changes improved the measures.
Academic Misconduct Policy: All submitted written work and code must be your own (except for any
provided starter code, of course) �C submitting work other than your own will lead to both a failure
on the assignment and a referral of the case to the ANU academic misconduct review procedures:
ANU Academic Misconduct Procedures
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做ceng0013 design of a pro...
2024-11-13
代做mech4880 refrigeration a...
2024-11-13
代做mcd1350: media studies a...
2024-11-13
代写fint b338f (autumn 2024)...
2024-11-13
代做engd3000 design of tunab...
2024-11-13
代做n1611 financial economet...
2024-11-13
代做econ 2331: economic and ...
2024-11-13
代做cs770/870 assignment 8代...
2024-11-13
代写amath 481/581 autumn qua...
2024-11-13
代做ccc8013 the process of s...
2024-11-13
代写csit040 – modern comput...
2024-11-13
代写econ 2070: introduc2on t...
2024-11-13
代写cct260, project 2 person...
2024-11-13
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!