首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
CS1003编程语言讲解、辅导Programming程序、java编程设计调试 解析Haskell程序|解析Haskell程序
项目预算:
开发周期:
发布时间:
要求地区:
University of St Andrews
School of Computer Science
CS1003 — Programming with Data
P1— Text Processing
Deadline: 5 February 2021 Credits: 10% of coursework mark
MMS is the definitive source for deadline and credit details
You are expected to have read and understood all the information in this specification
and any accompanying documents at least a week before the deadline. you must contact
the lecturer regarding any queries well in advance of the deadline.
This practical involves reading data from a file and using basic text processing techniques to solve a
specified problem. You will need to decompose the problem into a number of methods as appropriate
and classes if necessary. You will also need to test your solution carefully and write a report.
Task
The task is to write a Java program to perform string similarity search among words stored in a text file.
The code you are going to write is similar to code that is found in spell-checkers. Your program should
accept two command line arguments, the first is the path of a text file (which contains a dictionary of
commonly used English words) and the second is a query string. The program should then read the
text in the file and split it into lines, where each line contains a single word. Then, calculate a similarity
score between the query word and each word read from the file. Finally, print the closest match from
the file (the word with the highest similarity score) to standard output, together with the similarity
score. Place your main method in a class called CS1003P1.java. Some example runs are as follows.
Searching for the closest word to ‘strawberry’
> java CS1003P1 ../data/words_alpha.txt strawberry
Result: strawberry
Score: 1.0
Searching for the closest word to ‘stravberry’
> java CS1003P1 ../data/words_alpha.txt stravberry
Result: strawberry
Score: 0.6923077
Searching for the closest word to ‘ztravberry’
> java CS1003P1 ../data/words_alpha.txt ztravberry
Result: strawberry
Score: 0.46666667
String similarity
There are several ways of calculating a similarity score between strings, in this practical we ask you to
use a Jaccard index on character bigrams. This might sound scary at first, but don’t worry! We will
now define what we mean and give an example.
Jaccard index
The Jaccard index is a similarity measure between sets of objects. It is calculated by dividing the
size of the intersection of the two sets by the size of the union of the same two sets. If the two sets
are very similar, the value of the Jaccard index will be close to 1 (if the two sets are identical it will
be exactly 1). On the other hand, if the two sets are very dissimilar, the value of the Jaccard index
will be close to 0 (if the two sets are disjoint it will be exactly 0). Try drawing a few simple Venn
diagrams to convince yourselves of this! Wikipedia has a good article on the Jaccard index as well:
https://en.wikipedia.org/wiki/Jaccard_index
Character bigrams
A character bigram is a sequence of two consecutive characters in a string. Bigrams have applications
in several areas of text processing like linguistics, cryptography, speech recognition, and text search. In
this practical, we will calculate the Jaccard index on sets of bigrams for calculating a similarity score
between strings. Following is an example of the set of bigrams for the string ‘cocoa’: ‘co’, ‘oc’, ‘oa’.
Notice that since we generate a set of bigrams, we avoid repeating ‘co’ twice.
Your program should contain a method to create a set of bigrams for a given string.
Top and tail
Adding special characters to the start and the end of a string before calculating the set of bigrams can
improve string similarity search. This is often done by adding a ‘ˆ’ character to the beginning and a
‘$’ character to the end of the string. On the same example, ‘cocoa’, we first add the special characters
to either side and get to ‘ˆcocoa$’. The set of bigrams becomes: ‘ˆc’, ‘co’, ‘oc’, ‘oa’, ‘a$’.
Suggested steps
• Download the text file words alpha.txt 1
from StudRes and save it to a known location. You
should not submit this file as part of your submission.
• Create a Java class called CS1003P1 and write a program that is able to read the data stored in
this text file line by line. In order to check that this works, print each line to standard output.
See method from the class.
• Write a method to calculate character bigrams of a given string and store them in a set. See the
and classes and the ❛❞❞ method that they implement. You may test your method
with the string ‘cocoa’, the output should match the given output above.
• Implement top-and-tail as described above and update your bigram calculation to use this functionality.
• Implement the Jaccard index calculation. Which two sets will you calculate the Jaccard index
on? We suggest that you use the method for implementing set intersection and the
❛❞❞❆❧❧ method for implementing set union. The method returns the size of a set. If you
1Source: https://github.com/dwyl/english-words
2
calculate the Jaccard index between the set {“1”,“2”,“3”} and the set {“1”,“2”,“4”} (where the
size of the intersection is 2 and the size of the union is 4) you should get 2/4 = 0.5 as the result.
• Combining character bigrams, top-and-tail and Jaccard index you now have a way of calculating
a similarity score between two strings. Use this to calculate the score between the query word
and each word from the file in a loop. Keep track of the best score (and the word that has the
best score!) for reporting at the end.
• We suggest that you print the best matching string and the corresponding similarity score as you
iterate through the dictionary during development. This can help you with testing your program.
Auto-checker and Testing
This assignment makes use of the School’s automated checker ‘stacscheck’. You should therefore ensure
that your program can be tested using the auto-checker. It should help you see how well your program
performs on the tests we have made public and will hopefully give you an insight into any issues prior to
submission. The automated checking system is simple to run from the command line in your CS1003-P1
directory:
Make sure to type the command exactly – occasionally copying and pasting from the PDF specification
will not work correctly. If you are struggling to get it working, ask a demonstrator.
The automated checking system will only check the basic operation of your program. It is up to you
to provide evidence that you have thoroughly tested your program.
Submission
Report
Your report must be structured as follows:
• Overview: Give a short overview of the practical: what were you asked to do, and what did you
achieve? Clearly list which parts you have completed, and to what extent.
• Design: Describe the design of your program. Justify the decisions you made. In particular,
describe the classes you chose, the methods they contain, a brief explanation of why you designed
your solution in the way that you did, and any interesting features of your Java implementation.
• Testing: Describe how you tested your program. In particular, describe how you designed different
tests. Your report should include the output from a number of test runs to demonstrate that
your program satisfies the specification. Please note that simply reporting the result of stacscheck
is not enough; you should do further testing and explain in the report how you convinced yourself
that your program works correctly.
• Evaluation: Evaluate the success of your program against what you were asked to do.
• Conclusion: Conclude by summarising what you achieved, what you found difficult, and what
you would like to do given more time.
Don’t forget to add a header including your matriculation number, the name of your tutor and the
date.
3
Upload
Package up your CS1003-P1 folder and a PDF copy of your report into a zip file as in previous weeks,
and submit it using MMS, in the slot for Practical P1. After doing this, it is important to verify
that you have uploaded your submission correctly by downloading it from MMS. You
should also double check that you have uploaded your work to the correct slot. You can
then run stacscheck directly on your zip file to make sure that your code still passes stacscheck. For
example, if your file is called, save it to your Downloads directory and run:
Rubric
Marking
1-6 Very little evidence of work, software which does not compile or run, or crashes
before doing any useful work. You should seek help from your tutor immediately.
7-10 An acceptable attempt to complete the main task with serious problems such as
not compiling, or crashing often during execution.
11-13 A competent attempt to complete the main task. Serious weaknesses such as using
wrong data types, poor code design, weak testing, or a weak report riddled with
mistakes.
14-16 A good attempt to complete the main task together with good code design, testing
and a report.
17-18 Evidence of an excellent submission with no serious defects, good testing, accompanied
by an excellent report.
19-20 An exceptional submission. A correct implementation of the main task, extensive
testing, accompanied by an excellent report. In addition it goes beyond the basic
specification in a way that demonstrates use of concepts covered in class and other
concepts discovered through self-learning.
See also the standard mark descriptors in the School Student Handbook:
http://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/feedback.html#
Mark_Descriptors
Lateness penalty
The standard penalty for late submission applies (Scheme B: 1 mark per 8 hour period, or part
thereof):
http://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/assessment.html#
lateness-penalties
Good academic practice
The University policy on Good Academic Practice applies:
https://www.st-andrews.ac.uk/students/rules/academicpractice/
Going Further
Here are some additional questions and pointers for the interested student.
4
• Character bigrams are a special case. If you are interested look into character n-grams which are
sets of n-characters where n is not necessarily 2 (as in bigrams).
• N-grams can be constructed at the word level instead of at the character level. Look into applications
of word-level n-grams and think about the use cases of character-level vs word-level
n-grams.
• There are many other string similarity methods! See https://en.wikipedia.org/wiki/String_
metric as a starting point.
• Think about use cases for string similarity methods.
5
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做ceng0013 design of a pro...
2024-11-13
代做mech4880 refrigeration a...
2024-11-13
代做mcd1350: media studies a...
2024-11-13
代写fint b338f (autumn 2024)...
2024-11-13
代做engd3000 design of tunab...
2024-11-13
代做n1611 financial economet...
2024-11-13
代做econ 2331: economic and ...
2024-11-13
代做cs770/870 assignment 8代...
2024-11-13
代写amath 481/581 autumn qua...
2024-11-13
代做ccc8013 the process of s...
2024-11-13
代写csit040 – modern comput...
2024-11-13
代写econ 2070: introduc2on t...
2024-11-13
代写cct260, project 2 person...
2024-11-13
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!