首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
讲解COMP20008编程、辅导Data程序、讲解Python语言编程 辅导Web开发|辅导Web开发
项目预算:
开发周期:
发布时间:
要求地区:
COMP20008 Elements of Data Processing
Assignment 1
March 3, 2021
Due date
The assignment is worth 20 marks, (20% of subject grade) and is due 8:00am Thursday
1
st April 2021 Australia/Melbourne time.
Background
Learning outcomes
The learning objectives of this assignment are to:
❼ Gain practical experience in written communication skills for documenting for data
science projects.
❼ Practice a selection of processing and exploratory analysis techniques through visualisation.
❼ Practice text processing techniques using Python.
❼ Practice widely used Python libraries and gain experience in consultation of additional
documentation from Web resources.
Your tasks
There are three parts in this assignment, Part A, Part B, and Part C. Part A and Part B are
worth 9 marks each and Part C is worth 2 marks.
Getting started
Before starting the assignment you must do the following:
❼ Create a github account at https://www.github.com if you don’t already have one.
❼ Visit https://classroom.github.com/a/FSvGXkWI and accept the assignment. This
will create your personal assignment repository on github.
❼ Clone your assignment repository to your local machine. The repository contains important
files that you will need in order to complete the assignment.
1
COMP20008 2021 SM1
Part A (Total 9 marks)
For Part A, download the complete “Our World in Data COVID-19 dataset” (“owid-coviddata”)
from https://covid.ourworldindata.org/data/owid-covid-data.csv.
Part A Task 1 Data pre-processing (3 marks)
Program in python to produce a dataframe by
1. (2 marks) aggregating the values of the following four variables:
total cases
new cases
total deaths
new deaths
by month and location in the year 2020.
The dataframe should contain the following columns after completion of this sub-task:
location
month
total cases
new cases
total deaths
new deaths
Note: if there are no entries for certain combinations of locations and months, there
should be no entry for those combinations in the dataframe.
2. (1 mark) adding a new variable, case fatality rate, to the dataframe produced from
sub-task 1. The variable, case fatality rate, is defined as the number of deaths per
confirmed case in a given period. Do not impute missing values.
The final dataframe should contain the columns in the following order:
location
month
case fatality rate
total cases
new cases
total deaths
new deaths
and the rows are to be sorted by location and month in ascending order.
Page 2
COMP20008 2021 SM1
Print the first 5 rows of the final dataframe to the standard output.
Save the new dataframe to a CSV file named, “owid-covid-data-2020-monthly.csv” in
the same directory as the python program. Your program should be called from the command
line as follows:
python parta1.py owid-covid-data-2020-monthly.csv
Hint: You will need to use appropriate functions for the aggregation based on your understandings
of the variables.
Part A Task 2 Visualisation (2 marks)
Program in python to produce two scatter plots:
1. (1 mark) a scatter plot of case fatality rate (on the y-axis) and confirmed new cases on
the x-axis) by locations in the year 2020.
Output the plot to scatter-a.png in the same directory as the python program.
2. (1 mark) a second scatter plot of the same data with only one change: the x-axis is
changed to a log-scale.
Output the plot to scatter-b.png in the same directory as the python program. For
this plot, apply preprocessing if necessary.
Your program should be called from the command line as follows:
python parta2.py scatter-a.png scatter-b.png
Part A Task 3 Discussion and visual analysis (4 marks)
A short report of your visual analysis of the two plots produced from Task 2.
It is expected that the visual analysis would include:
1. (1.5 marks) a brief introduction/description of the raw data, including the source, any
limitations you observe in the data and all preprocessing steps taken on the raw data
to produce the visualisations,
2. (1.5 marks) explanation of the plots and patterns observed, and
3. (1 mark) a discussion contrasting the two scatter plots.
The report is to be 500 - 600 (maximum) words excluding figures, about 1 page, in pdf
format, and must include the two plots, scatter-a.png and scatter-b.png, produced
from Part A Task 2.
The filename of the report must be “owid-covid-2020-visual-analysis.pdf ”.
Part B (Total 9 marks)
For Part B, download the cricket dataset from the LMS. This dataset contains a sample of
cricket-related articles from BBC News. We wish to build a search engine that will allow a
user to specify keywords and find all articles related to those keywords.
Page 3
COMP20008 2021 SM1
Part B Task 1: Regular Expressions (1 mark)
Each article contains a document ID which uniquely identifies the document. This document
ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally
ending in a letter. For example, each of the following are valid document IDs:
ABCD-123
ABCD-123V
XKCD-999A
COMP-200
The document IDs are not located in a consistent place in each article. Use a regular expression
to identify the document ID for each document in the dataset. Write a Python program
in partb1.py that produces a CSV file called partb1.csv containing the filenames and Document
IDs for each document in the dataset. Your CSV file should contain the following
columns in the order below:
filename
documentID
Your program should be called from the command line along with the name of the CSV file:
python partb1.py partb1.csv
Part B Task 2: Preprocessing (1 mark)
We now wish to perform the following preprocessing on each article in the cricket folder in
order to make them easier to search:
Remove all non-alphabetic characters (for example, numbers and punctuation characters),
except for spacing characters such as whitespaces, tabs and newlines.
Convert all spacing characters such as tabs and newlines to whitespace and ensure that
only one whitespace character exists between each word
Change all uppercase characters to lower case
Create a Python program in partb2.py that performs this preprocessing.
Your program should be called from the command line along with the filename of a document.
For example:
python partb2.py cricket001.txt
Your program should then load the specified file, perform the preprocessing steps above
and print the results to standard output.
Hint: You may wish to create a function for performing this preprocessing as you will need
to perform this pre-processing as part of each task in Part B
Page 4
COMP20008 2021 SM1
Part B Task 3: Basic Search (2 marks)
Create a Python program in partb3.py that will allow the user to search for articles containing
particular keywords. Your program should be called from the command line along
with the keywords being searched for. For example:
python partb3.py keyword1 keyword2 keyword3
You can assume each keyword will be separated by a whitespace character and that
between 1 and 5 keywords will be entered. Your program should then return the document
IDs of the documents that contain all of the keywords in the user’s search query. For this
task:
You should check for matches after performing the preprocessing in Task 2. For example,
searching for the word ’old’ should return articles containing the words ’Old’ or ’OLD’.
The keywords that the user searches for are separate keywords. You are not required to
match exact phrases. For example, if a user searches for the keywords ’captain early’,
these words do not need to appear consecutively in the document to constitute a match.
Only documents that contain the actual keyword should return a match. For example,
searching for the word ’old’ should not return articles containing the word ’golden’.
Your program should output the document IDs of each article containing all of the specified
keywords.
Hint: You may wish to load partb1.csv back into your program
Part B Task 4: Advanced Search (2 marks)
We now wish to expand the search feature to enable inexact matching. For example, a
user should be able to specify the keyword ’missing’ and the search should also return articles
containing the related words ’missed’ or ’miss’. Create a Python program in partb4.py based
on your response to Task 3 that uses a Porter Stemmer to enable this inexact matching. Your
program should be called from the command line along with the keywords being searched for.
For example:
python partb4.py keyword1 keyword2 keyword3
Your program should output the document IDs of each article containing all of the specified
keywords, or words considered by the Porter Stemmer to have the same base. For this task:
You should check for matches after performing the preprocessing in Task 2. For example,
searching for the word ’old’ should return articles containing the words ’Old’ or ’OLD’.
The keywords that the user searches for are separate keywords. You are not required to
match exact phrases. For example, if a user searches for the keywords ’captain early’,
these words do not need to appear consecutively in the document to constitute a match.
Other than inexact matches permitted by the Porter Stemmer, only documents that
contain the actual keyword should return a match. For example, searching for the word
’old’ should not return articles containing the word ’golden’.
Note that other than the final point this list of requirements is the same as for Task 3.
Page 5
COMP20008 2021 SM1
Part B Task 5: Search Rankings (3 marks)
We wish to further expand the search feature to enable documents to be ranked, so that
those most relevant to the user’s keywords are displayed at the top of the list. One way
of computing such a ranking is to use TF-IDF along with the cosine similarity measure as
discussed in lectures. Create a Python program in partb5.py based on your response to
Task 4 that ranks articles returned by Task 4 by cosine similarity score.
Your program should be called from the command line along with the keywords being
searched for. For example:
python partb5.py keyword1 keyword2 keyword3
Your program should output:
The headings ’documentID’ and ’score’
The document IDs of each article containing all of the specified keywords, or words
considered by the Porter Stemmer to have the same base.
The cosine similarity score between the vector of stemmed keywords and the vector of
stemmed words appearing in the document for each document matched, rounded to
four decimal places.
You should assume that the collection being used by TF-IDF is the complete list of stemmed
words contained in articles returned by your Task 4 search. The output should be sorted in
descending order by cosine similarity score with the search query. For example, one sample
output might look like this:
documentID score
JDKC-105M 0.0618
BTAR-174V 0.0182
Part C(Total 2 marks)
GitHub Submission
Ensure all of your completed code files as well as your report have been pushed to the github
repository you created in the ’Getting Started’ section. We strongly encourage you to push an
updated version of your code to your github repository each time you make a major change.
Your repository must also contain a README file, which must contain your name and student
ID. It must also contain a brief description of your project and a list of dependencies.
Submission Instructions
Submit all pythin scripts and the pdf discussion report via LMS. A complete submittion
includes the following items:
1. parta1.py
2. parta2.py
Page 6
COMP20008 2021 SM1
3. owid-covid-2020-visual-analysis.pdf
4. partb1.py
5. partb2.py
6. partb3.py
7. partb4.py
8. partb5.py
9. A link to your GitHub repository
You must also have pushed the above files to your github repository, which the teaching staff
already have access to.
Extensions and late submission penalties
If requesting an extension due to illness, please submit a medical certificate to the lecturer.
If there are any other exceptional circumstances, please contact the lecturer with plenty of
notice. Late submissions without an approved extension will attract the following penalties
0 < hourslate <= 24 (2 marks deduction)
24 < hourslate <= 48 (4 marks deduction)
48 < hourslate <= 72: (6 marks deduction)
72 < hourslate <= 96: (8 marks deduction)
96 < hourslate <= 120: (10 marks deduction)
120 < hourslate <= 144: (12 marks deduction)
144 < hourslate: (20 marks deduction)
where hourslate is the elapsed time in hours (or fractions of hours).
This project is expected to require 15-20 hours work.
Academic honesty
You are expected to follow the academic honesty guidelines on the University website
https://academichonesty.unimelb.edu.au
Further information
A project discussion forum has also been created on the Ed forum. Please use this in the
first instance if you have questions, since it will allow discussion and responses to be seen by
everyone. There will also be a list of frequently asked questions on the project page.
Page 7
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做 program、代写 c++设计程...
2024-12-23
comp2012j 代写、代做 java 设...
2024-12-23
代做 data 编程、代写 python/...
2024-12-23
代做en.553.413-613 applied s...
2024-12-23
代做steady-state analvsis代做...
2024-12-23
代写photo essay of a deciduo...
2024-12-23
代写gpa analyzer调试c/c++语言
2024-12-23
代做comp 330 (fall 2024): as...
2024-12-23
代写pstat 160a fall 2024 - a...
2024-12-23
代做pstat 160a: stochastic p...
2024-12-23
代做7ssgn110 environmental d...
2024-12-23
代做compsci 4039 programming...
2024-12-23
代做lab exercise 8: dictiona...
2024-12-23
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!