首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
讲解COMP9414程序、python设计编程辅导、辅导python编程 讲解R语言程序|讲解留学生
项目预算:
开发周期:
发布时间:
要求地区:
COMP9414: Artificial Intelligence
Assignment 2: Topic Classification
Due Date: Week 9, Friday, July 30, 11:59 p.m.
Value: 25%
This assignment is inspired by a typical real-life scenario. Imagine you have been hired as a Data
Scientist by a major news organization. Your job is to analyse the news feed to determine the
topic of incoming news articles so they can be organized and distributed to your readers.
For this assignment, you will be given a collection of BBC news articles and also summaries
of the same articles. The articles have been manually labelled as one of five topics : business,
entertainment, politics, sport and tech. Important: Do not distribute these news articles
on the Internet, as this breaches BBC copyright.
You are expected to assess various supervised machine learning methods using a variety of fea-
tures and settings to determine what methods work best for topic classification in this domain.
The assignment has two components: programming to produce a collection of models for topic
classification, and a report to evaluate the effectiveness of the models. The programming part
involves development of Python code for data preprocessing of articles and experimentation of
methods using NLP and machine learning toolkits. The report involves evaluating and comparing
the models using various metrics.
You will use the NLTK toolkit for basic language preprocessing, and scikit-learn for feature con-
struction and evaluating the machine learning models. You will be given an example of how to use
NLTK and scikit-learn to define the machine learning methods (example.py), and an example of
how to plot metrics in a graph (plot.py).
Data and Methods
A training dataset is a .tsv (tab separated values) file containing a number of articles, with one
article per line, and linebreaks within articles removed. Each line of the .tsv file has three fields:
instance number, text and topic (business, entertainment, politics, sport, tech).
A test dataset is a .tsv file in the same format as the training dataset except that your code should
ignore the topic field. Training and test datasets can be drawn from supplied files articles.tsv
or summaries.tsv (see below).
For all models, consider an article to be a collection of words, where a word is a string of at
least two letters, numbers or the symbols #, @, , $ or %, delimited by a space, after removing
all other characters (two characters is the default minimum word length for CountVectorizer in
scikit-learn). URLs should be treated as a space, so delimit words. Note that deleting “junk”
characters may create longer words that were previously separated by those characters.
Use the supervised learning methods discussed in the lectures: Decision Trees (DT), Bernoulli
Naive Bayes (BNB) and Multinomial Naive Bayes (MNB). Do not code these methods: instead
use the implementations from scikit-learn. Read the scikit-learn documentation on Decision Trees1
and Naive Bayes,2 and the linked pages describing the parameters of the methods.
1https://scikit-learn.org/stable/modules/tree.html
2https://scikit-learn.org/stable/modules/naive bayes.html
Look at example.py to see how to use CountVectorizer and train and test the machine learning
algorithms, including how to generate metrics for the models developed, and plot.py to see how
to plot these metrics on a graph for inclusion in your report.
The programming part of the assignment is to produce DT, BNB and MNB models and your own
model for topic classification in Python programs that can be called from the command line to train
and classify articles read from correctly formatted .tsv files. The report part of the assignment
is to analyse these models using a variety of parameters, preprocessing tools and scenarios.
Programming
You will submit four Python programs: (i) DT classifier.py, (ii) BNB classifier.py, (iii)
MNB classifier.py and (iv) my classifier.py. The first three of these are standard models as
defined below. The last is a model that you develop following experimentation with the data. Use
the given datasets (articles.tsv and summaries.tsv) containing 1000 labelled articles and their
summaries to develop and test the models, as described below.
These programs, when called from the command line with two file names as arguments, the
first a training dataset and the second a test dataset (i.e. not hard-coded as training.tsv
and test.tsv), should print (to standard output, not a hard-coded file output.txt), the in-
stance number and topic produced by the classifier of each article in the test set when trained on
the training set (one per line with a space between the instance number and topic) – each topic
being the string “business”, “entertainment”, “politics”, “sport” or “tech”. For example:
python3 DT classifier.py training.tsv test.tsv > output.txt
should write to the file output.txt the instance number and topic of each article in test.tsv, as
determined by the Decision Tree classifier trained on training.tsv.
When reading in training and test datasets, make sure your code reads all the instances (some
Python readers use “excel” format, which uses double quotes as separators).
Standard Models
You will develop three standard models. For all models, make sure that scikit-learn does not
convert the text to lower case. For Decision Trees, use scikit-learn’s Decision Tree method with
criterion set to ’entropy’ and with random state=0. Scikit-learn’s Decision Tree method does not
implement pruning, rather you should ensure that Decision Tree construction stops when a node
covers fewer than 1% of the training set. Decision Trees are prone to fragmentation, so to avoid
overfitting and reduce computation time, for the Decision Tree models use as features only the
1000 most frequent words from the vocabulary, after preprocessing to remove “junk” characters
as described above. Write code to train and test a Decision Tree model in DT classifier.py.
For both BNB and MNB, use scikit-learn’s implementations, but use all of the words in the
vocabulary as features. Write two Pythons programs for training and testing Naive Bayes models,
one a BNB model and one an MNB model, in BNB classifier.py and MNB classifier.py.
Your Model
Develop your best model for topic classification by either varying the number and type of input
features for the learners, the parameters of the learners, and the training/test set split, or by using
another method from scikit-learn. Submit one program, my classifier.py, that trains and tests
a model in the same way as for the standard models. Conduct new experiments to analyse your
model and present results that justify your choice of this model in the report.
Report
In the report, you will first evaluate the standard models, then present your own model. For
questions 1–4 below, consider two scenarios:
(1) with the full articles in articles.tsv for training and testing, and
(2) with the summaries in summaries.tsv for training and testing.
For evaluating all models, report the results of training on the first 800 instances in the dataset
(the “training set”) and testing on the remaining 200 instances (the “test set”), rather than using
the full datasets of 1000 instances for training – so the 1% stopping rule for Decision Trees is when
nodes cover less than 8 instances rather than 10.
Use the metrics (micro- and macro-accuracy, precision, recall and F1) and classification reports
from scikit-learn. Show the results in Python plots (do not take screenshots of sklearn classification
reports), and write a short response to each question below. The answer to each question should
be self contained. Your report should be at most 10 pages. Do not include appendices.
1. (3 marks) Develop Decision Tree models for training and testing: (a) with the 1% stopping
criterion (the standard model), and (b) without the 1% stopping criterion.
(i) Show all metrics on the test set for scenario 1 comparing the two models (a) and (b), and
explain any similarities and differences.
(ii) Show all metrics on the test set for scenario 2 comparing the two models (a) and (b), and
explain any similarities and differences.
(iii) Explain any differences in the results between scenarios 1 and 2.
2. (3 marks) Develop BNB and MNB models from the training set using: (a) the whole vocabulary
(standard models), and (b) the most frequent 1000 words from the vocabulary, as defined using
sklearn’s CountVectorizer, after preprocessing by removing “junk” characters.
(i) Show all metrics on the test set for scenario 1 comparing the corresponding models (a) and
(b), and explain any similarities and differences.
(ii) Show all metrics on the test set for scenario 2 comparing the corresponding models (a) and
(b), and explain any similarities and differences.
(iii) Explain any differences in the results between scenarios 1 and 2.
3. (3 marks) Evaluate the effect of preprocessing for the three standard models by comparing
models developed with: (a) only the preprocessing described above (standard models), and (b)
applying, in addition, Porter stemming using NLTK then English stop word removal using sklearn’s
CountVectorizer.
(i) Show all metrics on the test set for scenario 1 comparing the corresponding models (a) and
(b), and explain any similarities and differences.
(ii) Show all metrics on the test set for scenario 2 comparing the corresponding models (a) and
(b), and explain any similarities and differences.
(iii) Explain any differences in the results between scenarios 1 and 2.
4. (3 marks) Evaluate the effect of converting all letters to lower case for the three standard models
by comparing models with: (a) no conversion to lower case, and (b) all input text converted to
lower case.
(i) Show all metrics on the test set for scenario 1 comparing the corresponding models (a) and
(b), and explain any similarities and differences.
(ii) Show all metrics on the test set for scenario 2 comparing the corresponding models (a) and
(b), and explain any similarities and differences.
(iii) Explain any differences in the results between scenarios 1 and 2.
5. (5 marks) Describe your chosen “best” method for topic classification. Give new experimental
results for your method trained on the training sets of 800 articles/summaries and tested on
the test sets of 200 articles/summaries. Explain how this experimental evaluation justifies your
choice of model, including settings and parameters, against a range of alternatives. Provide new
experiments and justifications: do not just refer to previous answers.
Submission
Make sure your name and zid appears on each page of the report
Submit all your files using a command such as (this includes Python code and report):
give cs9414 ass2 DT*.py BNB*.py MNB*.py my classifier.py report.pdf
Your submission should include:
– Your .py files for the specified models and your model, plus any .py “helper” files
– A .pdf file containing your report
When your files are submitted, a test will be done to ensure that one of your Python files
runs on the CSE machine (take note of any error messages printed out)
When running your code on CSE machines:
– Set SKLEARN SITE JOBLIB=TRUE to avoid warning messages
– Do not download NLTK in your code: CSE machines have NLTK installed
Check that your submission has been received using the command:
9414 classrun -check ass2
Assessment
Marks for this assignment are allocated as follows:
Programming (auto-marked): 8 marks
Report: 17 marks
Late penalty: 5 marks per day or part-day late off the mark obtainable for up to 3
(calendar) days after the due date.
Assessment Criteria
Correctness: Assessed on standard input tests, using calls such as:
python3 DT classifier.py training.tsv test.tsv > output.txt
Each such test will give two files, a training dataset and a test dataset, which contain any
number of articles (one on each line) in the correct format. The training and test datasets
can have any names, not just training.tsv and test.tsv, so read the file names from
sys.argv. The output should be a sequence of lines (one line for each article) giving the
instance number and classified topic, separated by a space and with no extra spaces on each
line or extra newline characters following any newline character after the last classification.
There are 2 marks allocated for correctness of each of the three standard models.
For your own method, 2 marks are allocated for correctness of your methods on test sets of
articles that include unseen examples.
Report: Assessed on correctness and thoroughness of experimental analysis, clarity and
succinctness of explanations, and presentation quality.
There are 12 marks allocated to items 1–4 as above, and 5 marks for item 5. Of these 5
marks, 1 mark is for the description of your model, 2 marks are for new experimental analysis
of your model, and 2 marks are for the justification of your model using new analysis. In
general, if the presentation is of poor quality, at most 50% of the marks can be obtained.
Plagiarism
Remember that ALL work submitted for this assignment must be your own work and no code
sharing or copying is allowed. You may use code from the Internet only with suitable attribution
of the source in your program. Do not use public code repositories on sites such as github
– make sure your code repository, if you use one, is private. All submitted assignments
will be run through plagiarism detection software to detect similarities to other submissions,
including from past years. You should carefully read the UNSW policy on academic integrity
and plagiarism (linked from the course web page), noting, in particular, that collusion (working
together on an assignment, or sharing parts of assignment solutions) is a form of plagiarism.
DO NOT USE ANY CODE FROM CONTRACT CHEATING “ACADEMIES” OR
ONLINE “TUTORING” SERVICES. THIS COUNTS AS SERIOUSMISCONDUCT
WITH A HEAVY PENALTY UP TO AUTOMATIC FAILURE OF THE COURSE
WITH 0 MARKS.
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做 program、代写 c++设计程...
2024-12-23
comp2012j 代写、代做 java 设...
2024-12-23
代做 data 编程、代写 python/...
2024-12-23
代做en.553.413-613 applied s...
2024-12-23
代做steady-state analvsis代做...
2024-12-23
代写photo essay of a deciduo...
2024-12-23
代写gpa analyzer调试c/c++语言
2024-12-23
代做comp 330 (fall 2024): as...
2024-12-23
代写pstat 160a fall 2024 - a...
2024-12-23
代做pstat 160a: stochastic p...
2024-12-23
代做7ssgn110 environmental d...
2024-12-23
代做compsci 4039 programming...
2024-12-23
代做lab exercise 8: dictiona...
2024-12-23
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!