首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
代做program、代写Java程序设计
项目预算:
开发周期:
发布时间:
要求地区:
Assignment 2 – Advanced News
Classifier
Contents
1 Introduction 4
1.1 Glove file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Task 1 - Glove.java [2.5 Marks] 5
2.1 Task 1.1 - Glove(String _vocabulary, Vector _vector) [0.5 Marks] . . . 5
2.2 Task 1.2 - Task 1.5 [0.5 Marks each] . . . . . . . . . . . . . . . . . . . 5
3 Task 2 - NewsArticles.java [3.5 Marks] 5
3.1 Task 2.1 - Task 2.7 [0.5 Marks each] . . . . . . . . . . . . . . . . . . . 6
4 Task 3 - HtmlParser.java [3 Marks] 6
4.1 Task 3.1 - getDataType(String _htmlCode) [1.5 Marks] . . . . . . . . 6
4.2 Task 3.2 - getLabel(String _htmlCode) [1.5 Marks] . . . . . . . . . . 6
5 Task 4 - Toolkit.java [10 Marks] 7
5.1 Task 4.1 - loadGlove() [5 Marks] . . . . . . . . . . . . . . . . . . . . 7
5.2 Task 4.2 - loadNews() [5 Marks] . . . . . . . . . . . . . . . . . . . . . 7
6 Task 5 - ArticlesEmbedding [31.5 Marks] 7
6.1 Task 5.1 - ArticlesEmbedding(String _title, String _content, NewsArti?cles.DataType _type, String _label) [1 Mark] . . . . . . . . . . . . . . . 8
6.2 Task 5.2 - setEmbeddingSize(int _size) [0.5 Marks] . . . . . . . . . . 8
6.3 Task 5.3 - getNewsContent() [10 Marks] . . . . . . . . . . . . . . . . . 8
6.4 Task 5.4 - getEmbedding() [20 Marks] . . . . . . . . . . . . . . . . . . 9
7 Task 6 - AdvancedNewsClassifier [44.5 Marks] 10
7.1 Task 6.1 - createGloveList() [5 Marks] . . . . . . . . . . . . . . . . . . 10
7.2 Task 6.2 - calculateEmbeddingSize(List
_listEm?bedding) [5 Marks] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
7.3 Task 6.3 - populateEmbedding() [10 Marks] . . . . . . . . . . . . . . 10
7.4 Task 6.4 - populateRecordReaders(int _numberO fClasses) [8 Marks] 11
7.5 Task 6.5 - predictResult(List
Marks] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.6 Task 6.6 - printResults() [6.5 Marks] . . . . . . . . . . . . . . . . . . 12
8 Expected Output 12
2
*Rules*
1. For each class refer to its corresponding test to verify field and method naming
conventions.
2. Although there are many ways to construct an application, you are required to
adhere to the rules as stipulated below (to achieve marks).
3. If variable names are not stipulated, you can use your own names for variables.
This shows that you have written the application (we will check for plagiarism).
4. Inclusion of extra imports is strictly prohibited and will lead to a substantial
penalty.
5. Do NOT change or modify files included in the "resources" folder.
6. Do NOT modify the skeleton code. However, you are allowed to create your own
methods if they are needed.
7. You MUST complete this assignment independently – Do NOT discuss or share
your code with others, and Do NOT use ChatGPT! Any cheating behaviour will
result in a zero score for this module and will be subject to punishment by the
University.
8. It is *STRONGLY ADVISED AGAINST* utilizing any translation software (such
as Google Translate) for the translation of this document.
9. The jUnit tests included in the skeleton code are basic and only scratch the surface
in evaluating your code. Passing these tests does not guarantee a full mark.
10. Wrong file structure leads to a substantial penalty. Make sure you have followed
the Submission Instructions on the Canvas page (the assignment page).
11. Creating your own .zip file without using the export function in IntelliJ may lead
to a wrong file structure.
HINT: You can use the TODO window in IntelliJ (View | Tool Windows | TODO) to
quickly jump between tasks.
3
1 Introduction
In the last assignment, you built a news classifier by using TF-IDF and Cosine Simi?larity. This approach proved effective in numerous situations, with its primary benefit
being its simplicity in implementation. However, there are several disadvantages, such
as:
Lack of contextual understanding. TF-IDF focuses on the frequency of words but
doesn’t capture the context in which they are used. This can lead to misinter?pretation of the text’s meaning, especially with homonyms or phrases where the
meaning depends on the context.
Ignoring word order. TF-IDF treats documents as a "bag of words", meaning it
loses the order of words. This is a significant limitation, as the sequence of words
can drastically change the meaning of sentences.
Computational complexity for large datasets. The method can become computa?tionally intensive as the size of the dataset and vocabulary grows, making it less
efficient for very large corpora.
High dimensionality. TF-IDF can lead to very high-dimensional feature spaces,
especially with large text corpora.
In comparison, more advanced techniques like word embeddings (e.g., Word2Vec
[3–5], GloVe [6]) and transformer-based models (e.g., BERT [1], GPT [7]) provide a
more nuanced understanding of language by capturing contextual meanings, semantic
relationships, and the order of words.
Hence, in this assignment, you are tasked with constructing an advanced new classi?fier utilizing GloVe Embedding and Machine Learning. You are not required to under?stand how Glove works or prior knowledge of Machine Learning, as this assignment
provides an existing GloVe file and incorporates two external libraries: Deeplearn?ing4J [8] and NDArray4J [9] which facilitate the Machine Learning processes.
However, you do need to understand the structure of the Glove file to build the input
of the neural network.
1.1 Glove file
The file is called "glove.6B.50d_Reduced.csv" and is located in the "resources" folder.
It was trained based on Wikipedia 2014 1 + Gigaword 5 2
, which contains 6 billion
tokens. Originally, there were 400,000 words included in this model. For demonstration
purposes, we have reduced its size to only include 38,534 unique words. Below is an
example of how this file is structured:
abacus,0.9102,-0.22416,0.37178,0.81798,...,0.34126
abadan,-0.33432,-0.95664,-0.23116,0.21188,...,-0.23159
1https://dumps.wikimedia.org/enwiki/20140102/
2https://catalog.ldc.upenn.edu/LDC2011T07
4
abalone,0.34318,-0.8135,-0.99188,0.6452,0.0057126,...,-0.15903
zygote,0.78116,-0.49601,0.02579,0.69854,...,-0.40833
zymogen,-0.34302,-0.76724,0.13492,-0.0059688,...,0.37539
Each line starts with a unique word (so 38,534 lines in total), then followed by 50
floating numbers (separated by ","). These floating numbers are the vector representa?tion of that word. In other words, each unique word is associated with a size/length 50
vector. Elements in this vector must be consistent with the order of the floating numbers
in the CSV file. Using the word "abacus" as an example, the first element in its vector
representation should be "0.9102", then the second element is "-0.22416", and so on
and so forth.
2 Task 1 - Glove.java [2.5 Marks]
The Glove class consists of GloVe objects, and you need to complete the following
methods to finish this class. strVocabulary is the attribute of the word stored in this
Glove object, and vecVector is its vector representation.
Testing this class with the GloveTest. java file.
2.1 Task 1.1 - Glove(String _vocabulary, Vector _vector) [0.5 Marks]
This is the constructor of the Glove class.
Complete this constructor by assigning the _vocabulary to the strVocabulary at?tribute, and _vector to vecVector.
2.2 Task 1.2 - Task 1.5 [0.5 Marks each]
Complete the relevant get and set methods accordingly.
3 Task 2 - NewsArticles.java [3.5 Marks]
This class holds the basic information about the news articles located in the resources\News
folder:
1. newsTitle: stores the title of the news.
2. newsContent: stores the content of the news.
3. newsType: in Machine Learning, it is essential to divide the data into two distinct
subsets: Training and Testing. This particular variable (or attribute) serves the
5
purpose of identifying whether a given news article is part of the Training set or
the Testing set.
4. newsLabel: in Machine Learning, a "label" refers to the output or target variable
a model tries to predict or classify. It’s an integral part of supervised learning,
and the goal is to learn a mapping from input data to labels based on example
input-output pairs. In this assignment, a label represents which group a given
news article belongs to. For example, if there are two groups, the label should be
either 1 (the first group) or 2.
This assignment initially provides only the training set data with corresponding
labels. The ultimate goal is to develop a machine-learning model that predicts the
labels for the testing set data.
3.1 Task 2.1 - Task 2.7 [0.5 Marks each]
Complete the constructor and the relevant get & set methods accordingly.
4 Task 3 - HtmlParser.java [3 Marks]
Similar to Assignment 1, the HtmlParser class provides various methods to retrieve
related information from news articles. The getNewsTitle(String _htmlCode) and get?NewsContent(String _htmlCode) methods are provided already, and this task focuses on
the methods that allow you to get the data type and label information.
4.1 Task 3.1 - getDataType(String _htmlCode) [1.5 Marks]
The data type information is located between the
tag.
If the article does not contain this tag, then consider it as Testing data. Otherwise,
return the data type accordingly.
The return type should be the enum defined in the NewsArticles class.
HINT: Enumerated data type (enum) is introduced in Chapter 8 - Arrays in the
textbook.
4.2 Task 3.2 - getLabel(String _htmlCode) [1.5 Marks]
The label information is located between the
tag.
If the article does not contain this tag, then return "-1" (as a string). Otherwise,
return the label accordingly.
6
5 Task 4 - Toolkit.java [10 Marks]
The Toolkit class includes methods you need to use/complete to load the Glove and
News data.
5.1 Task 4.1 - loadGlove() [5 Marks]
In this task, you are required to use a Bu f f eredReader (myReader) to read data from
the Glove file (FILENAME_GLOV E) line by line. FILENAME_GLOV E is the name
of the Glove file (the structure of this file can be found in Section 1.1, page 4).
Read the file line by line and analyse the result - adding the word to listVocabulary
and its vector representation to listVectors.
Use the Toolkit.getFileFromResource(String _ f ileName) method to get the cor?rect file path.
If the file doesn’t exist, throw an exception and print out the error message (using
.getMessage() method).
The average execution time should be below 280 milliseconds.
HINT: Remember to use the try...catch()...finally blocks. Do NOT hardcode your
file path.
5.2 Task 4.2 - loadNews() [5 Marks]
Similar to Task 4.1, now please load the News data from the resource\News folder.
Check the file name first and only load those with ".htm" extension.
Please use the completed HtmlParser class to retrieve the related information,
then convert it into a NewsArticles object and add it to the listNews variable.
The average execution time should be below 30 milliseconds.
6 Task 5 - ArticlesEmbedding [31.5 Marks]
Task 1 and Task 4.1 allow you to read data from the files and create the associated
Glove objective. Unlike the TF-IDF Embedding in the first assignment, these Glove
objectives are word-level embedding (or vectorisation) instead of document-level3
. So,
in this task, you are required to construct document-level embeddings based on the
related Glove objectives. In other words, each news article has one single embedding
that represents its content.
3
In A1, each document/article has a single TF-IDF embedding, this is called document-level embed?ding.
7
The ArticlesEmbedding class is a subclass of the NewsArticles class, which was
completed in Task 2. There are three attributes in this class:
1. processedText. Back to the first assignment, there was a preProcessing() method
for text cleaning, text lemmatization and stop words removal, then saved the pro?ceeded text to a string array called newsCleanedContent. In this assignment,
processedText is the equivalent of newsCleanedContent in A1 and is generated
in Task 5.3. The difference is that processedText is a single string instead of an
array.
2. newsEmbedding. This is the attribute for the document-level embedding, which
will be generated in Task 5.4
3. intSize. Each news article has a different length, but neural networks can only
process inputs of the same shape. Therefore, we need to set the size of the em?bedding here.
6.1 Task 5.1 - ArticlesEmbedding(String _title, String _content, NewsAr?ticles.DataType _type, String _label) [1 Mark]
This is the constructor of the ArticlesEmbedding class. Complete it accordingly. You
can modify the existing code in this constructor (super("","",null,"");).
6.2 Task 5.2 - setEmbeddingSize(int _size) [0.5 Marks]
This is the set method of the intSize variable. Complete it accordingly.
6.3 Task 5.3 - getNewsContent() [10 Marks]
Override the getNewsContent() method in the NewsArticles class.
The idea here is that when this method has been called, it will automatically retrieve
the original news content from its base and execute the subsequent pre-processing steps
in the following sequence:
1. Text cleaning. Perform the text cleaning tasks by calling the provided textClean?ing() method and output the string "***Getnewscontent Process Task***".
2. Text lemmatization. In the first assignment, we considered a simplified scenario.
Here, we will use a proper NLP library called CoreNLP [2], developed by the
NLP Group at Stanford University, for the lemmatization process.
The CoreNLP4
library has been included in this project, but you need to learn how
to set up the correct pipeline for text lemmatization by using the documentation
provided on their website.
HINT: There is a specific page about Lemmatization.
4https://stanfordnlp.github.io/CoreNLP
8
3. Stop-words removal. Use the STOPWORDS constant in the Toolkit class to
perform this task.
After these three steps, pass the string to the processedText attribute.
Ensure all the characters in the processtedText are in lowercase. The .lemma()
method in the CoreNLP library may restore letter cases and produce some unex?pected results.
The pre-processing task only needs to be done once. Otherwise, it will have a
huge impact on the performance. In the related jUnit test, the average execution
time should be less than 13000000 nanoseconds.
6.4 Task 5.4 - getEmbedding() [20 Marks]
Before starting this task, it’s essential to have completed Task 6.1 and Task 6.2.
This task involves creating an array using ND4J (N-Dimensional Arrays for Java),
a library included in this project. The array is formed by the embeddings of words
present in the processedText string. For example, if "hello" and "world" have embed?dings [0,1,2,3] and [4,5,6,7] respectively, the embedding for "hello world" is 0, 1, 2, 3,
4, 5, 6, 7.
Retrieve word embeddings from the Glove object list created in Task 6.1. Use the
intSize attribute to set the maximum length of the array, calculated in Task 6.2. You’ll
need to familiarize5 yourself with ND4J methods such as Nd4j.create() and .putRow().
The array’s shape should be [x,y] where x=intSize and y=word vector size.
Additional requirements include:
Throw an InvalidSizeException with a message "Invalid size" if intSize is unini?tialized (intSize = -1).
Throw an InvalidTextException with a message "Invalid text" if processedText is
empty (processedText.isEmpty()) and output the string "***Getembedding Process Terminated***".
Limit the length to intSize. If the document exceeds this, only process the first
intSize characters; if it’s shorter, fill the remaining space with 0.
For a specific article, ensure the embedding process is done only once to avoid
performance issues. In jUnit tests, the average execution time should be under 8
milliseconds.
HINT: Only include those words that have an associated Glove object.
5https://deeplearning4j.konduit.ai/nd4j/tutorials/quickstart
9
7 Task 6 - AdvancedNewsClassifier [44.5 Marks]
7.1 Task 6.1 - createGloveList() [5 Marks]
Based on the Toolkit.listVocabulary and ToolkitVectors, create/populate the Glove list.
Only create a Glove object for those non-stop words.
7.2 Task 6.2 - calculateEmbeddingSize(List
_lis?tEmbedding) [5 Marks]
As explained before, each article has a different length. Hence, it is essential to de?termine a suitable embedding size. Using the smallest length will limit the ability to
include more semantic information in the document-level embedding. On the other
hand, there will be too many 0s in the embedding, which will pollute the semantic rep?resentation and increase the training time of the machine-learning model. To balance
these concerns, we choose to use the median document length for embedding.
To calculate the median document length, follow these steps:
1. Determine the length of each document in your corpus/dataset.
2. Add these lengths to a list.
3. Sort the list in ascending order.
4. If the length of the list is even, the median is the average of the lengths at positions
N/2 and (N/2) + 1 in the sorted list.
5. Otherwise, the median is the length at position (N+1)/2 in the sorted list.
HINT: The length of the document is measured by the count of words it contains.
However, only words that have a corresponding Glove object are included in this count.
7.3 Task 6.3 - populateEmbedding() [10 Marks]
listEmbedding is an attribute that holds all the ArticlesEmbedding objects, which are
initialised in the loadData() method. Go through this list and call the getEmbedding()
method (completed in Task 5.4) to calculate the embedding for each article.
If an InvalidSizeException occurs, (re)assign the intSize attribute in the Article?sEmbedding class by calling the setEmbeddingSize() method.
If an InvalidTextException occurs, call the getNewsContent() method to pre-process
the text and output the string "***Generate unexPected resulT***".
At the end of this method, all the objects in the listEmbedding should have a valid
(nonempty) newsEmbedding.
To avoid performance issues, use a single for loop to complete this task.
10
7.4 Task 6.4 - populateRecordReaders(int _numberO fClasses)[8 Marks]
The actual machine learning process is handled by a given method called buildNeural?Network, but you are tasked to construct the training data (trainIter).
For a specific document, its associated DataSet object contains two elements: a) an
input (also called feature) INDArray and b) an output INDArray.
The input INDArray (inputNDArray) is simply the document-level embedding (.getEm?bedding() method completed in Task 5.4). The output INDArray (outputNDArray) is
constructed as the following:
The shape of this array is [1, _numberOfClasses]. Assuming that there are 2 classes
(two newsgroups), then create an outputNDArray with the shape [1,2] and assign value
0 to it (outputNDArray=[0,0]. For a specific document, assign value 1 to the *first
element* ([1,0]) if it belongs to the first group (newsLabel="1"). Otherwise, assign
value 1 to the *second element* ([0,1]).
Go through all the items that have been marked as Training data (use the .get?NewsType() method, Task 2.3) from the listEmbedding, and initials their cor?responding DataSet objects (DataSet myDataSet = new DataSet(inputNDArray,
outputNDArray)).
Once a DataSet object has been initialised, add it to the listDS.
Your code should be flexible enough to handle more than 2 newsgroups.
7.5 Task 6.5 - predictResult(List
[8 Marks]
The label data is obtained through the .getLabel() method in the HtmlParser class, as
outlined in Task 3.2. Initially, labels are available only for news items marked as Train?ing data/type. The goal is to employ myNeuralNetwork for predicting labels for the
Testing data.
The myNeuralNetwork attribute holds the trained machine learning model. To gen?erate a label for any given input, use its .predict() method.
The parameter of the .predict() method is the document-level embedding of a spe?cific news article. The output is an integer array: 0 means this specific news belongs to
the first group, and 1 means the second group.
Go through the ArticlesEmbedding list (_listEmbedding), and use the .predict()
method to generate a label for all the Testing data.
Add all the predicted labels to the listResult attribute.
Use the .setNewsLabel() method to modify the label information in the associated
ArticlEmbedding object.
11
7.6 Task 6.6 - printResults() [6.5 Marks]
Since the label information was updated in the last task, go through the listEmbedding
attribute and print out the grouping result for the Testing data.
Use the related jUnit test to determine the correct string format.
Your code must be flexible enough to handle more than 2 newsgroups.
8 Expected Output
If all tasks have been completed correctly, the output produced by the main() method
should match the following (ignore the colour):
Group 1
Boris Johnson asked if government ’believes in long COVID’, coronavirus
inquiry hears
COVID vaccine scientists win Nobel Prize in medicine
Long COVID risks are ’distorted by flawed research’, study finds
Who is Sam Altman? The OpenAI boss and ChatGPT guru who became one of
AI’s biggest players
Sam Altman: Ousted OpenAI boss ’committed to ensuring firm still
thrives’ as majority of employees threaten to quit
Sam Altman: Sudden departure of ChatGPT guru raises major questions that
should concern us all
ChatGPT creator Sam Altman lands Microsoft job after ousting by OpenAI
board
Group 2
COVID inquiry: There could have been fewer coronavirus-related deaths
with earlier lockdown, scientist says
Up to 200,000 people to be monitored for COVID this winter to track
infection rates
Molnupiravir: COVID drug linked to virus mutations, scientists say
How the chaos at ChatGPT maker OpenAI has unfolded as ousted CEO Sam
Altman returns - and why it matters
ChatGPT maker OpenAI agrees deal for ousted Sam Altman to return as
chief executive
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做ceng0013 design of a pro...
2024-11-13
代做mech4880 refrigeration a...
2024-11-13
代做mcd1350: media studies a...
2024-11-13
代写fint b338f (autumn 2024)...
2024-11-13
代做engd3000 design of tunab...
2024-11-13
代做n1611 financial economet...
2024-11-13
代做econ 2331: economic and ...
2024-11-13
代做cs770/870 assignment 8代...
2024-11-13
代写amath 481/581 autumn qua...
2024-11-13
代做ccc8013 the process of s...
2024-11-13
代写csit040 – modern comput...
2024-11-13
代写econ 2070: introduc2on t...
2024-11-13
代写cct260, project 2 person...
2024-11-13
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!