首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
data留学生程序讲解、C++语言编程调试、c/c++程序辅导 辅导Python编程|辅导R语言程序
项目预算:
开发周期:
发布时间:
要求地区:
Don’t be Sentimental!
Due: Monday February 15, 2021 @ 6 a.m. - pushed to Github and release issued.
Introduction
Have you ever read a tweet and thought, “Gee, what a positive outlook!” or “Wow, why so negative, friend?” Can computers make the same determination? They can surely try!
In Machine Learning, the task of assigning a label to a data item is called classification (putting things into different classes or categories). The more specific name for what we’re going to do is sentiment analysis because you’re trying to determine the “sentiment” or attitude based on the words in a tweet. So, Project 1 is to build a sentiment classifier! Aren’t you excited?? ( ← That would be positive sentiment!)
You’ll be given a set of tweets that are already pre-classified as positive or negative based on their content. You’ll analyze the word frequency patterns among all of those tweets to develop a classification algorithm. Using your classification algorithm, you’ll then classify another set of tweets to determine if they are positive or negative.
Building a Classifier
The goal in classification is to assign a class label to each element of a data set. Of course, we would want this done with the highest accuracy possible. For this project, we will have only two classes or labels: positive sentiment and negative sentiment. At a high level, the process to build a classifier (and many other machine learning models) is this:
1.Train
○Use a training data set with pre-classified members.
○Assume you have 10 tweets and each is pre-classified with + or - sentiment. How might you go about analyzing the words in the tweets to find words more commonly associated with negative sentiment and words more commonly associated with positive sentiment?
○The result of the training step will be two lists of words: 1 list for positive words and 1 list for negative words
2.Test
○Now, you give your classifier un-labeled tweets from a testing data set and ask it to output the class it determines.
○But behind the scenes, you already know what class each tweet actually belongs to.
○Compare the predicted class of each tweet (the output of your classifier) and the actual class of each tweet and determine the accuracy. In other words, how correct was your classifier?
The Real Data
The data set we will be using in this project comes from real tweets posted around 11-12 years ago. The original data was retrieved from Kaggle at https://www.kaggle.com/kazanova/sentiment140. I’ve pre-processed it into the file format we are using for this project. For more information, please see
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
Input files
There will be 3 different input files:
1.Training Data
2.Testing Data (no sentiment column)
3.Testing (id and sentiment for testing data for you to compare against).
The training data set is formatted as follows:
●A comma-separated-values (CSV) file containing a list of tweets, each one on a separate line. Each line of the data files include the following fields:
○Sentiment value (negative = 0, positive = 4),
○the tweet id,
○the date the tweet was posted
○Query status (you will ignore this column)
○the twitter username that posted the tweet
○the text of the tweet itself
The testing data set is broken into two files:
●A CSV file containing formatted just like the training data EXCEPT no Sentiment column
●A CSV file containing tweet ID and sentiment for the testing dataset (so you can compare your predictions of sentiment to the actually sentiment ground truth)
Below are two example tweets from the training dataset:
4,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,peruna_pony,"Beat TCU"
0,1467811595,Mon Apr 06 22:22:03 PDT 2009,NO_QUERY,the_frog,"Beat SMU"
Here are two tweets from the testing dataset:
1467811596,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,peruna_pony,"SMU > TCU"
The sentiment file for that testing tweet would be:
4, 1467811596
Output Files
There will be one output file organized as follows::
●The first line of the output file will contain the accuracy, a single floating point number with exactly 3 decimal places of precision. See the section “How good is your classifier” below to understand Accuracy.
●The remaining lines of the file will contain the Tweet IDs of the tweets from the testing data set that your algorithm incorrectly classified.
Example of the Testing Data tweet classifications file (These tweet IDs are fake):
0.500
2323232323
1132553423
Running your Program
Your program will have two modes
●Running Catch TDD tests
○this will be indicated by passing no arguments to the executable
○this is explained more below.
●Training & Testing the Classifier
○this mode will have 4 command line arguments:
■training data set filename - the file with the training tweets
■testing data set filename - tweets that your program will classify
■testing data set sentiment filename - the file with the classes for the testing tweet data
■ouput file name - see Output Files section above
■Example:
./classifier.out
Training Your Classification Algorithm
You’ll use word frequencies to train your classification algorithm. For the tweets in the testing data set, you’ll calculate word frequencies for the positive tweets and for the negative tweets. You might expect a few things to happen:
●certain words might appear more frequently in negative tweets. Other words might appear more frequently in positive tweets.
●certain words might appear frequently in both positive and negative tweets. Are these useful at all?
You might also consider the possibility that some people are more positive than other people. Could you use username for classification?
As you develop your classification algorithm, you can use the output file containing the incorrectly classified tweets to refine your solution.
Implementation Requirements
You must implement your own custom string class as part of this project. You may not use the STL string class or any other available string class from the Internet. Additionally, you may NOT use c-strings (null-terminated character arrays) except to implement your custom string class and as a temporary buffer when reading from a file. More on the requirements of this custom class will be in a separate handout.
You will also need to implement a suite of tests for the string class using the CATCH TDD library. More on this in a separate handout in the first week of lab. You’ll provide a testing mode (the mode with no args) which will execute your tests.
Comments about this Project
This is your opportunity to make a great first impression on the Prof/TAs in CSE 2341. We will be looking for simple, elegant, well-designed solutions to this problem. Please do not try to “wow” us with your knowledge of random, esoteric programming practices. Here is a list of things that you might want to consider while tackling this project:
●Procedural vs Object Oriented Design
oA seemingly infinite amount of software has been designed, implemented, deployed, maintained, updated, and redeployed using both of these paradigms. One could argue for days, or week even, about which is the “better” paradigm for modern software development. Regardless of which paradigm you choose to use, the most important thing is that you produce an elegant solution following solid software development practices.
●File input and output
oIt is so important to be able to read from and write to data files. Think about some software program that doesn't use files...
●Just the right amount of comments in just the right places
●Minimal amount of code in the driver method (main, in the case of C++)
oThe code in main should be minimal and only used to “get the ball rolling.”
●Proper use of command-line arguments
●Proper memory management
How Good is Your Classifier?
Training and Testing of a classification algorithm is an iterative process. You’ll develop a training algorithm, test it, evaluate its performance, tweak the algo, retrain, retest, etc. How do you know how good your classifier is after each development iteration, though? We will use accuracy as the metric for evaluation.
(total number of correctly classified tweets from test dataset)
Accuracy = --------------------------------------------------------------------
(total number of tweets in the test data set)
Why should you be interested in this? The TAs will take your final solutions and classify a set of tweets that your algorithm has never seen. They will calculate the accuracies (as defined above) for each submission. The 10 submissions with the highest accuracies will get 10 bonus points.
What to Submit
During lab in the coming weeks, you will cover a basic introduction to GitHub and how to use it to submit your projects. You will need to push your code (including your tests) to Github and create a new release by February 10, 2020 at 6am.
Grading Rubric
Points Possible Points Awarded
String Class 20
CATCH Tests 5
Training Algo Implementation 25
Classification Algo Implementation 25
Source Code Quality (formatting, comments, etc. 15
Proper use of Git and Github 10
You forgot to disable document editing :) - Nate was here!
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
urba6006代写、java/c++编程语...
2024-12-26
代做program、代写python编程语...
2024-12-26
代写dts207tc、sql编程语言代做
2024-12-25
cs209a代做、java程序设计代写
2024-12-25
cs305程序代做、代写python程序...
2024-12-25
代写csc1001、代做python设计程...
2024-12-24
代写practice test preparatio...
2024-12-24
代写bre2031 – environmental...
2024-12-24
代写ece5550: applied kalman ...
2024-12-24
代做conmgnt 7049 – measurem...
2024-12-24
代写ece3700j introduction to...
2024-12-24
代做adad9311 designing the e...
2024-12-24
代做comp5618 - applied cyber...
2024-12-24
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!