首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
讲解data留学生编程、辅导SQL,Python程序语言 辅导留学生 Statistics统计、回归、迭代|调试Web开发
项目预算:
开发周期:
发布时间:
要求地区:
Homework Assignment #3 (Capturing and Wrangling Twitter Data)
Your next assignment is to become something of an expert on acquiring, saving, and wrangling large volumes of social media data using Twitter APIs and Tweets as an example.
Get Ready...
In addition to PostgreSQL (again) and the mainstream components of Anaconda, you will be making use of the Twitter developer APIs and a popular supporting Python library called Tweepy (http://www.tweepy.org/) for this assignment. Grab Tweepy if you don’t already have it and read up on it a bit. You will also be using some of the techniques covered in class for doing non-query things with PostgreSQL, so make sure that you are comfortable with those. Lastly, to use Tweepy, you will need to have a Twitter account - so it’s time to get yourself one if you don’t already have one. (Since all of the CS170A course projects will likely include a social media component, consider this another investment in project preparation as well - so even if getting a Twitter account hasn’t been high on your to-do list in the past, it’s time for you to join the leaders of countries all over the world and start Tweeting!)
Get (Data) Set...
For this assignment, you are going to be acquiring, archiving, and analyzing Tweets. Because your boss envisions doing this for a LOT of Tweets, you’re going to do this the scalable way - you will be storing and manipulating your Tweets in PostgreSQL, using Python and Tweepy just as a means to that end. Your boss is looking for ways to help Wholoo.com overtake Netflix.com, and wants you to explore the online buzz about two recent Netflix shows, Bridgerton and The Queen's Gambit.
To get started, think of a few keywords that might appear in Tweets about these shows, and play around with Twitter’s interactive search interface (https://twitter.com/search-home) to see what those words turn up (e.g. https://twitter.com/search?q=bridgerton%20netflix). Iterate a bit until you find a search query that yields Tweets whose results you would like to analyze. This is also a great way to get familiar with the query syntax for Twitter searches, so feel free to continue exploring the search capabilities this way. More info about Twitter’s search operators can be found here. Additional note, Twitter has many Tweeters who are not human (bots). You might want to mitigate that by refining your search method (e.g., changing the ‘result_type’ from recent to popular or vice versa).
Go..!
It’s time to get to work. Tackle each of the following questions and problems using Tweepy to access the Twitter APIs and using PostgreSQL as your social data storage and query platform:
1.Design a schema, and a set of CREATE TABLE statements, for storing the data that you are going to gather. Tweets contain a number of different kinds of information, both at the top level and nested within, including information about a given Tweet (e.g., its text), information about the Tweet’s content (e.g., hashtags), static info about the Tweet’s user (e.g., their screen name), and dynamic info about the user (e.g., how many times they have Tweeted). (See https://developer.twitter.com/en/docs/tweets/data-dictionary). You will be capturing your Tweet data in a set of 1NF tables by collecting the Tweets with Tweepy and then “splitting them” up by writing their pieces into these tables. Create interrelated tables to hold the following information:
a.Tweet information tables BridgertonTweetInfo and QueenTweetInfo, including columns with names created_at, tweetid, userId, text, and in_reply_to_status_id. Also include a raw_tweet field in your design that captures the original Tweet (in case you realize later on that there’s more information that you’d like to have from the Tweets).
b.One static user information table StaticUserInfo including columns with names userid, username, screen_name, and userdescription.
c.One dynamic user information table DynamicUserInfo, including columns with names userid, statuses_count and followers_count.
d.One tweet entity information table TweetEntity, with columns tweetid and hashtag. The hashtag column contains the hashtags in that tweet. Note that the table should be in 1NF.
2.Write a small Python program that uses your search query to gather two sets of at least 5,000 Tweets and stores them in your PostgreSQL tables. One set should be Tweets about the Bridgerton show, and the other set should be Tweets about the Queen show. Start by testing your program on small sets of Tweets (say 10-100 or so) and then, once everything seems to be working, run it “at scale” to gather your two sets of 5,000 Tweets. Beware of Twitter’s rate limiting as you do this - your program will need to be respectful of that and will have to run for a while. You will want to get this step of the assignment done 1-2 days before the deadline to leave yourself sufficient time for the social analytics part of this HW assignment! (Seriously!)
3.Working (only) with SQL and PostgreSQL, do some simple analysis on your collection of Tweets - record your work and its results (e.g., by creating a script file that you can run again when you are done exploring your data).
a.What are the top 100 most frequent words appearing in each of your collections of Tweets and what is the occurrence rate of each word in each? What sentiment can you infer by just looking at the words (negative, positive, neutral)? (Hint: Recall that you can tokenize a text field in Postgres using string_to_array.)
b.How many different hashtags do your Tweet collections have overall?
c.What is the average number of hashtags per Tweeting user in each Tweet collection?
d.What hashtags are common to Tweets about both of the Netflix shows?
e.For the top 10 hashtags common to both shows, what are the associated occurrence frequencies for each show?
f.What users (if any) are common to both of your datasets?
g.Who are the top 10 Tweeters (include their user id, user name, screen name, and description) by Tweet volume within your combined dataset-- and what are their Tweet counts and their last known followers’ counts? You might want to create indexes on tweet(user_id) on both datasets to accelerate your query execution time.
h.In order to speed your work, you should create a GIN (generalized inverted index for text indexing) on the text field of your main Tweets table.
i.One (or both) of the shows may have sparked multiple events that are reflected in social media. Iteratively, try to query your data and look for one of the events that went viral, and form a set of keywords that are unique to this event. Use your keyword set to look for tweets that have one or many of those keywords. (That’s where the index can help.) Briefly explain your findings. (Note: This is an open-ended question and it has no right or wrong answer. However, it is a step closer to what you will do in your project.)
What To Turn In
When you have finished your assignment you should use Gradescope to turn in one PDF file that shows your Jupyter notebook work for Part 2 as well as your SQL code and output for Parts 1 and 3. When submitting to Gradescope, be sure to mark which pages correspond to which questions; you may lose credit if this is not done correctly.
Additionally, you should submit both your Jupyter and SQL code files to Canvas.
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做ceng0013 design of a pro...
2024-11-13
代做mech4880 refrigeration a...
2024-11-13
代做mcd1350: media studies a...
2024-11-13
代写fint b338f (autumn 2024)...
2024-11-13
代做engd3000 design of tunab...
2024-11-13
代做n1611 financial economet...
2024-11-13
代做econ 2331: economic and ...
2024-11-13
代做cs770/870 assignment 8代...
2024-11-13
代写amath 481/581 autumn qua...
2024-11-13
代做ccc8013 the process of s...
2024-11-13
代写csit040 – modern comput...
2024-11-13
代写econ 2070: introduc2on t...
2024-11-13
代写cct260, project 2 person...
2024-11-13
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!