首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
Python程序调试、辅导program编程设计、讲解Python编程语言 辅导Database|调试Matlab程序
项目预算:
开发周期:
发布时间:
要求地区:
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 1/8
HW3
PageRank
What is the most important website on the internet? Who is the "key player" on a sports team? Which countries
are the most central players in the world economy? There is no one correct answer to any of these questions,
but there is a most profitable one. PageRank (https://en.wikipedia.org/wiki/PageRank) is an algorithm for
ranking individual elements of complex systems, invited by Sergey Brin and Larry Page. It was the first and
most famous algorithm used by the Google Search engine, and it is fair to say that the internet as we know it
today would not exist without PageRank.
In this assignment, we will implement PageRank. There are many good ways to implement this algorithm, but in
this assignment we will use our newfound skills with object-oriented programming and iterators.
How it works
For the purposes of this example, let's assume that we are talking about webpages. PageRank works by
allowing a "random surfer" to move around webpages by following links. Each time the surfer lands on a page, it
then looks for all the links on that page. It then picks one at random and follows it, thereby arriving at the next
page, where the process repeats. Eventually, the surfer will visit all the pages one or more times. Pages that the
surfer visits more frequently have higher PageRank scores. Because the surfer moves between linked pages,
PageRank expresses an intuitive idea: important pages are linked to other important pages. This diagram
(https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg) from Wikipedia gives a nice
illustration. Note that more important webpages (higher PageRank) tend to be connected to other important
webpages.
A schematic for PageRank.
Data
You can complete this assignment using data from one of two sources.
Option 1: Hamilton
This data set comes from the hit Broadway musical "Hamilton."
The logo of the musical Hamilton, showing a
silhouette dressed in period custom standing on
top of a five-pointed star.
The Hamilton data set
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 2/8
The good folks at The Hamilton Project (http://hamilton.newtfire.org/) analyzed the script for us, obtaining data
on who talks about whom in each of the show's songs. When character A mentions character B, we'll think of
this as a link from A to B, suggesting that B might be important.
If you use this data set, listening to the soundtrack while working is strongly recommended.
Option 2: Global Airline Network
A map of the world, with many curved green
lines connecting different points on the map.
The points are airports, and the curved green
lines are flight routes.
The global airline network
Back in the Before Times, lots of people flew on airplanes. This data set includes a "link" from Airport A to
Airport B whenever there is a flight from B to A. This data set was collected by the OpenFlights Project
(https://openflights.org/data.html).
(A). Define Functions
In this part, all you have to do is hit shift + enter on the code block supplied below. This block defines two
functions. The first one retrieves the data from the internet and saves it to your local computer, while the second
reads in the data, producing a list of tuples. It's not important for you to be familiar with the code in these
functions right now -- we'll discuss them early in Week 4.
In [1]:
import urllib
import csv
def retrieve_data(url):
"""
Retrieve a file from the specified url and save it in a local file
called data.csv. The intended values of url are:
1. https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv
2. https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv
"""
# grab the data and parse it
filedata = urllib.request.urlopen(url)
to_write = filedata.read()
# write to file
with open("data.csv", "wb") as f:
f.write(to_write)
def read_data(path):
"""
read downloaded data from a .csv file, and return a list of tuples.
each tuple represents a link between states.
"""
with open(path, "r") as f:
reader = csv.reader(f)
return [(row[0], row[1]) for row in list(reader)]
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 3/8
(B). Grab the Data
The data live at the following URLs:
Hamilton: https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv
Airline: https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv
In each data set, each row corresponds to a "link" between objects. In Hamilton, the pairs have format
mentioner, mentioned while in the airline network the rows have format origin, destination.
Pick one of these data sets, and set the variable url appropriately by uncommenting one of the two lines
below. Then, call retrieve_data() and read_data() . The path argument for read_data() should be set to
"data.csv" . Create a variable called data to hold the return value of read_data() .
Your solution
In [ ]:
(C). Examine the structure of the data
This would also be a good time to inspect the data to make sure you understand how it is structured. Write a
function describe(n) that describes the meaning of the n th row of the data set you chose. In the Hamilton
data set, your function should do the following:
describe(5)
# output
"Element 5 of the Hamilton data set is ('burr', 'betsy'). This means that Burr mentions Be
tsy in a song."
In context of the airline flights data, your function should instead do this:
describe(5)
# output
"Element 5 of the flights data set is ('SIN', 'BKK'). This means that there is a flight fr
om BKK to SIN."
Please attend to capitalization and formatting. While the standard string concatenation operator + is
completely fine for this task, the fancy str.format() function may make your code somewhat simpler. This
page (https://realpython.com/python-formatted-output/) has some useful examples in case you'd like to try this.
Your Solution
# uncomment the second line if you'd prefer to
# work with the flights data.
url = "https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv"
# url = "https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv"
# Call your functions below
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 4/8
In [ ]:
(D). Data to Dictionary
Write a function called data_to_dictionary that converts the data into a dictionary such that:
1. There is a single key for each character (in Hamilton) or airport (in flights).
2. The value corresponding to each key is a list of the characters/airports to which that key links. The list
should contain repeats if there are multiple links.
Here's an example of the desired behavior on a fake data set.
data = [("a", "b"),
("a", "b"),
("a", "c"),
("b", "c"),
("b", "a")]
data_to_dictionary(data)
# output
{"a" : ["b", "b", "c"], "b" : ["a", "c"]}
Your Solution
In [ ]:
(E). Define a PR_DiGraph class
A directed graph, or DiGraph, is just a set of arrows ("edges") between objects ("nodes"). It is a natural way to
represent data that represents one-way relationships, such as links from one webpage to another or mentions
of one character by another. We already saw a directed graph above when we introduced the idea of
PageRank. Here's a paired-down example.
Six circles, representing nodes, labeled A
through F. There are directed arrows between
certain pairs of nodes.
Example of a directed graph.
Implement a PR_DiGraph class with a custom __init__() method and a linked_by() method. The
__init__() method should accept two arguments: data and iteration_limit . The __init__() method
should then construct an instance variable self.link_dict which is simply the output of data_to_dictionary
applied to the argument data . __init__() should also construct an instance variable
self.iteration_limit , which simply takes the value of iteration_limit supplied to __init__() . Don't
worry about that one for now.
Then, define a method self.linked_by(x) which, when called, returns the value self.link_dict[x] .
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 5/8
Finally, add an __iter__ method, which returns an object of class PR_Iterator . We will define this class in
the next part.
Example session (using Hamilton):
D = PR_DiGraph(data, iteration_limit = 10000)
D.linked_by('peggy')
# output
['peggy', 'schuylerSis']
Your Solution
In [ ]:
(F). Implement PR_Iterator
Define a PR_Iterator class with a custom __next__() method.
The __init__ method of this class should create instance variables to store the PR_DiGraph object from
which it was constructed; a counter i , starting at 0, to log the number of steps taken, and a current_state
variable whose value is one of the keys of the link_dict of the Pr_DiGraph . You can choose its initial value
arbitrarily; in my solution code I chose self.current_state = "hamilton" .
We are going to use iteration to implement the PageRank algorithm. This means we are going to imagine a
surfer who is following the links in our data set. Implement the following two methods:
1. follow_link() .
A. Pick a random new character mentioned by the current character, or new airport served by the current
airport. Let's call this next_state .
B. If next_state != current_state , set current_state to next_state .
C. Otherwise (if next_state == current_state ), teleport (see below).
D. You might run into KeyError s, in which case you should again teleport (use a try-except block).
2. teleport() .
A. Set the current state to a new state (key of the link dict) completely at random.
Hint: use random.choice from the random module to choose elements of lists.
Finally, implement __next__() . __next__() should do follow_link() with 85% probability, and do
teleport() with 15% probability. You should also define a custom StopIteration condition to ensure that
only as many steps are taken as the iteration_limit supplied to the PR_DiGraph initializer.
1. To do something with 85% probability, use the following:
if random.random() < 0.85:
# do the thing
else:
# do the other thing
Example Usage
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 6/8
After you define your class, run the following code and show that it works. Note: your precise sequence may be
different from mine.
D = PR_DiGraph(data, iteration_limit = 5)
for char in D:
print(char)
following link : current state = burr
following link : current state = washington
following link : current state = burr
following link : current state = hamilton
teleporting : current state = washington
I have added printed messages here for you to more clearly see what should be happening, but it is not
necessary for you to do this. It is sufficient for your output to look like:
D = PR_DiGraph(data, iteration_limit = 5)
for char in D:
print(char)
burr
washington
burr
hamilton
washington
Your Solution
In [ ]:
In [ ]:
(G). Compute PageRank
Finally, we are ready to compute the PageRank in our data set. Initialize a PR_DiGraph with a large iteration
limit (say, 1,000,000). Use a for -loop to allow your surfer to randomly move through the data set. The number
of times that the surfer visits state x is the PageRank score of x .
Create a dict which logs how many times a given state appears when iterating through the PR_Digraph . So,
this dictionary holds the PageRank score of each state.
Your Solution
import random
# run the below
D = PR_DiGraph(data, iteration_limit = 5)
for char in D:
print(char)
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 7/8
In [ ]:
(H). Display Your Result
Use your favorite approach to show the results in sorted format, descending by PageRank score. The entries at
the top should be the entries with highest PageRank. What are the most important elements in the data set?
You may show either the complete list or just the top 10.
Check your code by comparing your top 10 to mine. Because we are using a randomized algorithm, your
results will not agree exactly with mine, but they should be relatively close. If your top 10 list is very different,
then you might want to revisit your previous solutions.
For Hamilton, my top 10 were:
[('hamilton', 166062),
('burr', 99180),
('washington', 92246),
('jefferson', 72450),
('eliza', 51485),
('angelica', 48042),
('madison', 37421),
('lafayette', 34297),
('lee', 33678),
('jAdams', 31121)]
For the flights data, my top 10 were:
[('LHR', 18043), # London Heathrow
('ATL', 16370), # Atlanta
('JFK', 14795), # New York JFK
('FRA', 14156), # Frankfurt
('CDG', 14073), # Charles de Gaulle (Paris)
('LAX', 13199), # Los Angeles
('ORD', 12915), # Chicago O'Hare
('PEK', 12525), # Beijing
('AMS', 12410), # Amsterdam Schiphol
('PVG', 11517)] # Shanghai
Your solution
In [ ]:
(I). Submit!
Check that your code is appropriately documented (comments and docstrings), and turn it in.
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 8/8
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代写infosys 110 digital syst...
2024-12-28
代写fbe 506 quantitative met...
2024-12-28
代做part i: (crazy eddie htm...
2024-12-28
代写infosys 110 digital syst...
2024-12-28
代做stats 769 statistics sec...
2024-12-28
代写ece3700j introduction to...
2024-12-28
代做tcm2301 biochemistry代做...
2024-12-28
代做ece5550: applied kalman ...
2024-12-28
代写mth205 introduction to s...
2024-12-28
代写scicomp project 3 week 4...
2024-12-28
代做business operations anal...
2024-12-28
代写mth205 introduction to s...
2024-12-28
代写socs0100 computational t...
2024-12-28
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!