首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
DS 5230程序辅导、讲解Python编程、辅导Data Mining编程 讲解SPSS|辅导Web开发
项目预算:
开发周期:
发布时间:
要求地区:
DS 5230 Unsupervised Machine Learning and
Data Mining – Spring 2021 – Homework 1
Submission Instructions
• It is recommended that you complete this exercises in Python 3 and submit
your solutions as a Jupyter notebook.
• You may use any other language, as long as you include a README with
simple, clear instructions on how to run (and if necessary compile) your code.
• Please upload all files (code, README, written answers, etc.) to GradeScope
in a single zip file.
Exercise 1: Understanding Apriori and FP growth
1. Consider a dataset for frequent set mining as in the following table where we
have 6 binary features and each row represents a transaction.
TID Items
a. Illustrate the first three passes of the Apriori algorithm (set sizes 1, 2 and
3) for support threshold of 3 transactions. For each stage, list the candidate
sets Ck and the frequent sets Lk. What are the maximal frequent sets
discovered in the first 3 levels?
b. Pick one of the maximal sets that has more than 1 item, and check if any
of its subsets are association rules with support (i.e. frequency) at least
0.3 and confidence at least 0.6. Please explain your answer and show your
work.
2. Given the following transaction database, let the min support = 2, answer the
following questions.
a. Construct FP-tree from the transaction datasbase and draw it here.
b. Draw d’s conditional FP-tree, and find frequent patterns (i.e.itemsets)
based on d’s conditional FP-Tree.
Exercise 2: Probability Basics
1. Let X and Y be two independent random variables with densities p(x) and p(y),
respectively. Show the following two properties:
Ep(x,y)
[X + aY ] = Ep(x)
[X] + aEp(y)
[Y ] (1)
Varp(x,y)
[X + aY ] = Varp(x)
[X] + a
2Varp(y)
[Y ] (2)
for any scalar constant a ∈ R. Hint: use the definition of expectation and
variance,
Ep(x)
[X] = Z
x
p(x)xdx (3)
varp(x)
[X] = Ep(x)
[X
2
] − E
2
p(x)
[X] (4)
2. Let X be a random variable with Beta distribution,
p(x; α, β) = x
α−1
(1 − x)
β−1
B(α, β)
(5)
where B(α, β) is beta function. Prove that
E[X] = α
α + β
(6)
var[X] = αβ
(α + β)
2
(α + β + 1) (7)
2
3. Suppose we observe N i.i.d data points D = {x1, x2, ..., xN }, where each
xn ∈ {1, 2, ..., K} is a random variable with categorical (discrete) distribution
parameterized by θ = (θ1, θ2, ..., θK), i.e.,
xn ∼ Cat(θ1, θ2, ..., θK), n = 1, 2, ..., N (8)
In detail, this distribution means that for a specific n, the random variable xn
follows P(xn = k) = θk, k = 1, 2, ..., K.
Equivalently, we can also write the density function of a categorical distribution
as
p(xn) = Y
K
k=1
θ
I[xn=k]
k
(9)
where I[xn = k] is called identity function, and defined as
I[xn = k] =
1, if xn = k
0, otherwise (10)
a. Now we want to prove that the joint distribution of multiple i.i.d categorical
variables is a multinomial distribution. Show that the density function of
D = {x1, x2, ..., xN } is
p(D|θ) = Y
K
k=1
θ
Nk
k
(11)
where Nk =
PN
n=1 I[xn = k] is the number of random variables belonging
to category k. In other word, D = {x1, x2, ..., xN } follows a multinomial
distribution.
b. We often call p(D|θ) likelihood function, since it indicates the possibility
we observe this dataset given the model parameters θ. By Bayes rule, we
can rewrite the posterior as
p(θ|D) = p(D|θ)p(θ)
p(D)
(12)
where p(θ) is piror distribution which indicates our preknowledge about
the model parameters. And p(D) is the distribution of the observations
(data), which is constant w.r.t. posterior. Thus we can write
p(θ|D) ∝ p(D|θ)p(θ) (13)
If we assume the Dirichlet prior on θ, i.e.,
p(θ; α1, α2, ..., αK) = Dir(θ; α1, α2, ..., αK) = 1
where B(α) is Beta function and α = (α1, α2, ..., αK).
3
Now try to derive the joint distribution p(D, θ) and ignore the constant
term w.r.t. α. Show that the posterior is actually also Dirichlet and
parameterized as follows:
p(θ|D) = Dir(θ; α1 + N1, α2 + N2, ..., αK + NK) (15)
[In fact, this nice property is called conjugacy in machine learning. A general
statement is : If the prior distribution is conjuagate to the likelihood, then
the posterior will be the same distribution as the prior distribution. Search
conjugate prior and exponential family for more detail if you are interested.]
Before you work on implementation, you need to install Jupyter and PySpark
by reading Instructions on PySpark Installation.pdf
Exercise 3: Exploratory Analysis and Data Visualization
In this exercise, we will be looking at a public citation dataset from Aminer (https:
//aminer.org/), a free online service used to index and search academic social
networks. You will perform some exploratory analysis and data visualization for this
dataset. The dataset is up to the year 2012 and can be downloaded in the attached
file called q3 dataset.txt. We show an example item format in README.txt.
The ArnetMiner public citation dataset is a real world dataset containing lots of
noise. For example, you may see a venue name like “The Truth About Managing
People...And Nothing But the Truth”. However, you are not expected to do data
cleaning in this phase.
1. Count the number of distinct authors, publication venues (conferences and
journals), and publications in the dataset.
a. List each of the counts.
b. Are these numbers likely to be accurate? As an example look up all the
publications venue names associated with the conference “Principles and
Practice of Knowledge Discovery in Databases”1
.
c. In addition to the problem in 1.b, what other problems arises when you
try to determine the number of distinct authors in a dataset?
2. We will now look at the publications associated with each author and venue.
a. For each author, construct the list of publications. Plot a histogram of the
number of publications per author (use a logarithmic scale on the y axis).
1https://en.wikipedia.org/wiki/ECML_PKDD
4
b. Calculate the mean and standard deviation of the number of publications
per author. Also calculate the Q1 (1st quartile2
), Q2 (2nd quartile, or
median) and Q3 (3rd quartile) values. Compare the median to the mean
and explain the difference between the two values based on the standard
deviation and the 1st and 3rd quartiles.
c. Now construct a list of publications for each venue. Plot a histogram of
the number of publications per venue. Also calculate the mean, standard
deviation, median, Q1 and Q3 values. What is the venue with the largest
number of publications in the dataset?
3. Now construct the list of references (that is, the cited publications) for each
publication. Then in turn use this set to calculate the number of citations for
each publication (that is, the number of times a publication is cited).
a. Plot a histogram of the number of references and citations per publication.
What is the publication with the largest number of references? What is
the publication with the largest number of citations?
b. Calculate the so called impact factor for each venue. To do so, calculate
the total number of citations for the publications in the venue, and then
divide this number by the number of publications for the venue. Plot a
histogram of the results.
c. What is the venue with the highest apparent impact factor? Do you believe
this number?
d. Now repeat the calculation from part b, but restrict the calculation to
venues with at least 10 publications. How does your histogram change?
List the citation counts for all publications from the venue with the highest
impact factor. How does the impact factor (mean number of citations)
compare to the median number of citations?
e. Finally, construct a list of publications for each publication year. Use
this list to plot the average number of references and average number of
citations per publication as a function of time. Explain the differences you
see in the trends.
Exercise 4: Market Basket Analysis of Academic Communities
In this problem, you will try to apply frequent pattern mining techniques to the real
world bibliographic dataset from Aminer (https://aminer.org/). One thing worth
noting is that you are required consider the whole dataset, instead of running with
part of the dataset. You may use any Apriori or FP-growth implementation that is
made available in existing libraries. We recommend that you use the implementation
in Spark (http://spark.apache.org/).
2https://en.wikipedia.org/wiki/Quartile
5
1. The dataset included with this problem is q4 dataset.txt. Parse this data, and
comment on how it differs from the previous file (q3 dataset.txt), in terms of
number of publications, authors, venues, references, and years of publication.
2. Coauthor discovery: Please use FP-Growth to analyze coauthor relationships,
treating each paper as a basket of authors.
a. What happens when you successively decrease the support threshold using
the values {1e-4, 1e-5, 0.5e-5, 1e-6}?
b. Keep threshold = 0.5e-5 and report the top 5 co-authors for these researchers:
Rakesh Agrawal, Jiawei Han, Zoubin Ghahramani and Christos Faloutsos
according to frequency.
3. Academic community discovery: In computer science communities tend to
organize around conferences. Here are 5 key conferences for areas of data science
• Machine learning: NIPS/NeurIPS (Neural Information Processing Systems)3
• Data mining: KDD (Conference on Knowledge Discovery and Data Mining)
• Databases: VLDB (Very Large Data Bases)
• Computer networks: INFOCOM (International Conference on Computer
Communications)
• Natural language processing: ACL (Association for Computational Linguistics)
a. We will now use FP growth to analyze academic communities. To do so,
represent each author as a basket in which the items are the venues in which
the author has at least one publication. What happens as you decrease the
support threshold using values {1e-3, 5e-4, 1e-4}?
b. Keep the threshold=5e-4 and report results. For each of those 5 key
conferences, please report the top 3 venues that authors also publish in.
3NIPS is today abbreviated as NeurIPS, but this dataset only contains references to NIPS.
6
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代写infosys 110 digital syst...
2024-12-28
代写fbe 506 quantitative met...
2024-12-28
代做part i: (crazy eddie htm...
2024-12-28
代写infosys 110 digital syst...
2024-12-28
代做stats 769 statistics sec...
2024-12-28
代写ece3700j introduction to...
2024-12-28
代做tcm2301 biochemistry代做...
2024-12-28
代做ece5550: applied kalman ...
2024-12-28
代写mth205 introduction to s...
2024-12-28
代写scicomp project 3 week 4...
2024-12-28
代做business operations anal...
2024-12-28
代写mth205 introduction to s...
2024-12-28
代写socs0100 computational t...
2024-12-28
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!