首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
代做data编程、代写Java程序设计
项目预算:
开发周期:
发布时间:
要求地区:
Week 1 Practical
Introduction to WEKA
What are we doing?
• Download an open source machine learning tool “WEKA” and explore the main
features of this tool.
• Understand and practice the basic data pre-processing operations that can be
performed using WEKA.
Submission:
You are required to submit one .arff file (after completing the practical task as
instructed in this prac document) via the weekly-practical submission box.
What is WEKA?
The WEKA (The Waikato Environment for Knowledge Analysis) is a machine learning
toolkit developed at the University of Waikato in Hamilton, New Zealand. The
software provides many machine learning statistics and other data mining solutions
for various types of data mining task, such as classification, cluster detection,
association rule discovery and attribute selection. The software is also equipped with
data pre-processing and post-processing tools and visualisation tools so that
complete data mining projects can be conducted via a number of different styles of
user interface. The toolkit is written in Java and can, therefore, run on various
platforms, such as Linux, Windows and Macintosh. It is an open-source software and
distributed under the terms and conditions of the GNU General Public License.
Launching and Starting WEKA
You can find instructions for installing Weka at
https://waikato.github.io/weka-wiki/downloading_weka/
When you open Weka you should see a screen like the one below (Figure 1).
[Figure 1]
Select the Explorer option below Applications.
Data Pre-Processing using WEKA
This example illustrates some of the basic data preprocessing operations that can be
performed using WEKA. The sample data set used for this example, unless otherwise
indicated, is the "bank data", called Bank Data.csv
The data contains the following fields
id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep
did the customer buy a PEP (Personal Equity Plan) after the last mailing
(YES/NO)
Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in
".csv" format files. This is fortunate since many databases or spreadsheet applications
can save or export data into flat files in this format. A usual Microsoft Excel worksheet
can be saved as a CSV file and opened by WEKA. The first row of the spreadsheet is
used to name the attributes and the data types for the attributes are derived
automatically but not always accurately. Once opened, you can save the data set into
an ARFF file in WEKA (by clicking “Save” in the Preprocess tab).
In this example, we load the data set into WEKA, perform a series of operations using
WEKA's attribute and discretization filters. While all of these operations can be
performed from the command line, we use the GUI interface for WEKA Explorer.
Initially (in the Preprocess tab) click "open" and navigate to the directory containing
the data file (which is something like bank-data.csv). This is shown in [Figure 2].
Once the data is loaded, WEKA will recognize the attributes and during the scan of the
data will compute some basic statistics on each attribute. The left panel in [Figure 3]
shows the list of recognized attributes, while the top panels indicate the names of the
base relation (or table) and the current working relation (which are the same initially).
Note: The recent version of WEKA has an additional tab named “Edit” under
Preprocess menu to view the current contents of the dataset under working.
Whenever you apply any filter in WEKA, you can see the updated contents via this
viewer facility. (Alternatively, you can use the “Arff Viewer” tool included in WEKA.
Refer to the WEKA manual document for further details)
[Figure 2]
[Figure 3]
Clicking on any attribute in the left panel will show the basic statistics on that
attribute. For categorical attributes, the frequency for each attribute value is shown,
while for continuous attributes we can obtain min, max, mean, standard deviation,
etc. As an example, see the [Figure 4] below which show the results of selecting the
“age” attribute.
[Figure 4]
Selecting or Filtering Attributes
In our sample data file, each record is uniquely identified by a customer id (the "id"
attribute). We need to remove this attribute before the data mining step (as this
attribute is not necessary). We can do this by using the Attribute filters in WEKA.
In the "Filter" panel, click on the "Choose" button.
This will show a popup window with a list available filters. Scroll down the list and
select the "weka.filters.unsupervised.attribute.Remove" filter as shown in [Figure 5].
Next, click on text box immediately to the right of the "Choose" button.
In the resulting dialog box enter the index of the attribute to be filtered out (this can
be a range or a list separated by commas). In this case, we enter 1 which is the index
of the "id" attribute (see the left panel). Make sure that the "invertSelection" option
is set to false (otherwise everything except attribute 1 will be filtered). Then click "OK"
(See [Figure 6]). Now, in the filter box you will see "Remove -R 1" (see [Figure 7]).
[Figure 5]
[Figure 6]
[Figure 7]
Click the "Apply" button to apply this filter to the data. This will remove the "id"
attribute and create a new working relation (whose name now includes the details of
the filter that was applied). The result is depicted in [Figure 8].
[Figure 8]
Discretization
Some techniques, such as association rule mining, can only be performed on
categorical data. This requires performing discretization on numeric or continuous
attributes. (There are 3 such attributes in this data set: "age", "income", and
"children"). Click on the “age” attribute. Again we activate the Filter dialog box, but
this time, we will select "Discretize" filter from the list. (see [Figure 9]).
[Figure 9]
Next, to change the defaults for this filter, click on the box to the right of the "Choose"
button. This will open the Discretize Filter dialog box.
We enter the index for the the attributes to be discretized. In this case we enter 1
corresponding to attribute "age". We also enter 3 as the number of bins (note that it
is possible to discretize more than one attribute at the same time (by using a list of
attribute indexes). Since we are doing simple binning, all of the other available options
are set to "false". The dialog box is shown in [Figure 10].
Click "Apply" in the Filter panel. This will result in a new working relation with the
selected attribute partitioned into 3 bins (shown in Figure 10).
Finally, save the file as something like "bank-data-final.arff".
Submit this final filtered arff file to prove your work for this weekly
practical.
[Figure 10]
[Figure 11]
Other Useful Filters in WEKA
There are more useful preprocessing filters provided in WEKA in addition to filters we
tried in this exercise. The following is briefs of some among them. You are
recommended to refer to WEKA manual for further details and have a try to apply
some to bank data for your own exercise.
In WEKA, data pre-processing is done using attribute or instance filters that can
operate supervised or unsupervised. Attribute filters are applied to attributes
(columns) and instance filters are applied to data objects (rows). Supervised filters
perform with consideration of a class attribute whereas unsupervised filters do not.
(Many unsupervised filters have a supervised counterpart. Supervised filters must be
used with care for classification tasks; test examples must be pre-processed in the
same way as the training examples.)
The many other filters for data pre-processing have not been described here due to
limitations of space. Filters in WEKA are continuously developed and new filters are
constantly added in new versions.
Add attribute filter
Using “Add” filter, we can create a new attribute (with empty value as default)
and specify the location, name and labels of the new attribute. Once created,
the value of the new attribute can be entered manually in the viewer window
for data objects.
New numeric features can be added with the “AddExpression” filter, which
applies a mathematical expression based on the values of other attributes.
Numeric transformation attribute filters
The “MathExpression” filter allows transformation with a valid mathematical
expression that uses arithmetic operators and built-in functions, such as
absolute (abs), logarithm (log), square root (sqrt), etc.
The “NumericTransform” filter only allows transformations by methods
supported by the Java math library. Unlike AddExpression, these filters do not
create new attributes but replace the current values with the transformed
values.
Transformation attribute filters
The “Normalize” filter converts the values of all numeric attributes in the
loaded data set to those within a common range. The default range is [0.1].
The user can change the normal range if needed.
The “Standardize” filter standardizes all numeric attributes to have zero mean
and unit variance.
ReplaceMissingValues filter
This rudimentary filter fills in missing values; numeric values are replaced with
the sample mean and nominal values are replaced with the sample mode. The
user can also fill in missing values manually in the viewer window (using “Edit”
menu). For numeric attributes, the user may enter any value. For nominal
attributes, the user can only select one of the nominal labels that already exists
in the attribute domain. If the label does not exist (for instance, it is a special
code indicating unknown), the label can be added into the attribute domain by
using “AddValues” filter.
Resample instance filter
This filter selects a random sample of a certain percentage (SampleSizePercent
parameter) of the loaded data set, with or without replacement (to sample
without replacement, set the noReplacement parameter to True). The
unsupervised Resample filter draws the sample from the entire data set
reflecting the real distribution of attribute values including class values; the
supervised Resample filter draws samples according to either the real
distribution of classes (set the biasToUniformClass parameter to 0) or a
uniform distribution of classes (set the biasToUniformClass parameter to 1).
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做 program、代写 c++设计程...
2024-12-23
comp2012j 代写、代做 java 设...
2024-12-23
代做 data 编程、代写 python/...
2024-12-23
代做en.553.413-613 applied s...
2024-12-23
代做steady-state analvsis代做...
2024-12-23
代写photo essay of a deciduo...
2024-12-23
代写gpa analyzer调试c/c++语言
2024-12-23
代做comp 330 (fall 2024): as...
2024-12-23
代写pstat 160a fall 2024 - a...
2024-12-23
代做pstat 160a: stochastic p...
2024-12-23
代做7ssgn110 environmental d...
2024-12-23
代做compsci 4039 programming...
2024-12-23
代做lab exercise 8: dictiona...
2024-12-23
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!