`
insertyou
  • 浏览: 905941 次
  • 性别: Icon_minigender_1
  • 来自: 北京
文章分类
社区版块
存档分类
最新评论

weka入门教程

 
阅读更多
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">目录</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">1. </span></span><span style="">简介</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">的全名是怀卡托智能分析环境(</span><span lang="EN-US"><span style="font-family: Calibri;">Waikato Environment for Knowledge Analysis</span></span><span style="">),它的源代码可通过</span><span lang="EN-US"><span style="font-family: Calibri;">http://www.cs.waikato.ac.nz.sixxs.org/ml/weka</span></span><span style="">得到。同时</span><span lang="EN-US"><span style="font-family: Calibri;">weka</span></span><span style="">也是新西兰的一种鸟名,而</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">的主要开发者来自新西兰。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">作为一个公开的数据挖掘工作平台,集合了大量能承担数据挖掘任务的机器学习算法,包括对数据进行预处理,分类,回归、聚类、关联规则以及在新的交互式界面上的可视化。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">如果想自己实现数据挖掘算法的话,可以看一看</span><span lang="EN-US"><span style="font-family: Calibri;">weka</span></span><span style="">的接口文档。在</span><span lang="EN-US"><span style="font-family: Calibri;">weka</span></span><span style="">中集成自己的算法甚至借鉴它的方法自己实现可视化工具并不是件很困难的事情。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">2005</span></span><span style="">年</span><span lang="EN-US"><span style="font-family: Calibri;">8</span></span><span style="">月,在第</span><span lang="EN-US"><span style="font-family: Calibri;">11</span></span><span style="">届</span><span lang="EN-US"><span style="font-family: Calibri;">ACM SIGKDD</span></span><span style="">国际会议上,怀卡托大学的</span><span lang="EN-US"><span style="font-family: Calibri;">Weka</span></span><span style="">小组荣获了数据挖掘和知识探索领域的最高服务奖,</span><span lang="EN-US"><span style="font-family: Calibri;">Weka</span></span><span style="">系统得到了广泛的认可,被誉为数据挖掘和机器学习历史上的里程碑,是现今最完备的数据挖掘工具之一(已有</span><span lang="EN-US"><span style="font-family: Calibri;">11</span></span><span style="">年的发展历史)。</span><span lang="EN-US"><span style="font-family: Calibri;">Weka</span></span><span style="">的每月下载次数已超过万次。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">--</span></span><span style="">整理自</span><span lang="EN-US"><a href="http://www.china-pub.com.sixxs.org/computers/common/info.asp?id=29304"><span style="font-family: Calibri;">http://www.china-pub.com.sixxs.org/computers/common/info.asp?id=29304</span></a></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">2. </span></span><span style="">数据格式</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">巧妇难为无米之炊。首先我们来看看</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">所用的数据应是什么样的格式。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">跟很多电子表格或数据分析软件一样,</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">所处理的数据集是图</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">那样的一个二维的表格。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">1 </span></span><span style="">新窗口打开</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">这里我们要介绍一下</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">中的术语。表格里的一个横行称作一个实例(</span><span lang="EN-US"><span style="font-family: Calibri;">Instance</span></span><span style="">),相当于统计学中的一个样本,或者数据库中的一条记录。竖行称作一个属性(</span><span lang="EN-US"><span style="font-family: Calibri;">Attrbute</span></span><span style="">),相当于统计学中的一个变量,或者数据库中的一个字段。这样一个表格,或者叫数据集,在</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">看来,呈现了属性之间的一种关系</span><span lang="EN-US"><span style="font-family: Calibri;">(Relation)</span></span><span style="">。图</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">中一共有</span><span lang="EN-US"><span style="font-family: Calibri;">14</span></span><span style="">个实例,</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">个属性,关系名称为“</span><span lang="EN-US"><span style="font-family: Calibri;">weather</span></span><span style="">”。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">存储数据的格式是</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">(</span><span lang="EN-US"><span style="font-family: Calibri;">Attribute-Relation File Format</span></span><span style="">)文件,这是一种</span><span lang="EN-US"><span style="font-family: Calibri;">ASCII</span></span><span style="">文本文件。图</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">所示的二维表格存储在如下的</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件中。这也就是</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">自带的“</span><span lang="EN-US"><span style="font-family: Calibri;">weather.arff</span></span><span style="">”文件,在</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">安装目录的“</span><span lang="EN-US"><span style="font-family: Calibri;">data</span></span><span style="">”子目录下可以找到。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><code><span style="font-size: 12pt;" lang="EN-US"><span style="">% ARFF file for the weather data with some numric features</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">%</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">@relation weather</span></span></code><span lang="EN-US"><br><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">@attribute outlook {sunny, overcast, rainy}</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">@attribute temperature real</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">@attribute humidity real</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">@attribute windy {TRUE, FALSE}</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">@attribute play {yes, no}</span></span></code><span lang="EN-US"><br><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">@data</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">%</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">% 14 instances</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">%</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">sunny,85,85,FALSE,no</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">sunny,80,90,TRUE,no</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">overcast,83,86,FALSE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">rainy,70,96,FALSE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">rainy,68,80,FALSE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">rainy,65,70,TRUE,no</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">overcast,64,65,TRUE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">sunny,72,95,FALSE,no</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">sunny,69,70,FALSE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">rainy,75,80,FALSE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">sunny,75,70,TRUE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">overcast,72,90,TRUE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">overcast,81,75,FALSE,yes</span></span></code><span lang="EN-US"><br></span><code><span style="font-size: 12pt;" lang="EN-US"><span style="">rainy,71,91,TRUE,no</span></span></code></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><code><span style="font-size: 12pt;" lang="EN-US"><span style=""></span></span></code></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">需要注意的是,在</span><span lang="EN-US"><span style="font-family: Calibri;">Windows</span></span><span style="">记事本打开这个文件时,可能会因为回车符定义不一致而导致分行不正常。推荐使用</span><span lang="EN-US"><span style="font-family: Calibri;">UltraEdit</span></span><span style="">这样的字符编辑软件察看</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件的内容。</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">Felomeng</span></span><span style="">注:在</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">Linux/Unix</span></span><span style="">中是</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">”\n”</span></span><span style="">,在</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">windows</span></span><span style="">下是</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">” \r\n”</span></span><span style="">,而在</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">mac</span></span><span style="">下是</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">”\r”</span></span><span style="">。在</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">weka</span></span><span style="">中应该使用</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">”\n”</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">下面我们来对这个文件的内容进行说明。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">识别</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件的重要依据是分行,因此不能在这种文件里随意的断行。空行(或全是空格的行)将被忽略。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">以“</span><span lang="EN-US"><span style="font-family: Calibri;">%</span></span><span style="">”开始的行是注释,</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">将忽略这些行。如果你看到的“</span><span lang="EN-US"><span style="font-family: Calibri;">weather.arff</span></span><span style="">”文件多了或少了些“</span><span lang="EN-US"><span style="font-family: Calibri;">%</span></span><span style="">”开始的行,是没有影响的。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">除去注释后,整个</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件可以分为两个部分。第一部分给出了头信息(</span><span lang="EN-US"><span style="font-family: Calibri;">Head information</span></span><span style="">),包括了对关系的声明和对属性的声明。第二部分给出了数据信息(</span><span lang="EN-US"><span style="font-family: Calibri;">Data information</span></span><span style="">),即数据集中给出的数据。从“</span><span lang="EN-US"><span style="font-family: Calibri;">@data</span></span><span style="">”标记开始,后面的就是数据信息了。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">关系声明</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">关系名称在</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件的第一个有效行来定义,格式为</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">@relation &lt;relation-name&gt;<span style="color: #00b050;">//Felomeng</span></span></span><span style="">注:其实就是一个名字,起什么名字对于文档内容不产生影响。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">&lt;relation-name&gt;</span></span><span style="">是一个字符串。如果这个字符串包含空格,它必须加上引号(指英文标点的单引号或双引号)。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">属性声明</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">属性声明用一列以“</span><span lang="EN-US"><span style="font-family: Calibri;">@attribute</span></span><span style="">”开头的语句表示。数据集中的每一个属性都有它对应的“</span><span lang="EN-US"><span style="font-family: Calibri;">@attribute</span></span><span style="">”语句,来定义它的属性名称和数据类型。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">这些声明语句的顺序很重要。首先它表明了该项属性在数据部分的位置。例如,“</span><span lang="EN-US"><span style="font-family: Calibri;">humidity</span></span><span style="">”是第三个被声明的属性,这说明数据部分那些被逗号分开的列中,第三列数据</span><span lang="EN-US"><span style="font-family: Calibri;"> 85 90 86 96 ... </span></span><span style="">是相应的“</span><span lang="EN-US"><span style="font-family: Calibri;">humidity</span></span><span style="">”值。其次,最后一个声明的属性被称作</span><span lang="EN-US"><span style="font-family: Calibri;">class</span></span><span style="">属性,在分类或回归任务中,它是默认的目标变量。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">属性声明的格式为</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">@attribute &lt;attribute-name&gt; &lt;datatype&gt;<span style="color: #00b050;">//Felomeng</span></span></span><span style="">注:</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">&lt;attribute-name&gt;</span></span><span style="">其实就是一个名字,起什么名字对于文档内容不产生影响,而</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">datatype</span></span><span style="">比较重要</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">其中</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;attribute-name&gt;</span></span><span style="">是必须以字母开头的字符串。和关系名称一样,如果这个字符串包含空格,它必须加上引号。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">支持的</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;datatype&gt;</span></span><span style="">有四种,分别是</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">numeric-------------------------</span></span><span style="">数值型</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">&lt;nominal-specification&gt;-----</span></span><span style="">分类(</span><span lang="EN-US"><span style="font-family: Calibri;">nominal</span></span><span style="">)型</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">string----------------------------</span></span><span style="">字符串型</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">date [&lt;date-format&gt;]--------</span></span><span style="">日期和时间型</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">其中</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;nominal-specification&gt; </span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;date-format&gt; </span></span><span style="">将在下面说明。还可以使用两个类型“</span><span lang="EN-US"><span style="font-family: Calibri;">integer</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">real</span></span><span style="">”,但是</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">把它们都当作“</span><span lang="EN-US"><span style="font-family: Calibri;">numeric</span></span><span style="">”看待。注意“</span><span lang="EN-US"><span style="font-family: Calibri;">integer</span></span><span style="">”,“</span><span lang="EN-US"><span style="font-family: Calibri;">real</span></span><span style="">”,“</span><span lang="EN-US"><span style="font-family: Calibri;">numeric</span></span><span style="">”,“</span><span lang="EN-US"><span style="font-family: Calibri;">date</span></span><span style="">”,“</span><span lang="EN-US"><span style="font-family: Calibri;">string</span></span><span style="">”这些关键字是区分大小写的,而“</span><span lang="EN-US"><span style="font-family: Calibri;">relation</span></span><span style="">”“</span><span lang="EN-US"><span style="font-family: Calibri;">attribute </span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">date</span></span><span style="">”则不区分。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">数值属性</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">数值型属性可以是整数或者实数,但</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">把它们都当作实数看待。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">分类属性</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">//Felomeng</span></span><span style="">注:</span><span style="color: #00b050;" lang="EN-US"><span style="font-family: Calibri;">weka</span></span><span style="">中只能对此属性进行分类。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">分类属性由</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;nominal-specification&gt;</span></span><span style="">列出一系列可能的类别名称并放在花括号中:</span><span lang="EN-US"><span style="font-family: Calibri;">{&lt;nominal-name1&gt;, &lt;nominal-name2&gt;, &lt;nominal-name3&gt;, ...} </span></span><span style="">。数据集中该属性的值只能是其中一种类别。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">例如如下的属性声明说明“</span><span lang="EN-US"><span style="font-family: Calibri;">outlook</span></span><span style="">”属性有三种类别:“</span><span lang="EN-US"><span style="font-family: Calibri;">sunny</span></span><span style="">”,“</span><span lang="EN-US"><span style="font-family: Calibri;"> overcast</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">rainy</span></span><span style="">”。而数据集中每个实例对应的“</span><span lang="EN-US"><span style="font-family: Calibri;">outlook</span></span><span style="">”值必是这三者之一。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@attribute outlook {sunny, overcast, rainy}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">如果类别名称带有空格,仍需要将之放入引号中。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">字符串属性</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">字符串属性中可以包含任意的文本。这种类型的属性在文本挖掘中非常有用。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">示例:</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@ATTRIBUTE LCC string</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">日期和时间属性</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">日期和时间属性统一用“</span><span lang="EN-US"><span style="font-family: Calibri;">date</span></span><span style="">”类型表示,它的格式是</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@attribute &lt;name&gt; date [&lt;date-format&gt;]</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">其中</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;name&gt;</span></span><span style="">是这个属性的名称,</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;date-format&gt;</span></span><span style="">是一个字符串,来规定该怎样解析和显示日期或时间的格式,默认的字符串是</span><span lang="EN-US"><span style="font-family: Calibri;">ISO-8601</span></span><span style="">所给的日期时间组合格式“</span><span lang="EN-US"><span style="font-family: Calibri;">yyyy-MM-ddTHH:mm:ss</span></span><span style="">”。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">数据信息部分表达日期的字符串必须符合声明中规定的格式要求(下文有例子)。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">数据信息</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">数据信息中“</span><span lang="EN-US"><span style="font-family: Calibri;">@data</span></span><span style="">”标记独占一行,剩下的是各个实例的数据。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">每个实例占一行。实例的各属性值用逗号“</span><span lang="EN-US"><span style="font-family: Calibri;">,</span></span><span style="">”隔开。如果某个属性的值是缺失值(</span><span lang="EN-US"><span style="font-family: Calibri;">missing value</span></span><span style="">),用问号“</span><span lang="EN-US"><span style="font-family: Calibri;">?</span></span><span style="">”表示,且这个问号不能省略。例如:</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@data</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">sunny,85,85,FALSE,no</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">?,78,90,?,yes</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">字符串属性和分类属性的值是区分大小写的。若值中含有空格,必须被引号括起来。例如:</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@relation LCCvsLCSH</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>@attribute LCC string</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>@attribute LCSH string</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>@data</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>AG5, 'Encyclopedias and dictionaries.;Twentieth century.'</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>AS262, 'Science -- Soviet Union -- History.'</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">日期属性的值必须与属性声明中给定的相一致。例如:</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@RELATION Timestamps</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss" </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>@DATA </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>"2001-04-03 12:12:12"</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>"2001-05-03 12:59:55"</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">稀疏数据</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">有的时候数据集中含有大量的</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span><span style="">值(比如购物篮分析),这个时候用稀疏格式的数据存贮更加省空间。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">稀疏格式是针对数据信息中某个实例的表示而言,不需要修改</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件的其它部分。看如下的数据:</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@data</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>0, X, 0, Y, "class A"</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>0, 0, W, 0, "class B"</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">用稀疏格式表达的话就是</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@data</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>{1 X, 3 Y, 4 "class A"}</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>{2 W, 4 "class B"}</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">每个实例用花括号括起来。实例中每一个非</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span><span style="">的属性值用</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;index&gt; &lt;</span></span><span style="">空格</span><span lang="EN-US"><span style="font-family: Calibri;">&gt; &lt;value&gt;</span></span><span style="">表示。</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;index&gt;</span></span><span style="">是属性的序号,从</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span><span style="">开始计;</span><span lang="EN-US"><span style="font-family: Calibri;">&lt;value&gt;</span></span><span style="">是属性值。属性值之间仍用逗号隔开。这里每个实例的数值必须按属性的顺序来写,如</span><span lang="EN-US"><span style="font-family: Calibri;"> {1 X, 3 Y, 4 "class A"}</span></span><span style="">,不能写成</span><span lang="EN-US"><span style="font-family: Calibri;">{3 Y, 1 X, 4 "class A"}</span></span><span style="">。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">注意在稀疏格式中没有注明的属性值不是缺失值,而是</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span><span style="">值。若要表示缺失值必须显式的用问号表示出来。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Relational</span></span><span style="">型属性</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA 3.5</span></span><span style="">版中增加了一种属性类型叫做</span><span lang="EN-US"><span style="font-family: Calibri;">Relational</span></span><span style="">,有了这种类型我们可以像关系型数据库那样处理多个维度了。但是这种类型目前还不见广泛应用,暂不作介绍。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">3.</span></span><span style="">数据准备</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">使用</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">作数据挖掘,面临的第一个问题往往是我们的数据不是</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">格式的。幸好,</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">还提供了对</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件的支持,而这种格式是被很多其他软件所支持的。此外,</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">还提供了通过</span><span lang="EN-US"><span style="font-family: Calibri;">JDBC</span></span><span style="">访问数据库的功能。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在这一节里,我们先以</span><span lang="EN-US"><span style="font-family: Calibri;">Excel</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">Matlab</span></span><span style="">为例,说明如何获得</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件。然后我们将知道</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件如何转化成</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件,毕竟后者才是</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">支持得最好的文件格式。面对一个</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件,我们仍有一些预处理要做,才能进行挖掘任务。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">.* -&gt; .csv </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们给出一个</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件的例子(</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data.csv</span></span><span style="">)。用</span><span lang="EN-US"><span style="font-family: Calibri;">UltraEdit</span></span><span style="">打开它可以看到,这种格式也是一种逗号分割数据的文本文件</span><span lang="EN-US"><span style="font-family: Calibri;">,</span></span><span style="">储存了一个二维表格。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Excel</span></span><span style="">的</span><span lang="EN-US"><span style="font-family: Calibri;">XLS</span></span><span style="">文件可以让多个二维表格放到不同的工作表(</span><span lang="EN-US"><span style="font-family: Calibri;">Sheet</span></span><span style="">)中,我们只能把每个工作表存成不同的</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件。打开一个</span><span lang="EN-US"><span style="font-family: Calibri;">XLS</span></span><span style="">文件并切换到需要转换的工作表,另存为</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">类型,点“确定”、“是”忽略提示即可完成操作。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在</span><span lang="EN-US"><span style="font-family: Calibri;">Matlab</span></span><span style="">中的二维表格是一个矩阵,我们通过这条命令把一个矩阵存成</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">格式。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">csvwrite('filename',matrixname) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">需要注意的是,</span><span lang="EN-US"><span style="font-family: Calibri;">Matllab</span></span><span style="">给出的</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件往往没有属性名(</span><span lang="EN-US"><span style="font-family: Calibri;">Excel</span></span><span style="">给出的也有可能没有)。而</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">必须从</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件的第一行读取属性名,否则就会把第一行的各属性值读成变量名。因此我们对于</span><span lang="EN-US"><span style="font-family: Calibri;">Matllab</span></span><span style="">给出的</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件需要用</span><span lang="EN-US"><span style="font-family: Calibri;">UltraEdit</span></span><span style="">打开,手工添加一行属性名。注意属性名的个数要跟数据属性的个数一致,仍用逗号隔开。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">.csv -&gt; .arff </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">将</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">转换为</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">最迅捷的办法是使用</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">所带的命令行工具。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">运行</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">的主程序,出现</span><span lang="EN-US"><span style="font-family: Calibri;">GUI</span></span><span style="">后可以点击下方按钮进入相应的模块。我们点击进入“</span><span lang="EN-US"><span style="font-family: Calibri;">Simple CLI</span></span><span style="">”模块提供的命令行功能。在新窗口的最下方(上方是不能写字的)输入框写上</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">java weka.core.converters.CSVLoader filename.csv &gt; filename.arff </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">即可完成转换。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA 3.5</span></span><span style="">中提供了一个“</span><span lang="EN-US"><span style="font-family: Calibri;">Arff Viewer</span></span><span style="">”模块,我们可以用它打开一个</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件将进行浏览,然后另存为</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">进入“</span><span lang="EN-US"><span style="font-family: Calibri;">Exploer</span></span><span style="">”模块,从上方的按钮中打开</span><span lang="EN-US"><span style="font-family: Calibri;">CSV</span></span><span style="">文件然后另存为</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件亦可。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">“</span><span lang="EN-US"><span style="font-family: Calibri;">Exploer</span></span><span style="">”界面</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们应该注意到,“</span><span lang="EN-US"><span style="font-family: Calibri;">Exploer</span></span><span style="">”还提供了很多功能,实际上可以说这是</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">使用最多的模块。现在我们先来熟悉它的界面,然后利用它对数据进行预处理。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;"></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">2 </span></span><span style="">新窗口打开</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">2</span></span><span style="">显示的是使用</span><span lang="EN-US"><span style="font-family: Calibri;">3.5</span></span><span style="">版</span><span lang="EN-US"><span style="font-family: Calibri;">"Exploer"</span></span><span style="">打开</span><span lang="EN-US"><span style="font-family: Calibri;">"bank-data.csv"</span></span><span style="">的情况。我们根据不同的功能把这个界面分成</span><span lang="EN-US"><span style="font-family: Calibri;">8</span></span><span style="">个区域。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">区域</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">的几个选项卡是用来切换不同的挖掘任务面板。这一节用到的只有“</span><span lang="EN-US"><span style="font-family: Calibri;">Preprocess</span></span><span style="">”,其他面板的功能将在以后介绍。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">区域</span><span lang="EN-US"><span style="font-family: Calibri;">2</span></span><span style="">是一些常用按钮。包括打开数据,保存及编辑功能。我们在这里把</span><span lang="EN-US"><span style="font-family: Calibri;">"bank-data.csv"</span></span><span style="">另存为</span><span lang="EN-US"><span style="font-family: Calibri;">"bank-data.arff"</span></span><span style="">。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在区域</span><span lang="EN-US"><span style="font-family: Calibri;">3</span></span><span style="">中“</span><span lang="EN-US"><span style="font-family: Calibri;">Choose</span></span><span style="">”某个“</span><span lang="EN-US"><span style="font-family: Calibri;">Filter</span></span><span style="">”,可以实现筛选数据或者对数据进行某种变换。数据预处理主要就利用它来实现。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">区域</span><span lang="EN-US"><span style="font-family: Calibri;">4</span></span><span style="">展示了数据集的一些基本情况。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">区域</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">中列出了数据集的所有属性。勾选一些属性并“</span><span lang="EN-US"><span style="font-family: Calibri;">Remove</span></span><span style="">”就可以删除它们,删除后还可以利用区域</span><span lang="EN-US"><span style="font-family: Calibri;">2</span></span><span style="">的“</span><span lang="EN-US"><span style="font-family: Calibri;">Undo</span></span><span style="">”按钮找回。区域</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">上方的一排按钮是用来实现快速勾选的。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在区域</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">中选中某个属性,则区域</span><span lang="EN-US"><span style="font-family: Calibri;">6</span></span><span style="">中有关于这个属性的摘要。注意对于数值属性和分类属性,摘要的方式是不一样的。图中显示的是对数值属性“</span><span lang="EN-US"><span style="font-family: Calibri;">income</span></span><span style="">”的摘要。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">区域</span><span lang="EN-US"><span style="font-family: Calibri;">7</span></span><span style="">是区域</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">中选中属性的直方图。若数据集的最后一个属性(我们说过这是分类或回归任务的默认目标变量)是分类变量(这里的“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”正好是),直方图中的每个长方形就会按照该变量的比例分成不同颜色的段。要想换个分段的依据,在区域</span><span lang="EN-US"><span style="font-family: Calibri;">7</span></span><span style="">上方的下拉框中选个不同的分类属性就可以了。下拉框里选上“</span><span lang="EN-US"><span style="font-family: Calibri;">No Class</span></span><span style="">”或者一个数值属性会变成黑白的直方图。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">区域</span><span lang="EN-US"><span style="font-family: Calibri;">8</span></span><span style="">是状态栏,可以查看</span><span lang="EN-US"><span style="font-family: Calibri;">Log</span></span><span style="">以判断是否有错。右边的</span><span lang="EN-US"><span style="font-family: Calibri;">weka</span></span><span style="">鸟在动的话说明</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">正在执行挖掘任务。右键点击状态栏还可以执行</span><span lang="EN-US"><span style="font-family: Calibri;">JAVA</span></span><span style="">内存的垃圾回收。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">预处理</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">bank-data</span></span><span style="">数据各属性的含义如下:</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">id a unique identification number </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">age age of customer in years (numeric) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">sex MALE / FEMALE </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">region inner_city/rural/suburban/town </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">income income of customer (numeric) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">married is the customer married (YES/NO) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">children number of children (numeric) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">car does the customer own a car (YES/NO) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">save_acct does the customer have a saving account (YES/NO) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">current_acct does the customer have a current account (YES/NO) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">mortgage does the customer have a mortgage (YES/NO) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">pep did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO) </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">通常对于数据挖掘任务来说,</span><span lang="EN-US"><span style="font-family: Calibri;">ID</span></span><span style="">这样的信息是无用的,我们将之删除。在区域</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">勾选属性“</span><span lang="EN-US"><span style="font-family: Calibri;">id</span></span><span style="">”,并点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Remove</span></span><span style="">”。将新的数据集保存一次,并用</span><span lang="EN-US"><span style="font-family: Calibri;">UltraEdit</span></span><span style="">打开这个</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件。我们发现,在属性声明部分,</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">已经为每个属性选好了合适的类型。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们知道,有些算法,只能处理所有的属性都是分类型的情况。这时候我们就需要对数值型的属性进行离散化。在这个数据集中有</span><span lang="EN-US"><span style="font-family: Calibri;">3</span></span><span style="">个变量是数值型的,分别是“</span><span lang="EN-US"><span style="font-family: Calibri;">age</span></span><span style="">”,“</span><span lang="EN-US"><span style="font-family: Calibri;">income</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">children</span></span><span style="">”。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">其中“</span><span lang="EN-US"><span style="font-family: Calibri;">children</span></span><span style="">”只有</span><span lang="EN-US"><span style="font-family: Calibri;">4</span></span><span style="">个取值:</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span><span style="">,</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">,</span><span lang="EN-US"><span style="font-family: Calibri;">2</span></span><span style="">,</span><span lang="EN-US"><span style="font-family: Calibri;">3</span></span><span style="">。这时我们在</span><span lang="EN-US"><span style="font-family: Calibri;">UltraEdit</span></span><span style="">中直接修改</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件,把</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@attribute children numeric </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">改为</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">@attribute children {0,1,2,3} </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">就可以了。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在“</span><span lang="EN-US"><span style="font-family: Calibri;">Explorer</span></span><span style="">”中重新打开“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data.arff</span></span><span style="">”,看看选中“</span><span lang="EN-US"><span style="font-family: Calibri;">children</span></span><span style="">”属性后,区域</span><span lang="EN-US"><span style="font-family: Calibri;">6</span></span><span style="">那里显示的“</span><span lang="EN-US"><span style="font-family: Calibri;">Type</span></span><span style="">”是不是变成“</span><span lang="EN-US"><span style="font-family: Calibri;">Nominal</span></span><span style="">”了?</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">“</span><span lang="EN-US"><span style="font-family: Calibri;">age</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">income</span></span><span style="">”的离散化我们需要借助</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">中名为“</span><span lang="EN-US"><span style="font-family: Calibri;">Discretize</span></span><span style="">”的</span><span lang="EN-US"><span style="font-family: Calibri;">Filter</span></span><span style="">来完成。在区域</span><span lang="EN-US"><span style="font-family: Calibri;">2</span></span><span style="">中点“</span><span lang="EN-US"><span style="font-family: Calibri;">Choose</span></span><span style="">”,出现一棵“</span><span lang="EN-US"><span style="font-family: Calibri;">Filter</span></span><span style="">树”,逐级找到“</span><span lang="EN-US"><span style="font-family: Calibri;">weka.filters.unsupervised.attribute.Discretize</span></span><span style="">”,点击。若无法关闭这个树,在树之外的地方点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Explorer</span></span><span style="">”面板即可。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">现在“</span><span lang="EN-US"><span style="font-family: Calibri;">Choose</span></span><span style="">”旁边的文本框应该显示“</span><span lang="EN-US"><span style="font-family: Calibri;">Discretize -B 10 -M -0.1 -R first-last</span></span><span style="">”。</span><span style="font-family: Calibri;"> </span><span style="">点击这个文本框会弹出新窗口以修改离散化的参数。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们不打算对所有的属性离散化,只是针对对第</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">个和第</span><span lang="EN-US"><span style="font-family: Calibri;">4</span></span><span style="">个属性(见区域</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">属性名左边的数字),故把</span><span lang="EN-US"><span style="font-family: Calibri;">attributeIndices</span></span><span style="">右边改成“</span><span lang="EN-US"><span style="font-family: Calibri;">1,4</span></span><span style="">”。计划把这两个属性都分成</span><span lang="EN-US"><span style="font-family: Calibri;">3</span></span><span style="">段,于是把“</span><span lang="EN-US"><span style="font-family: Calibri;">bins</span></span><span style="">”改成“</span><span lang="EN-US"><span style="font-family: Calibri;">3</span></span><span style="">”。其它框里不用更改,关于它们的意思可以点“</span><span lang="EN-US"><span style="font-family: Calibri;">More</span></span><span style="">”查看。点“</span><span lang="EN-US"><span style="font-family: Calibri;">OK</span></span><span style="">”回到“</span><span lang="EN-US"><span style="font-family: Calibri;">Explorer</span></span><span style="">”,可以看到“</span><span lang="EN-US"><span style="font-family: Calibri;">age</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">income</span></span><span style="">”已经被离散化成分类型的属性。若想放弃离散化可以点区域</span><span lang="EN-US"><span style="font-family: Calibri;">2</span></span><span style="">的“</span><span lang="EN-US"><span style="font-family: Calibri;">Undo</span></span><span style="">”。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">如果对“</span><span lang="EN-US"><span style="font-family: Calibri;">"(-inf-34.333333]"</span></span><span style="">”这样晦涩的标识不满,我们可以用</span><span lang="EN-US"><span style="font-family: Calibri;">UltraEdit</span></span><span style="">打开保存后的</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件,把所有的“</span><span lang="EN-US"><span style="font-family: Calibri;">'\'(-inf-34.333333]\''</span></span><span style="">”替换成“</span><span lang="EN-US"><span style="font-family: Calibri;">0_34</span></span><span style="">”。其它标识做类似地手动替换。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">经过上述操作得到的数据集我们保存为</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data-final.arff</span></span><span style="">。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">----</span></span><span style="">整理自</span><span lang="EN-US"><span style="font-family: Calibri;">http://maya.cs.depaul.edu.sixxs.org/~classes/ect584/WEKA/preprocess.html </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">4. </span></span><span style="">关联规则(购物篮分析)</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">注意:目前,</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">的关联规则分析功能仅能用来作示范,不适合用来挖掘大型数据集。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们打算对前面的“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data</span></span><span style="">”数据作关联规则的分析。用“</span><span lang="EN-US"><span style="font-family: Calibri;">Explorer</span></span><span style="">”打开“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data-final.arff</span></span><span style="">”后,切换到“</span><span lang="EN-US"><span style="font-family: Calibri;">Associate</span></span><span style="">”选项卡。默认关联规则分析是用</span><span lang="EN-US"><span style="font-family: Calibri;">Apriori</span></span><span style="">算法,我们就用这个算法,但是点“</span><span lang="EN-US"><span style="font-family: Calibri;">Choose</span></span><span style="">”右边的文本框修改默认的参数,弹出的窗口中点“</span><span lang="EN-US"><span style="font-family: Calibri;">More</span></span><span style="">”可以看到各参数的说明。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">背景知识</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">首先我们来温习一下</span><span lang="EN-US"><span style="font-family: Calibri;">Apriori</span></span><span style="">的有关知识。对于一条关联规则</span><span lang="EN-US"><span style="font-family: Calibri;">L-&gt;R</span></span><span style="">,我们常用支持度(</span><span lang="EN-US"><span style="font-family: Calibri;">Support</span></span><span style="">)和置信度(</span><span lang="EN-US"><span style="font-family: Calibri;">Confidence</span></span><span style="">)来衡量它的重要性。规则的支持度是用来估计在一个购物篮中同时观察到</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">的概率</span><span lang="EN-US"><span style="font-family: Calibri;">P(L,R)</span></span><span style="">,而规则的置信度是估计购物栏中出现了</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">时也出会现</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">的条件概率</span><span lang="EN-US"><span style="font-family: Calibri;">P(R|L)</span></span><span style="">。关联规则的目标一般是产生支持度和置信度都较高的规则。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">有几个类似的度量代替置信度来衡量规则的关联程度,它们分别是</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Lift</span></span><span style="">(提升度?):</span><span lang="EN-US"><span style="font-family: Calibri;"> P(L,R)/(P(L)P(R)) </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Lift=1</span></span><span style="">时表示</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">独立。这个数越大,越表明</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">存在在一个购物篮中不是偶然现象。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Leverage</span></span><span style="">(不知道怎么翻译):</span><span lang="EN-US"><span style="font-family: Calibri;">P(L,R)-P(L)P(R) </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">它和</span><span lang="EN-US"><span style="font-family: Calibri;">Lift</span></span><span style="">的含义差不多。</span><span lang="EN-US"><span style="font-family: Calibri;">Leverage=0</span></span><span style="">时</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">独立,</span><span lang="EN-US"><span style="font-family: Calibri;">Leverage</span></span><span style="">越大</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">的关系越密切。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Conviction</span></span><span style="">(更不知道译了):</span><span lang="EN-US"><span style="font-family: Calibri;">P(L)P(!R)/P(L,!R) </span></span><span style="">(</span><span lang="EN-US"><span style="font-family: Calibri;">!R</span></span><span style="">表示</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">没有发生)</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Conviction</span></span><span style="">也是用来衡量</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">的独立性。从它和</span><span lang="EN-US"><span style="font-family: Calibri;">lift</span></span><span style="">的关系(对</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">取反,代入</span><span lang="EN-US"><span style="font-family: Calibri;">Lift</span></span><span style="">公式后求倒数)可以看出,我们也希望这个值越大越好。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">值得注意的是,用</span><span lang="EN-US"><span style="font-family: Calibri;">Lift</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">Leverage</span></span><span style="">作标准时,</span><span lang="EN-US"><span style="font-family: Calibri;">L</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">R</span></span><span style="">是对称的,</span><span lang="EN-US"><span style="font-family: Calibri;">Confidence</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">Conviction</span></span><span style="">则不然。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">参数设置</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">现在我们计划挖掘出支持度在</span><span lang="EN-US"><span style="font-family: Calibri;">10%</span></span><span style="">到</span><span lang="EN-US"><span style="font-family: Calibri;">100%</span></span><span style="">之间,并且</span><span lang="EN-US"><span style="font-family: Calibri;">lift</span></span><span style="">值超过</span><span lang="EN-US"><span style="font-family: Calibri;">1.5</span></span><span style="">且</span><span lang="EN-US"><span style="font-family: Calibri;">lift</span></span><span style="">值排在前</span><span lang="EN-US"><span style="font-family: Calibri;">100</span></span><span style="">位的那些关联规则。我们把“</span><span lang="EN-US"><span style="font-family: Calibri;">lowerBoundMinSupport</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">upperBoundMinSupport</span></span><span style="">”分别设为</span><span lang="EN-US"><span style="font-family: Calibri;">0.1</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">,“</span><span lang="EN-US"><span style="font-family: Calibri;">metricType</span></span><span style="">”设为</span><span lang="EN-US"><span style="font-family: Calibri;">lift</span></span><span style="">,“</span><span lang="EN-US"><span style="font-family: Calibri;">minMetric</span></span><span style="">”设为</span><span lang="EN-US"><span style="font-family: Calibri;">1.5</span></span><span style="">,“</span><span lang="EN-US"><span style="font-family: Calibri;">numRules</span></span><span style="">”设为</span><span lang="EN-US"><span style="font-family: Calibri;">100</span></span><span style="">。其他选项保持默认即可。“</span><span lang="EN-US"><span style="font-family: Calibri;">OK</span></span><span style="">”</span><span style="font-family: Calibri;"> </span><span style="">之后在“</span><span lang="EN-US"><span style="font-family: Calibri;">Explorer</span></span><span style="">”中点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Start</span></span><span style="">”开始运行算法,在右边窗口显示数据集摘要和挖掘结果。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">下面是挖掘出来的</span><span lang="EN-US"><span style="font-family: Calibri;">lift</span></span><span style="">排前</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style="">的规则。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">Best rules found:</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">1. age=52_max save_act=YES current_act=YES 113 ==&gt; income=43759_max 61 conf:(0.54) &lt; lift:(4.05)&gt; lev:(0.08) [45] conv:(1.85)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>2. income=43759_max 80 ==&gt; age=52_max save_act=YES current_act=YES 61 conf:(0.76) &lt; lift:(4.05)&gt; lev:(0.08) [45] conv:(3.25)</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>3. income=43759_max current_act=YES 63 ==&gt; age=52_max save_act=YES 61 conf:(0.97) &lt; lift:(3.85)&gt; lev:(0.08) [45] conv:(15.72)</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>4. age=52_max save_act=YES 151 ==&gt; income=43759_max current_act=YES 61 conf:(0.4) &lt; lift:(3.85)&gt; lev:(0.08) [45] conv:(1.49)</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>5. age=52_max save_act=YES 151 ==&gt; income=43759_max 76 conf:(0.5) &lt; lift:(3.77)&gt; lev:(0.09) [55] conv:(1.72) </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">对于挖掘出的每条规则,</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">列出了它们关联程度的四项指标。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">命令行方式</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们也可以利用命令行来完成挖掘任务,在“</span><span lang="EN-US"><span style="font-family: Calibri;">Simlpe CLI</span></span><span style="">”模块中输入如下格式的命令:</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">java weka.associations.Apriori options -t directory-path\bank-data-final.arff </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">即可完成</span><span lang="EN-US"><span style="font-family: Calibri;">Apriori</span></span><span style="">算法。注意,“</span><span lang="EN-US"><span style="font-family: Calibri;">-t</span></span><span style="">”参数后的文件路径中不能含有空格。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在前面我们使用的</span><span lang="EN-US"><span style="font-family: Calibri;">option</span></span><span style="">为</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">-N 100 -T 1 -C 1.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0 </span></span><span style="">命令行中使用这些参数得到的结果和前面利用</span><span lang="EN-US"><span style="font-family: Calibri;">GUI</span></span><span style="">得到的一样。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们还可以加上“</span><span lang="EN-US"><span style="font-family: Calibri;">- I</span></span><span style="">”参数,得到不同项数的频繁项集。我用的命令如下:</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">java weka.associations.Apriori -N 100 -T 1 -C 1.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -I -t d:\weka\bank-data-final.arff </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style=""><span style="font-size: small;">挖掘结果在上方显示,应是这个文件的样子。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">5. </span></span><span style="">分类与回归</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">背景知识</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">把分类</span><span lang="EN-US"><span style="font-family: Calibri;">(Classification)</span></span><span style="">和回归</span><span lang="EN-US"><span style="font-family: Calibri;">(Regression)</span></span><span style="">都放在“</span><span lang="EN-US"><span style="font-family: Calibri;">Classify</span></span><span style="">”选项卡中,这是有原因的。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在这两个任务中,都有一个目标属性(输出变量)。我们希望根据一个样本</span><span lang="EN-US"><span style="font-family: Calibri;">(WEKA</span></span><span style="">中称作实例</span><span lang="EN-US"><span style="font-family: Calibri;">)</span></span><span style="">的一组特征(输入变量),对目标进行预测。为了实现这一目的,我们需要有一个训练数据集,这个数据集中每个实例的输入和输出都是已知的。观察训练集中的实例,可以建立起预测的模型。有了这个模型,我们就可以新的输出未知的实例进行预测了。衡量模型的好坏就在于预测的准确程度。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">中,待预测的目标(输出)被称作</span><span lang="EN-US"><span style="font-family: Calibri;">Class</span></span><span style="">属性,这应该是来自分类任务的“类”。一般的,若</span><span lang="EN-US"><span style="font-family: Calibri;">Class</span></span><span style="">属性是分类型时我们的任务才叫分类,</span><span lang="EN-US"><span style="font-family: Calibri;">Class</span></span><span style="">属性是数值型时我们的任务叫回归。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">选择算法</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">这一节中,我们使用</span><span lang="EN-US"><span style="font-family: Calibri;">C4.5</span></span><span style="">决策树算法对</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data</span></span><span style="">建立起分类模型。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们来看原来的“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data.csv</span></span><span style="">”文件。“</span><span lang="EN-US"><span style="font-family: Calibri;">ID</span></span><span style="">”属性肯定是不需要的。由于</span><span lang="EN-US"><span style="font-family: Calibri;">C4.5</span></span><span style="">算法可以处理数值型的属性,我们不用像前面用关联规则那样把每个变量都离散化成分类型。尽管如此,我们还是把“</span><span lang="EN-US"><span style="font-family: Calibri;">Children</span></span><span style="">”属性转换成分类型的两个值“</span><span lang="EN-US"><span style="font-family: Calibri;">YES</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">NO</span></span><span style="">”。另外,我们的训练集仅取原来数据集实例的一半;而从另外一半中抽出若干条作为待预测的实例,它们的“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”属性都设为缺失值。经过了这些处理的训练集数据在这里下载;待预测集数据在这里下载。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们用“</span><span lang="EN-US"><span style="font-family: Calibri;">Explorer</span></span><span style="">”打开训练集“</span><span lang="EN-US"><span style="font-family: Calibri;">bank.arff</span></span><span style="">”,观察一下它是不是按照前面的要求处理好了。切换到“</span><span lang="EN-US"><span style="font-family: Calibri;">Classify</span></span><span style="">”选项卡,点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Choose</span></span><span style="">”按钮后可以看到很多分类或者回归的算法分门别类的列在一个树型框里。</span><span lang="EN-US"><span style="font-family: Calibri;">3.5</span></span><span style="">版的</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">中,树型框下方有一个“</span><span lang="EN-US"><span style="font-family: Calibri;">Filter...</span></span><span style="">”按钮,点击可以根据数据集的特性过滤掉不合适的算法。我们数据集的输入属性中有“</span><span lang="EN-US"><span style="font-family: Calibri;">Binary</span></span><span style="">”型(即只有两个类的分类型)和数值型的属性,而</span><span lang="EN-US"><span style="font-family: Calibri;">Class</span></span><span style="">变量是“</span><span lang="EN-US"><span style="font-family: Calibri;">Binary</span></span><span style="">”的;于是我们勾选“</span><span lang="EN-US"><span style="font-family: Calibri;">Binary attributes</span></span><span style="">”“</span><span lang="EN-US"><span style="font-family: Calibri;">Numeric attributes</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">Binary class</span></span><span style="">”。点“</span><span lang="EN-US"><span style="font-family: Calibri;">OK</span></span><span style="">”后回到树形图,可以发现一些算法名称变红了,说明它们不能用。选择“</span><span lang="EN-US"><span style="font-family: Calibri;">trees</span></span><span style="">”下的“</span><span lang="EN-US"><span style="font-family: Calibri;">J48</span></span><span style="">”,这就是我们需要的</span><span lang="EN-US"><span style="font-family: Calibri;">C4.5</span></span><span style="">算法,还好它没有变红。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Choose</span></span><span style="">”右边的文本框,弹出新窗口为该算法设置各种参数。点“</span><span lang="EN-US"><span style="font-family: Calibri;">More</span></span><span style="">”查看参数说明,点“</span><span lang="EN-US"><span style="font-family: Calibri;">Capabilities</span></span><span style="">”是查看算法适用范围。这里我们把参数保持默认。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">现在来看左中的“</span><span lang="EN-US"><span style="font-family: Calibri;">Test Option</span></span><span style="">”。我们没有专门设置检验数据集,为了保证生成的模型的准确性而不至于出现过拟合(</span><span lang="EN-US"><span style="font-family: Calibri;">overfitting</span></span><span style="">)的现象,我们有必要采用</span><span lang="EN-US"><span style="font-family: Calibri;">10</span></span><span style="">折交叉验证(</span><span lang="EN-US"><span style="font-family: Calibri;">10-fold cross validation</span></span><span style="">)来选择和评估模型。若不明白交叉验证的含义可以</span><span lang="EN-US"><span style="font-family: Calibri;">Google</span></span><span style="">一下。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">建模结果</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">OK</span></span><span style="">,选上“</span><span lang="EN-US"><span style="font-family: Calibri;">Cross-validation</span></span><span style="">”并在“</span><span lang="EN-US"><span style="font-family: Calibri;">Folds</span></span><span style="">”框填上“</span><span lang="EN-US"><span style="font-family: Calibri;">10</span></span><span style="">”。点“</span><span lang="EN-US"><span style="font-family: Calibri;">Start</span></span><span style="">”按钮开始让算法生成决策树模型。很快,用文本表示的一棵决策树,以及对这个决策树的误差分析等等结果出现在右边的“</span><span lang="EN-US"><span style="font-family: Calibri;">Classifier output</span></span><span style="">”中。同时左下的“</span><span lang="EN-US"><span style="font-family: Calibri;">Results list</span></span><span style="">”出现了一个项目显示刚才的时间和算法名称。如果换一个模型或者换个参数,重新“</span><span lang="EN-US"><span style="font-family: Calibri;">Start</span></span><span style="">”一次,则“</span><span lang="EN-US"><span style="font-family: Calibri;">Results list</span></span><span style="">”又会多出一项。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们看到“</span><span lang="EN-US"><span style="font-family: Calibri;">J48</span></span><span style="">”算法交叉验证的结果之一为</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">Correctly Classified Instances 206 68.6667 % </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">也就是说这个模型的准确度只有</span><span lang="EN-US"><span style="font-family: Calibri;">69%</span></span><span style="">左右。也许我们需要对原属性进行处理,或者修改算法的参数来提高准确度。但这里我们不管它,继续用这个模型。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">右键点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Results list</span></span><span style="">”刚才出现的那一项,弹出菜单中选择“</span><span lang="EN-US"><span style="font-family: Calibri;">Visualize tree</span></span><span style="">”,新窗口里可以看到图形模式的决策树。建议把这个新窗口最大化,然后点右键,选“</span><span lang="EN-US"><span style="font-family: Calibri;">Fit to screen</span></span><span style="">”,可以把这个树看清楚些。看完后截图或者关掉</span><span lang="EN-US"><span style="font-family: Calibri;">:P </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">这里我们解释一下“</span><span lang="EN-US"><span style="font-family: Calibri;">Confusion Matrix</span></span><span style="">”的含义。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">=== Confusion Matrix ===</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>a b &lt;-- classified as</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>74 64 | a = YES</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>30 132 | b = NO </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">这个矩阵是说,原本“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”是“</span><span lang="EN-US"><span style="font-family: Calibri;">YES</span></span><span style="">”的实例,有</span><span lang="EN-US"><span style="font-family: Calibri;">74</span></span><span style="">个被正确的预测为“</span><span lang="EN-US"><span style="font-family: Calibri;">YES</span></span><span style="">”,有</span><span lang="EN-US"><span style="font-family: Calibri;">64</span></span><span style="">个错误的预测成了“</span><span lang="EN-US"><span style="font-family: Calibri;">NO</span></span><span style="">”;原本“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”是“</span><span lang="EN-US"><span style="font-family: Calibri;">NO</span></span><span style="">”的实例,有</span><span lang="EN-US"><span style="font-family: Calibri;">30</span></span><span style="">个被错误的预测为“</span><span lang="EN-US"><span style="font-family: Calibri;">YES</span></span><span style="">”,有</span><span lang="EN-US"><span style="font-family: Calibri;">132</span></span><span style="">个正确的预测成了“</span><span lang="EN-US"><span style="font-family: Calibri;">NO</span></span><span style="">”。</span><span lang="EN-US"><span style="font-family: Calibri;">74+64+30+132 = 300</span></span><span style="">是实例总数,而</span><span lang="EN-US"><span style="font-family: Calibri;">(74+132)/300 = 0.68667</span></span><span style="">正好是正确分类的实例所占比例。这个矩阵对角线上的数字越大,说明预测得越好。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">模型应用</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">现在我们要用生成的模型对那些待预测的数据集进行预测了。注意待预测数据集和训练用数据集各个属性的设置必须是一致的。即使你没有待预测数据集的</span><span lang="EN-US"><span style="font-family: Calibri;">Class</span></span><span style="">属性的值,你也要添加这个属性,可以将该属性在各实例上的值均设成缺失值。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">在“</span><span lang="EN-US"><span style="font-family: Calibri;">Test Opion</span></span><span style="">”中选择“</span><span lang="EN-US"><span style="font-family: Calibri;">Supplied test set</span></span><span style="">”,并且“</span><span lang="EN-US"><span style="font-family: Calibri;">Set</span></span><span style="">”成你要应用模型的数据集,这里是“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-new.arff</span></span><span style="">”文件。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">现在,右键点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Result list</span></span><span style="">”中刚产生的那一项,选择“</span><span lang="EN-US"><span style="font-family: Calibri;">Re-evaluate model on current test set</span></span><span style="">”。右边显示结果的区域中会增加一些内容,告诉你该模型应用在这个数据集上表现将如何。如果你的</span><span lang="EN-US"><span style="font-family: Calibri;">Class</span></span><span style="">属性都是些缺失值,那这些内容是无意义的,我们关注的是模型在新数据集上的预测值。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">现在点击右键菜单中的“</span><span lang="EN-US"><span style="font-family: Calibri;">Visualize classifier errors</span></span><span style="">”,将弹出一个新窗口显示一些有关预测误差的散点图。点击这个新窗口中的“</span><span lang="EN-US"><span style="font-family: Calibri;">Save</span></span><span style="">”按钮,保存一个</span><span lang="EN-US"><span style="font-family: Calibri;">Arff</span></span><span style="">文件。打开这个文件可以看到在倒数第二个位置多了一个属性(</span><span lang="EN-US"><span style="font-family: Calibri;">predictedpep</span></span><span style="">),这个属性上的值就是模型对每个实例的预测值。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">使用命令行(推荐)</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">虽然使用图形界面查看结果和设置参数很方便,但是最直接最灵活的建模及应用的办法仍是使用命令行。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">打开“</span><span lang="EN-US"><span style="font-family: Calibri;">Simple CLI</span></span><span style="">”模块,像上面那样使用“</span><span lang="EN-US"><span style="font-family: Calibri;">J48</span></span><span style="">”算法的命令格式为:</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-path\bank.arff -d directory-path \bank.model </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">其中参数“</span><span lang="EN-US"><span style="font-family: Calibri;"> -C 0.25</span></span><span style="">”和“</span><span lang="EN-US"><span style="font-family: Calibri;">-M 2</span></span><span style="">”是和图形界面中所设的一样的。“</span><span lang="EN-US"><span style="font-family: Calibri;">-t </span></span><span style="">”后面跟着的是训练数据集的完整路径(包括目录和文件名),“</span><span lang="EN-US"><span style="font-family: Calibri;">-d </span></span><span style="">”后面跟着的是保存模型的完整路径。注意!这里我们可以把模型保存下来。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">输入上述命令后,所得到树模型和误差分析会在“</span><span lang="EN-US"><span style="font-family: Calibri;">Simple CLI</span></span><span style="">”上方显示,可以复制下来保存在文本文件里。误差是把模型应用到训练集上给出的。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">把这个模型应用到“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-new.arff</span></span><span style="">”所用命令的格式为:</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">java weka.classifiers.trees.J48 -p 9 -l directory-path\bank.model -T directory-path \bank-new.arff </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">其中“</span><span lang="EN-US"><span style="font-family: Calibri;">-p 9</span></span><span style="">”说的是模型中的待预测属性的真实值存在第</span><span lang="EN-US"><span style="font-family: Calibri;">9</span></span><span style="">个(也就是“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”)属性中,这里它们全部未知因此全部用缺失值代替。“</span><span lang="EN-US"><span style="font-family: Calibri;">-l</span></span><span style="">”后面是模型的完整路径。“</span><span lang="EN-US"><span style="font-family: Calibri;">-T</span></span><span style="">”后面是待预测数据集的完整路径。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">输入上述命令后,在“</span><span lang="EN-US"><span style="font-family: Calibri;">Simple CLI</span></span><span style="">”上方会有这样一些结果:</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">0 YES 0.75 ?</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">1 NO 0.7272727272727273 ?</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">2 YES 0.95 ?</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">3 YES 0.8813559322033898 ?</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">4 NO 0.8421052631578947 ?</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">... </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">这里的第一列就是我们提到过的“</span><span lang="EN-US"><span style="font-family: Calibri;">Instance_number</span></span><span style="">”,第二列就是刚才的“</span><span lang="EN-US"><span style="font-family: Calibri;">predictedpep</span></span><span style="">”,第四列则是“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-new.arff</span></span><span style="">”中原来的“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”值(这里都是“</span><span lang="EN-US"><span style="font-family: Calibri;">?</span></span><span style="">”缺失值)。第三列对预测结果的置信度(</span><span lang="EN-US"><span style="font-family: Calibri;">confidence </span></span><span style="">)。比如说对于实例</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span><span style="">,我们有</span><span lang="EN-US"><span style="font-family: Calibri;">75%</span></span><span style="">的把握说它的“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”的值会是“</span><span lang="EN-US"><span style="font-family: Calibri;">YES</span></span><span style="">”,对实例</span><span lang="EN-US"><span style="font-family: Calibri;">4</span></span><span style="">我们有</span><span lang="EN-US"><span style="font-family: Calibri;">84.2%</span></span><span style="">的把握说它的“</span><span lang="EN-US"><span style="font-family: Calibri;">pep</span></span><span style="">”值会是“</span><span lang="EN-US"><span style="font-family: Calibri;">NO</span></span><span style="">”。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我们看到,使用命令行至少有两个好处。一个是可以把模型保存下来,这样有新的待预测数据出现时,不用每次重新建模,直接应用保存好的模型即可。另一个是对预测结果给出了置信度,我们可以有选择的采纳预测结果,例如,只考虑那些置信度在</span><span lang="EN-US"><span style="font-family: Calibri;">85%</span></span><span style="">以上的结果。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">----</span></span><span style="">整理自</span><span lang="EN-US"><span style="font-family: Calibri;">http://maya.cs.depaul.edu.sixxs.org/~classes/ect584/WEKA/classify.html </span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">6. </span></span><span style="">聚类分析</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">原理与实现</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">聚类分析中的“类”(</span><span lang="EN-US"><span style="font-family: Calibri;">cluster</span></span><span style="">)和前面分类的“类”(</span><span lang="EN-US"><span style="font-family: Calibri;">class</span></span><span style="">)是不同的,对</span><span lang="EN-US"><span style="font-family: Calibri;">cluster</span></span><span style="">更加准确的翻译应该是“簇”。聚类的任务是把所有的实例分配到若干的簇,使得同一个簇的实例聚集在一个簇中心的周围,它们之间距离的比较近;而不同簇实例之间的距离比较远。对于由数值型属性刻画的实例来说,这个距离通常指欧氏距离。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">现在我们对前面的“</span><span lang="EN-US"><span style="font-family: Calibri;">bank data</span></span><span style="">”作聚类分析,使用最常见的</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">均值(</span><span lang="EN-US"><span style="font-family: Calibri;">K-means</span></span><span style="">)算法。下面我们简单描述一下</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">均值聚类的步骤。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">均值算法首先随机的指定</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">个簇中心。然后:</span><span lang="EN-US"><span style="font-family: Calibri;">1)</span></span><span style="">将每个实例分配到距它最近的簇中心,得到</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">个簇;</span><span lang="EN-US"><span style="font-family: Calibri;">2)</span></span><span style="">计分别计算各簇中所有实例的均值,把它们作为各簇新的簇中心。重复</span><span lang="EN-US"><span style="font-family: Calibri;">1)</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">2)</span></span><span style="">,直到</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">个簇中心的位置都固定,簇的分配也固定。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">上述</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">均值算法只能处理数值型的属性,遇到分类型的属性时要把它变为若干个取值</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style="">的属性。</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">将自动实施这个分类型到数值型的变换,而且</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">会自动对数值型的数据作标准化。因此,对于原始数据“</span><span lang="EN-US"><span style="font-family: Calibri;">bank-data.csv</span></span><span style="">”,我们所做的预处理只是删去属性“</span><span lang="EN-US"><span style="font-family: Calibri;">id</span></span><span style="">”,保存为</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">格式后,修改属性“</span><span lang="EN-US"><span style="font-family: Calibri;">children</span></span><span style="">”为分类型。这样得到的数据文件为“</span><span lang="EN-US"><span style="font-family: Calibri;">bank.arff</span></span><span style="">”,含</span><span lang="EN-US"><span style="font-family: Calibri;">600</span></span><span style="">条实例。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">用“</span><span lang="EN-US"><span style="font-family: Calibri;">Explorer</span></span><span style="">”打开刚才得到的“</span><span lang="EN-US"><span style="font-family: Calibri;">bank.arff</span></span><span style="">”,并切换到“</span><span lang="EN-US"><span style="font-family: Calibri;">Cluster</span></span><span style="">”。点“</span><span lang="EN-US"><span style="font-family: Calibri;">Choose</span></span><span style="">”按钮选择“</span><span lang="EN-US"><span style="font-family: Calibri;">SimpleKMeans</span></span><span style="">”,这是</span><span lang="EN-US"><span style="font-family: Calibri;">WEKA</span></span><span style="">中实现</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">均值的算法。点击旁边的文本框,修改“</span><span lang="EN-US"><span style="font-family: Calibri;">numClusters</span></span><span style="">”为</span><span lang="EN-US"><span style="font-family: Calibri;">6</span></span><span style="">,说明我们希望把这</span><span lang="EN-US"><span style="font-family: Calibri;">600</span></span><span style="">条实例聚成</span><span lang="EN-US"><span style="font-family: Calibri;">6</span></span><span style="">类,即</span><span lang="EN-US"><span style="font-family: Calibri;">K=6</span></span><span style="">。下面的“</span><span lang="EN-US"><span style="font-family: Calibri;">seed</span></span><span style="">”参数是要设置一个随机种子,依此产生一个随机数,用来得到</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">均值算法中第一次给出的</span><span lang="EN-US"><span style="font-family: Calibri;">K</span></span><span style="">个簇中心的位置。我们不妨暂时让它就为</span><span lang="EN-US"><span style="font-family: Calibri;">10</span></span><span style="">。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">选中“</span><span lang="EN-US"><span style="font-family: Calibri;">Cluster Mode</span></span><span style="">”的“</span><span lang="EN-US"><span style="font-family: Calibri;">Use training set</span></span><span style="">”,点击“</span><span lang="EN-US"><span style="font-family: Calibri;">Start</span></span><span style="">”按钮,观察右边“</span><span lang="EN-US"><span style="font-family: Calibri;">Clusterer output</span></span><span style="">”给出的聚类结果。也可以在左下角“</span><span lang="EN-US"><span style="font-family: Calibri;">Result list</span></span><span style="">”中这次产生的结果上点右键,“</span><span lang="EN-US"><span style="font-family: Calibri;">View in separate window</span></span><span style="">”在新窗口中浏览结果。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">结果解释</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">首先我们注意到结果中有这么一行:</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">Within cluster sum of squared errors: 1604.7416693522332 </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">这是评价聚类好坏的标准,数值越小说明同一簇实例之间的距离越小。也许你得到的数值会不一样;实际上如果把“</span><span lang="EN-US"><span style="font-family: Calibri;">seed</span></span><span style="">”参数改一下,得到的这个数值就可能会不一样。我们应该多尝试几个</span><span lang="EN-US"><span style="font-family: Calibri;">seed</span></span><span style="">,并采纳这个数值最小的那个结果。例如我让“</span><span lang="EN-US"><span style="font-family: Calibri;">seed</span></span><span style="">”取</span><span lang="EN-US"><span style="font-family: Calibri;">100</span></span><span style="">,就得到</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">Within cluster sum of squared errors: 1555.6241507629218 </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">我该取后面这个。当然再尝试几个</span><span lang="EN-US"><span style="font-family: Calibri;">seed</span></span><span style="">,这个数值可能会更小。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">接下来“</span><span lang="EN-US"><span style="font-family: Calibri;">Cluster centroids:</span></span><span style="">”之后列出了各个簇中心的位置。对于数值型的属性,簇中心就是它的均值(</span><span lang="EN-US"><span style="font-family: Calibri;">Mean</span></span><span style="">);分类型的就是它的众数(</span><span lang="EN-US"><span style="font-family: Calibri;">Mode</span></span><span style="">),</span><span style="font-family: Calibri;"> </span><span style="">也就是说这个属性上取值为众数值的实例最多。对于数值型的属性,还给出了它在各个簇里的标准差(</span><span lang="EN-US"><span style="font-family: Calibri;">Std Devs</span></span><span style="">)。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">最后的“</span><span lang="EN-US"><span style="font-family: Calibri;">Clustered Instances</span></span><span style="">”是各个簇中实例的数目及百分比。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">为了观察可视化的聚类结果,我们在左下方“</span><span lang="EN-US"><span style="font-family: Calibri;">Result list</span></span><span style="">”列出的结果上右击,点“</span><span lang="EN-US"><span style="font-family: Calibri;">Visualize cluster assignments</span></span><span style="">”。弹出的窗口给出了各实例的散点图。最上方的两个框是选择横坐标和纵坐标,第二行的“</span><span lang="EN-US"><span style="font-family: Calibri;">color</span></span><span style="">”是散点图着色的依据,默认是根据不同的簇“</span><span lang="EN-US"><span style="font-family: Calibri;">Cluster</span></span><span style="">”给实例标上不同的颜色。</span><span style="font-family: Calibri;"> </span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="font-size: small;"><span style="">可以在这里点“</span><span lang="EN-US"><span style="font-family: Calibri;">Save</span></span><span style="">”把聚类结果保存成</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件。在这个新的</span><span lang="EN-US"><span style="font-family: Calibri;">ARFF</span></span><span style="">文件中,“</span><span lang="EN-US"><span style="font-family: Calibri;">instance_number</span></span><span style="">”属性表示某实例的编号,“</span><span lang="EN-US"><span style="font-family: Calibri;">Cluster</span></span><span style="">”属性表示聚类算法给出的该实例所在的簇。</span></span></p>
分享到:
评论

相关推荐

    Weka基础教程V1.1(贵州大学)

    **Weka基础教程** Weka,全称Waikato Environment for Knowledge Analysis,是一款源自新西兰怀卡托大学的开源数据挖掘软件。它以Java语言编写,提供了丰富的数据挖掘算法,包括分类、聚类、关联规则挖掘、特征选择...

    数据挖掘开源项目Weka入门教程

    数据挖掘开源项目Weka入门教程 数据挖掘开源项目Weka入门教程

    数据挖掘 WEKA 入门教程

    ### 数据挖掘 WEKA 入门教程详解 #### 1. 简介 **WEKA** 的全称是 **怀卡托智能分析环境 (Waikato Environment for Knowledge Analysis)**,它是一个开源的数据挖掘工具,主要由新西兰怀卡托大学开发。WEKA 提供了...

    WEKA完整中文教程 实验教程 入门教程

    ### 一、WEKA入门 入门WEKA首先需要了解其工作流程,通常包括以下四个步骤: 1. **数据加载**:WEKA支持多种数据格式,如.arff(Attribute-Relation File Format)是最常用的一种。用户可以通过"文件"菜单导入本地...

    WEKA入门教程以及所用的数据集大全

    **WEKA入门教程详解及数据集介绍** **一、WEKA简介** WEKA,全称为Waikato Environment for Knowledge Analysis,是由新西兰怀卡托大学开发的一款强大的数据挖掘工具。它是一个开源软件,提供了多种机器学习算法和...

    weka入门详细教程

    很不错的weka入门教程 值得一看啊 我就是靠这个入的门

    数据挖掘-weka入门教程,数据格式,

    weka简介,数据格式,数据准备,explorer界面,分类、聚类、关联规则。

    weka入门教程word格式文件

    WEKA作为一个公开的数据挖掘工作平台,集合了大量能承担数据挖掘任务的机器学习算法,包括对数据进行预处理,分类,回归、聚类、关联规则以及在新的交互式界面上的可视化。

    WEKA入门中文教程(doc)

    WEKA入门中文教程(doc) 学习数据挖掘,WEKA是个不错的软件。

    weka使用教程

    weka入门教程,很多人因为weka的英文界面头疼,这个介绍的比较详细啦!大家要努力哦!

    Weka中文教程

    总体而言,Weka中文教程旨在为初学者提供一条快速入门机器学习,特别是使用Weka工具的路径。教程不仅提供了如何使用Weka的基本指导,还涉及了机器学习的若干重要概念,如分类和聚类等。通过学习本教程,学生和初学者...

    WEKA中文教程.rar

    《WEKA中文教程》是机器学习、人工智能和数据挖掘领域初学者的良师益友。在当今数据驱动的时代,掌握数据的分析和挖掘技术对于科研和商业应用至关重要。WEKA,作为一种功能强大的开源数据分析工具,其简洁的操作界面...

    个人推荐的Weka教程,包含了数据格式、数据准备、分类和聚类Demo

    “weka入门教程.pdf”表明提供的内容是一个PDF格式的初级教程,旨在帮助初学者快速上手Weka。这个文件可能会包含从基础到进阶的各个主题,包括Weka的界面操作、数据导入导出、预处理方法、分类算法(如决策树、...

    WEKA教程完整版(新)

    ### WEKA教程知识点详解 #### 1. WEKA简介 - **定义与来源**:WEKA,全称为...通过本教程的学习,不仅可以深入了解WEKA这款强大的数据挖掘工具,还能掌握一系列实用的数据分析技能,为实际问题解决打下坚实的基础。

    Weka使用教程合集

    针对初学者,Weka官方网站提供了详细的文档和教程,包括视频教程和案例研究,帮助用户快速掌握Weka的基本操作和应用。此外,网络上也有许多社区和博客分享了使用Weka进行数据分析的实践经验,可供参考学习。 总结来...

Global site tag (gtag.js) - Google Analytics