随机森林

backsnow

浏览: 130999 次
性别:
来自: 广州

最近访客更多访客>>

sqllib

zangyk

huang2011

沐小枫

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

机器学习

随机森林的基本过程是：（m*n，m为样本数，n为特征维）1，训练：随机选择若干特征r<<n（似乎一般去sqrt(n)）,构造决策树；2，预测：通过所有决策树分类，然后以投票方式，得票数最多的分类即为分类值。

决策树构造过程如下，其中最大化Information gain来获得最有效的特征：

How to grow a Decision Tree

source : [3]

LearnUnprunedTree(X,Y)

Input: X a matrix of R rows and M columns where X_ij = the value of the j'th attribute in the i'th input datapoint. Each column consists of either all real values or all categorical values.
Input: Y a vector of R elements, where Y_i = the output class of the i'th datapoint. TheY_i values are categorical.
Output: An Unpruned decision tree

If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes
If all values in Y are the same, return a Leaf Node predicting this value as the output
Else
    select m variables at random out of the M variables
    For j = 1 .. m
        If j'th attribute is categorical
          IG_j = IG(Y|X_j) (see Information Gain)
        Else (j'th attribute is real-valued)
            IG_j = IG*(Y|X_j) (see Information Gain)
    Let j* = argmax_j IG_j (this is the splitting attribute we'll use)
    If j* is categorical then
        For each value v of the j'th attribute
            Let X^v = subset of rows of X in which X_ij = v. Let Y^v = corresponding subset of Y
Let Child^v = LearnUnprunedTree(X^v,Y^v)
        Return a decision tree node, splitting on j'th attribute. The number of children equals the number of values of the j'th attribute, and the v'th child is Child^v
    Else j* is real-valued and let t be the best split threshold
        Let X^LO = subset of rows of X in which X_ij <= t. Let Y^LO = corresponding subset of Y
Let Child^LO = LearnUnprunedTree(X^LO,Y^LO)
Let X^HI = subset of rows of X in which X_ij > t. Let Y^HI = corresponding subset of Y
        Let Child^HI = LearnUnprunedTree(X^HI,Y^HI)
Return a decision tree node, splitting on j'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.

Note: There are alternatives to Information Gain for splitting nodes

注意，分类和实值求最大information gain是不同的，这里只说明实值的情形，IG*(Y|X_j)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。

entropy=-sum p_ilog(p_i),（entropy即为H函数） high entropy意味着变量为boring distribution,即变量取各个值的概率差距不大； low entropy意味着变量为varied(peaks and valley)distribution,即变量取某一个或两个值的概率特别高。我们的目的就是要找到Low entropy的特征，因为 information gain= H(Y)-H(Y|X)，在H(Y)固定时，找到的H(Y|X)越低，则该特征去某一个值或两个值的概率越高，能够分类清楚的样本数越多，这样就越该被先选中作为分支节点。

Information gain

source : [3]

nominal attributes

suppose X can have one of m values V₁,V₂,...,V_m
P(X=V₁)=p₁, P(X=V₂)=p₂,...,P(X=V_m)=p_m

H(X)= -sum_j=1^m p_j log₂ p_j (The entropy of X)
H(Y|X=v) = the entropy of Y among only those records in which X has value v
H(Y|X) = sum_j p_j H(Y|X=v_j)
IG(Y|X) = H(Y) - H(Y|X)

real-valued attributes

How to grow a Random Forest

source : [1]

Each tree is grown as follows:

if the number of cases in the training set is N, sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
each tree is grown to its large extent possible. There is no pruning.

Random Forest parameters

source : [2]
Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m) to be randomly selected from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.

分享到：

mahout安装配置 | 用 MapReduce 解决与云计算相关的 Big Dat ...

2011-07-16 23:25
浏览 1601
评论(0)
分类:非技术
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

随机森林

决策树构造过程如下，其中最大化Information gain来获得最有效的特征：

How to grow a Decision Tree

注意，分类和实值求最大information gain是不同的，这里只说明实值的情形，IG*(Y|X_j)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。

nominal attributes

real-valued attributes

How to grow a Random Forest

Random Forest parameters

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

随机森林

决策树构造过程如下，其中最大化Information gain来获得最有效的特征：

How to grow a Decision Tree

注意，分类和实值求最大information gain是不同的，这里只说明实值的情形，IG*(Y|Xj)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。

nominal attributes

real-valued attributes

How to grow a Random Forest

Random Forest parameters

评论

发表评论

相关推荐

hadoop单机版搭建图文详解

Programming.Collective.Intelligence中对常用机器学习算法的总结

有待验证的小Idea

A Fast Algorithm for Learning a Ranking Function from Large-Scale Data Sets

NdcgBoost和SoftRank

Directly optimization of evaluation measure in information retrieval

listwise类方法的一些想法

要看的几篇文章

Regularized Boost

读The Elements of Statistical learning

svm的复杂度

最近访客更多访客>>

注意，分类和实值求最大information gain是不同的，这里只说明实值的情形，IG*(Y|X_j)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。