Naive Bayes text classification

wbj0110

浏览: 1646386 次
性别:
来自: 上海

最近访客更多访客>>

一往无前bhz

ninja2006

loginboot

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

机器学习

机器学习

The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document being in class is computed as

$\displaystyle P(c\vert d) \propto P(c) \prod_{1 \leq \tcposindex \leq n_d} P(\tcword_\tcposindex\vert c)$

(113)

where $P(\tcword_\tcposindex\vert c)$ is the conditional probability of term $\tcword_\tcposindex$ occurring in a document of class .We interpret $P(\tcword_\tcposindex\vert c)$ as a measure of how much evidence $\tcword_\tcposindex$ contributes that is the correct class. is the prior probability of a document occurring in class . If a document's terms do not provide clear evidence for one class versus another, we choose the one that has a higher prior probability. $\langle \tcword_1,\tcword_2,\ldots,\tcword_{n_d}\rangle$ are the tokens in that are part of the vocabulary we use for classification and is the number of such tokens in . For example, $\langle \tcword_1,\tcword_2,\ldots,\tcword_{n_d}\rangle$ for the one-sentence document Beijing and Taipei join the WTO might be $\langle \term{Beijing}, \term{Taipei}, \term{join}, \term {WTO}\rangle$ , with , if we treat the terms and and the as stop words.

In text classification, our goal is to find the best class for the document. The best class in NB classification is the most likely or maximum a posteriori ( MAP ) class $c_{map}$ :

$\begin{displaymath}c_{map} = \argmax_{\tcjclass \in \mathbb{C}} \hat{P}(\tcjc... ...posindex \leq n_d} \hat{P}(\tcword_\tcposindex\vert\tcjclass). \end{displaymath}$

(114)

We write $\hat{P}$ for because we do not know the true values of the parameters $P(\tcjclass)$ and $P(\tcword_\tcposindex\vert\tcjclass)$ , but estimate them from the training set as we will see in a moment.

In Equation 114, many conditional probabilities are multiplied, one for each position $1 \leq \tcposindex \leq n_d$ . This can result in a floating point underflow. It is therefore better to perform the computation by adding logarithms of probabilities instead of multiplying probabilities. The class with the highest log probability score is still the most probable; $\log (xy) = \log (x) + \log (y)$ and the logarithm function is monotonic. Hence, the maximization that is actually done in most implementations of NB is:

$\displaystyle c_{map} = \argmax_{\tcjclass \in \mathbb{C}} \ [ \log \hat{P}(\tc... ...{1 \leq \tcposindex \leq n_d} \log \hat{P}(\tcword_\tcposindex\vert\tcjclass)].$

(115)

Equation 115 has a simple interpretation. Each conditional parameter $\log \hat{P}(\tcword_\tcposindex\vert\tcjclass)$ is a weight that indicates how good an indicator $\tcword_\tcposindex$ is for $\tcjclass$ . Similarly, the prior $\log \hat{P}(\tcjclass)$ is a weight that indicates the relative frequency of $\tcjclass$ . More frequent classes are more likely to be the correct class than infrequent classes. The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the class, and Equation 115 selects the class for which we have the most evidence.

We will initially work with this intuitive interpretation of the multinomial NB model and defer a formal derivation to Section 13.4 .

How do we estimate the parameters $\hat{P}(\tcjclass)$ and $\hat{P}(\tcword_\tcposindex\vert\tcjclass)$ ? We first try the maximum likelihood estimate (MLE; probtheory), which is simply the relative frequency and corresponds to the most likely value of each parameter given the training data. For the priors this estimate is:

$\displaystyle \hat{P}(\tcjclass) = \frac{N_c}{N},$

(116)

where is the number of documents in class $\tcjclass$ and is the total number of documents.

We estimate the conditional probability $\hat{P}(\tcword\vert c)$ as the relative frequency of term $\tcword$ in documents belonging to class :

$\begin{displaymath} \hat{P}(\tcword\vert c) = \frac{T_{c\tcword}}{\sum_{\tcword' \in V} T_{c\tcword'}}, \end{displaymath}$

(117)

where $T_{c\tcword}$ is the number of occurrences of $\tcword$ in training documents from class , including multiple occurrences of a term in a document. We have made the positional independence assumption here, which we will discuss in more detail in the next section: $T_{c\tcword}$ is a count of occurrences in all positions in the documents in the training set. Thus, we do not compute different estimates for different positions and, for example, if a word occurs twice in a document, in positions and , then $\hat{P}(\tcword_{k_1}\vert c) = \hat{P}(\tcword_{k_2}\vert c)$ .

The problem with the MLE estimate is that it is zero for a term-class combination that did not occur in the training data. If the term WTO in the training data only occurred in China documents, then the MLE estimates for the other classes, for example UK, will be zero:

$\begin{displaymath} \hat{P}(\mbox{\term{WTO}}\vert\mbox{\class{UK}}) = 0. \end{displaymath}$

(118)

Now, the one-sentence document Britain is a member of the WTO will get a conditional probability of zero for UK because we are multiplying the conditional probabilities for all terms in Equation 113. Clearly, the model should assign a high probability to the UK class because the term Britain occurs. The problem is that the zero probability for WTO cannot be ``conditioned away,'' no matter how strong the evidence for the class UK from other features. The estimate is 0 because of sparseness : The training data are never large enough to represent the frequency of rare events adequately, for example, the frequency of WTO occurring in UK documents.

Figure 13.2: Naive Bayes algorithm (multinomial model): Training and testing.

$\begin{figure}\begin{algorithm}{TrainMultinomialNB}{\mathbb{C},\docsetlabeled} V... ...OR}\\ \RETURN{\argmax_{c \in \mathbb{C}} score[c]} \end{algorithm} \end{figure}$

To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count (cf. Section 11.3.2 ):

$\begin{displaymath} \hat{P}(\tcword\vert c) = \frac{T_{c\tcword}+1}{\sum_{\tcwor... ...frac{T_{c\tcword}+1}{(\sum_{\tcword' \in V} T_{c\tcword'})+B}, \end{displaymath}$

(119)

where $B=\vert V\vert$ is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation 116 on the document level.

We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .

Table 13.1: Data for parameter estimation examples.

	docID	words in document	in China?
training set	1	Chinese Beijing Chinese	yes
	2	Chinese Chinese Shanghai	yes
	3	Chinese Macao	yes
	4	Tokyo Japan Chinese	no
test set	5	Chinese Chinese Chinese Tokyo Japan	?

Worked example. For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors $\hat{P}(c) = 3/4$ and $\hat{P}(\overline{c}) = 1/4$ and the following conditional probabilities:

$\begin{eqnarray*} \hat{P}(\term{Chinese}\vert c) &=& (5+1)/(8+6) = 6/14=3/7 \\ ... ...) = \hat{P}(\term{Japan}\vert\overline{c}) &=& (1+1)/(3+6)= 2/9 \end{eqnarray*}$

The denominators are and because the lengths of and $text_{\overline{c}}$ are 8 and 3, respectively, and because the constant in Equation 119 is 6 as the vocabulary consists of six terms.

We then get:

$\begin{eqnarray*} \hat{P}(c\vert d_5) &\propto& 3/4 \cdot (3/7)^3 \cdot 1/14 \c... ... &\propto& 1/4 \cdot (2/9)^3 \cdot 2/9 \cdot 2/9 \approx 0.0001. \end{eqnarray*}$

Thus, the classifier assigns the test document to = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in outweigh the occurrences of the two negative indicators Japan and Tokyo. End worked example.

Table 13.2: Training and test times for NB.

	mode	time complexity
	training	$\Theta(\vert\docsetlabeled\vert L_{ave}+\vert\mathbb{C}\vert\vert V\vert)$
	testing	$\Theta( L_{a}+\vert\mathbb{C}\vert M_{a})= \Theta(\vert\mathbb{C}\vert M_{a})$

What is the time complexity of NB? The complexity of computing the parameters is $\Theta(\vert\mathbb{C}\vert\vert V\vert)$ because the set of parameters consists of $\vert\mathbb{C}\vert\vert V\vert$ conditional probabilities and $\vert\mathbb{C}\vert$ priors. The preprocessing necessary for computing the parameters (extracting the vocabulary, counting terms, etc.) can be done in one pass through the training data. The time complexity of this component is therefore $\Theta(\vert\docsetlabeled\vert L_{ave})$ , where $\vert\docsetlabeled\vert$ is the number of documents and $L_{ave}$ is the average length of a document.

We use $\Theta(\vert\docsetlabeled\vert L_{ave})$ as a notation for $\Theta(T)$ here, where is the length of the training collection. This is nonstandard; $\Theta(.)$ is not defined for an average. We prefer expressing the time complexity in terms of $\docsetlabeled$ and $L_{ave}$ because these are the primary statistics used to characterize training collections.

The time complexity of APPLYMULTINOMIALNB in Figure 13.2 is $\Theta(\vert\mathbb{C}\vert L_{a})$ . $L_{a}$ and $M_{a}$ are the numbers of tokens and types, respectively, in the test document . APPLYMULTINOMIALNB can be modified to be $\Theta( L_{a}+\vert\mathbb{C}\vert M_{a})$ (Exercise 13.6 ). Finally, assuming that the length of test documents is bounded, $\Theta( L_{a}+\vert\mathbb{C}\vert M_{a})= \Theta(\vert\mathbb{C}\vert M_{a})$ because $L_{a} < b \vert C\vert M_{a}$ for a fixed constant .

Table 13.2 summarizes the time complexities. In general, we have $\vert\mathbb{C}\vert\vert V\vert < \vert\docsetlabeled\vert L_{ave}$ , so both training and testing complexity are linear in the time it takes to scan the data. Because we have to look at the data at least once, NB can be said to have optimal time complexity. Its efficiency is one reason why NB is a popular text classification method.

http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

分享到：

cdh4 hadoop,hive,impala,hbase本地库搭建 ... | 关于redis、memcache、mongoDB 的对比

2014-06-03 10:38
浏览 959
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论