贝叶斯算法介绍
一. 贝叶斯过滤算法的基本步骤
1) 收集大量的垃圾邮件和非垃圾邮件,建立垃圾邮件集和非垃圾邮件集。
2) 提取邮件主题和邮件体中的独立字串例如 ABC32,¥234等作为TOKEN串并统计提取出的TOKEN串出现的次数即字频。按照上述的方法分别处理垃圾邮件集和非垃圾邮件集中的所有邮件。
3) 每一个邮件集对应一个哈希表,hashtable_good对应非垃圾邮件集而hashtable_bad对应垃圾邮件集。表中存储TOKEN串到字频的映射关系。
4) 计算每个哈希表中TOKEN串出现的概率P=(某TOKEN串的字频)/(对应哈希表的长度)
5) 综合考虑hashtable_good和hashtable_bad,推断出当新来的邮件中出现某个TOKEN串时,该新邮件为垃圾邮件的概率。数学表达式为:
A事件----邮件为垃圾邮件;
t1,t2 …….tn代表TOKEN串
则P(A|ti)表示在邮件中出现TOKEN串ti时,该邮件为垃圾邮件的概率。
设
P1(ti)=(ti在hashtable_good中的值)
P2(ti)=(ti在hashtable_ bad中的值)
则 P(A|ti)= P1(ti)/[(P1(ti)+ P2(ti)];
6) 建立新的哈希表 hashtable_probability存储TOKEN串ti到P(A|ti)的映射
7) 至此,垃圾邮件集和非垃圾邮件集的学习过程结束。根据建立的哈希表 hashtable_probability可以估计一封新到的邮件为垃圾邮件的可能性。
当新到一封邮件时,按照步骤2)生成TOKEN串。查询hashtable_probability得到该TOKEN 串的键值。
假设由该邮件共得到N个TOKEN串,t1,t2…….tn, hashtable_probability中对应的值为P1,P2,。。。。。。PN,
P(A|t1 ,t2, t3……tn)表示在邮件中同时出现多个TOKEN串t1,t2…….tn时,该邮件为垃圾邮件的概率。
由复合概率公式可得
P(A|t1 ,t2, t3……tn)=(P1*P2*。。。。PN)/[P1*P2*。。。。。PN+(1-P1)*(1-P2)*。。。(1-PN)]
当P(A|t1 ,t2, t3……tn)超过预定阈值时,就可以判断邮件为垃圾邮件。
二. 贝叶斯过滤算法举例
例如:一封含有“fa#gong”字样的垃圾邮件 A
和 一封含有“法律”字样的非垃圾邮件B
根据邮件A生成hashtable_ bad,该哈希表中的记录为
法:1次
#:1次
功:1次
计算得在本表中:
法出现的概率为0。3
#出现的概率为0。3
功出现的概率为0。3
根据邮件B生成hashtable_good,该哈希表中的记录为:
法:1
律:1
计算得在本表中:
法出现的概率为0。5
律出现的概率为0。5
综合考虑两个哈希表,共有四个TOKEN串: 法 # 功 律
当邮件中出现“法”时,该邮件为垃圾邮件的概率为:
P=0。3/(0。3+0。5)=0。375
出现“#”时:
P=0。3/(0。3+0)=1
出现“功“时:
P=0。3/(0。3+0)=1
出现“律”时
P=0/(0+0。5)=0;
由此可得第三个哈希表:hashtable_probability 其数据为:
法:0。375
#:1
功:1
律:0
当新到一封含有“功律”的邮件时,我们可得到两个TOKEN串,功 律
查询哈希表hashtable_probability可得
P(垃圾邮件| 功)=1
P (垃圾邮件|律)=0
此时该邮件为垃圾邮件的可能性为:
P=(0*1)/[0*1+(1-0)*(1-1)]=0
由此可推出该邮件为非垃圾邮件
分享到:
相关推荐
An introduction to K2 algorithm which is a important score metric in Bayes Network learning
"Introduction to machine learning"这一教材,被视作初学者了解机器学习领域的一个极好的入门读物。 监督学习 监督学习是机器学习中的一种方法,它通过使用标注好的训练样本来学习出一个模型,然后利用这个模型对...
The EM Algorithm and Extensions remains the only single source to offer a complete and unified treatment of the theory, methodology, and applications of the EM algorithm. The highly applied area of ...
Next, you will learn about different classification algorithms and models such as the Naïve Bayes algorithm, the Hidden Markov Model, and so on. Finally, along with the examples that assist you in ...
You’ll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required ...
Machine Learning Algorithms By 作者: Giuseppe Bonaccorso ISBN-10 书号: 1785889621 ... A Brief Introduction To Deep Learning And Tensorflow Chapter 15. Creating A Machine Learning Architecture
Introduction to MATLAB A.1 What Is MATLAB? A.2 Getting Help in MATLAB A.3 File and Workspace Management A.4 Punctuation in MATLAB A.5 Arithmetic Operators A.6 Data Constructs in MATLAB ...
Introduction to MATLAB A.1 What Is MATLAB? A.2 Getting Help in MATLAB A.3 File and Workspace Management A.4 Punctuation in MATLAB A.5 Arithmetic Operators A.6 Data Constructs in MATLAB ...
Introduction to MATLAB A.1 What Is MATLAB? A.2 Getting Help in MATLAB A.3 File and Workspace Management A.4 Punctuation in MATLAB A.5 Arithmetic Operators A.6 Data Constructs in MATLAB ...
Introduction to MATLAB A.1 What Is MATLAB? A.2 Getting Help in MATLAB A.3 File and Workspace Management A.4 Punctuation in MATLAB A.5 Arithmetic Operators A.6 Data Constructs in MATLAB ...
An introduction to scikit-learn 15 Installing scikit-learn 16 Installing using pip 17 Installing on Windows 17 Installing on Ubuntu 16.04 17 Installing on Mac OS 17 Installing Anaconda 18 Verifying ...
2.4.1 The Bayes Filter Algorithm 23 2.4.2 Example 24 2.4.3 Mathematical Derivation of the Bayes Filter 28 2.4.4 The Markov Assumption 30 2.5 Representation and Computation 30 2.6 Summary 31 2.7 ...
decision trees Classifying with probability theory: naïve Bayes Logistic regression Support vector machines Improving classification with the AdaBoost meta algorithm PART 2 FORECASTING NUMERIC ...
with Applications to Science and Engineering Paul H. Kvam Brani Vidakovic Contents 1 Introduction 2 Probability Basics 3 Statistics Basics 4 Bayesian Statistics 5 Order Statistics 6 Goodness of Fit...
Marginal Gaussian distributions, Bayes’ theorem for Gaussian variables, Maximum likelihood for the Gaussian, Mixtures of Gaussians, Nonparametric Methods Linear model for regression: Linear basis ...
**Bishop's "Pattern Recognition and Machine Learning"** is a comprehensive and authoritative text that serves as both an introduction to the field of machine learning and a reference for advanced ...
- **Bayesian Inference:** A method of statistical inference where Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. - **Linear Models...
1 Introduction 19 1.1 What Is Learning? 19 1.2 When Do We Need Machine Learning? 21 1.3 Types of Learning 22 1.4 Relations to Other Fields 24 1.5 How to Read This Book 25 1.5.1 Possible Course Plans ...