`
sunbin
  • 浏览: 354542 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

贝叶斯分类算法

 
阅读更多

贝叶斯算法主要用于分类数据预测

以下为垃圾邮件分类算法

 

数据

type,text
ham,00 00 00 are 0089 0089 having a good week. Just checking in
ham,K..give back my thanks.
ham,Am also doing in cbe only. But have to pay.
spam,"complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+"
spam,okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm
ham,Aiya we discuss later lar... Pick u up at 4 is it?
ham,Are you this much buzy
ham,Please ask mummy to call father
spam,Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper
ham,"fyi I'm at usf now, swing by the room whenever"
ham,"Sure thing big man. i have hockey elections at 6, shouldn€˜t go on longer than an hour though"
ham,I anything lor...
ham,"By march ending, i should be ready. But will call you for sure. The problem is that my capital never complete. How far with you. How's work and the ladies"
ham,"Hmm well, night night "
ham,K I'll be sure to get up before noon and see what's what
ham,Ha ha cool cool chikku chikku:-):-DB-)
ham,Darren was saying dat if u meeting da ge den we dun meet 4 dinner. Cos later u leave xy will feel awkward. Den u meet him 4 lunch lor.
ham,He dint tell anything. He is angry on me that why you told to abi.
ham,Up to u... u wan come then come lor... But i din c any stripes skirt...
spam,"U can WIN £100 of Music Gift Vouchers every week starting NOW Txt the word DRAW to 87066 TsCs www.ldew.com SkillGame,1Winaweek, age16.150ppermessSubscription"
ham,2mro i am not coming to gym machan. Goodnight.
ham,ARR birthday today:) i wish him to get more oscar.
ham,Reading gud habit.. Nan bari hudgi yorge pataistha ertini kano:-)
ham,"I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones"
ham,"Could you not read me, my Love ? I answered you"
ham,So what did the bank say about the money?
ham,Well if I'm that desperate I'll just call armand again
ham,"Fuuuuck I need to stop sleepin, sup"
ham,So how's the weather over there?
ham,Ok thanx...
ham,Ok.ok ok..then..whats ur todays plan
ham,1Apple/Day=No Doctor. 1Tulsi Leaf/Day=No Cancer. 1Lemon/Day=No Fat. 1Cup Milk/day=No Bone Problms 3 Litres Watr/Day=No Diseases Snd ths 2 Whom U Care..:-)
ham,"Sorry, I'll call later"
ham,Will do. Was exhausted on train this morning. Too much wine and pie. You sleep well too
spam,U have won a nokia 6230 plus a free digital camera. This is what u get when u win our FREE auction. To take part send NOKIA to 83383 now. POBOX114/14TCR/W1 16
ham,Ron say fri leh. N he said ding tai feng cant make reservations. But he said wait lor.
ham,"Call me when you/carlos is/are here, my phone's vibrate is acting up and I might not hear texts"
ham,Oh k :)why you got job then whats up?
spam,"SPJanuary Male Sale! Hot Gay chat now cheaper, call 08709222922. National rate from 1.5p/min cheap to 7.8p/min peak! To stop texts call 08712460324 (10p/min)"
ham,Yeah you should. I think you can use your gt atm now to register. Not sure but if there's anyway i can help let me know. But when you do be sure you are ready.
ham,Nationwide auto centre (or something like that) on Newport road. I liked them there
ham,He is there. You call and meet him
ham,Yeah sure I'll leave in a min
spam,URGENT! Your Mobile number has been awarded with a £2000 prize GUARANTEED. Call 09061790121 from land line. Claim 3030. Valid 12hrs only 150ppm
ham,"Mah b, I'll pick it up tomorrow"
ham,Then she dun believe wat?
ham,I've sent u my part..

 

 

python算法

# 编码转换模块
import codecs
from sklearn.naive_bayes import MultinomialNB  
from sklearn.feature_extraction.text import CountVectorizer
 

if __name__ == '__main__':
    corpus = []
    labels = []
    corpus_test = []
    labels_test = []
#     读取文件
    f = codecs.open("../../sms_spam.txt", "rb")
    
    count = 0
    while True:  
        line = f.readline()
#         第一行不处理
        if count == 0:
            count = count + 1
            continue
        if line: 
#             修改byte类型为str类型,python2是str python3是byte
            line=line.decode()
           
            count = count + 1
            line = line.split(",")
#             维度,特征参数
            sentence = line[1]
#             构建训练集特征值
            corpus.append(sentence)
#             目标参数
            label = line[0]
#             构建训练集目标值 将支付串转为0 1
            if "ham" == label:
                labels.append(0)
            elif "spam" == label:
                labels.append(1)
#                 构建测试集
            if count > 5550:
                corpus_test.append(sentence)
                if "ham" == label:
                    labels_test.append(0)
                elif "spam" == label:
                    labels_test.append(1)
        else:
            break
#         创建训练集
    # CountVectorizer是将文本向量转换成稀疏表示数值向量(字符频率向量)  vectorizer 将文档词块化
    # 把corpus 数据中的数据转成“字符频率”
    vectorizer = CountVectorizer()
    fea_train = vectorizer.fit_transform(corpus)
#     所有出现的字符按 ascii码顺序排序组建特征维度
    print (vectorizer.get_feature_names())
#     按特征维度统计每行的字符出现次数
    print (fea_train.toarray())

#         创建测试集
#     在已统计的vectorizer基础上带入测试集数据,如果测试集数据中有新单词出现,不做统计
    vectorizer2 = CountVectorizer(vocabulary=vectorizer.vocabulary_)
    fea_test = vectorizer2.fit_transform(corpus_test)
    print (vectorizer2.get_feature_names())
    print (fea_test.toarray())
    
    
    # 创建贝叶斯分类模型,带入训练数据
    # alpha = 1 拉普拉斯估计给每个单词加1 
    clf = MultinomialNB(alpha=1)   
    clf.fit(fea_train, labels)
    
#     在模型中带入测试数据,得出预测值
    pred = clf.predict(fea_test);  
    for p in pred:
        if p == 0:
            print ("正常邮件")
        else:
            print ("垃圾邮件")
    for i in range(len(pred)):
        print(pred[i] ,"\t",labels_test[i])
           

 

 

 

spark算法

package com.sunbin

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.classification.NaiveBayes

object Naive_bayes {
  def main(args: Array[String]): Unit = {
    //1 构建Spark对象
    val conf=new SparkConf().setMaster("local[2]").setAppName("bayes")
    val sc=new SparkContext(conf)
    Logger.getRootLogger.setLevel(Level.WARN)
    val data_path1 = "sms_spam.txt"
    val lines= sc.textFile(data_path1, 2)
    
    val tf = new HashingTF(numFeatures = 100000)
    
//    构建数据集
    val parsedData=lines.map(line=>{
      val parts= line.split(",")
//      将文本特征转成向量
      val features= tf.transform(parts(1).split(" ")) 
    		   if (parts(0) == "ham"){
    		     LabeledPoint(0, features)
//    		     LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
    		   }else{
    		     LabeledPoint(1, features)
//    		      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
    		   }
    })
    parsedData.cache()
//      切分数据集,训练集和测试集
   val splits= parsedData.randomSplit(Array(0.9,0.1), seed=1l)
   val test=splits(1)
   val train=splits(0)

//   训练模型
   val model = NaiveBayes.train(train, lambda=1.0)
//   测试数据
   val predictionAndLabel = test.map(p =>{ 
     println(model.predict(p.features), " ",p.label)
     (model.predict(p.features), p.label)
     })
    predictionAndLabel.count()
  }
  
}

 

分享到:
评论

相关推荐

    【python代码实现】决策树分类算法、朴素贝叶斯分类算法以及人工神经网络分类算法的代码及数据

    资源中包括决策树分类算法、朴素贝叶斯分类算法、人工神经网络分类算法的代码(.ipynb,.py)和案例股票价格波动分析的数据(.csv),建议使用jupyter notebook打开.ipynb文件,体验更佳 1、资源配合博文《【python...

    java实现朴素贝叶斯分类算法

    朴素贝叶斯分类算法是一种基于概率的机器学习方法,它基于贝叶斯定理和特征条件独立假设。在Java中实现朴素贝叶斯分类器,我们需要理解以下几个关键知识点: 1. **贝叶斯定理**:贝叶斯定理是概率论中的一个公式,...

    贝叶斯分类算法MatLab实现

    利用贝叶斯分类算法对两个已知样本进行分类并求出决策面方程,画出3维图像。 代码注释详细,易于看懂。

    贝叶斯分类算法C实现_ 贝叶斯网络超参数c课程资源 一C\u002FC

    在IT领域,尤其是在数据分析和机器学习中,贝叶斯分类算法是一种广泛应用的统计技术。它基于贝叶斯定理,可以用于预测性建模,如文本分类、垃圾邮件过滤、情感分析等。在这个主题中,我们将深入探讨"贝叶斯分类算法C...

    朴素贝叶斯分类算法

    朴素贝叶斯分类算法是基于概率理论的一种有监督学习方法,尤其在文本分类、垃圾邮件过滤等领域表现出色。它的核心思想是假设各个特征之间相互独立,并且先验概率可以通过观察到的数据来估计。这种算法简单易用,计算...

    贝叶斯分类算法C++实现

    贝叶斯分类算法是一种基于概率统计的机器学习方法,它在数据挖掘领域有着广泛的应用。在C++中实现贝叶斯分类器可以帮助我们构建高效、灵活的预测模型,尤其适用于处理离散特征的数据集。 贝叶斯定理是贝叶斯分类...

    朴素贝叶斯分类算法原理与Python实现与使用方法案例

    本文实例讲述了朴素贝叶斯分类算法原理与Python实现与使用方法。分享给大家供大家参考,具体如下: 朴素贝叶斯分类算法 1、朴素贝叶斯分类算法原理 1.1、概述 贝叶斯分类算法是一大类分类算法的总称 贝叶斯分类算法...

    基于属性加权的朴素贝叶斯分类算法

    资源难得啊,花钱买的,基于属性加权的朴素贝叶斯分类算法.kdh

    bayes_贝叶斯分类算法matlab_

    利用贝叶斯分类算法进行数据的分析,非常实用而且方便的方法

    贝叶斯分类算法在垃圾邮件过滤中的应用

    ### 贝叶斯分类算法在垃圾邮件过滤中的应用 #### 一、贝叶斯分类算法的原理 贝叶斯分类算法是一种基于概率论的方法,主要用于分类任务,尤其是在文本分类领域有着广泛的应用。该算法的核心思想是利用已知条件概率...

    一种有效率的关系朴素贝叶斯分类算法

    ### 一种有效率的关系朴素贝叶斯分类算法 #### 摘要 本文提出了一种新的关系朴素贝叶斯分类算法(Relational Naive Bayes Classifier,简称RNBC),该算法针对目标关系表和背景关系表中不同的记录关联方式采用了不同...

    贝叶斯分类算法python实现

    贝叶斯分类算法是一种基于概率论的机器学习方法,它以贝叶斯定理为基础,通过对先验概率和似然性的结合,计算出后验概率,从而实现对未知数据的分类。在Python中,我们可以利用多种库来实现贝叶斯分类,如scikit-...

    数据挖掘中贝叶斯分类算法的研究

    ### 数据挖掘中贝叶斯分类算法的研究 #### 1. 引言 随着信息技术的快速发展,数据挖掘技术作为从大量数据中提取有价值信息的关键手段,受到了越来越多的关注。数据挖掘(Data Mining)是指从海量数据中自动抽取有用...

    一种基于朴素贝叶斯分类算法的数据预测.pdf

    基于朴素贝叶斯分类算法的数据预测 朴素贝叶斯分类算法是一种常用的机器学习算法,具有广泛的应用前景。本文主要介绍了基于朴素贝叶斯分类算法的数据预测,旨在通过数据挖掘技术来预测学生的入学报到数据。 朴素...

    基于K-近邻法的局部加权朴素贝叶斯分类算法.pdf

    基于K-近邻法的局部加权朴素贝叶斯分类算法 基于K-近邻法的局部加权朴素贝叶斯分类算法是一个结合K-近邻法和朴素贝叶斯分类算法的新型分类算法。该算法通过使用K-近邻法对属性加权,找到最合适的加权值,然后使用...

    基于特征相关的改进加权朴素贝叶斯分类算法

    朴素贝叶斯分类算法的特征项间强独立性的假设在...与基于传统TF一IDF权重的加权朴素贝叶斯分类算法和其他常用加权朴素贝叶斯分类算法比较,如基于属性加权的朴素贝叶斯分类算法,这种算法的分类效果均有一定的提高。

Global site tag (gtag.js) - Google Analytics