Python实现的朴素贝叶斯算法经典示例【测试可用】


Posted in Python onJune 13, 2018

本文实例讲述了Python实现的朴素贝叶斯算法。分享给大家供大家参考,具体如下:

代码主要参考机器学习实战那本书,发现最近老外的书确实比中国人写的好,由浅入深,代码通俗易懂,不多说上代码:

#encoding:utf-8
'''''
Created on 2015年9月6日
@author: ZHOUMEIXU204
朴素贝叶斯实现过程
'''
#在该算法中类标签为1和0,如果是多标签稍微改动代码既可
import numpy as np
path=u"D:\\Users\\zhoumeixu204\Desktop\\python语言机器学习\\机器学习实战代码  python\\机器学习实战代码\\machinelearninginaction\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print(len(myVocabList))
print(myVocabList)
print(setOfWordseVec(myVocabList, listOPosts[0]))
print(setOfWordseVec(myVocabList, listOPosts[3]))
#上述代码是将文本转化为向量的形式,如果出现则在向量中为1,若不出现 ,则为0
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWordseVec(myVocabList, postinDoc))
p0V,p1V,pAb=trainNB0(trainMat, listClasses)
if __name__!='__main__':
  print("p0的概况")
  print (p0V)
  print("p1的概率")
  print (p1V)
  print("pAb的概率")
  print (pAb)

运行结果:

32
['him', 'garbage', 'problems', 'take', 'steak', 'quit', 'so', 'is', 'cute', 'posting', 'dog', 'to', 'love', 'licks', 'dalmation', 'flea', 'I', 'please', 'maybe', 'buying', 'my', 'stupid', 'park', 'food', 'stop', 'has', 'ate', 'help', 'how', 'mr', 'worthless', 'not']
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]

# -*- coding:utf-8 -*-
#!python2
#构建样本分类器testEntry=['love','my','dalmation'] testEntry=['stupid','garbage']到底属于哪个类别
import numpy as np
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def testingNB():
  listOPosts,listClasses=loadDataSet()
  myVocabList=createVocabList(listOPosts)
  trainMat=[]
  for postinDoc in listOPosts:
    trainMat.append(setOfWordseVec(myVocabList, postinDoc))
  p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))
  print("p0V={0}".format(p0V))
  print("p1V={0}".format(p1V))
  print("pAb={0}".format(pAb))
  testEntry=['love','my','dalmation']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print("vec2Classify*p0Vec={0}".format(thisDoc*p0V))
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
  testEntry=['stupid','garbage']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
if __name__=='__main__':
  testingNB()

运行结果:

p0V=[-3.25809654 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -1.87180218 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -3.25809654 -2.15948425 -2.56494936 -3.25809654 -2.56494936
 -3.25809654 -3.25809654]
p1V=[-2.35137526 -3.04452244 -1.94591015 -2.35137526 -1.94591015 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -1.65822808 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -3.04452244 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -2.35137526 -2.35137526
 -2.35137526 -2.35137526]
pAb=0.5
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
vec2Classify*p0Vec=[-0.         -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.         -0.
 -1.87180218 -0.         -0.         -2.56494936 -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -2.56494936 -0.         -0.         -0.         -0.        ]
['love', 'my', 'dalmation'] classified as : 0
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
['stupid', 'garbage'] classified as : 1

# -*- coding:utf-8 -*-
#! python2
#使用朴素贝叶斯过滤垃圾邮件
# 1.收集数据:提供文本文件
# 2.准备数据:讲文本文件见习成词条向量
# 3.分析数据:检查词条确保解析的正确性
# 4.训练算法:使用我们之前简历的trainNB0()函数
# 5.测试算法:使用classifyNB(),并且对建一个新的测试函数来计算文档集的错误率
# 6.使用算法,构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
# import re
# mySent='this book is the best book on python or M.L. I hvae ever laid eyes upon.'
# print(mySent.split())
# regEx=re.compile('\\W*')
# print(regEx.split(mySent))
# emailText=open(path+"email\\ham\\6.txt").read()
import numpy as np
path=u"C:\\py\\3waterPyDemo\\src\\Demo\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def textParse(bigString):
  import re
  listOfTokens=re.split(r'\W*',bigString)
  return [tok.lower() for tok in listOfTokens if len(tok)>2]
def spamTest():
  docList=[];classList=[];fullText=[]
  for i in range(1,26):
    wordList=textParse(open(path+"email\\spam\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList=textParse(open(path+"email\\ham\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList=createVocabList(docList)
  trainingSet=range(50);testSet=[]
  for i in range(10):
    randIndex=int(np.random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat=[];trainClasses=[]
  for  docIndex in trainingSet:
    trainMat.append(setOfWordseVec(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V,p1V,pSpam=trainNB0(np.array(trainMat),np.array(trainClasses))
  errorCount=0
  for  docIndex in testSet:
    wordVector=setOfWordseVec(vocabList, docList[docIndex])
    if classifyNB(np.array(wordVector), p0V, p1V, pSpam)!=classList[docIndex]:
      errorCount+=1
  print 'the error rate is :',float(errorCount)/len(testSet)
if __name__=='__main__':
  spamTest()

运行结果:

the error rate is : 0.0

其中,path路径所使用到的Ch04文件点击此处本站下载

注:本文算法源自《机器学习实战》一书。

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
Python SQLAlchemy基本操作和常用技巧(包含大量实例,非常好)
May 06 Python
使用Pyrex来扩展和加速Python程序的教程
Apr 13 Python
Google开源的Python格式化工具YAPF的安装和使用教程
May 31 Python
利用Python实现Windows定时关机功能
Mar 21 Python
python中实现延时回调普通函数示例代码
Sep 08 Python
用python处理MS Word的实例讲解
May 08 Python
python之django母板页面的使用
Jul 03 Python
Python数据类型之String字符串实例详解
May 08 Python
python实现字符串完美拆分split()的方法
Jul 16 Python
python3.6中@property装饰器的使用方法示例
Aug 17 Python
Python GUI自动化实现绕过验证码登录
Jan 10 Python
python从Oracle读取数据生成图表
Oct 14 Python
Python使用matplotlib和pandas实现的画图操作【经典示例】
Jun 13 #Python
使用python爬虫获取黄金价格的核心代码
Jun 13 #Python
Python实现爬虫从网络上下载文档的实例代码
Jun 13 #Python
Pycharm导入Python包,模块的图文教程
Jun 13 #Python
mac下pycharm设置python版本的图文教程
Jun 13 #Python
使用Python来开发微信功能
Jun 13 #Python
python爬取足球直播吧五大联赛积分榜
Jun 13 #Python
You might like
百度地图API使用方法详解
2015/08/25 PHP
DEFER怎么用?
2006/07/01 Javascript
脚本吧 - 幻宇工作室用到js,超强推荐base.js
2006/12/23 Javascript
ajax无刷新动态调用股票信息(改良版)
2008/11/01 Javascript
js 利用className得到对象的实现代码
2011/11/15 Javascript
setTimeout的延时为0时多个浏览器的区别
2012/05/23 Javascript
javascript实现信息的显示和隐藏如注册页面
2013/12/03 Javascript
JS数组去重与取重的示例代码
2014/01/24 Javascript
jQuery找出网页上最高元素的方法
2015/03/20 Javascript
javascript实现网页字符定位的方法
2015/07/14 Javascript
JavaScript黑洞数字之运算路线查找算法(递归算法)实例
2016/01/28 Javascript
js获取元素的标签名实现方法
2016/10/08 Javascript
jQuery实现倒计时重新发送短信验证码功能示例
2017/01/12 Javascript
js实现华丽的九九乘法表效果
2017/03/29 Javascript
Agularjs妙用双向数据绑定实现手风琴效果
2017/05/26 Javascript
区分vue-router的hash和history模式
2020/10/03 Javascript
[00:37]食人魔魔法师轮盘吉兆顺应全新至宝将拥有额外款式
2019/12/19 DOTA
[01:08:24]DOTA2-DPC中国联赛 正赛 RNG vs Phoenix BO3 第一场 2月5日
2021/03/11 DOTA
python获得一个月有多少天的方法
2015/06/04 Python
将Python的Django框架与认证系统整合的方法
2015/07/24 Python
基python实现多线程网页爬虫
2015/09/06 Python
使用PyV8在Python爬虫中执行js代码
2017/02/16 Python
Python使用OpenCV进行标定
2018/05/08 Python
python中get和post有什么区别
2020/06/19 Python
Python docutils文档编译过程方法解析
2020/06/23 Python
俄罗斯皮肤健康中心:Pharmacosmetica.ru
2020/02/22 全球购物
安全生产先进个人材料
2014/02/06 职场文书
弘扬雷锋精神活动演讲稿
2014/03/04 职场文书
毕业生写求职信的要点
2014/03/04 职场文书
小学生期末评语大全
2014/04/21 职场文书
课外科技活动总结
2014/08/27 职场文书
小学师德师风演讲稿
2014/09/02 职场文书
2015年评职称工作总结范文
2015/04/20 职场文书
小学班主任教育随笔
2015/08/15 职场文书
Python pandas之求和运算和非空值个数统计
2021/08/07 Python
CSS子盒子水平和垂直居中的五种方法
2022/07/23 HTML / CSS