Python实现的朴素贝叶斯算法经典示例【测试可用】


Posted in Python onJune 13, 2018

本文实例讲述了Python实现的朴素贝叶斯算法。分享给大家供大家参考,具体如下:

代码主要参考机器学习实战那本书,发现最近老外的书确实比中国人写的好,由浅入深,代码通俗易懂,不多说上代码:

#encoding:utf-8
'''''
Created on 2015年9月6日
@author: ZHOUMEIXU204
朴素贝叶斯实现过程
'''
#在该算法中类标签为1和0,如果是多标签稍微改动代码既可
import numpy as np
path=u"D:\\Users\\zhoumeixu204\Desktop\\python语言机器学习\\机器学习实战代码  python\\机器学习实战代码\\machinelearninginaction\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print(len(myVocabList))
print(myVocabList)
print(setOfWordseVec(myVocabList, listOPosts[0]))
print(setOfWordseVec(myVocabList, listOPosts[3]))
#上述代码是将文本转化为向量的形式,如果出现则在向量中为1,若不出现 ,则为0
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWordseVec(myVocabList, postinDoc))
p0V,p1V,pAb=trainNB0(trainMat, listClasses)
if __name__!='__main__':
  print("p0的概况")
  print (p0V)
  print("p1的概率")
  print (p1V)
  print("pAb的概率")
  print (pAb)

运行结果:

32
['him', 'garbage', 'problems', 'take', 'steak', 'quit', 'so', 'is', 'cute', 'posting', 'dog', 'to', 'love', 'licks', 'dalmation', 'flea', 'I', 'please', 'maybe', 'buying', 'my', 'stupid', 'park', 'food', 'stop', 'has', 'ate', 'help', 'how', 'mr', 'worthless', 'not']
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]

# -*- coding:utf-8 -*-
#!python2
#构建样本分类器testEntry=['love','my','dalmation'] testEntry=['stupid','garbage']到底属于哪个类别
import numpy as np
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def testingNB():
  listOPosts,listClasses=loadDataSet()
  myVocabList=createVocabList(listOPosts)
  trainMat=[]
  for postinDoc in listOPosts:
    trainMat.append(setOfWordseVec(myVocabList, postinDoc))
  p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))
  print("p0V={0}".format(p0V))
  print("p1V={0}".format(p1V))
  print("pAb={0}".format(pAb))
  testEntry=['love','my','dalmation']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print("vec2Classify*p0Vec={0}".format(thisDoc*p0V))
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
  testEntry=['stupid','garbage']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
if __name__=='__main__':
  testingNB()

运行结果:

p0V=[-3.25809654 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -1.87180218 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -3.25809654 -2.15948425 -2.56494936 -3.25809654 -2.56494936
 -3.25809654 -3.25809654]
p1V=[-2.35137526 -3.04452244 -1.94591015 -2.35137526 -1.94591015 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -1.65822808 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -3.04452244 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -2.35137526 -2.35137526
 -2.35137526 -2.35137526]
pAb=0.5
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
vec2Classify*p0Vec=[-0.         -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.         -0.
 -1.87180218 -0.         -0.         -2.56494936 -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -2.56494936 -0.         -0.         -0.         -0.        ]
['love', 'my', 'dalmation'] classified as : 0
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
['stupid', 'garbage'] classified as : 1

# -*- coding:utf-8 -*-
#! python2
#使用朴素贝叶斯过滤垃圾邮件
# 1.收集数据:提供文本文件
# 2.准备数据:讲文本文件见习成词条向量
# 3.分析数据:检查词条确保解析的正确性
# 4.训练算法:使用我们之前简历的trainNB0()函数
# 5.测试算法:使用classifyNB(),并且对建一个新的测试函数来计算文档集的错误率
# 6.使用算法,构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
# import re
# mySent='this book is the best book on python or M.L. I hvae ever laid eyes upon.'
# print(mySent.split())
# regEx=re.compile('\\W*')
# print(regEx.split(mySent))
# emailText=open(path+"email\\ham\\6.txt").read()
import numpy as np
path=u"C:\\py\\3waterPyDemo\\src\\Demo\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def textParse(bigString):
  import re
  listOfTokens=re.split(r'\W*',bigString)
  return [tok.lower() for tok in listOfTokens if len(tok)>2]
def spamTest():
  docList=[];classList=[];fullText=[]
  for i in range(1,26):
    wordList=textParse(open(path+"email\\spam\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList=textParse(open(path+"email\\ham\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList=createVocabList(docList)
  trainingSet=range(50);testSet=[]
  for i in range(10):
    randIndex=int(np.random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat=[];trainClasses=[]
  for  docIndex in trainingSet:
    trainMat.append(setOfWordseVec(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V,p1V,pSpam=trainNB0(np.array(trainMat),np.array(trainClasses))
  errorCount=0
  for  docIndex in testSet:
    wordVector=setOfWordseVec(vocabList, docList[docIndex])
    if classifyNB(np.array(wordVector), p0V, p1V, pSpam)!=classList[docIndex]:
      errorCount+=1
  print 'the error rate is :',float(errorCount)/len(testSet)
if __name__=='__main__':
  spamTest()

运行结果:

the error rate is : 0.0

其中,path路径所使用到的Ch04文件点击此处本站下载

注:本文算法源自《机器学习实战》一书。

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
在Python中使用异步Socket编程性能测试
Jun 25 Python
Python中的多行注释文档编写风格汇总
Jun 16 Python
Pycharm编辑器技巧之自动导入模块详解
Jul 18 Python
Python爬虫实例爬取网站搞笑段子
Nov 08 Python
python-str,list,set间的转换实例
Jun 27 Python
浅谈django的render函数的参数问题
Oct 16 Python
Python中collections模块的基本使用教程
Dec 07 Python
详解python运行三种方式
May 13 Python
python通过TimedRotatingFileHandler按时间切割日志
Jul 17 Python
解决echarts中饼图标签重叠的问题
May 16 Python
Python自定义聚合函数merge与transform区别详解
May 26 Python
Python 实现RSA加解密文本文件
Dec 30 Python
Python使用matplotlib和pandas实现的画图操作【经典示例】
Jun 13 #Python
使用python爬虫获取黄金价格的核心代码
Jun 13 #Python
Python实现爬虫从网络上下载文档的实例代码
Jun 13 #Python
Pycharm导入Python包,模块的图文教程
Jun 13 #Python
mac下pycharm设置python版本的图文教程
Jun 13 #Python
使用Python来开发微信功能
Jun 13 #Python
python爬取足球直播吧五大联赛积分榜
Jun 13 #Python
You might like
php+oracle 分页类
2006/10/09 PHP
dede全站URL静态化改造[070414更正]
2007/04/17 PHP
使用GDB调试PHP代码,解决PHP代码死循环问题
2015/03/02 PHP
php使用simplexml_load_file加载XML文件并显示XML的方法
2015/03/19 PHP
在Thinkphp中使用ajax实现无刷新分页的方法
2016/10/25 PHP
tp5(thinkPHP5)操作mongoDB数据库的方法
2018/01/20 PHP
PHP实现通过二维数组键值获取一维键名操作示例
2019/10/11 PHP
Firefox window.close()的使用注意事项
2009/04/11 Javascript
基于jquery的Repeater实现代码
2010/07/17 Javascript
JS两种定义方式的区别、内部原理
2013/11/21 Javascript
jQuery中选择器小问题(新人难免遇到)
2014/03/31 Javascript
jquery实现多屏多图焦点图切换特效的方法
2015/05/04 Javascript
在JavaScript中操作时间之getMonth()方法的使用
2015/06/10 Javascript
Bootstrap模态框使用详解
2017/02/15 Javascript
Vue实现动态响应数据变化
2017/04/28 Javascript
JS设计模式之惰性模式(二)
2017/09/29 Javascript
浅析微信小程序自定义日历组件及flex布局最后一行对齐问题
2020/10/29 Javascript
[04:39]显微镜下的DOTA2第十三期—Pis卡尔个人秀
2014/04/04 DOTA
Python随机生成数模块random使用实例
2015/04/13 Python
Python制作简易注册登录系统
2016/12/15 Python
Python实现检测文件MD5值的方法示例
2018/04/11 Python
用python处理图片实现图像中的像素访问
2018/05/04 Python
Python实现随机漫步功能
2018/07/09 Python
使用TensorFlow实现SVM
2018/09/06 Python
Python cv2 图像自适应灰度直方图均衡化处理方法
2018/12/07 Python
python使用pdfminer解析pdf文件的方法示例
2018/12/20 Python
Python基于staticmethod装饰器标示静态方法
2020/10/17 Python
WiFi云数码相框:Nixplay
2018/07/05 全球购物
西班牙多品牌鞋店连锁店:Krack
2018/11/30 全球购物
关爱老人标语
2014/06/21 职场文书
改作风抓落实促发展心得体会
2014/09/10 职场文书
2014年师德师风自我剖析材料
2014/09/27 职场文书
2014年大学团支部工作总结
2014/12/02 职场文书
工程部岗位职责范本
2015/04/11 职场文书
2015年小学图书室工作总结
2015/05/18 职场文书
初中同学会致辞
2015/08/01 职场文书