Python实现的朴素贝叶斯算法经典示例【测试可用】


Posted in Python onJune 13, 2018

本文实例讲述了Python实现的朴素贝叶斯算法。分享给大家供大家参考,具体如下:

代码主要参考机器学习实战那本书,发现最近老外的书确实比中国人写的好,由浅入深,代码通俗易懂,不多说上代码:

#encoding:utf-8
'''''
Created on 2015年9月6日
@author: ZHOUMEIXU204
朴素贝叶斯实现过程
'''
#在该算法中类标签为1和0,如果是多标签稍微改动代码既可
import numpy as np
path=u"D:\\Users\\zhoumeixu204\Desktop\\python语言机器学习\\机器学习实战代码  python\\机器学习实战代码\\machinelearninginaction\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print(len(myVocabList))
print(myVocabList)
print(setOfWordseVec(myVocabList, listOPosts[0]))
print(setOfWordseVec(myVocabList, listOPosts[3]))
#上述代码是将文本转化为向量的形式,如果出现则在向量中为1,若不出现 ,则为0
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWordseVec(myVocabList, postinDoc))
p0V,p1V,pAb=trainNB0(trainMat, listClasses)
if __name__!='__main__':
  print("p0的概况")
  print (p0V)
  print("p1的概率")
  print (p1V)
  print("pAb的概率")
  print (pAb)

运行结果:

32
['him', 'garbage', 'problems', 'take', 'steak', 'quit', 'so', 'is', 'cute', 'posting', 'dog', 'to', 'love', 'licks', 'dalmation', 'flea', 'I', 'please', 'maybe', 'buying', 'my', 'stupid', 'park', 'food', 'stop', 'has', 'ate', 'help', 'how', 'mr', 'worthless', 'not']
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]

# -*- coding:utf-8 -*-
#!python2
#构建样本分类器testEntry=['love','my','dalmation'] testEntry=['stupid','garbage']到底属于哪个类别
import numpy as np
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def testingNB():
  listOPosts,listClasses=loadDataSet()
  myVocabList=createVocabList(listOPosts)
  trainMat=[]
  for postinDoc in listOPosts:
    trainMat.append(setOfWordseVec(myVocabList, postinDoc))
  p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))
  print("p0V={0}".format(p0V))
  print("p1V={0}".format(p1V))
  print("pAb={0}".format(pAb))
  testEntry=['love','my','dalmation']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print("vec2Classify*p0Vec={0}".format(thisDoc*p0V))
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
  testEntry=['stupid','garbage']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
if __name__=='__main__':
  testingNB()

运行结果:

p0V=[-3.25809654 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -1.87180218 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -3.25809654 -2.15948425 -2.56494936 -3.25809654 -2.56494936
 -3.25809654 -3.25809654]
p1V=[-2.35137526 -3.04452244 -1.94591015 -2.35137526 -1.94591015 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -1.65822808 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -3.04452244 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -2.35137526 -2.35137526
 -2.35137526 -2.35137526]
pAb=0.5
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
vec2Classify*p0Vec=[-0.         -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.         -0.
 -1.87180218 -0.         -0.         -2.56494936 -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -2.56494936 -0.         -0.         -0.         -0.        ]
['love', 'my', 'dalmation'] classified as : 0
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
['stupid', 'garbage'] classified as : 1

# -*- coding:utf-8 -*-
#! python2
#使用朴素贝叶斯过滤垃圾邮件
# 1.收集数据:提供文本文件
# 2.准备数据:讲文本文件见习成词条向量
# 3.分析数据:检查词条确保解析的正确性
# 4.训练算法:使用我们之前简历的trainNB0()函数
# 5.测试算法:使用classifyNB(),并且对建一个新的测试函数来计算文档集的错误率
# 6.使用算法,构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
# import re
# mySent='this book is the best book on python or M.L. I hvae ever laid eyes upon.'
# print(mySent.split())
# regEx=re.compile('\\W*')
# print(regEx.split(mySent))
# emailText=open(path+"email\\ham\\6.txt").read()
import numpy as np
path=u"C:\\py\\3waterPyDemo\\src\\Demo\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def textParse(bigString):
  import re
  listOfTokens=re.split(r'\W*',bigString)
  return [tok.lower() for tok in listOfTokens if len(tok)>2]
def spamTest():
  docList=[];classList=[];fullText=[]
  for i in range(1,26):
    wordList=textParse(open(path+"email\\spam\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList=textParse(open(path+"email\\ham\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList=createVocabList(docList)
  trainingSet=range(50);testSet=[]
  for i in range(10):
    randIndex=int(np.random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat=[];trainClasses=[]
  for  docIndex in trainingSet:
    trainMat.append(setOfWordseVec(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V,p1V,pSpam=trainNB0(np.array(trainMat),np.array(trainClasses))
  errorCount=0
  for  docIndex in testSet:
    wordVector=setOfWordseVec(vocabList, docList[docIndex])
    if classifyNB(np.array(wordVector), p0V, p1V, pSpam)!=classList[docIndex]:
      errorCount+=1
  print 'the error rate is :',float(errorCount)/len(testSet)
if __name__=='__main__':
  spamTest()

运行结果:

the error rate is : 0.0

其中,path路径所使用到的Ch04文件点击此处本站下载

注:本文算法源自《机器学习实战》一书。

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
python 字典(dict)遍历的四种方法性能测试报告
Jun 25 Python
Python使用email模块对邮件进行编码和解码的实例教程
Jul 01 Python
python3之微信文章爬虫实例讲解
Jul 12 Python
python机器学习实战之树回归详解
Dec 20 Python
Python对象与引用的介绍
Jan 24 Python
Python一行代码实现快速排序的方法
Apr 30 Python
Python多叉树的构造及取出节点数据(treelib)的方法
Aug 09 Python
python 使用while写猜年龄小游戏过程解析
Oct 07 Python
解决Tensorboard可视化错误:不显示数据 No scalar data was found
Feb 15 Python
浅谈Keras参数 input_shape、input_dim和input_length用法
Jun 29 Python
详解pycharm连接远程linux服务器的虚拟环境的方法
Nov 13 Python
python 窃取摄像头照片的实现示例
Jan 08 Python
Python使用matplotlib和pandas实现的画图操作【经典示例】
Jun 13 #Python
使用python爬虫获取黄金价格的核心代码
Jun 13 #Python
Python实现爬虫从网络上下载文档的实例代码
Jun 13 #Python
Pycharm导入Python包,模块的图文教程
Jun 13 #Python
mac下pycharm设置python版本的图文教程
Jun 13 #Python
使用Python来开发微信功能
Jun 13 #Python
python爬取足球直播吧五大联赛积分榜
Jun 13 #Python
You might like
无限级别菜单的实现
2006/10/09 PHP
php的list()的一步操作给一组变量进行赋值的使用
2011/05/18 PHP
php mail to 配置详解
2014/01/16 PHP
PHP的serialize序列化数据以及JSON格式化数据分析
2015/10/10 PHP
PHP实现的方程求解示例分析
2016/11/11 PHP
php微信公众平台交互与接口详解
2016/11/28 PHP
php编程实现简单的网页版计算器功能示例
2017/04/26 PHP
PHP底层运行机制与工作原理详解
2020/07/31 PHP
js 模拟气泡屏保效果代码
2010/07/10 Javascript
JavaScript 获取任一float型小数点后两位的小数
2014/06/30 Javascript
JQuery做的一个简单的点灯游戏分享
2014/07/16 Javascript
javascript学习总结之js使用技巧
2015/09/02 Javascript
深入浅析JavaScript字符串操作方法 slice、substr、substring及其IE兼容性
2015/12/16 Javascript
javascript垃圾收集机制的原理分析
2016/12/08 Javascript
react-redux中connect()方法详细解析
2017/05/27 Javascript
canvas+gif.js打造自己的数字雨头像的示例代码
2017/10/26 Javascript
微信小程序tabBar 返回tabBar不刷新页面
2019/07/25 Javascript
vue项目部署到nginx/tomcat服务器的实现
2019/08/26 Javascript
原生JS实现pc端轮播图效果
2020/12/21 Javascript
python的socket编程入门
2018/01/29 Python
Django外键(ForeignKey)操作以及related_name的作用详解
2019/07/29 Python
python实现WebSocket服务端过程解析
2019/10/18 Python
Python socket模块方法实现详解
2019/11/05 Python
Python web如何在IIS发布应用过程解析
2020/05/27 Python
python cookie反爬处理的实现
2020/11/01 Python
Django-simple-captcha验证码包使用方法详解
2020/11/28 Python
一文读懂python Scrapy爬虫框架
2021/02/24 Python
英国Zoro工具:手动工具,电动工具和个人防护用品
2016/11/02 全球购物
长青弘远的面试题
2012/06/09 面试题
门卫岗位职责
2013/11/15 职场文书
计算机大学生职业生涯规划书范文
2014/02/19 职场文书
科学发展观活动总结
2014/08/28 职场文书
个人查摆问题自查报告
2014/10/16 职场文书
2014年仓库管理工作总结
2014/12/17 职场文书
2015初中政治教学工作总结
2015/07/21 职场文书
读《茶花女》有感:山茶花的盛开与凋零
2020/01/17 职场文书