Python实现的朴素贝叶斯算法经典示例【测试可用】


Posted in Python onJune 13, 2018

本文实例讲述了Python实现的朴素贝叶斯算法。分享给大家供大家参考,具体如下:

代码主要参考机器学习实战那本书,发现最近老外的书确实比中国人写的好,由浅入深,代码通俗易懂,不多说上代码:

#encoding:utf-8
'''''
Created on 2015年9月6日
@author: ZHOUMEIXU204
朴素贝叶斯实现过程
'''
#在该算法中类标签为1和0,如果是多标签稍微改动代码既可
import numpy as np
path=u"D:\\Users\\zhoumeixu204\Desktop\\python语言机器学习\\机器学习实战代码  python\\机器学习实战代码\\machinelearninginaction\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print(len(myVocabList))
print(myVocabList)
print(setOfWordseVec(myVocabList, listOPosts[0]))
print(setOfWordseVec(myVocabList, listOPosts[3]))
#上述代码是将文本转化为向量的形式,如果出现则在向量中为1,若不出现 ,则为0
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWordseVec(myVocabList, postinDoc))
p0V,p1V,pAb=trainNB0(trainMat, listClasses)
if __name__!='__main__':
  print("p0的概况")
  print (p0V)
  print("p1的概率")
  print (p1V)
  print("pAb的概率")
  print (pAb)

运行结果:

32
['him', 'garbage', 'problems', 'take', 'steak', 'quit', 'so', 'is', 'cute', 'posting', 'dog', 'to', 'love', 'licks', 'dalmation', 'flea', 'I', 'please', 'maybe', 'buying', 'my', 'stupid', 'park', 'food', 'stop', 'has', 'ate', 'help', 'how', 'mr', 'worthless', 'not']
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]

# -*- coding:utf-8 -*-
#!python2
#构建样本分类器testEntry=['love','my','dalmation'] testEntry=['stupid','garbage']到底属于哪个类别
import numpy as np
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def testingNB():
  listOPosts,listClasses=loadDataSet()
  myVocabList=createVocabList(listOPosts)
  trainMat=[]
  for postinDoc in listOPosts:
    trainMat.append(setOfWordseVec(myVocabList, postinDoc))
  p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))
  print("p0V={0}".format(p0V))
  print("p1V={0}".format(p1V))
  print("pAb={0}".format(pAb))
  testEntry=['love','my','dalmation']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print("vec2Classify*p0Vec={0}".format(thisDoc*p0V))
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
  testEntry=['stupid','garbage']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
if __name__=='__main__':
  testingNB()

运行结果:

p0V=[-3.25809654 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -1.87180218 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -3.25809654 -2.15948425 -2.56494936 -3.25809654 -2.56494936
 -3.25809654 -3.25809654]
p1V=[-2.35137526 -3.04452244 -1.94591015 -2.35137526 -1.94591015 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -1.65822808 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -3.04452244 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -2.35137526 -2.35137526
 -2.35137526 -2.35137526]
pAb=0.5
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
vec2Classify*p0Vec=[-0.         -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.         -0.
 -1.87180218 -0.         -0.         -2.56494936 -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -2.56494936 -0.         -0.         -0.         -0.        ]
['love', 'my', 'dalmation'] classified as : 0
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
['stupid', 'garbage'] classified as : 1

# -*- coding:utf-8 -*-
#! python2
#使用朴素贝叶斯过滤垃圾邮件
# 1.收集数据:提供文本文件
# 2.准备数据:讲文本文件见习成词条向量
# 3.分析数据:检查词条确保解析的正确性
# 4.训练算法:使用我们之前简历的trainNB0()函数
# 5.测试算法:使用classifyNB(),并且对建一个新的测试函数来计算文档集的错误率
# 6.使用算法,构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
# import re
# mySent='this book is the best book on python or M.L. I hvae ever laid eyes upon.'
# print(mySent.split())
# regEx=re.compile('\\W*')
# print(regEx.split(mySent))
# emailText=open(path+"email\\ham\\6.txt").read()
import numpy as np
path=u"C:\\py\\3waterPyDemo\\src\\Demo\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def textParse(bigString):
  import re
  listOfTokens=re.split(r'\W*',bigString)
  return [tok.lower() for tok in listOfTokens if len(tok)>2]
def spamTest():
  docList=[];classList=[];fullText=[]
  for i in range(1,26):
    wordList=textParse(open(path+"email\\spam\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList=textParse(open(path+"email\\ham\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList=createVocabList(docList)
  trainingSet=range(50);testSet=[]
  for i in range(10):
    randIndex=int(np.random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat=[];trainClasses=[]
  for  docIndex in trainingSet:
    trainMat.append(setOfWordseVec(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V,p1V,pSpam=trainNB0(np.array(trainMat),np.array(trainClasses))
  errorCount=0
  for  docIndex in testSet:
    wordVector=setOfWordseVec(vocabList, docList[docIndex])
    if classifyNB(np.array(wordVector), p0V, p1V, pSpam)!=classList[docIndex]:
      errorCount+=1
  print 'the error rate is :',float(errorCount)/len(testSet)
if __name__=='__main__':
  spamTest()

运行结果:

the error rate is : 0.0

其中,path路径所使用到的Ch04文件点击此处本站下载

注:本文算法源自《机器学习实战》一书。

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
Python 2.7.x 和 3.x 版本的重要区别小结
Nov 28 Python
详解Tensorflow数据读取有三种方式(next_batch)
Feb 01 Python
Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容
Feb 23 Python
Python面向对象思想与应用入门教程【类与对象】
Apr 12 Python
用python建立两个Y轴的XY曲线图方法
Jul 08 Python
opencv+python实现均值滤波
Feb 19 Python
python GUI库图形界面开发之PyQt5布局控件QGridLayout详细使用方法与实例
Mar 06 Python
Django视图、传参和forms验证操作
Jul 15 Python
python绕过图片滑动验证码实现爬取PTA所有题目功能 附源码
Jan 06 Python
pytorch 如何使用amp进行混合精度训练
May 24 Python
python中的getter与setter你了解吗
Mar 24 Python
python自动获取微信公众号最新文章的实现代码
Jul 15 Python
Python使用matplotlib和pandas实现的画图操作【经典示例】
Jun 13 #Python
使用python爬虫获取黄金价格的核心代码
Jun 13 #Python
Python实现爬虫从网络上下载文档的实例代码
Jun 13 #Python
Pycharm导入Python包,模块的图文教程
Jun 13 #Python
mac下pycharm设置python版本的图文教程
Jun 13 #Python
使用Python来开发微信功能
Jun 13 #Python
python爬取足球直播吧五大联赛积分榜
Jun 13 #Python
You might like
怎样在php中使用PDF文档功能
2006/10/09 PHP
PHP移动文件指针ftell()、fseek()、rewind()函数总结
2014/11/18 PHP
php控制文件下载速度的方法
2015/03/24 PHP
使用PHP处理数据库数据如何将数据返回客户端并显示当前状态
2016/02/16 PHP
PHP实现的Redis多库选择功能单例类
2017/07/27 PHP
关于PHP虚拟主机概念及如何选择稳定的PHP虚拟主机
2018/11/20 PHP
Laravel框架下的Contracts契约详解
2020/03/17 PHP
PHP设计模式之迭代器模式Iterator实例分析【对象行为型】
2020/04/26 PHP
基于jQuery图片平滑连续滚动插件
2009/04/27 Javascript
JSQL SQLProxy 的 php 版本代码
2010/05/05 Javascript
使用jquery与图片美化checkbox和radio控件的代码(打包下载)
2010/11/11 Javascript
jQuery.extend()、jQuery.fn.extend()扩展方法示例详解
2014/05/08 Javascript
js数组的操作指南
2014/12/28 Javascript
JS动态创建DOM元素的方法
2015/06/09 Javascript
零基础轻松学JavaScript闭包
2016/12/30 Javascript
很棒的一组js图片轮播特效
2017/01/12 Javascript
使用vue与jquery实时监听用户输入状态的操作代码
2017/09/19 jQuery
基于twbsPagination.js分页插件使用心得(分享)
2017/10/21 Javascript
JavaScript中严格判断NaN的方法
2018/02/16 Javascript
Linux Centos7.2下安装nodejs&npm配置全局路径的教程
2018/05/15 NodeJs
vue-cli 脚手架基于Nightwatch的端到端测试环境的过程
2018/09/30 Javascript
vue.js高德地图实现热点图代码实例
2019/04/18 Javascript
详解template标签用法(含vue中的用法总结)
2021/01/12 Vue.js
[05:06]TI4西雅图DOTA2前线报道 海涛密探LGD训练
2014/07/09 DOTA
python实现人人网登录示例分享
2014/01/19 Python
整理Python 常用string函数(收藏)
2016/05/30 Python
Python语言描述随机梯度下降法
2018/01/04 Python
pycharm使用matplotlib.pyplot不显示图形的解决方法
2018/10/28 Python
Python下简易的单例模式详解
2019/04/08 Python
python中的句柄操作的方法示例
2019/06/20 Python
python字符串判断密码强弱
2020/03/18 Python
Python如何基于Tesseract实现识别文字功能
2020/06/05 Python
Python实现SMTP邮件发送
2020/06/16 Python
Clarria化妆品官方网站:购买天然和有机化妆品系列
2018/04/08 全球购物
会议欢迎词范文
2015/01/27 职场文书
Python语言规范之Pylint的详细用法
2021/06/24 Python