Python实现的朴素贝叶斯算法经典示例【测试可用】


Posted in Python onJune 13, 2018

本文实例讲述了Python实现的朴素贝叶斯算法。分享给大家供大家参考,具体如下:

代码主要参考机器学习实战那本书,发现最近老外的书确实比中国人写的好,由浅入深,代码通俗易懂,不多说上代码:

#encoding:utf-8
'''''
Created on 2015年9月6日
@author: ZHOUMEIXU204
朴素贝叶斯实现过程
'''
#在该算法中类标签为1和0,如果是多标签稍微改动代码既可
import numpy as np
path=u"D:\\Users\\zhoumeixu204\Desktop\\python语言机器学习\\机器学习实战代码  python\\机器学习实战代码\\machinelearninginaction\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print(len(myVocabList))
print(myVocabList)
print(setOfWordseVec(myVocabList, listOPosts[0]))
print(setOfWordseVec(myVocabList, listOPosts[3]))
#上述代码是将文本转化为向量的形式,如果出现则在向量中为1,若不出现 ,则为0
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWordseVec(myVocabList, postinDoc))
p0V,p1V,pAb=trainNB0(trainMat, listClasses)
if __name__!='__main__':
  print("p0的概况")
  print (p0V)
  print("p1的概率")
  print (p1V)
  print("pAb的概率")
  print (pAb)

运行结果:

32
['him', 'garbage', 'problems', 'take', 'steak', 'quit', 'so', 'is', 'cute', 'posting', 'dog', 'to', 'love', 'licks', 'dalmation', 'flea', 'I', 'please', 'maybe', 'buying', 'my', 'stupid', 'park', 'food', 'stop', 'has', 'ate', 'help', 'how', 'mr', 'worthless', 'not']
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]

# -*- coding:utf-8 -*-
#!python2
#构建样本分类器testEntry=['love','my','dalmation'] testEntry=['stupid','garbage']到底属于哪个类别
import numpy as np
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def testingNB():
  listOPosts,listClasses=loadDataSet()
  myVocabList=createVocabList(listOPosts)
  trainMat=[]
  for postinDoc in listOPosts:
    trainMat.append(setOfWordseVec(myVocabList, postinDoc))
  p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))
  print("p0V={0}".format(p0V))
  print("p1V={0}".format(p1V))
  print("pAb={0}".format(pAb))
  testEntry=['love','my','dalmation']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print("vec2Classify*p0Vec={0}".format(thisDoc*p0V))
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
  testEntry=['stupid','garbage']
  thisDoc=np.array(setOfWordseVec(myVocabList, testEntry))
  print(thisDoc)
  print(testEntry,'classified as :',classifyNB(thisDoc, p0V, p1V, pAb))
if __name__=='__main__':
  testingNB()

运行结果:

p0V=[-3.25809654 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -1.87180218 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -3.25809654 -2.15948425 -2.56494936 -3.25809654 -2.56494936
 -3.25809654 -3.25809654]
p1V=[-2.35137526 -3.04452244 -1.94591015 -2.35137526 -1.94591015 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -1.65822808 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -3.04452244 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -2.35137526 -2.35137526
 -2.35137526 -2.35137526]
pAb=0.5
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
vec2Classify*p0Vec=[-0.         -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.         -0.
 -1.87180218 -0.         -0.         -2.56494936 -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -2.56494936 -0.         -0.         -0.         -0.        ]
['love', 'my', 'dalmation'] classified as : 0
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
['stupid', 'garbage'] classified as : 1

# -*- coding:utf-8 -*-
#! python2
#使用朴素贝叶斯过滤垃圾邮件
# 1.收集数据:提供文本文件
# 2.准备数据:讲文本文件见习成词条向量
# 3.分析数据:检查词条确保解析的正确性
# 4.训练算法:使用我们之前简历的trainNB0()函数
# 5.测试算法:使用classifyNB(),并且对建一个新的测试函数来计算文档集的错误率
# 6.使用算法,构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
# import re
# mySent='this book is the best book on python or M.L. I hvae ever laid eyes upon.'
# print(mySent.split())
# regEx=re.compile('\\W*')
# print(regEx.split(mySent))
# emailText=open(path+"email\\ham\\6.txt").read()
import numpy as np
path=u"C:\\py\\3waterPyDemo\\src\\Demo\\Ch04\\"
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]  #1 is abusive, 0 not
  return postingList,classVec
def createVocabList(dataset):
  vocabSet=set([])
  for document in dataset:
    vocabSet=vocabSet|set(document)
  return list(vocabSet)
def setOfWordseVec(vocabList,inputSet):
  returnVec=[0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)]=1  #vocabList.index() 函数获取vocabList列表某个元素的位置,这段代码得到一个只包含0和1的列表
    else:
      print("the word :%s is not in my Vocabulary!"%word)
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #创建朴素贝叶斯分类器函数
  numTrainDocs=len(trainMatrix)
  numWords=len(trainMatrix[0])
  pAbusive=sum(trainCategory)/float(numTrainDocs)
  p0Num=np.ones(numWords);p1Num=np.ones(numWords)
  p0Deom=2.0;p1Deom=2.0
  for i in range(numTrainDocs):
    if trainCategory[i]==1:
      p1Num+=trainMatrix[i]
      p1Deom+=sum(trainMatrix[i])
    else:
      p0Num+=trainMatrix[i]
      p0Deom+=sum(trainMatrix[i])
  p1vect=np.log(p1Num/p1Deom)  #change to log
  p0vect=np.log(p0Num/p0Deom)  #change to log
  return p0vect,p1vect,pAbusive
def  classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
  p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
  p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
  if p1>p0:
    return 1
  else:
    return 0
def textParse(bigString):
  import re
  listOfTokens=re.split(r'\W*',bigString)
  return [tok.lower() for tok in listOfTokens if len(tok)>2]
def spamTest():
  docList=[];classList=[];fullText=[]
  for i in range(1,26):
    wordList=textParse(open(path+"email\\spam\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList=textParse(open(path+"email\\ham\\%d.txt"%i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList=createVocabList(docList)
  trainingSet=range(50);testSet=[]
  for i in range(10):
    randIndex=int(np.random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat=[];trainClasses=[]
  for  docIndex in trainingSet:
    trainMat.append(setOfWordseVec(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V,p1V,pSpam=trainNB0(np.array(trainMat),np.array(trainClasses))
  errorCount=0
  for  docIndex in testSet:
    wordVector=setOfWordseVec(vocabList, docList[docIndex])
    if classifyNB(np.array(wordVector), p0V, p1V, pSpam)!=classList[docIndex]:
      errorCount+=1
  print 'the error rate is :',float(errorCount)/len(testSet)
if __name__=='__main__':
  spamTest()

运行结果:

the error rate is : 0.0

其中,path路径所使用到的Ch04文件点击此处本站下载

注:本文算法源自《机器学习实战》一书。

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
一张图带我们入门Python基础教程
Feb 05 Python
Anaconda多环境多版本python配置操作方法
Sep 12 Python
python机器学习理论与实战(四)逻辑回归
Jan 19 Python
python自动查询12306余票并发送邮箱提醒脚本
May 21 Python
Python实现的连接mssql数据库操作示例
Aug 17 Python
用Python编写一个高效的端口扫描器的方法
Dec 20 Python
Python File(文件) 方法整理
Feb 18 Python
Tensorflow 自定义loss的情况下初始化部分变量方式
Jan 06 Python
Python统计文本词汇出现次数的实例代码
Feb 27 Python
详解Pandas 处理缺失值指令大全
Jul 30 Python
使用Python下载抖音各大V视频的思路详解
Feb 06 Python
Python爬虫:从m3u8文件里提取小视频的正确操作
May 14 Python
Python使用matplotlib和pandas实现的画图操作【经典示例】
Jun 13 #Python
使用python爬虫获取黄金价格的核心代码
Jun 13 #Python
Python实现爬虫从网络上下载文档的实例代码
Jun 13 #Python
Pycharm导入Python包,模块的图文教程
Jun 13 #Python
mac下pycharm设置python版本的图文教程
Jun 13 #Python
使用Python来开发微信功能
Jun 13 #Python
python爬取足球直播吧五大联赛积分榜
Jun 13 #Python
You might like
虹吸式咖啡探讨–研磨
2021/03/03 冲泡冲煮
PHP开发入门教程之面向对象
2006/12/05 PHP
PHP获取文件绝对路径的代码(上一级目录)
2011/05/29 PHP
详解PHP中strlen和mb_strlen函数的区别
2014/03/07 PHP
详解yii2实现分库分表的方案与思路
2017/02/03 PHP
Win7环境下Apache连接MySQL提示连接已重置的解决办法
2017/05/09 PHP
yii2项目实战之restful api授权验证详解
2017/05/20 PHP
Javascript 异步加载详解(浏览器在javascript的加载方式)
2012/05/20 Javascript
鼠标移到导航当前位置的LI变色处于选中状态
2013/08/23 Javascript
js showModalDialog参数的使用详解
2014/01/07 Javascript
JavaScript获取按钮所在form表单id的方法
2015/04/02 Javascript
详解Js模板引擎(TrimPath)
2016/11/22 Javascript
微信小程序 实战程序简易新闻的制作
2017/01/09 Javascript
详解JS中的attribute属性
2017/04/25 Javascript
很棒的vue弹窗组件
2017/05/24 Javascript
基于jQuery和CSS3实现APPLE TV海报视差效果
2017/06/16 jQuery
JS实现电商放大镜效果
2017/08/24 Javascript
基于Vue2.0+ElementUI实现表格翻页功能
2017/10/23 Javascript
微信小程序开发之路由切换页面重定向问题
2018/09/18 Javascript
基于JS正则表达式实现模板数据动态渲染(实现思路详解)
2020/03/07 Javascript
利用node.js开发cli的完整步骤
2020/12/29 Javascript
Python的词法分析与语法分析
2013/05/18 Python
python通过cookie模拟已登录状态的初步研究
2016/11/09 Python
python+ffmpeg批量去视频开头的方法
2019/01/09 Python
Windows10下Tensorflow2.0 安装及环境配置教程(图文)
2019/11/21 Python
Python基于Socket实现简单聊天室
2020/02/17 Python
python 实现音频叠加的示例
2020/10/29 Python
ivx平台开发之不用代码实现一个九宫格抽奖功能
2021/01/27 HTML / CSS
TripAdvisor土耳其网站:全球知名旅行社区,真实旅客评论
2017/04/17 全球购物
NFL Game Pass欧洲:在线观看NFL比赛直播和点播,以高清质量播放
2018/08/30 全球购物
纽约州一群才华横溢的金匠制作而成:Hearth Jewelry
2019/03/22 全球购物
家佳咖啡店创业计划书
2013/12/27 职场文书
运动会方阵口号
2014/06/07 职场文书
世界红十字日活动总结
2015/02/10 职场文书
2015最新学生自我评价范文
2015/03/03 职场文书
农村老人去世追悼词
2015/06/23 职场文书