python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
编写Python的web框架中的Model的教程
Apr 29 Python
Python3安装Pymongo详细步骤
May 26 Python
使用python和pygame绘制繁花曲线的方法
Feb 24 Python
python如何派生内置不可变类型并修改实例化行为
Mar 21 Python
Python基于递归和非递归算法求两个数最大公约数、最小公倍数示例
May 21 Python
解决.ui文件生成的.py文件运行不出现界面的方法
Jun 19 Python
Apache,wsgi,django 程序部署配置方法详解
Jul 01 Python
pygame实现俄罗斯方块游戏(基础篇3)
Oct 29 Python
python 实现矩阵按对角线打印
Nov 29 Python
python 项目目录结构设置
Feb 14 Python
如何使用PyCharm引入需要使用的包的方法
Sep 22 Python
分享3个非常实用的 Python 模块
Mar 03 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
php 无限级分类,超级简单的无限级分类,支持输出树状图
2014/06/29 PHP
使用PHP编写发红包程序
2015/07/22 PHP
Ext面向对象开发实践(续)
2008/11/18 Javascript
javascript实现div浮动在网页最顶上并带关闭按钮效果实例
2013/08/13 Javascript
jquery实现移动端点击图片查看大图特效
2020/09/11 Javascript
js实现的鼠标滚轮滚动切换页面效果(类似360默认页面滚动切换效果)
2016/01/27 Javascript
使用JQuery中的trim()方法去掉前后空格
2016/09/16 Javascript
node.js 抓取代理ip实例代码
2017/04/30 Javascript
关于jQuery库冲突的完美解决办法
2017/05/20 jQuery
基于layer.js实现收货地址弹框选择然后返回相应的地址信息
2017/05/26 Javascript
基于node.js之调试器详解
2017/08/22 Javascript
超详细动手搭建一个VuePress 站点及开启PWA与自动部署的方法
2019/01/27 Javascript
JavaScript 实现继承的几种方式
2021/02/19 Javascript
js闭包和垃圾回收机制示例详解
2021/03/01 Javascript
[57:22]完美世界DOTA2联赛PWL S2 FTD vs PXG 第二场 11.27
2020/12/01 DOTA
python 中文字符串的处理实现代码
2009/10/25 Python
python实现划词翻译
2020/04/23 Python
Python使用openpyxl读写excel文件的方法
2017/06/30 Python
基于python绘制科赫雪花
2018/06/22 Python
numpy向空的二维数组中添加元素的方法
2018/11/01 Python
Python设计模式之状态模式原理与用法详解
2019/01/15 Python
django将网络中的图片,保存成model中的ImageField的实例
2019/08/07 Python
Python解释器及PyCharm工具安装过程
2020/02/26 Python
Python+Opencv身份证号码区域提取及识别实现
2020/08/25 Python
Python 操作SQLite数据库的示例
2020/10/16 Python
泰国时尚电商:POMELO Fashion
2020/03/11 全球购物
建筑工程自我鉴定
2013/10/18 职场文书
中文专业毕业生自荐信
2013/10/28 职场文书
个人应聘自我评价分享
2013/11/18 职场文书
大学生简历的个人自我评价
2013/12/04 职场文书
市场开发与营销专业求职信
2013/12/31 职场文书
售房协议书
2014/08/19 职场文书
传承焦裕禄精神思想汇报2014
2014/09/10 职场文书
企业介绍信范文
2015/01/30 职场文书
签字仪式主持词
2015/07/03 职场文书
详细了解MVC+proxy
2021/07/09 Java/Android