python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python 的类、继承和多态详解
Jul 16 Python
Python判断文件或文件夹是否存在的三种方法
Jul 27 Python
python Selenium实现付费音乐批量下载的实现方法
Jan 24 Python
Python代码太长换行的实现
Jul 05 Python
python3实现高效的端口扫描
Aug 31 Python
pandas数据处理进阶详解
Oct 11 Python
python实现连续变量最优分箱详解--CART算法
Nov 22 Python
如何基于Python + requests实现发送HTTP请求
Jan 13 Python
分享PyCharm最新激活码(真永久激活方法)不用每月找安装参数或最新激活码了
Dec 27 Python
Python 循环读取数据内存不足的解决方案
May 25 Python
Python数据分析之pandas读取数据
Jun 02 Python
python 常用的异步框架汇总整理
Jun 18 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
SONY ICF-SW55的电路分析
2021/03/02 无线电
php 中的str_replace 函数总结
2007/04/27 PHP
php 什么是PEAR?(第二篇)
2009/03/19 PHP
介绍一些PHP判断变量的函数
2012/04/24 PHP
解析php 版获取重定向后的地址(代码)
2013/06/26 PHP
php 截取中英文混合字符串的方法
2018/05/31 PHP
PHP中quotemeta()函数的用法讲解
2019/04/04 PHP
javascript中的window.location.search方法简介
2013/09/02 Javascript
比较不错的JS/JQuery显示或隐藏文本的方法
2014/02/13 Javascript
纯javascript代码实现计算器功能(三种方法)
2015/09/07 Javascript
JS获取月份最后天数、最大天数与某日周数的方法
2015/12/08 Javascript
js老生常谈之this,constructor ,prototype全面解析
2016/04/05 Javascript
jquery仿ps颜色拾取功能
2017/03/08 Javascript
Vue SSR 组件加载问题
2018/05/02 Javascript
解决vue中修改了数据但视图无法更新的情况
2018/08/27 Javascript
vue配置nprogress实现页面顶部进度条
2019/09/21 Javascript
浅谈vue异步数据影响页面渲染
2019/10/29 Javascript
如何HttpServletRequest文件对象并储存
2020/08/14 Javascript
js实现飞机大战游戏
2020/08/26 Javascript
解决Vue keep-alive 调用 $destory() 页面不再被缓存的情况
2020/10/30 Javascript
[01:04:31]DOTA2-DPC中国联赛定级赛 iG vs Magma BO3第二场 1月8日
2021/03/11 DOTA
python开启多个子进程并行运行的方法
2015/04/18 Python
Python使用Matplotlib实现Logos设计代码
2017/12/25 Python
微信跳一跳小游戏python脚本
2018/01/05 Python
python爬取淘宝商品销量信息
2018/11/16 Python
Python numpy.zero() 初始化矩阵实例
2019/11/27 Python
python实现字典嵌套列表取值
2019/12/16 Python
HTML5 Canvas渐进填充与透明实现图像的Mask效果
2013/07/11 HTML / CSS
临床医学专业个人的自我评价
2013/09/27 职场文书
大三毕业自我鉴定
2014/01/15 职场文书
党员承诺书怎么写
2014/05/20 职场文书
电气工程及其自动化专业毕业生自荐信
2014/06/21 职场文书
领导班子党的群众路线对照检查材料
2014/09/25 职场文书
干部个人考察材料
2014/12/24 职场文书
研究生就业推荐表导师评语
2014/12/31 职场文书
倡议书怎么写?
2019/04/11 职场文书