python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python使用scrapy解析js示例
Jan 23 Python
详解使用pymysql在python中对mysql的增删改查操作(综合)
Jan 18 Python
详解Python自建logging模块
Jan 29 Python
python操作excel的方法(xlsxwriter包的使用)
Jun 11 Python
Pandas读取并修改excel的示例代码
Feb 17 Python
django框架使用orm实现批量更新数据的方法
Jun 21 Python
浅谈Django中view对数据库的调用方法
Jul 18 Python
pygame实现成语填空游戏
Oct 29 Python
Python3 解决读取中文文件txt编码的问题
Dec 20 Python
在python中logger setlevel没有生效的解决
Feb 21 Python
Python3.7在anaconda里面使用IDLE编译器的步骤详解
Apr 29 Python
如何在keras中添加自己的优化器(如adam等)
Jun 19 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
php实现在站点里面添加邮件发送的功能
2020/04/28 PHP
Yii2下点击验证码的切换实例代码
2017/03/14 PHP
PHP cookie与session会话基本用法实例分析
2019/11/18 PHP
javascript函数中参数传递问题示例探讨
2014/07/31 Javascript
JavaScript中的ArrayBuffer详细介绍
2014/12/08 Javascript
Bootstrap模态框(modal)垂直居中的实例代码
2016/08/18 Javascript
jQuery动态创建元素以及追加节点的实现方法
2016/10/20 Javascript
jquery validation验证表单插件
2017/01/07 Javascript
canvas滤镜效果实现代码
2017/02/06 Javascript
javascript 数据存储的常用函数总结
2017/06/01 Javascript
angular4 如何在全局设置路由跳转动画的方法
2017/08/30 Javascript
原生js实现简单的模态框示例
2017/09/08 Javascript
详解Vue项目中实现锚点定位
2019/04/24 Javascript
javascript实现对话框功能警告(alert 消息对话框)确认(confirm 消息对话框)
2019/05/07 Javascript
[51:29]Alliance vs TNC 2019国际邀请赛小组赛 BO2 第二场 8.16
2019/08/18 DOTA
matplotlib subplots 调整子图间矩的实例
2018/05/25 Python
Python 转换文本编码实现解析
2019/08/27 Python
Python偏函数Partial function使用方法实例详解
2020/06/17 Python
python 使用递归的方式实现语义图片分割功能
2020/07/16 Python
matplotlib bar()实现百分比堆积柱状图
2021/02/24 Python
24个canvas基础知识小结
2014/12/17 HTML / CSS
新西兰杂志订阅:isubscribe
2019/08/26 全球购物
德国前卫设计师时装在线商店:Luxury Loft
2019/11/04 全球购物
骨干教师培训制度
2014/01/13 职场文书
服装行业创业计划书范文
2014/02/05 职场文书
四年级评语大全
2014/04/21 职场文书
2014年社区教育工作总结
2014/12/02 职场文书
事业单位财务人员岗位职责
2015/04/14 职场文书
医院见习总结
2015/06/24 职场文书
2015年七夕情人节感言
2015/08/03 职场文书
小学校园广播稿
2015/08/18 职场文书
初中历史教学反思
2016/02/19 职场文书
《月球之谜》教学反思
2016/02/20 职场文书
python简单验证码识别的实现过程
2021/06/20 Python
为了顺利买到演唱会的票用Python制作了自动抢票的脚本
2021/10/16 Python
利用Java连接Hadoop进行编程
2022/06/28 Java/Android