python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python环境下搭建属于自己的pip源的教程
May 05 Python
Python3.6通过自带的urllib通过get或post方法请求url的实例
May 10 Python
Python实现求解一元二次方程的方法示例
Jun 20 Python
Windows下python3.6.4安装教程
Jul 31 Python
Django+JS 实现点击头像即可更改头像的方法示例
Dec 26 Python
在Django admin中编辑ManyToManyField的实现方法
Aug 09 Python
wxPython实现列表增删改查功能
Nov 19 Python
浅谈matplotlib默认字体设置探索
Feb 03 Python
pytorch Dataset,DataLoader产生自定义的训练数据案例
Mar 03 Python
解决Pytorch dataloader时报错每个tensor维度不一样的问题
May 28 Python
如何用Python搭建gRPC服务
Jun 30 Python
Python标准库pathlib操作目录和文件
Nov 20 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
PHP简单系统数据添加以及数据删除模块源文件下载
2008/06/07 PHP
PHP memcache扩展的三种安装方法
2009/04/26 PHP
PHP中几个常用的魔术常量
2012/02/23 PHP
thinkPHP实现基于ajax的评论回复功能
2018/06/22 PHP
JQuery动画animate的stop方法使用详解
2014/05/09 Javascript
jquery增加和删除元素的方法
2015/01/14 Javascript
jquery中map函数遍历数组用法实例
2015/05/18 Javascript
JavaScript数组实现数据结构中的队列与堆栈
2016/05/26 Javascript
纯javascript版日历控件
2016/11/24 Javascript
微信小程序中顶部导航栏的实现代码
2017/03/30 Javascript
如何在vue中使用ts的示例代码
2018/02/28 Javascript
浅析前端路由简介以及vue-router实现原理
2018/06/01 Javascript
Element实现表格分页数据选择+全选所有完善批量操作
2019/06/07 Javascript
JavaScript判断浏览器运行环境的详细方法
2019/06/30 Javascript
开源一个微信小程序仪表盘组件过程解析
2019/07/30 Javascript
微信小程序 wx:for遍历循环使用实例解析
2019/09/09 Javascript
Vue中函数防抖节流的理解及应用实现
2020/04/24 Javascript
Python3.x中自定义比较函数
2015/04/24 Python
python如何实现远程控制电脑(结合微信)
2015/12/21 Python
itchat和matplotlib的结合使用爬取微信信息的实例
2017/08/25 Python
Python遍历numpy数组的实例
2018/04/04 Python
pytorch + visdom 处理简单分类问题的示例
2018/06/04 Python
python使用rsa非对称加密过程解析
2019/12/28 Python
HTML如何让IMG自动适应DIV容器大小的实现方法
2020/02/25 HTML / CSS
意大利奢侈品购物网站:Deliberti
2019/10/08 全球购物
职业生涯规划书基本格式
2014/01/06 职场文书
文明倡议书范文
2014/04/15 职场文书
会计岗位说明书
2014/07/29 职场文书
对外汉语专业大学生职业生涯规划书
2014/10/11 职场文书
门市房租房协议书
2014/12/04 职场文书
家长会欢迎词
2015/01/23 职场文书
小学班主任个人总结
2015/03/03 职场文书
2015年银行个人工作总结
2015/05/14 职场文书
2016党校培训心得体会
2016/01/07 职场文书
victoriaMetrics库布隆过滤器初始化及使用详解
2022/04/05 Golang
vue的项目如何打包上线
2022/04/13 Vue.js