python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python学习之编写查询ip程序
Feb 27 Python
从局部变量和全局变量开始全面解析Python中变量的作用域
Jun 16 Python
Django自定义分页效果
Jun 27 Python
python交互式图形编程实例(二)
Nov 17 Python
Python实现获取本地及远程图片大小的方法示例
Jul 21 Python
不到20行代码用Python做一个智能聊天机器人
Apr 19 Python
如何不用安装python就能在.NET里调用Python库
Jul 12 Python
简单了解Django应用app及分布式路由
Jul 24 Python
python GUI库图形界面开发之PyQt5窗口控件QWidget详细使用方法
Feb 26 Python
Python基于pandas绘制散点图矩阵代码实例
Jun 04 Python
Python字符串split及rsplit方法原理详解
Jun 29 Python
Python OpenCV中的numpy与图像类型转换操作
Dec 11 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
一个简单的PHP投票程序源码
2007/03/11 PHP
php array_unique之后json_encode需要注意
2011/01/02 PHP
PHP对MongoDB[NoSQL]数据库的操作
2013/03/01 PHP
ThinkPHP3.2.3数据库设置新特性
2015/03/05 PHP
php文件缓存类用法实例分析
2015/04/22 PHP
表单提交错误后返回内容消失问题的解决方法(PHP网站)
2015/10/20 PHP
php生成酷炫的四个字符验证码
2016/04/22 PHP
Thinkphp 中 distinct 的用法解析
2016/12/14 PHP
laravel框架中路由设置,路由参数和路由命名实例分析
2019/11/23 PHP
node.js中的fs.writeFile方法使用说明
2014/12/14 Javascript
JS或jQuery获取ASP.NET服务器控件ID的方法
2015/06/08 Javascript
浅谈js的url解析函数封装
2016/06/28 Javascript
Bootstrap modal使用及点击外部不消失的解决方法
2016/12/13 Javascript
vue 2.0封装model组件的方法
2017/08/03 Javascript
angularJS开发注意事项
2018/05/26 Javascript
Express之托管静态文件的方法
2018/06/01 Javascript
Vue项目中配置pug解析支持
2019/05/10 Javascript
JS前端知识点offset,scroll,client,冒泡,事件对象的应用整理总结
2019/06/27 Javascript
[01:00:04]DOTA2上海特级锦标赛B组小组赛#1 Alliance VS Spirit第二局
2016/02/26 DOTA
[55:56]NB vs Infamous 2019国际邀请赛淘汰赛 败者组 BO3 第二场 8.22
2019/09/05 DOTA
python字符串替换示例
2014/04/24 Python
初学Python函数的笔记整理
2015/04/07 Python
windows下添加Python环境变量的方法汇总
2018/05/14 Python
详解numpy矩阵的创建与数据类型
2019/10/18 Python
python利用opencv实现SIFT特征提取与匹配
2020/03/05 Python
Python文字截图识别OCR工具实例解析
2020/03/05 Python
python 爬取免费简历模板网站的示例
2020/09/27 Python
英国汽车座椅和婴儿车购物网站:Uber Kids
2017/04/19 全球购物
请编写一个 C 函数,该函数在给定的内存区域搜索给定的字符,并返回该字符所在位置索引值
2014/09/15 面试题
毕业学生推荐信
2013/12/01 职场文书
一名毕业生的自我鉴定
2013/12/04 职场文书
指导教师评语
2014/04/26 职场文书
中华在我心中演讲稿
2014/09/13 职场文书
超市督导岗位职责
2015/04/10 职场文书
红色经典观后感
2015/06/18 职场文书
pytorch训练神经网络爆内存的解决方案
2021/05/22 Python