python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python实现统计代码行的方法分析
Jul 12 Python
Python使用matplotlib绘制随机漫步图
Aug 27 Python
对python的输出和输出格式详解
Dec 08 Python
详解js文件通过python访问数据库方法
Mar 03 Python
使用Python Pandas处理亿级数据的方法
Jun 24 Python
Python3内置模块之base64编解码方法详解
Jul 13 Python
python设置随机种子实例讲解
Sep 12 Python
基于Python批量生成指定尺寸缩略图代码实例
Nov 20 Python
Python导入模块包原理及相关注意事项
Mar 25 Python
Python字典fromkeys()方法使用代码实例
Jul 20 Python
python 爬取豆瓣网页的示例
Apr 13 Python
python基础之爬虫入门
May 10 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
fleaphp常用方法分页之Pager使用方法
2011/04/23 PHP
了解PHP的返回引用和局部静态变量
2015/06/04 PHP
简单介绍PHP非阻塞模式
2016/03/03 PHP
TNC vs BOOM BO3 第一场2.13
2021/03/10 DOTA
div+css布局的图片连续滚动js实现代码
2010/05/04 Javascript
js弹出窗口之弹出层的小例子
2013/06/17 Javascript
js+css 实现遮罩居中弹出层(随浏览器窗口滚动条滚动)
2013/12/11 Javascript
jquery ajax应用中iframe自适应高度问题解决方法
2014/04/12 Javascript
js判断图片加载完成后获取图片实际宽高的方法
2016/02/25 Javascript
浅谈JavaScript前端开发的MVC结构与MVVM结构
2016/06/03 Javascript
JS常用知识点整理
2017/01/21 Javascript
jQuery模拟淘宝购物车功能
2017/02/27 Javascript
详解JavaScript数组过滤相同元素的5种方法
2017/05/23 Javascript
nuxt.js中间件实现拦截权限判断的方法
2018/11/21 Javascript
微信小程序 wxParse插件显示视频问题
2019/09/27 Javascript
微信小程序 获取手机号 JavaScript解密示例代码详解
2020/05/14 Javascript
[01:48]完美圣典齐天大圣至宝宣传片
2016/12/17 DOTA
详解详解Python中writelines()方法的使用
2015/05/25 Python
Django实现图片文字同时提交的方法
2015/05/26 Python
Python爬虫抓取手机APP的传输数据
2016/01/22 Python
基于Python中numpy数组的合并实例讲解
2018/04/04 Python
Python多进程方式抓取基金网站内容的方法分析
2019/06/03 Python
对numpy下的轴交换transpose和swapaxes的示例解读
2019/06/26 Python
Django使用uwsgi部署时的配置以及django日志文件的处理方法
2019/08/30 Python
css3针对移动端卡顿问题的解决(动画性能优化)
2020/02/14 HTML / CSS
美国牛仔品牌:True Religion
2018/11/16 全球购物
Bose加拿大官方网站:美国知名音响品牌
2019/03/21 全球购物
巴西一家专门从事家居和装饰的连锁店:Camicado
2019/08/14 全球购物
新员工入职感言
2014/02/01 职场文书
关于拾金不昧的感谢信
2015/01/21 职场文书
人才市场接收函
2015/01/30 职场文书
幼儿园教师自我评价
2015/03/04 职场文书
电影地道战观后感
2015/06/04 职场文书
迁徙的鸟观后感
2015/06/09 职场文书
解除租赁合同协议书
2016/03/21 职场文书
2019年年中职场激励人心语录30条
2019/08/07 职场文书