python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
跟老齐学Python之一个免费的实验室
Sep 14 Python
利用Python爬取可用的代理IP
Aug 18 Python
解决Python requests 报错方法集锦
Mar 19 Python
Python操作SQLite数据库的方法详解
Jun 16 Python
Python开发的HTTP库requests详解
Aug 29 Python
python3.5+tesseract+adb实现西瓜视频或头脑王者辅助答题
Jan 17 Python
python入门教程 python入门神图一张
Mar 05 Python
pandas按行按列遍历Dataframe的几种方式
Oct 23 Python
flask框架配置mysql数据库操作详解
Nov 29 Python
python计算auc的方法
Sep 09 Python
Python源码解析之List
May 21 Python
Python实现双向链表基本操作
May 25 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
简单的php写入数据库类代码分享
2011/07/26 PHP
PHP 自定义错误处理函数的使用详解
2013/05/10 PHP
怎么在Windows系统中搭建php环境
2013/08/31 PHP
用PHP将Unicode 转化为UTF-8的实现方法(推荐)
2017/02/08 PHP
Javascript 表单之间的数据传递代码
2008/12/04 Javascript
Extjs入门之动态加载树代码
2010/04/09 Javascript
jquery 之 $().hover(func1, funct2)使用方法
2012/06/14 Javascript
javascript删除一个html元素节点的方法
2014/12/20 Javascript
Jquery中request和request.form和request.querystring的区别
2015/11/26 Javascript
js 判断一组日期是否是连续的简单实例
2016/07/11 Javascript
AngularJS控制器详解及示例代码
2016/08/16 Javascript
JavaScript数组复制详解
2017/02/02 Javascript
jquery replace方法去空格
2017/05/08 jQuery
JS非空验证及邮箱验证的实例
2017/08/11 Javascript
Nodejs+angularjs结合multiparty实现多图片上传的示例代码
2017/09/29 NodeJs
基于Vue的移动端图片裁剪组件功能
2017/11/28 Javascript
[01:28:43]2014 DOTA2华西杯精英邀请赛5 24 DK VS CIS
2014/05/25 DOTA
解决python opencv无法显示图片的问题
2018/10/28 Python
Python绘制二维曲线的日常应用详解
2019/12/04 Python
Python Django2 model 查询介绍(条件、范围、模糊查询)
2020/03/16 Python
Python手动或自动协程操作方法解析
2020/06/22 Python
Jabra捷波朗美国官网:用于办公、车载和运动的无线蓝牙耳麦
2017/02/01 全球购物
HEMA法国:荷兰原创设计
2019/02/21 全球购物
什么是用户模式(User Mode)与内核模式(Kernel Mode) ?
2014/07/21 面试题
PyQt 如何创建自定义QWidget
2021/03/24 Python
《孔子游春》教学反思
2014/02/25 职场文书
预备党员自我批评思想汇报
2014/10/10 职场文书
2014年党的群众路线整改措施思想汇报
2014/10/12 职场文书
2015年客服工作总结范文
2015/04/02 职场文书
2015年学生会部门工作总结
2015/04/21 职场文书
2015年幼儿园国庆节活动总结
2015/07/30 职场文书
高中同学会致辞
2015/08/01 职场文书
MySQL大小写敏感的注意事项
2021/05/24 MySQL
Java基础-封装和继承
2021/07/02 Java/Android
win7配置本地ftp服务器的图文教程
2022/08/05 Servers
ubuntu如何搭建vsftpd服务器
2022/12/24 Servers