编程 Python

Python_LDA实现方法详解

Posted in Python onOctober 25, 2017

LDA(Latent Dirichlet allocation)模型是一种常用而用途广泛地概率主题模型。其实现一般通过Variational inference和Gibbs Samping实现。作者在提出LDA模型时给出了其变分推理的C源码（后续贴出C++改编的类），这里贴出基于Python的第三方模块改写的LDA类及实现。

#coding:utf-8
import numpy as np
import lda
import lda.datasets
import jieba
import codecs
class LDA_v20161130():
  def __init__(self, topics=2):
    self.n_topic = topics
    self.corpus = None
    self.vocab = None
    self.ppCountMatrix = None
    self.stop_words = [u'，', u'。', u'、', u'（', u'）', u'·', u'！', u' ', u'：', u'“', u'”', u'\n']
    self.model = None
  def loadCorpusFromFile(self, fn):
    # 中文分词
    f = open(fn, 'r')
    text = f.readlines()
    text = r' '.join(text)
    seg_generator = jieba.cut(text)
    seg_list = [i for i in seg_generator if i not in self.stop_words]
    seg_list = r' '.join(seg_list)
    # 切割统计所有出现的词纳入词典
    seglist = seg_list.split(" ")
    self.vocab = []
    for word in seglist:
      if (word != u' ' and word not in self.vocab):
        self.vocab.append(word)
    CountMatrix = []
    f.seek(0, 0)
    # 统计每个文档中出现的词频
    for line in f:
      # 置零
      count = np.zeros(len(self.vocab),dtype=np.int)
      text = line.strip()
      # 但还是要先分词
      seg_generator = jieba.cut(text)
      seg_list = [i for i in seg_generator if i not in self.stop_words]
      seg_list = r' '.join(seg_list)
      seglist = seg_list.split(" ")
      # 查询词典中的词出现的词频
      for word in seglist:
        if word in self.vocab:
          count[self.vocab.index(word)] += 1
      CountMatrix.append(count)
    f.close()
    #self.ppCountMatrix = (len(CountMatrix), len(self.vocab))
    self.ppCountMatrix = np.array(CountMatrix)
    print "load corpus from %s success!"%fn
  def setStopWords(self, word_list):
    self.stop_words = word_list
  def fitModel(self, n_iter = 1500, _alpha = 0.1, _eta = 0.01):
    self.model = lda.LDA(n_topics=self.n_topic, n_iter=n_iter, alpha=_alpha, eta= _eta, random_state= 1)
    self.model.fit(self.ppCountMatrix)
  def printTopic_Word(self, n_top_word = 8):
    for i, topic_dist in enumerate(self.model.topic_word_):
      topic_words = np.array(self.vocab)[np.argsort(topic_dist)][:-(n_top_word + 1):-1]
      print "Topic:",i,"\t",
      for word in topic_words:
        print word,
      print
  def printDoc_Topic(self):
    for i in range(len(self.ppCountMatrix)):
      print ("Doc %d:((top topic:%s) topic distribution:%s)"%(i, self.model.doc_topic_[i].argmax(),self.model.doc_topic_[i]))
  def printVocabulary(self):
    print "vocabulary:"
    for word in self.vocab:
      print word,
    print
  def saveVocabulary(self, fn):
    f = codecs.open(fn, 'w', 'utf-8')
    for word in self.vocab:
      f.write("%s\n"%word)
    f.close()
  def saveTopic_Words(self, fn, n_top_word = -1):
    if n_top_word==-1:
      n_top_word = len(self.vocab)
    f = codecs.open(fn, 'w', 'utf-8')
    for i, topic_dist in enumerate(self.model.topic_word_):
      topic_words = np.array(self.vocab)[np.argsort(topic_dist)][:-(n_top_word + 1):-1]
      f.write( "Topic:%d\t"%i)
      for word in topic_words:
        f.write("%s "%word)
      f.write("\n")
    f.close()
  def saveDoc_Topic(self, fn):
    f = codecs.open(fn, 'w', 'utf-8')
    for i in range(len(self.ppCountMatrix)):
      f.write("Doc %d:((top topic:%s) topic distribution:%s)\n" % (i, self.model.doc_topic_[i].argmax(), self.model.doc_topic_[i]))
    f.close()

算法实现demo：

例如，抓取BBC川普当选的新闻作为语料，输入以下代码：

if __name__=="__main__":
  _lda = LDA_v20161130(topics=20)
  stop = [u'!', u'@', u'#', u',',u'.',u'/',u';',u' ',u'[',u']',u'$',u'%',u'^',u'&',u'*',u'(',u')',
      u'"',u':',u'<',u'>',u'?',u'{',u'}',u'=',u'+',u'_',u'-',u'''''']
  _lda.setStopWords(stop)
  _lda.loadCorpusFromFile(u'C:\\Users\Administrator\Desktop\\BBC.txt')
  _lda.fitModel(n_iter=1500)
  _lda.printTopic_Word(n_top_word=10)
  _lda.printDoc_Topic()
  _lda.saveVocabulary(u'C:\\Users\Administrator\Desktop\\vocab.txt')
  _lda.saveTopic_Words(u'C:\\Users\Administrator\Desktop\\topic_word.txt')
  _lda.saveDoc_Topic(u'C:\\Users\Administrator\Desktop\\doc_topic.txt')

因为语料全部为英文，因此这里的stop_words全部设置为英文符号，主题设置20个，迭代1500次。结果显示，文档148篇，词典1347词，总词数4174，在i3的电脑上运行17s。
Topic_words部分输出如下：

Topic: 0
to will and of he be trumps the what policy
Topic: 1 he would in said not no with mr this but
Topic: 2 for or can some whether have change health obamacare insurance
Topic: 3 the to that president as of us also first all
Topic: 4 trump to when with now were republican mr office presidential
Topic: 5 the his trump from uk who president to american house
Topic: 6 a to that was it by issue vote while marriage
Topic: 7 the to of an are they which by could from
Topic: 8 of the states one votes planned won two new clinton
Topic: 9 in us a use for obama law entry new interview
Topic: 10 and on immigration has that there website vetting action given

Doc_Topic部分输出如下：

Doc 0:((top topic:4) topic distribution:[ 0.02972973 0.0027027 0.0027027 0.16486486 0.32702703 0.19189189
0.0027027 0.0027027 0.02972973 0.0027027 0.02972973 0.0027027
0.0027027 0.0027027 0.02972973 0.0027027 0.02972973 0.0027027
0.13783784 0.0027027 ])
Doc 1:((top topic:18) topic distribution:[ 0.21 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.11 0.01 0.01 0.01
0.01 0.01 0.01 0.01 0.01 0.01 0.31 0.21])
Doc 2:((top topic:18) topic distribution:[ 0.02075472 0.00188679 0.03962264 0.00188679 0.00188679 0.00188679
0.00188679 0.15283019 0.00188679 0.02075472 0.00188679 0.24716981
0.00188679 0.07735849 0.00188679 0.00188679 0.00188679 0.00188679
0.41698113 0.00188679])

当然，对于英文语料，需要排除大部分的虚词以及常用无意义词，例如it, this, there, that...在实际操作中，需要合理地设置参数。

换中文语料尝试，采用习大大就卡斯特罗逝世发表的吊唁文章和朴槿惠辞职的新闻。

Topic: 0
的同志和人民卡斯特罗菲德尔古巴他了我
Topic: 1 在朴槿惠向表示总统对将的月国民
Doc 0:((top topic:0) topic distribution:[ 0.91714123 0.08285877])
Doc 1:((top topic:1) topic distribution:[ 0.09200666 0.90799334])

还是存在一些虚词，例如“的”，“和”，“了”，“对”等词的干扰，但是大致来说，两则新闻的主题分布很明显，效果还不赖。

总结

以上就是本文关于Python_LDA实现方法详解的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站：python+mongodb数据抓取详细介绍、Python探索之创建二叉树、Python探索之修改Python搜索路径等，有什么问题可以随时留言，欢迎大家一起交流讨论。感谢朋友们对本站的支持！

Python_LDA实现方法详解

- Author -

liuph_

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python错误：AttributeError: 'module' object has no attribute 'setdefaultencoding'问题的解决方法

Aug 22 Python

使用Node.js和Socket.IO扩展Django的实时处理功能

Apr 20 Python

python实现一次创建多级目录的方法

May 15 Python

python从网络读取图片并直接进行处理的方法

May 22 Python

python通过百度地图API获取某地址的经纬度详解

Jan 28 Python

pandas 取出表中一列数据所有的值并转换为array类型的方法

Apr 11 Python

Python学习笔记之pandas索引列、过滤、分组、求和功能示例

Jun 03 Python

django如何通过类视图使用装饰器

Jul 24 Python

python监控nginx端口和进程状态

Sep 06 Python

浅析Python+OpenCV使用摄像头追踪人脸面部血液变化实现脉搏评估

Oct 17 Python

PYTHON发送邮件YAGMAIL的简单实现解析

Oct 28 Python

2020年10款优秀的Python第三方库,看看有你中意的吗？

Jan 12 Python

python+mongodb数据抓取详细介绍

Oct 25 #Python

python装饰器实例大详解

Oct 25 #Python

Python3 模块、包调用&路径详解

Oct 25 #Python

Python探索之创建二叉树

Oct 25 #Python

Python探索之修改Python搜索路径

Oct 25 #Python

python中 logging的使用详解

Oct 25 #Python

python下载文件记录黑名单的实现代码

Oct 24 #Python

You might like

php判断数组元素中是否存在某个字符串的方法

2014/06/14 PHP

Thinkphp微信公众号支付接口

2016/08/04 PHP

thinkPHP5.0框架应用请求生命周期分析

2017/03/25 PHP

删除重复数据的算法

2006/11/23 Javascript

js getElementsByTagName的简写方式

2010/06/27 Javascript

JavaScript弹出窗口方法汇总

2014/08/12 Javascript

jquery实现的蓝色二级导航条效果代码

2015/08/24 Javascript

JavaScript程序开发之JS代码放置的位置

2016/01/15 Javascript

JavaScript判断图片是否已经加载完毕的方法汇总

2016/02/05 Javascript

js匿名函数作为函数参数详解

2016/06/01 Javascript

前端面试知识点锦集（JavaScript篇）

2016/12/28 Javascript

js获取json中key所对应的value值的简单方法

2020/06/17 Javascript

Linux Centos7.2下安装nodejs&npm配置全局路径的教程

2018/05/15 NodeJs

nodejs(officegen)+vue(axios)在客户端导出word文档的方法

2018/07/31 NodeJs

如何去除富文本中的html标签及vue、react、微信小程序中的过滤器

2018/11/21 Javascript

extract-text-webpack-plugin用法详解

2019/02/14 Javascript

JavaScript实现美化滑块效果

2019/05/17 Javascript

基于mpvue的简单弹窗组件mptoast使用详解

2019/08/02 Javascript

[02:20]DOTA2亚洲邀请赛 IG战队出场宣传片

2015/02/07 DOTA

python线程锁(thread)学习示例

2013/12/04 Python

Python异常学习笔记

2015/02/03 Python

详细介绍Ruby中的正则表达式

2015/04/10 Python

Python实现读取json文件到excel表

2017/11/18 Python

详解pandas如何去掉、过滤数据集中的某些值或者某些行？

2019/05/15 Python

利用PyTorch实现VGG16教程

2020/06/24 Python

python实现邮件循环自动发件功能

2020/09/11 Python

html5 移动端视频video的android兼容(去除播放控件、全屏)

2020/03/26 HTML / CSS

巴西在线鞋店：Shoestock

2017/10/28 全球购物

美国主要的特色咖啡和茶公司：Peet’s Coffee

2020/02/14 全球购物

WEB控件及HTML服务端控件能否调用客户端方法？如果能，请解释如何调用？

2015/08/25 面试题

应届大学生简历中的自我评价

2014/01/15 职场文书

个人安全生产责任书

2014/07/28 职场文书

群众路线领导对照材料

2014/08/23 职场文书

2015暑假打工实践报告

2015/07/13 职场文书

小学思品教学反思

2016/02/20 职场文书

零基础学java之方法的定义与调用详解

2022/04/10 Java/Android