Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
讲解Python中的递归函数
Apr 27 Python
用Python写一个无界面的2048小游戏
May 24 Python
python删除某个字符
Mar 19 Python
Python实现的基于优先等级分配糖果问题算法示例
Apr 25 Python
简单了解python高阶函数map/reduce
Jun 28 Python
Python基于OpenCV实现人脸检测并保存
Jul 23 Python
python实现回旋矩阵方式(旋转矩阵)
Dec 04 Python
将python依赖包打包成window下可执行文件bat方式
Dec 26 Python
python实现大战外星人小游戏实例代码
Dec 26 Python
flask利用flask-wtf验证上传的文件的方法
Jan 17 Python
python能开发游戏吗
Jun 11 Python
如何利用python实现列表嵌套字典取值
Jun 10 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
PHP调用三种数据库的方法(3)
2006/10/09 PHP
php mysql 判断update之后是否更新了的方法
2012/01/10 PHP
浅谈php正则表达式中的非贪婪模式匹配的使用
2014/11/25 PHP
javascript 贪吃蛇实现代码
2008/11/22 Javascript
JQuery扩展插件Validate 3通过参数设置错误信息
2011/09/05 Javascript
纯JS实现五子棋游戏兼容各浏览器(附源码)
2013/04/24 Javascript
js给网页加上背景音乐及选择音效的方法
2015/03/03 Javascript
javascript实现校验文件上传控件实例
2015/04/20 Javascript
javascript实现二级级联菜单的简单制作
2015/11/19 Javascript
jquery悬浮提示框完整实例
2016/01/13 Javascript
防止页面url缓存中ajax中post请求的处理方法
2017/10/10 Javascript
jQuery+koa2实现简单的Ajax请求的示例
2018/03/06 jQuery
在vue.js中使用JSZip实现在前端解压文件的方法
2018/09/05 Javascript
对于防止按钮重复点击的尝试详解
2019/04/22 Javascript
Vue+abp微信扫码登录的实现代码示例
2020/01/06 Javascript
在Vue中创建可重用的 Transition的方法
2020/06/02 Javascript
angular中的post请求处理示例详解
2020/06/30 Javascript
vue 使用微信jssdk,调用微信相册上传图片功能
2020/11/13 Javascript
Vue项目打包部署到apache服务器的方法步骤
2021/02/01 Vue.js
python基础教程之字典操作详解
2014/03/25 Python
python局域网ip扫描示例分享
2014/04/03 Python
浅析python 中__name__ = '__main__' 的作用
2014/07/05 Python
使用Python的Treq on Twisted来进行HTTP压力测试
2015/04/16 Python
python实现装饰器、描述符
2018/02/28 Python
tensorflow求导和梯度计算实例
2020/01/23 Python
Python Opencv轮廓常用操作代码实例解析
2020/09/01 Python
python 中 .py文件 转 .pyd文件的操作
2021/03/04 Python
html5绘制时钟动画
2014/12/15 HTML / CSS
Java软件工程师综合面试题笔试题
2013/09/08 面试题
合作协议书范本
2014/04/17 职场文书
公务员考察材料
2014/12/23 职场文书
2015年采购员工作总结
2015/04/27 职场文书
污水处理保证书
2015/05/09 职场文书
西安事变观后感
2015/06/12 职场文书
详解Python类和对象内容
2021/06/22 Python
Go语言安装并操作redis的go-redis库
2022/04/14 Golang