Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
python调用cmd复制文件代码分享
Dec 27 Python
使用python实现扫描端口示例
Mar 29 Python
python构建自定义回调函数详解
Jun 20 Python
Python tkinter模块中类继承的三种方式分析
Aug 08 Python
linecache模块加载和缓存文件内容详解
Jan 11 Python
解决pandas.DataFrame.fillna 填充Nan失败的问题
Nov 06 Python
Python中 CSV格式清洗与转换的实例代码
Aug 29 Python
PyCharm第一次安装及使用教程
Jan 08 Python
用python解压分析jar包实例
Jan 16 Python
flask利用flask-wtf验证上传的文件的方法
Jan 17 Python
python tqdm 实现滚动条不上下滚动代码(保持一行内滚动)
Feb 19 Python
Python的代理类实现,控制访问和修改属性的权限你都了解吗
Mar 21 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
ASP和PHP都是可以删除自身的
2007/04/09 PHP
php自定义函数call_user_func和call_user_func_array详解
2011/07/14 PHP
解密ThinkPHP3.1.2版本之模块和操作映射
2014/06/19 PHP
php实现用于删除整个目录的递归函数
2015/03/16 PHP
php专用数组排序类ArraySortUtil用法实例
2015/04/03 PHP
Laravel框架Request、Response及Session操作示例
2019/05/06 PHP
JS Timing
2007/04/21 Javascript
动态载入js提高网页打开速度的方法
2014/07/04 Javascript
jQuery中[attribute]选择器用法实例
2014/12/31 Javascript
js使用cookie记录用户名的方法
2015/11/26 Javascript
创建一个类Person的简单实例
2016/05/17 Javascript
javascript字体颜色控件的开发 JS实现字体控制
2017/11/27 Javascript
如何快速解决JS或Jquery ajax异步跨域的问题
2018/01/08 jQuery
Vue使用mixin分发组件的可复用功能
2019/09/01 Javascript
云服务器部署Node.js项目的方法步骤(小白系列)
2020/03/23 Javascript
vue实现一个6个输入框的验证码输入组件功能的实例代码
2020/06/29 Javascript
[06:44]2018DOTA2亚洲邀请赛4.5 SOLO赛 MidOne vs Sumail
2018/04/06 DOTA
Python3.2中的字符串函数学习总结
2015/04/23 Python
python下setuptools的安装详解及No module named setuptools的解决方法
2017/07/06 Python
python绘制多个曲线的折线图
2020/03/23 Python
教你一步步利用python实现贪吃蛇游戏
2019/06/27 Python
python3 dict ndarray 存成json,并保留原数据精度的实例
2019/12/06 Python
Docker部署Python爬虫项目的方法步骤
2020/01/19 Python
Django 解决distinct无法去除重复数据的问题
2020/05/20 Python
Python调用ffmpeg开源视频处理库,批量处理视频
2020/11/16 Python
物业管理应届生求职信
2013/10/28 职场文书
店长岗位职责
2013/11/21 职场文书
影视制作岗位职责
2013/12/04 职场文书
个人求职简历中英文自我评价
2013/12/16 职场文书
企业内控岗位的职责
2014/02/07 职场文书
讲文明懂礼貌演讲稿
2014/09/11 职场文书
英语复习计划
2015/01/19 职场文书
大学迎新生的欢迎词
2019/06/25 职场文书
创业计划书之餐饮
2019/09/02 职场文书
浅谈redis五大数据结构和使用场景
2021/04/12 Redis
24年收藏2000多部退役军用电台
2022/02/18 无线电