Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
Python 正则表达式操作指南
May 04 Python
Python 元类使用说明
Dec 18 Python
使用django-suit为django 1.7 admin后台添加模板
Nov 18 Python
Python用sndhdr模块识别音频格式详解
Jan 11 Python
pandas全表查询定位某个值所在行列的方法
Apr 12 Python
对python中for、if、while的区别与比较方法
Jun 25 Python
解决python selenium3启动不了firefox的问题
Oct 13 Python
Python中xml和json格式相互转换操作示例
Dec 05 Python
Python3爬虫学习之将爬取的信息保存到本地的方法详解
Dec 12 Python
python 计算数据偏差和峰度的方法
Jun 29 Python
django model的update时auto_now不被更新的原因及解决方式
Apr 01 Python
详解MindSpore自定义模型损失函数
Jun 30 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
php实现数组纵向转横向并过滤重复值的方法分析
2017/05/29 PHP
thinkPHP5框架自定义验证器实现方法分析
2018/06/11 PHP
PHP中strtr与str_replace函数运行性能简单测试示例
2019/06/22 PHP
laravel model模型处理之修改查询或修改字段时的类型格式案例
2019/10/17 PHP
js几个验证函数代码
2010/03/25 Javascript
js获取class的所有元素
2013/03/28 Javascript
JavaScript框架(iframe)操作总结
2014/04/16 Javascript
window.location不跳转的问题解决方法
2014/04/17 Javascript
JavaScript操作XML文件之XML读取方法
2015/06/09 Javascript
Bootstrap select下拉联动(jQuery cxselect)
2017/01/04 Javascript
bootstrap实现的自适应页面简单应用示例
2017/03/09 Javascript
微信小程序图片宽100%显示并且不变形
2017/06/21 Javascript
JS设计模式之访问者模式定义与用法分析
2018/02/05 Javascript
vue项目前端埋点的实现
2019/03/06 Javascript
Layui Table js 模拟选中checkbox的例子
2019/09/03 Javascript
JQuery插件tablesorter表格排序实现过程解析
2020/05/28 jQuery
关于小程序优化的一些建议(小结)
2020/12/10 Javascript
[01:07:41]IG vs VGJ.T 2018国际邀请赛小组赛BO2 第一场 8.18
2018/08/19 DOTA
python实现将pvr格式转换成pvr.ccz的方法
2015/04/28 Python
python3调用R的示例代码
2018/02/23 Python
python之pandas用法大全
2018/03/13 Python
python 接口返回的json字符串实例
2018/03/27 Python
Python小程序之在图片上加入数字的代码
2019/11/26 Python
python列表生成器迭代器实例解析
2019/12/19 Python
python 图像判断,清晰度(明暗),彩色与黑白实例
2020/06/04 Python
Python使用tkinter制作在线翻译软件
2021/02/22 Python
New Balance英国官方网站:始于1906年,百年慢跑品牌
2016/12/07 全球购物
美国知名的旅游网站:OneTravel
2018/10/09 全球购物
美国手机支架公司:PopSockets
2019/11/27 全球购物
广播电视新闻学专业应届生求职信
2013/10/08 职场文书
老人祝寿主持词
2014/03/28 职场文书
2014年药品销售工作总结
2014/12/16 职场文书
庆祝教师节主题班会
2015/08/17 职场文书
还在手动盖楼抽奖?教你用Python实现自动评论盖楼抽奖(一)
2021/06/07 Python
基于JavaScript实现省市联动效果
2021/06/22 Javascript
Python实现聚类K-means算法详解
2022/07/15 Python