Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
python使用PyGame模块播放声音的方法
May 20 Python
python获取mp3文件信息的方法
Jun 15 Python
Python的SQLalchemy模块连接与操作MySQL的基础示例
Jul 11 Python
在cmd中查看python的安装路径方法
Jul 03 Python
Pytorch技巧:DataLoader的collate_fn参数使用详解
Jan 08 Python
python颜色随机生成器的实例代码
Jan 10 Python
Spring Boot中使用IntelliJ IDEA插件EasyCode一键生成代码详细方法
Mar 20 Python
keras 如何保存最佳的训练模型
May 25 Python
Django serializer优化类视图的实现示例
Jul 16 Python
python ssh 执行shell命令的示例
Sep 29 Python
Python的3种运行方式:命令行窗口、Python解释器、IDLE的实现
Oct 10 Python
python 镜像环境搭建总结
Sep 23 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
对盗链说再见...
2006/10/09 PHP
php源代码安装常见错误与解决办法分享
2013/05/28 PHP
php检查字符串中是否包含7位GSM字符的方法
2015/03/17 PHP
PHP 生成微信红包代码简单
2016/03/25 PHP
PHP实现更改hosts文件的方法示例
2017/08/08 PHP
javascript iframe编程相关代码
2009/12/28 Javascript
关于JavaScript定义类和对象的几种方式
2010/11/09 Javascript
利用js实现选项卡的特别效果的实例
2013/03/03 Javascript
jQuery的DOM操作之删除节点示例
2014/01/03 Javascript
javascript实现计时器的简单方法
2016/02/21 Javascript
Javascript闭包与函数柯里化浅析
2016/06/22 Javascript
Javascript iframe交互并兼容各种浏览器的解决方法
2016/07/12 Javascript
javascript特殊文本输入框网页特效
2016/09/13 Javascript
Bootstrap modal使用及点击外部不消失的解决方法
2016/12/13 Javascript
javascript基于原型链的继承及call和apply函数用法分析
2016/12/15 Javascript
js数字计算 误差问题的快速解决方法
2017/02/28 Javascript
jQuery实现字体颜色渐变效果的方法
2017/03/29 jQuery
ionic2懒加载配置详解
2017/09/01 Javascript
微信小程序 循环及嵌套循环的使用总结
2017/09/26 Javascript
使用selenium抓取淘宝的商品信息实例
2018/02/06 Javascript
VUE中v-on:click事件中获取当前dom元素的代码
2018/08/22 Javascript
Vue 递归多级菜单的实例代码
2019/05/05 Javascript
js函数和this用法实例分析
2020/03/13 Javascript
JS sort方法基于数组对象属性值排序
2020/07/10 Javascript
Vue实现简单购物车功能
2020/12/13 Vue.js
python中利用await关键字如何等待Future对象完成详解
2017/09/07 Python
浅谈python for循环的巧妙运用(迭代、列表生成式)
2017/09/26 Python
浅谈Python实现2种文件复制的方法
2018/01/19 Python
Pytorch 实现自定义参数层的例子
2019/08/17 Python
使用python图形模块turtle库绘制樱花、玫瑰、圣诞树代码实例
2020/03/16 Python
学习Python需要哪些工具
2020/09/04 Python
基于python爬取梨视频实现过程解析
2020/11/09 Python
CSS3区域模块region相关编写示例
2015/08/28 HTML / CSS
教师党员整改措施
2014/10/24 职场文书
Python NumPy灰度图像的压缩原理讲解
2021/08/04 Python
Spring Boot DevTools 全局配置学习指南
2022/03/31 Java/Android