Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
centos 下面安装python2.7 +pip +mysqld
Nov 18 Python
Python处理字符串之isspace()方法的使用
May 19 Python
Python实现合并字典的方法
Jul 07 Python
python+django加载静态网页模板解析
Dec 12 Python
对Python 除法负数取商的取整方式详解
Dec 12 Python
Python多图片合并PDF的方法
Jan 03 Python
对pyqt5多线程正确的开启姿势详解
Jun 14 Python
Python绘制堆叠柱状图的实例
Jul 09 Python
学会python自动收发邮件 代替你问候女友
May 20 Python
使用opencv识别图像红色区域,并输出红色区域中心点坐标
Jun 02 Python
python 爬取百度文库并下载(免费文章限定)
Dec 04 Python
python 用递归实现通用爬虫解析器
Apr 16 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
PHP中JSON的应用技巧
2015/10/10 PHP
php使用SAE原生Mail类实现各种类型邮件发送的方法
2016/10/10 PHP
Windows平台实现PHP连接SQL Server2008的方法
2017/07/26 PHP
PHP实现超简单的SSL加密解密、验证及签名的方法示例
2017/08/28 PHP
Laravel 集成 Geetest验证码的方法
2018/05/14 PHP
jQuery 三击事件实现代码
2013/09/11 Javascript
js Date概念详细介绍
2013/11/22 Javascript
jQuery中:reset选择器用法实例
2015/01/04 Javascript
vue2.0嵌套路由实现豆瓣电影分页功能(附demo)
2017/03/13 Javascript
vue 实现 tomato timer(蕃茄钟)实例讲解
2017/07/24 Javascript
vue教程之toast弹框全局调用示例详解
2020/08/24 Javascript
Node.JS循环删除非空文件夹及子目录下的所有文件
2018/03/12 Javascript
vuex中的 mapState,mapGetters,mapActions,mapMutations 的使用
2018/04/13 Javascript
vue富文本编辑器组件vue-quill-edit使用教程
2018/09/21 Javascript
JavaScript类型相关的常用操作总结
2019/02/14 Javascript
Node.js实现一个HTTP服务器的方法示例
2019/05/13 Javascript
ES6 Set结构的应用实例分析
2019/06/26 Javascript
vue.js this.$router.push获取不到params参数问题
2020/03/03 Javascript
微信小程序实现抖音播放效果的实例代码
2020/04/11 Javascript
JS实现纸牌发牌动画
2021/01/19 Javascript
python之模拟鼠标键盘动作具体实现
2013/12/30 Python
python使用itchat库实现微信机器人(好友聊天、群聊天)
2018/01/04 Python
Python图像处理之简单画板实现方法示例
2018/08/30 Python
python 执行文件时额外参数获取的实例
2018/12/18 Python
使用Python-OpenCV消除图像中孤立的小区域操作
2020/07/05 Python
Python图像处理二值化方法实例汇总
2020/07/24 Python
10个python爬虫入门基础代码实例 + 1个简单的python爬虫完整实例
2020/12/16 Python
将不规则的Python多维数组拉平到一维的方法实现
2021/01/11 Python
Fairyseason:为个人和批发商提供女装和配件
2017/03/01 全球购物
柒牌官方商城:中国男装优秀品牌
2017/06/30 全球购物
医药代表个人的求职信分享
2013/12/08 职场文书
物业公司采购员岗位职责
2013/12/31 职场文书
会计电算化大学生职业规划书
2014/02/05 职场文书
大学共青团员个人自我评价
2014/04/16 职场文书
应用外语系自荐信
2014/06/26 职场文书
一次项目中Thinkphp绕过禁用函数的实战记录
2021/11/17 PHP