Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
Python编程实现输入某年某月某日计算出这一天是该年第几天的方法
Apr 18 Python
Python学习笔记之if语句的使用示例
Oct 23 Python
Python enumerate函数功能与用法示例
Mar 01 Python
python3 tkinter实现点击一个按钮跳出另一个窗口的方法
Jun 13 Python
对python中GUI,Label和Button的实例详解
Jun 27 Python
python的slice notation的特殊用法详解
Dec 27 Python
Python中zip()函数的解释和可视化(实例详解)
Feb 16 Python
python实现从尾到头打印单链表操作示例
Feb 22 Python
将 Ubuntu 16 和 18 上的 python 升级到最新 python3.8 的方法教程
Mar 11 Python
如何将PySpark导入Python的放实现(2种)
Apr 26 Python
python调用API接口实现登陆短信验证
May 10 Python
浅谈python 调用open()打开文件时路径出错的原因
Jun 05 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
php pthreads多线程的安装与使用
2016/01/19 PHP
php实现的中秋博饼游戏之绘制骰子图案功能示例
2017/11/06 PHP
HTML 自动伸缩的表格Table js实现
2009/04/01 Javascript
ext 列表页面关于多行查询的办法
2010/03/25 Javascript
js几个验证函数代码
2010/03/25 Javascript
JS原型对象通俗"唱法"
2012/12/27 Javascript
js日期相关函数总结分享
2013/10/15 Javascript
js控制鼠标事件移动及移出效果显示
2014/10/19 Javascript
Jquery 全选反选实例代码
2015/11/19 Javascript
js获取本机操作系统类型的两种方法
2015/12/19 Javascript
浅谈javascript的call()、apply()、bind()的用法
2016/02/21 Javascript
JavaScript闭包实例详解
2016/06/03 Javascript
BootStrap智能表单实战系列(三)分块表单配置详解
2016/06/13 Javascript
js 判断一组日期是否是连续的简单实例
2016/07/11 Javascript
BootStrap Typeahead自动补全插件实例代码
2016/08/10 Javascript
Es6 写的文件import 起来解决方案详解
2016/12/13 Javascript
vue中使用vue-router切换页面时滚动条自动滚动到顶部的方法
2017/11/28 Javascript
浅析Vue 防抖与节流的使用
2019/11/14 Javascript
ES6扩展运算符和rest运算符用法实例分析
2020/05/23 Javascript
[01:31:03]DOTA2完美盛典全回顾 见证十五项大奖花落谁家
2017/11/28 DOTA
python实现清屏的方法
2015/04/30 Python
Python机器学习算法之k均值聚类(k-means)
2018/02/23 Python
pytorch 转换矩阵的维数位置方法
2018/12/08 Python
pytest中文文档之编写断言
2019/09/12 Python
wxpython+pymysql实现用户登陆功能
2019/11/19 Python
Python 实现打印单词的菱形字符图案
2020/04/12 Python
使用Python实现批量ping操作方法
2020/05/06 Python
卸载tensorflow-cpu重装tensorflow-gpu操作
2020/06/23 Python
lookfantastic荷兰:在线购买奢华护肤、护发和化妆品
2018/11/27 全球购物
法律专业实习鉴定
2013/12/22 职场文书
社区禁毒工作方案
2014/06/02 职场文书
年度安全生产目标责任书
2014/07/23 职场文书
2014年业务员工作总结范文
2014/11/17 职场文书
会计专业自荐信范文
2015/03/05 职场文书
千与千寻观后感
2015/06/04 职场文书
升级 Win11 还是坚守 Win10?微软 Win11 新系统缺失功能大盘点
2022/04/05 数码科技