编程 Python

Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考，具体如下：

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果：

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存，性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果：

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果：

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果：

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注：这里使用的测试文本test.txt如下：

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

- Author -

wanlifeipeng

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

在GitHub Pages上使用Pelican搭建博客的教程

Apr 25 Python

Python实现将绝对URL替换成相对URL的方法

Jun 28 Python

python遍历序列enumerate函数浅析

Oct 17 Python

Django框架实现逆向解析url的方法

Jul 04 Python

不管你的Python报什么错，用这个模块就能正常运行

Sep 14 Python

python实现嵌套列表平铺的两种方法

Nov 08 Python

Python修改文件往指定行插入内容的实例

Jan 30 Python

python GUI库图形界面开发之PyQt5切换按钮控件QPushButton详细使用方法与实例

Feb 28 Python

详解python爬取弹幕与数据分析

Nov 14 Python

pip 20.3 新版本发布!即将抛弃 Python 2.x(推荐)

Dec 16 Python

Pytorch DataLoader shuffle验证方式

Jun 02 Python

详解python的异常捕获

Mar 03 Python

基于DataFrame改变列类型的方法

Jul 25 #Python

对pandas中Series的map函数详解

Jul 25 #Python

基于pandas将类别属性转化为数值属性的方法

Jul 25 #Python

Django实现支付宝付款和微信支付的示例代码

Jul 25 #Python

Python走楼梯问题解决方法示例

Jul 25 #Python

python 批量修改/替换数据的实例

Jul 25 #Python

django 实现电子支付功能的示例代码

Jul 25 #Python

You might like

一步一步学习PHP(5) 类和对象

2010/02/16 PHP

php debug 安装技巧

2011/04/30 PHP

ThinkPHP学习笔记（一）ThinkPHP部署

2014/06/22 PHP

Json_encode防止汉字转义成unicode的方法

2016/02/25 PHP

ThinkPHP5 的简单搭建和使用详解

2018/11/15 PHP

tp5(thinkPHP5)框架数据库Db增删改查常见操作总结

2019/01/10 PHP

利用PHP内置SERVER开启web服务(本地开发使用)

2020/01/22 PHP

MooBox 基于Mootools的对话框插件

2012/01/20 Javascript

JS左右无缝滚动（一般方法+面向对象方法）

2012/08/17 Javascript

中国地区三级联动下拉菜单效果分析

2012/11/15 Javascript

父元素与子iframe相互获取变量和元素对象的具体实现

2013/10/15 Javascript

extjs 如何给column 加上提示

2014/07/29 Javascript

JS实现仿QQ聊天窗口抖动特效

2015/05/10 Javascript

JavaScript手机振动API

2016/06/11 Javascript

浅谈JS中的bind方法与函数柯里化

2016/08/10 Javascript

javascript实现简单的可随机变色网页计算器示例

2016/12/30 Javascript

Vue.js 中的 $watch使用方法

2017/05/25 Javascript

gulp教程_从入门到项目中快速上手使用方法

2017/09/14 Javascript

Vue实现点击时间获取时间段查询功能

2020/08/21 Javascript

Python 实现文件的全备份和差异备份详解

2016/12/27 Python

python的pdb调试命令的命令整理及实例

2017/07/12 Python

Django怎么在admin后台注册数据库表

2020/11/14 Python

CSS的background属性及CSS3的背景图片设置总结

2016/06/13 HTML / CSS

俄罗斯优惠券网站：BIGLION

2017/05/21 全球购物

MAC彩妆英国官网：M·A·C UK

2018/05/30 全球购物

波兰品牌鞋履在线商店：Eastend.pl

2020/01/11 全球购物

理工大学毕业生自荐信

2013/11/01 职场文书

详细的大学生创业计划书模板

2014/01/27 职场文书

幼儿园运动会加油词

2014/02/14 职场文书

会计电算化应届生自荐信

2014/02/25 职场文书

2014年创先争优工作总结

2014/12/11 职场文书

优秀少先队员事迹材料

2014/12/24 职场文书

项目技术负责人岗位职责

2015/04/13 职场文书

2015年社区计生工作总结

2015/04/21 职场文书

我是特种兵观后感

2015/06/11 职场文书

Python 机器学习工具包SKlearn的安装与使用

2021/05/14 Python