Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
在GitHub Pages上使用Pelican搭建博客的教程
Apr 25 Python
Python实现将绝对URL替换成相对URL的方法
Jun 28 Python
python遍历序列enumerate函数浅析
Oct 17 Python
Django框架实现逆向解析url的方法
Jul 04 Python
不管你的Python报什么错,用这个模块就能正常运行
Sep 14 Python
python实现嵌套列表平铺的两种方法
Nov 08 Python
Python修改文件往指定行插入内容的实例
Jan 30 Python
python GUI库图形界面开发之PyQt5切换按钮控件QPushButton详细使用方法与实例
Feb 28 Python
详解python爬取弹幕与数据分析
Nov 14 Python
pip 20.3 新版本发布!即将抛弃 Python 2.x(推荐)
Dec 16 Python
Pytorch DataLoader shuffle验证方式
Jun 02 Python
详解python的异常捕获
Mar 03 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
一步一步学习PHP(5) 类和对象
2010/02/16 PHP
php debug 安装技巧
2011/04/30 PHP
ThinkPHP学习笔记(一)ThinkPHP部署
2014/06/22 PHP
Json_encode防止汉字转义成unicode的方法
2016/02/25 PHP
ThinkPHP5 的简单搭建和使用详解
2018/11/15 PHP
tp5(thinkPHP5)框架数据库Db增删改查常见操作总结
2019/01/10 PHP
利用PHP内置SERVER开启web服务(本地开发使用)
2020/01/22 PHP
MooBox 基于Mootools的对话框插件
2012/01/20 Javascript
JS左右无缝滚动(一般方法+面向对象方法)
2012/08/17 Javascript
中国地区三级联动下拉菜单效果分析
2012/11/15 Javascript
父元素与子iframe相互获取变量和元素对象的具体实现
2013/10/15 Javascript
extjs 如何给column 加上提示
2014/07/29 Javascript
JS实现仿QQ聊天窗口抖动特效
2015/05/10 Javascript
JavaScript手机振动API
2016/06/11 Javascript
浅谈JS中的bind方法与函数柯里化
2016/08/10 Javascript
javascript实现简单的可随机变色网页计算器示例
2016/12/30 Javascript
Vue.js 中的 $watch使用方法
2017/05/25 Javascript
gulp教程_从入门到项目中快速上手使用方法
2017/09/14 Javascript
Vue实现点击时间获取时间段查询功能
2020/08/21 Javascript
Python 实现文件的全备份和差异备份详解
2016/12/27 Python
python的pdb调试命令的命令整理及实例
2017/07/12 Python
Django怎么在admin后台注册数据库表
2020/11/14 Python
CSS的background属性及CSS3的背景图片设置总结
2016/06/13 HTML / CSS
俄罗斯优惠券网站:BIGLION
2017/05/21 全球购物
MAC彩妆英国官网:M·A·C UK
2018/05/30 全球购物
波兰品牌鞋履在线商店:Eastend.pl
2020/01/11 全球购物
理工大学毕业生自荐信
2013/11/01 职场文书
详细的大学生创业计划书模板
2014/01/27 职场文书
幼儿园运动会加油词
2014/02/14 职场文书
会计电算化应届生自荐信
2014/02/25 职场文书
2014年创先争优工作总结
2014/12/11 职场文书
优秀少先队员事迹材料
2014/12/24 职场文书
项目技术负责人岗位职责
2015/04/13 职场文书
2015年社区计生工作总结
2015/04/21 职场文书
我是特种兵观后感
2015/06/11 职场文书
Python 机器学习工具包SKlearn的安装与使用
2021/05/14 Python