Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
django 自定义用户user模型的三种方法
Nov 18 Python
利用Python命令行传递实例化对象的方法
Nov 02 Python
用pandas按列合并两个文件的实例
Apr 12 Python
Python访问MongoDB,并且转换成Dataframe的方法
Oct 15 Python
关于Python作用域自学总结
Jun 10 Python
Python使用正则表达式分割字符串的实现方法
Jul 16 Python
Python Django 命名空间模式的实现
Aug 09 Python
Python绘制股票移动均线的实例
Aug 24 Python
TensorFlow——Checkpoint为模型添加检查点的实例
Jan 21 Python
python GUI库图形界面开发之PyQt5信号与槽基本操作
Feb 25 Python
Python tkinter布局与按钮间距设置方式
Mar 04 Python
使用pygame编写Flappy bird小游戏
Mar 14 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
php中的三元运算符使用说明
2011/07/03 PHP
php实现改变图片直接打开为下载的方法
2015/04/14 PHP
curl和libcurl的区别简介
2015/07/01 PHP
PHP7原生MySQL数据库操作实现代码
2020/07/03 PHP
javascript for循环从入门到偏门(效率优化+奇特用法)
2012/08/01 Javascript
JavaScript实现16进制颜色值转RGB的方法
2015/02/09 Javascript
jquery右下角自动弹出可关闭的广告层
2015/05/08 Javascript
jQuery插件制作之参数用法实例分析
2015/06/01 Javascript
JavaScript中iframe实现局部刷新的几种方法汇总
2016/01/06 Javascript
JavaScript实现数据类型的相互转换
2016/03/06 Javascript
JavaScript中的对象继承关系
2016/08/01 Javascript
Angularjs实现多图片上传预览功能
2018/07/18 Javascript
深入理解 Koa 框架中间件原理
2018/10/18 Javascript
React 实现车牌键盘的示例代码
2019/12/20 Javascript
[42:22]DOTA2上海特级锦标赛C组小组赛#1 OG VS Archon第一局
2016/02/27 DOTA
简洁的十分钟Python入门教程
2015/04/03 Python
在Python中处理字符串之isdecimal()方法的使用
2015/05/20 Python
python3实现读取chrome浏览器cookie
2016/06/19 Python
pycharm远程调试openstack的图文教程
2017/11/21 Python
Python排序搜索基本算法之插入排序实例分析
2017/12/11 Python
Python cookbook(数据结构与算法)将多个映射合并为单个映射的方法
2018/04/19 Python
pytorch + visdom 处理简单分类问题的示例
2018/06/04 Python
Python基础之文件读取的讲解
2019/02/16 Python
使用python 计算百分位数实现数据分箱代码
2020/03/03 Python
selenium WebDriverWait类等待机制的实现
2020/03/18 Python
利用Python自动化操作AutoCAD的实现
2020/04/01 Python
python基于opencv 实现图像时钟
2021/01/04 Python
CSS3教程:新增加的结构伪类
2009/04/02 HTML / CSS
Java的五个基础面试题
2016/02/26 面试题
法学毕业生自荐信
2013/11/13 职场文书
考生诚信考试承诺书
2014/05/23 职场文书
创先争优活动心得体会
2014/09/04 职场文书
志愿者服务活动总结报告
2015/05/06 职场文书
运动会加油稿30字
2015/07/21 职场文书
2019公司借款合同范本2篇!
2019/07/24 职场文书
Hive常用日期格式转换语法
2022/06/25 数据库