Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】


Posted in Python onJuly 25, 2018

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

Python 相关文章推荐
python 解析html之BeautifulSoup
Jul 07 Python
在Python中测试访问同一数据的竞争条件的方法
Apr 23 Python
python多进程实现进程间通信实例
Nov 24 Python
Flask解决跨域的问题示例代码
Feb 12 Python
对python中的xlsxwriter库简单分析
May 04 Python
深入了解Python枚举类型的相关知识
Jul 09 Python
详解python uiautomator2 watcher的使用方法
Sep 09 Python
Tensorflow 模型转换 .pb convert to .lite实例
Feb 12 Python
pyecharts绘制中国2020肺炎疫情地图的实例代码
Feb 12 Python
python 5个实用的技巧
Sep 27 Python
Python根据字典的值查询出对应的键的方法
Sep 30 Python
python hmac模块验证客户端的合法性
Nov 07 Python
基于DataFrame改变列类型的方法
Jul 25 #Python
对pandas中Series的map函数详解
Jul 25 #Python
基于pandas将类别属性转化为数值属性的方法
Jul 25 #Python
Django实现支付宝付款和微信支付的示例代码
Jul 25 #Python
Python走楼梯问题解决方法示例
Jul 25 #Python
python 批量修改/替换数据的实例
Jul 25 #Python
django 实现电子支付功能的示例代码
Jul 25 #Python
You might like
缓存技术详谈―php
2006/12/14 PHP
ob_start(),ob_start('ob_gzhandler')使用
2006/12/25 PHP
ThinkPHP打开验证码页面显示乱码的解决方法
2014/12/18 PHP
PHP Hash算法:Times33算法代码实例
2015/05/13 PHP
PHP html_entity_decode()函数讲解
2019/02/25 PHP
PHP判断是否微信访问的方法示例
2019/03/27 PHP
thinkphp5使html5实现动态跳转的例子
2019/10/16 PHP
JavaScript isArray()函数判断对象类型的种种方法
2010/10/11 Javascript
jQuery实现鼠标滑过遮罩并高亮显示效果
2013/07/16 Javascript
js调用后台、后台调用前台等方法总结
2014/04/17 Javascript
移动端jQuery修正Web页面滑动时div问题的两则实例
2016/05/30 Javascript
javascript 小数乘法结果错误的处理方法
2016/07/28 Javascript
浅谈js中的三种继承方式及其优缺点
2016/08/10 Javascript
浅谈JS中的!=、== 、!==、===的用法和区别
2016/09/24 Javascript
AngularJS操作键值对象类似java的hashmap(填坑小结)
2016/11/12 Javascript
PHP 实现一种多文件上传的方法
2017/09/20 Javascript
详解如何在vue项目中使用lodop打印插件
2018/09/27 Javascript
vue.js实现数据库的JSON数据输出渲染到html页面功能示例
2019/08/03 Javascript
vue实现移动端input上传视频、音频
2020/08/18 Javascript
在antd4.0中Form使用initialValue操作
2020/11/02 Javascript
[00:35]DOTA2上海特级锦标赛 EG战队宣传片
2016/03/04 DOTA
Python编程求质数实例代码
2018/01/31 Python
python实现在图片上画特定大小角度矩形框
2018/10/24 Python
django session完成状态保持的方法
2018/11/27 Python
matplotlib实现热成像图colorbar和极坐标图的方法
2018/12/13 Python
python实现转圈打印矩阵
2019/03/02 Python
关于Python字符编码与二进制不得不说的一些事
2020/10/04 Python
CSS3实现彩色进度条动画的示例
2020/10/29 HTML / CSS
腾讯技术类校园招聘笔试试题
2014/05/06 面试题
采购助理岗位职责
2014/02/16 职场文书
大学三年计划书范文
2014/04/30 职场文书
2014年感恩母亲演讲稿
2014/05/27 职场文书
春秋淹城导游词
2015/02/11 职场文书
离婚案件原告代理词
2015/05/23 职场文书
家属联谊会致辞
2015/07/31 职场文书
2016教师国培研修感言
2015/12/08 职场文书