8g内存用python读取10文件_面试题-python 如何读取一个大于 10G 的txt文件?


Posted in Python onMay 28, 2021

前言

用python 读取一个大于10G 的文件,自己电脑只有8G内存,一运行就报内存溢出:MemoryError

python 如何用open函数读取大文件呢?

读取大文件

首先可以自己先制作一个大于10G的txt文件

a = '''
2021-02-02 21:33:31,678 [django.request:93] [base:get_response] [WARNING]- Not Found: /http:/123.125.114.144/
2021-02-02 21:33:31,679 [django.server:124] [basehttp:log_message] [WARNING]- "HEAD http://123.125.114.144/ HTTP/1.1" 404 1678
2021-02-02 22:14:04,121 [django.server:124] [basehttp:log_message] [INFO]- code 400, message Bad request version ('HTTP')
2021-02-02 22:14:04,122 [django.server:124] [basehttp:log_message] [WARNING]- "GET ../../mnt/custom/ProductDefinition HTTP" 400 -
2021-02-02 22:16:21,052 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login HTTP/1.1" 301 0
2021-02-02 22:16:21,123 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login/ HTTP/1.1" 200 3876
2021-02-02 22:16:21,192 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/main_bg.png HTTP/1.1" 200 2801
2021-02-02 22:16:21,196 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/style.css HTTP/1.1" 200 1638
2021-02-02 22:16:21,229 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/bg.jpg HTTP/1.1" 200 135990
2021-02-02 22:16:21,307 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/fonts/icomoon.ttf?u4m6fy HTTP/1.1" 200 6900
2021-02-02 22:16:23,525 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/login/ HTTP/1.1" 302 0
2021-02-02 22:16:23,618 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/index/ HTTP/1.1" 200 18447
2021-02-02 22:16:23,709 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/commons.js HTTP/1.1" 200 13209
2021-02-02 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/admin.css HTTP/1.1" 200 19660
2021-02-02 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/common.css HTTP/1.1" 200 1004
2021-02-02 22:16:23,714 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/app.js HTTP/1.1" 200 20844
2021-02-02 22:16:26,509 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/report_list/1/ HTTP/1.1" 200 14649
2021-02-02 22:16:51,496 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/test_list/1/ HTTP/1.1" 200 24874
2021-02-02 22:16:51,721 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/add_case/ HTTP/1.1" 200 0
2021-02-02 22:16:59,707 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/test_list/1/ HTTP/1.1" 200 24874
2021-02-03 22:16:59,909 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/add_case/ HTTP/1.1" 200 0
2021-02-03 22:17:01,306 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/edit_case/1/ HTTP/1.1" 200 36504
2021-02-03 22:17:06,265 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/add_project/ HTTP/1.1" 200 17737
2021-02-03 22:17:07,825 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/project_list/1/ HTTP/1.1" 200 29789
2021-02-03 22:17:13,116 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/add_config/ HTTP/1.1" 200 24816
2021-02-03 22:17:19,671 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/config_list/1/ HTTP/1.1" 200 19532
'''

while True:
with open("xxx.log", "a", encoding="utf-8") as fp:
fp.write(a)

循环写入到 xxx.log 文件,运行 3-5 分钟,pycharm 打开查看文件大小大于 10G

8g内存用python读取10文件_面试题-python 如何读取一个大于 10G 的txt文件?

于是我用open函数 直接读取

f = open("xxx.log", 'r')
print(f.read())
f.close()

抛出内存溢出异常:MemoryError

Traceback (most recent call last):
File "D:/2021kecheng06/demo/txt.py", line 35, in
print(f.read())
MemoryError

运行的时候可以看下自己电脑的内存已经占了100%, cpu高达91% ,不挂掉才怪了!

8g内存用python读取10文件_面试题-python 如何读取一个大于 10G 的txt文件?

这种错误的原因在于,read()方法执行操作是一次性的都读入内存中,显然文件大于内存就会报错。

read() 的几种方法

1.read() 方法可以带参数 n, n 是每次读取的大小长度,也就是可以每次读一部分,这样就不会导致内存溢出

f = open("xxx.log", 'r')
print(f.read(2048))
f.close()

运行结果

2019-10-24 21:33:31,678 [django.request:93] [base:get_response] [WARNING]- Not Found: /http:/123.125.114.144/
2019-10-24 21:33:31,679 [django.server:124] [basehttp:log_message] [WARNING]- "HEAD http://123.125.114.144/ HTTP/1.1" 404 1678
2019-10-24 22:14:04,121 [django.server:124] [basehttp:log_message] [INFO]- code 400, message Bad request version ('HTTP')
2019-10-24 22:14:04,122 [django.server:124] [basehttp:log_message] [WARNING]- "GET ../../mnt/custom/ProductDefinition HTTP" 400 -
2019-10-24 22:16:21,052 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login HTTP/1.1" 301 0
2019-10-24 22:16:21,123 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login/ HTTP/1.1" 200 3876
2019-10-24 22:16:21,192 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/main_bg.png HTTP/1.1" 200 2801
2019-10-24 22:16:21,196 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/style.css HTTP/1.1" 200 1638
2019-10-24 22:16:21,229 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/bg.jpg HTTP/1.1" 200 135990
2019-10-24 22:16:21,307 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/fonts/icomoon.ttf?u4m6fy HTTP/1.1" 200 6900
2019-10-24 22:16:23,525 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/login/ HTTP/1.1" 302 0
2019-10-24 22:16:23,618 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/index/ HTTP/1.1" 200 18447
2019-10-24 22:16:23,709 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/commons.js HTTP/1.1" 200 13209
2019-10-24 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/admin.css HTTP/1.1" 200 19660
2019-10-24 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/common.css HTTP/1.1" 200 1004
2019-10-24 22:16:23,714 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/app.js HTTP/1.1" 200 20844

这样就只读取了2048个字符,全部读取的话,循环读就行

f = open("xxx.log", 'r')
while True:
block = f.read(2048)
print(block)
if not block:
break
f.close()
2.readline():每次读取一行,这个方法也不会报错
f = open("xxx.log", 'r')
while True:
line = f.readline()
print(line, end="")
if not line:
break
f.close()

3.readlines():读取全部的行,生成一个list,通过list来对文件进行处理,显然这种方式依然会造成:MemoyError

真正 Pythonic 的方法

真正 Pythonci 的方法,使用 with 结构打开文件,fp 是一个可迭代对象,可以用 for 遍历读取每行的文件内容

with open("xxx.log", 'r') as fp:

for line in fp:

print(line, end="")

yield 生成器读取大文件

前面一篇讲yield 生成器的时候提到读取大文件,函数返回一个可迭代对象,用next()方法读取文件内容

def read_file(fpath):

BLOCK_SIZE = 1024

with open(fpath, 'rb') as f:

while True:

block = f.read(BLOCK_SIZE)

if block:

yield block

else:

return

if __name__ == '__main__':

a = read_file("xxx.log")

print(a) # generator objec

print(next(a)) # bytes类型

print(next(a).decode("utf-8")) # str

运行结果

b'\r\n2019-10-24 21:33:31,678 [django.request:93] [base:get_response] [WARNING]- Not Found: /http:/123.125.114.144/\r\n2019-10-24 21:33:31,679 [django.server:124] [basehttp:log_message] [WARNING]- "HEAD http://123.125.114.144/ HTTP/1.1" 404 1678\r\n2019-10-24 22:14:04,121 [django.server:124] [basehttp:log_message] [INFO]- code 400, message Bad request version (\'HTTP\')\r\n2019-10-24 22:14:04,122 [django.server:124] [basehttp:log_message] [WARNING]- "GET ../../mnt/custom/ProductDefinition HTTP" 400 -\r\n2019-10-24 22:16:21,052 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login HTTP/1.1" 301 0\r\n2019-10-24 22:16:21,123 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login/ HTTP/1.1" 200 3876\r\n2019-10-24 22:16:21,192 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/main_bg.png HTTP/1.1" 200 2801\r\n2019-10-24 22:16:21,196 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/style.css HTTP/1.1" 200 1638\r\n2019-10-24 22:16:21,229 [django.server:124] '

Python 相关文章推荐
python实现简单ftp客户端的方法
Jun 28 Python
Python基础篇之初识Python必看攻略
Jun 23 Python
Python获取文件所在目录和文件名的方法
Jan 12 Python
Python使用正则表达式实现文本替换的方法
Apr 18 Python
python生成器,可迭代对象,迭代器区别和联系
Feb 04 Python
用TensorFlow实现戴明回归算法的示例
May 02 Python
Python遍历字典方式就实例详解
Dec 28 Python
tensorflow模型继续训练 fineturn实例
Jan 21 Python
Python安装与卸载流程详细步骤(图解)
Feb 20 Python
Django Admin设置应用程序及模型顺序方法详解
Apr 01 Python
python用Configobj模块读取配置文件
Sep 26 Python
教你使用Python pypinyin库实现汉字转拼音
May 27 Python
用python画城市轮播地图
用Python实现一个打字速度测试工具来测试你的手速
解决Pytorch dataloader时报错每个tensor维度不一样的问题
May 28 #Python
pytorch锁死在dataloader(训练时卡死)
Python趣味爬虫之用Python实现智慧校园一键评教
Pytorch 如何加速Dataloader提升数据读取速度
在前女友婚礼上,用Python破解了现场的WIFI还把名称改成了
You might like
PHP 中文处理技巧
2010/04/25 PHP
PHP中文分词的简单实现代码分享
2011/07/17 PHP
[原创]php集成安装包wampserver修改密码后phpmyadmin无法登陆的解决方法
2016/11/23 PHP
PDO实现学生管理系统
2020/03/21 PHP
JavaScript Timer实现代码
2010/02/17 Javascript
仿jQuery的siblings效果的js代码
2011/08/09 Javascript
分享一个asp.net pager分页控件
2012/01/04 Javascript
js前台判断开始时间是否小于结束时间
2012/02/23 Javascript
jQuery登陆判断简单实现代码
2013/04/21 Javascript
js在IE与firefox的差异集锦
2014/11/11 Javascript
AngularJS基础知识
2014/12/21 Javascript
javascript操作符"!~"详解
2015/02/10 Javascript
深入学习AngularJS中数据的双向绑定机制
2016/03/04 Javascript
javascript数组对象常用api函数小结(连接,插入,删除,反转,排序等)
2016/09/20 Javascript
关于javascript事件响应的基础语法总结(必看篇)
2016/12/26 Javascript
vue axios基于常见业务场景的二次封装的实现
2018/09/21 Javascript
vue实现直播间点赞飘心效果的示例代码
2019/09/20 Javascript
使用python解析xml成对应的html示例分享
2014/04/02 Python
跟老齐学Python之用Python计算
2014/09/12 Python
从Python的源码浅要剖析Python的内存管理
2015/04/16 Python
python使用tensorflow深度学习识别验证码
2018/04/03 Python
Python代码太长换行的实现
2019/07/05 Python
使用python客户端访问impala的操作方式
2020/03/28 Python
Pycharm安装第三方库失败解决方案
2020/11/17 Python
CSS3实现图片抽屉式效果的示例代码
2019/11/06 HTML / CSS
HTML5给汉字加拼音收起展开组件的实现代码
2020/04/08 HTML / CSS
里程积分管理买卖交换平台:Points.com
2017/01/13 全球购物
Blue Nile中国官网:全球知名的钻石和珠宝网络零售商
2020/03/22 全球购物
消防安全责任书范本
2014/04/15 职场文书
开展党的群众路线教育实践活动领导班子对照检查材料
2014/09/25 职场文书
九寨沟导游词
2015/02/02 职场文书
2016年六一儿童节开幕词
2016/03/04 职场文书
Nginx服务器如何设置url链接
2021/03/31 Servers
电脑开机弹出documents文件夹怎么回事?弹出documents文件夹解决方法
2022/04/08 数码科技
MySQL 语句执行顺序举例解析
2022/06/05 MySQL
JS前端使用canvas实现物体的点选示例
2022/08/05 Javascript