编程 Python

python 提取html文本的方法

Posted in Python onMay 20, 2021

假设我们需要从各种网页中提取全文，并且要剥离所有HTML标记。通常，默认解决方案是使用BeautifulSoup软件包中的get_text方法，该方法内部使用lxml。这是一个经过充分测试的解决方案，但是在处理成千上万个HTML文档时可能会非常慢。
通过用selectolax替换BeautifulSoup，您几乎可以免费获得5-30倍的加速！
这是一个简单的基准测试，可分析commoncrawl(`处理NLP问题时，有时您需要获得大量的文本集。互联网是文本的最大来源，但是不幸的是，从任意HTML页面提取文本是一项艰巨而痛苦的任务。
假设我们需要从各种网页中提取全文，并且要剥离所有HTML标记。通常，默认解决方案是使用BeautifulSoup软件包中的get_text方法，该方法内部使用lxml。这是一个经过充分测试的解决方案，但是在处理成千上万个HTML文档时可能会非常慢。
通过用selectolax替换BeautifulSoup，您几乎可以免费获得5-30倍的加速！这是一个简单的基准测试，可分析commoncrawl(https://commoncrawl.org/)的10,000个HTML页面：

# coding: utf-8

from time import time

import warc
from bs4 import BeautifulSoup
from selectolax.parser import HTMLParser


def get_text_bs(html):
    tree = BeautifulSoup(html, 'lxml')

    body = tree.body
    if body is None:
        return None

    for tag in body.select('script'):
        tag.decompose()
    for tag in body.select('style'):
        tag.decompose()

    text = body.get_text(separator='\n')
    return text


def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='\n')
    return text


def read_doc(record, parser=get_text_selectolax):
    url = record.url
    text = None

    if url:
        payload = record.payload.read()
        header, html = payload.split(b'\r\n\r\n', maxsplit=1)
        html = html.strip()

        if len(html) > 0:
            text = parser(html)

    return url, text


def process_warc(file_name, parser, limit=10000):
    warc_file = warc.open(file_name, 'rb')
    t0 = time()
    n_documents = 0
    for i, record in enumerate(warc_file):
        url, doc = read_doc(record, parser)

        if not doc or not url:
            continue

        n_documents += 1

        if i > limit:
            break

    warc_file.close()
    print('Parser: %s' % parser.__name__)
    print('Parsing took %s seconds and produced %s documents\n' % (time() - t0, n_documents))

>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz
>>> file_name = "CC-MAIN-20180116070444-20180116090444-00000.warc.gz"
>>> process_warc(file_name, get_text_selectolax, 10000)
Parser: get_text_selectolax
Parsing took 16.170367002487183 seconds and produced 3317 documents
>>> process_warc(file_name, get_text_bs, 10000)
Parser: get_text_bs
Parsing took 432.6902508735657 seconds and produced 3283 documents

显然，这并不是对某些事物进行基准测试的最佳方法，但是它提供了一个想法，即selectolax有时比lxml快30倍。
selectolax最适合将HTML剥离为纯文本。如果我有10,000多个HTML片段，需要将它们作为纯文本索引到Elasticsearch中。（Elasticsearch有一个html_strip文本过滤器，但这不是我想要/不需要在此上下文中使用的过滤器）。事实证明，以这种规模将HTML剥离为纯文本实际上是非常低效的。那么，最有效的方法是什么？

PyQuery

from pyquery import PyQuery as pq

text = pq(html).text()

selectolax

from selectolax.parser import HTMLParser

text = HTMLParser(html).text()

正则表达式

import re

regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)

结果

我编写了一个脚本来计算时间，该脚本遍历包含HTML片段的10,000个文件。注意！这些片段不是完整的<html>文档（带有<head>和<body>等），只是HTML的一小部分。平均大小为10,314字节（中位数为5138字节）。结果如下：

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms

我已经运行了很多次，结果非常稳定。重点是：selectolax比PyQuery快7倍。

正则表达式好用？真的吗？

对于最基本的HTML Blob，它可能工作得很好。实际上，如果HTML是<p> Foo＆amp; Bar </ p>，我希望纯文本转换应该是Foo＆Bar，而不是Foo＆amp; bar。
更重要的一点是，PyQuery和selectolax支持非常特定但对我的用例很重要的内容。在继续之前，我需要删除某些标签（及其内容）。例如：

<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>

正则表达式永远无法做到这一点。

2.0 版本

因此，我的要求可能会发生变化，但基本上，我想删除某些标签。例如：<div class =“ warning”> 、 <div class =“ hidden”> 和 <div style =“ display：none”>。因此，让我们实现一下：

PyQuery

from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()

selectolax

from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()

这实际上有效。当我现在为10,000个片段运行相同的基准时，新结果如下：

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip

同样，selectolax击败PyQuery约6倍。

结论

正则表达式速度快，但功能弱。selectolax的效率令人印象深刻。

以上就是python 提取html文本的方法的详细内容，更多关于python 提取html文本的资料请关注三水点靠木其它相关文章！

python 提取html文本的方法

- Author -

Python中文社区

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python通过urllib2爬网页上种子下载示例

Feb 24 Python

用pywin32实现windows模拟鼠标及键盘动作

Apr 22 Python

python类和继承用法实例

Jul 07 Python

python将字典内容存入mysql实例代码

Jan 18 Python

Python生成器以及应用实例解析

Feb 08 Python

Python中实现变量赋值传递时的引用和拷贝方法

Apr 29 Python

matplotlib实现热成像图colorbar和极坐标图的方法

Dec 13 Python

python如何制作缩略图

Apr 30 Python

python numpy存取文件的方式

Apr 01 Python

Python 切分数组实例解析

Nov 07 Python

OpenCV python sklearn随机超参数搜索的实现

Jan 17 Python

pandas dataframe 中的explode函数用法详解

May 18 Python

学会用Python实现滑雪小游戏,再也不用去北海道啦

pytorch 带batch的tensor类型图像显示操作

pytorch 中nn.Dropout的使用说明

May 20 #Python

Python 线程池模块之多线程操作代码

May 20 #Python

pytorch中[..., 0]的用法说明

May 20 #Python

浅谈pytorch中stack和cat的及to_tensor的坑

May 20 #Python

pytorch实现手写数字图片识别

You might like

在PHP中使用redis

2013/11/04 PHP

Laravel手动分页实现方法详解

2016/10/09 PHP

php中上传文件的的解决方案

2018/09/25 PHP

php中file_get_contents()函数用法实例

2019/02/21 PHP

精通Javascript系列之数值计算

2011/06/07 Javascript

jquery jqPlot API 中文使用教程(非常强大的图表工具)

2011/08/15 Javascript

jQuery使用$.ajax进行即时验证实例详解

2015/12/11 Javascript

AngularJS API之copy深拷贝详解及实例

2016/09/14 Javascript

用jquery获取自定义的标签属性的值简单实例

2016/09/17 Javascript

Bootstrap超大屏幕的实现代码

2017/03/22 Javascript

JS中将多个逗号替换为一个逗号的实现代码

2017/06/23 Javascript

Vue.js实例方法之生命周期详解

2017/07/03 Javascript

Angular CLI 安装和使用教程

2017/09/13 Javascript

Vue2.0实现组件数据的双向绑定问题

2018/03/06 Javascript

详解基于DllPlugin和DllReferencePlugin的webpack构建优化

2018/06/28 Javascript

vue中如何实现pdf文件预览的方法

2018/07/12 Javascript

Vue中Quill富文本编辑器的使用教程

2018/09/21 Javascript

js中的reduce()函数讲解

2019/01/18 Javascript

vue实现移动端返回顶部

2020/10/12 Javascript

Python内置的字符串处理函数详细整理(覆盖日常所用)

2014/08/19 Python

Python中内建函数的简单用法说明

2016/05/05 Python

浅谈python为什么不需要三目运算符和switch

2016/06/17 Python

Python正则表达式使用范例分享

2016/12/04 Python

Django 过滤器汇总及自定义过滤器使用详解

2019/07/19 Python

python读取hdfs并返回dataframe教程

2020/06/05 Python

python爬虫如何解决图片验证码

2021/02/14 Python

css3实现一个div设置多张背景图片及background-image属性实例演示

2017/08/10 HTML / CSS

英国最大的线上保健品零售商之一：Vitamin Planet

2016/12/01 全球购物

控制工程专业个人求职信

2013/09/25 职场文书

查环查孕证明

2014/01/10 职场文书

益达广告词

2014/03/14 职场文书

防卫过当辩护词

2015/05/21 职场文书

病假证明模板

2015/06/19 职场文书

python 解决微分方程的操作(数值解法)

2021/05/26 Python

css3新特性的应用示例分析

2022/03/16 HTML / CSS

Python保存并浏览用户的历史记录

2022/04/29 Python