编程 Python

python re正则匹配网页中图片url地址的方法

Posted in Python onDecember 20, 2018

最近写了个python抓取必应搜索首页http://cn.bing.com/的背景图片并将此图片更换为我的电脑桌面的程序，在正则匹配图片url时遇到了匹配失败问题。

要抓取的图片地址如图所示：

首先，使用这个pattern

reg = re.compile('.*g_img={url: "(http.*?jpg)"')

无论怎么匹配都匹配不到，后来把网页源码抓下来放在notepad++中查看，并用notepad++的正则匹配查找，很轻易就匹配到了，如图：

python re正则匹配网页中图片url地址的方法

后来我写了个测试代码，把图片地址在的那一行保存在一个字符串中，很快就匹配到了，如下面代码所示，data是匹配不到的，然而line是可以匹配到的。

# -*-coding:utf-8-*-
import os
import re
 
f = open('bing.html','r')
 
line = r'''Bnp.Internal.Close(0,0,60056); } });;g_img={url: "https://az12410.vo.msecnd.net/homepage/app/2016hw/BingHalloween_BkgImg.jpg",id:'bgDiv',d:'200',cN'''
data = f.read().decode('utf-8','ignore').encode('gbk','ignore')
 
print " "
 
reg = re.compile('.*g_img={url: "(http.*?jpg)"')
 
if re.match(reg, data):
  m1 = reg.findall(data)
  print m1[0]
else:
  print("data Not match .")
  
print 20*'-'
#print line
if re.match(reg, line):
  m2 = reg.findall(line)
  print m2[0]
else:
  print("line Not match .")

由此可见line和data是有区别的，什么区别呢？那就是data是多行的，包含换行符，而line是单行的，没有换行符。我有在字符串line中加了换行符，结果line没有匹配到。

到这了原因就清楚了。原因就在这句话

re.compile('.*g_img={url: "(http.*?jpg)"')。

后来翻阅python文档，发现re.compile()这个函数的第二个可选参数flags。这个参数是re中定义的常量，有如下常量

re.DEBUG Display debug information about compiled expression.
re.I 
re.IGNORECASE Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.

re.L 


re.LOCALE Make \w, \W, \b, \B, \s and \S dependent on the current locale.

re.M 


re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.S 


re.DOTALL Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.re.U re.UNICODE Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.New in version 2.0.

re.X 


re.VERBOSE This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

这里我们需要的就是re.S 让'.'匹配所有字符，包括换行符。修改正则表达式为

reg = re.compile('.*g_img={url: "(http.*?jpg)"', re.S)

即可完美解决问题。

以上这篇python re正则匹配网页中图片url地址的方法就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持三水点靠木。

python re正则匹配网页中图片url地址的方法

- Author -

Arckal

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

深入讲解Java编程中类的生命周期

Feb 05 Python

Python判断列表是否已排序的各种方法及其性能分析

Jun 20 Python

python3.4用循环往mysql5.7中写数据并输出的实现方法

Jun 20 Python

Python OpenCV 直方图的计算与显示的方法示例

Feb 08 Python

基于windows下pip安装python模块时报错总结

Jun 12 Python

python图像处理入门（一）

Apr 04 Python

Python函数参数分类原理详解

May 28 Python

python中numpy数组与list相互转换实例方法

Jan 29 Python

Python绘制数码晶体管日期

Feb 19 Python

Django程序的优化技巧

Apr 29 Python

Python基础之进程详解

May 21 Python

Pytorch中的学习率衰减及其用法详解

Jun 05 Python

python使用pdfminer解析pdf文件的方法示例

Dec 20 #Python

python爬取指定微信公众号文章

Dec 20 #Python

在Django中URL正则表达式匹配的方法

Dec 20 #Python

python采集微信公众号文章

Dec 20 #Python

Linux下Pycharm、Anaconda环境配置及使用踩坑

Dec 19 #Python

python爬虫之urllib,伪装,超时设置,异常处理的方法

Dec 19 #Python

python3实现网络爬虫之BeautifulSoup使用详解

Dec 19 #Python

You might like

咖啡的传说和历史

2021/03/03 新手入门

PHP 5.3.0 安装分析心得

2009/08/07 PHP

php中CI操作多个数据库的代码

2012/07/05 PHP

PHP函数checkdnsrr用法详解(Windows平台用法)

2016/03/21 PHP

PHP读取Excel类文件

2017/05/15 PHP

php生成毫秒时间戳的实例讲解

2017/09/22 PHP

php设计模式之组合模式实例详解【星际争霸游戏案例】

2020/03/27 PHP

JQuery的Validation插件中Remote验证的中文问题

2010/07/26 Javascript

基于jQuery的倒计时插件代码

2011/05/07 Javascript

Javascript核心读书有感之语言核心

2015/02/01 Javascript

javascript结合CSS实现苹果开关按钮特效

2015/04/07 Javascript

JQuery跳出each循环的方法

2015/04/16 Javascript

JS解决iframe之间通信和自适应高度的问题

2016/08/24 Javascript

vue2.0 父组件给子组件传递数据的方法

2018/01/15 Javascript

单页面vue引入百度统计的使用方法示例详解

2018/10/13 Javascript

基于vue实现移动端圆形旋钮插件效果

2018/11/28 Javascript

vue组件中iview的modal组件爬坑问题之modal的显示与否应该是使用v-show

2019/04/12 Javascript

Vue2.X和Vue3.0数据响应原理变化的区别

2019/11/07 Javascript

vue 子组件修改data或调用操作

2020/08/07 Javascript

Python中使用异常处理来判断运行的操作系统平台方法

2015/01/22 Python

python实现图书借阅系统

2019/02/20 Python

tesserocr与pytesseract模块的使用方法解析

2019/08/30 Python

TFRecord文件查看包含的所有Features代码

2020/02/17 Python

基于CSS3实现的黑色个性导航菜单效果

2015/09/14 HTML / CSS

HTML5 canvas基本绘图之绘制线条

2016/06/27 HTML / CSS

JD Sports德国官网：英国领先的运动鞋和运动服饰零售商

2018/02/26 全球购物

在DELPHI中调用存储过程和使用内嵌SQL哪种方式更好

2016/11/22 面试题

高中生校园生活自我评价

2013/09/19 职场文书

《美丽的小兴安岭》教学反思

2014/02/26 职场文书

公司2015年终工作总结

2015/05/26 职场文书

小学教师教学反思

2016/02/24 职场文书

学前班教学反思

2016/02/24 职场文书

2016年教育局“我们的节日——端午节”主题活动总结

2016/04/01 职场文书

导游词之岳阳楼

2019/09/25 职场文书

Nginx服务器添加Systemd自定义服务过程解析

2021/03/31 Servers

Redis唯一ID生成器的实现

2022/07/07 Redis