编程 Python

Python爬虫实现网页信息抓取功能示例【URL与正则模块】

Posted in Python onMay 18, 2017

本文实例讲述了Python爬虫实现网页信息抓取功能。分享给大家供大家参考，具体如下：

首先实现关于网页解析、读取等操作我们要用到以下几个模块

import urllib
import urllib2
import re

我们可以尝试一下用readline方法读某个网站，比如说百度

def test():
  f=urllib.urlopen('http://www.baidu.com')
  while True:
   firstLine=f.readline()
   print firstLine

下面我们说一下如何实现网页信息的抓取，比如说百度贴吧

我们大概要做几件事情：

首先获取网页及其代码，这里我们要实现多页，即其网址会改变，我们传递一个页数

def getPage(self,pageNum):
     try:
        url=self.baseURL+self.seeLZ+'&pn='+str(pageNum)
        #创建request对象
        request=urllib2.Request(url)
        response=urllib2.urlopen(request)
        #print 'URL:'+url
        return response.read()
     except Exception,e:
        print e

之后我们要获取小说内容，这里咱们分为标题和正文。标题每页都有，所以我们获取一次就好了。

我们可以点击某网站，按f12查看他的标题标签是如何构造的，比如说百度贴吧是<title>…………

那我们就匹配reg=re.compile(r'<title>(.*?)。')来抓取这个信息

标题抓取完我们要开始抓去正文了，我们知道正文会有很多段，所以我们要循环的去抓取整个items，这里我们注意

对于文本的读写操作，一定要放在循环外。同时加入一些去除超链接、<br>等机制

最后，我们在主函数调用即可

完整代码：

# -*- coding:utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
#爬虫之网页信息抓取
#需要的函数方法：urllib,re,urllib2
import urllib
import urllib2
import re
#测试函数->读取
#def test():
#   f=urllib.urlopen('http://www.baidu.com')
#   while True:
#     firstLine=f.readline()
#     print firstLine
#针对于百度贴吧获取前十页楼主小说文本内容
class BDTB:
   def __init__(self,baseUrl,seeLZ):
     #成员变量
     self.baseURL=baseUrl
     self.seeLZ='?see_lz='+str(seeLZ)
   #获取该页帖子的代码
   def getPage(self,pageNum):
     try:
        url=self.baseURL+self.seeLZ+'&pn='+str(pageNum)
        #创建request对象
        request=urllib2.Request(url)
        response=urllib2.urlopen(request)
        #print 'URL:'+url
        return response.read()
     except Exception,e:
        print e
   #匹配标题
   def Title(self):
     html=self.getPage(1)
     #compile提高正则匹配效率
     reg=re.compile(r'<title>(.*?)。')
     #返回list列表
     items=re.findall(reg,html)
     f=open('output.txt','w+')
     item=('').join(items)
     f.write('\t\t\t\t\t'+item.encode('gbk'))
     f.close()
   #匹配正文
   def Text(self,pageNum):
     html=self.getPage(pageNum)
     #compile提高正则匹配效率
     reg=re.compile(r'"d_post_content j_d_post_content ">(.*?)</div>')
     #返回list列表
     items=re.findall(reg,html)
     f=open('output.txt','a+')
     #[1:]切片，第一个元素不需要，去掉。
     for i in items[1:]:
        #超链接去除
        removeAddr=re.compile('<a.*?>|</a>')
        #用""替换
        i=re.sub(removeAddr,"",i)
        #<br>去除
        i=i.replace('<br>','')
        f.write('\n\n'+i.encode('gbk'))
     f.close()
#调用入口
baseURL='http://tieba.baidu.com/p/4638659116'
bdtb=BDTB(baseURL,1)
print '爬虫正在启动....'.encode('gbk')
#多页
bdtb.Title()
print '抓取标题完毕！'.encode('gbk')
for i in range(1,11):
  print '正在抓取第%02d页'.encode('gbk')%i
  bdtb.Text(i)
print '抓取正文完毕!'.encode('gbk')

Python爬虫实现网页信息抓取功能示例【URL与正则模块】

- Author -

九日王朝

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python 模拟购物车的实例讲解

Sep 11 Python

python利用urllib和urllib2访问http的GET/POST详解

Sep 27 Python

Python PIL读取的图像发生自动旋转的实现方法

Jul 05 Python

Django实现发送邮件功能

Jul 18 Python

python matplotlib库绘制散点图例题解析

Aug 10 Python

python tkinter组件使用详解

Sep 16 Python

Pytorch 神经网络—自定义数据集上实现教程

Jan 07 Python

python中的 zip函数详解及用法举例

Feb 16 Python

Pytorch 使用 nii数据做输入数据的操作

May 26 Python

Python全局变量与global关键字常见错误解决方案

Oct 05 Python

10个python爬虫入门基础代码实例 + 1个简单的python爬虫完整实例

Dec 16 Python

Python wordcloud库安装方法总结

Dec 31 Python

Python使用time模块实现指定时间触发器示例

May 18 #Python

Python实现的文本简单可逆加密算法示例

May 18 #Python

Python操作MongoDB详解及实例

May 18 #Python

Python 迭代器与生成器实例详解

May 18 #Python

Python字符串处理实例详解

May 18 #Python

Python进阶-函数默认参数(详解)

May 18 #Python

Python装饰器实现几类验证功能做法实例

May 18 #Python