Python爬虫包BeautifulSoup实例(三)


Posted in Python onJune 17, 2018

一步一步构建一个爬虫实例,抓取糗事百科的段子

先不用beautifulsoup包来进行解析

第一步,访问网址并抓取源码

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()
  print content.decode('utf-8')

第二步,利用正则表达式提取信息

首先先观察源码中,你需要的内容的位置以及如何识别
然后用正则表达式去识别读取
注意正则表达式中的 . 是不能匹配\n的,所以需要设置一下匹配模式。

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()

  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  items = re.findall(regex, content)

  # 提取数据
  # 注意换行符,设置 . 能够匹配换行符
  for item in items:
    print item

第三步,修正数据并保存到文件中

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:41:32

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()

  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  items = re.findall(regex, content)

  # 提取数据
  # 注意换行符,设置 . 能够匹配换行符
  path = './qiubai'
  if not os.path.exists(path):
    os.makedirs(path)
  count = 1
  for item in items:
    #整理数据,去掉\n,将<br/>换成\n
    item = item.replace('\n', '').replace('<br/>', '\n')
    filepath = path + '/' + str(count) + '.txt'
    f = open(filepath, 'w')
    f.write(item)
    f.close()
    count += 1

第四步,将多个页面下的内容都抓取下来

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  path = './qiubai'
  if not os.path.exists(path):
    os.makedirs(path)
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  count = 1
  for cnt in range(1, 35):
    print '第' + str(cnt) + '轮'
    url = 'http://www.qiushibaike.com/textnew/page/' + str(cnt) + '/?s=4941357'
    try:
      request = urllib2.Request(url = url, headers = headers)
      response = urllib2.urlopen(request)
      content = response.read()
    except urllib2.HTTPError as e:
      print e
      exit()
    except urllib2.URLError as e:
      print e
      exit()
    # print content

    # 提取数据
    # 注意换行符,设置 . 能够匹配换行符
    items = re.findall(regex, content)

    # 保存信息
    for item in items:
      #  print item
      #整理数据,去掉\n,将<br/>换成\n
      item = item.replace('\n', '').replace('<br/>', '\n')
      filepath = path + '/' + str(count) + '.txt'
      f = open(filepath, 'w')
      f.write(item)
      f.close()
      count += 1

  print '完成'

使用BeautifulSoup对源码进行解析

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:34:02

import urllib
import urllib2
import re
import os
from bs4 import BeautifulSoup

if __name__ == '__main__':
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  request = urllib2.Request(url = url, headers = headers)
  response = urllib2.urlopen(request)
  # print response.read()
  soup_packetpage = BeautifulSoup(response, 'lxml')
  items = soup_packetpage.find_all("div", class_="content")

  for item in items:
    try:
      content = item.span.string
    except AttributeError as e:
      print e
      exit()

    if content:
      print content + "\n"

这是用BeautifulSoup去抓取书本以及其价格的代码
可以通过对比得出到bs4对标签的读取以及标签内容的读取
(因为我自己也没有学到这一部分,目前只能依葫芦画瓢地写)

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 20:37:38
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:27:30
import urllib2
import urllib
import re 

from bs4 import BeautifulSoup 


url = "https://www.packtpub.com/all"
try:
  html = urllib2.urlopen(url) 
except urllib2.HTTPError as e:
  print e
  exit()

soup_packtpage = BeautifulSoup(html, 'lxml') 
all_book_title = soup_packtpage.find_all("div", class_="book-block-title") 

price_regexp = re.compile(u"\s+\$\s\d+\.\d+") 

for book_title in all_book_title: 
  try:
    print "Book's name is " + book_title.string.strip()
  except AttributeError as e:
    print e
    exit()
  book_price = book_title.find_next(text=price_regexp) 
  try:
    print "Book's price is "+ book_price.strip()
  except AttributeError as e:
    print e
    exit()
  print ""

以上全部为本篇文章的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python正则表达式匹配ip地址实例
Oct 09 Python
用Python实现服务器中只重载被修改的进程的方法
Apr 30 Python
Python的Django框架中使用SQLAlchemy操作数据库的教程
Jun 02 Python
基于Python和Scikit-Learn的机器学习探索
Oct 16 Python
python实现神经网络感知器算法
Dec 20 Python
django使用LDAP验证的方法示例
Dec 10 Python
python程序快速缩进多行代码方法总结
Jun 23 Python
django最快程序开发流程详解
Jul 19 Python
Pycharm生成可执行文件.exe的实现方法
Jun 02 Python
Python中return函数返回值实例用法
Nov 19 Python
python eventlet绿化和patch原理
Nov 21 Python
Python批量删除mysql中千万级大量数据的脚本分享
Dec 03 Python
Python爬虫包BeautifulSoup异常处理(二)
Jun 17 #Python
Python爬虫包BeautifulSoup简介与安装(一)
Jun 17 #Python
python主线程捕获子线程的方法
Jun 17 #Python
Python实现获取邮箱内容并解析的方法示例
Jun 16 #Python
Python实现自定义函数的5种常见形式分析
Jun 16 #Python
Python基于jieba库进行简单分词及词云功能实现方法
Jun 16 #Python
Python实现简单的文本相似度分析操作详解
Jun 16 #Python
You might like
php实现12306余票查询、价格查询示例
2014/04/17 PHP
常用的php图片处理类(水印、等比缩放、固定高宽)分享
2015/06/19 PHP
在WordPress的后台中添加顶级菜单和子菜单的函数详解
2016/01/11 PHP
PHP实现的mysql主从数据库状态检测功能示例
2017/07/20 PHP
JavaScript For Beginners(转载)
2007/01/05 Javascript
silverlight线程与基于事件驱动javascript引擎(实现轨迹回放功能)
2011/08/09 Javascript
面向对象的Javascript之一(初识Javascript)
2012/01/20 Javascript
jQuery extend 的简单实例
2013/09/18 Javascript
jquery调取json数据实现省市级联的方法
2015/01/29 Javascript
JavaScript 事件绑定及深入
2015/04/13 Javascript
js实现简单秒表走动的时钟特效
2020/03/25 Javascript
JS实现密码框根据焦点的获取与失去控制文字的消失与显示效果
2015/11/26 Javascript
JavaScript中获取纯正的undefined的方法
2016/03/06 Javascript
将html页面保存成图片,图片写入pdf的实现方法(推荐)
2016/09/17 Javascript
angular实现表单验证及提交功能
2017/02/01 Javascript
JS传参及动态修改页面布局
2017/04/13 Javascript
jQuery自定义多选下拉框效果
2017/06/19 jQuery
jQuery实现标签子元素的添加和赋值方法
2018/02/24 jQuery
后台使用freeMarker和前端使用vue的方法及遇到的问题
2019/06/13 Javascript
JS实现基本的网页计算器功能示例
2020/01/16 Javascript
python计算书页码的统计数字问题实例
2014/09/26 Python
python基于queue和threading实现多线程下载实例
2014/10/08 Python
仅用50行Python代码实现一个简单的代理服务器
2015/04/08 Python
对python添加模块路径的三种方法总结
2018/10/16 Python
Python格式化输出字符串方法小结【%与format】
2018/10/29 Python
Python使用字典的嵌套功能详解
2019/02/27 Python
python实现简单银行管理系统
2019/10/25 Python
Python如何批量获取文件夹的大小并保存
2020/03/31 Python
在keras里面实现计算f1-score的代码
2020/06/15 Python
Python Selenium破解滑块验证码最新版(GEETEST95%以上通过率)
2021/01/29 Python
Servlet面试题库
2015/07/18 面试题
计算机网络毕业生自荐信
2013/10/01 职场文书
后勤自我鉴定
2013/10/13 职场文书
石油大学毕业生自荐信
2014/01/28 职场文书
函授药学自我鉴定
2014/02/07 职场文书
pytest实现多进程与多线程运行超好用的插件
2022/07/15 Python