Python爬虫包BeautifulSoup实例(三)


Posted in Python onJune 17, 2018

一步一步构建一个爬虫实例,抓取糗事百科的段子

先不用beautifulsoup包来进行解析

第一步,访问网址并抓取源码

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()
  print content.decode('utf-8')

第二步,利用正则表达式提取信息

首先先观察源码中,你需要的内容的位置以及如何识别
然后用正则表达式去识别读取
注意正则表达式中的 . 是不能匹配\n的,所以需要设置一下匹配模式。

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()

  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  items = re.findall(regex, content)

  # 提取数据
  # 注意换行符,设置 . 能够匹配换行符
  for item in items:
    print item

第三步,修正数据并保存到文件中

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:41:32

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  try:
    request = urllib2.Request(url = url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
  except urllib2.HTTPError as e:
    print e
    exit()
  except urllib2.URLError as e:
    print e
    exit()

  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  items = re.findall(regex, content)

  # 提取数据
  # 注意换行符,设置 . 能够匹配换行符
  path = './qiubai'
  if not os.path.exists(path):
    os.makedirs(path)
  count = 1
  for item in items:
    #整理数据,去掉\n,将<br/>换成\n
    item = item.replace('\n', '').replace('<br/>', '\n')
    filepath = path + '/' + str(count) + '.txt'
    f = open(filepath, 'w')
    f.write(item)
    f.close()
    count += 1

第四步,将多个页面下的内容都抓取下来

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 20:17:13

import urllib
import urllib2
import re
import os

if __name__ == '__main__':
  # 访问网址并抓取源码
  path = './qiubai'
  if not os.path.exists(path):
    os.makedirs(path)
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)
  count = 1
  for cnt in range(1, 35):
    print '第' + str(cnt) + '轮'
    url = 'http://www.qiushibaike.com/textnew/page/' + str(cnt) + '/?s=4941357'
    try:
      request = urllib2.Request(url = url, headers = headers)
      response = urllib2.urlopen(request)
      content = response.read()
    except urllib2.HTTPError as e:
      print e
      exit()
    except urllib2.URLError as e:
      print e
      exit()
    # print content

    # 提取数据
    # 注意换行符,设置 . 能够匹配换行符
    items = re.findall(regex, content)

    # 保存信息
    for item in items:
      #  print item
      #整理数据,去掉\n,将<br/>换成\n
      item = item.replace('\n', '').replace('<br/>', '\n')
      filepath = path + '/' + str(count) + '.txt'
      f = open(filepath, 'w')
      f.write(item)
      f.close()
      count += 1

  print '完成'

使用BeautifulSoup对源码进行解析

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 16:16:08
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:34:02

import urllib
import urllib2
import re
import os
from bs4 import BeautifulSoup

if __name__ == '__main__':
  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
  headers = {'User-Agent':user_agent}
  request = urllib2.Request(url = url, headers = headers)
  response = urllib2.urlopen(request)
  # print response.read()
  soup_packetpage = BeautifulSoup(response, 'lxml')
  items = soup_packetpage.find_all("div", class_="content")

  for item in items:
    try:
      content = item.span.string
    except AttributeError as e:
      print e
      exit()

    if content:
      print content + "\n"

这是用BeautifulSoup去抓取书本以及其价格的代码
可以通过对比得出到bs4对标签的读取以及标签内容的读取
(因为我自己也没有学到这一部分,目前只能依葫芦画瓢地写)

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-22 20:37:38
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-22 21:27:30
import urllib2
import urllib
import re 

from bs4 import BeautifulSoup 


url = "https://www.packtpub.com/all"
try:
  html = urllib2.urlopen(url) 
except urllib2.HTTPError as e:
  print e
  exit()

soup_packtpage = BeautifulSoup(html, 'lxml') 
all_book_title = soup_packtpage.find_all("div", class_="book-block-title") 

price_regexp = re.compile(u"\s+\$\s\d+\.\d+") 

for book_title in all_book_title: 
  try:
    print "Book's name is " + book_title.string.strip()
  except AttributeError as e:
    print e
    exit()
  book_price = book_title.find_next(text=price_regexp) 
  try:
    print "Book's price is "+ book_price.strip()
  except AttributeError as e:
    print e
    exit()
  print ""

以上全部为本篇文章的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
详细介绍Python语言中的按位运算符
Nov 26 Python
Python设计模式之观察者模式实例
Apr 26 Python
python3编写C/S网络程序实例教程
Aug 25 Python
Python中使用Boolean操作符做真值测试实例
Jan 30 Python
Python实现输出程序执行进度百分比的方法
Sep 16 Python
Python简单定义与使用二叉树示例
May 11 Python
详解python之heapq模块及排序操作
Apr 04 Python
PyQT5 QTableView显示绑定数据的实例详解
Jun 25 Python
Pandas分组与排序的实现
Jul 23 Python
Python破解BiliBili滑块验证码的思路详解(完美避开人机识别)
Feb 17 Python
在Mac中PyCharm配置python Anaconda环境过程图解
Mar 11 Python
浅谈TensorFlow之稀疏张量表示
Jun 30 Python
Python爬虫包BeautifulSoup异常处理(二)
Jun 17 #Python
Python爬虫包BeautifulSoup简介与安装(一)
Jun 17 #Python
python主线程捕获子线程的方法
Jun 17 #Python
Python实现获取邮箱内容并解析的方法示例
Jun 16 #Python
Python实现自定义函数的5种常见形式分析
Jun 16 #Python
Python基于jieba库进行简单分词及词云功能实现方法
Jun 16 #Python
Python实现简单的文本相似度分析操作详解
Jun 16 #Python
You might like
PHP开发文件系统实例讲解
2006/10/09 PHP
解析php中获取url与物理路径的总结
2013/06/21 PHP
PHP实现链式操作的核心思想
2015/06/23 PHP
php自动提交表单的方法(基于fsockopen与curl)
2016/05/09 PHP
php对微信支付回调处理的方法
2018/08/23 PHP
JS获取地址栏参数的小例子
2013/08/23 Javascript
Javascript动画效果(3)
2016/10/11 Javascript
Mongoose学习全面理解(推荐)
2017/01/21 Javascript
十大热门的JavaScript框架和库
2017/03/21 Javascript
vue组件实现可搜索下拉框扩展
2020/10/23 Javascript
NodeJS 中Stream 的基本使用
2018/07/30 NodeJs
[38:54]完美世界DOTA2联赛PWL S2 Rebirth vs LBZS 第一场 11.28
2020/12/01 DOTA
python去掉空白行的多种实现代码
2018/03/19 Python
Python实现的读写json文件功能示例
2018/06/05 Python
pandas 读取各种格式文件的方法
2018/06/22 Python
在Pycharm中将pyinstaller加入External Tools的方法
2019/01/16 Python
python画蝴蝶曲线图的实例
2019/11/21 Python
Pytorch保存模型用于测试和用于继续训练的区别详解
2020/01/10 Python
Python爬取梨视频的示例
2021/01/29 Python
基于HTML5 FileSystem API的使用介绍
2013/04/24 HTML / CSS
学生会竞选自荐信
2013/10/12 职场文书
制衣厂各岗位职责
2013/12/02 职场文书
《美丽的黄昏》教学反思
2014/02/28 职场文书
小学生安全演讲稿
2014/04/25 职场文书
学雷锋活动总结报告
2014/06/26 职场文书
学雷锋志愿者活动总结
2014/06/27 职场文书
2014年校长工作总结
2014/12/11 职场文书
大学生就业推荐表自我评价
2015/03/02 职场文书
食堂采购员岗位职责
2015/04/03 职场文书
离婚答辩状范文
2015/05/22 职场文书
2015年机关作风和效能建设工作总结
2015/07/23 职场文书
会议承办单位欢迎词
2019/07/09 职场文书
标准演讲稿格式结尾应该怎么书写?
2019/07/17 职场文书
用CSS3画一个爱心
2021/04/27 HTML / CSS
只用40行Python代码就能写出pdf转word小工具
2021/05/31 Python
如何利用 CSS Overview 面板重构优化你的网站
2021/10/24 HTML / CSS