编程 Python

python爬虫框架talonspider简单介绍

Posted in Python onJune 09, 2017

1.为什么写这个？

一些简单的页面，无需用比较大的框架来进行爬取，自己纯手写又比较麻烦

因此针对这个需求写了talonspider:

•1.针对单页面的item提取 - 具体介绍点这里
•2.spider模块 - 具体介绍点这里

2.介绍&&使用

2.1.item

这个模块是可以独立使用的，对于一些请求比较简单的网站（比如只需要get请求），单单只用这个模块就可以快速地编写出你想要的爬虫，比如(以下使用python3，python2见examples目录)：

2.1.1.单页面单目标

比如要获取这个网址http://book.qidian.com/info/1004608738 的书籍信息，封面等信息，可直接这样写：

import time
from talonspider import Item, TextField, AttrField
from pprint import pprint

class TestSpider(Item):
  title = TextField(css_select='.book-info>h1>em')
  author = TextField(css_select='a.writer')
  cover = AttrField(css_select='a#bookImg>img', attr='src')

  def tal_title(self, title):
    return title

  def tal_cover(self, cover):
    return 'http:' + cover

if __name__ == '__main__':
  item_data = TestSpider.get_item(url='http://book.qidian.com/info/1004608738')
  pprint(item_data)

具体见qidian_details_by_item.py

2.1.1.单页面多目标

比如获取豆瓣250电影首页展示的25部电影，这一个页面有25个目标，可直接这样写：

from talonspider import Item, TextField, AttrField
from pprint import pprint

# 定义继承自item的爬虫类
class DoubanSpider(Item):
  target_item = TextField(css_select='div.item')
  title = TextField(css_select='span.title')
  cover = AttrField(css_select='div.pic>a>img', attr='src')
  abstract = TextField(css_select='span.inq')

  def tal_title(self, title):
    if isinstance(title, str):
      return title
    else:
      return ''.join([i.text.strip().replace('\xa0', '') for i in title])

if __name__ == '__main__':
  items_data = DoubanSpider.get_items(url='https://movie.douban.com/top250')
  result = []
  for item in items_data:
    result.append({
      'title': item.title,
      'cover': item.cover,
      'abstract': item.abstract,
    })
  pprint(result)

具体见douban_page_by_item.py

2.2.spider

当需要爬取有层次的页面时，比如爬取豆瓣250全部电影，这时候spider部分就派上了用场：

# !/usr/bin/env python
from talonspider import Spider, Item, TextField, AttrField, Request
from talonspider.utils import get_random_user_agent


# 定义继承自item的爬虫类
class DoubanItem(Item):
  target_item = TextField(css_select='div.item')
  title = TextField(css_select='span.title')
  cover = AttrField(css_select='div.pic>a>img', attr='src')
  abstract = TextField(css_select='span.inq')

  def tal_title(self, title):
    if isinstance(title, str):
      return title
    else:
      return ''.join([i.text.strip().replace('\xa0', '') for i in title])


class DoubanSpider(Spider):
  # 定义起始url，必须
  start_urls = ['https://movie.douban.com/top250']
  # requests配置
  request_config = {
    'RETRIES': 3,
    'DELAY': 0,
    'TIMEOUT': 20
  }
  # 解析函数 必须有
  def parse(self, html):
    # 将html转化为etree
    etree = self.e_html(html)
    # 提取目标值生成新的url
    pages = [i.get('href') for i in etree.cssselect('.paginator>a')]
    pages.insert(0, '?start=0&filter=')
    headers = {
      "User-Agent": get_random_user_agent()
    }
    for page in pages:
      url = self.start_urls[0] + page
      yield Request(url, request_config=self.request_config, headers=headers, callback=self.parse_item)

  def parse_item(self, html):
    items_data = DoubanItem.get_items(html=html)
    # result = []
    for item in items_data:
      # result.append({
      #   'title': item.title,
      #   'cover': item.cover,
      #   'abstract': item.abstract,
      # })
      # 保存
      with open('douban250.txt', 'a+') as f:
        f.writelines(item.title + '\n')


if __name__ == '__main__':
  DoubanSpider.start()

控制台：

/Users/howie/anaconda3/envs/work3/bin/python /Users/howie/Documents/programming/python/git/talonspider/examples/douban_page_by_spider.py
2017-06-07 23:17:30,346 - talonspider - INFO: talonspider started
2017-06-07 23:17:30,693 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250
2017-06-07 23:17:31,074 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=25&filter=
2017-06-07 23:17:31,416 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=50&filter=
2017-06-07 23:17:31,853 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=75&filter=
2017-06-07 23:17:32,523 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=100&filter=
2017-06-07 23:17:33,032 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=125&filter=
2017-06-07 23:17:33,537 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=150&filter=
2017-06-07 23:17:33,990 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=175&filter=
2017-06-07 23:17:34,406 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=200&filter=
2017-06-07 23:17:34,787 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=225&filter=
2017-06-07 23:17:34,809 - talonspider - INFO: Time usage：0:00:04.462108

Process finished with exit code 0

此时当前目录会生成douban250.txt，具体见douban_page_by_spider.py。

3.说明

学习之作，待完善的地方还有很多，欢迎提意见，项目地址talonspider。

python爬虫框架talonspider简单介绍

- Author -

howie6879

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python算法学习之基数排序实例

Dec 18 Python

python通过exifread模块获得图片exif信息的方法

Mar 16 Python

python网络爬虫之如何伪装逃过反爬虫程序的方法

Nov 23 Python

对python 通过ssh访问数据库的实例详解

Feb 19 Python

python并发编程 Process对象的其他属性方法join方法详解

Aug 20 Python

python+jinja2实现接口数据批量生成工具

Aug 28 Python

Python 脚本实现淘宝准点秒杀功能

Nov 13 Python

Python进程池Pool应用实例分析

Nov 27 Python

python批量处理txt文件的实例代码

Jan 13 Python

Python如何使用input函数获取输入

Aug 06 Python

Python通过字典映射函数实现switch

Nov 06 Python

python线程优先级队列知识点总结

Feb 28 Python

python实现list元素按关键字相加减的方法示例

Jun 09 #Python

Python利用QQ邮箱发送邮件的实现方法(分享)

Jun 09 #Python

老生常谈python的私有公有属性(必看篇)

Jun 09 #Python

Python 自动化表单提交实例代码

Jun 08 #Python

Python错误： SyntaxError: Non-ASCII character解决办法

Jun 08 #Python

Python实现网站注册验证码生成类

Jun 08 #Python

Python实现多线程抓取网页功能实例详解

Jun 08 #Python

You might like

PHP 常用数组内部函数(Array Functions)介绍

2013/06/05 PHP

浅谈php扩展imagick

2014/06/02 PHP

Highslide.js是一款基于js实现的网页中图片展示插件

2020/03/30 Javascript

Javascript 继承实现例子

2009/08/12 Javascript

javascript 获取表单file全路径

2009/12/31 Javascript

jQuery(1.6.3) 中css方法对浮动的实现缺陷分析

2011/09/09 Javascript

javascript将数组插入到另一个数组中的代码

2013/01/10 Javascript

JS+CSS实现的经典tab选项卡效果代码

2015/09/16 Javascript

js日期插件dateHelp获取本月、三个月、今年的日期

2016/03/07 Javascript

JavaScript实现时间倒计时跳转(推荐)

2016/06/28 Javascript

Bootstrap的popover(弹出框)在append后弹不出(失效)

2017/02/27 Javascript

bootstrap模态框嵌套、tabindex属性、去除阴影的示例代码

2017/10/17 Javascript

关于layui 下拉列表的change事件详解

2019/09/20 Javascript

Vue 使用iframe引用html页面实现vue和html页面方法的调用操作

2020/11/16 Javascript

node koa2 ssr项目搭建的方法步骤

2020/12/11 Javascript

python创建只读属性对象的方法(ReadOnlyObject)

2013/02/10 Python

用Python的Django框架完成视频处理任务的教程

2015/04/02 Python

django DRF图片路径问题的解决方法

2018/09/10 Python

在Python中调用Ping命令,批量IP的方法

2019/01/26 Python

Django框架使用内置方法实现登录功能详解

2019/06/12 Python

python卸载后再次安装遇到的问题解决

2019/07/10 Python

详解Python Opencv和PIL读取图像文件的差别

2019/12/27 Python

Python3如何判断三角形的类型

2020/04/12 Python

tensorflow使用freeze_graph.py将ckpt转为pb文件的方法

2020/04/22 Python

Python3爬虫中Splash的知识总结

2020/07/10 Python

LookFantastic丹麦：英国美容护肤精品在线商城

2016/08/18 全球购物

加拿大最大的钻石商店：Peoples Jewellers

2018/01/01 全球购物

蔻驰英国官网：COACH英国

2020/07/19 全球购物

如何整合JQuery和Prototype

2014/01/31 面试题

工程造价与财务管理专业应届生求职信

2013/10/06 职场文书

六查六看剖析材料

2014/02/15 职场文书

结婚喜宴主持词

2014/03/14 职场文书

幼儿学前班评语

2014/12/29 职场文书

JavaScript 语句之常用 for 循环详解

2021/03/29 Javascript

python glom模块的使用简介

2021/04/13 Python

MySQL一劳永逸永久支持输入中文的方法实例

2022/08/05 MySQL