Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
Pyramid添加Middleware的方法实例
Nov 27 Python
python基础教程之popen函数操作其它程序的输入和输出示例
Feb 10 Python
python返回昨天日期的方法
May 13 Python
Python实现简单的语音识别系统
Dec 13 Python
Windows下anaconda安装第三方包的方法小结(tensorflow、gensim为例)
Apr 05 Python
Python面向对象之反射/自省机制实例分析
Aug 24 Python
Python实现定时执行任务的三种方式简单示例
Mar 30 Python
python基于TCP实现的文件下载器功能案例
Dec 10 Python
Python GUI自动化实现绕过验证码登录
Jan 10 Python
详解django中Template语言
Feb 22 Python
VS2019+python3.7+opencv4.1+tensorflow1.13配置详解
Apr 16 Python
重构Python代码的六个实例
Nov 25 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
PHP随机生成信用卡卡号的方法
2015/03/23 PHP
PHP数组相加操作及与array_merge的区别浅析
2016/11/26 PHP
总结PHP中初始化空数组的最佳方法
2019/02/13 PHP
用js查找法实现当前栏目的高亮显示的代码
2007/11/24 Javascript
jquery 学习之二 属性(类)
2010/11/25 Javascript
javascript实现可全选、反选及删除表格的方法
2015/05/15 Javascript
Jquery实现遮罩层的方法
2015/06/08 Javascript
JavaScript中解析JSON数据的三种方法
2015/07/03 Javascript
jQuery 遍历函数详解
2015/07/05 Javascript
javascript简单判断输入内容是否合法的方法
2016/05/11 Javascript
原生JS版和jquery版实现checkbox的全选/全不选/点选/行内点选(Mr.Think)
2016/10/29 Javascript
bootstrap table动态加载数据示例代码
2017/03/25 Javascript
页面缩放兼容性处理方法(zoom,Firefox火狐浏览器)
2017/08/29 Javascript
vue实现动态添加数据滚动条自动滚动到底部的示例代码
2018/07/06 Javascript
Vue.js实现数据响应的方法
2018/08/13 Javascript
javascript this指向相关问题及改变方法
2020/11/19 Javascript
Vue——前端生成二维码的示例
2020/12/19 Vue.js
利用Python实现简单的相似图片搜索的教程
2015/04/23 Python
用Python实现一个简单的能够上传下载的HTTP服务器
2015/05/05 Python
12步入门Python中的decorator装饰器使用方法
2016/06/20 Python
利用Python为iOS10生成图标和截屏
2016/09/24 Python
如何用Python来搭建一个简单的推荐系统
2019/08/07 Python
Python操作SQLite数据库过程解析
2019/09/02 Python
Python替换NumPy数组中大于某个值的所有元素实例
2020/06/08 Python
用python实现前向分词最大匹配算法的示例代码
2020/08/06 Python
Pycharm学生免费专业版安装教程的方法步骤
2020/09/24 Python
印尼穆斯林时尚购物网站:Hijabenka
2016/12/10 全球购物
英国时尚饰品和发饰购物网站:Claire’s
2017/07/04 全球购物
渗透攻击的测试步骤
2014/06/07 面试题
小学教师岗位职责
2013/11/25 职场文书
退休感言
2014/01/28 职场文书
保卫钓鱼岛口号
2014/06/20 职场文书
战友聚会致辞
2015/07/28 职场文书
幼儿园大班教学反思
2016/03/02 职场文书
mysql聚集索引、辅助索引、覆盖索引、联合索引的使用
2022/02/12 MySQL
JavaScript实现两个数组的交集
2022/03/25 Javascript