Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
python使用ctypes模块调用windowsapi获取系统版本示例
Apr 17 Python
python实现根据窗口标题调用窗口的方法
Mar 13 Python
python通过urllib2获取带有中文参数url内容的方法
Mar 13 Python
python中利用队列asyncio.Queue进行通讯详解
Sep 10 Python
详谈python在windows中的文件路径问题
Apr 28 Python
Python常见MongoDB数据库操作实例总结
Jul 24 Python
Python 加密与解密小结
Dec 06 Python
详解Python数据分析--Pandas知识点
Mar 23 Python
我喜欢你 抖音表白程序python版
Apr 07 Python
python opencv根据颜色进行目标检测的方法示例
Jan 15 Python
django3.02模板中的超链接配置实例代码
Feb 04 Python
利用python中集合的唯一性实现去重
Feb 11 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
针对初学PHP者的疑难问答(1)
2006/10/09 PHP
PHP使用内置函数生成图片的方法详解
2016/05/09 PHP
Laravel的throttle中间件失效问题解决方法
2016/10/09 PHP
php图片合成方法(多张图片合成一张)
2017/11/25 PHP
ThinkPHP框架整合微信支付之刷卡模式图文详解
2019/04/10 PHP
PHP7 windows支持
2021/03/09 PHP
jquery实现邮箱自动补全功能示例分享
2014/02/17 Javascript
谷歌浏览器调试JavaScript小技巧
2014/12/29 Javascript
js实现ifram取父窗口URL地址的方法
2015/02/09 Javascript
JavaScript学习笔记(三):JavaScript也有入口Main函数
2015/09/12 Javascript
javascript中类的定义方式详解(四种方式)
2015/12/22 Javascript
深入学习jQuery Validate表单验证(二)
2016/01/18 Javascript
深入理解JavaScript内置函数
2016/06/03 Javascript
关于jquery中动态增加select,事件无效的快速解决方法
2016/08/29 Javascript
Bootstrap框架实现广告轮播效果
2016/11/28 Javascript
Bootstrap基本组件学习笔记之缩略图(13)
2016/12/08 Javascript
微信小程序开发探究
2016/12/27 Javascript
详解bootstrap导航栏.nav与.navbar区别
2017/11/23 Javascript
微信小程序日历弹窗选择器代码实例
2019/05/09 Javascript
详解在React-Native中持久化redux数据
2019/05/22 Javascript
Python3.x对JSON的一些操作示例
2017/09/01 Python
Python-OpenCV基本操作方法详解
2018/04/02 Python
Python3中关于cookie的创建与保存
2018/10/21 Python
Python使用Beautiful Soup爬取豆瓣音乐排行榜过程解析
2019/08/15 Python
PyQt5 QDockWidget控件应用详解
2020/08/12 Python
CSS实现鼠标滑过鼠标点击代码写法
2016/12/26 HTML / CSS
基于MUI框架使用HTML5实现的二维码扫描功能
2018/03/01 HTML / CSS
英国领先的葡萄酒专家:Majestic Wine
2017/05/30 全球购物
法国娇韵诗官方旗舰店:Clarins是来自法国的天然护肤品牌
2018/06/30 全球购物
西班牙著名的珠宝首饰品牌:P D PAOLA
2018/09/15 全球购物
美国台面电器和厨具品牌:KitchenAid
2019/04/12 全球购物
Jacadi Paris英国官网:法国童装品牌
2019/08/09 全球购物
财务主管岗位职责
2014/02/28 职场文书
Redis IP地址的绑定的实现
2021/05/08 Redis
Linux中Nginx的防盗链和优化的实现代码
2021/06/20 Servers
golang fmt格式“占位符”的实例用法详解
2021/07/04 Golang