Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
python检测lvs real server状态
Jan 22 Python
Python实现的批量下载RFC文档
Mar 10 Python
python简单判断序列是否为空的方法
Jun 30 Python
Python基于FTP模块实现ftp文件上传操作示例
Apr 23 Python
Python3.6日志Logging模块简单用法示例
Jun 14 Python
matplotlib实现热成像图colorbar和极坐标图的方法
Dec 13 Python
详解python的四种内置数据结构
Mar 19 Python
Python Threading 线程/互斥锁/死锁/GIL锁
Jul 21 Python
解决TensorFlow GPU版出现OOM错误的问题
Feb 03 Python
Python3爬虫中Selenium的用法详解
Jul 10 Python
python3实现Dijkstra算法最短路径的实现
May 12 Python
python使用pycharm安装pyqt5以及相关配置
Apr 22 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
在PHP中使用与Perl兼容的正则表达式
2006/11/26 PHP
smarty的保留变量问题
2008/10/23 PHP
php cookie 作用范围?不要在当前页面使用你的cookie
2009/03/24 PHP
php中根据某年第几天计算出日期年月日的代码
2011/02/24 PHP
php实现图片缩放功能类
2013/12/18 PHP
PHP对文件进行加锁、解锁实例
2015/01/23 PHP
微信公众平台之快递查询功能用法实例
2015/04/14 PHP
PHP-FPM之Chroot执行环境详解
2015/08/03 PHP
PHP编程中的__clone()方法使用详解
2015/11/27 PHP
Javascript 生成指定范围数值随机数
2009/01/09 Javascript
html页面显示年月日时分秒和星期几的两种方式
2013/08/20 Javascript
js电话号码验证方法
2015/09/28 Javascript
Bootstrap 3 进度条的实现
2017/02/22 Javascript
原生JS实现圣旨卷轴展开效果
2017/03/06 Javascript
详解Vue双向数据绑定原理解析
2017/09/11 Javascript
Bootstrap Table中的多选框删除功能
2018/07/15 Javascript
Vue.js实现大转盘抽奖总结及实现思路
2019/10/09 Javascript
JS highcharts实现动态曲线代码示例
2020/10/16 Javascript
解决vue elementUI 使用el-select 时 change事件的触发问题
2020/11/17 Vue.js
Python查询阿里巴巴关键字排名的方法
2015/07/08 Python
python实现微信发送邮件关闭电脑功能
2018/02/22 Python
python opencv摄像头的简单应用
2019/06/06 Python
PyQt5实现从主窗口打开子窗口的方法
2019/06/19 Python
python实现QQ批量登录功能
2019/06/19 Python
Django的Modelforms用法简介
2019/07/27 Python
python爬虫爬取笔趣网小说网站过程图解
2019/11/18 Python
美国办公用品折扣网站:Shoplet.com
2019/11/24 全球购物
英国森林假期:Forest Holidays
2021/01/01 全球购物
大学生如何写自荐信
2014/01/08 职场文书
百日安全生产活动总结
2014/07/05 职场文书
2014年秘书工作总结
2014/11/25 职场文书
创业计划书详解
2019/07/19 职场文书
【js设计模式】SOLID五大设计原则
2022/03/24 Javascript
vue ref如何获取子组件属性值
2022/03/31 Vue.js
一级电子管军用接收机测评
2022/04/05 无线电
vue ant design 封装弹窗表单的使用
2022/06/01 Vue.js