Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
python正则表达式修复网站文章字体不统一的解决方法
Feb 21 Python
用 Python 爬了爬自己的微信朋友(实例讲解)
Aug 25 Python
python3安装pip3(install pip3 for python 3.x)
Apr 03 Python
在Pandas中给多层索引降级的方法
Nov 16 Python
Python简单获取二维数组行列数的方法示例
Dec 21 Python
Python 3 实现定义跨模块的全局变量和使用教程
Jul 07 Python
Django Aggregation聚合使用方法解析
Aug 01 Python
Python描述符descriptor使用原理解析
Mar 21 Python
基于Python的OCR实现示例
Apr 03 Python
PythonPC客户端自动化实现原理(pywinauto)
May 28 Python
Python打印不合法的文件名
Jul 31 Python
python学习笔记之多进程
Aug 06 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
广播爱好者需要了解的天线知识
2021/03/01 无线电
PHP中提问频率最高的11个面试题和答案
2014/09/02 PHP
学习php设计模式 php实现享元模式(flyweight)
2015/12/07 PHP
Laravel框架查询构造器简单示例
2019/05/08 PHP
php让json_encode不自动转义斜杠“/”的方法
2020/04/27 PHP
精通Javascript系列之Javascript基础篇
2011/06/07 Javascript
js获取php变量的实现代码
2013/08/10 Javascript
JavaScript自定义日期格式化函数详细解析
2014/01/14 Javascript
利用jQuery中的ajax分页实现代码
2016/02/25 Javascript
学习使用bootstrap3栅格系统
2016/04/12 Javascript
Node.js常用工具之util模块
2017/03/09 Javascript
提高Node.js性能的应用技巧分享
2017/08/10 Javascript
Node.JS使用Sequelize操作MySQL的示例代码
2017/10/09 Javascript
vue.js使用v-model指令实现的数据双向绑定功能示例
2018/05/22 Javascript
vue框架搭建之axios使用教程
2018/07/11 Javascript
node.js之基础加密算法模块crypto详解
2018/09/11 Javascript
vue项目引入ts步骤(小结)
2019/10/31 Javascript
如何在Node和浏览器控制台中打印彩色文字
2020/01/09 Javascript
Python greenlet实现原理和使用示例
2014/09/24 Python
python 切换root 执行命令的方法
2019/01/19 Python
Python设计模式之策略模式实例详解
2019/01/21 Python
对tensorflow中tf.nn.conv1d和layers.conv1d的区别详解
2020/02/11 Python
如何用Python输出一个Fibonacci数列
2016/08/28 面试题
酒店服务实习自我鉴定
2013/09/22 职场文书
铁路工务反思材料
2014/02/07 职场文书
高中军训感言500字
2014/02/24 职场文书
电气工程自动化求职信
2014/03/14 职场文书
党员批评与自我批评发言
2014/10/02 职场文书
北京天坛导游词
2015/02/12 职场文书
部队个人年终总结
2015/03/02 职场文书
2015年简历自我评价范文
2015/03/11 职场文书
小学五年级(说明文3篇)
2019/08/13 职场文书
如何理解PHP核心特性命名空间
2021/05/28 PHP
MySQL系列之五 视图、存储函数、存储过程、触发器
2021/07/02 MySQL
Nginx的基本概念和原理
2022/03/21 Servers
如何使用注解方式实现 Redis 分布式锁
2022/07/23 Redis