Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
Python求两个list的差集、交集与并集的方法
Nov 01 Python
Python检测QQ在线状态的方法
May 09 Python
python实现备份目录的方法
Aug 03 Python
举例讲解Python中的迭代器、生成器与列表解析用法
Mar 20 Python
python通过cookie模拟已登录状态的初步研究
Nov 09 Python
python字符串str和字节数组相互转化方法
Mar 18 Python
由浅入深讲解python中的yield与generator
Apr 05 Python
python模拟点击网页按钮实现方法
Feb 25 Python
Django-migrate报错问题解决方案
Apr 21 Python
基于python实现ROC曲线绘制广场解析
Jun 28 Python
python中使用 unittest.TestCase单元测试的用例详解
Aug 30 Python
Python日志模块logging用法
Jun 05 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
PHP中用接口、抽象类、普通基类实现“面向接口编程”与“耦合方法”简述
2011/03/23 PHP
PHP性能优化工具篇Benchmark类调试执行时间
2011/12/06 PHP
php实现分页功能的详细实例方法
2019/09/29 PHP
PHP实现微信公众号验证Token的示例代码
2019/12/16 PHP
javascript 自动填写表单的实现方法
2010/04/09 Javascript
Fastest way to build an HTML string(拼装html字符串的最快方法)
2011/08/20 Javascript
iframe异步加载实现点击左边菜单加载右边内容实例讲解
2013/03/04 Javascript
Extjs的FileUploadField文件上传出现了两个上传按钮
2014/04/29 Javascript
文本框只能输入数字的实现方法(兼容IE火狐)
2016/06/25 Javascript
Webpack性能优化 DLL 用法详解
2017/08/10 Javascript
Vue实现textarea固定输入行数与添加下划线样式的思路详解
2018/06/28 Javascript
Vue注册组件命名时不能用大写的原因浅析
2019/04/25 Javascript
React传值 组件传值 之间的关系详解
2019/08/26 Javascript
微信小程序新闻网站详情页实例代码
2020/01/10 Javascript
javascript设计模式 ? 建造者模式原理与应用实例分析
2020/04/10 Javascript
js中延迟加载和预加载的具体使用
2021/01/14 Javascript
Python3基础之基本数据类型概述
2014/08/13 Python
python控制台中实现进度条功能
2015/11/10 Python
Python 专题六 局部变量、全局变量global、导入模块变量
2017/03/20 Python
python计算auc指标实例
2017/07/13 Python
Python 2.x如何设置命令执行的超时时间实例
2017/10/19 Python
python sklearn库实现简单逻辑回归的实例代码
2019/07/01 Python
python实现两个经纬度点之间的距离和方位角的方法
2019/07/05 Python
Python实现的企业粉丝抽奖功能示例
2019/07/26 Python
浅谈tensorflow使用张量时的一些注意点tf.concat,tf.reshape,tf.stack
2020/06/23 Python
python属于哪种语言
2020/08/16 Python
利用简洁的图片预加载组件提升html5移动页面的用户体验
2016/03/11 HTML / CSS
高街生活方式全球在线商店:AZBRO
2017/08/26 全球购物
澳洲的服装老品牌:SABA
2018/02/06 全球购物
卡拉威高尔夫官方网站:Callaway Golf
2020/09/16 全球购物
师范大学生求职信
2014/06/13 职场文书
2015年元宵节活动总结
2015/02/06 职场文书
幼儿园校车安全责任书
2015/05/08 职场文书
新员工辞职信范文
2015/05/12 职场文书
年终奖金发放管理制度,中小企业适用,拿去救急吧!
2019/07/12 职场文书
使用MybatisPlus打印sql语句
2022/04/22 SQL Server