Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
Python 除法小技巧
Sep 06 Python
Python 自动安装 Rising 杀毒软件
Apr 24 Python
Python自定义类的数组排序实现代码
Aug 28 Python
基于ID3决策树算法的实现(Python版)
May 31 Python
Python实现爬取需要登录的网站完整示例
Aug 19 Python
基于Python实现的微信好友数据分析
Feb 26 Python
python使用Matplotlib画饼图
Sep 25 Python
django 数据库连接模块解析及简单长连接改造方法
Aug 29 Python
python 伯努利分布详解
Feb 25 Python
django queryset相加和筛选教程
May 18 Python
Python SQLAlchemy库的使用方法
Oct 13 Python
Python基于Opencv识别两张相似图片
Apr 25 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
多个PHP中文字符串截取函数
2013/11/12 PHP
页面利用渐进式JPEG来提升用户体验度
2014/12/01 PHP
ThinkPHP3.2.3数据库设置新特性
2015/03/05 PHP
Codeigniter控制器controller继承问题实例分析
2016/01/19 PHP
PHP实现路由映射到指定控制器
2016/08/13 PHP
node.js chat程序如何实现Ajax long-polling长链接刷新模式
2012/03/13 Javascript
JS替换字符串中字符即替换全部而不是第一个
2014/06/04 Javascript
深入探讨JavaScript String对象
2015/03/09 Javascript
浅谈JavaScript字符串与数组
2015/06/03 Javascript
JS实现下拉菜单赋值到文本框的方法
2015/08/18 Javascript
Three.js学习之Lamber材质和Phong材质
2016/08/04 Javascript
微信小程序 css使用技巧总结
2017/01/09 Javascript
svg动画之动态描边效果
2017/02/22 Javascript
在 Angular 中实现搜索关键字高亮示例
2017/03/21 Javascript
JavaScript该如何学习 怎样轻松学习JavaScript
2017/06/12 Javascript
webpack+react+antd脚手架优化的方法
2018/04/02 Javascript
VUE的history模式下除了index外其他路由404报错解决办法
2019/08/21 Javascript
layui下拉列表select实现可输入查找的方法
2019/09/28 Javascript
Jquery高级应用Deferred对象原理及使用实例
2020/05/28 jQuery
jquery自定义组件实例详解
2020/12/31 jQuery
[48:52]DOTA2上海特级锦标赛A组小组赛#2 Secret VS CDEC第一局
2016/02/25 DOTA
python基础教程项目三之万能的XML
2018/04/02 Python
Python GUI Tkinter简单实现个性签名设计
2018/06/19 Python
使用python将时间转换为指定的格式方法
2018/11/12 Python
Python操作MySQL数据库的两种方式实例分析【pymysql和pandas】
2019/03/18 Python
Pytorch GPU显存充足却显示out of memory的解决方式
2020/01/13 Python
Python 中的pygame安装与配置教程详解
2020/02/10 Python
python打开音乐文件的实例方法
2020/07/21 Python
Spanx塑身衣官网:美国知名内衣品牌
2017/01/11 全球购物
广州地球村科技数据库题目
2016/04/25 面试题
药学专业个人的自我评价
2013/12/31 职场文书
岗位廉政承诺书
2014/03/27 职场文书
小学语文课后反思精选
2014/04/25 职场文书
长城英文导游词
2015/01/30 职场文书
MySQL sql_mode修改不生效的原因及解决
2021/05/07 MySQL
MySQL大小写敏感的注意事项
2021/05/24 MySQL