Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
python自动安装pip
Apr 24 Python
Python最长公共子串算法实例
Mar 07 Python
详解python3中socket套接字的编码问题解决
Jul 01 Python
解决python写入mysql中datetime类型遇到的问题
Jun 21 Python
解决tensorflow模型参数保存和加载的问题
Jul 26 Python
解决python通过cx_Oracle模块连接Oracle乱码的问题
Oct 18 Python
Python基于datetime或time模块分别获取当前时间戳的方法实例
Feb 19 Python
选择Python写网络爬虫的优势和理由
Jul 07 Python
python requests证书问题解决
Sep 05 Python
Python 实现OpenCV格式和PIL.Image格式互转
Jan 09 Python
一些关于python 装饰器的个人理解
Aug 31 Python
解决IDEA翻译插件Translation报错更新TTK失败不能使用
Apr 24 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
使用PHP实现二分查找算法代码分享
2011/06/24 PHP
php gzip压缩输出的实现方法
2013/04/27 PHP
PHP使用curl模拟post上传及接收文件的方法
2016/03/04 PHP
php lcg_value与mt_rand生成0~1随机小数的效果对比分析
2017/04/05 PHP
(推荐一个超好的JS函数库)S.Sams Lifexperience ScriptClassLib
2007/04/29 Javascript
web性能优化之javascript性能调优
2012/12/28 Javascript
使用Post提交时须将空格转换成加号的解释
2013/01/14 Javascript
解决Jquery鼠标经过不停滑动的问题
2014/03/03 Javascript
常见的原始JS选择器使用方法总结
2014/04/09 Javascript
javascript学习总结之js使用技巧
2015/09/02 Javascript
js仿微博实现统计字符和本地存储功能
2015/12/22 Javascript
详解Node.js模块间共享数据库连接的方法
2016/05/24 Javascript
Node.js包管理器Yarn的入门介绍与安装
2016/10/17 Javascript
解析如何利用iframe标签以及js制作时钟
2016/12/08 Javascript
jQuery表单设置值的方法
2017/06/30 jQuery
JS实现仿UC浏览器前进后退效果的实例代码
2017/07/17 Javascript
vue-router 组件复用问题详解
2018/01/22 Javascript
Vue开发环境跨域访问问题
2020/01/22 Javascript
Vue过渡效果之CSS过渡详解(结合transition,animation,animate.css)
2020/02/05 Javascript
Python Numpy 数组的初始化和基本操作
2018/03/13 Python
python 简单照相机调用系统摄像头实现方法 pygame
2018/08/03 Python
python实现简单flappy bird
2018/12/24 Python
Python button选取本地图片并显示的实例
2019/06/13 Python
python 解决print数组/矩阵无法完整输出的问题
2020/02/19 Python
python numpy生成等差数列、等比数列的实例
2020/02/25 Python
Django高并发负载均衡实现原理详解
2020/04/04 Python
德国户外商店:eXXpozed
2020/07/25 全球购物
人力资源部副职的竞聘演讲稿
2014/01/07 职场文书
yy婚礼司仪主持词
2014/03/14 职场文书
班级旅游计划书
2014/05/03 职场文书
股票投资建议书
2014/05/19 职场文书
幼儿园法制宣传日活动总结
2014/11/01 职场文书
少先队中队工作总结2015
2015/07/23 职场文书
2016年大学生就业指导课心得体会
2015/10/09 职场文书
超市啤酒狂欢夜策划方案范文!
2019/07/03 职场文书
从原生JavaScript到React深入理解
2022/07/23 Javascript