Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
Python中用format函数格式化字符串的用法
Apr 08 Python
Python生成数字图片代码分享
Oct 31 Python
Python入门之三角函数tan()函数实例详解
Nov 08 Python
python执行使用shell命令方法分享
Nov 08 Python
python使用pandas实现数据分割实例代码
Jan 25 Python
pycharm 配置远程解释器的方法
Oct 28 Python
Python中*args和**kwargs的区别详解
Sep 17 Python
win7下 python3.6 安装opencv 和 opencv-contrib-python解决 cv2.xfeatures2d.SIFT_create() 的问题
Oct 24 Python
python 生成器需注意的小问题
Sep 29 Python
史上最详细的Python打包成exe文件教程
Jan 17 Python
python实现scrapy爬虫每天定时抓取数据的示例代码
Jan 27 Python
Jupyter Notebook内使用argparse报错的解决方案
Jun 03 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
在PHP中利用XML技术构造远程服务(下)
2006/10/09 PHP
phpstorm配置Xdebug进行调试PHP教程
2014/12/01 PHP
Javascript hasOwnProperty 方法 & in 关键字
2008/11/26 Javascript
深入理解Javascript闭包 新手版
2010/12/28 Javascript
在iframe里的页面编写js,实现在父窗口上创建动画效果展开和收缩的div(不变动iframe父窗口代码)
2011/12/20 Javascript
你必须知道的JavaScript 变量命名规则详解
2013/05/07 Javascript
JQuery ztree 异步加载实例讲解
2016/02/25 Javascript
JS创建对象几种不同方法详解
2016/03/01 Javascript
BootstrapValidator超详细教程(推荐)
2016/12/07 Javascript
JS实现简单的二元方程计算器功能示例
2017/01/03 Javascript
Vue组件中prop属性使用说明实例代码详解
2018/05/31 Javascript
初试vue-cli使用HBuilderx打包app的坑
2019/07/17 Javascript
Layui事件监听的实现(表单和数据表格)
2019/10/17 Javascript
JavaScript setInterval()与setTimeout()计时器
2019/12/27 Javascript
js cavans实现静态滚动弹幕
2020/05/21 Javascript
Vue3.0的优化总结
2020/10/16 Javascript
[48:39]Ti4主赛事胜者组第一天 EG vs NEWBEE 2
2014/07/19 DOTA
基于wxpython实现的windows GUI程序实例
2015/05/30 Python
Python面向对象特殊成员
2017/04/24 Python
Django中ORM表的创建和增删改查方法示例
2017/11/15 Python
使用python存储网页上的图片实例
2018/05/22 Python
运用Python的webbrowser实现定时打开特定网页
2019/02/21 Python
对Python强大的可变参数传递机制详解
2019/06/13 Python
关于torch.optim的灵活使用详解(包括重写SGD,加上L1正则)
2020/02/20 Python
Django接收照片储存文件的实例代码
2020/03/07 Python
Python离线安装各种库及pip的方法
2020/11/28 Python
html svg生成环形进度条的实现方法
2019/09/23 HTML / CSS
Shell如何接收变量输入
2016/08/06 面试题
Ajax的工作原理
2015/12/04 面试题
最新离婚协议书范本
2014/08/19 职场文书
硕士毕业论文导师评语
2014/12/31 职场文书
党支部书记岗位职责
2015/02/15 职场文书
初中教师德育工作总结2015
2015/05/12 职场文书
TensorFlow中tf.batch_matmul()的用法
2021/06/02 Python
简单聊聊TypeScript只读修饰符
2022/04/06 Javascript
MySQL慢查询中的commit慢和binlog中慢事务的区别
2022/06/16 MySQL