编程 Python

Scrapy框架基本命令与settings.py设置

Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考，具体如下：

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种：1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000；值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题：《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Scrapy框架基本命令与settings.py设置

- Author -

hankleo

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python中的模块导入和读取键盘输入的方法

Oct 16 Python

Python中文件I/O高效操作处理的技巧分享

Feb 04 Python

python实现八大排序算法（1）

Sep 14 Python

python数字图像处理之骨架提取与分水岭算法

Apr 27 Python

Django Admin实现三级联动的示例代码(省市区)

Jun 22 Python

使用Python如何测试InnoDB与MyISAM的读写性能

Sep 18 Python

Python-while 计算100以内奇数和的方法

Jun 11 Python

Python中一些深不见底的“坑”

Jun 12 Python

Python字典深浅拷贝与循环方式方法详解

Feb 09 Python

使用python实现多维数据降维操作

Feb 24 Python

python如何遍历指定路径下所有文件（按按照时间区间检索）

Sep 14 Python

python Paramiko使用示例

Sep 21 Python

python opencv圆、椭圆与任意多边形的绘制实例详解

Feb 06 #Python

Python输出指定字符串的方法

Feb 06 #Python

python实现简单飞行棋

Feb 06 #Python

python实现飞行棋游戏

Feb 05 #Python

以SQLite和PySqlite为例来学习Python DB API

Feb 05 #Python

Python操作Sqlite正确实现方法解析

Feb 05 #Python

Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)

Feb 05 #Python

You might like

PHP中的integer类型使用分析

2010/07/27 PHP

php购物车实现代码

2011/10/10 PHP

修改WordPress中文章编辑器的样式的方法详解

2015/12/15 PHP

php生成验证码，缩略图及水印图的类分享

2016/04/07 PHP

PHP中类的自动加载的方法

2017/03/17 PHP

JavaScript Cookie显示用户上次访问的时间和次数

2009/12/08 Javascript

firefox火狐浏览器与与ie兼容的2个问题总结

2010/07/20 Javascript

改变状态栏文字的js代码

2014/06/13 Javascript

javascript中的__defineGetter__和__defineSetter__介绍

2014/08/15 Javascript

node.js中的fs.chownSync方法使用说明

2014/12/16 Javascript

jquery删除指定子元素代码实例

2015/01/13 Javascript

使用jQuery+EasyUI实现CheckBoxTree的级联选中特效

2015/12/06 Javascript

图片旋转、鼠标滚轮缩放、镜像、切换图片js代码

2020/12/13 Javascript

AJAX实现瀑布流触发分页与分页触发瀑布流的方法

2016/05/23 Javascript

Angular.js与node.js项目里用cookie校验账户登录详解

2017/02/22 Javascript

使用react-router4.0实现重定向和404功能的方法

2017/08/28 Javascript

微信小程序使用request网络请求操作实例

2017/12/15 Javascript

vue实例中data使用return包裹的方法

2018/08/27 Javascript

json解析大全双引号、键值对不在一起的情况

2019/12/06 Javascript

Element Alert警告的具体使用方法

2020/07/27 Javascript

[04:42]5分钟带你了解什么是DOTA2（第一期）

2017/02/07 DOTA

python 查找字符串是否存在实例详解

2017/01/20 Python

django模板语法学习之include示例详解

2017/12/17 Python

EM算法的python实现的方法步骤

2018/01/02 Python

python-web根据元素属性进行定位的方法

2019/12/13 Python

pytorch数据预处理错误的解决

2020/02/20 Python

python plt可视化——打印特殊符号和制作图例代码

2020/04/17 Python

基于Python的一个自动录入表格的小程序

2020/08/05 Python

Python压缩模块zipfile实现原理及用法解析

2020/08/14 Python

Scrapy+Selenium自动获取cookie爬取网易云音乐个人喜爱歌单

2021/02/01 Python

技术总监岗位职责

2013/12/05 职场文书

住宿生擅自离校检讨书

2014/09/22 职场文书

贫困生证明范文

2015/06/16 职场文书

Python探索生命起源 matplotlib细胞自动机动画演示

2022/04/21 Python

Mybatis-Plus 使用 @TableField 自动填充日期

2022/04/26 Java/Android

AndroidStudio图片压缩工具ImgCompressPlugin使用实例

2022/08/05 Java/Android