Scrapy框架基本命令与settings.py设置


Posted in Python onFebruary 06, 2020

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

Python 相关文章推荐
神经网络理论基础及Python实现详解
Dec 15 Python
使用pandas read_table读取csv文件的方法
Jul 04 Python
Python对ElasticSearch获取数据及操作
Apr 24 Python
Python Opencv实现图像轮廓识别功能
Mar 23 Python
python 将字符串完成特定的向右移动方法
Jun 11 Python
python 搜索大文件的实例代码
Jul 08 Python
Python基础之函数基本用法与进阶详解
Jan 02 Python
PyCharm 在Windows的有用快捷键详解
Apr 07 Python
解决python 虚拟环境删除包无法加载的问题
Jul 13 Python
python 实时调取摄像头的示例代码
Nov 25 Python
python绘图subplots函数使用模板的示例代码
Apr 30 Python
解决IDEA翻译插件Translation报错更新TTK失败不能使用
Apr 24 Python
python opencv圆、椭圆与任意多边形的绘制实例详解
Feb 06 #Python
Python输出指定字符串的方法
Feb 06 #Python
python实现简单飞行棋
Feb 06 #Python
python实现飞行棋游戏
Feb 05 #Python
以SQLite和PySqlite为例来学习Python DB API
Feb 05 #Python
Python操作Sqlite正确实现方法解析
Feb 05 #Python
Tensorflow矩阵运算实例(矩阵相乘,点乘,行/列累加)
Feb 05 #Python
You might like
农民和部队如何穿矿
2020/03/04 星际争霸
用PHP实现验证码功能
2006/10/09 PHP
ThinkPHP中实例Model方法的区别说明
2010/08/21 PHP
php读取csv文件并输出的方法
2015/03/14 PHP
PHP连接操作access数据库实例
2015/03/30 PHP
Laravel中日期时间处理包Carbon的简单使用
2017/09/21 PHP
PHPCrawl爬虫库实现抓取酷狗歌单的方法示例
2017/12/21 PHP
PHP unset函数原理及使用方法解析
2020/08/14 PHP
Javascript+XMLHttpRequest+asp.net无刷新读取数据库数据
2009/08/09 Javascript
js获取html页面节点方法(递归方式)
2013/12/13 Javascript
jQuery中prevAll()方法用法实例
2015/01/08 Javascript
EasyUI中datagrid在ie下reload失败解决方案
2015/03/09 Javascript
JavaScript中的对象与JSON
2015/07/03 Javascript
nodejs中向HTTP响应传送进程的输出
2017/03/19 NodeJs
Vue的百度地图插件尝试使用
2017/09/06 Javascript
Node.js实现连接mysql数据库功能示例
2017/09/15 Javascript
微信小程序的mpvue框架快速上手指南
2019/05/15 Javascript
Node.js API详解之 util模块用法实例分析
2020/05/09 Javascript
jquery+css3实现的经典弹出层效果示例
2020/05/16 jQuery
python自动化工具日志查询分析脚本代码实现
2013/11/26 Python
Python自定义scrapy中间模块避免重复采集的方法
2015/04/07 Python
深入解读Python解析XML的几种方式
2016/02/16 Python
钉钉群自定义机器人消息Python封装的实例
2019/02/20 Python
python2.7使用plotly绘制本地散点图和折线图
2019/04/02 Python
Python的Tkinter点击按钮触发事件的例子
2019/07/19 Python
python3中使用__slots__限定实例属性操作分析
2020/02/14 Python
Python代码注释规范代码实例解析
2020/08/14 Python
python如何停止递归
2020/09/09 Python
python tqdm库的使用
2020/11/30 Python
澳大利亚领先的运动鞋商店:Hype DC
2018/03/31 全球购物
市优秀教师事迹材料
2014/02/05 职场文书
用Python爬取各大高校并可视化帮弟弟选大学,弟弟直呼牛X
2021/06/11 Python
如何解决goland,idea全局搜索快捷键失效问题
2022/04/03 Golang
Mysql中常用的join连接方式
2022/05/11 MySQL
pd.DataFrame中的几种索引变换的实现
2022/06/16 Python
keepalived + nginx 实现高可用方案
2022/12/24 Servers