编程 Python

Python大数据之从网页上爬取数据的方法详解

Posted in Python onNovember 16, 2019

本文实例讲述了Python大数据之从网页上爬取数据的方法。分享给大家供大家参考，具体如下：

myspider.py ：

#!/usr/bin/python
# -*- coding:utf-8 -*-
from scrapy.spiders import Spider
from lxml import etree
from jredu.items import JreduItem
class JreduSpider(Spider):
  name = 'tt' #爬虫的名字，必须的，唯一的
  allowed_domains = ['sohu.com']
  start_urls = [
    'http://www.sohu.com'
  ]
  def parse(self, response):
    content = response.body.decode('utf-8')
    dom = etree.HTML(content)
    for ul in dom.xpath("//div[@class='focus-news-box']/div[@class='list16']/ul"):
      lis = ul.xpath("./li")
      for li in lis:
        item = JreduItem() #定义对象
        if ul.index(li) == 0:
          strong = li.xpath("./a/strong/text()")
          li.xpath("./a/@href")
          item['title']= strong[0]
          item['href'] = li.xpath("./a/@href")[0]
        else:
          la = li.xpath("./a[last()]/text()")
          item['title'] = la[0]
          item['href'] = li.xpath("./a[last()]/href")[0]
        yield item

items.py ：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class JreduItem(scrapy.Item):#相当于Java里的实体类
  # define the fields for your item here like:
  # name = scrapy.Field()
  title = scrapy.Field()#创建一个field对象
  href = scrapy.Field()
  pass

middlewares.py ：

# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class JreduSpiderMiddleware(object):
  # Not all methods need to be defined. If a method is not defined,
  # scrapy acts as if the spider middleware does not modify the
  # passed objects.
  @classmethod
  def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s
  def process_spider_input(self, response, spider):
    # Called for each response that goes through the spider
    # middleware and into the spider.
    # Should return None or raise an exception.
    return None
  def process_spider_output(self, response, result, spider):
    # Called with the results returned from the Spider, after
    # it has processed the response.
    # Must return an iterable of Request, dict or Item objects.
    for i in result:
      yield i
  def process_spider_exception(self, response, exception, spider):
    # Called when a spider or process_spider_input() method
    # (from other spider middleware) raises an exception.
    # Should return either None or an iterable of Response, dict
    # or Item objects.
    pass
  def process_start_requests(self, start_requests, spider):
    # Called with the start requests of the spider, and works
    # similarly to the process_spider_output() method, except
    # that it doesn't have a response associated.
    # Must return only requests (not items).
    for r in start_requests:
      yield r
  def spider_opened(self, spider):
    spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py :

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json
class JreduPipeline(object):
  def __init__(self):
    self.fill = codecs.open("data.txt",encoding="utf-8",mode="wb");
  def process_item(self, item, spider):
    line = json.dumps(dict(item))+"\n"
    self.fill.write(line)
    return item

settings.py ：

# -*- coding: utf-8 -*-
# Scrapy settings for jredu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   http://doc.scrapy.org/en/latest/topics/settings.html
#   http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#   http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jredu'
SPIDER_MODULES = ['jredu.spiders']
NEWSPIDER_MODULE = 'jredu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jredu (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'jredu.middlewares.JreduSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'jredu.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'jredu.pipelines.JreduPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后需要一个程序入口的方法：

main.py ：

#!/usr/bin/python
# -*- coding:utf-8 -*-
#爬虫文件的执行入口
from scrapy import cmdline
cmdline.execute("scrapy crawl tt".split())

更多关于Python相关内容可查看本站专题：《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python大数据之从网页上爬取数据的方法详解

- Author -

xuehyunyu

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

在Python中操作文件之read()方法的使用教程

May 24 Python

python：socket传输大文件示例

Jan 18 Python

Python实现PS滤镜的旋转模糊功能示例

Jan 20 Python

Python无损音乐搜索引擎实现代码

Feb 02 Python

Flask框架配置与调试操作示例

Jul 23 Python

python3安装speech语音模块的方法

Dec 24 Python

Appium+python自动化怎么查看程序所占端口号和IP

Jun 14 Python

Python如何实现邮件功能

May 27 Python

基于Python模拟浏览器发送http请求

Nov 06 Python

python3 sqlite3限制条件查询的操作

Apr 07 Python

解决numpy和torch数据类型转化的问题

May 23 Python

Python 实现Mac 屏幕截图详解

Oct 05 Python

简单了解Pandas缺失值处理方法

Nov 16 #Python

python selenium 执行完毕关闭chromedriver进程示例

Nov 15 #Python

浅谈Django2.0 加xadmin踩的坑

Nov 15 #Python

Django 实现xadmin后台菜单改为中文

Nov 15 #Python

django使用xadmin的全局配置详解

Nov 15 #Python

在django-xadmin中APScheduler的启动初始化实例

Nov 15 #Python

解决django-xadmin列表页filter关联对象搜索问题

Nov 15 #Python

You might like

rephactor 优秀的PHP的重构工具

2011/06/09 PHP

apache php模块整合操作指南

2012/11/16 PHP

php使用正则表达式获取字符串中的URL

2016/12/29 PHP

基于PHP的登录和注册的功能的实现

2020/08/06 PHP

Avengerls vs Newbee BO3 第三场2.18

2021/03/10 DOTA

CSS+Table图文混排中实现文本自适应图片宽度（超简单+跨所有浏览器）

2009/02/14 Javascript

面向对象的javascript（笔记）

2009/10/06 Javascript

用jquery和json从后台获得数据集的代码

2011/11/07 Javascript

在Node.js中使用HTTP上传文件的方法

2015/06/23 Javascript

简单谈谈javascript中的变量、作用域和内存问题

2015/08/30 Javascript

浅谈javascript函数式编程

2015/09/06 Javascript

js removeChild 方法深入理解

2016/08/16 Javascript

JS 获取HTML标签内的子节点的方法

2016/09/21 Javascript

微信小程序地图定位简单实例

2016/10/14 Javascript

JavaScript模块化之使用requireJS按需加载

2017/04/12 Javascript

让axios发送表单请求形式的键值对post数据的实例

2018/08/11 Javascript

详解将微信小程序接口Promise化并使用async函数

2019/08/05 Javascript

详解Angular Karma测试的持续集成实践

2019/11/15 Javascript

Python+tkinter模拟“记住我”自动登录实例代码

2018/01/16 Python

python实现扫描日志关键字的示例

2018/04/28 Python

Python使用Phantomjs截屏网页的方法

2018/05/17 Python

python 爬取疫情数据的源码

2020/02/09 Python

浅析Django 接收所有文件,前端展示文件（包括视频，文件，图片）ajax请求

2020/03/09 Python

使用Keras实现Tensor的相乘和相加代码

2020/06/18 Python

tensorflow基于CNN实战mnist手写识别(小白必看)

2020/07/20 Python

用css3实现当鼠标移进去时当前亮其他变灰效果

2014/04/08 HTML / CSS

Madewell澳大利亚官方网站：美国休闲服饰品牌

2019/07/18 全球购物

部队领导证婚词

2014/01/12 职场文书

自愿解除劳动合同协议书

2014/09/11 职场文书

党员干部群众路线教育实践活动个人对照检查材料

2014/09/23 职场文书

出生证明范本

2015/06/15 职场文书

有关花店创业的计划书模板

2019/08/27 职场文书

如何用JS实现网页瀑布流布局

2021/04/24 Javascript

python可视化大屏库big_screen示例详解

2021/11/23 Python

mysql使用instr达到in(字符串)的效果

2022/04/03 MySQL