编程 Python

Python下使用Scrapy爬取网页内容的实例

Posted in Python onMay 21, 2018

上周用了一周的时间学习了Python和Scrapy，实现了从0到1完整的网页爬虫实现。研究的时候很痛苦，但是很享受，做技术的嘛。

首先，安装Python，坑太多了，一个个爬。由于我是windows环境，没钱买mac, 在安装的时候遇到各种各样的问题，确实各种各样的依赖。

安装教程不再赘述。如果在安装的过程中遇到 ERROR：需要windows c/c++问题，一般是由于缺少windows开发编译环境，晚上大多数教程是安装一个VisualStudio，太不靠谱了，事实上只要安装一个WindowsSDK就可以了。

下面贴上我的爬虫代码：

爬虫主程序：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from zjf.FsmzItems import FsmzItem
from scrapy.selector import Selector
# 圈圈：情感生活
class MySpider(scrapy.Spider):
 #爬虫名
 name = "MySpider"
 #设定域名
 allowed_domains = ["nvsheng.com"]
 #爬取地址
 start_urls = []
 #flag
 x = 0
 #爬取方法
 def parse(self, response):
  item = FsmzItem()
  sel = Selector(response)
  item['title'] = sel.xpath('//h1/text()').extract()
  item['text'] = sel.xpath('//*[@class="content"]/p/text()').extract()
  item['imags'] = sel.xpath('//div[@id="content"]/p/a/img/@src|//div[@id="content"]/p/img/@src').extract()
  if MySpider.x == 0:
   page_list = MySpider.getUrl(self,response)
   for page_single in page_list:
    yield Request(page_single)
  MySpider.x += 1
  yield item
 #init: 动态传入参数
 #命令行传参写法： scrapy crawl MySpider -a start_url="http://some_url"
 def __init__(self,*args,**kwargs):
  super(MySpider,self).__init__(*args,**kwargs)
  self.start_urls = [kwargs.get('start_url')]
 def getUrl(self, response):
  url_list = []
  select = Selector(response)
  page_list_tmp = select.xpath('//div[@class="viewnewpages"]/a[not(@class="next")]/@href').extract()
  for page_tmp in page_list_tmp:
   if page_tmp not in url_list:
    url_list.append("http://www.nvsheng.com/emotion/px/" + page_tmp)
  return url_list

PipeLines类

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from zjf import settings
import json,os,re,random
import urllib.request
import requests, json
from requests_toolbelt.multipart.encoder import MultipartEncoder
class MyPipeline(object):
 flag = 1
 post_title = ''
 post_text = []
 post_text_imageUrl_list = []
 cs = []
 user_id= ''
 def __init__(self):
  MyPipeline.user_id = MyPipeline.getRandomUser('37619,18441390,18441391')
 #process the data
 def process_item(self, item, spider):
  #获取随机user_id，模拟发帖
  user_id = MyPipeline.user_id
  #获取正文text_str_tmp
  text = item['text']
  text_str_tmp = ""
  for str in text:
   text_str_tmp = text_str_tmp + str
  # print(text_str_tmp)
  #获取标题
  if MyPipeline.flag == 1:
   title = item['title']
   MyPipeline.post_title = MyPipeline.post_title + title[0]
  #保存并上传图片
  text_insert_pic = ''
  text_insert_pic_w = ''
  text_insert_pic_h = ''
  for imag_url in item['imags']:
   img_name = imag_url.replace('/','').replace('.','').replace('|','').replace(':','')
   pic_dir = settings.IMAGES_STORE + '%s.jpg' %(img_name)
   urllib.request.urlretrieve(imag_url,pic_dir)
   #图片上传，返回json
   upload_img_result = MyPipeline.uploadImage(pic_dir,'image/jpeg')
   #获取json中保存图片路径
   text_insert_pic = upload_img_result['result']['image_url']
   text_insert_pic_w = upload_img_result['result']['w']
   text_insert_pic_h = upload_img_result['result']['h']
  #拼接json
  if MyPipeline.flag == 1:
   cs_json = {"c":text_str_tmp,"i":"","w":text_insert_pic_w,"h":text_insert_pic_h}
  else:
   cs_json = {"c":text_str_tmp,"i":text_insert_pic,"w":text_insert_pic_w,"h":text_insert_pic_h}
  MyPipeline.cs.append(cs_json)
  MyPipeline.flag += 1
  return item
 #spider开启时被调用
 def open_spider(self,spider):
  pass
 #sipder 关闭时被调用
 def close_spider(self,spider):
  strcs = json.dumps(MyPipeline.cs)
  jsonData = {"apisign":"99ea3eda4b45549162c4a741d58baa60","user_id":MyPipeline.user_id,"gid":30,"t":MyPipeline.post_title,"cs":strcs}
  MyPipeline.uploadPost(jsonData)
 #上传图片
 def uploadImage(img_path,content_type):
  "uploadImage functions"
  #UPLOAD_IMG_URL = "http://api.qa.douguo.net/robot/uploadpostimage"
  UPLOAD_IMG_URL = "http://api.douguo.net/robot/uploadpostimage"
  # 传图片
  #imgPath = 'D:\pics\http___img_nvsheng_com_uploads_allimg_170119_18-1f1191g440_jpg.jpg'
  m = MultipartEncoder(
   # fields={'user_id': '192323',
   #   'images': ('filename', open(imgPath, 'rb'), 'image/JPEG')}
   fields={'user_id': MyPipeline.user_id,
     'apisign':'99ea3eda4b45549162c4a741d58baa60',
     'image': ('filename', open(img_path , 'rb'),'image/jpeg')}
  )
  r = requests.post(UPLOAD_IMG_URL,data=m,headers={'Content-Type': m.content_type})
  return r.json()
 def uploadPost(jsonData):
  CREATE_POST_URL = http://api.douguo.net/robot/uploadimagespost

reqPost = requests.post(CREATE_POST_URL,data=jsonData)

def getRandomUser(userStr):
  user_list = []
  user_chooesd = ''
  for user_id in str(userStr).split(','):
   user_list.append(user_id)
  userId_idx = random.randint(1,len(user_list))
  user_chooesd = user_list[userId_idx-1]
  return user_chooesd

字段保存Items类

# -*- coding: utf-8 -*- 
 
# Define here the models for your scraped items 
# 
# See documentation in: 
# http://doc.scrapy.org/en/latest/topics/items.html 
 
import scrapy 
 
class FsmzItem(scrapy.Item): 
 # define the fields for your item here like: 
 # name = scrapy.Field() 
 title = scrapy.Field() 
 #tutor = scrapy.Field() 
 #strongText = scrapy.Field() 
 text = scrapy.Field() 
 imags = scrapy.Field()

在命令行里键入

scrapy crawl MySpider -a start_url=www.aaa.com

这样就可以爬取aaa.com下的内容了

以上这篇Python下使用Scrapy爬取网页内容的实例就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持三水点靠木。

Python下使用Scrapy爬取网页内容的实例

- Author -

止鱼

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python实现的各种排序算法代码

Mar 04 Python

Python过滤函数filter()使用自定义函数过滤序列实例

Aug 26 Python

Python实现的插入排序算法原理与用法实例分析

Nov 22 Python

查看django执行的sql语句及消耗时间的两种方法

May 29 Python

快速解决pandas.read_csv()乱码的问题

Jun 15 Python

对python中字典keys,values,items的使用详解

Feb 03 Python

Python实现微信自动好友验证,自动回复,发送群聊链接方法

Feb 21 Python

详解Numpy数组转置的三种方法T、transpose、swapaxes

May 27 Python

pytorch 自定义参数不更新方式

Jan 06 Python

python字符串,元组,列表,字典互转代码实例详解

Feb 14 Python

keras中模型训练class_weight,sample_weight区别说明

May 23 Python

python写文件时覆盖原来的实例方法

Jul 22 Python

python 每天如何定时启动爬虫任务(实现方法分享)

May 21 #Python

对python抓取需要登录网站数据的方法详解

May 21 #Python

深入浅析python 中的匿名函数

May 21 #Python

python3 selenium 切换窗口的几种方法小结

May 21 #Python

python selenium 对浏览器标签页进行关闭和切换的方法

May 21 #Python

pytorch cnn 识别手写的字实现自建图片数据

May 20 #Python

pytorch 把MNIST数据集转换成图片和txt的方法

May 20 #Python

You might like

如何去掉文章里的 html 语法

2006/10/09 PHP

php下过滤HTML代码的函数

2007/12/10 PHP

PHP屏蔽蜘蛛访问代码及常用搜索引擎的HTTP_USER_AGENT

2013/03/06 PHP

php中smarty模板条件判断用法实例

2015/06/11 PHP

ThinkPHP中create()方法自动验证表单信息

2017/04/28 PHP

PHP获取HTTP body内容的方法

2018/12/31 PHP

JS提交并解析后台返回的XML的代码

2008/11/03 Javascript

简略的前端架构心得&&基于editor为例子的编码小技巧

2010/11/25 Javascript

js实现的仿新浪微博完美的时间组件升级版

2011/12/20 Javascript

jQuery中html()方法用法实例

2014/12/25 Javascript

浅谈javascript 迭代方法

2015/01/21 Javascript

js实现点击图片自动提交action的简单方法

2016/10/16 Javascript

可输入文字查找ajax下拉框控件 ComBox的实现方法

2016/10/25 Javascript

BootStrap的select2既可以查询又可以输入的实现代码

2017/02/17 Javascript

关于预加载InstantClick的问题解决方法

2017/09/12 Javascript

给localStorage设置一个过期时间的方法分享

2018/11/06 Javascript

JavaScript forEach中return失效问题解决方案

2020/06/01 Javascript

ES6学习教程之Promise用法详解

2020/11/22 Javascript

node.js 基于 STMP 协议和 EWS 协议发送邮件

2021/02/14 Javascript

python通过zlib实现压缩与解压字符串的方法

2014/11/19 Python

老生常谈Python进阶之装饰器

2017/05/11 Python

Python使用crontab模块设置和清除定时任务操作详解

2019/04/09 Python

python取余运算符知识点详解

2019/06/27 Python

使用Windows批处理和WMI设置Python的环境变量方法

2019/08/14 Python

使用CSS3制作版头动画效果

2020/12/24 HTML / CSS

美体小铺瑞典官方网站：The Body Shop瑞典

2018/01/27 全球购物

澳洲最大的时尚奢侈品电商平台：Cettire

2020/06/15 全球购物

介绍一下Linux内核的排队自旋锁

2014/01/04 面试题

工厂仓管员岗位职责

2014/01/01 职场文书

六一节目主持词

2014/04/01 职场文书

班级口号大全

2014/06/09 职场文书

万里长城导游词

2015/01/30 职场文书

英语邀请函范文

2015/02/02 职场文书

2016年大学生就业指导课心得体会

2015/10/09 职场文书

springboot中的pom文件 project报错问题

2022/01/18 Java/Android

浅谈Redis缓冲区机制

2022/06/05 Redis