python使用scrapy发送post请求的坑


Posted in Python onSeptember 04, 2018

使用requests发送post请求

先来看看使用requests来发送post请求是多少好用,发送请求

Requests 简便的 API 意味着所有 HTTP 请求类型都是显而易见的。例如,你可以这样发送一个 HTTP POST 请求:

>>>r = requests.post('http://httpbin.org/post', data = {'key':'value'})

使用data可以传递字典作为参数,同时也可以传递元祖

>>>payload = (('key1', 'value1'), ('key1', 'value2'))
>>>r = requests.post('http://httpbin.org/post', data=payload)
>>>print(r.text)
{
 ...
 "form": {
  "key1": [
   "value1",
   "value2"
  ]
 },
 ...
}

传递json是这样

>>>import json

>>>url = 'https://api.github.com/some/endpoint'
>>>payload = {'some': 'data'}

>>>r = requests.post(url, data=json.dumps(payload))

2.4.2 版的新加功能:

>>>url = 'https://api.github.com/some/endpoint'
>>>payload = {'some': 'data'}

>>>r = requests.post(url, json=payload)

也就是说,你不需要对参数做什么变化,只需要关注使用data=还是json=,其余的requests都已经帮你做好了。

使用scrapy发送post请求

通过源码可知scrapy默认发送的get请求,当我们需要发送携带参数的请求或登录时,是需要post、请求的,以下面为例

from scrapy.spider import CrawlSpider
from scrapy.selector import Selector
import scrapy
import json
class LaGou(CrawlSpider):
  name = 'myspider'
  def start_requests(self):
    yield scrapy.FormRequest(
      url='https://www.******.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false',
      formdata={
        'first': 'true',#这里不能给bool类型的True,requests模块中可以
        'pn': '1',#这里不能给int类型的1,requests模块中可以
        'kd': 'python'
      },这里的formdata相当于requ模块中的data,key和value只能是键值对形式
      callback=self.parse
    )
  def parse(self, response):
    datas=json.loads(response.body.decode())['content']['positionResult']['result']
    for data in datas:
      print(data['companyFullName'] + str(data['positionId']))

官方推荐的 Using FormRequest to send data via HTTP POST

return [FormRequest(url="http://www.example.com/post/action",
          formdata={'name': 'John Doe', 'age': '27'},
          callback=self.after_post)]

这里使用的是FormRequest,并使用formdata传递参数,看到这里也是一个字典。

但是,超级坑的一点来了,今天折腾了一下午,使用这种方法发送请求,怎么发都会出问题,返回的数据一直都不是我想要的

return scrapy.FormRequest(url, formdata=(payload))

在网上找了很久,最终找到一种方法,使用scrapy.Request发送请求,就可以正常的获取数据。

return scrapy.Request(url, body=json.dumps(payload), method='POST', headers={'Content-Type': 'application/json'},)

参考:Send Post Request in Scrapy

my_data = {'field1': 'value1', 'field2': 'value2'}
request = scrapy.Request( url, method='POST', 
             body=json.dumps(my_data), 
             headers={'Content-Type':'application/json'} )

FormRequest 与 Request 区别

在文档中,几乎看不到差别,

The FormRequest class adds a new argument to the constructor. The remaining arguments are the same as for the Request class and are not documented here.
Parameters: formdata (dict or iterable of tuples) ? is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.

说FormRequest新增加了一个参数formdata,接受包含表单数据的字典或者可迭代的元组,并将其转化为请求的body。并且FormRequest是继承Request的

class FormRequest(Request):

  def __init__(self, *args, **kwargs):
    formdata = kwargs.pop('formdata', None)
    if formdata and kwargs.get('method') is None:
      kwargs['method'] = 'POST'

    super(FormRequest, self).__init__(*args, **kwargs)

    if formdata:
      items = formdata.items() if isinstance(formdata, dict) else formdata
      querystr = _urlencode(items, self.encoding)
      if self.method == 'POST':
        self.headers.setdefault(b'Content-Type', b'application/x-www-form-urlencoded')
        self._set_body(querystr)
      else:
        self._set_url(self.url + ('&' if '?' in self.url else '?') + querystr)
      ###


def _urlencode(seq, enc):
  values = [(to_bytes(k, enc), to_bytes(v, enc))
       for k, vs in seq
       for v in (vs if is_listlike(vs) else [vs])]
  return urlencode(values, doseq=1)

最终我们传递的{‘key': ‘value', ‘k': ‘v'}会被转化为'key=value&k=v' 并且默认的method是POST,再来看看Request

class Request(object_ref):

  def __init__(self, url, callback=None, method='GET', headers=None, body=None,
         cookies=None, meta=None, encoding='utf-8', priority=0,
         dont_filter=False, errback=None, flags=None):

    self._encoding = encoding # this one has to be set first
    self.method = str(method).upper()

默认的方法是GET,其实并不影响。仍然可以发送post请求。这让我想起来requests中的request用法,这是定义请求的基础方法。

def request(method, url, **kwargs):
  """Constructs and sends a :class:`Request <Request>`.

  :param method: method for the new :class:`Request` object.
  :param url: URL for the new :class:`Request` object.
  :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
  :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
  :param json: (optional) json data to send in the body of the :class:`Request`.
  :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
  :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
  :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
    ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
    or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
    defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
    to add for the file.
  :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
  :param timeout: (optional) How many seconds to wait for the server to send data
    before giving up, as a float, or a :ref:`(connect timeout, read
    timeout) <timeouts>` tuple.
  :type timeout: float or tuple
  :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
  :type allow_redirects: bool
  :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
  :param verify: (optional) Either a boolean, in which case it controls whether we verify
      the server's TLS certificate, or a string, in which case it must be a path
      to a CA bundle to use. Defaults to ``True``.
  :param stream: (optional) if ``False``, the response content will be immediately downloaded.
  :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
  :return: :class:`Response <Response>` object
  :rtype: requests.Response

  Usage::

   >>> import requests
   >>> req = requests.request('GET', 'http://httpbin.org/get')
   <Response [200]>
  """

  # By using the 'with' statement we are sure the session is closed, thus we
  # avoid leaving sockets open which can trigger a ResourceWarning in some
  # cases, and look like a memory leak in others.
  with sessions.Session() as session:
    return session.request(method=method, url=url, **kwargs)

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python实现类似jQuery使用中的链式调用的示例
Jun 16 Python
python利用正则表达式提取字符串
Dec 08 Python
python数据抓取分析的示例代码(python + mongodb)
Dec 25 Python
Python中static相关知识小结
Jan 02 Python
Python编程求解二叉树中和为某一值的路径代码示例
Jan 04 Python
ubuntu安装sublime3并配置python3环境的方法
Mar 15 Python
python版本的仿windows计划任务工具
Apr 30 Python
学习python的前途 python挣钱
Feb 27 Python
python爬虫项目设置一个中断重连的程序的实现
Jul 26 Python
django项目环境搭建及在虚拟机本地创建django项目的教程
Aug 02 Python
Python 解决火狐浏览器不弹出下载框直接下载的问题
Mar 09 Python
python代码实现将列表中重复元素之间的内容全部滤除
May 22 Python
解决win64 Python下安装PIL出错问题(图解)
Sep 03 #Python
Python全局变量与局部变量区别及用法分析
Sep 03 #Python
Python wxPython库Core组件BoxSizer用法示例
Sep 03 #Python
深入浅析Python中list的复制及深拷贝与浅拷贝
Sep 03 #Python
Python wxPython库使用wx.ListBox创建列表框示例
Sep 03 #Python
Python wxPython库消息对话框MessageDialog用法示例
Sep 03 #Python
Python中关键字global和nonlocal的区别详解
Sep 03 #Python
You might like
关于手调机和数调机的选择
2021/03/02 无线电
多数据表共用一个页的新闻发布
2006/10/09 PHP
php 无限级缓存的类的扩展
2009/03/16 PHP
有关JSON以及JSON在PHP中的应用
2010/04/09 PHP
php中DOMDocument简单用法示例代码(XML创建、添加、删除、修改)
2010/12/19 PHP
PHP使用函数用法详解
2018/09/30 PHP
js调用flash的效果代码
2008/04/26 Javascript
chrome原生方法之数组
2011/11/30 Javascript
放弃用你的InnerHTML来输出HTML吧 jQuery Tmpl不详细讲解
2013/04/20 Javascript
jquery ajax 简单范例(界面+后台)
2013/11/19 Javascript
浅谈Vue.js 关于页面加载完成后执行一个方法的问题
2019/04/01 Javascript
jQuery实现倒计时功能完整示例
2020/06/01 jQuery
Vue项目开发常见问题和解决方案总结
2020/09/11 Javascript
JS如何操作DOM基于表格动态展示数据
2020/10/15 Javascript
JS前端基于canvas给图片添加水印
2020/11/11 Javascript
JS创建自定义对象的六种方法总结
2020/12/15 Javascript
[02:25]专访DOTA2负责人Erik 国际邀请赛暂不会离开西雅
2014/07/21 DOTA
[58:42]DOTA2上海特级锦标赛C组败者赛 Newbee VS Archon第一局
2016/02/27 DOTA
[27:08]完美世界DOTA2联赛PWL S2 SZ vs Rebirth 第二场 11.21
2020/11/23 DOTA
python文件操作相关知识点总结整理
2016/02/22 Python
numpy添加新的维度:newaxis的方法
2018/08/02 Python
python3获取控制台输入的数据的具体实例
2020/08/16 Python
Kneipp克奈圃美国官网:德国百年精油配方的传承
2018/02/07 全球购物
真正的英国宝藏:Mappin & Webb
2019/05/05 全球购物
Antler英国官网:购买安特丽行李箱、拉杆箱
2019/08/25 全球购物
运动会通讯稿500字
2014/02/20 职场文书
护理专业学生职业生涯规划范文
2014/03/11 职场文书
幼儿园秋游感想
2014/03/12 职场文书
保护环境标语
2014/06/09 职场文书
学校地质灾害防治方案
2014/06/10 职场文书
优秀大专毕业生求职信
2014/08/04 职场文书
教师师德师风自我剖析材料
2014/09/29 职场文书
设备收款委托书范本
2014/10/02 职场文书
陈斌强事迹观后感
2015/06/17 职场文书
公司职员入党自传书
2015/06/26 职场文书
pandas中关于apply+lambda的应用
2022/02/28 Python