python用700行代码实现http客户端


Posted in Python onJanuary 14, 2021

本文用python在TCP的基础上实现一个HTTP客户端, 该客户端能够复用TCP连接, 使用HTTP1.1协议. 

一. 创建HTTP请求

HTTP是基于TCP连接的, 它的请求报文格式如下:

python用700行代码实现http客户端

因此, 我们只需要创建一个到服务器的TCP连接, 然后按照上面的格式写好报文并发给服务器, 就实现了一个HTTP请求.

1. HTTPConnection类

基于以上的分析, 我们首先定义一个HTTPConnection类来管理连接和请求内容:

class HTTPConnection:
  default_port = 80
  _http_vsn = 11
  _http_vsn_str = 'HTTP/1.1'

  def __init__(self, host: str, port: int = None) -> None:
    self.sock = None
    self._buffer = []
    self.host = host
    self.port = port if port is not None else self.default_port
    self._state = _CS_IDLE
    self._response = None
    self._method = None
    self.block_size = 8192

  def _output(self, s: Union[str, bytes]) -> None:
    if hasattr(s, 'encode'):
      s = s.encode('latin-1')
    self._buffer.append(s)

  def connect(self) -> None:
    self.sock = socket.create_connection((self.host, self.port))

对于这个HTTPConnection对象, 我们只需要创建TCP连接, 然后按照HTTP协议的格式把请求数据写入buffer中, 最后把buffer中的数据发送出去就行了.

2. 编写请求行

请求行的内容比较简单, 就是说明请求方法, 请求路径和HTTP协议. 使用下面的方法来编写一个请求行:

def put_request(self, method: str, url: str) -> None:
  self._method = method

  url = url or '/'

  request = f'{method} {url} {self._http_vsn_str}'
  self._output(request)

3. 添加请求头

HTTP请求头和python的字典类似, 每行都是一个字段名与值的映射关系. HTTP协议并不要求设置所有合法的请求头的值, 我们只需要按照需要, 设置特定的请求头即可. 使用如下代码添加请求头:

def put_header(self, header: Union[bytes, str], value: Union[bytes, str, int]) -> None:
  if hasattr(header, 'encode'):
    header = header.encode('ascii')

  if hasattr(value, 'encode'):
    value = value.encode('latin-1')
  elif isinstance(value, int):
    value = str(value).encode('ascii')

  header = header + b': ' + value
  self._output(header)

此外, 在HTTP请求中, Host请求头字段是必须的, 否则网站可能会拒绝响应. 因此, 如果用户没有设置这个字段, 这里就应该主动把它加上去:

def _add_host(self, url: str) -> None:
  # 所有HTTP / 1.1请求报文中必须包含一个Host头字段
  # 如果用户没给,就调用这个函数来生成
  netloc = ''
  if url.startswith('http'):
    nil, netloc, nil, nil, nil = urllib.parse.urlsplit(url)

  if netloc:
    try:
      netloc_enc = netloc.encode('ascii')
    except UnicodeEncodeError:
      netloc_enc = netloc.encode('idna')
    self.put_header('Host', netloc_enc)
  else:
    host = self.host
    port = self.port

    try:
      host_enc = host.encode('ascii')
    except UnicodeEncodeError:
      host_enc = host.encode('idna')

    # 对IPv6的地址进行额外处理
    if host.find(':') >= 0:
      host_enc = b'[' + host_enc + b']'

    if port == self.default_port:
      self.put_header('Host', host_enc)
    else:
      host_enc = host_enc.decode('ascii')
      self.put_header('Host', f'{host_enc}:{port}')

4. 发送请求正文

我们接受两种形式的body数据: 一个基于io.IOBase的可读文件对象, 或者是一个能通过迭代得到数据的对象. 在传输数据之前, 我们首先要确定数据是否采用分块传输:

def request(self, method: str, url: str, headers: dict = None, body: Union[io.IOBase, Iterable] = None,
      encode_chunked: bool = False) -> None:
  ...
  if 'content-length' not in header_names:
    if 'transfer-encoding' not in header_names:
      encode_chunked = False
      content_length = self._get_content_length(body, method)
      if content_length is None:
        if body is not None:
          # 在这种情况下, body一般是个生成器或者可读文件之类的东西,应该分块传输
          encode_chunked = True
          self.put_header('Transfer-Encoding', 'chunked')
      else:
        self.put_header('Content-Length', str(content_length))
    else:
      # 如果设置了transfer-encoding,则根据用户给的encode_chunked参数决定是否分块
      pass
  else:
    # 只要给了content-length,那么一定不是分块传输
    encode_chunked = False
  ...


@staticmethod
def _get_content_length(body: Union[str, bytes, bytearray, Iterable, io.IOBase], method: str) -> Optional[int]:
  if body is None:
    # PUT,POST,PATCH三个方法默认是有body的
    if method.upper() in _METHODS_EXPECTING_BODY:
      return 0
    else:
      return None

  if hasattr(body, 'read'):
    return None

  try:
    # 对于bytes或者bytearray格式的数据,通过memoryview获取它的长度
    return memoryview(body).nbytes
  except TypeError:
    pass

  if isinstance(body, str):
    return len(body)

  return None

在确定了是否分块之后, 就可以把正文发出去了. 如果body是一个可读文件的话, 就调用_read_readable方法把它封装为一个生成器:

def _send_body(self, message_body: Union[str, bytes, bytearray, Iterable, io.IOBase], encode_chunked: bool) -> None:
  if hasattr(message_body, 'read'):
    chunks = self._read_readable(message_body)
  else:
    try:
      memoryview(message_body)
    except TypeError:
      try:
        chunks = iter(message_body)
      except TypeError:
        raise TypeError(
          f'message_body should be a bytes-like object or an iterable, got {repr(type(message_body))}')
    else:
      # 如果是字节类型的,通过一次迭代把它发出去
      chunks = (message_body,)

  for chunk in chunks:
    if not chunk:
      continue

    if encode_chunked:
      chunk = f'{len(chunk):X}\r\n'.encode('ascii') + chunk + b'\r\n'
    self.send(chunk)

  if encode_chunked:
    self.send(b'0\r\n\r\n')


def _read_readable(self, readable: io.IOBase) -> Generator[bytes, None, None]:
  need_encode = False
  if isinstance(readable, io.TextIOBase):
    need_encode = True
  while True:
    data_block = readable.read(self.block_size)
    if not data_block:
      break
    if need_encode:
      data_block = data_block.encode('utf-8')
    yield data_block

二. 获取响应数据

HTTP响应报文的格式与请求报文大同小异, 它大致是这样的:

python用700行代码实现http客户端

因此, 我们只要用HTTPConnection的socket对象读取服务器发送的数据, 然后按照上面的格式对数据进行解析就行了.

1. HTTPResponse类

我们首先定义一个简单的HTTPResponse类. 它的属性大致上就是socket的文件对象以及一些请求的信息等等, 调用它的begin方法来解析响应行和响应头的数据, 然后调用read方法读取响应正文:

class HTTPResponse:

  def __init__(self, sock: socket.socket, method: str = None) -> None:
    self.fp = sock.makefile('rb')
    self._method = method
    self.headers = None
    self.version = _UNKNOWN
    self.status = _UNKNOWN
    self.reason = _UNKNOWN
    self.chunked = _UNKNOWN
    self.chunk_left = _UNKNOWN
    self.length = _UNKNOWN
    self.will_close = _UNKNOWN

  def begin(self) -> None:
    ...

  def read(self, amount: int = None) -> bytes:
    ...

2. 解析状态行

状态行的解析比较简单, 我们只需要读取响应的第一行数据, 然后把它解析为HTTP协议版本,状态码和原因短语三部分就行了:

def _read_status(self) -> Tuple[str, int, str]:
  line = str(self._read_line(), 'latin-1')
  if not line:
    raise RemoteDisconnected('Remote end closed connection without response')
  try:
    version, status, reason = line.split(None, 2)
  except ValueError:
    # reason只是给人看的, 一般和status对应, 所以它有可能不存在
    try:
      version, status = line.split(None, 1)
      reason = ''
    except ValueError:
      version, status, reason = '', '', ''
  if not version.startswith('HTTP/'):
    self._close_conn()
    raise BadStatusLine(line)

  try:
    status = int(status)
    if status < 100 or status > 999:
      raise BadStatusLine(line)
  except ValueError:
    raise BadStatusLine(line)
  return version, status, reason.strip()

如果状态码为100, 则客户端需要解析多个响应状态行. 它的原理是这样的: 在请求数据过大的时候, 有的客户端会先不发送请求数据, 而是先在header中添加一个Expect: 100-continue, 如果服务器愿意接收数据, 会返回100的状态码, 这时候客户端再把数据发过去. 因此, 如果读取到100的状态码, 那么后面往往还会收到一个正式的响应数据, 应该继续读取响应头. 这部分的代码如下:

def begin(self) -> None:
  while True:
    version, status, reason = self._read_status()
    if status != HTTPStatus.CONTINUE:
      break
    # 跳过100状态码部分的响应头
    while True:
      skip = self._read_line().strip()
      if not skip:
        breakself.status = status
  self.reason = reason
  if version in ('HTTP/1.0', 'HTTP/0.9'):
    self.version = 10
  elif version.startswith('HTTP/1.'):
    self.version = 11
  else:
    # HTTP2还没研究, 这里就不写了
    raise UnknownProtocol(version)

  ...

3. 解析响应头

解析响应头比响应行还要简单. 因为每个header字段占一行, 我们只需要一直调用read_line方法读取字段, 直到读完header为止就行了.

def _parse_header(self) -> None:
  headers = {}
  while True:
    line = self._read_line()
    if len(headers) > _MAX_HEADERS:
      raise HTTPException('got more than %d headers' % _MAX_HEADERS)
    if line in _EMPTY_LINE:
      break
    line = line.decode('latin-1')
    i = line.find(':')
    if i == -1:
      raise BadHeaderLine(line)
    # 这里默认没有重名的情况
    key, value = line[:i].lower(), line[i + 1:].strip()
    headers[key] = value
  self.headers = headers

4. 接收响应正文

在接收响应正文之前, 首先要确定它的传输方式和长度:

def _set_chunk(self) -> None:
  transfer_encoding = self.get_header('transfer-encoding')
  if transfer_encoding and transfer_encoding.lower() == 'chunked':
    self.chunked = True
    self.chunk_left = None
  else:
    self.chunked = False


def _set_length(self) -> None:
  # 首先要知道数据是否是分块传输的
  if self.chunked == _UNKNOWN:
    self._set_chunk()

  # 如果状态码是1xx或者204(无响应内容)或者304(使用上次缓存的内容),则没有响应正文
  # 如果这是个HEAD请求,那么也不能有响应正文
  if (self.status == HTTPStatus.NO_CONTENT or
      self.status == HTTPStatus.NOT_MODIFIED or
      100 <= self.status < 200 or
      self._method == 'HEAD'):
    self.length = 0
    return

  length = self.get_header('content-length')
  if length and not self.chunked:
    try:
      self.length = int(length)
    except ValueError:
      self.length = None
    else:
      if self.length < 0:
        self.length = None
  else:
    self.length = None

 

然后, 我们实现一个read方法, 从body中读取指定大小的数据:

def read(self, amount: int = None) -> bytes:
  if self.is_closed():
    return b''
  if self._method == 'HEAD':
    self.close()
    return b''
  if amount is None:
    return self._read_all()
  return self._read_amount(amount)

如果没有指定需要的数据大小, 就默认读取所有数据:

def _read_all(self) -> bytes:
  if self.chunked:
    return self._read_all_chunk()
  if self.length is None:
    s = self.fp.read()
  else:
    try:
      s = self._read_bytes(self.length)
    except IncompleteRead:
      self.close()
      raise
    self.length = 0
  self.close()
  return s


def _read_all_chunk(self) -> bytes:
  assert self.chunked != _UNKNOWN
  value = []
  try:
    while True:
      chunk = self._read_chunk()
      if chunk is None:
        break
      value.append(chunk)
    return b''.join(value)
  except IncompleteRead:
    raise IncompleteRead(b''.join(value))


def _read_chunk(self) -> Optional[bytes]:
  try:
    chunk_size = self._read_chunk_size()
  except ValueError:
    raise IncompleteRead(b'')
  if chunk_size == 0:
    self._read_and_discard_trailer()
    self.close()
    return None
  chunk = self._read_bytes(chunk_size)
  # 每块的结尾会有一个\r\n,这里把它读掉
  self._read_bytes(2)
  return chunk


def _read_chunk_size(self) -> int:
  line = self._read_line(error_message='chunk size')
  i = line.find(b';')
  if i >= 0:
    line = line[:i]
  try:
    return int(line, 16)
  except ValueError:
    self.close()
    raise


def _read_and_discard_trailer(self) -> None:
  # chunk的尾部可能会挂一些额外的信息,比如MD5值,过期时间等等,一般会在header中用trailer字段说明
  # 当chunk读完之后调用这个函数, 这些信息就先舍弃掉得了
  while True:
    line = self._read_line(error_message='chunk size')
    if line in _EMPTY_LINE:
      break

否则的话, 就读取部分数据, 如果正好是分块数据的话, 就比较复杂了. 简单来说, 就是用bytearray制造一个所需大小的数组, 然后依次读取chunk把数据往里面填, 直到填满或者没数据为止.  然后用chunk_left记录下当前块剩余的量, 以便下次读取.

def _read_amount(self, amount: int) -> bytes:
  if self.chunked:
    return self._read_amount_chunk(amount)
  if isinstance(self.length, int) and amount > self.length:
    amount = self.length
  container = bytearray(amount)
  n = self.fp.readinto(container)
  if not n and container:
    # 如果读不到字节了,也就可以关了
    self.close()
  elif self.length is not None:
    self.length -= n
    if not self.length:
      self.close()
  return memoryview(container)[:n].tobytes()


def _read_amount_chunk(self, amount: int) -> bytes:
  # 调用这个方法,读取amount大小的chunk类型数据,不足就全部读取
  assert self.chunked != _UNKNOWN
  total_bytes = 0
  container = bytearray(amount)
  mvb = memoryview(container)
  try:
    while True:
      # mvb可以理解为容器的空的那一部分
      # 这里一直调用_full_readinto把数据填进去,让mvb越来越小,同时记录填入的量
      # 等没数据或者当前数据足够把mvb填满之后,跳出循环
      chunk_left = self._get_chunk_left()
      if chunk_left is None:
        break
      if len(mvb) <= chunk_left:
        n = self._full_readinto(mvb)
        self.chunk_left = chunk_left - n
        total_bytes += n
        break
      temp_mvb = mvb[:chunk_left]
      n = self._full_readinto(temp_mvb)
      mvb = mvb[n:]
      total_bytes += n
      self.chunk_left = 0

  except IncompleteRead:
    raise IncompleteRead(bytes(container[:total_bytes]))

  return memoryview(container)[:total_bytes].tobytes()


def _full_readinto(self, container: memoryview) -> int:
  # 返回读取的量.如果没能读满,这个方法会报警
  amount = len(container)
  n = self.fp.readinto(container)
  if n < amount:
    raise IncompleteRead(bytes(container[:n]), amount - n)
  return n


def _get_chunk_left(self) -> Optional[int]:
  # 如果当前块读了一半,那么直接返回self.chunk_left就行了
  # 否则,有三种情况
  # 1). chunk_left为None,说明body压根没开始读,于是返回当前这一整块的长度
  # 2). chunk_left为0,说明这块读完了,于是返回下一块的长度
  # 3). body数据读完了,返回None,顺便做好善后工作
  chunk_left = self.chunk_left
  if not chunk_left:
    if chunk_left == 0:
      # 如果剩余零,说明上一块已经读完了,这里把\r\n读掉
      # 如果是None,就说明chunk压根没开始读
      self._read_bytes(2)
    try:
      chunk_left = self._read_chunk_size()
    except ValueError:
      raise IncompleteRead(b'')
    if chunk_left == 0:
      self._read_and_discard_trailer()
      self.close()
      chunk_left = None
    self.chunk_left = chunk_left
  return chunk_left

三. 复用TCP连接

HTTP通信本质上是基于TCP连接发送和接收HTTP请求和响应, 因此, 只要TCP连接不断开, 我们就可以继续用它进行HTTP请求, 这样就避免了创建和销毁TCP连接产生的消耗.

python用700行代码实现http客户端

1. 判断连接是否会断开

在下面几种情况中, 服务端会自动断开连接:

  • HTTP协议小于1.1且没有在头部设置了keep-alive
  • HTTP协议大于等于1.1但是在头部设置了connection: close
  • 数据没有分块传输, 也没有说明数据的长度, 这种情况下, 服务器一般会在发送完成后断开连接, 让客户端知道数据发完了

根据上面列出来的几种情况, 通过下面的代码来判断连接是否会断开:

def _check_close(self) -> bool:
  conn = self.get_header('connection')

  if not self.chunked and self.length is None:
    return True

  if self.version == 11:
    if conn and 'close' in conn.lower():
      return True
    return False
  else:
    if self.headers.get('keep-alive'):
      return False

    if conn and 'keep-alive' in conn.lower():
      return False

  return True

2. 正确地关闭HTTPResponse对象

由于TCP连接的复用, 一个HTTPConnection可以产生多个HTTPResponse对象, 而这些对象在同一个TCP连接上, 会共用这个连接的读缓冲区. 这就导致, 如果上一个HTTPResponse对象没有把它的那部分数据读完, 就会对下一个响应产生影响.

另一方面来看, 我们也需要及时地关闭与这个TCP关联的文件对象来避免占用资源. 因此, 我们定义如下的close方法关闭一个HTTPResponse对象:

def close(self) -> None:
  if self.is_closed():
    return
  fp = self.fp
  self.fp = None
  fp.close()


def is_closed(self) -> bool:
  return self.fp is None

用户调用HTTPResponse对象的read方法, 把缓冲区数据读完之后, 就会自动调用close方法(具体实现见上一章的第四节: 读取响应数据这部分). 因此, 在获取下一个响应数据之前, 我们只需要调用这个对象的is_closed方法, 就能判断读缓冲区是否已经读完, 能否继续接收响应了.

3. HTTP请求的生命周期

不使用管道机制的话, 不同的HTTP请求必须按次序进行, 相互之间不能重叠. 基于这个原因, 我们为HTTPConnection对象设置IDLE, REQ_STARTED和REQ_SENT三种状态, 一个完整的请求应该经历这几种状态:

python用700行代码实现http客户端

根据上面的流程, 对HTTPConnection中对应的方法进行修改:

def get_response(self) -> HTTPResponse:
  if self._response and self._response.is_closed():
    self._response = None
  if self._state != _CS_REQ_SENT or self._response:
    raise ResponseNotReady(self._state)

  response = HTTPResponse(self.sock, method=self._method)

  try:
    try:
      response.begin()
    except ConnectionError:
      self.close()
      raise
    assert response.will_close != _UNKNOWN
    self._state = _CS_IDLE

    if response.will_close:
      self.close()
    else:
      self._response = response

    return response
  except Exception as _:
    response.close()
    raise

def put_request(self, method: str, url: str) -> None:
  # 调用这个函数开始新一轮的请求,它负责写好请求行输出到缓存里面去
  # 调用它的前提是当前处于空闲状态
  # 如果之前的response还在并且已结束,会自动把它消除掉
  if self._response and self._response.is_closed():
    self._response = None

  if self._state == _CS_IDLE:
    self._state = _CS_REQ_STARTED
  else:
    raise CannotSendRequest(self._state)

  ...

def put_header(self, header: Union[bytes, str], value: Union[bytes, str, int]) -> None:
  if self._state != _CS_REQ_STARTED:
    raise CannotSendHeader()

  ...

def end_headers(self, message_body=None, encode_chunked=False) -> None:
  if self._state == _CS_REQ_STARTED:
    self._state = _CS_REQ_SENT
  else:
    raise CannotSendHeader()
  ...

需要注意的是, 如果第二个请求已经进入到获取响应的阶段了, 而上一个请求的响应还没关闭, 那么就应该直接报错, 否则读取到的会是上一个请求剩余的响应部分数据, 导致解析响应出现问题.

事实上, HTTP1.1开始支持管道化技术, 也就是一次提交多个HTTP请求, 然后等待响应, 而不是在接收到上一个请求的响应后, 才发送后面的请求.
基于这种处理模式, 管道化技术理论上可以减少IO时间的损耗, 提升效率, 不过, 需要服务端的支持, 而且会增加程序的复杂程度, 这里就不实现了.

四. 总结

1. 完整代码

HTTPConnection的完整代码如下:

class HTTPConnection:
  default_port = 80
  _http_vsn = 11
  _http_vsn_str = 'HTTP/1.1'

  def __init__(self, host: str, port: int = None) -> None:
    self.sock = None
    self._buffer = []
    self.host = host
    self.port = port if port is not None else self.default_port
    self._state = _CS_IDLE
    self._response = None
    self._method = None
    self.block_size = 8192

  def request(self, method: str, url: str, headers: dict = None, body: Union[io.IOBase, Iterable] = None,
        encode_chunked: bool = False) -> None:
    self.put_request(method, url)
    headers = headers or {}
    header_names = frozenset(k.lower() for k in headers.keys())
    if 'host' not in header_names:
      self._add_host(url)

    if 'content-length' not in header_names:
      if 'transfer-encoding' not in header_names:
        encode_chunked = False
        content_length = self._get_content_length(body, method)
        if content_length is None:
          if body is not None:
            encode_chunked = True
            self.put_header('Transfer-Encoding', 'chunked')
        else:
          self.put_header('Content-Length', str(content_length))
      else:
        # 如果设置了transfer-encoding,则根据用户给的encode_chunked参数决定是否分块
        pass
    else:
      # 只要给了content-length,那么一定不是分块传输
      encode_chunked = False

    for hdr, value in headers.items():
      self.put_header(hdr, value)
    if isinstance(body, str):
      body = _encode(body)
    self.end_headers(body, encode_chunked=encode_chunked)

  def send(self, data: bytes) -> None:
    if self.sock is None:
      self.connect()

    self.sock.sendall(data)

  def get_response(self) -> HTTPResponse:
    if self._response and self._response.is_closed():
      self._response = None
    if self._state != _CS_REQ_SENT or self._response:
      raise ResponseNotReady(self._state)

    response = HTTPResponse(self.sock, method=self._method)

    try:
      try:
        response.begin()
      except ConnectionError:
        self.close()
        raise
      assert response.will_close != _UNKNOWN
      self._state = _CS_IDLE

      if response.will_close:
        self.close()
      else:
        self._response = response

      return response
    except Exception as _:
      response.close()
      raise

  def connect(self) -> None:
    self.sock = socket.create_connection((self.host, self.port))

  def close(self) -> None:
    self._state = _CS_IDLE
    try:
      sock = self.sock
      if sock:
        self.sock = None
        sock.close()
    finally:
      response = self._response
      if response:
        self._response = None
        response.close()

  def put_request(self, method: str, url: str) -> None:
    # 调用这个函数开始新一轮的请求,它负责写好请求行输出到缓存里面去
    # 调用它的前提是当前处于空闲状态
    # 如果之前的response还在并且已结束,会自动把它消除掉
    if self._response and self._response.is_closed():
      self._response = None

    if self._state == _CS_IDLE:
      self._state = _CS_REQ_STARTED
    else:
      raise CannotSendRequest(self._state)

    self._method = method

    url = url or '/'

    request = f'{method} {url} {self._http_vsn_str}'
    self._output(request)

  def put_header(self, header: Union[bytes, str], value: Union[bytes, str, int]) -> None:
    if self._state != _CS_REQ_STARTED:
      raise CannotSendHeader()

    if hasattr(header, 'encode'):
      header = header.encode('ascii')

    if hasattr(value, 'encode'):
      value = value.encode('latin-1')
    elif isinstance(value, int):
      value = str(value).encode('ascii')

    header = header + b': ' + value
    self._output(header)

  def end_headers(self, message_body=None, encode_chunked=False) -> None:
    if self._state == _CS_REQ_STARTED:
      self._state = _CS_REQ_SENT
    else:
      raise CannotSendHeader()
    self._send_output(message_body, encode_chunked=encode_chunked)

  def _add_host(self, url: str) -> None:
    # 所有HTTP / 1.1请求报文中必须包含一个Host头字段
    # 如果用户没给,就调用这个函数来生成
    netloc = ''
    if url.startswith('http'):
      nil, netloc, nil, nil, nil = urlsplit(url)

    if netloc:
      try:
        netloc_enc = netloc.encode('ascii')
      except UnicodeEncodeError:
        netloc_enc = netloc.encode('idna')
      self.put_header('Host', netloc_enc)
    else:
      host = self.host
      port = self.port

      try:
        host_enc = host.encode('ascii')
      except UnicodeEncodeError:
        host_enc = host.encode('idna')

      # 对IPv6的地址进行额外处理
      if host.find(':') >= 0:
        host_enc = b'[' + host_enc + b']'

      if port == self.default_port:
        self.put_header('Host', host_enc)
      else:
        host_enc = host_enc.decode('ascii')
        self.put_header('Host', f'{host_enc}:{port}')

  def _output(self, s: Union[str, bytes]) -> None:
    # 将数据添加到缓冲区
    if hasattr(s, 'encode'):
      s = s.encode('latin-1')
    self._buffer.append(s)

  def _send_output(self, message_body=None, encode_chunked=False) -> None:
    # 发送并清空缓冲数据.然后,如果有请求正文,就也顺便发送

    self._buffer.extend((b'', b''))
    msg = b'\r\n'.join(self._buffer)
    self._buffer.clear()
    self.send(msg)

    if message_body is not None:
      self._send_body(message_body, encode_chunked)

  def _send_body(self, message_body: Union[bytes, str, bytearray, Iterable, io.IOBase], encode_chunked: bool) -> None:
    if hasattr(message_body, 'read'):
      chunks = self._read_readable(message_body)
    else:
      try:
        memoryview(message_body)
      except TypeError:
        try:
          chunks = iter(message_body)
        except TypeError:
          raise TypeError(
            f'message_body should be a bytes-like object or an iterable, got {repr(type(message_body))}')
      else:
        # 如果是字节类型的,通过一次迭代把它发出去
        chunks = (message_body,)

    for chunk in chunks:
      if not chunk:
        continue

      if encode_chunked:
        chunk = f'{len(chunk):X}\r\n'.encode('ascii') + chunk + b'\r\n'
      self.send(chunk)

    if encode_chunked:
      self.send(b'0\r\n\r\n')

  def _read_readable(self, readable: io.IOBase) -> Generator[bytes, None, None]:
    need_encode = False
    if isinstance(readable, io.TextIOBase):
      need_encode = True
    while True:
      data_block = readable.read(self.block_size)
      if not data_block:
        break
      if need_encode:
        data_block = data_block.encode('utf-8')
      yield data_block

  @staticmethod
  def _get_content_length(body: Union[str, bytes, bytearray, Iterable, io.IOBase], method: str) -> Optional[int]:
    if body is None:
      # PUT,POST,PATCH三个方法默认是有body的
      if method.upper() in _METHODS_EXPECTING_BODY:
        return 0
      else:
        return None

    if hasattr(body, 'read'):
      return None

    try:
      # 对于bytes或者bytearray格式的数据,通过memoryview获取它的长度
      return memoryview(body).nbytes
    except TypeError:
      pass

    if isinstance(body, str):
      return len(body)

    return None

HTTPResponse的完整代码如下:

class HTTPResponse:

  def __init__(self, sock: socket.socket, method: str = None) -> None:
    self.fp = sock.makefile('rb')
    self._method = method
    self.headers = None
    self.version = _UNKNOWN
    self.status = _UNKNOWN
    self.reason = _UNKNOWN
    self.chunked = _UNKNOWN
    self.chunk_left = _UNKNOWN
    self.length = _UNKNOWN
    self.will_close = _UNKNOWN

  def begin(self) -> None:
    if self.headers is not None:
      return
    self._parse_status_line()
    self._parse_header()
    self._set_chunk()
    self._set_length()
    self.will_close = self._check_close()

  def _read_line(self, limit: int = _MAX_LINE + 1, error_message: str = '') -> bytes:
    # 注意,这个方法默认不去除line尾部的\r\n
    line = self.fp.readline(limit)
    if len(line) > _MAX_LINE:
      raise LineTooLong(error_message)
    return line

  def _read_bytes(self, amount: int) -> bytes:
    data = self.fp.read(amount)
    if len(data) < amount:
      raise IncompleteRead(data, amount - len(data))
    return data

  def _parse_status_line(self) -> None:
    while True:
      version, status, reason = self._read_status()
      if status != HTTPStatus.CONTINUE:
        break
      while True:
        skip = self._read_line(error_message='header line').strip()
        if not skip:
          break

    self.status = status
    self.reason = reason
    if version in ('HTTP/1.0', 'HTTP/0.9'):
      self.version = 10
    elif version.startswith('HTTP/1.'):
      self.version = 11
    else:
      raise UnknownProtocol(version)

  def _read_status(self) -> Tuple[str, int, str]:
    line = str(self._read_line(error_message='status line'), 'latin-1')
    if not line:
      raise RemoteDisconnected('Remote end closed connection without response')
    try:
      version, status, reason = line.split(None, 2)
    except ValueError:
      # reason只是给人看的, 和status对应, 所以它有可能不存在
      try:
        version, status = line.split(None, 1)
        reason = ''
      except ValueError:
        version, status, reason = '', '', ''
    if not version.startswith('HTTP/'):
      self.close()
      raise BadStatusLine(line)

    try:
      status = int(status)
      if status < 100 or status > 999:
        raise BadStatusLine(line)
    except ValueError:
      raise BadStatusLine(line)
    return version, status, reason.strip()

  def _parse_header(self) -> None:
    headers = {}
    while True:
      line = self._read_line(error_message='header line')
      if len(headers) > _MAX_HEADERS:
        raise HTTPException('got more than %d headers' % _MAX_HEADERS)
      if line in _EMPTY_LINE:
        break
      line = line.decode('latin-1')
      i = line.find(':')
      if i == -1:
        raise BadHeaderLine(line)
      # 这里默认没有重名的情况
      key, value = line[:i].lower(), line[i + 1:].strip()
      headers[key] = value
    self.headers = headers

  def _set_chunk(self) -> None:
    transfer_encoding = self.get_header('transfer-encoding')
    if transfer_encoding and transfer_encoding.lower() == 'chunked':
      self.chunked = True
      self.chunk_left = None
    else:
      self.chunked = False

  def _set_length(self) -> None:
    # 首先要知道数据是否是分块传输的
    if self.chunked == _UNKNOWN:
      self._set_chunk()

    # 如果状态码是1xx或者204(无响应内容)或者304(使用上次缓存的内容),则没有响应正文
    # 如果这是个HEAD请求,那么也不能有响应正文
    assert isinstance(self.status, int)
    if (self.status == HTTPStatus.NO_CONTENT or
        self.status == HTTPStatus.NOT_MODIFIED or
        100 <= self.status < 200 or
        self._method == 'HEAD'):
      self.length = 0
      return

    length = self.get_header('content-length')
    if length and not self.chunked:
      try:
        self.length = int(length)
      except ValueError:
        self.length = None
      else:
        if self.length < 0:
          self.length = None
    else:
      self.length = None

  def _check_close(self) -> bool:
    conn = self.get_header('connection')

    if not self.chunked and self.length is None:
      return True

    if self.version == 11:
      if conn and 'close' in conn.lower():
        return True
      return False
    else:
      if self.headers.get('keep-alive'):
        return False

      if conn and 'keep-alive' in conn.lower():
        return False

    return True

  def close(self) -> None:
    if self.is_closed():
      return
    fp = self.fp
    self.fp = None
    fp.close()

  def is_closed(self) -> bool:
    return self.fp is None

  def read(self, amount: int = None) -> bytes:
    if self.is_closed():
      return b''
    if self._method == 'HEAD':
      self.close()
      return b''
    if amount is None:
      return self._read_all()
    print(amount, amount is None)
    return self._read_amount(amount)

  def _read_all(self) -> bytes:
    if self.chunked:
      return self._read_all_chunk()
    if self.length is None:
      s = self.fp.read()
    else:
      try:
        s = self._read_bytes(self.length)
      except IncompleteRead:
        self.close()
        raise
      self.length = 0
    self.close()
    return s

  def _read_all_chunk(self) -> bytes:
    assert self.chunked != _UNKNOWN
    value = []
    try:
      while True:
        chunk = self._read_chunk()
        if chunk is None:
          break
        value.append(chunk)
      return b''.join(value)
    except IncompleteRead:
      raise IncompleteRead(b''.join(value))

  def _read_chunk(self) -> Optional[bytes]:
    try:
      chunk_size = self._read_chunk_size()
    except ValueError:
      raise IncompleteRead(b'')
    if chunk_size == 0:
      self._read_and_discard_trailer()
      self.close()
      return None
    chunk = self._read_bytes(chunk_size)
    # 每块的结尾会有一个\r\n,这里把它读掉
    self._read_bytes(2)
    return chunk

  def _read_chunk_size(self) -> int:
    line = self._read_line(error_message='chunk size')
    i = line.find(b';')
    if i >= 0:
      line = line[:i]
    try:
      return int(line, 16)
    except ValueError:
      self.close()
      raise

  def _read_and_discard_trailer(self) -> None:
    # chunk的尾部可能会挂一些额外的信息,比如MD5值,过期时间等等,一般会在header中用trailer字段说明
    # 当chunk读完之后调用这个函数, 这些信息就先舍弃掉得了
    while True:
      line = self._read_line(error_message='chunk size')
      if line in _EMPTY_LINE:
        break

  def _read_amount(self, amount: int) -> bytes:
    if self.chunked:
      return self._read_amount_chunk(amount)
    if isinstance(self.length, int) and amount > self.length:
      amount = self.length
    container = bytearray(amount)
    n = self.fp.readinto(container)
    if not n and container:
      # 如果读不到字节了,也就可以关了
      self.close()
    elif self.length is not None:
      self.length -= n
      if not self.length:
        self.close()
    return memoryview(container)[:n].tobytes()

  def _read_amount_chunk(self, amount: int) -> bytes:
    # 调用这个方法,读取amount大小的chunk类型数据,不足就全部读取
    assert self.chunked != _UNKNOWN
    total_bytes = 0
    container = bytearray(amount)
    mvb = memoryview(container)
    try:
      while True:
        # mvb可以理解为容器的空的那一部分
        # 这里一直调用_full_readinto把数据填进去,让mvb越来越小,同时记录填入的量
        # 等没数据或者当前数据足够把mvb填满之后,跳出循环
        chunk_left = self._get_chunk_left()
        if chunk_left is None:
          break
        if len(mvb) <= chunk_left:
          n = self._full_readinto(mvb)
          self.chunk_left = chunk_left - n
          total_bytes += n
          break
        temp_mvb = mvb[:chunk_left]
        n = self._full_readinto(temp_mvb)
        mvb = mvb[n:]
        total_bytes += n
        self.chunk_left = 0

    except IncompleteRead:
      raise IncompleteRead(bytes(container[:total_bytes]))

    return memoryview(container)[:total_bytes].tobytes()

  def _full_readinto(self, container: memoryview) -> int:
    # 返回读取的量.如果没能读满,这个方法会报警
    amount = len(container)
    n = self.fp.readinto(container)
    if n < amount:
      raise IncompleteRead(bytes(container[:n]), amount - n)
    return n

  def _get_chunk_left(self) -> Optional[int]:
    # 如果当前块读了一半,那么直接返回self.chunk_left就行了
    # 否则,有三种情况
    # 1). chunk_left为None,说明body压根没开始读,于是返回当前这一整块的长度
    # 2). chunk_left为0,说明这块读完了,于是返回下一块的长度
    # 3). body数据读完了,返回None,顺便做好善后工作
    chunk_left = self.chunk_left
    if not chunk_left:
      if chunk_left == 0:
        # 如果剩余零,说明上一块已经读完了,这里把\r\n读掉
        # 如果是None,就说明chunk压根没开始读
        self._read_bytes(2)
      try:
        chunk_left = self._read_chunk_size()
      except ValueError:
        raise IncompleteRead(b'')
      if chunk_left == 0:
        self._read_and_discard_trailer()
        self.close()
        chunk_left = None
      self.chunk_left = chunk_left
    return chunk_left

  def get_header(self, name, default: str = None) -> Optional[str]:
    if self.headers is None:
      raise ResponseNotReady()
    return self.headers.get(name, default)

  @property
  def info(self) -> str:
    return repr(self.headers)

这两个类应该放到同一个py文件中, 同时这个文件内还有其他一些辅助性质的代码:

import io
import socket
from typing import Generator, Iterable, Optional, Tuple, Union
from urllib.parse import urlsplit

_CS_IDLE = 'Idle'
_CS_REQ_STARTED = 'Request-started'
_CS_REQ_SENT = 'Request-sent'

_METHODS_EXPECTING_BODY = {'PATCH', 'POST', 'PUT'}
_UNKNOWN = 'UNKNOWN'

_MAX_LINE = 65536
_MAX_HEADERS = 100

_EMPTY_LINE = (b'\r\n', b'\n', b'')


class HTTPStatus:
  CONTINUE = 100
  SWITCHING_PROTOCOLS = 101
  PROCESSING = 102
  OK = 200
  CREATED = 201
  ACCEPTED = 202
  NON_AUTHORITATIVE_INFORMATION = 203
  NO_CONTENT = 204
  RESET_CONTENT = 205
  PARTIAL_CONTENT = 206
  MULTI_STATUS = 207
  ALREADY_REPORTED = 208
  IM_USED = 226
  MULTIPLE_CHOICES = 300
  MOVED_PERMANENTLY = 301
  FOUND = 302
  SEE_OTHER = 303
  NOT_MODIFIED = 304
  USE_PROXY = 305
  TEMPORARY_REDIRECT = 307
  PERMANENT_REDIRECT = 308
  BAD_REQUEST = 400
  UNAUTHORIZED = 401
  PAYMENT_REQUIRED = 402
  FORBIDDEN = 403
  NOT_FOUND = 404
  METHOD_NOT_ALLOWED = 405
  NOT_ACCEPTABLE = 406
  PROXY_AUTHENTICATION_REQUIRED = 407
  REQUEST_TIMEOUT = 408
  CONFLICT = 409
  GONE = 410
  LENGTH_REQUIRED = 411
  PRECONDITION_FAILED = 412
  REQUEST_ENTITY_TOO_LARGE = 413
  REQUEST_URI_TOO_LONG = 414
  UNSUPPORTED_MEDIA_TYPE = 415
  REQUESTED_RANGE_NOT_SATISFIABLE = 416
  EXPECTATION_FAILED = 417
  MISDIRECTED_REQUEST = 421
  UNPROCESSABLE_ENTITY = 422
  LOCKED = 423
  FAILED_DEPENDENCY = 424
  UPGRADE_REQUIRED = 426
  PRECONDITION_REQUIRED = 428
  TOO_MANY_REQUESTS = 429
  REQUEST_HEADER_FIELDS_TOO_LARGE = 431
  UNAVAILABLE_FOR_LEGAL_REASONS = 451
  INTERNAL_SERVER_ERROR = 500
  NOT_IMPLEMENTED = 501
  BAD_GATEWAY = 502
  SERVICE_UNAVAILABLE = 503
  GATEWAY_TIMEOUT = 504
  HTTP_VERSION_NOT_SUPPORTED = 505
  VARIANT_ALSO_NEGOTIATES = 506
  INSUFFICIENT_STORAGE = 507
  LOOP_DETECTED = 508
  NOT_EXTENDED = 510
  NETWORK_AUTHENTICATION_REQUIRED = 511


class HTTPResponse:
  ...


class HTTPConnection:
  ...


def _encode(data: str, encoding: str = 'latin-1', name: str = 'data') -> bytes:
  # 给请求正文等不知道能怎么转码的东西转码时用这个,默认使用latin-1编码
  # 它的好处是,转码失败后能抛出详细的错误信息,一目了然
  try:
    return data.encode(encoding)
  except UnicodeEncodeError as err:
    raise UnicodeEncodeError(
      err.encoding,
      err.object,
      err.start,
      err.end,
      "{} ({:.20!r}) is not valid {}. Use {}.encode('utf-8') if you want to send it encoded in UTF-8.".format(
        name.title(), data[err.start:err.end], encoding, name)
    ) from None


class HTTPException(Exception):
  pass


class ImproperConnectionState(HTTPException):
  pass


class CannotSendRequest(ImproperConnectionState):
  pass


class CannotSendHeader(ImproperConnectionState):
  pass


class CannotCloseStream(ImproperConnectionState):
  pass


class ResponseNotReady(ImproperConnectionState):
  pass


class LineTooLong(HTTPException):
  def __init__(self, line_type):
    HTTPException.__init__(self, 'got more than %d bytes when reading %s'
                % (_MAX_LINE, line_type))


class BadStatusLine(HTTPException):
  def __init__(self, line):
    if not line:
      line = repr(line)
    self.args = line,
    self.line = line


class BadHeaderLine(HTTPException):
  def __init__(self, line):
    if not line:
      line = repr(line)
    self.args = line,
    self.line = line


class RemoteDisconnected(ConnectionResetError, BadStatusLine):
  def __init__(self, *args, **kwargs):
    BadStatusLine.__init__(self, '')
    ConnectionResetError.__init__(self, *args, **kwargs)


class UnknownProtocol(HTTPException):
  def __init__(self, version):
    self.args = version,
    self.version = version


class UnknownTransferEncoding(HTTPException):
  pass


class IncompleteRead(HTTPException):
  def __init__(self, partial, expected=None):
    self.args = partial,
    self.partial = partial
    self.expected = expected

  def __repr__(self):
    if self.expected is not None:
      e = f', {self.expected} more expected'
    else:
      e = ''
    return f'{self.__class__.__name__}({len(self.partial)} bytes read{e})'

  __str__ = object.__str__

2. 需要注意的点

总的来说, 本文的内容不算复杂, 毕竟HTTP属于不难理解, 但知识点很多很杂的类型. 这里把本文中一些需要注意的点总结一下:

  • 请求和响应数据的结构大致相同, 都是状态行+头部+正文, 状态行和头部的每个字段都用一个\r\n分割, 与正文之间用两个分割;
  • 状态行是必须的, 请求头则最少需要host这个字段, 同时为了大家的方便, 你最好也设置一下Accept-encoding和Accept来限制服务器返回给你的数据内容和格式;
  • 正文不是必须的, 特别是对于除了3P(PATCH, POST, PUT)之外的方法来说. 如果你有正文, 你最好在header中使用Content-Length说明正文的长度, 如果是分块发送, 则使用Transfer-Encoding字段说明;
  • 如果对正文使用分块传输, 每块的格式是: 16进制的数据长度+\r\n+数据+\r\n, 使用0\r\n\r\n来收尾. 收尾之后, 你还可以放一个trailer, 里面放数据的MD5值或者过期时间什么的, 这时候最好在header中设置trailer字段;
  • 在一个请求的生命周期完成后, TCP连接是否会断开取决于三点: 响应数据的HTTP版本, 响应头中的Connection和Keep-Alive字段, 是否知道响应正文的长度;
  • 最最重要的一点, HTTP协议只是一个约定而非限制, 这就和矿泉水的建议零售价差不多, 你可以选择遵守, 也可以不遵守, 后果自负. 

3. 结果测试

首先, 我们用tornado写一个简单的服务器, 它会显示客户端的地址和接口;

import tornado.web
import tornado.ioloop

class IndexHandler(tornado.web.RequestHandler):

  def get(self) -> None:
    print(f'new connection from {self.request.connection.context.address}')
    self.write('hello world')


app = tornado.web.Application([(r'/', IndexHandler)])
app.listen(8888)
tornado.ioloop.IOLoop.current().start()

然后, 使用我们刚写好的客户端进行测试:

from client import HTTPConnection


def fetch(conn: HTTPConnection, url: str = '') -> None:
  conn.request('GET', url)
  res = conn.get_response()
  print(res.read())


connection = HTTPConnection('127.0.0.1', 8888)
for i in range(10):
  fetch(connection)

结果如下:

python用700行代码实现http客户端

以上就是python用700行代码实现http客户端的详细内容,更多关于python http客户端的资料请关注三水点靠木其它相关文章!

Python 相关文章推荐
从零学python系列之教你如何根据图片生成字符画
May 23 Python
python读取二进制mnist实例详解
May 31 Python
django在接受post请求时显示403forbidden实例解析
Jan 25 Python
Python实现矩阵相乘的三种方法小结
Jul 26 Python
Django实现auth模块下的登录注册与注销功能
Oct 10 Python
python 两个数据库postgresql对比
Oct 21 Python
使用pytorch搭建AlexNet操作(微调预训练模型及手动搭建)
Jan 18 Python
python图形开发GUI库pyqt5的基本使用方法详解
Feb 14 Python
python+selenium+Chrome options参数的使用
Mar 18 Python
使用scrapy ImagesPipeline爬取图片资源的示例代码
Sep 28 Python
Django集成MongoDB实现过程解析
Dec 01 Python
python之json文件转xml文件案例讲解
Aug 07 Python
python批量生成身份证号到Excel的两种方法实例
Jan 14 #Python
Django扫码抽奖平台的配置过程详解
Jan 14 #Python
如何用python实现一个HTTP连接池
Jan 14 #Python
如何用python写个模板引擎
Jan 14 #Python
opencv python 对指针仪表读数识别的两种方式
Jan 14 #Python
详解如何使用Pytest进行自动化测试
Jan 14 #Python
matplotlib对象拾取事件处理的实现
Jan 14 #Python
You might like
PHP加Nginx实现动态裁剪图片方案
2014/03/10 PHP
JavaScript中的Screen屏幕对象
2008/01/16 Javascript
Ajax执行顺序流程及回调问题分析
2012/12/10 Javascript
js冒泡法和数组转换成字符串示例代码
2013/08/14 Javascript
JQuery Mobile实现导航栏和页脚
2016/03/09 Javascript
Jquery删除css属性的简单方法
2016/12/04 Javascript
Bootstrap BootstrapDialog使用详解
2017/02/17 Javascript
vue修改对象的属性值后页面不重新渲染的实例
2018/08/09 Javascript
vue-cli3项目升级到vue-cli4 的方法总结
2020/03/19 Javascript
通过angular CDK实现页面元素拖放的步骤详解
2020/07/01 Javascript
[14:03]2017DOTA2亚洲邀请赛开幕式:12神兵演绎水墨中华
2017/04/01 DOTA
python创建和使用字典实例详解
2013/11/01 Python
Pyramid Mako模板引入helper对象的步骤方法
2013/11/27 Python
Python open()文件处理使用介绍
2014/11/30 Python
django+js+ajax实现刷新页面的方法
2017/05/22 Python
Python 2.x如何设置命令执行的超时时间实例
2017/10/19 Python
基于Django的ModelForm组件(详解)
2017/12/07 Python
使用python为mysql实现restful接口
2018/01/05 Python
Python利用字典将两个通讯录文本合并为一个文本实例
2018/01/16 Python
Django实现组合搜索的方法示例
2018/01/23 Python
Python3的介绍、安装和命令行的认识(推荐)
2018/10/20 Python
使用python itchat包爬取微信好友头像形成矩形头像集的方法
2019/02/21 Python
python实现密码强度校验
2020/03/18 Python
用Python自动清理系统垃圾的实现
2021/01/18 Python
利用CSS3的border-radius绘制太极及爱心图案示例
2016/05/17 HTML / CSS
HTML5 Canvas绘制圆点虚线实例
2015/01/01 HTML / CSS
全球知名提供各类营养保健品的零售商:Vitamin Shoppe
2016/10/09 全球购物
2019年分享net面试的经历和题目
2016/08/07 面试题
简述进程的启动、终止的方式以及如何进行进程的查看
2013/07/12 面试题
Servlet面试题库
2015/07/18 面试题
重阳节登山活动方案
2014/02/03 职场文书
森林防火工作方案
2014/02/14 职场文书
2016年教师节特级教师获奖感言
2015/12/09 职场文书
合作合同协议书
2016/03/21 职场文书
如何写好竞聘报告
2019/04/03 职场文书
Java elasticsearch安装以及部署教程
2021/06/28 Java/Android