python实现知乎高颜值图片爬取


Posted in Python onAugust 12, 2019

导入相关包

import time
import pydash
import base64
import requests
from lxml import etree
from aip import AipFace
from pathlib import Path

百度云 人脸检测 申请信息

#唯一必须填的信息就这三行
APP_ID = "xxxxxxxx"
API_KEY = "xxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxxxxxx"
# 过滤颜值阈值,存储空间大的请随意
BEAUTY_THRESHOLD = 55
AUTHORIZATION = "oauth c3cef7c66a1843f8b3a9e6a1e3160e20"
# 如果权限错误,浏览器中打开知乎,在开发者工具复制一个,无需登录
# 建议最好换一个,因为不知道知乎的反爬虫策略,如果太多人用同一个,可能会影响程序运行

以下皆无需改动

# 每次请求知乎的讨论列表长度,不建议设定太长,注意节操
LIMIT = 5
# 这是话题『美女』的 ID,其是『颜值』(20013528)的父话题
SOURCE = "19552207"

爬虫假装下正常浏览器请求

USER_AGENT = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.5 Safari/534.55.3"
REFERER = "https://www.zhihu.com/topic/%s/newest" % SOURCE
# 某话题下讨论列表请求 url
BASE_URL = "https://www.zhihu.com/api/v4/topics/%s/feeds/timeline_activity"
# 初始请求 url 附带的请求参数
URL_QUERY = "?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.comment_count&limit=" + str(
  LIMIT)

HEADERS = {
  "User-Agent": USER_AGENT,
  "Referer": REFERER,
  "authorization": AUTHORIZATION

指定 url,获取对应原始内容 / 图片

def fetch_image(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.content

指定 url,获取对应 JSON 返回 / 话题列表

def fetch_activities(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.json()

处理返回的话题列表

def parser_activities(datums, face_detective):
  for data in datums["data"]:
    target = data["target"]
    if "content" not in target or "question" not in target or "author" not in target:
      continue
    html = etree.HTML(target["content"])
    seq = 0
    title = target["question"]["title"]
    author = target["author"]["name"]
    images = html.xpath("//img/@src")
    for image in images:
      if not image.startswith("http"):
        continue
      image_data = fetch_image(image)
      score = face_detective(image_data)
      if not score:
        continue
      name = "{}--{}--{}--{}.jpg".format(score, author, title, seq)
      seq = seq + 1
      path = Path(__file__).parent.joinpath("image").joinpath(name)
      try:
        f = open(path, "wb")
        f.write(image_data)
        f.flush()
        f.close()
        print(path)
        time.sleep(2)
      except Exception as e:
        continue
  if not datums["paging"]["is_end"]:
    return datums["paging"]["next"]
  else:
    return None

初始化颜值检测工具

def init_detective(app_id, api_key, secret_key):
  client = AipFace(app_id, api_key, secret_key)
  options = {"face_field": "age,gender,beauty,qualities"}
  def detective(image):
    image = str(base64.b64encode(image), "utf-8")
    response = client.detect(str(image), "BASE64", options)
    response = response.get("result")
    if not response:
      return
    if (not response) or (response["face_num"] == 0):
      return
    face_list = response["face_list"]
    if pydash.get(face_list, "0.face_probability") < 0.6:
      return
    if pydash.get(face_list, "0.beauty") < BEAUTY_THRESHOLD:
      return
    if pydash.get(face_list, "0.gender.type") != "female":
      return
    score = pydash.get(face_list, "0.beauty")
    return score
  return detective

程序入口

def main():
  face_detective = init_detective(APP_ID, API_KEY, SECRET_KEY)
  url = BASE_URL % SOURCE + URL_QUERY
  while url is not None:
    datums = fetch_activities(url)
    url = parser_activities(datums, face_detective)
    time.sleep(5)
if __name__ == '__main__':
  main()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
对于Python的框架中一些会话程序的管理
Apr 20 Python
对Python 2.7 pandas 中的read_excel详解
May 04 Python
对python中Json与object转化的方法详解
Dec 31 Python
numpy下的flatten()函数用法详解
May 27 Python
Python with用法:自动关闭文件进程
Jul 10 Python
python命令 -u参数用法解析
Oct 24 Python
python3中pip3安装出错,找不到SSL的解决方式
Dec 12 Python
python 用 xlwings 库 生成图表的操作方法
Dec 22 Python
python 日志 logging模块详细解析
Mar 31 Python
Python可以实现栈的结构吗
May 27 Python
Python3爬虫带上cookie的实例代码
Jul 28 Python
Python控制台输出俄罗斯方块的方法实例
Apr 17 Python
python3 enum模块的应用实例详解
Aug 12 #Python
Python一键查找iOS项目中未使用的图片、音频、视频资源
Aug 12 #Python
django+echart数据动态显示的例子
Aug 12 #Python
Flask框架学习笔记之使用Flask实现表单开发详解
Aug 12 #Python
Flask框架学习笔记之表单基础介绍与表单提交方式
Aug 12 #Python
python内存管理机制原理详解
Aug 12 #Python
Flask框架学习笔记之路由和反向路由详解【图文与实例】
Aug 12 #Python
You might like
开发大型 PHP 项目的方法
2007/01/02 PHP
php的POSIX 函数以及进程测试的深入分析
2013/06/03 PHP
浅析memcache启动以及telnet命令详解
2013/06/28 PHP
基于ThinkPHP+uploadify+upload+PHPExcel 无刷新导入数据
2015/09/23 PHP
PHP各种异常和错误的拦截方法及发生致命错误时进行报警
2016/01/19 PHP
PHP实现网站应用微信登录功能详解
2019/04/11 PHP
Web版彷 Visual Studio 2003 颜色选择器
2007/01/09 Javascript
深入理解JavaScript中的传值与传引用
2013/12/09 Javascript
js中遍历Map对象的简单实例
2016/08/08 Javascript
AngularJS常见过滤器用法实例总结
2017/07/06 Javascript
深入探究angular2 UI组件之primeNG用法
2017/07/26 Javascript
js移动端事件基础及常用事件库详解
2017/08/15 Javascript
Node.js中DNS模块学习总结
2018/02/28 Javascript
Vue实现自定义下拉菜单功能
2018/07/16 Javascript
JavaScript解析机制与闭包原理实例详解
2019/03/08 Javascript
用vscode开发vue应用的方法步骤
2019/05/06 Javascript
Vue实现点击箭头上下移动效果
2020/06/11 Javascript
[03:55]显微镜下的DOTA2特别篇——430灰烬之灵神级操作
2014/06/24 DOTA
[55:42]VG vs VGJ.T 2018国际邀请赛淘汰赛BO1 8.21
2018/08/22 DOTA
python3 图片referer防盗链的实现方法
2018/03/12 Python
Python 字符串与二进制串的相互转换示例
2018/07/23 Python
利用rest framework搭建Django API过程解析
2019/08/31 Python
numpy库ndarray多维数组的维度变换方法(reshape、resize、swapaxes、flatten)
2020/04/28 Python
CSS3教程(1):什么是CSS3
2009/04/02 HTML / CSS
英国Radley包德国官网:Radley London德国
2019/11/18 全球购物
中科前程Java笔试题
2016/11/20 面试题
应届生体育教师自荐信
2013/10/03 职场文书
小学少先队活动方案
2014/02/18 职场文书
环境工程专业自荐信
2014/03/03 职场文书
法人代表授权委托书
2014/04/08 职场文书
小学校长汇报材料
2014/08/20 职场文书
2015年护士节活动总结
2015/02/10 职场文书
2015年发展党员工作总结报告
2015/03/31 职场文书
傅雷家书读书笔记
2015/06/29 职场文书
2016新春团拜会致辞
2015/08/01 职场文书
SQL语法CONSTRAINT约束操作详情
2022/01/18 MySQL