python实现知乎高颜值图片爬取


Posted in Python onAugust 12, 2019

导入相关包

import time
import pydash
import base64
import requests
from lxml import etree
from aip import AipFace
from pathlib import Path

百度云 人脸检测 申请信息

#唯一必须填的信息就这三行
APP_ID = "xxxxxxxx"
API_KEY = "xxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxxxxxx"
# 过滤颜值阈值,存储空间大的请随意
BEAUTY_THRESHOLD = 55
AUTHORIZATION = "oauth c3cef7c66a1843f8b3a9e6a1e3160e20"
# 如果权限错误,浏览器中打开知乎,在开发者工具复制一个,无需登录
# 建议最好换一个,因为不知道知乎的反爬虫策略,如果太多人用同一个,可能会影响程序运行

以下皆无需改动

# 每次请求知乎的讨论列表长度,不建议设定太长,注意节操
LIMIT = 5
# 这是话题『美女』的 ID,其是『颜值』(20013528)的父话题
SOURCE = "19552207"

爬虫假装下正常浏览器请求

USER_AGENT = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.5 Safari/534.55.3"
REFERER = "https://www.zhihu.com/topic/%s/newest" % SOURCE
# 某话题下讨论列表请求 url
BASE_URL = "https://www.zhihu.com/api/v4/topics/%s/feeds/timeline_activity"
# 初始请求 url 附带的请求参数
URL_QUERY = "?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.comment_count&limit=" + str(
  LIMIT)

HEADERS = {
  "User-Agent": USER_AGENT,
  "Referer": REFERER,
  "authorization": AUTHORIZATION

指定 url,获取对应原始内容 / 图片

def fetch_image(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.content

指定 url,获取对应 JSON 返回 / 话题列表

def fetch_activities(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.json()

处理返回的话题列表

def parser_activities(datums, face_detective):
  for data in datums["data"]:
    target = data["target"]
    if "content" not in target or "question" not in target or "author" not in target:
      continue
    html = etree.HTML(target["content"])
    seq = 0
    title = target["question"]["title"]
    author = target["author"]["name"]
    images = html.xpath("//img/@src")
    for image in images:
      if not image.startswith("http"):
        continue
      image_data = fetch_image(image)
      score = face_detective(image_data)
      if not score:
        continue
      name = "{}--{}--{}--{}.jpg".format(score, author, title, seq)
      seq = seq + 1
      path = Path(__file__).parent.joinpath("image").joinpath(name)
      try:
        f = open(path, "wb")
        f.write(image_data)
        f.flush()
        f.close()
        print(path)
        time.sleep(2)
      except Exception as e:
        continue
  if not datums["paging"]["is_end"]:
    return datums["paging"]["next"]
  else:
    return None

初始化颜值检测工具

def init_detective(app_id, api_key, secret_key):
  client = AipFace(app_id, api_key, secret_key)
  options = {"face_field": "age,gender,beauty,qualities"}
  def detective(image):
    image = str(base64.b64encode(image), "utf-8")
    response = client.detect(str(image), "BASE64", options)
    response = response.get("result")
    if not response:
      return
    if (not response) or (response["face_num"] == 0):
      return
    face_list = response["face_list"]
    if pydash.get(face_list, "0.face_probability") < 0.6:
      return
    if pydash.get(face_list, "0.beauty") < BEAUTY_THRESHOLD:
      return
    if pydash.get(face_list, "0.gender.type") != "female":
      return
    score = pydash.get(face_list, "0.beauty")
    return score
  return detective

程序入口

def main():
  face_detective = init_detective(APP_ID, API_KEY, SECRET_KEY)
  url = BASE_URL % SOURCE + URL_QUERY
  while url is not None:
    datums = fetch_activities(url)
    url = parser_activities(datums, face_detective)
    time.sleep(5)
if __name__ == '__main__':
  main()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python socket编程实例详解
May 27 Python
Fiddler如何抓取手机APP数据包
Jan 22 Python
Django中login_required装饰器的深入介绍
Nov 24 Python
Python设计模式之工厂模式简单示例
Jan 09 Python
Tensorflow实现卷积神经网络用于人脸关键点识别
Mar 05 Python
pandas 按照特定顺序输出的实现代码
Jul 10 Python
对python插入数据库和生成插入sql的示例讲解
Nov 14 Python
python 定时器每天就执行一次的实现代码
Aug 14 Python
python进阶之自定义可迭代的类
Aug 20 Python
Golang GBK转UTF-8的例子
Aug 26 Python
手动安装python3.6的操作过程详解
Jan 13 Python
python GUI库图形界面开发之PyQt5多线程中信号与槽的详细使用方法与实例
Mar 08 Python
python3 enum模块的应用实例详解
Aug 12 #Python
Python一键查找iOS项目中未使用的图片、音频、视频资源
Aug 12 #Python
django+echart数据动态显示的例子
Aug 12 #Python
Flask框架学习笔记之使用Flask实现表单开发详解
Aug 12 #Python
Flask框架学习笔记之表单基础介绍与表单提交方式
Aug 12 #Python
python内存管理机制原理详解
Aug 12 #Python
Flask框架学习笔记之路由和反向路由详解【图文与实例】
Aug 12 #Python
You might like
用php过滤危险html代码的函数
2008/07/22 PHP
PHP中执行cmd命令的方法
2014/10/11 PHP
非常实用的PHP常用函数汇总
2014/12/17 PHP
Laravel 5框架学习之Eloquent (laravel 的ORM)
2015/04/08 PHP
laravel 解决paginate查询多个字段报错的问题
2019/10/22 PHP
PHP常见的序列化与反序列化操作实例分析
2019/10/28 PHP
JQuery 学习笔记 选择器之五
2009/07/23 Javascript
基于JQuery 的消息提示框效果代码
2011/07/31 Javascript
node.js使用nodemailer发送邮件实例
2014/03/10 Javascript
js判断浏览器是否支持html5
2014/08/17 Javascript
深入理解JavaScript系列(21):S.O.L.I.D五大原则之接口隔离原则ISP详解
2015/03/05 Javascript
一个字符串中出现次数最多的字符 统计这个次数【实现代码】
2016/04/29 Javascript
js实现悬浮窗效果(支持拖动)
2017/03/09 Javascript
jQuery实现的动态文字变化输出效果示例【附演示与demo源码下载】
2017/03/24 jQuery
webpack打包js文件及部署的实现方法
2017/12/18 Javascript
vuejs实现递归树型菜单组件
2018/01/13 Javascript
详解给Vue2路由导航钩子和axios拦截器做个封装
2018/04/10 Javascript
使用Vue组件实现一个简单弹窗效果
2018/04/23 Javascript
webpack 从指定入口文件中提取公共文件的方法
2018/11/13 Javascript
javascript中数组的常用算法深入分析
2019/03/12 Javascript
vue.js中ref及$refs的使用方法解析
2019/10/08 Javascript
Python输出9*9乘法表的方法
2015/05/25 Python
Python守护进程用法实例分析
2015/06/04 Python
这可能是最好玩的python GUI入门实例(推荐)
2019/07/19 Python
Python使用mongodb保存爬取豆瓣电影的数据过程解析
2019/08/14 Python
Python可变参数会自动填充前面的默认同名参数实例
2019/11/18 Python
Python实现直播推流效果
2019/11/26 Python
python 制作本地应用搜索工具
2021/02/27 Python
html5 Canvas画图教程(1)—画图的基本常识
2013/01/09 HTML / CSS
Vince官网:全球著名设计师品牌,休闲而优雅的服饰
2017/01/15 全球购物
澳大利亚最受欢迎的美发和美容在线商店:Catwalk
2018/12/12 全球购物
教育专业个人求职信
2013/12/02 职场文书
临床医师专业个人自我评价
2014/01/08 职场文书
地理教师岗位职责
2014/03/16 职场文书
考核评语大全
2014/04/29 职场文书
详解Java实践之适配器模式
2021/06/18 Java/Android