python实现知乎高颜值图片爬取


Posted in Python onAugust 12, 2019

导入相关包

import time
import pydash
import base64
import requests
from lxml import etree
from aip import AipFace
from pathlib import Path

百度云 人脸检测 申请信息

#唯一必须填的信息就这三行
APP_ID = "xxxxxxxx"
API_KEY = "xxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxxxxxx"
# 过滤颜值阈值,存储空间大的请随意
BEAUTY_THRESHOLD = 55
AUTHORIZATION = "oauth c3cef7c66a1843f8b3a9e6a1e3160e20"
# 如果权限错误,浏览器中打开知乎,在开发者工具复制一个,无需登录
# 建议最好换一个,因为不知道知乎的反爬虫策略,如果太多人用同一个,可能会影响程序运行

以下皆无需改动

# 每次请求知乎的讨论列表长度,不建议设定太长,注意节操
LIMIT = 5
# 这是话题『美女』的 ID,其是『颜值』(20013528)的父话题
SOURCE = "19552207"

爬虫假装下正常浏览器请求

USER_AGENT = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.5 Safari/534.55.3"
REFERER = "https://www.zhihu.com/topic/%s/newest" % SOURCE
# 某话题下讨论列表请求 url
BASE_URL = "https://www.zhihu.com/api/v4/topics/%s/feeds/timeline_activity"
# 初始请求 url 附带的请求参数
URL_QUERY = "?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.comment_count&limit=" + str(
  LIMIT)

HEADERS = {
  "User-Agent": USER_AGENT,
  "Referer": REFERER,
  "authorization": AUTHORIZATION

指定 url,获取对应原始内容 / 图片

def fetch_image(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.content

指定 url,获取对应 JSON 返回 / 话题列表

def fetch_activities(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.json()

处理返回的话题列表

def parser_activities(datums, face_detective):
  for data in datums["data"]:
    target = data["target"]
    if "content" not in target or "question" not in target or "author" not in target:
      continue
    html = etree.HTML(target["content"])
    seq = 0
    title = target["question"]["title"]
    author = target["author"]["name"]
    images = html.xpath("//img/@src")
    for image in images:
      if not image.startswith("http"):
        continue
      image_data = fetch_image(image)
      score = face_detective(image_data)
      if not score:
        continue
      name = "{}--{}--{}--{}.jpg".format(score, author, title, seq)
      seq = seq + 1
      path = Path(__file__).parent.joinpath("image").joinpath(name)
      try:
        f = open(path, "wb")
        f.write(image_data)
        f.flush()
        f.close()
        print(path)
        time.sleep(2)
      except Exception as e:
        continue
  if not datums["paging"]["is_end"]:
    return datums["paging"]["next"]
  else:
    return None

初始化颜值检测工具

def init_detective(app_id, api_key, secret_key):
  client = AipFace(app_id, api_key, secret_key)
  options = {"face_field": "age,gender,beauty,qualities"}
  def detective(image):
    image = str(base64.b64encode(image), "utf-8")
    response = client.detect(str(image), "BASE64", options)
    response = response.get("result")
    if not response:
      return
    if (not response) or (response["face_num"] == 0):
      return
    face_list = response["face_list"]
    if pydash.get(face_list, "0.face_probability") < 0.6:
      return
    if pydash.get(face_list, "0.beauty") < BEAUTY_THRESHOLD:
      return
    if pydash.get(face_list, "0.gender.type") != "female":
      return
    score = pydash.get(face_list, "0.beauty")
    return score
  return detective

程序入口

def main():
  face_detective = init_detective(APP_ID, API_KEY, SECRET_KEY)
  url = BASE_URL % SOURCE + URL_QUERY
  while url is not None:
    datums = fetch_activities(url)
    url = parser_activities(datums, face_detective)
    time.sleep(5)
if __name__ == '__main__':
  main()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python比较2个xml内容的方法
May 11 Python
Python读取一个目录下所有目录和文件的方法
Jul 15 Python
django框架如何集成celery进行开发
May 24 Python
python下setuptools的安装详解及No module named setuptools的解决方法
Jul 06 Python
Python利用BeautifulSoup解析Html的方法示例
Jul 30 Python
pandas DataFrame 交集并集补集的实现
Jun 24 Python
python文件转为exe文件的方法及用法详解
Jul 08 Python
Python math库 ln(x)运算的实现及原理
Jul 17 Python
django 2.2和mysql使用的常见问题
Jul 18 Python
flask框架url与重定向操作实例详解
Jan 25 Python
K最近邻算法(KNN)---sklearn+python实现方式
Feb 24 Python
Django如何批量创建Model
Sep 01 Python
python3 enum模块的应用实例详解
Aug 12 #Python
Python一键查找iOS项目中未使用的图片、音频、视频资源
Aug 12 #Python
django+echart数据动态显示的例子
Aug 12 #Python
Flask框架学习笔记之使用Flask实现表单开发详解
Aug 12 #Python
Flask框架学习笔记之表单基础介绍与表单提交方式
Aug 12 #Python
python内存管理机制原理详解
Aug 12 #Python
Flask框架学习笔记之路由和反向路由详解【图文与实例】
Aug 12 #Python
You might like
检测png图片是否完整的php代码
2010/09/06 PHP
CodeIgniter中实现泛域名解析
2014/07/19 PHP
ThinkPHP表单自动验证实例
2014/10/13 PHP
php实现的简单数据库操作Model类
2016/11/16 PHP
php mysql_real_escape_string addslashes及mysql绑定参数防SQL注入攻击
2016/12/23 PHP
php批量转换文件夹下所有文件编码的函数类
2017/08/06 PHP
让你的博客飘雪花超出屏幕依然看得见
2013/01/04 Javascript
将查询条件的input、select清空
2014/01/14 Javascript
仅9张思维导图帮你轻松学习Javascript 就这么简单
2016/06/01 Javascript
学好js,这些js函数概念一定要知道【推荐】
2017/01/19 Javascript
DataTables添加额外的查询参数和删除columns等无用参数实例
2017/07/04 Javascript
从vue源码解析Vue.set()和this.$set()
2018/08/30 Javascript
Angular 利用路由跳转到指定页面的指定位置方法
2018/08/31 Javascript
layui插件表单验证提交触发提交的例子
2019/09/09 Javascript
vue使用prop可以渲染但是打印台报错的解决方式
2019/11/13 Javascript
vue实现pdf文档在线预览功能
2019/11/26 Javascript
[26:21]浴火之凤-TI4世界冠军Newbee战队纪录片
2014/08/07 DOTA
总结python爬虫抓站的实用技巧
2016/08/09 Python
Python之NumPy(axis=0 与axis=1)区分详解
2019/05/27 Python
如何给Python代码进行加密
2020/01/10 Python
使用keras实现densenet和Xception的模型融合
2020/05/23 Python
浅谈Tensorflow加载Vgg预训练模型的几个注意事项
2020/05/26 Python
PIL.Image.open和cv2.imread的比较与相互转换的方法
2020/06/03 Python
用python批量移动文件
2021/01/14 Python
Petmate品牌官方网站:宠物用品
2018/11/25 全球购物
税务干部鉴定材料
2014/02/11 职场文书
读书演讲主持词
2014/03/18 职场文书
我的梦想演讲稿
2014/04/30 职场文书
创业女性典型材料
2014/05/02 职场文书
小学竞选班长演讲稿
2014/09/09 职场文书
师德承诺书2015
2015/04/28 职场文书
埃及王子观后感
2015/06/16 职场文书
结婚典礼致辞
2015/07/28 职场文书
导游词之无锡东林书院
2019/12/11 职场文书
CSS3鼠标悬浮过渡缩放效果
2021/04/17 HTML / CSS
详解分布式系统中如何用python实现Paxos
2021/05/18 Python