python实现知乎高颜值图片爬取


Posted in Python onAugust 12, 2019

导入相关包

import time
import pydash
import base64
import requests
from lxml import etree
from aip import AipFace
from pathlib import Path

百度云 人脸检测 申请信息

#唯一必须填的信息就这三行
APP_ID = "xxxxxxxx"
API_KEY = "xxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxxxxxx"
# 过滤颜值阈值,存储空间大的请随意
BEAUTY_THRESHOLD = 55
AUTHORIZATION = "oauth c3cef7c66a1843f8b3a9e6a1e3160e20"
# 如果权限错误,浏览器中打开知乎,在开发者工具复制一个,无需登录
# 建议最好换一个,因为不知道知乎的反爬虫策略,如果太多人用同一个,可能会影响程序运行

以下皆无需改动

# 每次请求知乎的讨论列表长度,不建议设定太长,注意节操
LIMIT = 5
# 这是话题『美女』的 ID,其是『颜值』(20013528)的父话题
SOURCE = "19552207"

爬虫假装下正常浏览器请求

USER_AGENT = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.5 Safari/534.55.3"
REFERER = "https://www.zhihu.com/topic/%s/newest" % SOURCE
# 某话题下讨论列表请求 url
BASE_URL = "https://www.zhihu.com/api/v4/topics/%s/feeds/timeline_activity"
# 初始请求 url 附带的请求参数
URL_QUERY = "?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.comment_count&limit=" + str(
  LIMIT)

HEADERS = {
  "User-Agent": USER_AGENT,
  "Referer": REFERER,
  "authorization": AUTHORIZATION

指定 url,获取对应原始内容 / 图片

def fetch_image(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.content

指定 url,获取对应 JSON 返回 / 话题列表

def fetch_activities(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.json()

处理返回的话题列表

def parser_activities(datums, face_detective):
  for data in datums["data"]:
    target = data["target"]
    if "content" not in target or "question" not in target or "author" not in target:
      continue
    html = etree.HTML(target["content"])
    seq = 0
    title = target["question"]["title"]
    author = target["author"]["name"]
    images = html.xpath("//img/@src")
    for image in images:
      if not image.startswith("http"):
        continue
      image_data = fetch_image(image)
      score = face_detective(image_data)
      if not score:
        continue
      name = "{}--{}--{}--{}.jpg".format(score, author, title, seq)
      seq = seq + 1
      path = Path(__file__).parent.joinpath("image").joinpath(name)
      try:
        f = open(path, "wb")
        f.write(image_data)
        f.flush()
        f.close()
        print(path)
        time.sleep(2)
      except Exception as e:
        continue
  if not datums["paging"]["is_end"]:
    return datums["paging"]["next"]
  else:
    return None

初始化颜值检测工具

def init_detective(app_id, api_key, secret_key):
  client = AipFace(app_id, api_key, secret_key)
  options = {"face_field": "age,gender,beauty,qualities"}
  def detective(image):
    image = str(base64.b64encode(image), "utf-8")
    response = client.detect(str(image), "BASE64", options)
    response = response.get("result")
    if not response:
      return
    if (not response) or (response["face_num"] == 0):
      return
    face_list = response["face_list"]
    if pydash.get(face_list, "0.face_probability") < 0.6:
      return
    if pydash.get(face_list, "0.beauty") < BEAUTY_THRESHOLD:
      return
    if pydash.get(face_list, "0.gender.type") != "female":
      return
    score = pydash.get(face_list, "0.beauty")
    return score
  return detective

程序入口

def main():
  face_detective = init_detective(APP_ID, API_KEY, SECRET_KEY)
  url = BASE_URL % SOURCE + URL_QUERY
  while url is not None:
    datums = fetch_activities(url)
    url = parser_activities(datums, face_detective)
    time.sleep(5)
if __name__ == '__main__':
  main()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python类和函数中使用静态变量的方法
May 09 Python
python制作花瓣网美女图片爬虫
Oct 28 Python
rabbitmq(中间消息代理)在python中的使用详解
Dec 14 Python
分数霸榜! python助你微信跳一跳拿高分
Jan 08 Python
Python面向对象程序设计之继承与多继承用法分析
Jul 13 Python
python+selenium实现自动化百度搜索关键词
Jun 03 Python
python中break、continue 、exit() 、pass终止循环的区别详解
Jul 08 Python
Python matplotlib生成图片背景透明的示例代码
Aug 30 Python
Pytoch之torchvision.transforms图像变换实例
Dec 30 Python
pandas的resample重采样的使用
Apr 24 Python
使用Python matplotlib作图时,设置横纵坐标轴数值以百分比(%)显示
May 16 Python
python urllib库的使用详解
Apr 13 Python
python3 enum模块的应用实例详解
Aug 12 #Python
Python一键查找iOS项目中未使用的图片、音频、视频资源
Aug 12 #Python
django+echart数据动态显示的例子
Aug 12 #Python
Flask框架学习笔记之使用Flask实现表单开发详解
Aug 12 #Python
Flask框架学习笔记之表单基础介绍与表单提交方式
Aug 12 #Python
python内存管理机制原理详解
Aug 12 #Python
Flask框架学习笔记之路由和反向路由详解【图文与实例】
Aug 12 #Python
You might like
摩卡咖啡
2021/03/03 咖啡文化
一周学会PHP(视频)Http下载
2006/12/12 PHP
php操作mysql数据库的基本类代码
2014/02/25 PHP
浅谈PHP定义命令空间的几个注意点(推荐)
2016/10/29 PHP
什么是PHP7中的孤儿进程与僵尸进程
2019/04/14 PHP
高性能web开发 如何加载JS,JS应该放在什么位置?
2010/05/14 Javascript
jQuery实现列表自动循环滚动鼠标悬停时停止滚动
2013/09/06 Javascript
javascript制作loading动画效果 loading效果
2014/01/14 Javascript
JavaScript中匿名、命名函数的性能测试
2014/09/04 Javascript
javascript性能优化之事件委托实例详解
2015/12/12 Javascript
jQuery UI结合Ajax创建可定制的Web界面
2016/06/22 Javascript
简单好用的nodejs 爬虫框架分享
2017/03/26 NodeJs
angular学习之ngRoute路由机制
2017/04/12 Javascript
微信小程序websocket聊天室的实现示例代码
2019/02/12 Javascript
JS实现滚动条触底加载更多
2019/09/19 Javascript
Vue实现跑马灯效果
2020/05/25 Javascript
jQuery实现回到顶部效果
2020/10/19 jQuery
快速解决vue2+vue-cli3项目ie兼容的问题
2020/11/17 Vue.js
用C++封装MySQL的API的教程
2015/05/06 Python
Python3实现将文件归档到zip文件及从zip文件中读取数据的方法
2015/05/22 Python
python实现list由于numpy array的转换
2018/04/04 Python
Python列表的切片实例讲解
2019/08/20 Python
PyTorch 普通卷积和空洞卷积实例
2020/01/07 Python
在python中利用pycharm自定义代码块教程(三步搞定)
2020/04/15 Python
Python实现常见的几种加密算法(MD5,SHA-1,HMAC,DES/AES,RSA和ECC)
2020/05/09 Python
怎么快速自学python
2020/06/22 Python
python和js交互调用的方法
2020/06/23 Python
python之语音识别speech模块
2020/09/09 Python
阿拉伯世界最大的电子商务网站:Souq沙特阿拉伯
2016/10/28 全球购物
维多利亚的秘密阿联酋官网:Victoria’s Secret阿联酋
2019/12/07 全球购物
自动化职业生涯规划书范文
2014/01/03 职场文书
2014年情人节活动方案
2014/02/16 职场文书
镇创先争优活动总结
2014/08/28 职场文书
护理医院见习报告
2014/11/03 职场文书
八年级作文之我的母亲
2019/12/10 职场文书
MySQL日期时间函数知识汇总
2022/03/17 MySQL