python实现知乎高颜值图片爬取


Posted in Python onAugust 12, 2019

导入相关包

import time
import pydash
import base64
import requests
from lxml import etree
from aip import AipFace
from pathlib import Path

百度云 人脸检测 申请信息

#唯一必须填的信息就这三行
APP_ID = "xxxxxxxx"
API_KEY = "xxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxxxxxx"
# 过滤颜值阈值,存储空间大的请随意
BEAUTY_THRESHOLD = 55
AUTHORIZATION = "oauth c3cef7c66a1843f8b3a9e6a1e3160e20"
# 如果权限错误,浏览器中打开知乎,在开发者工具复制一个,无需登录
# 建议最好换一个,因为不知道知乎的反爬虫策略,如果太多人用同一个,可能会影响程序运行

以下皆无需改动

# 每次请求知乎的讨论列表长度,不建议设定太长,注意节操
LIMIT = 5
# 这是话题『美女』的 ID,其是『颜值』(20013528)的父话题
SOURCE = "19552207"

爬虫假装下正常浏览器请求

USER_AGENT = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.5 Safari/534.55.3"
REFERER = "https://www.zhihu.com/topic/%s/newest" % SOURCE
# 某话题下讨论列表请求 url
BASE_URL = "https://www.zhihu.com/api/v4/topics/%s/feeds/timeline_activity"
# 初始请求 url 附带的请求参数
URL_QUERY = "?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.comment_count&limit=" + str(
  LIMIT)

HEADERS = {
  "User-Agent": USER_AGENT,
  "Referer": REFERER,
  "authorization": AUTHORIZATION

指定 url,获取对应原始内容 / 图片

def fetch_image(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.content

指定 url,获取对应 JSON 返回 / 话题列表

def fetch_activities(url):
  try:
    response = requests.get(url, headers=HEADERS)
  except Exception as e:
    raise e
  return response.json()

处理返回的话题列表

def parser_activities(datums, face_detective):
  for data in datums["data"]:
    target = data["target"]
    if "content" not in target or "question" not in target or "author" not in target:
      continue
    html = etree.HTML(target["content"])
    seq = 0
    title = target["question"]["title"]
    author = target["author"]["name"]
    images = html.xpath("//img/@src")
    for image in images:
      if not image.startswith("http"):
        continue
      image_data = fetch_image(image)
      score = face_detective(image_data)
      if not score:
        continue
      name = "{}--{}--{}--{}.jpg".format(score, author, title, seq)
      seq = seq + 1
      path = Path(__file__).parent.joinpath("image").joinpath(name)
      try:
        f = open(path, "wb")
        f.write(image_data)
        f.flush()
        f.close()
        print(path)
        time.sleep(2)
      except Exception as e:
        continue
  if not datums["paging"]["is_end"]:
    return datums["paging"]["next"]
  else:
    return None

初始化颜值检测工具

def init_detective(app_id, api_key, secret_key):
  client = AipFace(app_id, api_key, secret_key)
  options = {"face_field": "age,gender,beauty,qualities"}
  def detective(image):
    image = str(base64.b64encode(image), "utf-8")
    response = client.detect(str(image), "BASE64", options)
    response = response.get("result")
    if not response:
      return
    if (not response) or (response["face_num"] == 0):
      return
    face_list = response["face_list"]
    if pydash.get(face_list, "0.face_probability") < 0.6:
      return
    if pydash.get(face_list, "0.beauty") < BEAUTY_THRESHOLD:
      return
    if pydash.get(face_list, "0.gender.type") != "female":
      return
    score = pydash.get(face_list, "0.beauty")
    return score
  return detective

程序入口

def main():
  face_detective = init_detective(APP_ID, API_KEY, SECRET_KEY)
  url = BASE_URL % SOURCE + URL_QUERY
  while url is not None:
    datums = fetch_activities(url)
    url = parser_activities(datums, face_detective)
    time.sleep(5)
if __name__ == '__main__':
  main()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
使用cx_freeze把python打包exe示例
Jan 24 Python
Python过滤列表用法实例分析
Apr 29 Python
python监控linux内存并写入mongodb(推荐)
Sep 11 Python
python统计字母、空格、数字等字符个数的实例
Jun 29 Python
python将一组数分成每3个一组的实例
Nov 14 Python
获取Pytorch中间某一层权重或者特征的例子
Aug 17 Python
python基础 range的用法解析
Aug 23 Python
Python通过cv2读取多个USB摄像头
Aug 28 Python
Python代码生成视频的缩略图的实例讲解
Dec 22 Python
Jupyter notebook 远程配置及SSL加密教程
Apr 14 Python
快速解决jupyter notebook启动需要密码的问题
Apr 21 Python
python查询MySQL将数据写入Excel
Oct 29 Python
python3 enum模块的应用实例详解
Aug 12 #Python
Python一键查找iOS项目中未使用的图片、音频、视频资源
Aug 12 #Python
django+echart数据动态显示的例子
Aug 12 #Python
Flask框架学习笔记之使用Flask实现表单开发详解
Aug 12 #Python
Flask框架学习笔记之表单基础介绍与表单提交方式
Aug 12 #Python
python内存管理机制原理详解
Aug 12 #Python
Flask框架学习笔记之路由和反向路由详解【图文与实例】
Aug 12 #Python
You might like
基于ubuntu下nginx+php+mysql安装配置的具体操作步骤
2013/04/28 PHP
PHP获取音频文件的相关信息
2015/06/22 PHP
详解Grunt插件之LiveReload实现页面自动刷新(两种方案)
2015/07/31 PHP
WordPress中用于创建以及获取侧边栏的PHP函数讲解
2015/12/29 PHP
php数组分页实现方法
2016/04/30 PHP
PHP设置images目录不充许http访问的方法
2016/11/01 PHP
thinkPHP3.2.3结合Laypage实现的分页功能示例
2018/05/28 PHP
Firefox下提示illegal character并出现乱码的原因
2010/03/25 Javascript
动态添加option及createElement使用示例
2014/01/26 Javascript
浏览器缩放检测的js代码
2014/09/28 Javascript
js阻止事件追加的具体实现
2014/10/15 Javascript
js对象的复制继承实例
2015/01/10 Javascript
详解Vue-基本标签和自定义控件
2017/03/24 Javascript
JS库particles.js创建超炫背景粒子插件(附源码下载)
2017/09/13 Javascript
原生js的ajax和解决跨域的jsonp(实例讲解)
2017/10/16 Javascript
JS常见构造模式实例对比分析
2018/08/27 Javascript
小程序实现自定义导航栏适配完美版
2019/04/02 Javascript
微信小程序云开发实现云数据库读写权限
2019/05/17 Javascript
nodejs对项目下所有空文件夹创建gitkeep的方法
2019/08/02 NodeJs
小程序如何写动态标签的实现方法
2020/02/05 Javascript
原生js实现日历效果
2020/03/02 Javascript
vue登录页实现使用cookie记住7天密码功能的方法
2021/02/18 Vue.js
python 算法 排序实现快速排序
2012/06/05 Python
Python读写Json涉及到中文的处理方法
2016/09/12 Python
Python基于matplotlib实现绘制三维图形功能示例
2018/01/18 Python
根据DataFrame某一列的值来选择具体的某一行方法
2018/07/03 Python
详解Python中的正则表达式
2018/07/08 Python
python tkinter实现界面切换的示例代码
2019/06/14 Python
使用Python计算玩彩票赢钱概率
2019/06/26 Python
Python数据库小程序源代码
2019/09/15 Python
python GUI库图形界面开发之PyQt5信号与槽多窗口数据传递详细使用方法与实例
2020/03/08 Python
市场部规章制度
2014/01/24 职场文书
2015年实习班主任工作总结
2015/04/23 职场文书
小学少先队工作总结2015
2015/05/26 职场文书
《和时间赛跑》读后感3篇
2019/12/16 职场文书
使用Nginx+Tomcat实现负载均衡的全过程
2022/05/30 Servers