编程 Python

python爬虫实现获取下一页代码

Posted in Python onMarch 13, 2020

我们首先来看下实例代码：

from time import sleep

import faker
import requests
from lxml import etree

fake = faker.Faker()

base_url = "http://angelimg.spbeen.com"

def get_next_link(url):
  content = downloadHtml(url)
  html = etree.HTML(content)
  next_url = html.xpath("//a[@class='ch next']/@href")
  if next_url:
    return base_url + next_url[0]
  else:
    return False

def downloadHtml(ur):
  user_agent = fake.user_agent()
  headers = {'User-Agent': user_agent,"Referer":"http://angelimg.spbeen.com/"}
  response = requests.get(url, headers=headers)
  return response.text

def getImgUrl(content):
  html = etree.HTML(content)
  img_url = html.xpath('//*[@id="content"]/a/img/@src')
  title = html.xpath(".//div['@class=article']/h2/text()")

  return img_url[0],title[0]

def saveImg(title,img_url):
  if img_url is not None and title is not None:
    with open("txt/"+str(title)+".jpg",'wb') as f:
      user_agent = fake.user_agent()
      headers = {'User-Agent': user_agent,"Referer":"http://angelimg.spbeen.com/"}
      content = requests.get(img_url, headers=headers)
      #request_view(content)
      f.write(content.content)
      f.close()

def request_view(response):
  import webbrowser
  request_url = response.url
  base_url = '<head><base href="%s" rel="external nofollow" >' %(request_url)
  base_url = base_url.encode()
  content = response.content.replace(b"<head>",base_url)
  tem_html = open('tmp.html','wb')
  tem_html.write(content)
  tem_html.close()
  webbrowser.open_new_tab('tmp.html')

def crawl_img(url):
  content = downloadHtml(url)
  res = getImgUrl(content)
  title = res[1]
  img_url = res[0]
  saveImg(title,img_url)

if __name__ == "__main__":
  url = "http://angelimg.spbeen.com/ang/4968/1"

  while url:
    print(url)
    crawl_img(url)
    url = get_next_link(url)

python 爬虫如何执行自动下一页循环加载文字

from bs4 import BeautifulSoup
import requests
import time
from lxml import etree
import os
# 该demo执行的为如何利用bs去爬一些文字
def start():
  # 发起网络请求
  html=requests.get('http://www.baidu.com')
  #编码
  html.encoding=html.apparent_encoding
  #创建sp
  soup=BeautifulSoup(html.text,'html.parser')
  print(type(soup))
  print('打印元素')
  print(soup.prettify())
  #存储一下title 该方法没有提示直接展示
  title=soup.head.title.string
  print(title)
#   写入文本
  with open(r'C:/Users/a/Desktop/a.txt','w') as f:
    f.write(title)
  print(time.localtime())
 
url_2 = 'http://news.gdzjdaily.com.cn/zjxw/politics/sz_4.shtml'
def get_html_from_bs4(url):
 
  # response = requests.get(url,headers=data,proxies=ip).content.decode('utf-8')
  response = requests.get(url).content.decode('utf-8')
  soup = BeautifulSoup(response, 'html.parser')
  next_page = soup.select('#displaypagenum a:nth-of-type(9)')[0].get('href')
  # for i in nett
  print(next_page)
  next2='http://news.gdzjdaily.com.cn/zjxw/politics/'+next_page
 
 
def get_html_from_etree(url):
 
  response = requests.get(url).content.decode('utf-8')
  html= etree.HTML(response)
 
  next_page = html.xpath('.//a[@class="PageNum"][8]/@href')[0]
  print(next_page)
  # next2='http://news.gdzjdaily.com.cn/zjxw/politics/'+next_page
 
 
get_html_from_etree(url_2)
 
if __name__ == '__main__':
  start()

到此这篇关于python爬虫实现获取下一页代码的文章就介绍到这了,更多相关python爬虫获取下一页内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木！

python爬虫实现获取下一页代码

- Author -

brady.wang

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

详细探究Python中的字典容器

Apr 14 Python

Python代码解决RenderView窗口not found问题

Aug 28 Python

Python 基础教程之包和类的用法

Feb 23 Python

python SVD压缩图像的实现代码

Nov 05 Python

Python合并2个字典成1个新字典的方法(9种)

Dec 19 Python

python 图像的离散傅立叶变换实例

Jan 02 Python

python文件和文件夹复制函数

Feb 07 Python

keras中模型训练class_weight,sample_weight区别说明

May 23 Python

selenium判断元素是否存在的两种方法小结

Dec 07 Python

selenium框架中driver.close()和driver.quit()关闭浏览器

Dec 08 Python

python 如何在list中找Topk的数值和索引

May 20 Python

python三子棋游戏

May 04 Python

Python3 利用face_recognition实现人脸识别的方法

Mar 13 #Python

在django中使用post方法时,需要增加csrftoken的例子

Mar 13 #Python

python 安装教程之Pycharm安装及配置字体主题,换行,自动更新

Mar 13 #Python

详解用Python进行时间序列预测的7种方法

Mar 13 #Python

django-xadmin根据当前登录用户动态设置表单字段默认值方式

Mar 13 #Python

在django项目中导出数据到excel文件并实现下载的功能

Mar 13 #Python

Django choices下拉列表绑定实例

Mar 13 #Python

You might like

整合了前面的PHP数据库连接类~~做成一个分页类!

2006/11/25 PHP

PHP 字符串操作入门教程

2006/12/06 PHP

PHP中最容易忘记的一些知识点总结

2013/04/28 PHP

PHP无限分类(树形类)

2013/09/28 PHP

THINKPHP在添加数据的时候获取主键id的值方法

2017/04/03 PHP

JavaScript的面向对象(二)

2006/11/09 Javascript

jquery EasyUI的formatter格式化函数代码

2011/01/12 Javascript

jQuery基本过滤选择器使用介绍

2013/04/18 Javascript

angularjs在ng-repeat中使用ng-model遇到的问题

2016/01/21 Javascript

使用JQuery 加载页面时调用JS的实现方法

2016/05/30 Javascript

Vue.js组件使用开发实例教程

2016/11/01 Javascript

用Angular实时获取本地Localstorage数据,实现一个模拟后台数据登入的效果

2016/11/09 Javascript

NodeJS实现客户端js加密

2017/01/09 NodeJs

vue中选项卡点击切换且能滑动切换功能的实现代码

2018/11/25 Javascript

ES6 Iterator遍历器原理，应用场景及相关常用知识拓展详解

2020/02/15 Javascript

Vue组件模板及组件互相引用代码实例

2020/03/11 Javascript

简单的Python的curses库使用教程

2015/04/11 Python

python控制台中实现进度条功能

2015/11/10 Python

Python 私有函数的实例详解

2017/09/11 Python

Python变量赋值的秘密分享

2018/04/03 Python

Python 实现「食行生鲜」签到领积分功能

2018/09/26 Python

Pandas实现dataframe和np.array的相互转换

2019/11/30 Python

CSS实现fullpage.js全屏滚动效果的示例代码

2021/03/24 HTML / CSS

应届生法律顾问求职信

2013/11/19 职场文书

大学毕业感言100字

2014/02/03 职场文书

数控个人求职信范文

2014/02/03 职场文书

应聘英语教师求职信

2014/04/24 职场文书

任命书范本大全

2014/06/06 职场文书

电气工程及其自动化专业求职信

2014/06/23 职场文书

县政府班子个人对照检查材料

2014/10/05 职场文书

个人批评与自我批评材料

2014/10/17 职场文书

党员考试作弊检讨书1000字

2015/02/16 职场文书

圣诞晚会主持词

2015/07/01 职场文书

2016圣诞节贺卡寄语

2015/12/07 职场文书

详解redis在微服务领域的贡献

2021/10/16 Redis

联想win10摄像头打不开怎么办?win10笔记本摄像头打不开解决办法

2022/04/08 数码科技