Python通过解析网页实现看报程序的方法


Posted in Python onAugust 04, 2014

本文所述实例可以实现基于Python的查看图片报纸《参考消息》并将当天的图片报纸自动下载到本地供查看的功能,具体实现代码如下:

# coding=gbk
import urllib2
import socket
import re
import time
import os

# timeout in seconds
#timeout = 10
#socket.setdefaulttimeout(timeout)
timeout = 10
urllib2.socket.setdefaulttimeout(timeout)

home_url = "http://www.hqck.net"
home_page = ""
try:
  home_page_context = urllib2.urlopen(home_url)
  home_page = home_page_context.read()

  print "Read home page finishd."
  print "-------------------------------------------------"
except urllib2.URLError,e:
  print e.code
  exit()
except:
  print e.code
  exit()

reg_str = r'<a class="item-baozhi" href="/arc/jwbt/ckxx/\d{4}/\d{4}/\w+\.html" rel="external nofollow" ><span class.+>.+</span></a>'

news_url_reg = re.compile(reg_str)

today_cankao_news = news_url_reg.findall(home_page)

if len(today_cankao_news) == 0:
  print "Cannot find today's news!"
  exit()

my_news = today_cankao_news[0]
print "Latest news link = " + my_news
print

url_s = my_news.find("/arc/")
url_e = my_news.find(".html")
url_e = url_e + 5

print "Link index = [" + str(url_s) + "," + str(url_e) + "]"
my_news = my_news[url_s:url_e]
print "part url = " + my_news

full_news_url = home_url + my_news
print "full url = " + full_news_url
print

image_folder = "E:\\new_folder\\"

if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
today_num = time.strftime('%Y-%m-%d',time.localtime(time.time()))
image_folder = image_folder + today_num + "\\"
if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
print "News image folder = " + image_folder
print

context_uri = full_news_url[0:-5]

first_page_url = context_uri + ".html"
try:
  first_page_context = urllib2.urlopen(first_page_url)
  first_page = first_page_context.read()
except urllib2.HTTPError, e:
  print e.code
  exit()

tot_page_index = first_page.find("共")
tot_page_index = tot_page_index

tmp_str = first_page[tot_page_index:tot_page_index+10]
end_s = tmp_str.find("页")

page_num = tmp_str[2:end_s]
print page_num

page_count = int(page_num)
print "Total " + page_num + " pages:"
print

page_index = 1
download_suc = True
while page_index <= page_count:
  page_url = context_uri
  if page_index > 1:
    page_url = page_url + "_" + str(page_index)
  page_url = page_url + ".html"
  print "News page link = " + page_url

  try:
    news_img_page_context = urllib2.urlopen(page_url)
  except urllib2.URLError,e:
    print e.reason
    download_suc = False
    break
  
  news_img_page = news_img_page_context.read()

  #f = open("e:\\page.html", "w")
  #f.write(news_img_page)
  #f.close()

  reg_str = r'http://image\S+jpg'
  image_reg = re.compile(reg_str)
  image_results = image_reg.findall(news_img_page)
  if len(image_results) == 0:
    print "Cannot find news page" + str(page_index) + "!"
    download_suc = False
    break
  
  image_url = image_results[0]

  print "News image url = " + image_url
  news_image_context = urllib2.urlopen(image_url)

  image_name = image_folder + "page_" + str(page_index) + ".jpg"
  imgf = open(image_name, 'wb')
  print "Getting image..."
  try:
    while True:
      date = news_image_context.read(1024*10)
      if not date:
        break
      imgf.write(date)
    imgf.close()
  except:
    download_suc = False
    print "Save image " + str(page_index) + " failed!"
    print "Unexpected error: " + sys.exc_info()[0] + sys.exc_info()[1]
  else:
    print "Save image " + str(page_index) + " succeed!"
    print
  page_index = page_index + 1

if download_suc == True:
  print "News download succeed! Path = \"" + str(image_folder) + "\""
  print "Enjoy it! ^^"
else:
  print "news download failed!"
Python 相关文章推荐
Python使用cx_Oracle模块将oracle中数据导出到csv文件的方法
May 16 Python
举例讲解Python编程中对线程锁的使用
Jul 12 Python
python实现的多线程端口扫描功能示例
Jan 21 Python
python中reload(module)的用法示例详解
Sep 15 Python
如何优雅地改进Django中的模板碎片缓存详解
Jul 04 Python
python tkinter GUI绘制,以及点击更新显示图片代码
Mar 14 Python
Python基于Twilio及腾讯云实现国际国内短信接口
Jun 18 Python
python操作ini类型配置文件的实例教程
Oct 30 Python
Python 实现键盘鼠标按键模拟
Nov 18 Python
Python之Sklearn使用入门教程
Feb 19 Python
利用Python网络爬虫爬取各大音乐评论的代码
Apr 13 Python
python geopandas读取、创建shapefile文件的方法
Jun 29 Python
基于Python实现的扫雷游戏实例代码
Aug 01 #Python
python脚本实现查找webshell的方法
Jul 31 #Python
用python删除java文件头上版权信息的方法
Jul 31 #Python
Python datetime时间格式化去掉前导0
Jul 31 #Python
python处理文本文件并生成指定格式的文件
Jul 31 #Python
Python中关键字is与==的区别简述
Jul 31 #Python
python处理文本文件实现生成指定格式文件的方法
Jul 31 #Python
You might like
mcrypt启用 加密以及解密过程详细解析
2013/08/07 PHP
PHP整合七牛实现上传文件
2015/07/03 PHP
PHP简单实现遍历目录下特定文件的方法小结
2017/05/22 PHP
php基于环形链表解决约瑟夫环问题示例
2017/11/07 PHP
PHP实现限制域名访问的实现代码(本地验证)
2020/09/13 PHP
Javascript hasOwnProperty 方法 &amp; in 关键字
2008/11/26 Javascript
Date对象格式化函数代码
2010/07/17 Javascript
一个挺有意思的Javascript小问题说明
2011/09/26 Javascript
浅析JavaScript中的隐式类型转换
2013/12/05 Javascript
js 采用delete实现继承示例代码
2014/05/20 Javascript
javascript去除字符串左右两端的空格
2015/02/05 Javascript
JavaScript中的继承之类继承
2016/05/01 Javascript
推荐一个基于Node.js的表单验证库
2019/02/15 Javascript
深入了解Hybrid App技术的相关知识
2019/07/17 Javascript
[04:47]DOTA2-潍坊风行电子俱乐部探秘
2014/08/08 DOTA
[01:32]TI奖金增速竟因它再创新高!DOTA2勇士令状不朽珍藏Ⅰ饰品欣赏
2018/05/18 DOTA
python双向链表实现实例代码
2013/11/21 Python
解决python ogr shp字段写入中文乱码的问题
2018/12/31 Python
Python实现堡垒机模式下远程命令执行操作示例
2019/05/09 Python
在PyCharm中控制台输出日志分层级分颜色显示的方法
2019/07/11 Python
简单了解django索引的相关知识
2019/07/17 Python
python机器学习实现决策树
2019/11/11 Python
Python完全识别验证码自动登录实例详解
2019/11/24 Python
Python按照list dict key进行排序过程解析
2020/04/04 Python
Python3将ipa包中的文件按大小排序
2020/04/17 Python
Keras官方中文文档:性能评估Metrices详解
2020/06/15 Python
python和go语言的区别是什么
2020/07/20 Python
html5/css3响应式页面开发总结
2018/10/16 HTML / CSS
荷兰领先的百货商店:De Bijenkorf
2018/10/17 全球购物
乡镇网格化管理实施方案
2014/03/23 职场文书
岗位明星事迹材料
2014/05/18 职场文书
2014年行政后勤工作总结
2014/12/06 职场文书
干部年终考核评语
2015/01/04 职场文书
小学班主任研修日志
2015/11/13 职场文书
python基于tkinter实现gif录屏功能
2021/05/19 Python
详解Redis的三种常用的缓存读写策略步骤
2022/05/06 Redis