Python通过解析网页实现看报程序的方法


Posted in Python onAugust 04, 2014

本文所述实例可以实现基于Python的查看图片报纸《参考消息》并将当天的图片报纸自动下载到本地供查看的功能,具体实现代码如下:

# coding=gbk
import urllib2
import socket
import re
import time
import os

# timeout in seconds
#timeout = 10
#socket.setdefaulttimeout(timeout)
timeout = 10
urllib2.socket.setdefaulttimeout(timeout)

home_url = "http://www.hqck.net"
home_page = ""
try:
  home_page_context = urllib2.urlopen(home_url)
  home_page = home_page_context.read()

  print "Read home page finishd."
  print "-------------------------------------------------"
except urllib2.URLError,e:
  print e.code
  exit()
except:
  print e.code
  exit()

reg_str = r'<a class="item-baozhi" href="/arc/jwbt/ckxx/\d{4}/\d{4}/\w+\.html" rel="external nofollow" ><span class.+>.+</span></a>'

news_url_reg = re.compile(reg_str)

today_cankao_news = news_url_reg.findall(home_page)

if len(today_cankao_news) == 0:
  print "Cannot find today's news!"
  exit()

my_news = today_cankao_news[0]
print "Latest news link = " + my_news
print

url_s = my_news.find("/arc/")
url_e = my_news.find(".html")
url_e = url_e + 5

print "Link index = [" + str(url_s) + "," + str(url_e) + "]"
my_news = my_news[url_s:url_e]
print "part url = " + my_news

full_news_url = home_url + my_news
print "full url = " + full_news_url
print

image_folder = "E:\\new_folder\\"

if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
today_num = time.strftime('%Y-%m-%d',time.localtime(time.time()))
image_folder = image_folder + today_num + "\\"
if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
print "News image folder = " + image_folder
print

context_uri = full_news_url[0:-5]

first_page_url = context_uri + ".html"
try:
  first_page_context = urllib2.urlopen(first_page_url)
  first_page = first_page_context.read()
except urllib2.HTTPError, e:
  print e.code
  exit()

tot_page_index = first_page.find("共")
tot_page_index = tot_page_index

tmp_str = first_page[tot_page_index:tot_page_index+10]
end_s = tmp_str.find("页")

page_num = tmp_str[2:end_s]
print page_num

page_count = int(page_num)
print "Total " + page_num + " pages:"
print

page_index = 1
download_suc = True
while page_index <= page_count:
  page_url = context_uri
  if page_index > 1:
    page_url = page_url + "_" + str(page_index)
  page_url = page_url + ".html"
  print "News page link = " + page_url

  try:
    news_img_page_context = urllib2.urlopen(page_url)
  except urllib2.URLError,e:
    print e.reason
    download_suc = False
    break
  
  news_img_page = news_img_page_context.read()

  #f = open("e:\\page.html", "w")
  #f.write(news_img_page)
  #f.close()

  reg_str = r'http://image\S+jpg'
  image_reg = re.compile(reg_str)
  image_results = image_reg.findall(news_img_page)
  if len(image_results) == 0:
    print "Cannot find news page" + str(page_index) + "!"
    download_suc = False
    break
  
  image_url = image_results[0]

  print "News image url = " + image_url
  news_image_context = urllib2.urlopen(image_url)

  image_name = image_folder + "page_" + str(page_index) + ".jpg"
  imgf = open(image_name, 'wb')
  print "Getting image..."
  try:
    while True:
      date = news_image_context.read(1024*10)
      if not date:
        break
      imgf.write(date)
    imgf.close()
  except:
    download_suc = False
    print "Save image " + str(page_index) + " failed!"
    print "Unexpected error: " + sys.exc_info()[0] + sys.exc_info()[1]
  else:
    print "Save image " + str(page_index) + " succeed!"
    print
  page_index = page_index + 1

if download_suc == True:
  print "News download succeed! Path = \"" + str(image_folder) + "\""
  print "Enjoy it! ^^"
else:
  print "news download failed!"
Python 相关文章推荐
Python中使用ConfigParser解析ini配置文件实例
Aug 30 Python
Python实现的简单发送邮件脚本分享
Nov 07 Python
Python for Informatics 第11章之正则表达式(二)
Apr 21 Python
Python自定义线程类简单示例
Mar 23 Python
Python并发之多进程的方法实例代码
Aug 15 Python
对python过滤器和lambda函数的用法详解
Jan 21 Python
python将时分秒转换成秒的实例
Dec 07 Python
PyTorch的自适应池化Adaptive Pooling实例
Jan 03 Python
python扫描线填充算法详解
Feb 19 Python
Python错误的处理方法
Jun 23 Python
如何用Python绘制3D柱形图
Sep 16 Python
Alpine安装Python3依赖出现的问题及解决方法
Dec 25 Python
基于Python实现的扫雷游戏实例代码
Aug 01 #Python
python脚本实现查找webshell的方法
Jul 31 #Python
用python删除java文件头上版权信息的方法
Jul 31 #Python
Python datetime时间格式化去掉前导0
Jul 31 #Python
python处理文本文件并生成指定格式的文件
Jul 31 #Python
Python中关键字is与==的区别简述
Jul 31 #Python
python处理文本文件实现生成指定格式文件的方法
Jul 31 #Python
You might like
让你的网站首页自动选择语言转跳
2006/12/06 PHP
手把手教你使用DedeCms V3的在线采集图文教程
2007/04/03 PHP
smarty简单分页的实现方法
2014/10/27 PHP
php使用NumberFormatter格式化货币的方法
2015/03/21 PHP
Yii扩展组件编写方法实例分析
2015/06/29 PHP
PHP中子类重载父类的方法【parent::方法名】
2016/05/06 PHP
浅谈PHP中new self()和new static()的区别
2017/08/11 PHP
thinkphp5.1框架模板布局与模板继承用法分析
2019/07/19 PHP
ECMAScript 基础知识
2007/06/29 Javascript
制作高质量的JQuery Plugin 插件的方法
2010/04/20 Javascript
Textbox控件注册回车事件及触发按钮提交事件具体实现
2013/03/04 Javascript
JavaScript常用脚本汇总(三)
2015/03/04 Javascript
使用Web Uploader实现多文件上传
2016/06/08 Javascript
Vue.js学习笔记之修饰符详解
2017/07/25 Javascript
bootstrap table表格插件之服务器端分页实例代码
2018/09/12 Javascript
jquery插件开发模式实例详解
2019/07/20 jQuery
jQuery列表动态增加和删除的实现方法
2020/11/05 jQuery
一分钟学会JavaScript中的try-catch
2020/12/14 Javascript
在漏洞利用Python代码真的很爽
2007/08/26 Python
Windows下Anaconda2安装NLTK教程
2018/09/19 Python
python文本数据处理学习笔记详解
2019/06/17 Python
python实现的config文件读写功能示例
2019/09/24 Python
python根据完整路径获得盘名/路径名/文件名/文件扩展名的方法
2020/04/22 Python
基于Python采集爬取微信公众号历史数据
2020/11/27 Python
IE支持HTML5的解决方法
2009/10/20 HTML / CSS
如何用H5实现一个触屏版的轮播器的实例
2017/01/09 HTML / CSS
澳大利亚家居用品零售商:Harris Scarfe
2020/10/10 全球购物
既然说Ruby中一切都是对象,那么Ruby中类也是对象吗
2013/01/26 面试题
销售自我评价
2013/10/22 职场文书
计算机应用专业推荐信
2013/11/13 职场文书
《最佳路径》教学反思
2014/04/13 职场文书
学生干部培训方案
2014/06/12 职场文书
保护环境建议书作文400字
2015/09/14 职场文书
反四风问题学习心得体会
2016/01/22 职场文书
python 三边测量定位的实现代码
2021/04/22 Python
mysql外连接与内连接查询的不同之处
2021/06/03 MySQL