Python通过解析网页实现看报程序的方法


Posted in Python onAugust 04, 2014

本文所述实例可以实现基于Python的查看图片报纸《参考消息》并将当天的图片报纸自动下载到本地供查看的功能,具体实现代码如下:

# coding=gbk
import urllib2
import socket
import re
import time
import os

# timeout in seconds
#timeout = 10
#socket.setdefaulttimeout(timeout)
timeout = 10
urllib2.socket.setdefaulttimeout(timeout)

home_url = "http://www.hqck.net"
home_page = ""
try:
  home_page_context = urllib2.urlopen(home_url)
  home_page = home_page_context.read()

  print "Read home page finishd."
  print "-------------------------------------------------"
except urllib2.URLError,e:
  print e.code
  exit()
except:
  print e.code
  exit()

reg_str = r'<a class="item-baozhi" href="/arc/jwbt/ckxx/\d{4}/\d{4}/\w+\.html" rel="external nofollow" ><span class.+>.+</span></a>'

news_url_reg = re.compile(reg_str)

today_cankao_news = news_url_reg.findall(home_page)

if len(today_cankao_news) == 0:
  print "Cannot find today's news!"
  exit()

my_news = today_cankao_news[0]
print "Latest news link = " + my_news
print

url_s = my_news.find("/arc/")
url_e = my_news.find(".html")
url_e = url_e + 5

print "Link index = [" + str(url_s) + "," + str(url_e) + "]"
my_news = my_news[url_s:url_e]
print "part url = " + my_news

full_news_url = home_url + my_news
print "full url = " + full_news_url
print

image_folder = "E:\\new_folder\\"

if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
today_num = time.strftime('%Y-%m-%d',time.localtime(time.time()))
image_folder = image_folder + today_num + "\\"
if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
print "News image folder = " + image_folder
print

context_uri = full_news_url[0:-5]

first_page_url = context_uri + ".html"
try:
  first_page_context = urllib2.urlopen(first_page_url)
  first_page = first_page_context.read()
except urllib2.HTTPError, e:
  print e.code
  exit()

tot_page_index = first_page.find("共")
tot_page_index = tot_page_index

tmp_str = first_page[tot_page_index:tot_page_index+10]
end_s = tmp_str.find("页")

page_num = tmp_str[2:end_s]
print page_num

page_count = int(page_num)
print "Total " + page_num + " pages:"
print

page_index = 1
download_suc = True
while page_index <= page_count:
  page_url = context_uri
  if page_index > 1:
    page_url = page_url + "_" + str(page_index)
  page_url = page_url + ".html"
  print "News page link = " + page_url

  try:
    news_img_page_context = urllib2.urlopen(page_url)
  except urllib2.URLError,e:
    print e.reason
    download_suc = False
    break
  
  news_img_page = news_img_page_context.read()

  #f = open("e:\\page.html", "w")
  #f.write(news_img_page)
  #f.close()

  reg_str = r'http://image\S+jpg'
  image_reg = re.compile(reg_str)
  image_results = image_reg.findall(news_img_page)
  if len(image_results) == 0:
    print "Cannot find news page" + str(page_index) + "!"
    download_suc = False
    break
  
  image_url = image_results[0]

  print "News image url = " + image_url
  news_image_context = urllib2.urlopen(image_url)

  image_name = image_folder + "page_" + str(page_index) + ".jpg"
  imgf = open(image_name, 'wb')
  print "Getting image..."
  try:
    while True:
      date = news_image_context.read(1024*10)
      if not date:
        break
      imgf.write(date)
    imgf.close()
  except:
    download_suc = False
    print "Save image " + str(page_index) + " failed!"
    print "Unexpected error: " + sys.exc_info()[0] + sys.exc_info()[1]
  else:
    print "Save image " + str(page_index) + " succeed!"
    print
  page_index = page_index + 1

if download_suc == True:
  print "News download succeed! Path = \"" + str(image_folder) + "\""
  print "Enjoy it! ^^"
else:
  print "news download failed!"
Python 相关文章推荐
Python中使用copy模块实现列表(list)拷贝
Apr 14 Python
python 实现红包随机生成算法的简单实例
Jan 04 Python
Python 专题四 文件基础知识
Mar 20 Python
Python获取当前公网ip并自动断开宽带连接实例代码
Jan 12 Python
python docx 中文字体设置的操作方法
May 08 Python
Python基于百度云文字识别API
Dec 13 Python
python 实现敏感词过滤的方法
Jan 21 Python
Python将列表数据写入文件(txt, csv,excel)
Apr 03 Python
python每5分钟从kafka中提取数据的例子
Dec 23 Python
Python:type、object、class与内置类型实例
Dec 25 Python
详解Python openpyxl库的基本应用
Feb 26 Python
详解Python requests模块
Jun 21 Python
基于Python实现的扫雷游戏实例代码
Aug 01 #Python
python脚本实现查找webshell的方法
Jul 31 #Python
用python删除java文件头上版权信息的方法
Jul 31 #Python
Python datetime时间格式化去掉前导0
Jul 31 #Python
python处理文本文件并生成指定格式的文件
Jul 31 #Python
Python中关键字is与==的区别简述
Jul 31 #Python
python处理文本文件实现生成指定格式文件的方法
Jul 31 #Python
You might like
解析php安全性问题中的:Null 字符问题
2013/06/21 PHP
对PHP新手的一些建议(PHP学习经验总结)
2014/08/20 PHP
PHP实现在线阅读PDF文件的方法
2015/06/17 PHP
php文件压缩之PHPZip类用法实例
2015/06/18 PHP
PHP多维数组遍历方法(2种实现方法)
2015/12/10 PHP
PHP fopen函数用法实例讲解
2019/02/15 PHP
ThinkPHP3.2.3框架实现执行原生SQL语句的方法示例
2019/04/03 PHP
PHP实现财务审核通过后返现金额到客户的功能
2019/07/04 PHP
excel操作之Add Data to a Spreadsheet Cell
2007/06/12 Javascript
基于jquery的合并table相同单元格的插件(精简版)
2011/04/05 Javascript
javascript不可用的问题探究
2013/10/01 Javascript
JavaScript兼容性总结之获取非行间样式案例
2016/08/07 Javascript
jQuery实现鼠标经过像翻页和描点链接效果
2016/08/08 Javascript
Bootstrap复选框和单选按钮美化插件(推荐)
2016/11/23 Javascript
js图片延迟加载(Lazyload)三种实现方式
2017/03/01 Javascript
AngularJS的脏检查深入分析
2017/04/22 Javascript
MvcPager分页控件 适用于Bootstrap
2017/06/03 Javascript
jQuery Jsonp跨域模拟搜索引擎
2017/06/17 jQuery
创建简单的node服务器实例(分享)
2017/06/23 Javascript
深入理解vue.js中$watch的oldvalue与newValue
2017/08/07 Javascript
Windows下使用Nodejs运行js的方法
2017/09/02 NodeJs
JavaScript Date对象应用实例分享
2017/10/30 Javascript
Vue+Spring Boot简单用户登录(附Demo)
2020/11/12 Javascript
[01:03:38]2014 DOTA2国际邀请赛中国区预选赛5.21 CNB VS CIS
2014/05/22 DOTA
[39:11]DOTA2上海特级锦标赛C组资格赛#2 LGD VS Newbee第二局
2016/02/28 DOTA
Python的Django中django-userena组件的简单使用教程
2015/05/30 Python
matplotlib实现热成像图colorbar和极坐标图的方法
2018/12/13 Python
Python实现删除排序数组中重复项的两种方法示例
2019/01/31 Python
python程序文件扩展名知识点详解
2020/02/27 Python
XMLHttpRequest对象在IE和Firefox中创建方式有没有不同
2016/03/23 面试题
2014年政风行风自查自纠报告
2014/10/21 职场文书
自主招生专家推荐信
2015/03/26 职场文书
上帝也疯狂观后感
2015/06/09 职场文书
中学教代会开幕词
2016/03/04 职场文书
如何使用分区处理MySQL的亿级数据优化
2021/06/18 MySQL
Nginx使用Lua模块实现WAF的原理解析
2021/09/04 Servers