Python通过解析网页实现看报程序的方法


Posted in Python onAugust 04, 2014

本文所述实例可以实现基于Python的查看图片报纸《参考消息》并将当天的图片报纸自动下载到本地供查看的功能,具体实现代码如下:

# coding=gbk
import urllib2
import socket
import re
import time
import os

# timeout in seconds
#timeout = 10
#socket.setdefaulttimeout(timeout)
timeout = 10
urllib2.socket.setdefaulttimeout(timeout)

home_url = "http://www.hqck.net"
home_page = ""
try:
  home_page_context = urllib2.urlopen(home_url)
  home_page = home_page_context.read()

  print "Read home page finishd."
  print "-------------------------------------------------"
except urllib2.URLError,e:
  print e.code
  exit()
except:
  print e.code
  exit()

reg_str = r'<a class="item-baozhi" href="/arc/jwbt/ckxx/\d{4}/\d{4}/\w+\.html" rel="external nofollow" ><span class.+>.+</span></a>'

news_url_reg = re.compile(reg_str)

today_cankao_news = news_url_reg.findall(home_page)

if len(today_cankao_news) == 0:
  print "Cannot find today's news!"
  exit()

my_news = today_cankao_news[0]
print "Latest news link = " + my_news
print

url_s = my_news.find("/arc/")
url_e = my_news.find(".html")
url_e = url_e + 5

print "Link index = [" + str(url_s) + "," + str(url_e) + "]"
my_news = my_news[url_s:url_e]
print "part url = " + my_news

full_news_url = home_url + my_news
print "full url = " + full_news_url
print

image_folder = "E:\\new_folder\\"

if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
today_num = time.strftime('%Y-%m-%d',time.localtime(time.time()))
image_folder = image_folder + today_num + "\\"
if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
print "News image folder = " + image_folder
print

context_uri = full_news_url[0:-5]

first_page_url = context_uri + ".html"
try:
  first_page_context = urllib2.urlopen(first_page_url)
  first_page = first_page_context.read()
except urllib2.HTTPError, e:
  print e.code
  exit()

tot_page_index = first_page.find("共")
tot_page_index = tot_page_index

tmp_str = first_page[tot_page_index:tot_page_index+10]
end_s = tmp_str.find("页")

page_num = tmp_str[2:end_s]
print page_num

page_count = int(page_num)
print "Total " + page_num + " pages:"
print

page_index = 1
download_suc = True
while page_index <= page_count:
  page_url = context_uri
  if page_index > 1:
    page_url = page_url + "_" + str(page_index)
  page_url = page_url + ".html"
  print "News page link = " + page_url

  try:
    news_img_page_context = urllib2.urlopen(page_url)
  except urllib2.URLError,e:
    print e.reason
    download_suc = False
    break
  
  news_img_page = news_img_page_context.read()

  #f = open("e:\\page.html", "w")
  #f.write(news_img_page)
  #f.close()

  reg_str = r'http://image\S+jpg'
  image_reg = re.compile(reg_str)
  image_results = image_reg.findall(news_img_page)
  if len(image_results) == 0:
    print "Cannot find news page" + str(page_index) + "!"
    download_suc = False
    break
  
  image_url = image_results[0]

  print "News image url = " + image_url
  news_image_context = urllib2.urlopen(image_url)

  image_name = image_folder + "page_" + str(page_index) + ".jpg"
  imgf = open(image_name, 'wb')
  print "Getting image..."
  try:
    while True:
      date = news_image_context.read(1024*10)
      if not date:
        break
      imgf.write(date)
    imgf.close()
  except:
    download_suc = False
    print "Save image " + str(page_index) + " failed!"
    print "Unexpected error: " + sys.exc_info()[0] + sys.exc_info()[1]
  else:
    print "Save image " + str(page_index) + " succeed!"
    print
  page_index = page_index + 1

if download_suc == True:
  print "News download succeed! Path = \"" + str(image_folder) + "\""
  print "Enjoy it! ^^"
else:
  print "news download failed!"
Python 相关文章推荐
python中threading超线程用法实例分析
May 16 Python
python中如何正确使用正则表达式的详细模式(Verbose mode expression)
Nov 08 Python
Python实现XML文件解析的示例代码
Feb 05 Python
Python操作Sql Server 2008数据库的方法详解
May 17 Python
python调用OpenCV实现人脸识别功能
May 25 Python
python3将视频流保存为本地视频文件
Jun 20 Python
Python闭包思想与用法浅析
Dec 27 Python
python判断变量是否为int、字符串、列表、元组、字典的方法详解
Feb 13 Python
python tkinter之顶层菜单、弹出菜单实例
Mar 04 Python
Selenium之模拟登录铁路12306的示例代码
Jul 31 Python
python3.9实现pyinstaller打包python文件成exe
Dec 13 Python
用Python将GIF动图分解成多张静态图片
Jun 11 Python
基于Python实现的扫雷游戏实例代码
Aug 01 #Python
python脚本实现查找webshell的方法
Jul 31 #Python
用python删除java文件头上版权信息的方法
Jul 31 #Python
Python datetime时间格式化去掉前导0
Jul 31 #Python
python处理文本文件并生成指定格式的文件
Jul 31 #Python
Python中关键字is与==的区别简述
Jul 31 #Python
python处理文本文件实现生成指定格式文件的方法
Jul 31 #Python
You might like
收音机指标测试方法及仪器
2021/03/01 无线电
PHP安全配置
2006/12/06 PHP
解析php中的fopen()函数用打开文件模式说明
2013/06/20 PHP
ThinkPHP使用心得分享-上传类UploadFile的使用
2014/05/15 PHP
PHP+Mysql基于事务处理实现转账功能的方法
2015/07/08 PHP
微信第三方登录(原生)demo【必看篇】
2017/05/26 PHP
PHP判断函数是否被定义的方法
2019/06/21 PHP
laravel入门知识点整理
2020/09/15 PHP
如何让页面在打开时自动刷新一次让图片全部显示
2012/12/17 Javascript
用js+iframe形成页面的一种遮罩效果的具体实现
2013/12/31 Javascript
详解Javascript 装载和执行
2014/11/17 Javascript
jQuery制作可自定义大小的拼图游戏
2015/03/30 Javascript
JavaScript基于setTimeout实现计数的方法
2015/05/08 Javascript
AngularJS自定义服务与fliter的混合使用
2016/11/24 Javascript
jQuery点击导航栏选中更换样式的实现代码
2017/01/23 Javascript
flag和jq on 的绑定多个对象和方法(必看)
2017/02/27 Javascript
js弹性势能动画之抛物线运动实例详解
2017/07/27 Javascript
vue中eslintrc.js配置最详细介绍
2018/12/21 Javascript
基于Taro的微信小程序模板消息-获取formId功能模块封装实践
2019/07/15 Javascript
[43:57]LGD vs Mineski 2018国际邀请赛小组赛BO2 第二场 8.19
2018/08/21 DOTA
python使用cStringIO实现临时内存文件访问的方法
2015/03/26 Python
Django 忘记管理员或忘记管理员密码 重设登录密码的方法
2018/05/30 Python
Python 通过调用接口获取公交信息的实例
2018/12/17 Python
Django模板导入母版继承和自定义返回Html片段过程解析
2019/09/18 Python
python和go语言的区别是什么
2020/07/20 Python
python中逻辑与或(and、or)和按位与或异或(&amp;、|、^)区别
2020/08/05 Python
浅析rem和em和px vh vw和% 移动端长度单位
2016/04/28 HTML / CSS
HTML5 placeholder属性详解
2016/06/22 HTML / CSS
使用javascript和HTML5 Canvas画的四渐变色播放按钮效果
2014/04/10 HTML / CSS
HTML5的download属性详细介绍和使用实例
2014/04/23 HTML / CSS
LookFantastic丹麦:英国美容护肤精品在线商城
2016/08/18 全球购物
2013年保送生自荐信格式
2013/11/20 职场文书
环境整治工作方案
2014/05/18 职场文书
物业总经理助理岗位职责
2014/06/29 职场文书
利用python Pandas实现批量拆分Excel与合并Excel
2021/05/23 Python
Django框架模板用法详解
2022/06/10 Python