Python通过解析网页实现看报程序的方法


Posted in Python onAugust 04, 2014

本文所述实例可以实现基于Python的查看图片报纸《参考消息》并将当天的图片报纸自动下载到本地供查看的功能,具体实现代码如下:

# coding=gbk
import urllib2
import socket
import re
import time
import os

# timeout in seconds
#timeout = 10
#socket.setdefaulttimeout(timeout)
timeout = 10
urllib2.socket.setdefaulttimeout(timeout)

home_url = "http://www.hqck.net"
home_page = ""
try:
  home_page_context = urllib2.urlopen(home_url)
  home_page = home_page_context.read()

  print "Read home page finishd."
  print "-------------------------------------------------"
except urllib2.URLError,e:
  print e.code
  exit()
except:
  print e.code
  exit()

reg_str = r'<a class="item-baozhi" href="/arc/jwbt/ckxx/\d{4}/\d{4}/\w+\.html" rel="external nofollow" ><span class.+>.+</span></a>'

news_url_reg = re.compile(reg_str)

today_cankao_news = news_url_reg.findall(home_page)

if len(today_cankao_news) == 0:
  print "Cannot find today's news!"
  exit()

my_news = today_cankao_news[0]
print "Latest news link = " + my_news
print

url_s = my_news.find("/arc/")
url_e = my_news.find(".html")
url_e = url_e + 5

print "Link index = [" + str(url_s) + "," + str(url_e) + "]"
my_news = my_news[url_s:url_e]
print "part url = " + my_news

full_news_url = home_url + my_news
print "full url = " + full_news_url
print

image_folder = "E:\\new_folder\\"

if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
today_num = time.strftime('%Y-%m-%d',time.localtime(time.time()))
image_folder = image_folder + today_num + "\\"
if (os.path.exists(image_folder) == False):
  os.makedirs(image_folder)
print "News image folder = " + image_folder
print

context_uri = full_news_url[0:-5]

first_page_url = context_uri + ".html"
try:
  first_page_context = urllib2.urlopen(first_page_url)
  first_page = first_page_context.read()
except urllib2.HTTPError, e:
  print e.code
  exit()

tot_page_index = first_page.find("共")
tot_page_index = tot_page_index

tmp_str = first_page[tot_page_index:tot_page_index+10]
end_s = tmp_str.find("页")

page_num = tmp_str[2:end_s]
print page_num

page_count = int(page_num)
print "Total " + page_num + " pages:"
print

page_index = 1
download_suc = True
while page_index <= page_count:
  page_url = context_uri
  if page_index > 1:
    page_url = page_url + "_" + str(page_index)
  page_url = page_url + ".html"
  print "News page link = " + page_url

  try:
    news_img_page_context = urllib2.urlopen(page_url)
  except urllib2.URLError,e:
    print e.reason
    download_suc = False
    break
  
  news_img_page = news_img_page_context.read()

  #f = open("e:\\page.html", "w")
  #f.write(news_img_page)
  #f.close()

  reg_str = r'http://image\S+jpg'
  image_reg = re.compile(reg_str)
  image_results = image_reg.findall(news_img_page)
  if len(image_results) == 0:
    print "Cannot find news page" + str(page_index) + "!"
    download_suc = False
    break
  
  image_url = image_results[0]

  print "News image url = " + image_url
  news_image_context = urllib2.urlopen(image_url)

  image_name = image_folder + "page_" + str(page_index) + ".jpg"
  imgf = open(image_name, 'wb')
  print "Getting image..."
  try:
    while True:
      date = news_image_context.read(1024*10)
      if not date:
        break
      imgf.write(date)
    imgf.close()
  except:
    download_suc = False
    print "Save image " + str(page_index) + " failed!"
    print "Unexpected error: " + sys.exc_info()[0] + sys.exc_info()[1]
  else:
    print "Save image " + str(page_index) + " succeed!"
    print
  page_index = page_index + 1

if download_suc == True:
  print "News download succeed! Path = \"" + str(image_folder) + "\""
  print "Enjoy it! ^^"
else:
  print "news download failed!"
Python 相关文章推荐
Python 中 Meta Classes详解
Feb 13 Python
Python实现采用进度条实时显示处理进度的方法
Dec 19 Python
详谈Numpy中数组重塑、合并与拆分方法
Apr 17 Python
Python判断变量名是否合法的方法示例
Jan 28 Python
pyqt5 comboBox获得下标、文本和事件选中函数的方法
Jun 14 Python
简单了解Python3里的一些新特性
Jul 13 Python
解决Python3 抓取微信账单信息问题
Jul 19 Python
python 解决tqdm模块不能单行显示的问题
Feb 19 Python
Python 去除字符串中指定字符串
Mar 05 Python
CentOS 7如何实现定时执行python脚本
Jun 24 Python
分享PyCharm最新激活码(真永久激活方法)不用每月找安装参数或最新激活码了
Dec 27 Python
基于PyQT5制作一个桌面摸鱼工具
Feb 15 Python
基于Python实现的扫雷游戏实例代码
Aug 01 #Python
python脚本实现查找webshell的方法
Jul 31 #Python
用python删除java文件头上版权信息的方法
Jul 31 #Python
Python datetime时间格式化去掉前导0
Jul 31 #Python
python处理文本文件并生成指定格式的文件
Jul 31 #Python
Python中关键字is与==的区别简述
Jul 31 #Python
python处理文本文件实现生成指定格式文件的方法
Jul 31 #Python
You might like
用文本文件实现的动态实时发布新闻的程序
2006/10/09 PHP
php禁止某ip或ip地址段访问的方法
2015/02/25 PHP
PHP使用mysqldump命令导出数据库
2015/04/14 PHP
PHP错误处理函数register_shutdown_function使用示例
2017/07/03 PHP
PHP preg_match实现正则表达式匹配功能【输出是否匹配及匹配值】
2017/07/19 PHP
基于PHP实现短信验证码发送次数限制
2020/07/11 PHP
JQuery 常用方法和事件详细介绍
2013/04/18 Javascript
js实现Select下拉框具有输入功能的方法
2015/02/06 Javascript
JS限制文本框只能输入数字和字母方法
2015/02/28 Javascript
jQuery插件实现控制网页元素动态居中显示
2015/03/24 Javascript
JS把内容动态插入到DIV的实现方法
2016/07/19 Javascript
Bootstrap Table服务器分页与在线编辑应用总结
2016/08/08 Javascript
最常见和最有用的字符串相关的方法详解
2017/02/06 Javascript
详解vue-cli中模拟数据的两种方法
2018/07/03 Javascript
Vue项目总结之webpack常规打包优化方案
2019/06/06 Javascript
js实现数字跳动到指定数字
2020/08/25 Javascript
[01:03:33]Alliance vs TNC 2019国际邀请赛小组赛 BO2 第一场 8.16
2019/08/18 DOTA
[51:52]Liquid vs Secret 2019国际邀请赛淘汰赛 败者组 BO3 第二场 8.24
2019/09/10 DOTA
Python 比较两个数组的元素的异同方法
2017/08/17 Python
python中reload(module)的用法示例详解
2017/09/15 Python
python微信跳一跳系列之棋子定位像素遍历
2018/02/26 Python
Python结合ImageMagick实现多张图片合并为一个pdf文件的方法
2018/04/24 Python
python 统计数组中元素出现次数并进行排序的实例
2018/07/02 Python
python实现回旋矩阵方式(旋转矩阵)
2019/12/04 Python
解析Python 偏函数用法全方位实现
2020/06/26 Python
基于Python爬取fofa网页端数据过程解析
2020/07/13 Python
scrapy处理python爬虫调度详解
2020/11/23 Python
python爬取股票最新数据并用excel绘制树状图的示例
2021/03/01 Python
捷克母婴用品购物网站:Feedo.cz
2020/12/28 全球购物
英智兴达软件测试笔试题
2016/10/12 面试题
《桥》教学反思
2014/04/09 职场文书
交通文明倡议书
2014/05/16 职场文书
大学毕业生推荐信
2014/07/09 职场文书
出资证明书范本(标准版)
2014/09/24 职场文书
学前班幼儿评语大全
2014/12/29 职场文书
面试被问select......for update会锁表还是锁行
2021/11/11 MySQL