编程 Python

Python爬取读者并制作成PDF

Posted in Python onMarch 10, 2015

学了下beautifulsoup后,做个个网络爬虫,爬取读者杂志并用reportlab制作成pdf..

crawler.py

#!/usr/bin/env python

#coding=utf-8

"""

    Author:         Anemone

    Filename:       getmain.py

    Last modified:  2015-02-19 16:47

    E-mail:         anemone@82flex.com

"""

import urllib2

from bs4 import BeautifulSoup

import re

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

def getEachArticle(url):

#    response = urllib2.urlopen('http://www.52duzhe.com/2015_01/duzh20150104.html')

    response = urllib2.urlopen(url)

    html = response.read()

    soup = BeautifulSoup(html)#.decode("utf-8").encode("gbk"))

    #for i in soup.find_all('div'):

    #    print i,1

    title=soup.find("h1").string

    writer=soup.find(id="pub_date").string.strip()

    _from=soup.find(id="media_name").string.strip()

    text=soup.get_text()#.encode("utf-8")

    main=re.split("BAIDU_CLB.*;",text)

    result={"title":title,"writer":writer,"from":_from,"context":main[1]}

    return result

    #new=open("new.txt","w")

    #new.write(result["title"]+"\n\n")

    #new.write(result["writer"]+"  "+result["from"])

    #new.write(result["context"])

    #new.close()

def getCatalog(issue):

    url="http://www.52duzhe.com/"+issue[:4]+"_"+issue[-2:]+"/"

    firstUrl=url+"duzh"+issue+"01.html"

    firstUrl=url+"index.html"

    duzhe=dict()

    response = urllib2.urlopen(firstUrl)

    html = response.read()

    soup=BeautifulSoup(html)

    firstUrl=url+soup.table.a.get("href")

    response = urllib2.urlopen(firstUrl)

    html = response.read()

    soup = BeautifulSoup(html)

    all=soup.find_all("h2")

    for i in all:

        print i.string

        duzhe[i.string]=list()

        for link in i.parent.find_all("a"):

            href=url+link.get("href")

            print href

            while 1:

                try:

                    article=getEachArticle(href)

                    break

                except:

                    continue

            duzhe[i.string].append(article)

    return duzhe

def readDuZhe(duzhe):

    for eachColumn in duzhe:

        for eachArticle in duzhe[eachColumn]:

            print eachArticle["title"]

if __name__ == '__main__':

#    issue=raw_input("issue(201501):")

    readDuZhe(getCatalog("201424"))

getpdf.py

#!/usr/bin/env python

#coding=utf-8

"""

    Author:         Anemone

    Filename:       writetopdf.py

    Last modified:  2015-02-20 19:19

    E-mail:         anemone@82flex.com

"""

#coding=utf-8

import reportlab.rl_config

from reportlab.pdfbase import pdfmetrics

from reportlab.pdfbase.ttfonts import TTFont

from reportlab.lib import fonts

import copy

from reportlab.platypus import Paragraph, SimpleDocTemplate,flowables

from reportlab.lib.styles import getSampleStyleSheet

import crawler

def writePDF(issue,duzhe):

    reportlab.rl_config.warnOnMissingFontGlyphs = 0

    pdfmetrics.registerFont(TTFont('song',"simsun.ttc"))

    pdfmetrics.registerFont(TTFont('hei',"msyh.ttc"))

    fonts.addMapping('song', 0, 0, 'song')

    fonts.addMapping('song', 0, 1, 'song')

    fonts.addMapping('song', 1, 0, 'hei')

    fonts.addMapping('song', 1, 1, 'hei')

    stylesheet=getSampleStyleSheet()

    normalStyle = copy.deepcopy(stylesheet['Normal'])

    normalStyle.fontName ='song'

    normalStyle.fontSize = 11

    normalStyle.leading = 11

    normalStyle.firstLineIndent = 20

    titleStyle = copy.deepcopy(stylesheet['Normal'])

    titleStyle.fontName ='song'

    titleStyle.fontSize = 15

    titleStyle.leading = 20

    firstTitleStyle = copy.deepcopy(stylesheet['Normal'])

    firstTitleStyle.fontName ='song'

    firstTitleStyle.fontSize = 20

    firstTitleStyle.leading = 20

    firstTitleStyle.firstLineIndent = 50

    smallStyle = copy.deepcopy(stylesheet['Normal'])

    smallStyle.fontName ='song'

    smallStyle.fontSize = 8

    smallStyle.leading = 8

    story = []

    story.append(Paragraph("<b>读者{0}期</b>".format(issue), firstTitleStyle))

    for eachColumn in duzhe:

        story.append(Paragraph('__'*28, titleStyle))

        story.append(Paragraph('<b>{0}</b>'.format(eachColumn), titleStyle))

        for eachArticle in duzhe[eachColumn]:

            story.append(Paragraph(eachArticle["title"],normalStyle))

    story.append(flowables.PageBreak())

    for eachColumn in duzhe:

        for eachArticle in duzhe[eachColumn]:

            story.append(Paragraph("<b>{0}</b>".format(eachArticle["title"]),titleStyle))

            story.append(Paragraph(" {0}  {1}".format(eachArticle["writer"],eachArticle["from"]),smallStyle))

            para=eachArticle["context"].split("")

            for eachPara in para:

                story.append(Paragraph(eachPara,normalStyle))

            story.append(flowables.PageBreak())

    #story.append(Paragraph("context",normalStyle))

    doc = SimpleDocTemplate("duzhe"+issue+".pdf")

    print "Writing PDF..."

    doc.build(story)

def main(issue):

    duzhe=crawler.getCatalog(issue)

    writePDF(issue,duzhe)

if __name__ == '__main__':

    issue=raw_input("Enter issue(201501):")

    main(issue)

以上就是本文的全部内容了，希望大家能够喜欢。

Python爬取读者并制作成PDF

- Author -

hebedich

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

本地文件上传到七牛云服务器示例(七牛云存储)

Jan 11 Python

Python程序设计入门(4)模块和包

Jun 16 Python

Python正则表达式的使用范例详解

Aug 08 Python

对Python 文件夹遍历和文件查找的实例讲解

Apr 26 Python

Python 找到列表中满足某些条件的元素方法

Jun 26 Python

numpy.std() 计算矩阵标准差的方法

Jul 11 Python

ubuntu 安装pyqt5和卸载pyQt5的方法

Mar 24 Python

Pyspark获取并处理RDD数据代码实例

Mar 27 Python

python实现快速文件格式批量转换的方法

Oct 16 Python

用python对oracle进行简单性能测试

Dec 05 Python

pycharm 的Structure界面设置操作

Feb 05 Python

pd.DataFrame中的几种索引变换的实现

Jun 16 Python

Python生成随机MAC地址

Mar 10 #Python

Python中实现结构相似的函数调用方法

Mar 10 #Python

Python实现CET查分的方法

Mar 10 #Python

Python实现的批量下载RFC文档

Mar 10 #Python

Python制作CSDN免积分下载器

Mar 10 #Python

Python Tkinter GUI编程入门介绍

Mar 10 #Python

Python格式化css文件的方法

Mar 10 #Python

You might like

Windows下的PHP5.0安装配制详解

2006/09/05 PHP

php用户注册时常用的检验函数实例总结

2014/12/22 PHP

PHP基于MySQLI函数封装的数据库连接工具类【定义与用法】

2017/08/11 PHP

Laravel5.0+框架邮件发送功能实现方法图文与实例详解

2019/04/23 PHP

js 键盘记录实现(兼容FireFox和IE)

2010/02/07 Javascript

script标签的 charset 属性使用说明

2010/12/04 Javascript

Jquery css函数用法(判断标签是否拥有某属性)

2011/05/28 Javascript

IE6、IE7中获取Button元素的值的bug说明

2011/08/28 Javascript

关于eval 与new Function 到底该选哪个?

2013/04/17 Javascript

实现51Map地图接口(示例代码)

2013/11/22 Javascript

JavaScript实现找出字符串中第一个不重复的字符

2014/09/03 Javascript

Node.js事件循环（Event Loop）和线程池详解

2015/01/28 Javascript

JS实现页面数据无限加载

2016/09/13 Javascript

jQuery利用sort对DOM元素进行排序操作

2016/11/07 Javascript

javascript代码优化的8点总结

2018/01/29 Javascript

Vue-resource安装过程及使用方法解析

2020/07/21 Javascript

[01:23]2014DOTA2国际邀请赛球迷无处不在Ti现场世界杯受关注

2014/07/10 DOTA

python读取html中指定元素生成excle文件示例

2014/04/03 Python

对Python定时任务的启动和停止方法详解

2019/02/19 Python

Python学习笔记之读取文件、OS模块、异常处理、with as语法示例

2019/06/04 Python

django框架防止XSS注入的方法分析

2019/06/21 Python

python 并发下载器实现方法示例

2019/11/22 Python

matplotlib基础绘图命令之bar的使用方法

2020/08/13 Python

如何在Anaconda中打开python自带idle

2020/09/21 Python

Python创建自己的加密货币的示例

2021/03/01 Python

CSS3 :default伪类选择器使用简介

2018/03/15 HTML / CSS

HTML5之HTML元素扩展(下)—增强的Form表单元素值得关注

2013/01/31 HTML / CSS

使用HTML5中的contentEditable来将多行文本自动增高

2016/03/01 HTML / CSS

JENNIFER BEHR官网：各种耳环和发饰

2020/06/07 全球购物

前台接待岗位职责

2013/12/03 职场文书

大学四年规划书范文

2013/12/27 职场文书

《苏珊的帽子》教学反思

2014/04/07 职场文书

《第一次抱母亲》教学反思

2014/04/16 职场文书

大学毕业生求职自荐书

2014/06/05 职场文书

债务授权委托书范本

2014/10/17 职场文书

请学会珍惜眼前，因为人生没有下辈子！

2019/11/12 职场文书