python定向爬虫校园论坛帖子信息


Posted in Python onJuly 23, 2018

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息,主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url(包含page页面参数)来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式,文本格式都为 &#XXXX;的形式,所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite数据库:saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬虫代码

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站,伪造一个header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
浅谈Python的文件类型
May 30 Python
python多个模块py文件的数据共享实例
Jan 11 Python
python设定并获取socket超时时间的方法
Jan 12 Python
Python Pandas数据结构简单介绍
Jul 03 Python
python列表插入append(), extend(), insert()用法详解
Sep 14 Python
Django多数据库配置及逆向生成model教程
Mar 28 Python
python seaborn heatmap可视化相关性矩阵实例
Jun 03 Python
Tensorflow中批量读取数据的案列分析及TFRecord文件的打包与读取
Jun 30 Python
MATLAB数学建模之画图汇总
Jul 16 Python
使用python操作lmdb对数据读取的实例
Dec 11 Python
python如何用matplotlib创建三维图表
Jan 26 Python
Python音乐爬虫完美绕过反爬
Aug 30 Python
python实现图片批量压缩程序
Jul 23 #Python
python中的插值 scipy-interp的实现代码
Jul 23 #Python
Flask框架URL管理操作示例【基于@app.route】
Jul 23 #Python
python中的turtle库函数简单使用教程
Jul 23 #Python
Flask框架配置与调试操作示例
Jul 23 #Python
python实现时间o(1)的最小栈的实例代码
Jul 23 #Python
Flask框架Flask-Principal基本用法实例分析
Jul 23 #Python
You might like
Linux下CoreSeek及PHP扩展模块的安装
2012/09/23 PHP
PHP正则替换函数preg_replace和preg_replace_callback使用总结
2014/09/22 PHP
PHP递归复制、移动目录的自定义函数分享
2014/11/18 PHP
使用PHP实现阻止用户上传成人照片或者裸照
2014/12/25 PHP
laravel ORM 只开启created_at的几种方法总结
2018/01/29 PHP
laravel validate 设置为中文的例子(验证提示为中文)
2019/09/29 PHP
在laravel5.2中实现点击用户头像更改头像的方法
2019/10/14 PHP
列表内容的选择
2006/06/30 Javascript
js下通过prototype扩展实现indexOf的代码
2010/12/08 Javascript
JavaScript基础知识之数据类型
2012/08/06 Javascript
javascript中call和apply方法浅谈
2013/09/27 Javascript
javascript实现表单提交后,提交按钮不可用的方法
2015/04/18 Javascript
js实现对ajax请求面向对象的封装
2016/01/08 Javascript
jQuery中$.each()函数的用法引申实例
2016/05/12 Javascript
BootStrap网页中代码显示用法详解
2016/10/21 Javascript
JavaScript常用正则验证函数实例小结【年龄,数字,Email,手机,URL,日期等】
2017/01/23 Javascript
Vue2路由动画效果的实现代码
2017/07/10 Javascript
JavaScript for循环 if判断语句(学习笔记)
2017/10/11 Javascript
javascript+HTML5 canvas绘制时钟功能示例
2019/05/15 Javascript
vue 集成jTopo 处理方法
2019/08/07 Javascript
Jquery 动态添加元素并添加点击事件实现过程解析
2019/10/12 jQuery
[00:35]DOTA2上海特级锦标赛 EG战队宣传片
2016/03/04 DOTA
[01:00:44]DOTA2上海特级锦标赛主赛事日 - 3 败者组第三轮#1COL VS Alliance第三局
2016/03/04 DOTA
用python + hadoop streaming 分布式编程(一) -- 原理介绍,样例程序与本地调试
2014/07/14 Python
利用Python的装饰器解决Bottle框架中用户验证问题
2015/04/24 Python
Python用csv写入文件_消除空余行的方法
2018/07/06 Python
Apache部署Django项目图文详解
2019/07/30 Python
python 给图像添加透明度(alpha通道)
2020/04/09 Python
使用PyQt的QLabel组件实现选定目标框功能的方法示例
2020/05/19 Python
浅析HTML5页面元素及属性
2021/01/20 HTML / CSS
英国健康和美容技术产品购物网站:CurrentBody
2019/07/17 全球购物
应聘自荐书
2013/10/08 职场文书
男方父母证婚词
2014/01/12 职场文书
春季运动会广播稿大全
2014/02/19 职场文书
还款承诺书范本
2015/01/20 职场文书
保护动物的宣传语
2015/07/13 职场文书