python定向爬虫校园论坛帖子信息


Posted in Python onJuly 23, 2018

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息,主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url(包含page页面参数)来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式,文本格式都为 &#XXXX;的形式,所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite数据库:saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬虫代码

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站,伪造一个header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python的keyword模块用法实例分析
Jun 30 Python
python 转换 Javascript %u 字符串为python unicode的代码
Sep 06 Python
HTML中使用python屏蔽一些基本功能的方法
Jul 07 Python
python爬虫_实现校园网自动重连脚本的教程
Apr 22 Python
Python使用re模块正则提取字符串中括号内的内容示例
Jun 01 Python
python调用webservice接口的实现
Jul 12 Python
Django之编辑时根据条件跳转回原页面的方法
Aug 21 Python
python编写微信公众号首图思路详解
Dec 13 Python
tensorflow指定GPU与动态分配GPU memory设置
Feb 03 Python
深入浅析python变量加逗号,的含义
Feb 22 Python
pycharm安装及如何导入numpy
Apr 03 Python
用Python实现一个打字速度测试工具来测试你的手速
May 28 Python
python实现图片批量压缩程序
Jul 23 #Python
python中的插值 scipy-interp的实现代码
Jul 23 #Python
Flask框架URL管理操作示例【基于@app.route】
Jul 23 #Python
python中的turtle库函数简单使用教程
Jul 23 #Python
Flask框架配置与调试操作示例
Jul 23 #Python
python实现时间o(1)的最小栈的实例代码
Jul 23 #Python
Flask框架Flask-Principal基本用法实例分析
Jul 23 #Python
You might like
PHP设计模式之装饰者模式
2012/02/29 PHP
php中通过curl检测页面是否被百度收录
2013/09/27 PHP
PHP实现WebService的简单示例和实现步骤
2015/03/27 PHP
PHP生成唯一订单号
2015/07/05 PHP
thinkPHP3.2使用RBAC实现权限管理的实现
2019/08/27 PHP
“不能执行已释放的Script代码”错误的原因及解决办法
2007/09/09 Javascript
jQuery EasyUI 的EasyLoader功能介绍
2010/09/12 Javascript
js的alert样式如何更改如背景颜色
2014/01/22 Javascript
js控制table合并具体实现
2014/02/20 Javascript
jQuery中remove()方法用法实例
2014/12/25 Javascript
浅谈javascript中return语句
2015/07/15 Javascript
jQuery过滤选择器用法示例
2016/09/12 Javascript
Bootstrap下拉菜单Dropdowns的实现代码
2017/03/17 Javascript
Vue基于NUXT的SSR详解
2017/10/24 Javascript
详解webpack模块化管理和打包工具
2018/04/21 Javascript
微信小程序使用template标签实现五星评分功能
2018/11/03 Javascript
js 下拉菜单点击旁边收起实现(踩坑记)
2019/09/29 Javascript
highcharts.js数据绑定方式代码实例
2019/11/13 Javascript
微信小程序跨页面数据传递事件响应实现过程解析
2019/12/19 Javascript
webpack 如何同时输出压缩和未压缩的文件的实现步骤
2020/06/05 Javascript
用Python编写一个国际象棋AI程序
2014/11/28 Python
python字符串编码识别模块chardet简单应用
2015/06/15 Python
Python小进度条显示代码
2019/03/05 Python
详解Python绘图Turtle库
2019/10/12 Python
PYTHON如何读取和写入EXCEL里面的数据
2019/10/28 Python
python__new__内置静态方法使用解析
2020/01/07 Python
pycharm永久激活超详细教程
2020/10/29 Python
css3实现圆锥渐变conic-gradient效果
2020/02/12 HTML / CSS
西班牙香水和化妆品网上商店:Douglas
2017/10/29 全球购物
西班牙土拨鼠床垫公司,感觉在云端:Marmota
2019/03/18 全球购物
澳大利亚100%丝绸多彩度假装商店:TheSwankStore
2019/09/04 全球购物
C#中有没有运算符重载?能否使用指针?
2014/05/05 面试题
《童年》教学反思
2014/02/18 职场文书
优秀应届生求职信
2014/06/16 职场文书
优秀团员事迹材料1000字
2014/08/20 职场文书
教你win10系统中APPCRASH事件问题解决方法
2022/07/15 数码科技