python定向爬虫校园论坛帖子信息


Posted in Python onJuly 23, 2018

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息,主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url(包含page页面参数)来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式,文本格式都为 &#XXXX;的形式,所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite数据库:saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬虫代码

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站,伪造一个header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python之模拟鼠标键盘动作具体实现
Dec 30 Python
Python编写简单的HTML页面合并脚本
Jul 11 Python
python 实现数字字符串左侧补零的方法
Dec 04 Python
Python3爬虫学习入门教程
Dec 11 Python
Python + OpenCV 实现LBP特征提取的示例代码
Jul 11 Python
Python在OpenCV里实现极坐标变换功能
Sep 02 Python
python 正则表达式参数替换实例详解
Jan 17 Python
使用celery和Django处理异步任务的流程分析
Feb 19 Python
Python Numpy 控制台完全输出ndarray的实现
Feb 19 Python
Django model class Meta原理解析
Nov 14 Python
python利用proxybroker构建爬虫免费IP代理池的实现
Feb 21 Python
聊聊pytorch测试的时候为何要加上model.eval()
May 23 Python
python实现图片批量压缩程序
Jul 23 #Python
python中的插值 scipy-interp的实现代码
Jul 23 #Python
Flask框架URL管理操作示例【基于@app.route】
Jul 23 #Python
python中的turtle库函数简单使用教程
Jul 23 #Python
Flask框架配置与调试操作示例
Jul 23 #Python
python实现时间o(1)的最小栈的实例代码
Jul 23 #Python
Flask框架Flask-Principal基本用法实例分析
Jul 23 #Python
You might like
php include加载文件两种方式效率比较
2010/08/08 PHP
PHP 5.6.11 访问SQL Server2008R2的几种情况详解
2016/08/08 PHP
PHP基于PDO扩展操作mysql数据库示例
2018/12/24 PHP
js URL参数的拼接方法比较
2012/02/15 Javascript
ExtJS如何设置与获取radio控件的选取状态
2014/01/22 Javascript
JavaScript闭包函数访问外部变量的方法
2014/08/27 Javascript
浅谈 javascript 事件处理
2015/01/04 Javascript
JS实现双击编辑可修改状态的方法
2015/08/14 Javascript
javascript自动恢复文本框点击清除后的默认文本
2016/01/12 Javascript
浅谈angularjs module返回对象的坑(推荐)
2016/10/21 Javascript
AngularJS中watch监听用法分析
2016/11/04 Javascript
微信小程序中子页面向父页面传值实例详解
2017/03/20 Javascript
ES6学习笔记之Set和Map数据结构详解
2017/04/07 Javascript
vue如何引用其他组件(css和js)
2017/04/13 Javascript
为你的微信小程序体积瘦身详解
2017/05/20 Javascript
React如何解决fetch跨域请求时session失效问题
2018/11/02 Javascript
微信小程序实现分享朋友圈的图片功能示例
2019/01/18 Javascript
高效jQuery选择器的5个技巧实例分析
2019/11/26 jQuery
JavaScript中的this/call/apply/bind的使用及区别
2020/03/06 Javascript
JavaScript cookie原理及使用实例
2020/05/08 Javascript
小程序实现密码输入框
2020/11/16 Javascript
[03:52]DOTA2英雄基础教程 酒仙
2013/12/23 DOTA
[01:02]DOTA2辉夜杯决赛日 CDEC.Y对阵VG赛前花絮
2015/12/27 DOTA
python实现的防DDoS脚本
2011/02/08 Python
Python使用微信SDK实现的微信支付功能示例
2017/06/30 Python
python实现比较文件内容异同
2018/06/22 Python
Pandas读取MySQL数据到DataFrame的方法
2018/07/25 Python
使用canvas绘制超炫时钟
2014/12/17 HTML / CSS
用HTML5中的Canvas结合公式绘制粒子运动的教程
2015/05/08 HTML / CSS
美国当红的名品折扣网:Gilt Groupe
2016/08/15 全球购物
LVMH旗下最大的奢侈品网站平台:24S
2020/05/24 全球购物
工作保证书范文
2014/04/29 职场文书
幼儿园门卫安全责任书
2015/05/08 职场文书
公司回复函格式
2015/07/14 职场文书
Python中使用subprocess库创建附加进程
2021/05/11 Python
Python实现自动玩连连看的脚本分享
2022/04/04 Python