python定向爬虫校园论坛帖子信息


Posted in Python onJuly 23, 2018

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息,主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url(包含page页面参数)来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式,文本格式都为 &#XXXX;的形式,所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite数据库:saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬虫代码

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站,伪造一个header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
跟老齐学Python之玩转字符串(3)
Sep 14 Python
Python中的进程分支fork和exec详解
Apr 11 Python
Python基于whois模块简单识别网站域名及所有者的方法
Apr 23 Python
python操作excel的包(openpyxl、xlsxwriter)
Jun 11 Python
Python多图片合并PDF的方法
Jan 03 Python
Python解析命令行读取参数之argparse模块
Jul 26 Python
python Kmeans算法原理深入解析
Aug 23 Python
python实现全排列代码(回溯、深度优先搜索)
Feb 26 Python
Python参数传递实现过程及原理详解
May 14 Python
什么是Python中的顺序表
Jun 02 Python
Python lambda表达式原理及用法解析
Aug 18 Python
Python 实现图片转字符画的示例(静态图片,gif皆可)
Nov 05 Python
python实现图片批量压缩程序
Jul 23 #Python
python中的插值 scipy-interp的实现代码
Jul 23 #Python
Flask框架URL管理操作示例【基于@app.route】
Jul 23 #Python
python中的turtle库函数简单使用教程
Jul 23 #Python
Flask框架配置与调试操作示例
Jul 23 #Python
python实现时间o(1)的最小栈的实例代码
Jul 23 #Python
Flask框架Flask-Principal基本用法实例分析
Jul 23 #Python
You might like
一个简单的PHP入门源程序
2006/10/09 PHP
PHP连接SQLServer2005的实现方法(附ntwdblib.dll下载)
2012/07/02 PHP
使用PHPExcel操作Excel用法实例分析
2015/03/26 PHP
php 无限级分类 获取顶级分类ID
2016/03/13 PHP
PHP全局使用Laravel辅助函数dd
2019/12/26 PHP
JavaScript constructor和instanceof,JSOO中的一对欢喜冤家
2009/05/25 Javascript
javascript 鼠标拖动图标技术
2010/02/07 Javascript
基于jQuery的星级评分插件
2011/08/12 Javascript
javascript之典型高阶函数应用介绍二
2013/01/10 Javascript
Knockout数组(observable)使用详解示例
2013/11/15 Javascript
js实现简洁的TAB滑动门效果代码
2015/09/06 Javascript
jquery的幻灯片图片切换效果代码分享
2015/09/07 Javascript
浅析jQuery Ajax通用js封装
2016/06/22 Javascript
深入理解js中的加载事件
2017/02/08 Javascript
微信小程序中使用javascript 回调函数
2017/05/11 Javascript
利用纯JS实现像素逐渐显示的方法示例
2017/08/14 Javascript
从源码看angular/material2 中 dialog模块的实现方法
2017/10/18 Javascript
JS实现HTML页面中动态显示当前时间完整示例
2018/07/30 Javascript
D3.js(v3)+react 实现带坐标与比例尺的柱形图 (V3版本)
2019/05/09 Javascript
Node.js系列之安装配置与基本使用(1)
2019/08/30 Javascript
VsCode与Node.js知识点详解
2019/09/05 Javascript
js通过循环多张图片实现动画效果
2019/12/19 Javascript
微信小游戏中three.js离屏画布的示例代码
2020/10/12 Javascript
[54:24]Optic vs TNC 2018国际邀请赛小组赛BO2 第二场
2018/08/18 DOTA
Python制作爬虫采集小说
2015/10/25 Python
Python基础教程之正则表达式基本语法以及re模块
2016/03/25 Python
python2.7和NLTK安装详细教程
2018/09/19 Python
css3实现背景颜色渐变让图片不再是唯一的实现方式
2012/12/18 HTML / CSS
XD健身器材:Kevlar球、Crossfit健身球
2019/03/26 全球购物
Java程序开发中如何应用线程
2016/03/03 面试题
大学生就业自荐信
2013/10/26 职场文书
2014普法依法治理工作总结
2014/12/18 职场文书
如何写辞职书
2015/02/26 职场文书
爱国主义主题班会
2015/08/14 职场文书
使用numpy nonzero 找出非0元素
2021/05/14 Python
MySQL 如何限制一张表的记录数
2021/09/14 MySQL