python定向爬虫校园论坛帖子信息


Posted in Python onJuly 23, 2018

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息,主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url(包含page页面参数)来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式,文本格式都为 &#XXXX;的形式,所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite数据库:saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬虫代码

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站,伪造一个header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python 内置函数memoryview(obj)的具体用法
Nov 23 Python
python简单商城购物车实例代码
Mar 15 Python
Python小白必备的8个最常用的内置函数(推荐)
Apr 03 Python
Python 图像处理: 生成二维高斯分布蒙版的实例
Jul 04 Python
如何使用python3获取当前路径及os.path.dirname的使用
Dec 13 Python
详解Python中pyautogui库的最全使用方法
Apr 01 Python
Mysql数据库反向生成Django里面的models指令方式
May 18 Python
详解Windows下PyCharm安装Numpy包及无法安装问题解决方案
Jun 18 Python
Python在后台自动解压各种压缩文件的实现方法
Nov 10 Python
Python基础之操作MySQL数据库
May 06 Python
Python 游戏大作炫酷机甲闯关游戏爆肝数千行代码实现案例进阶
Oct 16 Python
Python requests用法和django后台处理详解
Mar 19 Python
python实现图片批量压缩程序
Jul 23 #Python
python中的插值 scipy-interp的实现代码
Jul 23 #Python
Flask框架URL管理操作示例【基于@app.route】
Jul 23 #Python
python中的turtle库函数简单使用教程
Jul 23 #Python
Flask框架配置与调试操作示例
Jul 23 #Python
python实现时间o(1)的最小栈的实例代码
Jul 23 #Python
Flask框架Flask-Principal基本用法实例分析
Jul 23 #Python
You might like
用PHP连接Oracle for NT 远程数据库
2006/10/09 PHP
PHP学习之整理字符串
2011/04/17 PHP
PHP实现邮件群发的源码
2013/06/18 PHP
PHP遍历目录函数opendir()、readdir()、closedir()、rewinddir()总结
2014/11/18 PHP
php实现多维数组中每个单元值(数字)翻倍的方法
2015/02/16 PHP
PHP网络操作函数汇总
2015/05/18 PHP
解决php extension 加载顺序问题
2019/08/16 PHP
JS 屏蔽按键效果与改变按键效果的示例代码
2013/12/24 Javascript
jQuery解析XML与传统JavaScript方法的差别实例分析
2015/03/05 Javascript
浅谈Node.js中的定时器
2015/06/18 Javascript
jQuery简单实现仿京东商城的左侧菜单效果代码
2015/09/09 Javascript
JS使用单链表统计英语单词出现次数
2016/06/16 Javascript
利用js获取下拉框中所选的值
2016/12/01 Javascript
JS基于贪心算法解决背包问题示例
2017/11/27 Javascript
javascript实现Emrips反质数枚举的示例代码
2017/12/06 Javascript
浅谈redux以及react-redux简单实现
2018/08/28 Javascript
vue2.0实现的tab标签切换效果(内容可自定义)示例
2019/02/11 Javascript
jQuery实现的图片点击放大缩小功能案例
2020/01/02 jQuery
python爬虫爬取网页表格数据
2018/03/07 Python
Python Django框架单元测试之文件上传测试示例
2019/05/17 Python
python 使用matplotlib 实现从文件中读取x,y坐标的可视化方法
2019/07/04 Python
python+opencv像素的加减和加权操作的实现
2019/07/14 Python
python使用pip安装模块出现ReadTimeoutError: HTTPSConnectionPool的解决方法
2019/10/04 Python
python+playwright微软自动化工具的使用
2021/02/02 Python
早晨薰衣草在线女性精品店:Morning Lavender
2021/01/04 全球购物
什么是.net的Remoting技术
2016/07/08 面试题
医学专业本科毕业生自我鉴定
2013/12/28 职场文书
倡议书格式
2014/04/14 职场文书
建筑工地标语
2014/06/18 职场文书
教师党员自我评价范文
2015/03/04 职场文书
幼儿园小班教师随笔
2015/08/14 职场文书
2016习总书记系列重要讲话心得体会
2016/01/15 职场文书
《狼牙山五壮士》教学反思
2016/02/17 职场文书
python基础之爬虫入门
2021/05/10 Python
漫画「请问您今天要来点兔子吗?」最新杂志彩页公开
2022/03/24 日漫
详解Mysql数据库平滑扩容解决高并发和大数据量问题
2022/05/25 MySQL