编程 Python

python定向爬虫校园论坛帖子信息

Posted in Python onJuly 23, 2018

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息，主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url（包含page页面参数）来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式，文本格式都为 &#XXXX;的形式，所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite数据库：saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬虫代码

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站，伪造一个header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python定向爬虫校园论坛帖子信息

- Author -

lannooooooooooo

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

布同自制Python函数帮助查询小工具

Mar 13 Python

gearman的安装启动及python API使用实例

Jul 08 Python

Python处理json字符串转化为字典的简单实现

Jul 07 Python

使用Python将数组的元素导出到变量中(unpacking)

Oct 27 Python

Django实现分页功能

Jul 02 Python

对python特殊函数 __call__()的使用详解

Jul 02 Python

Python分析彩票记录并预测中奖号码过程详解

Jul 09 Python

浅谈Python中的异常和JSON读写数据的实现

Feb 27 Python

Python异常原理及异常捕捉实现过程解析

Mar 25 Python

python实现人机五子棋

Mar 25 Python

Python安装第三方库攻略(pip和Anaconda)

Oct 15 Python

详解python字符串驻留技术

May 21 Python

python实现图片批量压缩程序

Jul 23 #Python

python中的插值 scipy-interp的实现代码

Jul 23 #Python

Flask框架URL管理操作示例【基于@app.route】

Jul 23 #Python

python中的turtle库函数简单使用教程

Jul 23 #Python

Flask框架配置与调试操作示例

Jul 23 #Python

python实现时间o(1)的最小栈的实例代码

Jul 23 #Python

Flask框架Flask-Principal基本用法实例分析

Jul 23 #Python

You might like

常用星际术语索引(新手指南)

2020/03/04 星际争霸

PHP 和 MySQL 基础教程（二）

2006/10/09 PHP

PHP5 安装方法

2007/01/15 PHP

细谈php中SQL注入攻击与XSS攻击

2012/06/10 PHP

PHP网站开发中常用的8个小技巧

2015/02/13 PHP

使用PHP和JavaScript判断请求是否来自微信内浏览器

2015/08/18 PHP

php邮件发送的两种方式

2020/04/28 PHP

PHP基于curl后台远程登录正方教务系统的方法

2016/10/14 PHP

PHP 表单提交及处理表单数据详解及实例

2016/12/27 PHP

Yii 访问 Gii(脚手架)时出现 403 错误

2018/06/06 PHP

PHP的mysqli_ssl_set()函数讲解

2019/01/23 PHP

键盘上一张下一张兼容IE/google/firefox等浏览器

2014/01/28 Javascript

jquery中子元素和后代元素的区别示例介绍

2014/04/02 Javascript

JQuery移动页面开发之屏幕方向改变与滚屏的实现

2015/12/03 Javascript

JS动态插入并立即执行回调函数的方法

2016/04/21 Javascript

原生js实现可拖动的登录框效果

2017/01/21 Javascript

JS html时钟制作代码分享

2017/03/03 Javascript

nodejs项目windows下开机自启动的方法

2017/11/22 NodeJs

JavaScript数据结构与算法之队列原理与用法实例详解

2017/11/22 Javascript

详解如何用babel转换es6的class语法

2018/04/03 Javascript

vue配置请求本地json数据的方法

2018/04/11 Javascript

详解ES6 CLASS在微信小程序中的应用实例

2020/04/24 Javascript

python采用requests库模拟登录和抓取数据的简单示例

2014/07/05 Python

python获取本机mac地址和ip地址的方法

2015/04/29 Python

python数值基础知识浅析

2019/11/19 Python

NumPy统计函数的实现方法

2020/01/21 Python

Spring @Enable模块驱动原理及使用实例

2020/06/23 Python

墨西哥运动服饰和鞋网上商店：Netshoes墨西哥

2016/07/28 全球购物

美国第二大连锁药店：Rite Aid

2019/04/03 全球购物

泰国排名第一的家居用品中心：HomePro

2020/11/18 全球购物

岗位竞聘演讲稿

2014/01/10 职场文书

三八妇女节演讲稿

2014/05/27 职场文书

第一批党的群众路线教育实践活动总结报告

2014/07/03 职场文书

2014年行政助理工作总结

2014/11/19 职场文书

建党伟业的观后感

2015/06/01 职场文书

交互式可视化js库gojs使用介绍及技巧

2022/02/18 Javascript