编程 Python

Python抓取Discuz!用户名脚本代码

Posted in Python onDecember 30, 2013

最近学习Python，于是就用Python写了一个抓取Discuz!用户名的脚本，代码很少但是很搓。思路很简单，就是正则匹配title然后提取用户名写入文本文档。程序以百度站长社区为例(一共有40多万用户)，挂在VPS上就没管了，虽然用了延时但是后来发现一共只抓取了50000多个用户名就被封了。。。
代码如下：

# -*- coding: utf-8 -*-
# Author: 天一
# Blog: http://www.90blog.org
# Version: 1.0
# 功能: Python抓取百度站长平台用户名脚本import urllib
import urllib2  
import re
import time
def BiduSpider():
     pattern = re.compile(r'<title>(.*)的个人资料  百度站长社区 </title>')
     uid=1
     thedatas = []
     while uid <400000:
         theUrl = "http://bbs.zhanzhang.baidu.com/home.php?mod=space&uid="+str(uid)
         uid +=1
         theResponse  = urllib2.urlopen(theUrl)
         thePage = theResponse.read()
         #正则匹配用户名
         theFindall = re.findall(pattern,thePage)
         #等待0.5秒，以防频繁访问被禁止
         time.sleep(0.5)
         if theFindall :
              #中文编码防止乱码输出
              thedatas = theFindall[0].decode('utf-8').encode('gbk')
              #写入txt文本文档
              f = open('theUid.txt','a')
              f.writelines(thedatas+'\n')
              f.close()
if __name__ == '__main__':
     BiduSpider()

最终成果如下：

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python数据结构树和二叉树简介

Apr 29 Python

使用Python3编写抓取网页和只抓网页图片的脚本

Aug 20 Python

Python+微信接口实现运维报警

Aug 27 Python

python3 flask实现文件上传功能

Mar 20 Python

让代码变得更易维护的7个Python库

Oct 09 Python

对Pandas DataFrame缺失值的查找与填充示例讲解

Nov 06 Python

Python正则表达式实现简易计算器功能示例

May 07 Python

python中pip的使用和修改下载源的方法

Jul 08 Python

python线程的几种创建方式详解

Aug 29 Python

Python操作SQLite数据库过程解析

Sep 02 Python

解决import tensorflow as tf 出错的原因

Apr 16 Python

浅析PyCharm 的初始设置（知道）

Oct 12 Python

python之模拟鼠标键盘动作具体实现

Dec 30 #Python

python多线程http下载实现示例

Dec 30 #Python

python正则匹配查询港澳通行证办理进度示例分享

Dec 27 #Python

python模拟登录百度代码分享(获取百度贴吧等级)

Dec 27 #Python

python读文件逐行处理的示例代码分享

Dec 27 #Python

python调用cmd复制文件代码分享

Dec 27 #Python

win7安装python生成随机数代码分享

Dec 27 #Python