编程 Python

python2.7实现爬虫网页数据

Posted in Python onMay 25, 2018

最近刚学习Python，做了个简单的爬虫，作为一个简单的demo希望帮助和我一样的初学者。

代码使用python2.7做的爬虫抓取51job上面的职位名，公司名，薪资，发布时间等等。

直接上代码，代码中注释还算比较清楚，没有安装mysql需要屏蔽掉相关代码：

#!/usr/bin/python 
# -*- coding: UTF-8 -*- 
 
from bs4 import BeautifulSoup 
import urllib 
import urllib2 
import codecs 
import re 
import time 
import logging 
import MySQLdb 
 
 
class Jobs(object): 
 
  # 初始化 
  """docstring for Jobs""" 
 
  def __init__(self): 
    super(Jobs, self).__init__() 
     
    logging.basicConfig(level=logging.DEBUG, 
         format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s') 
    #数据库的操作，没有mysql可以做屏蔽 
    self.db = MySQLdb.connect('127.0.0.1','root','rootroot','MySQL_Test',charset='utf8') 
    self.cursor = self.db.cursor() 
 
    #log日志的显示 
    self.logger = logging.getLogger("sjk") 
 
    self.logger.setLevel(level=logging.DEBUG) 
 
    formatter = logging.Formatter( 
      '%(asctime)s - %(name)s - %(levelname)s - %(message)s') 
    handler = logging.FileHandler('log.txt') 
    handler.setFormatter(formatter) 
    handler.setLevel(logging.DEBUG) 
    self.logger.addHandler(handler) 
 
    self.logger.info('初始化完成') 
 
  # 模拟请求数据 
  def jobshtml(self, key, page='1'): 
    try: 
      self.logger.info('开始请求第' + page + '页') 
      #网页url 
      searchurl = "https://search.51job.com/list/040000,000000,0000,00,9,99,{key},2,{page}.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" 
 
      user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0' 
      #设置请求头 
      header = {'User-Agent': user_agent, 'Host': 'search.51job.com', 
           'Referer': 'https://www.51job.com/'} 
      #拼接url 
      finalUrl = searchurl.format(key=key, page=page) 
       
      request = urllib2.Request(finalUrl, headers=header) 
 
      response = urllib2.urlopen(request) 
      #等待网页加载完成 
      time.sleep(3) 
      #gbk格式解码 
      info = response.read().decode('gbk') 
 
      self.logger.info('请求网页网页') 
 
      self.decodeHtml(info=info, key=key, page=page) 
 
    except urllib2.HTTPError as e: 
      print e.reason 
 
  # 解析网页数据 
  def decodeHtml(self, info, key, page): 
    self.logger.info('开始解析网页数据') 
    #BeautifulSoup 解析网页 
    soup = BeautifulSoup(info, 'html.parser') 
    #找到class = t1 t2 t3 t4 t5 的标签数据 
    ps = soup.find_all(attrs={"class": re.compile(r'^t[1-5].*')}) 
    #打开txt文件 a+ 代表追加 
    f = codecs.open(key + '.txt', 'a+', 'UTF-8') 
    #清除之前的数据信息 
    f.truncate() 
 
    f.write('\n------------' + page + '--------------\n') 
 
    count = 1 
 
    arr = [] 
    #做一些字符串的处理，形成数据格式  iOS开发工程师 有限公司 深圳-南山区 0.9-1.6万/月 05-16 
    for pi in ps: 
      spe = " " 
      finalstr = pi.getText().strip() 
      arr.append(finalstr) 
      if count % 5 == 0: 
        #每一条数据插入数据库，如果没有安装mysql 可以将当前行注释掉 
        self.connectMySQL(arr=arr) 
        arr = [] 
        spe = "\n" 
      writestr = finalstr + spe 
      count += 1 
      f.write(writestr) 
    f.close() 
     
    self.logger.info('解析完成') 
 
#数据库操作 没有安装mysql 可以屏蔽掉 
  def connectMySQL(self,arr): 
    work=arr[0] 
    company=arr[1] 
    place=arr[2] 
    salary=arr[3] 
    time=arr[4] 
 
    query = "select * from Jobs_tab where \ 
    company_name='%s' and work_name='%s' and work_place='%s' \ 
    and salary='%s' and time='%s'" %(company,work,place,salary,time) 
    self.cursor.execute(query) 
 
    queryresult = self.cursor.fetchall() 
    #数据库中不存在就插入数据 存在就可以更新数据 不过我这边没有写 
    if len(queryresult) > 0: 
      sql = "insert into Jobs_tab(work_name,company_name,work_place,salary\ 
          ,time) values('%s','%s','%s','%s','%s')" %(work,company,place,salary,time) 
       
      try: 
        self.cursor.execute(sql) 
        self.db.commit() 
         
      except Exception as e: 
        self.logger.info('写入数据库失败') 
     
 
  #模拟登陆 
  # def login(self): 
  #   data = {'action':'save','isread':'on','loginname':'18086514327','password':'kui4131sjk'} 
 
 
  # 开始抓取 主函数 
  def run(self, key): 
 
    # 只要前5页的数据 key代表搜索工做类型 这边我是用的ios page是页数 
    for x in xrange(1, 6): 
      self.jobshtml(key=key, page=str(x)) 
 
    self.logger.info('写入数据库完成') 
 
    self.db.close() 
 
if __name__ == '__main__': 
 
  Jobs().run(key='iOS')

这样抓取网页数据格式如下：

python2.7实现爬虫网页数据

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python2.7实现爬虫网页数据

- Author -

aasdsjk

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

在Python的Django框架中调用方法和处理无效变量

Jul 15 Python

Python卸载模块的方法汇总

Jun 07 Python

python导入时小括号大作用

Jan 10 Python

TensorFlow实现卷积神经网络CNN

Mar 09 Python

详谈套接字中SO_REUSEPORT和SO_REUSEADDR的区别

Apr 28 Python

利用Python如何批量更新服务器文件

Jul 29 Python

python下的opencv画矩形和文字注释的实现方法

Jul 09 Python

Python爬虫爬取煎蛋网图片代码实例

Dec 16 Python

python实现xlwt xlrd 指定条件给excel行添加颜色

Jul 14 Python

python切割图片的示例

Nov 12 Python

用Python提取PDF表格的方法

Apr 11 Python

Python如何识别银行卡卡号?

Jun 10 Python

python sys.argv[]用法实例详解

May 25 #Python

python切片及sys.argv[]用法详解

May 25 #Python

windows下python安装pip图文教程

May 25 #Python

python3.6使用pymysql连接Mysql数据库

May 25 #Python

python matplotlib绘图,修改坐标轴刻度为文字的实例

May 25 #Python

Python二叉树定义与遍历方法实例分析

May 25 #Python

matplotlib 纵坐标轴显示数据值的实例

May 25 #Python

You might like

php strlen mb_strlen计算中英文混排字符串长度

2009/07/10 PHP

php计算整个mysql数据库大小的方法

2015/06/19 PHP

Yii2框架RESTful API 格式化响应，授权认证和速率限制三部分详解

2016/11/10 PHP

php操作access数据库的方法详解

2017/02/22 PHP

漂亮的提示信息（带箭头）

2007/03/21 Javascript

js open() 与showModalDialog()方法使用介绍

2013/09/10 Javascript

jQuery创建自定义的选择器用以选择高度大于100的超链接实例

2015/03/18 Javascript

Javascript中的包装类型介绍

2015/04/02 Javascript

JavaScript实现DOM对象选择器

2016/09/24 Javascript

JS操作input标签属性checkbox全选的实现代码

2017/03/02 Javascript

Vue组件化通讯的实例代码

2017/06/23 Javascript

Vue-Router模式和钩子的用法

2018/02/28 Javascript

详解vuex commit保存数据技巧

2018/12/25 Javascript

微信小程序实现左右列表联动

2020/05/19 Javascript

在antd4.0中Form使用initialValue操作

2020/11/02 Javascript

[01:03:51]2018DOTA2亚洲邀请赛 4.7 淘汰赛 VP vs LGD 第三场

2018/04/09 DOTA

[41:13]完美世界DOTA2联赛PWL S2 Forest vs Rebirth 第一场 11.20

2020/11/20 DOTA

python 除法保留两位小数点的方法

2018/07/16 Python

Python使用while循环花式打印乘法表

2019/01/28 Python

使用Python画股票的K线图的方法步骤

2019/06/28 Python

python tkinter控件布局项目实例

2019/11/04 Python

把vgg-face.mat权重迁移到pytorch模型示例

2019/12/27 Python

浅析python标准库中的glob

2020/03/13 Python

利用 Canvas实现绘画一个未闭合的带进度条的圆环

2019/07/26 HTML / CSS

html5启动原生APP总结

2020/07/03 HTML / CSS

购买中国最好的电子产品：Geekbuying

2018/03/13 全球购物

最新大学毕业求职简历的自我评价

2013/10/18 职场文书

西门豹教学反思

2014/02/04 职场文书

我的长生果教学反思

2014/04/28 职场文书

会计求职信怎么写

2015/03/20 职场文书

爱国主义电影观后感

2015/06/18 职场文书

三严三实学习心得体会（精选N篇）

2016/01/05 职场文书

初中历史教学反思

2016/02/19 职场文书

Nginx快速入门教程

2021/03/31 Servers

详解MySQL 用户权限管理

2021/04/20 MySQL

浅谈tf.train.Saver()与tf.train.import_meta_graph的要点

2021/05/26 Python