Python实现简单HTML表格解析的方法


Posted in Python onJune 15, 2015

本文实例讲述了Python实现简单HTML表格解析的方法。分享给大家供大家参考。具体分析如下:

这里依赖libxml2dom,确保首先安装!导入到你的脚步并调用parse_tables() 函数。

1. source = a string containing the source code you can pass in just the table or the entire page code

2. headers = a list of ints OR a list of strings
If the headers are ints this is for tables with no header, just list the 0 based index of the rows in which you want to extract data.
If the headers are strings this is for tables with header columns (with the tags) it will pull the information from the specified columns

3. The 0 based index of the table in the source code. If there are multiple tables and the table you want to parse is the third table in the code then pass in the number 2 here

It will return a list of lists. each inner list will contain the parsed information.

具体代码如下:

#The goal of table parser is to get specific information from specific
#columns in a table.
#Input: source code from a typical website
#Arguments: a list of headers the user wants to return
#Output: A list of lists of the data in each row
import libxml2dom
def parse_tables(source, headers, table_index):
  """parse_tables(string source, list headers, table_index)
    headers may be a list of strings if the table has headers defined or
    headers may be a list of ints if no headers defined this will get data
    from the rows index.
    This method returns a list of lists
    """
  #Determine if the headers list is strings or ints and make sure they
  #are all the same type
  j = 0
  print 'Printing headers: ',headers
  #route to the correct function
  #if the header type is int
  if type(headers[0]) == type(1):
    #run no_header function
    return no_header(source, headers, table_index)
  #if the header type is string
  elif type(headers[0]) == type('a'):
    #run the header_given function
    return header_given(source, headers, table_index)
  else:
    #return none if the headers aren't correct
    return None
#This function takes in the source code of the whole page a string list of
#headers and the index number of the table on the page. It returns a list of
#lists with the scraped information
def header_given(source, headers, table_index):
  #initiate a list to hole the return list
  return_list = []
  #initiate a list to hold the index numbers of the data in the rows
  header_index = []
  #get a document object out of the source code
  doc = libxml2dom.parseString(source,html=1)
  #get the tables from the document
  tables = doc.getElementsByTagName('table')
  try:
    #try to get focue on the desired table
    main_table = tables[table_index]
  except:
    #if the table doesn't exits then return an error
    return ['The table index was not found']
  #get a list of headers in the table
  table_headers = main_table.getElementsByTagName('th')
  #need a sentry value for the header loop
  loop_sentry = 0
  #loop through each header looking for matches
  for header in table_headers:
    #if the header is in the desired headers list 
    if header.textContent in headers:
      #add it to the header_index
      header_index.append(loop_sentry)
    #add one to the loop_sentry
    loop_sentry+=1
  #get the rows from the table
  rows = main_table.getElementsByTagName('tr')
  #sentry value detecting if the first row is being viewed
  row_sentry = 0
  #loop through the rows in the table, skipping the first row
  for row in rows:
    #if row_sentry is 0 this is our first row
    if row_sentry == 0:
      #make the row_sentry not 0
      row_sentry = 1337
      continue
    #get all cells from the current row
    cells = row.getElementsByTagName('td')
    #initiate a list to append into the return_list
    cell_list = []
    #iterate through all of the header index's
    for i in header_index:
      #append the cells text content to the cell_list
      cell_list.append(cells[i].textContent)
    #append the cell_list to the return_list
    return_list.append(cell_list)
  #return the return_list
  return return_list
#This function takes in the source code of the whole page an int list of
#headers indicating the index number of the needed item and the index number
#of the table on the page. It returns a list of lists with the scraped info
def no_header(source, headers, table_index):
  #initiate a list to hold the return list
  return_list = []
  #get a document object out of the source code
  doc = libxml2dom.parseString(source, html=1)
  #get the tables from document
  tables = doc.getElementsByTagName('table')
  try:
    #Try to get focus on the desired table
    main_table = tables[table_index]
  except:
    #if the table doesn't exits then return an error
    return ['The table index was not found']
  #get all of the rows out of the main_table
  rows = main_table.getElementsByTagName('tr')
  #loop through each row
  for row in rows:
    #get all cells from the current row
    cells = row.getElementsByTagName('td')
    #initiate a list to append into the return_list
    cell_list = []
    #loop through the list of desired headers
    for i in headers:
      try:
        #try to add text from the cell into the cell_list
        cell_list.append(cells[i].textContent)
      except:
        #if there is an error usually an index error just continue
        continue
    #append the data scraped into the return_list    
    return_list.append(cell_list)
  #return the return list
  return return_list

希望本文所述对大家的Python程序设计有所帮助。

Python 相关文章推荐
Python数据结构与算法之图的最短路径(Dijkstra算法)完整实例
Dec 12 Python
Selenium chrome配置代理Python版的方法
Nov 29 Python
python实现网页自动签到功能
Jan 21 Python
python Pandas库基础分析之时间序列的处理详解
Jul 13 Python
提升Python效率之使用循环机制代替递归函数
Jul 23 Python
numpy求平均值的维度设定的例子
Aug 24 Python
Pyqt5 关于流式布局和滚动条的综合使用示例代码
Mar 24 Python
解决Python Matplotlib绘图数据点位置错乱问题
May 16 Python
python爬虫实例之获取动漫截图
May 31 Python
通过实例解析Python文件操作实现步骤
Sep 21 Python
python连接手机自动搜集蚂蚁森林能量的实现代码
Feb 24 Python
Python实战之用tkinter库做一个鼠标模拟点击器
Apr 27 Python
Python判断Abundant Number的方法
Jun 15 #Python
Python计算一个文件里字数的方法
Jun 15 #Python
Python素数检测实例分析
Jun 15 #Python
Python计算三维矢量幅度的方法
Jun 15 #Python
Python栈类实例分析
Jun 15 #Python
Python实现股市信息下载的方法
Jun 15 #Python
给Python入门者的一些编程建议
Jun 15 #Python
You might like
颠覆常识!无色透明的咖啡诞生了(中日双语)
2021/03/03 咖啡文化
php删除与复制文件夹及其文件夹下所有文件的实现代码
2013/01/23 PHP
基于PHP magic_quotes_gpc的使用方法详解
2013/06/24 PHP
php引用传值实例详解学习
2013/11/06 PHP
php+mongodb判断坐标是否在指定多边形区域内的实例
2016/10/28 PHP
javascript 面向对象编程 万物皆对象
2009/09/17 Javascript
js 高效去除数组重复元素示例代码
2013/12/19 Javascript
我的NodeJs学习小结(一)
2014/07/06 NodeJs
JS实现带关闭功能的阿里妈妈网站顶部滑出banner工具条代码
2015/09/17 Javascript
简单对比分析JavaScript中的apply,call与this的使用
2015/12/04 Javascript
AngularJS实现Input格式化的方法
2016/11/07 Javascript
浅谈使用splice函数对数组中的元素进行删除时的注意事项
2016/12/04 Javascript
jQuery编写设置和获取颜色的插件
2017/01/09 Javascript
bootstrap警告框使用方法解析
2017/01/13 Javascript
微信小程序页面间值传递的两种方法
2018/11/26 Javascript
jQuery实现的自定义轮播图功能详解
2018/12/28 jQuery
jquery多级树形下拉菜单的实例代码
2019/07/09 jQuery
jQuery实现提交表单时不提交隐藏div中input的方法
2019/10/08 jQuery
vue实现在v-html的html字符串中绑定事件
2019/10/28 Javascript
jQuery三组基本动画与自定义动画操作实例总结
2020/05/09 jQuery
uni-app使用微信小程序云函数的步骤示例
2020/05/22 Javascript
element日历calendar组件上月、今天、下月、日历块点击事件及模板源码
2020/07/27 Javascript
es5 类与es6中class的区别小结
2020/11/09 Javascript
[47:04]LGD vs infamous Supermajor小组赛D组 BO3 第二场 6.3
2018/06/04 DOTA
Python字符串处理函数简明总结
2015/04/13 Python
Python数据分析库pandas基本操作方法
2018/04/08 Python
python里运用私有属性和方法总结
2019/07/08 Python
keras实现基于孪生网络的图片相似度计算方式
2020/06/11 Python
Python 删除List元素的三种方法remove、pop、del
2020/11/16 Python
纯css3实现鼠标经过图片显示描述的动画效果
2014/09/01 HTML / CSS
html5 input元素新特性_动力节点Java学院整理
2017/07/06 HTML / CSS
英国在线珠宝店:The Jewel Hut
2017/03/20 全球购物
药店采购员岗位职责
2014/09/30 职场文书
孝老爱亲事迹材料
2014/12/24 职场文书
三好学生主要事迹怎么写
2015/11/03 职场文书
详解MongoDB排序时内存大小限制与创建索引的注意事项
2022/05/06 MongoDB