利用python做表格数据处理


Posted in Python onApril 13, 2021

技术背景

数据处理是一个当下非常热门的研究方向,通过对于大型实际场景中的数据进行建模,可以用于预测下一阶段可能出现的情况。比如我们有过去的2002年-2018年的黄金价格的数据:

利用python做表格数据处理

该数据来源于Gitee上的一个开源项目。其中包含有:时间、开盘价、收盘价、最高价、最低价、交易数以及成交额这么几个参数。假如我们使用一个机器学习的模型去分析这个数据,也许我们可以预测在这个数据中并不存在的金价数据。如果预测的契合度较好,那么对于一些人的投资策略来说有重大意义。但是这种实际场景下的数据,往往数据量是非常大的。虽然这里我们使用到的数据只有300多KB,但是我们更多的时候不得不考虑10个GB甚至是1个TB以上的数据的处理。如果处理都无法处理,那我们如何对这些数据进行建模呢?

python对Excel表格的处理

首先我们看一个最简单的情况,我们先不考虑性能的问题,那么我们可以使用xlrd这个工具来在python中打开和加载一个Excel表格:

# table.py

def read_table_by_xlrd():
    import xlrd
    workbook = xlrd.open_workbook(r'data.xls')
    sheet_name = workbook.sheet_names()
    print ('All sheets in the file data.xls are: {}'.format(sheet_name))
    sheet = workbook.sheet_by_index(0)
    print ('The cell value of row index 0 and col index 1 is: {}'.format(sheet.cell_value(0, 1)))
    print ('The elements of row index 0 are: {}'.format(sheet.row_values(0)))
    print ('The length of col index 1 are: {}'.format(len(sheet.col_values(1))))

if __name__ == '__main__':
    read_table_by_xlrd()

上述代码的输出如下:

[dechin@dechin-manjaro gold]$ python3 table.py 
All sheets in the file data.xls are: ['Sheet1', 'Sheet2', 'Sheet3']
The cell value of row index 0 and col index 1 is: 开
The elements of row index 0 are: ['时间', '开', '高', '低', '收', '量', '额']
The length of col index 1 are: 3923

我们这里成功的将一个xls格式的表格加载到了python的内存中,我们可以对这些数据进行分析。如果需要对这些数据修改,可以使用openpyxl这个仓库,但是这里我们不做过多的赘述。

在python中还有另外一个非常常用且非常强大的库可以用来处理表格数据,那就是pandas,这里我们利用ipython这个工具简单展示一下使用pandas处理表格数据的方法:

[dechin@dechin-manjaro gold]$ ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: !ls -l
总用量 368
-rw-r--r-- 1 dechin dechin 372736  3月 27 21:31 data.xls
-rw-r--r-- 1 dechin dechin    563  3月 27 21:42 table.py

In [3]: data = pd.read_excel('data.xls', 'Sheet1') # 读取excel格式的文件

In [4]: data.to_csv('data.csv', encoding='utf-8') # 转成csv格式的文件

In [7]: !ls -l
总用量 588
-rw-r--r-- 1 dechin dechin 221872  3月 27 21:52 data.csv
-rw-r--r-- 1 dechin dechin 372736  3月 27 21:31 data.xls
-rw-r--r-- 1 dechin dechin    563  3月 27 21:42 table.py

In [8]: !head -n 10 data.csv # 读取csv文件的头10行
,时间,开,高,低,收,量,额
0,2002-10-30,83.98,92.38,82.0,83.52,352,29373370
1,2002-10-31,83.9,83.92,83.9,83.91,66,5537480
2,2002-11-01,84.5,84.65,84.0,84.51,77,6502510
3,2002-11-04,84.9,85.06,84.9,84.99,95,8076330
4,2002-11-05,85.1,85.2,85.1,85.13,61,5193650
5,2002-11-06,84.9,84.9,84.9,84.9,1,84900
6,2002-11-07,85.0,85.15,85.0,85.14,26,2212310
7,2002-11-08,85.25,85.28,85.1,85.16,35,2981780
8,2002-11-11,85.18,85.19,85.18,85.19,65,5537050

在ipython中我们不仅可以执行python指令,还可以在前面加一个!就能够执行一些系统命令,非常的方便。csv格式的文件,其实就是用逗号跟换行符来替代常用的\t字符串进行数据的分隔。

但是,不论是使用xlrd还是pandas,我们都会面临一个同样的问题:需要把所有的数据加载到内存中进行处理。我们一般的个人电脑只有8GB-16GB的内存,就算是比较大的64GB的内存,我们也只能够在内存中对64GB以下内存大小的文件进行处理,这对于大数据场景来说远远不够。所以,下一章节中介绍的vaex就是一个很好的解决方案。另外,关于Linux下查看本地内存以及使用情况的方法如下:

[dechin@dechin-manjaro gold]$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b 交换 空闲 缓冲 缓存   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 35812168 328340 2904872    0    0    20    27  362  365  8  4 88  0  0
[dechin@dechin-manjaro gold]$ vmstat 2 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b 交换 空闲 缓冲 缓存   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 35810916 328356 2905844    0    0    20    27  362  365  8  4 88  0  0
 0  0      0 35811916 328364 2904952    0    0     0     6  613  688  1  1 99  0  0
 0  0      0 35812168 328364 2904856    0    0     0     0  672  642  0  1 99  0  0

我们可以看到空闲内存大约有36GB的内存,这里我们本机一共有40GB的内存,算是比较大的了。

vaex的安装与使用

vaex提供了一种内存映射的数据处理方案,我们不需要将整个的数据文件加载到内存中进行处理,我们可以直接对硬盘存储进行操作。换句话说,我们所能够处理的文件大小不再受到内存大小的限制,只要在磁盘存储空间允许的范围内,我们都可以对这么大小的文件进行处理。
一般现在个人PC的磁盘最小也有128GB,远远大于内存可以承受的范围。当然,由于分区的不同,不一定能够保障所有的内存资源都能够被使用到,这里附上查看当前目录分区的可用磁盘空间大小查询的方法:

[dechin@dechin-manjaro gold]$ df -hl .
文件系统        容量  已用  可用 已用% 挂载点
/dev/nvme0n1p9  144G   57G   80G   42% /

这里可以看到我们还有80GB的可用磁盘空间,也就是说,如果我们在当前目录放一个80GB大小的表格文件,那么用pandas和xlrd都是没办法处理的,因为这已经远远超出了内存可支持的空间。但是用vaex,我们依然可以对这个文件进行处理。

在vaex的官方文档链接中也介绍有vaex的原理和优势:

利用python做表格数据处理

vaex的安装

与大多数的python第三方包类似的,我们可以使用pip来进行下载和管理。当然由于下载的文件会比较多,中间的过程也会较为缓慢,我们只需安静等待即可:

[dechin@dechin-manjaro gold]$ python3 -m pip install vaex
Collecting vaex
  Downloading vaex-4.1.0-py3-none-any.whl (4.5 kB)
Collecting vaex-ml<0.12,>=0.11.0
  Downloading vaex_ml-0.11.1-py3-none-any.whl (95 kB)
     |????????????????????????????????| 95 kB 81 kB/s 
Collecting vaex-core<5,>=4.1.0
  Downloading vaex_core-4.1.0-cp38-cp38-manylinux2010_x86_64.whl (2.5 MB)
     |????????????????????????????????| 2.5 MB 61 kB/s 
Collecting vaex-viz<0.6,>=0.5.0
  Downloading vaex_viz-0.5.0-py3-none-any.whl (19 kB)
Collecting vaex-astro<0.9,>=0.8.0
  Downloading vaex_astro-0.8.0-py3-none-any.whl (20 kB)
Collecting vaex-hdf5<0.8,>=0.7.0
  Downloading vaex_hdf5-0.7.0-py3-none-any.whl (15 kB)
Collecting vaex-server<0.5,>=0.4.0
  Downloading vaex_server-0.4.0-py3-none-any.whl (13 kB)
Collecting vaex-jupyter<0.7,>=0.6.0
  Downloading vaex_jupyter-0.6.0-py3-none-any.whl (42 kB)
     |????????????????????????????????| 42 kB 82 kB/s 
Requirement already satisfied: traitlets in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (5.0.5)
Requirement already satisfied: numba in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (0.51.2)
Requirement already satisfied: jinja2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (2.11.2)
Requirement already satisfied: psutil>=1.2.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (5.7.2)
Requirement already satisfied: six in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.15.0)
Requirement already satisfied: cloudpickle in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.6.0)
Requirement already satisfied: numpy>=1.16 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.20.1)
Requirement already satisfied: dask[array] in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (2.30.0)
Collecting pyarrow>=3.0
  Downloading pyarrow-3.0.0-cp38-cp38-manylinux2014_x86_64.whl (20.7 MB)
     |????????????????????????????????| 20.7 MB 86 kB/s 
Requirement already satisfied: pandas in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.1.3)
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)")': /simple/tabulate/                                       
Collecting tabulate>=0.8.3
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Requirement already satisfied: pyyaml in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (5.3.1)
Collecting frozendict
  Downloading frozendict-1.2.tar.gz (2.6 kB)
Collecting aplus
  Downloading aplus-0.11.0.tar.gz (3.7 kB)
Requirement already satisfied: requests in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (2.24.0)
Requirement already satisfied: nest-asyncio>=1.3.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.4.2)
Collecting progressbar2
  Downloading progressbar2-3.53.1-py2.py3-none-any.whl (25 kB)
Requirement already satisfied: future>=0.15.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (0.18.2)
Requirement already satisfied: matplotlib>=1.3.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-viz<0.6,>=0.5.0->vaex) (3.3.4)
Requirement already satisfied: pillow in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-viz<0.6,>=0.5.0->vaex) (8.0.1)
Requirement already satisfied: astropy in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-astro<0.9,>=0.8.0->vaex) (4.0.2)
Requirement already satisfied: h5py>=2.9 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-hdf5<0.8,>=0.7.0->vaex) (2.10.0)
Collecting cachetools
  Downloading cachetools-4.2.1-py3-none-any.whl (12 kB)
Requirement already satisfied: tornado>4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-server<0.5,>=0.4.0->vaex) (6.0.4)
Collecting xarray
  Downloading xarray-0.17.0-py3-none-any.whl (759 kB)
     |????????????????????????????????| 759 kB 28 kB/s 
Collecting ipympl
  Downloading ipympl-0.7.0-py2.py3-none-any.whl (106 kB)
     |????????????????????????????????| 106 kB 39 kB/s 
Collecting ipyleaflet
  Downloading ipyleaflet-0.13.6-py2.py3-none-any.whl (3.3 MB)
     |????????????????????????????????| 3.3 MB 75 kB/s 
Collecting ipyvuetify<2,>=1.2.2
  Downloading ipyvuetify-1.6.2-py2.py3-none-any.whl (11.7 MB)
     |????????????????????????????????| 11.7 MB 173 kB/s 
Collecting ipyvolume>=0.4
  Downloading ipyvolume-0.5.2-py2.py3-none-any.whl (2.9 MB)
     |????????????????????????????????| 2.9 MB 66 kB/s 
Collecting bqplot>=0.10.1
  Downloading bqplot-0.12.23-py2.py3-none-any.whl (1.2 MB)
     |????????????????????????????????| 1.2 MB 175 kB/s 
Requirement already satisfied: ipython-genutils in /home/dechin/anaconda3/lib/python3.8/site-packages (from traitlets->vaex-ml<0.12,>=0.11.0->vaex) (0.2.0)
Requirement already satisfied: setuptools in /home/dechin/anaconda3/lib/python3.8/site-packages (from numba->vaex-ml<0.12,>=0.11.0->vaex) (50.3.1.post20201107)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from numba->vaex-ml<0.12,>=0.11.0->vaex) (0.34.0)
Requirement already satisfied: MarkupSafe>=0.23 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jinja2->vaex-ml<0.12,>=0.11.0->vaex) (1.1.1)
Requirement already satisfied: toolz>=0.8.2; extra == "array" in /home/dechin/anaconda3/lib/python3.8/site-packages (from dask[array]->vaex-core<5,>=4.1.0->vaex) (0.11.1)
Requirement already satisfied: pytz>=2017.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pandas->vaex-core<5,>=4.1.0->vaex) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pandas->vaex-core<5,>=4.1.0->vaex) (2.8.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (3.0.4)
Collecting python-utils>=2.3.0
  Downloading python_utils-2.5.6-py2.py3-none-any.whl (12 kB)
Requirement already satisfied: cycler>=0.10 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (1.3.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (2.4.7)
Collecting ipywidgets>=7.6.0
  Downloading ipywidgets-7.6.3-py2.py3-none-any.whl (121 kB)
     |????????????????????????????????| 121 kB 175 kB/s 
Requirement already satisfied: ipykernel>=4.7 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (5.3.4)
Collecting branca<0.5,>=0.3.1
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Collecting shapely
  Downloading Shapely-1.7.1-cp38-cp38-manylinux1_x86_64.whl (1.0 MB)
     |????????????????????????????????| 1.0 MB 98 kB/s 
Collecting traittypes<3,>=0.2.1
  Downloading traittypes-0.2.1-py2.py3-none-any.whl (8.6 kB)
Collecting ipyvue<2,>=1.5
  Downloading ipyvue-1.5.0-py2.py3-none-any.whl (2.7 MB)
     |????????????????????????????????| 2.7 MB 80 kB/s 
Collecting ipywebrtc
  Downloading ipywebrtc-0.5.0-py2.py3-none-any.whl (1.1 MB)
     |????????????????????????????????| 1.1 MB 99 kB/s 
Collecting pythreejs>=1.0.0
  Downloading pythreejs-2.3.0-py2.py3-none-any.whl (3.4 MB)
     |????????????????????????????????| 3.4 MB 30 kB/s 
Requirement already satisfied: widgetsnbextension~=3.5.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.5.1)
Requirement already satisfied: nbformat>=4.2.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (5.0.8)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (7.19.0)
Collecting jupyterlab-widgets>=1.0.0; python_version >= "3.6"
  Downloading jupyterlab_widgets-1.0.0-py3-none-any.whl (243 kB)
     |????????????????????????????????| 243 kB 115 kB/s 
Requirement already satisfied: jupyter-client in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipykernel>=4.7->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.1.7)
Collecting ipydatawidgets>=1.1.1
  Downloading ipydatawidgets-4.2.0-py2.py3-none-any.whl (275 kB)
     |????????????????????????????????| 275 kB 73 kB/s 
Requirement already satisfied: notebook>=4.4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.1.4)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.2.0)
Requirement already satisfied: jupyter-core in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.6.3)
Requirement already satisfied: backcall in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.2.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.0.8)
Requirement already satisfied: pickleshare in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.7.5)
Requirement already satisfied: pexpect>4.3; sys_platform != "win32" in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.8.0)
Requirement already satisfied: pygments in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (2.7.2)
Requirement already satisfied: jedi>=0.10 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.17.1)
Requirement already satisfied: decorator in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.4.2)
Requirement already satisfied: pyzmq>=13 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jupyter-client->ipykernel>=4.7->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (19.0.2)
Requirement already satisfied: terminado>=0.8.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.9.1)
Requirement already satisfied: argon2-cffi in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.1.0)
Requirement already satisfied: Send2Trash in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.5.0)
Requirement already satisfied: nbconvert in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.0.7)
Requirement already satisfied: prometheus-client in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.8.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.17.3)
Requirement already satisfied: attrs>=17.4.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.3.0)
Requirement already satisfied: wcwidth in /home/dechin/anaconda3/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.2.5)
Requirement already satisfied: ptyprocess>=0.5 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pexpect>4.3; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.6.0)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.7.0)
Requirement already satisfied: cffi>=1.0.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.14.3)
Requirement already satisfied: mistune<2,>=0.8.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.8.4)
Requirement already satisfied: testpath in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.4.4)
Requirement already satisfied: pandocfilters>=1.4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.4.3)
Requirement already satisfied: jupyterlab-pygments in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.1.2)
Requirement already satisfied: bleach in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.2.1)
Requirement already satisfied: entrypoints>=0.2.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.3)
Requirement already satisfied: defusedxml in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.6.0)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.5.1)
Requirement already satisfied: pycparser in /home/dechin/anaconda3/lib/python3.8/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (2.20)
Requirement already satisfied: webencodings in /home/dechin/anaconda3/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.5.1)
Requirement already satisfied: packaging in /home/dechin/anaconda3/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.4)
Requirement already satisfied: async-generator in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.10)
Building wheels for collected packages: frozendict, aplus
  Building wheel for frozendict (setup.py) ... done
  Created wheel for frozendict: filename=frozendict-1.2-py3-none-any.whl size=3148 sha256=1ae5d8fe0d670f73bf3ee88453978246919197a616f0e08e601c84cc244cb238
  Stored in directory: /home/dechin/.cache/pip/wheels/9b/9b/56/5713233cf7226423ab6c58c08081551a301b5863e343ba053c
  Building wheel for aplus (setup.py) ... done
  Created wheel for aplus: filename=aplus-0.11.0-py3-none-any.whl size=4412 sha256=9762d51c5ece813b0c5a27ff6ebc1a86e709d55edb7003dcc11272c954dd39c7
  Stored in directory: /home/dechin/.cache/pip/wheels/de/93/23/3db69e1003030a764c9827dc02137119ec5e6e439afd64eebb
Successfully built frozendict aplus
Installing collected packages: pyarrow, tabulate, frozendict, aplus, python-utils, progressbar2, vaex-core, vaex-ml, vaex-viz, vaex-astro, vaex-hdf5, cachetools, vaex-server, xarray, jupyterlab-widgets, ipywidgets, ipympl, branca, shapely, traittypes, ipyleaflet, ipyvue, ipyvuetify, ipywebrtc, ipydatawidgets, pythreejs, ipyvolume, bqplot, vaex-jupyter, vaex
  Attempting uninstall: ipywidgets
    Found existing installation: ipywidgets 7.5.1
    Uninstalling ipywidgets-7.5.1:
      Successfully uninstalled ipywidgets-7.5.1
Successfully installed aplus-0.11.0 bqplot-0.12.23 branca-0.4.2 cachetools-4.2.1 frozendict-1.2 ipydatawidgets-4.2.0 ipyleaflet-0.13.6 ipympl-0.7.0 ipyvolume-0.5.2 ipyvue-1.5.0 ipyvuetify-1.6.2 ipywebrtc-0.5.0 ipywidgets-7.6.3 jupyterlab-widgets-1.0.0 progressbar2-3.53.1 pyarrow-3.0.0 python-utils-2.5.6 pythreejs-2.3.0 shapely-1.7.1 tabulate-0.8.9 traittypes-0.2.1 vaex-4.1.0 vaex-astro-0.8.0 vaex-core-4.1.0 vaex-hdf5-0.7.0 vaex-jupyter-0.6.0 vaex-ml-0.11.1 vaex-server-0.4.0 vaex-viz-0.5.0 xarray-0.17.0

在出现Successfully installed的字样之后,就代表我们已经安装成功,可以开始使用了。

性能对比

由于使用其他的工具我们也可以正常的打开和读取表格文件,为了体现出使用vaex的优势,这里我们直接用ipython来对比一下两者的打开时间:

[dechin@dechin-manjaro gold]$ ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import vaex

In [2]: import xlrd

In [3]: %timeit xlrd.open_workbook(r'data.xls')
46.4 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit vaex.open('data.csv')
4.95 ms ± 48.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit vaex.open('data.hdf5')
1.34 ms ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我们从结果中发现,打开同样的一份文件,使用xlrd需要将近50ms的时间,而vaex最低只需要1ms的时间,如此巨大的性能优势使得我们不得不对vaex给予更多的关注。关于跟其他库的对比,在这个链接中已经有人做过了,即使是对比pandas,vaex在读取速度上也有1000多倍的加速,而计算速度的加速效果在数倍,总体来说表现非常的优秀。

数据格式转换

在上一章节的测试中,我们用到了1个没有提到过的文件:data.hdf5,这个文件其实是从data.csv转换而来的。这一章节我们主要就介绍如何将数据格式进行转换,以适配vaex可以打开和识别的格式。第一个方案是使用pandas将csv格式的文件直接转换为hdf5格式,操作类似于在python对表格数据处理的章节中将xls格式的文件转换成csv格式:

[dechin@dechin-manjaro gold]$ ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [4]: data = pd.read_csv('data.csv')

In [10]: data.to_hdf('data.hdf5','data',mode='w',format='table')

In [11]: !ls -l
总用量 932
-rw-r--r-- 1 dechin dechin 221872  3月 27 21:52 data.csv
-rw-r--r-- 1 dechin dechin 348524  3月 27 22:17 data.hdf5
-rw-r--r-- 1 dechin dechin 372736  3月 27 21:31 data.xls
-rw-r--r-- 1 dechin dechin    563  3月 27 21:42 table.py

操作完成之后在当前目录下生成了一个hdf5文件。但是这种操作方式有个弊端,就是生成的hdf5文件跟vaex不是直接适配的关系,如果直接用df = vaex.open('data.hdf5')的方法进行读取的话,输出内容如下所示:

In [3]: df
Out[3]: 
#      table
0      '(0, [83.98, 92.38, 82.  , 83.52], [       0,   ...
1      '(1, [83.9 , 83.92, 83.9 , 83.91], [      1,    ...
2      '(2, [84.5 , 84.65, 84.  , 84.51], [      2,    ...
3      '(3, [84.9 , 85.06, 84.9 , 84.99], [      3,    ...
4      '(4, [85.1 , 85.2 , 85.1 , 85.13], [      4,    ...
...    ...
3,917  '(3917, [274.65, 275.35, 274.6 , 274.61], [     ...
3,918  '(3918, [274.4, 275.2, 274.1, 275. ], [      391...
3,919  '(3919, [275.  , 275.01, 274.  , 274.19], [     ...
3,920  '(3920, [275.2, 275.2, 272.6, 272.9], [      392...
3,921  '(3921, [272.96, 273.73, 272.5 , 272.93], [     ...

在这个数据中,丢失了最关键的索引信息,虽然数据都被正确的保留了下来,但是在读取上有非常大的不便。因此我们更加推荐第二种数据转换的方法,直接用vaex进行数据格式的转换:

[dechin@dechin-manjaro gold]$ ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import vaex

In [2]: df = vaex.from_csv('data.csv')

In [3]: df.export_hdf5('vaex_data.hdf5')

In [4]: !ls -l
总用量 1220
-rw-r--r-- 1 dechin dechin 221856  3月 27 22:34 data.csv
-rw-r--r-- 1 dechin dechin 348436  3月 27 22:34 data.hdf5
-rw-r--r-- 1 dechin dechin 372736  3月 27 21:31 data.xls
-rw-r--r-- 1 dechin dechin    563  3月 27 21:42 table.py
-rw-r--r-- 1 dechin dechin 293512  3月 27 22:52 vaex_data.hdf5

执行完毕后在当前目录下生成了一个vaex_data.hdf5文件,让我们再试试读取这个新的hdf5文件:

[dechin@dechin-manjaro gold]$ ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import vaex

In [2]: df = vaex.open('vaex_data.hdf5')

In [3]: df
Out[3]: 
#      i     t             s       h       l      e       n      a
0      0     '2002-10-30'  83.98   92.38   82.0   83.52   352    29373370
1      1     '2002-10-31'  83.9    83.92   83.9   83.91   66     5537480
2      2     '2002-11-01'  84.5    84.65   84.0   84.51   77     6502510
3      3     '2002-11-04'  84.9    85.06   84.9   84.99   95     8076330
4      4     '2002-11-05'  85.1    85.2    85.1   85.13   61     5193650
...    ...   ...           ...     ...     ...    ...     ...    ...
3,917  3917  '2018-11-23'  274.65  275.35  274.6  274.61  13478  3708580608
3,918  3918  '2018-11-26'  274.4   275.2   274.1  275.0   13738  3773763584
3,919  3919  '2018-11-27'  275.0   275.01  274.0  274.19  13984  3836845568
3,920  3920  '2018-11-28'  275.2   275.2   272.6  272.9   15592  4258130688
3,921  3921  '2018-11-28'  272.96  273.73  272.5  272.93  592    161576336

In [4]: df.s
Out[4]: 
Expression = s
Length: 3,922 dtype: float64 (column)
-------------------------------------
   0   83.98
   1    83.9
   2    84.5
   3    84.9
   4    85.1
    ...     
3917  274.65
3918   274.4
3919     275
3920   275.2
3921  272.96

In [11]: df.plot(df.i, df.s, show=True) # 作图
/home/dechin/anaconda3/lib/python3.8/site-packages/vaex/viz/mpl.py:311: UserWarning: `plot` is deprecated and it will be removed in version 5.x. Please `df.viz.heatmap` instead.
  warnings.warn('`plot` is deprecated and it will be removed in version 5.x. Please `df.viz.heatmap` instead.')

这里我们也需要提一下,在新的hdf5文件中,索引从高、低等中文变成了h、l等英文,这是为了方便数据的操作,我们在csv文件中将索引手动的修改成了英文,再转换成hdf5的格式。最后我们使用vaex自带的画图功能,绘制了这十几年期间黄金的价格变动:

利用python做表格数据处理

由于vaex自带的绘图方法比较少,总结如下:

利用python做表格数据处理

最常用的还是热度图,因此这里绘制出来的黄金价格图的效果也是热度图的效果,但是基本上功能是比较完备的,而且性能异常的强大。

总结概要

在这篇文章中我们介绍了三种不同的python库对表格数据进行处理,分别是xlrd、pandas和vaex,其中特别着重的强调了一下vaex的优越性能以及在大数据中的应用价值。配合一些简单的示例,我们可以初步的了解到这些库各自的特点,在实际场景中可以斟酌使用。

以上就是利用python做表格数据处理的详细内容,更多关于python 表格数据处理的资料请关注三水点靠木其它相关文章!

Python 相关文章推荐
Python中正则表达式的详细教程
Apr 30 Python
使用Python的Django框架结合jQuery实现AJAX购物车页面
Apr 11 Python
python实现斐波那契数列的方法示例
Jan 12 Python
python实现12306火车票查询器
Apr 20 Python
python SSH模块登录,远程机执行shell命令实例解析
Jan 12 Python
Python批量发送post请求的实现代码
May 05 Python
python模糊图片过滤的方法
Dec 14 Python
python networkx 包绘制复杂网络关系图的实现
Jul 10 Python
python批量处理文件或文件夹
Jul 28 Python
pytorch之inception_v3的实现案例
Jan 06 Python
python 实现人和电脑猜拳的示例代码
Mar 02 Python
Django自定义全局403、404、500错误页面的示例代码
Mar 08 Python
python glom模块的使用简介
Apr 13 #Python
Python的flask接收前台的ajax的post数据和get数据的方法
Python使用sql语句对mysql数据库多条件模糊查询的思路详解
python实现简单反弹球游戏
python中Tkinter 窗口之输入框和文本框的实现
Apr 12 #Python
python opencv常用图形绘制方法(线段、矩形、圆形、椭圆、文本)
python 利用 PIL 将数组值转成图片的实现
You might like
PHP通用分页类page.php[仿google分页]
2008/08/31 PHP
PHP+XML 制作简单的留言本 图文教程
2009/11/02 PHP
利用Ffmpeg获得flv视频缩略图和视频时间的代码
2011/09/15 PHP
thinkphp路由规则使用示例详解和伪静态功能实现(apache重写)
2014/02/24 PHP
php使用substr()和strpos()联合查找字符串中某一特定字符的方法
2015/05/12 PHP
WordPress中给文章添加自定义字段及后台编辑功能区域
2015/12/19 PHP
PHP正则替换函数preg_replace()报错:Notice Use of undefined constant的解决方法分析
2017/02/04 PHP
PHP批斗大会之缺失的异常详解
2019/07/09 PHP
PHP pthreads v3下的Volatile简介与使用方法示例
2020/02/21 PHP
浅析JS中document对象的一些重要属性
2014/03/06 Javascript
arguments对象验证函数的参数是否合法
2015/06/26 Javascript
JS简单编号生成器实现方法(附demo源码下载)
2016/04/05 Javascript
JS表单验证的代码(常用)
2016/04/08 Javascript
jQuery禁用快捷键例如禁用F5刷新 禁用右键菜单等的简单实现
2016/08/31 Javascript
原生js实现可兼容PC和移动端的拖动滑块功能详解【测试可用】
2019/08/15 Javascript
Vue学习之axios的使用方法实例分析
2020/01/06 Javascript
解决vue-photo-preview 异步图片放大失效的问题
2020/07/29 Javascript
解决新建一个vue项目过程中遇到的问题
2020/10/22 Javascript
Vue ​v-model相关知识总结
2021/01/28 Vue.js
Python中统计函数运行耗时的方法
2015/05/05 Python
Python读写ini文件的方法
2015/05/28 Python
python中Pycharm 输出中文或打印中文乱码现象的解决办法
2017/06/16 Python
详解Python 实现元胞自动机中的生命游戏(Game of life)
2018/01/27 Python
Python 类的特殊成员解析
2018/06/20 Python
python3.5基于TCP实现文件传输
2020/03/20 Python
Python使用random.shuffle()打乱列表顺序的方法
2018/11/08 Python
Mac下Anaconda的安装和使用教程
2018/11/29 Python
深入解析神经网络从原理到实现
2019/07/26 Python
Keras中的多分类损失函数用法categorical_crossentropy
2020/06/11 Python
在线服装零售商:SheIn
2016/07/22 全球购物
美国最大点评网站:Yelp
2018/02/14 全球购物
整个世界的设计师家具在哈恩:Designathome
2019/03/25 全球购物
英国顶级水晶珠宝零售商之一:Tresor Paris
2019/04/27 全球购物
中学运动会广播稿
2014/01/19 职场文书
交通事故协议书范本
2016/03/19 职场文书
php将xml转化对象的实例详解
2021/11/17 PHP