Python数据分析模块pandas用法详解


Posted in Python onSeptember 04, 2019

本文实例讲述了Python数据分析模块pandas用法。分享给大家供大家参考,具体如下:

一 介绍

pandas(Python Data Analysis Library)是基于numpy的数据分析模块,提供了大量标准数据模型和高效操作大型数据集所需要的工具,可以说pandas是使得Python能够成为高效且强大的数据分析环境的重要因素之一。

pandas主要提供了3种数据结构:

1)Series,带标签的一维数组。

2)DataFrame,带标签且大小可变的二维表格结构。

3)Panel,带标签且大小可变的三维数组。

二 代码

1、生成一维数组

>>>import pandas as pd
>>>import numpy as np
>>> x = pd.Series([1,3,5, np.nan])
>>>print(x)
01.0
13.0
25.0
3NaN
dtype: float64

2、生成二维数组

>>> dates = pd.date_range(start='20170101', end='20171231', freq='D')#间隔为天
>>>print(dates)
DatetimeIndex(['2017-01-01','2017-01-02','2017-01-03','2017-01-04',
'2017-01-05','2017-01-06','2017-01-07','2017-01-08',
'2017-01-09','2017-01-10',
...
'2017-12-22','2017-12-23','2017-12-24','2017-12-25',
'2017-12-26','2017-12-27','2017-12-28','2017-12-29',
'2017-12-30','2017-12-31'],
dtype='datetime64[ns]', length=365, freq='D')
>>> dates = pd.date_range(start='20170101', end='20171231', freq='M')#间隔为月
>>>print(dates)
DatetimeIndex(['2017-01-31','2017-02-28','2017-03-31','2017-04-30',
'2017-05-31','2017-06-30','2017-07-31','2017-08-31',
'2017-09-30','2017-10-31','2017-11-30','2017-12-31'],
dtype='datetime64[ns]', freq='M')
>>> df = pd.DataFrame(np.random.randn(12,4), index=dates, columns=list('ABCD'))
>>>print(df)
A B C D
2017-01-31-0.6825560.2441020.4508550.236475
2017-02-28-0.6300600.5906670.4824380.225697
2017-03-311.0669890.3193391.0949531.716053
2017-04-300.334944-0.053049-1.009493-1.039470
2017-05-31-0.380778-0.0444290.0756470.931243
2017-06-300.8675400.872197-0.738974-1.114596
2017-07-310.423371-1.0863860.183820-0.438921
2017-08-311.2851630.634134-0.4729731.281057
2017-09-30-1.002832-0.888122-1.316014-0.070637
2017-10-311.735617-0.2538150.5544031.536211
2017-11-302.0303840.6675561.0126980.239479
2017-12-312.059718-0.0890501.4205170.224578
>>> df = pd.DataFrame([[np.random.randint(1,100)for j in range(4)]for i in range(12)], index=dates, columns=list('ABCD'))
>>>print(df)
A B C D
2017-01-317532522
2017-02-2870997098
2017-03-3199477567
2017-04-3033701749
2017-05-3162886891
2017-06-3019751844
2017-07-3150856582
2017-08-315628776
2017-09-306173111
2017-10-318296692
2017-11-306359194
2017-12-3179586933
>>> df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)],
'B':pd.date_range(start='20130101', periods=4, freq='D'),
'C':pd.Series([1,2,3,4],index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'})
>>>print(df)
A B C D E F
0152013-01-011.03 test foo
1112013-01-022.03 train foo
2912013-01-033.03 test foo
3912013-01-044.03 train foo
>>> df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)],
'B':pd.date_range(start='20130101', periods=4, freq='D'),
'C':pd.Series([1,2,3,4],index=['zhang','li','zhou','wang'],dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'})
>>>print(df)
A B C D E F
zhang 362013-01-011.03 test foo
li 862013-01-022.03 train foo
zhou 102013-01-033.03 test foo
wang 792013-01-044.03 train foo
>>>

3、二维数据查看

>>> df.head() #默认显示前5行
A B C D E F
zhang 362013-01-011.03 test foo
li 862013-01-022.03 train foo
zhou 102013-01-033.03 test foo
wang 792013-01-044.03 train foo
>>> df.head(3) #查看前3行
A B C D E F
zhang 362013-01-011.03 test foo
li 862013-01-022.03 train foo
zhou 102013-01-033.03 test foo
>>> df.tail(2) #查看最后2行
A B C D E F
zhou 102013-01-033.03 test foo
wang 792013-01-044.03 train foo

4、查看二维数据的索引、列名和数据

>>> df.index
Index(['zhang','li','zhou','wang'], dtype='object')
>>> df.columns
Index(['A','B','C','D','E','F'], dtype='object')
>>> df.values
array([[36,Timestamp('2013-01-01 00:00:00'),1.0,3,'test','foo'],
[86,Timestamp('2013-01-02 00:00:00'),2.0,3,'train','foo'],
[10,Timestamp('2013-01-03 00:00:00'),3.0,3,'test','foo'],
[79,Timestamp('2013-01-04 00:00:00'),4.0,3,'train','foo']], dtype=object)

5、查看数据的统计信息

>>> df.describe() #平均值、标准差、最小值、最大值等信息
A C D
count 4.0000004.0000004.0
mean 52.7500002.5000003.0
std 36.0682221.2909940.0
min 10.0000001.0000003.0
25%29.5000001.7500003.0
50%57.5000002.5000003.0
75%80.7500003.2500003.0
max 86.0000004.0000003.0

6、二维数据转置 

>>> df.T
zhang li zhou \
A 368610
B 2013-01-0100:00:002013-01-0200:00:002013-01-0300:00:00
C 123
D 333
E test train test
F foo foo foo
wang
A 79
B 2013-01-0400:00:00
C 4
D 3
E train
F foo

7、排序 

>>> df.sort_index(axis=0, ascending=False)#对轴进行排序
A B C D E F
zhou 102013-01-033.03 test foo
zhang 362013-01-011.03 test foo
wang 792013-01-044.03 train foo
li 862013-01-022.03 train foo
>>> df.sort_index(axis=1, ascending=False)
F E D C B A
zhang foo test 31.02013-01-0136
li foo train 32.02013-01-0286
zhou foo test 33.02013-01-0310
wang foo train 34.02013-01-0479
>>> df.sort_index(axis=0, ascending=True)
A B C D E F
li 862013-01-022.03 train foo
wang 792013-01-044.03 train foo
zhang 362013-01-011.03 test foo
zhou 102013-01-033.03 test foo
>>> df.sort_values(by='A')#对数据进行排序
A B C D E F
zhou 102013-01-033.03 test foo
zhang 362013-01-011.03 test foo
wang 792013-01-044.03 train foo
li 862013-01-022.03 train foo
>>> df.sort_values(by='A', ascending=False)#降序排列
A B C D E F
li 862013-01-022.03 train foo
wang 792013-01-044.03 train foo
zhang 362013-01-011.03 test foo
zhou 102013-01-033.03 test foo

8、数据选择

>>> df['A']#选择列
zhang 1
li 1
zhou 60
wang 58
Name: A, dtype: int64
>>> df[0:2]#使用切片选择多行
A B C D E F
zhang 12013-01-011.03 test foo
li 12013-01-022.03 train foo
>>> df.loc[:,['A','C']]#选择多列
A C
zhang 11.0
li 12.0
zhou 603.0
wang 584.0
>>> df.loc[['zhang','zhou'],['A','D','E']]#同时指定多行与多列进行选择
A D E
zhang 13 test
zhou 603 test
>>> df.loc['zhang',['A','D','E']]
A 1
D 3
E test
Name: zhang, dtype: object

9、数据修改和设置

>>> df.iat[0,2]=3#修改指定行、列位置的数据值
>>>print(df)
A B C D E F
zhang 12013-01-013.03 test foo
li 12013-01-022.03 train foo
zhou 602013-01-033.03 test foo
wang 582013-01-044.03 train foo
>>> df.loc[:,'D']=[np.random.randint(50,60)for i in range(4)]#修改某列的值
>>>print(df)
A B C D E F
zhang 12013-01-013.057 test foo
li 12013-01-022.052 train foo
zhou 602013-01-033.057 test foo
wang 582013-01-044.056 train foo
>>> df['C']=-df['C']#对指定列数据取反
>>>print(df)
A B C D E F
zhang 12013-01-01-3.057 test foo
li 12013-01-02-2.052 train foo
zhou 602013-01-03-3.057 test foo
wang 582013-01-04-4.056 train foo

10、缺失值处理

>>> df1 = df.reindex(index=['zhang','li','zhou','wang'], columns=list(df.columns)+['G'])
>>>print(df1)
A B C D E F G
zhang 12013-01-01-3.057 test foo NaN
li 12013-01-02-2.052 train foo NaN
zhou 602013-01-03-3.057 test foo NaN
wang 582013-01-04-4.056 train foo NaN
>>> df1.iat[0,6]=3#修改指定位置元素值,该列其他元素为缺失值NaN
>>>print(df1)
A B C D E F G
zhang 12013-01-01-3.057 test foo 3.0
li 12013-01-02-2.052 train foo NaN
zhou 602013-01-03-3.057 test foo NaN
wang 582013-01-04-4.056 train foo NaN
>>> pd.isnull(df1)#测试缺失值,返回值为True/False阵列
A B C D E F G
zhang FalseFalseFalseFalseFalseFalseFalse
li FalseFalseFalseFalseFalseFalseTrue
zhou FalseFalseFalseFalseFalseFalseTrue
wang FalseFalseFalseFalseFalseFalseTrue
>>> df1.dropna()#返回不包含缺失值的行
A B C D E F G
zhang 12013-01-01-3.057 test foo 3.0
>>> df1['G'].fillna(5, inplace=True)#使用指定值填充缺失值
>>>print(df1)
A B C D E F G
zhang 12013-01-01-3.057 test foo 3.0
li 12013-01-02-2.052 train foo 5.0
zhou 602013-01-03-3.057 test foo 5.0
wang 582013-01-04-4.056 train foo 5.0

11、数据操作

>>> df1.mean()#平均值,自动忽略缺失值
A 30.0
C -3.0
D 55.5
G 4.5
dtype: float64
>>> df.mean(1)#横向计算平均值
zhang 18.333333
li 17.000000
zhou 38.000000
wang 36.666667
dtype: float64
>>> df1.shift(1)#数据移位
A B C D E F G
zhang NaNNaTNaNNaNNaNNaNNaN
li 1.02013-01-01-3.057.0 test foo 3.0
zhou 1.02013-01-02-2.052.0 train foo 5.0
wang 60.02013-01-03-3.057.0 test foo 5.0
>>> df1['D'].value_counts()#直方图统计
572
561
521
Name: D, dtype: int64
>>>print(df1)
A B C D E F G
zhang 12013-01-01-3.057 test foo 3.0
li 12013-01-02-2.052 train foo 5.0
zhou 602013-01-03-3.057 test foo 5.0
wang 582013-01-04-4.056 train foo 5.0
>>> df2 = pd.DataFrame(np.random.randn(10,4))
>>>print(df2)
0123
0-0.939904-1.856658-0.2819650.203624
10.3501620.060674-0.9148080.135735
2-1.031384-1.6112740.341546-0.363671
30.139464-0.050959-0.810610-0.772648
4-1.146810-0.7916081.488790-0.490004
5-0.100707-0.763545-0.071274-0.298142
6-0.2120140.8097090.6931960.980568
7-0.812985-0.000325-0.675101-0.217394
80.066969-0.084609-0.4330990.535616
9-0.319120-0.5328541.321712-1.751913
>>> p1 = df2[:3] >>> print(p1) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 >>> p2 = df2[3:7] >>> print(p2) 0 1 2 3 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 >>> p3 = df2[7:] >>> print(p3) 0 1 2 3 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df3 = pd.concat([p1, p2, p3]) #数据行合并 >>> print(df3) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df2 == df3 0 1 2 3 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True >>> df4 = pd.DataFrame({'A':[np.random.randint(1,5) for i in range(8)], 'B':[np.random.randint(10,15) for i in range(8)], 'C':[np.random.randint(20,30) for i in range(8)], 'D':[np.random.randint(80,100) for i in range(8)]}) >>> print(df4) A B C D 0 4 11 24 91 1 1 13 28 95 2 2 12 27 91 3 1 12 20 87 4 3 11 24 96 5 1 13 21 99 6 3 11 22 95 7 2 13 26 98 >>> >>> df4.groupby('A').sum() #数据分组计算 B C D A 1 38 69 281 2 25 53 189 3 22 46 191 4 11 24 91 >>> >>> df4.groupby(['A','B']).mean() C D A B 1 12 20.0 87.0 13 24.5 97.0 2 12 27.0 91.0 13 26.0 98.0 3 11 23.0 95.5 4 11 24.0 91.0

12、结合matplotlib绘图

>>>import pandas as pd
>>>import numpy as np
>>>import matplotlib.pyplot as plt
>>> df = pd.DataFrame(np.random.randn(1000,2), columns=['B','C']).cumsum()
>>>print(df)
B C
00.0898860.511081
11.3237661.584758
21.489479-0.438671
30.831331-0.398021
4-0.2482330.494418
5-0.0130850.684518
60.666951-1.422161
71.768838-0.658786
82.6610800.648505
91.9517510.836261
103.5387851.657475
113.2540342.052609
124.2486201.568401
134.0771730.055622
143.452590-0.200314
152.627620-0.408829
163.690537-0.210440
173.1849240.365447
183.646556-0.150044
194.164563-0.023405
202.3914470.517872
212.8651530.686649
223.6231830.663927
231.5451170.151044
243.5959240.903619
253.0138041.855083
264.4388011.014572
275.1552160.882628
284.4314570.741509
292.8419490.709991
........
970-7.910646-13.738689
971-7.318091-14.811335
972-9.144376-15.466873
973-9.538658-15.367167
974-9.061114-16.822726
975-9.803798-17.368350
976-10.180575-17.270180
977-10.601352-17.671543
978-10.804909-19.535919
979-10.397964-20.361419
980-10.979640-20.300267
981-8.738223-20.202669
982-9.339929-21.528973
983-9.780686-20.902152
984-11.072655-21.235735
985-10.849717-20.439201
986-10.953247-19.708973
987-13.032707-18.687553
988-12.984567-19.557132
989-13.508836-18.747584
990-13.420713-19.883180
991-11.718125-20.474092
992-11.936512-21.360752
993-14.225655-22.006776
994-13.524940-20.844519
995-14.088767-20.492952
996-14.169056-20.666777
997-14.798708-19.960555
998-15.766568-19.395622
999-17.281143-19.089793
[1000 rows x 2 columns]
>>> df['A']= pd.Series(list(range(len(df))))
>>>print(df)
B C A
00.0898860.5110810
11.3237661.5847581
21.489479-0.4386712
30.831331-0.3980213
4-0.2482330.4944184
5-0.0130850.6845185
60.666951-1.4221616
71.768838-0.6587867
82.6610800.6485058
91.9517510.8362619
103.5387851.65747510
113.2540342.05260911
124.2486201.56840112
134.0771730.05562213
143.452590-0.20031414
152.627620-0.40882915
163.690537-0.21044016
173.1849240.36544717
183.646556-0.15004418
194.164563-0.02340519
202.3914470.51787220
212.8651530.68664921
223.6231830.66392722
231.5451170.15104423
243.5959240.90361924
253.0138041.85508325
264.4388011.01457226
275.1552160.88262827
284.4314570.74150928
292.8419490.70999129
...........
970-7.910646-13.738689970
971-7.318091-14.811335971
972-9.144376-15.466873972
973-9.538658-15.367167973
974-9.061114-16.822726974
975-9.803798-17.368350975
976-10.180575-17.270180976
977-10.601352-17.671543977
978-10.804909-19.535919978
979-10.397964-20.361419979
980-10.979640-20.300267980
981-8.738223-20.202669981
982-9.339929-21.528973982
983-9.780686-20.902152983
984-11.072655-21.235735984
985-10.849717-20.439201985
986-10.953247-19.708973986
987-13.032707-18.687553987
988-12.984567-19.557132988
989-13.508836-18.747584989
990-13.420713-19.883180990
991-11.718125-20.474092991
992-11.936512-21.360752992
993-14.225655-22.006776993
994-13.524940-20.844519994
995-14.088767-20.492952995
996-14.169056-20.666777996
997-14.798708-19.960555997
998-15.766568-19.395622998
999-17.281143-19.089793999
[1000 rows x 3 columns]
>>> plt.figure()
<matplotlib.figure.Figure object at 0x000002A2A0B10F28>
>>> df.plot(x='A')
<matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A12FE7F0>
>>> plt.show()

运行结果
Python数据分析模块pandas用法详解
 

>>> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])
>>>print(df)
a b c d
00.5044340.1908750.0016870.327372
10.4068440.6020290.9120750.815889
20.8285340.9859100.0946620.552089
30.1988430.8187850.7506490.967054
40.4984940.1513780.4175060.264438
50.6552880.6727880.0886160.433270
60.4931270.0092540.1794790.396655
70.4193860.9109860.0200040.229063
80.6714690.6121890.3749200.407093
90.4149780.0334990.7560250.717849
>>> df.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A17BD7B8>
>>> plt.show()

运行结果

Python数据分析模块pandas用法详解
 

>>> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])
>>> df.plot(kind='barh', stacked=True)
<matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A3784390>
>>> plt.show()

Python数据分析模块pandas用法详解
       

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
python查询mysql中文乱码问题
Nov 09 Python
关于Python 3中print函数的换行详解
Aug 08 Python
python 读文件,然后转化为矩阵的实例
Apr 23 Python
Python3使用turtle绘制超立方体图形示例
Jun 19 Python
浅析Python3中的对象垃圾收集机制
Jun 06 Python
python模块导入的方法
Oct 24 Python
opencv3/C++ 平面对象识别&amp;透视变换方式
Dec 11 Python
np.random.seed() 的使用详解
Jan 14 Python
python多线程semaphore实现线程数控制的示例
Aug 10 Python
解决python绘图使用subplots出现标题重叠的问题
Apr 30 Python
Python自动化测试PO模型封装过程详解
Jun 22 Python
OpenCV 图像梯度的实现方法
Jul 25 Python
Python实现TCP探测目标服务路由轨迹的原理与方法详解
Sep 04 #Python
基于python全局设置id 自动化测试元素定位过程解析
Sep 04 #Python
Django框架 querySet功能解析
Sep 04 #Python
Django框架 查询Extra功能实现解析
Sep 04 #Python
Django框架 Pagination分页实现代码实例
Sep 04 #Python
python 动态迁移solr数据过程解析
Sep 04 #Python
Django框架 信号调度原理解析
Sep 04 #Python
You might like
加速XP搜索功能堪比vista
2007/03/22 PHP
php simplexmlElement操作xml的命名空间实现代码
2011/01/04 PHP
PHP设计模式 注册表模式(多个类的注册)
2012/02/05 PHP
PHPStudy下如何为Apache安装SSL证书的方法步骤
2019/01/23 PHP
TP5框架实现上传多张图片的方法分析
2020/03/29 PHP
两种简单实现菜单高亮显示的JS类代码
2010/06/27 Javascript
js 控制下拉菜单刷新的方法
2013/03/03 Javascript
图标线性回归斜着移动到指定的位置
2013/08/16 Javascript
Javascript和Java获取各种form表单信息的简单实例
2014/02/14 Javascript
了不起的node.js读书笔记之mongodb数据库交互
2014/12/22 Javascript
jQuery消息提示框插件Tipso
2015/05/04 Javascript
js限制文本框只能输入中文的方法
2015/08/11 Javascript
javascript函数式编程程序员的工具集
2015/10/11 Javascript
基于JavaScript实现动态添加删除表格的行
2016/02/01 Javascript
Bootstrap布局组件应用实例讲解
2016/02/17 Javascript
图文详解JavaScript的原型对象及原型链
2016/08/02 Javascript
基于input框覆盖掉数字英文的实例讲解
2017/07/21 Javascript
react开发教程之React 组件之间的通信方式
2017/08/12 Javascript
JS禁止浏览器右键查看元素或按F12审查元素自动关闭页面示例代码
2017/09/07 Javascript
详解vue3.0 的 Composition API 的一种使用方法
2020/10/26 Javascript
JavaScript实现移动小精灵的案例代码
2020/12/12 Javascript
[05:03]显微镜下的DOTA2第十期——Ti3豪之超神幽鬼
2014/06/23 DOTA
Python 检查数组元素是否存在类似PHP isset()方法
2014/10/14 Python
[原创]使用豆瓣提供的国内pypi源
2017/07/02 Python
基于python 处理中文路径的终极解决方法
2018/04/12 Python
Python函数装饰器实现方法详解
2018/12/22 Python
Django实现跨域请求过程详解
2019/07/25 Python
Pytorch实现将模型的所有参数的梯度清0
2020/06/24 Python
解决Pycharm 中遇到Unresolved reference 'sklearn'的问题
2020/07/13 Python
前端canvas动画如何转成mp4视频的方法
2019/06/17 HTML / CSS
中国茶叶、茶具一站式网上购物商城:醉品茶城
2018/07/03 全球购物
大整数数相乘的问题
2012/07/22 面试题
财会专业毕业生自荐信
2014/07/09 职场文书
校园会短篇的广播稿
2014/10/21 职场文书
小学中等生评语
2014/12/29 职场文书
USB TYPE-C 或将成为所有智能手机充电标准
2022/04/21 数码科技