import pandas as pd
import numpy as np
基本数据结构
- Series
一维数组,与Numpy中的array类似,与Python中的List类似。Series能保存不同种数据类型。字符串、布尔值、数字都能保存在Series中。 - DataFrame
二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
1、Pandas库的Series
类型
一维Series
可以用列表初始化
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
默认下标是数字,但可以修改

s = pd.Series([1,3,5,np.nan,6,8],index=['a','b','c','d','e','f'])
s
Out[]:
a 1.0
b 3.0
c 5.0
d NaN
e 6.0
f 8.0
dtype: float64
索引:数据的行标签
s.index
Out[]:
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
值:
s.values
Out[]:
array([ 1., 3., 5., nan, 6., 8.])
s[2]
Out[]:
5.0
s['a']
Out[]:
1.0
切片操作
s[2:5] # 左开右闭
索引赋值:
s.index.name='索引'
s
Out:
索引
a 1.0
b 3.0
c 5.0
d NaN
e 6.0
f 8.0
dtype: float64
s.index=[1,2,3,4,5,6]
s
Out[]:
1 1.0
2 3.0
3 5.0
4 NaN
5 6.0
6 8.0
dtype: float64
2、Pandas库的DataFrame
类型
先构造个时间序列:
date = pd.date_range('20190130',periods=6)
date
Out[]:
DatetimeIndex(['2019-01-30', '2019-01-31', '2019-02-01', '2019-02-02',
'2019-02-03', '2019-02-04'],
dtype='datetime64[ns]', freq='D')
创建一个DataFrame
结构:
# 默认情况
df = pd.DataFrame(np.random.randn(6,4))
df
Out[]:
0 1 2 3
0 -0.763025 0.300867 0.122908 0.553115
1 0.096414 -0.824670 -1.151473 0.774615
2 -0.858495 0.409791 0.425232 0.841326
3 -0.587433 -1.175765 1.409728 0.693573
4 -0.008313 0.986666 0.313205 -0.108581
5 -1.652439 -1.683581 0.515152 1.062231
# 指定行名
df = pd.DataFrame(np.random.randn(6,4),index=date)
df
Out[]:
0 1 2 3
2019-01-30 -0.032291 1.034165 -3.318587 -0.278843
2019-01-31 0.601556 0.762580 0.546680 -0.960772
2019-02-01 0.432969 0.818542 2.796366 0.920975
2019-02-02 -0.742998 -0.868394 1.671997 1.156050
2019-02-03 1.443198 -1.539142 -0.165005 2.061008
2019-02-04 -0.467950 0.206220 0.044831 0.380925
# 指定列名
df = pd.DataFrame(np.random.randn(6,4),index=date,columns=list('ABCD'))
df
Out[]:
A B C D
2019-01-30 -0.137763 0.015278 0.079426 -1.793683
2019-01-31 1.412634 -0.423254 -0.188469 -1.971741
2019-02-01 0.925228 -0.416434 0.432288 0.700189
2019-02-02 1.575157 -0.276338 0.425863 -0.332686
2019-02-03 -1.097524 -0.098180 0.134335 -2.618035
2019-02-04 0.872670 0.104345 -0.759217 -1.540087
# 使用字典
df2 = pd.DataFrame({'A':1.,'B':pd.Timestamp('20191130'),'C':pd.Series(1,index=list(range(4)),dtype=float),'D':np.array([3]*4,dtype=int),'E':pd.Categorical(["test","train","test","train"]),'F':'abc'})
df2
Out[]:
A B C D E F
0 1.0 2019-11-30 1.0 3 test abc
1 1.0 2019-11-30 1.0 3 train abc
2 1.0 2019-11-30 1.0 3 test abc
3 1.0 2019-11-30 1.0 3 train abc
查看数据
- 头尾数据
head
和tail
方法可以分别查看最前面几行和最后面几行的数据,默认为5
df.tail(3)
Out[]:
A B C D
2019-02-02 1.575157 -0.276338 0.425863 -0.332686
2019-02-03 -1.097524 -0.098180 0.134335 -2.618035
2019-02-04 0.872670 0.104345 -0.759217 -1.540087
- 查看数据类型
dtypes
属性
df2.dtypes
Out[]:
A float64
B datetime64[ns]
C float64
D int32
E category
F object
dtype: object
- 查看下标【行名】属性
df.index
Out[]:
DatetimeIndex(['2019-01-30', '2019-01-31', '2019-02-01', '2019-02-02',
'2019-02-03', '2019-02-04'],
dtype='datetime64[ns]', freq='D')
- 查看列标【列名】属性:
df.columns
Out[]:
Index(['A', 'B', 'C', 'D'], dtype='object')
- 查看值
df.values
Out[]:
array([[-0.13776298, 0.01527819, 0.07942575, -1.79368276],
[ 1.41263423, -0.42325441, -0.18846901, -1.97174079],
[ 0.92522826, -0.41643405, 0.43228816, 0.70018932],
[ 1.57515712, -0.27633845, 0.42586307, -0.33268635],
[-1.09752417, -0.09818006, 0.1343351 , -2.61803483],
[ 0.87266954, 0.10434451, -0.75921666, -1.54008747]])