Pandas 基础
Pandas中的一维数组:Series
data = pd.Series([0.25, 0.5, 0.75, 1.0])
>>> data:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
>>> data.values
array([0.25, 0.5 , 0.75, 1. ])
>>> data.index
RangeIndex(start=0, stop=4, step=1)
Index
- Series和Numpy中的一维向量之间的核心区别是:Index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
>>>data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
>>>data['b']
0.5
字典与Series
- 由于index的存在,Series可以看做是一个字典(Dictionary)
>>>population_dict = {
'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
>>>population = pd.Series(population_dict)
>>>population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
>>>population['California':'lllionis']
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
创建Series
>>>pd.Series([2, 4, 6])
0 2
1 4
2 6
dtype: int64
>>>pd.Series(5, index=[100, 200, 300])
100 5
200 5
300 5
dtype: int64
>>>pd.Series({
2:'a', 1:'b', 3:'c'})
1 b
2 a
3 c
dtype: object
>>>pd.Series({
2:'a', 1:'b', 3:'c'}, index=[3, 2])
3 c
2 a
dtype: object
Pandas 中的 DataFrame
population_dict = {
'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
area_dict = {
'California': 423967,
'Texas': 695662,
'New York': 141297,
'Florida': 170312,
'Illinois': 149995}
area = pd.Series(area_dict)
states = pd.DataFrame({
'population': population, 'area': area})
states:
area |
population |
California |
423967 |
Florida |
170312 |
Illinois |
149995 |
New York |
141297 |
Texas |
695662 |
>>> states.index
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
>>>state.columns
Index(['area', 'population'], dtype='object')
>>>states['area']
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
创建 DataFrame
pd.DataFrame(population, columns=['population'])
population |
California |
Florida |
Illinois |
New York |
Texas |
- 通过多个字典(此时,每个字典可以理解成是在描述一个样本)
data = [{
'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)
pd.DataFrame([{
'a': 1, 'b': 2}, {
'b': 3, 'c': 4}])
- 通过多个Series
>>>pd.DataFrame({
'population': population, 'area': area})
area |
population |
California |
423967 |
Florida |
170312 |
Illinois |
149995 |
New York |
141297 |
Texas |
695662 |
- 通过Numpy二维数组
import numpy as np
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
foo |
bar |
a |
0.529692 |
b |
0.391235 |
c |
0.440382 |
- 通过Numpy中的structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
>>> A
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])
>>>pd.DataFrame(A)
Index
>>>ind = pd.Index([2, 3, 5, 7, 11])
>>>ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')
>>>ind[::2]
Int64Index([2, 5, 11], dtype='int64')
>>>ind.shape
(5,)
>>>ind.ndim
1
>>>ind.dtype
dtype('int64')
>>>ind[1] = 0
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-41-906a9fa1424c> in <module>()
----> 1 ind[1] = 0
~/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
1722
1723 def __setitem__(self, key, value):
-> 1724 raise TypeError("Index does not support mutable operations")
1725
1726 def __getitem__(self, key):
TypeError: Index does not support mutable operations
>>>indA = pd.Index([1, 3, 5, 7, 9])
>>>indB = pd.Index([2, 3, 5, 7, 11])
>>>indA & indB
Int64Index([3, 5, 7], dtype='int64')
>>>indA | indB
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
>>>indA ^ indB
Int64Index([1, 2, 9, 11], dtype='int64')
Pandas 中的数据索引和选择
Series 中的数据选择
字典式的数据选择
import pandas as pd
>>>data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
>>>data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
>>>'a' in data
True
>>>data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
>>>data.items()
<zip at 0x110f11e48>
>>>list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
像字典操作一样在Series中添加数据
>>>d