Pandas

本文详细介绍了Pandas的基础知识,包括Series和DataFrame的创建、数据选择与操作。讲解了如何利用loc和iloc进行数据索引,以及数据对齐、缺失值处理、数据连接与合并、集合操作等内容,是Pandas学习的重要参考资料。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Pandas 基础

Pandas中的一维数组:Series

data = pd.Series([0.25, 0.5, 0.75, 1.0])

>>> data:
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

>>> data.values # 返回numpy中的结构
array([0.25, 0.5 , 0.75, 1.  ])

>>> data.index
RangeIndex(start=0, stop=4, step=1)
Index
  • Series和Numpy中的一维向量之间的核心区别是:Index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
>>>data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

>>>data['b']
0.5
字典与Series
  • 由于index的存在,Series可以看做是一个字典(Dictionary)
>>>population_dict = {
  
  'California': 38332521,
                   'Texas': 26448193,                   
                   'New York': 19651127,                   
                   'Florida': 19552860,                   
                   'Illinois': 12882135}                   
>>>population = pd.Series(population_dict)

>>>population
California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

>>>population['California':'lllionis']
California    38332521
Florida       19552860
Illinois      12882135
dtype: int64
创建Series
>>>pd.Series([2, 4, 6])
0    2
1    4
2    6
dtype: int64

>>>pd.Series(5, index=[100, 200, 300])
100    5
200    5
300    5
dtype: int64

>>>pd.Series({
  
  2:'a', 1:'b', 3:'c'})
1    b
2    a
3    c
dtype: object


>>>pd.Series({
  
  2:'a', 1:'b', 3:'c'}, index=[3, 2])
3    c
2    a
dtype: object

Pandas 中的 DataFrame

population_dict = {
  
  'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

area_dict = {
  
  'California': 423967, 
             'Texas': 695662,              
             'New York': 141297,             
             'Florida': 170312,              
             'Illinois': 149995}             
area = pd.Series(area_dict) 

states = pd.DataFrame({
  
  'population': population, 'area': area})

states:

area population
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
>>> states.index
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

>>>state.columns
Index(['area', 'population'], dtype='object')

>>>states['area']
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64
创建 DataFrame
  • 通过单一Series
pd.DataFrame(population, columns=['population'])
population
California
Florida
Illinois
New York
Texas

- 通过多个字典(此时,每个字典可以理解成是在描述一个样本)

data = [{
  
  'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)
a b
0 0
1 1
2 2
pd.DataFrame([{
  
  'a': 1, 'b': 2}, {
  
  'b': 3, 'c': 4}])
a b c
0 1.0 2
1 NaN 3

- 通过多个Series

>>>pd.DataFrame({
  
  'population': population, 'area': area})              
area population
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662

- 通过Numpy二维数组

import numpy as np
pd.DataFrame(np.random.rand(3, 2), 
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])
foo bar
a 0.529692
b 0.391235
c 0.440382

- 通过Numpy中的structured array

A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
>>> A
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

>>>pd.DataFrame(A)
A B
0 0
1 0
2 0
Index
>>>ind = pd.Index([2, 3, 5, 7, 11])
>>>ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')

>>>ind[::2]
Int64Index([2, 5, 11], dtype='int64')
>>>ind.shape
(5,)
>>>ind.ndim
1
>>>ind.dtype
dtype('int64')

# Index 是一个 Immutable Array ,不可改变
>>>ind[1] = 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-906a9fa1424c> in <module>()
----> 1 ind[1] = 0

~/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   1722 
   1723     def __setitem__(self, key, value):
-> 1724         raise TypeError("Index does not support mutable operations")
   1725 
   1726     def __getitem__(self, key):

TypeError: Index does not support mutable operations

# Index 是一个 有序集合类(Ordered Set)
>>>indA = pd.Index([1, 3, 5, 7, 9])
>>>indB = pd.Index([2, 3, 5, 7, 11])
>>>indA & indB
Int64Index([3, 5, 7], dtype='int64')
>>>indA | indB
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
>>>indA ^ indB
Int64Index([1, 2, 9, 11], dtype='int64')

Pandas 中的数据索引和选择

Series 中的数据选择

字典式的数据选择
import pandas as pd
>>>data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
>>>data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

>>>'a' in data
True

>>>data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
>>>data.items()
<zip at 0x110f11e48>
>>>list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
像字典操作一样在Series中添加数据
>>>d
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值