Pandas

最新推荐文章于 2025-06-25 00:03:24 发布

原创

最新推荐文章于 2025-06-25 00:03:24 发布 · 1.1k 阅读

7 ·

CC 4.0 BY-SA版权

本文详细介绍了Pandas的基础知识，包括Series和DataFrame的创建、数据选择与操作。讲解了如何利用loc和iloc进行数据索引，以及数据对齐、缺失值处理、数据连接与合并、集合操作等内容，是Pandas学习的重要参考资料。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Pandas 基础

Pandas中的一维数组：Series

data = pd.Series([0.25, 0.5, 0.75, 1.0])

>>> data:
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

>>> data.values # 返回numpy中的结构
array([0.25, 0.5 , 0.75, 1.  ])

>>> data.index
RangeIndex(start=0, stop=4, step=1)

Index

Series和Numpy中的一维向量之间的核心区别是：Index

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
>>>data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

>>>data['b']
0.5

字典与Series

由于index的存在，Series可以看做是一个字典（Dictionary）

>>>population_dict = {
  
  'California': 38332521,
                   'Texas': 26448193,                   
                   'New York': 19651127,                   
                   'Florida': 19552860,                   
                   'Illinois': 12882135}                   
>>>population = pd.Series(population_dict)

>>>population
California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

>>>population['California':'lllionis']
California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

创建Series

>>>pd.Series([2, 4, 6])
0    2
1    4
2    6
dtype: int64

>>>pd.Series(5, index=[100, 200, 300])
100    5
200    5
300    5
dtype: int64

>>>pd.Series({
  
  2:'a', 1:'b', 3:'c'})
1    b
2    a
3    c
dtype: object


>>>pd.Series({
  
  2:'a', 1:'b', 3:'c'}, index=[3, 2])
3    c
2    a
dtype: object

Pandas 中的 DataFrame

population_dict = {
  
  'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

area_dict = {
  
  'California': 423967, 
             'Texas': 695662,              
             'New York': 141297,             
             'Florida': 170312,              
             'Illinois': 149995}             
area = pd.Series(area_dict) 

states = pd.DataFrame({
  
  'population': population, 'area': area})

states:

area	population
California	423967
Florida	170312
Illinois	149995
New York	141297
Texas	695662

>>> states.index
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

>>>state.columns
Index(['area', 'population'], dtype='object')

>>>states['area']
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

创建 DataFrame

通过单一Series

pd.DataFrame(population, columns=['population'])

population
California
Florida
Illinois
New York
Texas

- 通过多个字典（此时，每个字典可以理解成是在描述一个样本）

data = [{
  
  'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

a	b
0	0
1	1
2	2

pd.DataFrame([{
  
  'a': 1, 'b': 2}, {
  
  'b': 3, 'c': 4}])

a	b	c
0	1.0	2
1	NaN	3

- 通过多个Series

>>>pd.DataFrame({
  
  'population': population, 'area': area})

area	population
California	423967
Florida	170312
Illinois	149995
New York	141297
Texas	695662

- 通过Numpy二维数组

import numpy as np
pd.DataFrame(np.random.rand(3, 2), 
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

foo	bar
a	0.529692
b	0.391235
c	0.440382

- 通过Numpy中的structured array

A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])

>>> A
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

>>>pd.DataFrame(A)

A	B
0	0
1	0
2	0

Index

>>>ind = pd.Index([2, 3, 5, 7, 11])
>>>ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')

>>>ind[::2]
Int64Index([2, 5, 11], dtype='int64')
>>>ind.shape
(5,)
>>>ind.ndim
1
>>>ind.dtype
dtype('int64')

# Index 是一个 Immutable Array ，不可改变
>>>ind[1] = 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-906a9fa1424c> in <module>()
----> 1 ind[1] = 0

~/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   1722 
   1723     def __setitem__(self, key, value):
-> 1724         raise TypeError("Index does not support mutable operations")
   1725 
   1726     def __getitem__(self, key):

TypeError: Index does not support mutable operations

# Index 是一个 有序集合类（Ordered Set）
>>>indA = pd.Index([1, 3, 5, 7, 9])
>>>indB = pd.Index([2, 3, 5, 7, 11])
>>>indA & indB
Int64Index([3, 5, 7], dtype='int64')
>>>indA | indB
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
>>>indA ^ indB
Int64Index([1, 2, 9, 11], dtype='int64')

Pandas 中的数据索引和选择

Series 中的数据选择

字典式的数据选择

import pandas as pd
>>>data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
>>>data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

>>>'a' in data
True

>>>data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
>>>data.items()
<zip at 0x110f11e48>
>>>list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]