Traceback (most recent call last): File "E:\QtToos\scr\core\unified_data_processing.py", line 205, in <module> main() File "E:\QtToos\scr\core\unified_data_processing.py", line 188, in main table = pa.Table.from_pandas(final_df) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow\\table.pxi", line 4795, in pyarrow.lib.Table.from_pandas File "D:\Python\pyPath\QtToos\Lib\site-packages\pyarrow\pandas_compat.py", line 650, in dataframe_to_arrays arrays[i] = maybe_fut.result() ^^^^^^^^^^^^^^^^^^ File "D:\Python\Python311\Lib\concurrent\futures\_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "D:\Python\Python311\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "D:\Python\Python311\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Python\pyPath\QtToos\Lib\site-packages\pyarrow\pandas_compat.py", line 625, in convert_column raise e File "D:\Python\pyPath\QtToos\Lib\site-packages\pyarrow\pandas_compat.py", line 619, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow\\array.pxi", line 365, in pyarrow.lib.array File "pyarrow\\array.pxi", line 91, in pyarrow.lib._ndarray_to_array File "pyarrow\\error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'float' object", 'Conversion failed for column bss_cssr with type object')
时间: 2025-08-21 09:54:11 浏览: 3
你遇到的错误是:
```
pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'float' object", 'Conversion failed for column bss_cssr with type object')
```
这个错误发生在 `pyarrow.Table.from_pandas(df)` 时,说明 PyArrow 在尝试将 `pandas.DataFrame` 转换为 `pyarrow.Table` 时,遇到了**某一列的数据类型不兼容**,尤其是:
- 该列是 `object` 类型;
- 但其中包含 `float` 类型的值;
- 或者该列混合了字符串和浮点数(如 `"95.2%"`, `95.2`, `NaN` 等);
---
## 🔍 问题分析
### 1. **列类型混乱**
PyArrow 期望每一列的数据类型是统一的。如果某一列是字符串类型(`object`),但其中包含了 `float` 值或 `NaN`,就会导致转换失败。
例如:
```python
df['bss_cssr'] = ['95.2%', 95.2, np.nan, '94.3%'] # 混合类型,会报错
```
---
## ✅ 解决方案
### ✅ 方法一:统一列类型为字符串或浮点数
#### ✅ 1. 将列统一为字符串类型(推荐处理含百分号、文本等数据)
```python
df['bss_cssr'] = df['bss_cssr'].astype(str)
```
#### ✅ 2. 或者将列统一为浮点数(适合纯数值)
```python
df['bss_cssr'] = pd.to_numeric(df['bss_cssr'], errors='coerce').fillna(0.0)
```
> `errors='coerce'` 会将无法转换的值设为 `NaN`,然后用 `fillna` 填充为 0.0。
---
### ✅ 方法二:在 `from_pandas` 时显式指定 schema
如果你知道每一列的类型,可以手动定义 schema:
```python
import pyarrow as pa
schema = pa.schema([
('bss_cssr', pa.float64()),
('date', pa.date32()),
('rat', pa.string()),
('region', pa.string()),
('year', pa.int16()),
('month', pa.int8()),
# 其他字段...
])
table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
```
---
## ✅ 修改你的 `main()` 函数中的处理逻辑
### 🔧 在 `final_df = pd.concat(...)` 后添加类型统一处理:
```python
# 统一所有 object 类型列为 string
for col in final_df.select_dtypes(include='object').columns:
final_df[col] = final_df[col].astype(str)
```
或者,针对特定列:
```python
final_df['bss_cssr'] = pd.to_numeric(final_df['bss_cssr'], errors='coerce').fillna(0.0)
```
---
## ✅ 最终推荐代码片段
```python
def main(folder_path, output_path='output.parquet'):
processed_dfs = []
for foldername, subfolders, filenames in os.walk(folder_path):
for filename in filenames:
file_path = os.path.join(foldername, filename)
print(f"处理文件: {file_path}")
try:
match = re.search(r".*?(\D\d).*?\((\d\w)\)\.[cvxls]", filename)
if not match:
print(f"{file_path} 未识别")
continue
region, rat = match.groups()
except Exception as e:
print(f"{file_path} 正则解析失败: {e}")
continue
df = process_csv(file_path, rat, region)
if isinstance(df, pd.DataFrame):
processed_dfs.append(df)
else:
print(df)
if not processed_dfs:
print("没有可处理的数据")
return
final_df = pd.concat(processed_dfs, ignore_index=True)
# 修复列类型问题:将 object 类型列统一转为字符串
for col in final_df.select_dtypes(include='object').columns:
final_df[col] = final_df[col].astype(str)
# 或者只处理特定列:
# final_df['bss_cssr'] = pd.to_numeric(final_df['bss_cssr'], errors='coerce').fillna(0.0)
# 转为 PyArrow Table
table = pa.Table.from_pandas(final_df)
# 写入分区 Parquet
pq.write_table(
table,
output_path,
partition_cols=['rat', 'region', 'year', 'month']
)
print(f"Parquet 文件已写入: {output_path}")
```
---
## ✅ 总结
| 错误原因 | 解决方案 |
|----------|----------|
| 列中包含混合类型(如字符串和浮点数) | 使用 `.astype(str)` 或 `pd.to_numeric(..., errors='coerce')` |
| 使用 `from_pandas()` 时未指定 schema | 手动定义 schema 或统一列类型 |
| 数据中存在 `NaN` 或非法字符 | 使用 `fillna()` 或 `replace()` 清洗数据 |
---
##
阅读全文