使用Python分析北京新能源增发指标中签名单_新能源指标公司中签名单-CSDN博客

本文链接：https://blog.csdn.net/formaever/article/details/108538274

通过Python分析北京八月份增发新能源指标给无车家庭的数据，利用tabula和pandas处理PDF文件，揭示中签家庭特征。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

北京八月份增发了两万个新能源指标给无车家庭，本人也申请了，但意料之中的没有中签，所以今天把中签名单下载了下来，想拿python做一下简单的分析统计，看看中签的都是什么样的家庭。

首先名单下载下来之后是一个pdf文件

文件里面全是表格，首先需要把表格从pdf中提取出来，于是在网上搜索了一下，发现tabula模块可以很好的完成这个任务

pip install tabula-py

这个模块依赖pandas, numpy, java，所以要保证这些库首先安装好。

安装完之后，我首先把pdf文件中的表格数据提取出来，保存成csv文件，用tabula非常容易实现。

import tabula

tabula.convert_into('./20200911.pdf', './20200911.csv', output_format="csv", pages='all')

执行完成后就得到了名单的csv文件，然后就可以用pandas来处理了。

#-*-coding:utf-8 -*- 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('./20200911.csv', encoding = 'gb18030')
#print(df.head())
#print(df.tail())
column_headers = list(df.columns.values)
print(column_headers)

family_number = df['家庭人数'].value_counts()
family_generate = df['家庭代际数'].value_counts()
family_scores = df['家庭总积分'].value_counts()
id_string = df['主申请人证件号码']

print(family_number)
print('***************************************');
print(family_generate)
print('***************************************');
family_scores.sort_index(inplace=True)
print(family_scores)
print('***************************************');
index_1 = [x for x in family_scores.index if x in range(0, 100)]
family_scores_lower100 = family_scores[index_1]
print('Number of family scores [0-100): ', family_scores_lower100.values.sum())
index_1 = [x for x in family_scores.index if x in range(100, 200)]
family_scores_100to200 = family_scores[index_1]
print('Number of family scores [100-200): ', family_scores_100to200.values.sum())
index_1 = [x for x in family_scores.index if x >= 200]
family_scores_over200 = family_scores[index_1]
print('Number of family scores over 200: ', family_scores_over200.values.sum())
print('***************************************');
index_2 = [x for x in range(len(id_string)) if id_string[x].startswith('110')]
print('Number of Beijing Id: ', len(index_2))
print('***************************************');

运行结果：