用cookiecutter快速构建机器学习项目结构-CSDN博客

本文链接：https://blog.csdn.net/Bit_Coders/article/details/113617650

本文介绍如何使用Cookiecutter创建标准化的项目结构，提高数据分析和机器学习项目的可复现性和工程效率。通过具体模板示例，详细说明各文件夹及文件的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

为什么要采用标准化的项目结构？
用cookiecutter生成更好的项目结构
- 快速入门
数据科学/机器学习可用的Cookiecutter模板
参考链接

为什么要采用标准化的项目结构？

我们常常会遇到这样一类问题：在尝试复现几个月前或者几年以前的数据分析实验时，却发现对自己以前编写的代码感到迷茫，不知道从何开始下手？数据应该如何加载和处理？哪些是中间处理的结果？等等。
良好的项目结构应该能帮助我们更轻松地回到过去的工作的实践，具有如下优点：

将代码、数据等分离，采用标准化的处理流程
采用最佳工程设计实践，如版本控制和docker等工具
提高机器学习、数据分析项目中结果的可重复性
提供用于机器学习项目的最佳目录和文件模板

用cookiecutter生成更好的项目结构

Cookiecutter可以从现有的项目模板创建项目，比如python包项目，可以实现标准化的文件结构，帮助工程的构建和共享分析变得更加容易。

官方文档：https://cookiecutter.readthedocs.io/en/latest/index.html

快速入门

1、安装cookiecutter

pip install cookiecutter

2、使用项目工程模板开始一个新项目

在命令行输入下面代码，可以在python中使用数据科学工作的模板，创建符合逻辑的、合理标准化的、灵活的项目结构。

> cookiecutter https://github.com/drivendata/cookiecutter-data-science

如果git被墙，可以用gitee上的类似模板：
> cookiecutter https://gitee.com/whuhenry/cookiecutter-data-science-poetry.git

然后输入项目的相关信息，就会在当前路径下自动创建项目。
在这里插入图片描述

上述命令会在当前目录生成文件夹，其目录结构如下：

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

在命令行采用 tree /f >list.txt 命令，就可以把打印出的树状目录结构写入到list.txt文件中了。

3、常用文件说明

通常来说，任何人应该可以仅用src中的代码和data/raw中的数据复现结果。
data目录：通常在分析中，需要执行长时间的步骤来预处理数据或训练模型。如果已经执行了这些步骤，可以将输出存储在data/interim 目录等位置，而不必每次都等待重新运行它们。
notebooks目录：存放例如jupyter notebooks等工具生成的笔记，对于探索和交流数据分析非常有效。笔记的文件名可以采用类似<step>-<ghuser>-<description>.ipynb的形式 (e.g., 0.3-bull-visualize-distributions.ipynb)。注意：不要编写代码在多个笔记本中执行相同的任务，利用好重构的组件——比如数据预处理任务，则将其放入管道中src/data/make_dataset.py并从中加载数据data/interim。如果有其他的实用代码，将其重构为src。
MakeFile 用make管理相互依赖的步骤，从程序的源文件生成程序的可执行文件和其他非源文件。MakeFile基础文档
.env 用于保存一些机密数据和配置变量，该文件一般不提交到版本控制存储库中。
.gitignore 通过该文件告诉git哪些文件不需要提交到版本控制存储库中。比如包含本地隐私数据的文件、占用存储较大的文件等。
requirements.txt 用于保存当前所有的依赖库，采用pip freeze > requirements.txt命令生成。

数据科学/机器学习可用的Cookiecutter模板

Cookiecutter Docker Science：用docker容器生成适合简单机器学习任务的初始目录。
cookiecutter-reproducible-science：用于开始一个可重复和透明的科学项目，包括数据，模型，分析和报告的cookiecutter模板 (比如你的科学论文)。
CookieCutter Pip-Project：用于生成可以pip install安装的项目。

参考链接

http://drivendata.github.io/cookiecutter-data-science/
https://www.cnblogs.com/taceywong/p/10506032.html