Sebastian Nozzi

Posted on May 2

Codebase Onboarding Documentation

#documentation #literateprogramming #onboarding #codebase

I've often faced the problem of getting familiar with existing big, complex, codebases.

The size and complexity seem insurmountable, it is not clear how things work and are related to each other, one does not know where to start learning and getting clarity.

Some projects have at best a README file, and maybe one or two additional files. Others have a Wiki, but often talking about conceptual aspects and therefor don't help so much in understanding the code itself.

More often than not, code is poorly documented, if documented at all. Even if individual methods or classes are documented, it does not always help in understanding the "big picture" and knowing how everything fits together.

This situation, together with the accompanying frustration, lead me to look for solutions. One of them very extreme.

Enter ...

Literate Programming

What if we were doing the whole thing wrong?

When I initially read about literate programming I was hooked. I thought: this is it!

For those unfamiliar with it, the idea of literate programming is that you don't start with code, which you then "document".

You start with a "document" through which you explain your program, one code-snippet at a time. That document takes the form of a complex book, with pages and chapters. The code-snippets themselves, which are initially part of the document, are inter-related themselves.

Then, you run a typical "literate programming" tool, which does two things:

It produces a beautiful, pleasant to read document (often as a PDF or a webpage, together with images, fonts, table of contents, links and so on).
It assembles your final codebase from the inter-related snippets. After that, you are free to compile your code to produce your executable, or whatever deliverable your code produces.

The strength of this approach is that you, as the "book author" can decide what to explain when. Our programming languages and frameworks often dictate how files and folders are to be structured in a codebase. But if you start with a document, you are free to introduce "files" and code-snippets in any order you want.

I could talk more about this approach, but I will end this for now.

Suffice to say that, while the approach seems to have many advantages, and be a good fit to my "onboarding problem", it does impose big sacrifices:

One can kiss our familiar tool-integration (IDEs) goodbye
Refactoring becomes much harder
Code navigation becomes much harder
Maintenance becomes harder / slower (maybe a good thing?)

Realistically, I could not imagine this being the ultimate solution. I must admit that I never seriously tried it on a big project, and maybe I should. But the cost-benefit outcome did not seem quite right to me.

Nevertheless, the spirit of the approach seemed good, and it contained good ideas and noble goals.

What if there was an hybrid approach, or something inspired by it ...?

I think I found an approach that might be just that ...

What about Wikis?

Earlier I mentioned that some projects do contain Wikis.

The nice thing about Wikis is that they are easy to use. Documents are "cheap". Creating new pages is easy. Editing existing pages is easy. Especially if one is using Markdown.

But I didn't like the fact that most Wikis "live" separate from the codebase.

What I wanted is to navigate a codebase, and have a series of documents, IN the codebase, telling me about important aspects of the code I was dealing with.

Local "Wikis"?

At some point I became aware of a "trick" (is it really) that some open-source projects sometimes use: interlinking of Markdown documents.

Typical examples would be having a "README.md" file, and have it link to a "LICENCE.txt" file or a "Contributing.md" or "Installation.md" files, and so on ...

The linking per-se is part of the equation, however. The viewing environment at hand is the one which has to honor the links, and make the navigation possible. Fortunately, both GitHub (the webpage) and ~~IDEs~~ code editors like VSCode support this feature.

Armed with this knowledge I had this idea: one could strategically put Markdown files in the appropriate places and have them linked. One could navigate them from within a code-editor (or repo viewer like GitHub) and they would provide context-sensitive meta-information.

Think for example:

Documenting the deployment process
Documenting installation
Documenting architecture
Documenting packages / modules / subprojects

I think it's a worthy idea. And for some projects I would use it as is. But I still was missing something lost from the "literate programming" approach.

I was missing "beauty".

In search of Beauty Lost

Let's go back in time to my experiments with "literate programming".

Remember that I previously mentioned one of the outputs, apart from the resulting codebase, was "beautiful content"?

But what do I mean by "beautiful"? What I mean is "more than just plain text". Think: different font styles, images, diagrams, colors, footnotes, admonitions, table-of-contents, etc. Things we are used to from (technical) books, documentation and webpages. Markdown (if rendered beyond its plain-text version) does support some of that, but not all ...

When first conceived, the first "literate programming" tools used have LaTeX as its content-generation language. That is: the produced document was LaTeX-based, with all that it implies (personal summary: very powerful features, not so friendly to write, geared more towards print media, especially academic papers).

But when exploring "literate programming", my idea of the final document was never a PDF or LaTeX file, but a webpage. More specifically a modern, navigable online documentation. That is: HTML based.

One tool for producing HTML-based technical documentation is Sphinx. It originates and is very widespread in the Python world, but can be used for whatever purpose one sees fit.

Fortunately I found out that there have been "literate programming" extensions for Sphinx.

That alone made me familiar with Sphinx as a document-authoring and generator tool. Remember: one writes Sphinx pages in "code" itself. The variants being reStructuredText or Markdown.

So there I was. One the one hand having me realization about the ability to interlink Markdown pages. On the other hand missing advanced features found in document-generators like Sphinx.

Until I had an idea ... what if I combined the interlinked Markdown-pages approach with Sphinx-generated documentation ...?

"Embedded Doctree"

For lack of a better name, I call this the "embedded Doctree" approach. Let me explain ...

Through some trickery (which I am not going to go into detail right now) I managed to have a codebase be both:

The project codebase, with its normal file structure
A Sphinx project codebase, with its document pages

The code files and the documentation files are intermixed. They live in the same file-hierarchy.

Thus, I can put Sphinx/Markdown files wherever I want, have them linked however I want, and they will produce a nice HTML-based technical page. The nice-to-read HTML-based documentation can be read and navigated on the browser (even be published somewhere).

And on the other hand, Sphinx/Markdown files are developer-friendly when exploring a codebase. They can still be viewed and edited as plain text and even still support the direct linking mentioned earlier (with some minor exceptions).

Conclusion

I think I found an interesting middle-ground between the beautifully produced final-documents promised by the "literate programming" approach and useful, pragmatic, and IDE-friendly context-relevant local (document) files.

With this approach one gets to keep using existing IDEs and the usual code-navigation, while having interlinked local in-depth explanations for vital aspects of a codebase - both as plain-text (Markdown) files and pleasant-to-read navigable online documentation.

Top comments (3)

Sebastian Nozzi • May 2

I regret that I cannot show a working example at the moment, since my first experiment with this approach is done for a company project. I hope to be able to upload a public proof-of-concept project soon.

Sebastian Nozzi • May 2

Here is the "literate programming" extension for Sphinx I used:

sphinx-litprog.readthedocs.io/en/s...

Sahha • May 7

Interesting to read