This project began from a question "can machines understand the text within SICP?".
A graph visualization of the interconnectedness between the sections in SICP.
Similarity is computed using a hybrid approach combining semantic and lexical methods:
- Semantic similarity: OpenAI's
text-embedding-3-smallembeddings with cosine similarity - Lexical similarity: BM25 (Okapi BM25) for keyword-based matching
The final similarity score combines both: 0.7 * semantic + 0.3 * lexical. This captures both conceptual relationships (via embeddings) and shared terminology (via BM25).
A k-NN threshold is applied to keep only the top-k most similar neighbors per chapter, resulting in a sparse similarity graph visualized with D3.js force-directed layout.
docscontains all the data and the static pages rendered on gh-pagestextscontains the cleaned html and markdown version of each textsconnectome.pycomputes semantic similarity using OpenAI embeddingsconnectome_tfidf.pylegacy TF-IDF approach (preserved for comparison)
The graph visualization has since been extended to other texts:
- Structure and Interpretation of Computer Programs
- Structure and Interpretation of Classical Mechanics
- Lagrangian mechanics is more related to rigid bodies than Hamiltonian mechanics.
- The Principles of Quantum Mechanics: chapters, sections
- The Society of Mind. Sourced From http://www.aurellem.org/minsky/
- Each sections are highly correlated with each other