An exploration into low impact indexing using embeddings.
A Chrome extension that automatically captures web pages as MHTML, extracts text content, chunks it into segments, and computes embeddings using ONNX Runtime Web for offline search capabilities. Includes a basic UI for searching, and managing indexed content.
- Automatic Page Capture: Uses
chrome.pageCapture.saveAsMHTML()to capture pages on load - MHTML Processing: Extracts HTML content from MHTML using
mhtml-to-html - Text Chunking: Splits content into 510 token segments with 50 token overlap using BERT WordPiece tokenization
- Embedding Computation: Uses all-MiniLM-L6-v2 model via ONNX Runtime Web with WebGPU/WASM support
- OPFS Storage: Persists embedding vectors and chunk data in Origin Private File System
- Chrome Storage: Lightweight metadata storage for quick lookups
captures and indexes pages as you browse. No user interaction required.
- OPFS:
- Chunks stored as JSON files
- Embedding vectors stored as binary data (Float32Array serialized)
- Chrome Storage: Metadata (URLs, titles, timestamps, chunk counts)
- Target size: 510 content tokens per chunk (plus 2 special tokens: [CLS] and [SEP])
- Overlap: 50 tokens between chunks
- Method: BERT WordPiece tokenization with token-based chunking
- Features: Text reconstruction for displaying readable snippets in search results
- Note: Tokenizes entire text first, then chunks based on actual token boundaries
- Compute query embedding using same model
- Load all page embeddings from OPFS
- Compute cosine similarity for all chunks
- Rank results by similarity score
- Return top K results with metadata
- Quick Stats: Pages indexed, chunks stored, storage usage
- Recent Pages: Last 5 indexed pages with favicons
- Quick Actions: View all pages, search index, settings
- Content Script: Use
window.offlineIndexerin console - Background: Check service worker logs in DevTools
- Worker: Monitor worker messages in background script
- Storage: Inspect OPFS and chrome.storage in DevTools
- UI: Use browser DevTools for popup and side panel debugging
- Only works on HTTP/HTTPS pages
- Requires OPFS support (Chrome 86+)
- Model download required on first run (~25MB)
- WebGPU support recommended for performance
- Storage limited by browser quotas
- Side panel requires Chrome 114+
MIT License - see LICENSE file for details.