MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation

Zijie J. Wang 0000-0003-4360-1423 [email protected] Georgia Institute of TechnologyAtlantaGeorgiaUSA  and  Duen Horng Chau 0000-0001-9824-3323 [email protected] Georgia Institute of TechnologyAtlantaGeorgiaUSA
(2024)
Abstract.

Retrieval-augmented text generation (RAG) addresses the common limitations of large language models (LLMs), such as hallucination, by retrieving information from an updatable external knowledge base. However, existing approaches often require dedicated backend servers for data storage and retrieval, thereby limiting their applicability in use cases that require strict data privacy, such as personal finance, education, and medicine. To address the pressing need for client-side dense retrieval, we introduce MeMemo, the first open-source JavaScript toolkit that adapts the state-of-the-art approximate nearest neighbor search technique HNSW to browser environments. Developed with modern and native Web technologies, such as IndexedDB and Web Workers, our toolkit leverages client-side hardware capabilities to enable researchers and developers to efficiently search through millions of high-dimensional vectors in the browser. MeMemo enables exciting new design and research opportunities, such as private and personalized content creation and interactive prototyping, as demonstrated in our example application RAG Playground. Reflecting on our work, we discuss the opportunities and challenges for on-device dense retrieval. MeMemo is available at https://github.com/poloclub/mememo.

Neural information retrieval, On-device, Large language models
journalyear: 2024copyright: rightsretainedconference: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 14–18, 2024; Washington, DC, USAbooktitle: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USAdoi: 10.1145/3626772.3657662isbn: 979-8-4007-0431-4/24/07ccs: Information systems Information retrievalccs: Human-centered computing Human computer interaction (HCI)ccs: Computing methodologies Machine learning
Refer to caption
Fig. 1. MeMemo is the first open-source JavaScript toolkit for in-browser dense neural retrieval. We demonstrate the capabilities of MeMemo by developing RAG Playground that enables AI developers to prototype retrieval-augmented text generation (RAG) apps locally in their browsers. With RAG Playground, developers can (A) enter various user queries, (B) search for semantically similar documents from an in-browser vector database, and (C) augment a text prompt with retrieved documents. (D) This allows developers to rapidly test if in-browser large language models generate more reliable responses to the query.
\Description

Teaser image for MeMemo.

1. Introduction

Retrieval augmented generation (RAG) (Lewis et al., 2020) with large language models (LLMs) has gained immense popularity from both practitioners and researchers, especially in applications such as domain-specific chatbots (Semnani et al., 2023; Prince et al., 2023), code generation (Soare et al., 2022; Zhou et al., 2023), and interactive agents (Hsieh et al., 2023; Ruan et al., 2023). RAG can improve the accuracy and reliability of LLMs’ generated text (Shuster et al., 2021), by providing these models, such as GPT-4 (OpenAI, 2023) and Llama 2 (Touvron et al., 2023), with context information retrieved from an updatable and external knowledge base. Compared to other techniques, such as fine-tuning (Hu et al., 2021) and prompt tuning (Lester et al., 2021), that aim to improve LLM’s performance on new or specific domains, RAG is often favored by AI practitioners (Martineau, 2023) due to its ease of implementation, flexibility in maintenance, and superior performance (Ovadia et al., 2024).

However, current RAG systems rely on dedicated backend servers to store and retrieve external documents relevant to the user’s query. This is often achieved through nearest neighbor search using dense embedding vector representations of documents (Balaguer et al., 2024; Li et al., 2023a). The need for centralized backend servers limits the applicability of RAG in domains that prioritize data privacy, such as personal finance, education, and medicine (e.g., Chung et al., 2023; Wutschitz et al., 2023; Fuchsbauer et al., 2021; Ghodratnama and Zakershahrak, 2023). Furthermore, implementing and hosting a vector storage and dense retriever pose additional challenges for AI novices and everyday LLM users (Draxler et al., 2023; Zamfirescu-Pereira et al., 2023), thereby increasing the barrier to entry for learning and applying RAG.

To address these pressing challenges, we present MeMemo, the first JavaScript toolkit that offloads vector storage and dense retrieval to the client—empowering a broader range of audiences to leverage cutting-edge retrieval techniques to enhance their LLM experiences. Our work makes the following key contributions:

  • MeMemo, the first scalable JavaScript library that enables users to store and retrieve large vector databases directly in their browsers. Our toolkit adapts the state-of-the-art approximate nearest neighbor search Hierarchical Navigable Small World graphs (HNSW) (Malkov and Yashunin, 2020) to the Web environment. By leveraging a novel prefetching strategy and modern Web technologies, such as IndexedDB and Web Workers, MeMemo empowers users to retrieve dense vectors with both privacy and efficiency (§ 3).

  • RAG Playground, an example application of on-device dense retrieval. We demonstrate the capabilities of MeMemo by developing RAG Playground (Fig. 1), a novel client-side tool using on-device retrieval to enable interactive learning about RAG and rapid prototyping of RAG applications (§ 2). We highlight the benefit of on-device retrieval regarding privacy, ubiquity, and interactivity. Finally, we discuss the opportunities and challenges for future research on client-side retrieval augmentation and personalized text generation (§ 5). RAG Playground is publicly accessible at https://poloclub.github.io/mememo.

  • An open-source111MeMemo code: https://github.com/poloclub/mememo implementation that lowers the barrier for researchers and developers to apply retrieval augmentation to improve text generation on the client side. We provide comprehensive documentation and an example application to help users use MeMemo to implement on-device retrieval augmentation across different Web environments. MeMemo is developed with minimal dependencies and TypeScript, a statically typed programming language, making it a maintainable and easy-to-use resource for the information retrieval community.

We hope our work will inspire the design, research, and development of on-device retrieval, enabling everyone to use text-generative models and other AI technologies more easily and privately.

2. MeMemo in Action

We present two hypothetical usage scenarios, developing (§ 2.1) and using (§ 2.2) RAG Playground, to demonstrate how researchers and practitioners can use MeMemo to easily develop client-side applications that take advantage of on-device RAGs.

2.1. Developing In-browser RAG Tools

Motivations. Assume an example scenario where Mei, a machine learning (ML) consultant, is currently developing an LLM-based chatbot for a large design studio. The chatbot’s purpose is to assist new-hired designers in familiarizing themselves with the company’s internal design systems and tools. To ensure accurate and reliable responses, Mei integrates RAG into this onboarding chatbot. This integration allows the responses to be grounded by relevant documentation, design documents, and code. Initially, Mei uses Jupyter Notebooks (Kluyver et al., 2016) to prototype the chatbot through prompt engineering in Python. However, she realizes that this workflow is not ideal for collaborating with designers and introducing RAG to her clients. This is because many of the collaborators and stakeholders are not experienced in programming and setting up notebook environments. Therefore, Mei decides to develop RAG Playground (Fig. 1), a web-based no-code RAG prototyping tool. This tool will enable her collaborators, who come from diverse backgrounds, to easily access and prototype RAG features for their chatbot through their web browsers.

1import { HNSW } from ’mememo’;
2
3// Creating a new index
4const index = new HNSW({ distanceFunction: ’cosine’ });
5
6// Inserting elements, keys: string[], values: number[][]
7await index.bulkInsert(keys, values);
8
9// Find k-nearest neighbors, query: number[], k: number
10// keys: string[], distances: number[]
11const { keys, distances } = await index.query(query, k);
Code 1: Example TypeScript code that uses MeMemo to create an HNSW index and search for k-nearest neighbors.

Vector storage and retrieval with MeMemo. Mei uses MeMemo, a JavaScript library, to enable dense vector storage and retrieval directly in the browser. By installing the library with a single command npm install mememo, Mei can easily import it into her web app, regardless of her web development stack (e.g., JavaScript, TypeScript, React (Facebook, 2013), Svelte (Harris, 2016), or Lit (Google, 2015)). With just a few lines of code (Code 1), Mei can create an HNSW vector index (Malkov and Yashunin, 2020) and efficiently search through millions of embedding vectors entirely within her browser. Mei also uses MeMemo’s exportIndex() and loadIndex() functions to export an index she has created into persistent local storage or as a JSON file. This allows her collaborators to quickly load the HNSW index without the need to recreate it every time they use RAG Playground.

Smooth integration with existing Web ML technologies. Mei seamlessly integrates MeMemo with other Web ML technologies. For example, she uses IndexedDB, a client-side key-value browser storage, to store the raw documents. Using the same keys, Mei creates the HNSW index with MeMemo. Then, Mei uses FlexSearch (Wilkerling, 2019) to implement fast full-text lexical search in the browser. To enable semantic search, Mei first uses GTE-Small (Li et al., 2023b) to encode all documents into dense vectors with 384 dimensions in Python with SentenceTransformers (Reimers and Gurevych, 2019). For encoding the user’s query (Fig. 1A), Mei uses ONNX (Bai et al., 2019) and Transformer.js (Lochner, 2023) to run the same GTE-Small model in the browser. After augmenting a text prompt with retrieved documents (Fig. 1C), Mei runs the prompt with open-source LLMs, such as LLama 2 (Touvron et al., 2023) and Phi 2 (Abdin et al., 2023), in the browser through Web LLM (teamMLCLLM2023). By combining MeMemo with existing Web ML technologies, Mei quickly develops RAG Playground and shares it with her collaborators. With this tool, Mei’s team has made great progress as all stakeholders with diverse backgrounds can easily experiment with different user queries and prompts to improve their onboarding chatbot.

2.2. Prototyping with RAG Playground

Motivations. Robaire, a graduate student studying human-computer interaction, is designing an interactive visualization tool to assist researchers in brainstorming and literature review. After discovering RAG online, Robaire becomes interested in integrating it into his prototype. The objective is to allow users to input a large corpus of academic papers and use natural language queries to discover related papers and visualize the connections between them. Since Robaire has never implemented RAG before, he turns to RAG Playground to learn about the concept and prototype for his tool.

[Uncaptioned image]

Learning and experimenting with RAG. After opening RAG Playground in the browser, Robaire creates a MeMemo database (Fig. 1B) by uploading a JSON file containing the abstracts of 120k arXiv ML papers and 384-dimensional embeddings of the abstracts. Robaire then pretends to be his end-users and types in a natural language query in the User Query View, such as “how to integrate information retrieval into ML?” (Fig. 1A). In addition, he writes a simple system prompt template in the Prompt View (Fig. 1C) with placeholders {{user}} and {{context}}. After clicking the [Uncaptioned image] button, Robaire sees 10 relevant paper abstracts with their Cosine distances highlighted in the Database View (Fig. 1B). He also finds that the two placeholders in the Prompt View are replaced with the user query and relevant documents. Robaire then sees the LLM’s output in the Output View (Fig. 1D). Finding the output helpful and grounded by the documents retrieved by MeMemo, Robaire experiments with more prompts and both remote and local LLMs (e.g., GPT 4 and Llama 2 shown in the figure above) in RAG Playground and gains a better understanding of RAG. This increased understanding gives him more confidence to implement RAG in his tool.

3. MeMemo Design and Implementation

MeMemo is the first JavaScript toolkit that enables dense retrieval in the browser. To enable fast and reliable retrieval for RAG, our tool adapts the state-of-the-art approximate nearest neighbor search technique HNSW (§ 3.1). MeMemo leveraging modern and native Web technologies, such as IndexedDB and Web Workers (§ 3.2), to optimize for browser environments. To help researchers and developers adopt MeMemo, we have open-sourced it and provided detailed documentation, tutorial, and an example application (§ 3.3).

3.1. Adapting HNSW

HNSW is a state-of-the-art approximate k-nearest neighbor search technique introduced by Malkov and Yashunin. It is inspired by the greedy graph routing used in navigable small world networks (Kleinberg, 2000; Boguñá et al., 2009) and the stochastic hierarchical structure in 1D probabilistic skip list (Pugh, 1990). HNSW uses a multilayered graph structure to connect high-dimensional dense vectors. During the insertion process, each new element is assigned a layer level at random, determining its position within the graph’s multi-layered hierarchy. The insertion process involves finding the element’s closest neighbors, starting from the top layer and working downwards using a greedy search approach. When searching for the nearest neighbors of a query element, the algorithm follows a similar procedure. It starts from the top layer and uses the connections established during the insertion phase to guide its search downwards.

We use HNSW as the approximate nearest neighbor search technique in MeMemo because it is the state-of-the-art regarding construction and query efficiency (Malkov and Yashunin, 2020). Additionally, HNSW has gained immense popularity among retrieval and AI practitioners and has been integrated into popular retrieval and RAG Python toolkits such as FAISS (Douze et al., 2024), Pyserini (Lin et al., 2021), PGVector (Kane, 2021), and LangChain (Chase, 2022). Our goal with MeMemo is to seamlessly integrate into users’ existing workflows and preferences, providing a smooth and familiar experience when developing in-browser retrieval applications.

3.2. Optimizing for the Browsers

Memory management. Memory management is one of the main challenges for developing in-browser toolkits. Depending on the device and browser, a webpage tab might have a RAM limit as low as 256MB (Maitre, 2018). This means that without considering any other memory usage on a webpage, it can store at most 83k 384-dimensional vectors in RAM. Additionally, for security reasons, browsers do not allow access to the operating system’s file systems, so MeMemo cannot directly store data in the user’s disk. To overcome these challenges, MeMemo leverages IndexedDB (MDN, 2021), a cross-browser key-value storage that can use up to 80% of the client’s disk size (MDN, 2023a). In IndexedDB, MeMemo stores all vector values, while only keeping the keys and HNSW graphs in the RAM.

Prefetching for efficient data access. While IndexedDB addresses the memory constraints in the browser, reading or writing a large amount of data to IndexedDB with consecutive transactions is extremely slow (RxDB, 2021). Dexie.js (Fahlander, 2021) introduces techniques for fast batched read and write to IndexedDB. However, the HNSW construction process requires consecutive reads and writes of vector values, as the algorithm relies on the previously constructed index for finding good neighbors (Mendel-Gleason, 2024; Malkov and Yashunin, 2020). To address this challenge, MeMemo introduces a prefetching mechanism. When inserting multiple elements, MeMemo first uses a batched write to store all vectors in IndexedDB. During construction and search, MeMemo maintains a cache of p𝑝pitalic_p vector values in RAM. If it needs to read a vector value that is not in the cache, MeMemo prefetches p𝑝pitalic_p neighbors of that element on the current graph layer from IndexedDB to RAM. This mechanism reduces the number of IndexedDB transactions. The parameter p𝑝pitalic_p is automatically determined by the vector dimension and can be configured by users.

3.3. Open-source and Easy to Use

To help researchers and developers easily adopt MeMemo, we open source our implementation and design APIs similar to popular HNSW Python libraries (e.g., Malkov and Yashunin, 2020; Douze et al., 2024; Zhu, 2016). Users can easily configure all HNSW parameters, such as M𝑀Mitalic_M (the number of neighbors a graph node can have) and 𝑒𝑓𝐶𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑒𝑓𝐶𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛\mathit{efConstruction}italic_efConstruction (the number of nodes to search during construction). With just a few lines of code (Code 1), users can quickly implement dense retrieval in web browsers using MeMemo. We provide detailed documentation and tutorials. Additionally, we offer an open-source example application RAG Playground that demonstrates the integration of MeMemo with existing Web ML technologies (§ 2.1). RAG Playground also shows how to use MeMemo with modern Web APIs, including Web Workers (MDN, 2023c) to prevent blocking the main thread and Streams API (MDN, 2023b) for creating an HNSW index incrementally with small network-received chunk. MeMemo is published in the popular Web package repository npm Registry, and can be easily installed and used in both browser and Node.js (Dahl, 2009) environments.

4. Related Work

Retrieval-augmented text generation. There has been a long history of using information retrieval to enhance text generation, such as developing language models through retrieval (Lavrenko and Croft, 2001), using a retrieve-and-edit framework to improve code generation (Hashimoto et al., 2018), and incorporating knowledge graphs to enhance language representation in language models (Zhang et al., 2019). The concept of RAG was popularized by Lewis et al., who introduced a model that combines a dense passage reliever and sequence-to-sequence models. More recent approaches (e.g., Neelakantan et al., 2022; Qu et al., 2021; Izacard and Grave, 2021; Cuconasu et al., 2024) use pre-trained embedding models to encode external documents as dense vectors and retrieve relevant documents using dense retrievers such as HNSW (Malkov and Yashunin, 2020), PQ (Jégou et al., 2011), and FAISS (Douze et al., 2024). MeMemo builds upon these works and extends RAG to the client side for more private and personalized text generation.

On-device retrieval and machine learning. Traditional retrieval and machine learning (ML) systems are typically deployed on remote servers, and their outputs are sent to client devices. However, there has been a recent surge of interest in deploying ML models directly on edge devices in the pursuit of private, ubiquitous, and interactive ML experiences. Tools such as TensorFlow.js (Smilkov et al., 2019), ONNX (Bai et al., 2019), MLC (MLC, 2023; Chen et al., 2018), and Core ML (Apple, 2017) have significantly reduced the barriers to running complex ML models in browsers and mobile devices. Researchers have proposed various on-device systems, including information retrieval (Kamvar et al., 2009; Lam et al., 2023), recommender systems (Gong et al., 2020; Xia et al., 2023), prediction explanation (Wang et al., 2022, 2023b; Wang and Chau, 2023), speech recognition (Macoskey et al., 2021b, a), translation (Tan et al., 2022), and writing assistants (Wang et al., 2024). Our tool contributes to the growing body of on-device ML research by introducing the first adaptation of dense retrieval to browsers.

5. Discussion and Future Work

Reflecting on our development of MeMemo, we highlight the opportunities and challenges for in-browser dense retrieval.

Opportunities. Enabling dense retrieval and RAG in browsers offers significant advantages regarding privacy, ubiquity, and interactivity. With the browser’s ubiquity, MeMemo is accessible on various devices, including laptops, mobile phones, and IoT appliances like smart refrigerators. Future research directions include:

  • Intelligent personal information management. There is a large body of research on collecting all of one’s personal information into a searchable database (e.g., Freeman and Gelernter, 1996; Cai et al., 2005; Bell, 2001; Chau et al., 2008; Kiesel et al., 2018). Researchers can leverage on-device dense storage and retrieval to design browser extensions that automatically and privately encode and store a user’s visited web pages, photos, and academic papers. These extensions can serve as an intelligent “second brain” (Forte, 2022) to help users capture and review knowledge.

  • Private and personalized content creation. If users maintain a personal vector database in browsers, content creators, such as book writers, can use on-device RAG to tailor their content privately based on readers’ preferences and reading history.

  • Interactive RAG prototyping. Future researchers can enhance the design of RAG Playground to improve interactive RAG prototyping experience, such as supporting collaborative prompt editing (Feng et al., 2023) and interactive embedding visualizations (Wang et al., 2023a).

Challenges. Due to limited computation resources in browsers, MeMemo is slower than heavily optimized libraries like HNSWLIB (Malkov and Yashunin, 2020) in terms of index creation and search. In Chrome on a 64GB RAM MacBook, it took about 94 minutes to insert 1 million 384-dimensional vectors (M𝑀Mitalic_M=5, 𝑒𝑓𝐶𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑒𝑓𝐶𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛\mathit{efConstruction}italic_efConstruction=20). However, querying this index with 1M items is still performed in real time. Future researchers can optimize in-browser dense retrieval further by implementing parallelization and smarter prefetching techniques.

Conclusions. We present MeMemo, an open-source library that enables in-browser dense retrieval using HNSW and modern Web technologies. We introduce RAG Playground, a novel client-side RAG prototyping tool to demonstrate the capabilities of MeMemo. We hope MeMemo to be an easy-to-use resource for the information retrieval and ML community, inspiring future research and development of on-device retrieval and RAG applications.

References