The Secret to Context Engineering for Spatial AI is Accuracy Transfer
Creating a compelling Spatial AI demo is a pretty manageable task given the quality of LLMs available today. The next step of creating a reliable MCP for that service that works globally is a more challenging leap. The problem of bridging the gap from “cheap demo” to “magical product” is lately being called “context engineering”. More specifically:
Context Engineering is the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time, to give a LLM everything it needs to accomplish a task.
This really resonated with our work enabling Spatial AI. Our focus isn’t so much building a new LLM as providing the correct spatial data and tools in a structure that LLMs can perform orientation and navigation tasks in real time. The key to doing this well is providing high accuracy map data and imagery which is well aligned with the localization of your device. Then providing an interface for the LLM to be able to consume that data fluidly.
The Accuracy Challenge
We’ve seen that LLM’s do quite well when we provide real time localization of what a user is looking at. Also with GNSS augmentation and sensor fusion magic our localization can be quite precise. By way of example we benchmarked our localization accuracy against Google’s ARcore VPS and we were within 7 degrees with a 99.5% heading correlation.
When you combine this with GNSS accuracy improvements you get a very competitive localization service without the VPS baggage (e.g. battery drain, open camera, expensive 3D mapping and compute). The downside of sensor based localization is you need geospatially accurate vector map data to localize against. While buildings and roads largely match reality, the places (a.k.a. points of interest) we do commerce with are far less accurate. We did a series of posts (#1 and #2) where we tested different techniques for improving the positions of those places.
Further complicating this problem, open street level imagery can also often suffer from poor positioning and camera pose data coming from subpar smartphone measurements. This includes explicit street level projects like Mapillary and KartaView as well as user generated images of places from smartphones. Ideally we want high accuracy position and pose for both street level imagery and user generated photos.
Accuracy Transfer
The motivation to generate better foundational accuracy for data collected with smartphones is to help facilitate accuracy transfer. A big reason roads and buildings are so accurate is because their real world position is derived from aerial and satellite imagery, which has rigorous orthorectification and absolute accuracy specifications. Unfortunately, this georectification is totally lacking for “places”. We can potentially bridge this gap with street level imagery and photos of places by reconstructing the georectification problem for vector map data in a novel way. In this process we can take the improved accuracy from the position and pose of a smartphone when generating photos, and then transfer that accuracy to the vector map data (e.g. places) we generate from those photos.
Traditionally the mapping of street level imagery has been done with “structure for motion” (SfM) pipelines that turn 2D photos into 3D point clouds that can then be geo-registered (RIP Pixel8earth). This approach has been popular for creating 3D feature databases to power city scale augmented reality (AR). The problem is these 3D feature databases for AR don’t play particularly well with vector maps or the data we’d like to feed an LLM.
Geographically Anchored Visual Data
Instead of point clouds derived from images we want geographically anchored visual data. For Spatial AI, we only need to anchor the contents of image in the real world, rather than the centimeter scale feature geometries computed in an SfM process. While SfM reconstruction can be aligned geographically to allow us to back out the position and orientation of each image we've found it to be an unnecessarily heavy lift. At it's best, this approach can decompose the pixels into a precise 3D geographic representation of what's in the picture. However, doing this with crowdsourced images taken with different cameras, in different conditions, at different times can be quite brittle.
Precise absolute position and pose from a smartphone allows us to avoid SfM complexity and computational cost when triangulating map features, as well as removing the VPS battery usage when localizing against the map on device. The core of this process was covered in our relocalization blog posts, but there is an additional bundle adjustment step in order to do fun things like map entrances. Also this approach allows us to explicitly link visual data with vector map data through GERS IDs and better conflate places with buildings in the process. Below is an example of work we've done in our test area to geographically anchor images while conflating their context with Overture places and building through GERS.
Conclusion
In the buzz surrounding generative AI we often get focused on foundational models and our need for one specific to the geospatial domain. While this is critical work I think we are missing equally important piece of “context engineering” to geospatially enable existing LLMs. Andrej Karapathy summarizes this process of productionizing LLM capabilities well.
For the geospatial use case the challenge multiplies because in addition to the standard LLM problems you are also bridging spatial operations with linguistic relationships. The key to executing this well is accurate and aligned data coming from the device and the LLM/RAG. Stay tuned for an MCP to help facilitate this goal and more content of how we can better geographically anchor visual data.
Architect of Thinking OS™ | Inventor of Refusal-First Cognition | Built the Seatbelt for AI — Eliminates Fines, Ensures Explainability, Stops Drift
3moSean Gorman The term “context engineering” is catching fire — but clarity gets lost in the blaze. You’re right to bring it back to structure: grounded data, decision architecture, and constraint-aware reasoning. Just published a deep dive on why “context engineering” alone isn’t enough and what must be upstream if you want reliable cognition, not just interface logic. https://www.thinkingoperatingsystem.com/context-engineering-is-a-mirage-the-system-still-doesnt-know
Lead DevX @ Google Maps | Architect of AI-native Platforms | Context Engineering & Agent Workflows
3moReally curious to learn how far you can push spatial inference with run time context (e g. Context engineering - a term I like - via tools) vs. fine-tuning a specialized geospatial model. Thanks for sharing Sean!
Mostly Human-Powered Content
3mostructured data FTW
It's fascinating how context shapes AI reliability. What are your thoughts on enhancing frameworks? 🤔