We just exhibited at PyTorch Conference 2025. So many developers came by eager to try NexaSDK, curious about how we’re making local inference practical across NPU, GPU, and CPU. You could see the idea of “run any model on any device” really clicking. We were featured at both the Qualcomm and AMD booths — showing our latest advancements in 1. Nexa Profiling Tool for fine-grained performance insights on NPU models 2. Fully NPU-based agentic RAG pipelines 3. NPU-accelerated image generation — all powered by one unified runtime. It was great connecting with partners, open-source maintainers, and teams across the stack who share this same goal: bringing AI closer to the device. Thanks to PyTorch, Qualcomm, AMD for creating a space where the next generation of AI developers can build and learn together. Chun-Po Chang, Madhura Chatterjee, Nick Debeurre, Neel Kishan, Mark Zhong, Victoria Godsoe
Nexa AI
Software Development
Cupertino, California 5,500 followers
On Device AI Deployment and Research | NexaSDK: github.com/NexaAI/nexa-sdk | Hyperlink App: https://hyperlink.nexa.ai/
About us
Nexa AI is an on device AI deployment and research company. We craft optimized foundation models and on-device inference framework that runs any model on any device, across any backend—within minutes. Our mission is to make on device AI friction‑free and production‑ready.
- Website
-
https://nexa.ai/
External link for Nexa AI
- Industry
- Software Development
- Company size
- 11-50 employees
- Headquarters
- Cupertino, California
- Type
- Privately Held
- Founded
- 2023
Locations
-
Primary
Get directions
Cupertino, California 95014, US
Employees at Nexa AI
Updates
-
NexaSDK Python Binding is here. Getting started with NPU inference for SOTA GenAI is now as easy as: pip install nexaai Run local inference on Qualcomm Hexagon NPU for LLMs, VLMs, Embeddings, ASR, and Rerankers — all directly from Python. We’ve released full Python bindings and example Jupyter notebook so anyone can build on-device AI projects with Python: RAG systems, chatbots, or full agentic workflows — all powered by NexaML engine. Jupyter Notebook: https://lnkd.in/gRNaDcG2 Docs: https://lnkd.in/gQRxiMTV Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Heeseon Lim, Justin Lee
-
-
The best on-device multimodal model just came to Snapdragon devices. Qwen3-VL, latest release from Qwen, now runs locally on Qualcomm Oryon CPU, Adreno GPU, and most importantly, the Hexagon NPU using NexaSDK, powered by NexaML — the first and currently only framework to support Qwen3-VL on CPU, GPU, or NPU. Qwen3-VL brings state-of-the-art multimodal reasoning — understanding images, text, and layouts together — to local devices. With Nexa, Snapdragon devices can now power visual compute use agents, intelligent OCR, and visual context-aware assistants fully local — no cloud, ultra-fast latency, and optimized for battery efficiency. Run it with one line in NexaSDK: nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU Demo below. Star NexaSDK for more NPU-first model releases. Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Heeseon Lim, Justin Lee
-
🔥 LFM2-1.2B just got a major speed boost — 52 tokens/sec on Snapdragon X Elite. We’ve optimized Liquid AI’s hybrid LFM2-1.2B with our NexaML Turbo Engine, achieving real-time inference fully on the Qualcomm Hexagon NPU. LFM2’s new multiplicative-gate + convolution architecture isn’t trivial to run — it demanded hardware-aware graph optimization. NexaML Turbo squeezes every bit of NPU performance for faster, smoother on-device AI. This update shows what happens when great model design meets a purpose-built inference engine. Thrilled to be partnering with Liquid AI — and even more excited for what’s next. Ramin Hasani Mathias Lechner Alexander Amini Daniela Rus Jeffrey Li Manoj Khilnani Chun-Po Chang
-
Phi-4-mini, Microsoft’s newest 3.8B-parameter model with partial rope, now runs fully on Qualcomm Hexagon NPU for the first time — powered by the NexaML engine through NexaSDK. This brings a major AI performance lift to Snapdragon devices, delivering ~20 tokens/sec on Snapdragon X Elite — with richer reasoning and extended context, all directly on-device. Phi-4-mini isn’t just small — it’s clever. It packs reasoning, math, coding, and function-calling capabilities that rival much larger models, in a fraction of the size. With NexaML’s NPU-optimized runtime, developers can now build continuous, context-aware reasoning experiences that stay fully local. Run Phi-4-mini locally with one line: nexa infer NexaAI/phi4-mini-npu-turbo ⭐ us on GitHub for more NPU-first model releases. Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Heeseon Lim, Justin Lee
-
LFM2-1.2B models from Liquid AI are now running on Qualcomm Hexagon NPU in NexaSDK, powered by NexaML engine. Four new edge-ready variants: - LFM2-1.2B — general chat and reasoning - LFM2-1.2B-RAG — retrieval-augmented local chat - LFM2-1.2B-Tool — structured tool calling and agent workflows - LFM2-1.2B-Extract — ultra-fast document parsing LFM2 is a brand-new hybrid model architecture with both transformers and the SSM. Most inference frameworks can’t even run it yet. NexaML can. That means these models now run fully accelerated on Qualcomm Hexagon NPUs, hitting real-time speeds with tiny memory footprints for popular edge intelligence tasks — perfect for phones, wearables, and edge devices. We’re already working with customers like Brilliant Labs on what this unlocks next in ARVR glasses. Model link in comments. And if you want to follow the new model drops, star NexaSDK — it helps us deliver faster! Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Bobak Tavangar, Heeseon Lim, Justin Lee
-
Nexa AI is now SOC 2 Type 2 certified! Dual-layer enterprise-level security by design: - Device Layer: AI runs 100% locally on your hardware. Zero data collection, zero cloud dependency. - Organization Layer: SOC 2 Type 2 certified operations ensuring enterprise-grade security controls, monitoring, and compliance. Local intelligence. Enterprise trust.
-
-
NVIDIA sent us a 5090 so we can demo Qwen3-VL 4B & 8B GGUF. You can now run it in our desktop UI, Hyperlink, powered by NexaML Engine — the first and only framework that supports Qwen3-VL GGUF right now. We tried the same demo examples from the Qwen2.5-32B blog — the new Qwen3-VL 4B & 8B are insane. Benchmarks on RTX 5090 (Q4): Qwen3VL-8B → 187 tok/s, ~8GB VRAM Qwen3VL-4B → 267 tok/s, ~6GB VRAM Thanks NVIDIA and Qwen — local multimodal just went beast mode. More optimizations are coming. Run it yourself in Hyperlink — one-click install, fully local, beautiful UI. What interesting Qwen3VL use cases will you discover? Download link below.