Nexa AI

Nexa AI · 2025-10-17T17:24:14.094Z

NVIDIA sent us a 5090 so we can demo Qwen3-VL 4B & 8B GGUF. You can now run it in our desktop UI, Hyperlink, powered by NexaML Engine — the first and only framework that supports Qwen3-VL GGUF right now. We tried the same demo examples from the Qwen2.5-32B blog — the new Qwen3-VL 4B & 8B are insane. Benchmarks on RTX 5090 (Q4): Qwen3VL-8B → 187 tok/s, ~8GB VRAM Qwen3VL-4B → 267 tok/s, ~6GB VRAM Thanks NVIDIA and Qwen — local multimodal just went beast mode. More optimizations are coming. Run it yourself in Hyperlink — one-click install, fully local, beautiful UI. What interesting Qwen3VL use cases will you discover? Download link below.

Software Development

Cupertino, California 5,500 followers

On Device AI Deployment and Research | NexaSDK: github.com/NexaAI/nexa-sdk | Hyperlink App: https://hyperlink.nexa.ai/

See jobs Follow

View all 33 employees

About us

Nexa AI is an on device AI deployment and research company. We craft optimized foundation models and on-device inference framework that runs any model on any device, across any backend—within minutes. Our mission is to make on device AI friction‑free and production‑ready.

Website: https://nexa.ai/
External link for Nexa AI
Industry: Software Development
Company size: 11-50 employees
Headquarters: Cupertino, California
Type: Privately Held
Founded: 2023

Locations

Primary

Cupertino, California 95014, US

Get directions

Employees at Nexa AI

See all employees

Updates

Nexa AI

5,500 followers
15h
Report this post
We just exhibited at PyTorch Conference 2025. So many developers came by eager to try NexaSDK, curious about how we’re making local inference practical across NPU, GPU, and CPU. You could see the idea of “run any model on any device” really clicking. We were featured at both the Qualcomm and AMD booths — showing our latest advancements in 1. Nexa Profiling Tool for fine-grained performance insights on NPU models 2. Fully NPU-based agentic RAG pipelines 3. NPU-accelerated image generation — all powered by one unified runtime. It was great connecting with partners, open-source maintainers, and teams across the stack who share this same goal: bringing AI closer to the device. Thanks to PyTorch, Qualcomm, AMD for creating a space where the next generation of AI developers can build and learn together. Chun-Po Chang, Madhura Chatterjee, Nick Debeurre, Neel Kishan, Mark Zhong, Victoria Godsoe
Like Comment Share
Nexa AI

5,500 followers
23h Edited
Report this post
NexaSDK Python Binding is here. Getting started with NPU inference for SOTA GenAI is now as easy as: pip install nexaai Run local inference on Qualcomm Hexagon NPU for LLMs, VLMs, Embeddings, ASR, and Rerankers — all directly from Python. We’ve released full Python bindings and example Jupyter notebook so anyone can build on-device AI projects with Python: RAG systems, chatbots, or full agentic workflows — all powered by NexaML engine. Jupyter Notebook: https://lnkd.in/gRNaDcG2 Docs: https://lnkd.in/gQRxiMTV Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Heeseon Lim, Justin Lee
Like Comment Share
Nexa AI

5,500 followers
1d
Report this post
The best on-device multimodal model just came to Snapdragon devices. Qwen3-VL, latest release from Qwen, now runs locally on Qualcomm Oryon CPU, Adreno GPU, and most importantly, the Hexagon NPU using NexaSDK, powered by NexaML — the first and currently only framework to support Qwen3-VL on CPU, GPU, or NPU. Qwen3-VL brings state-of-the-art multimodal reasoning — understanding images, text, and layouts together — to local devices. With Nexa, Snapdragon devices can now power visual compute use agents, intelligent OCR, and visual context-aware assistants fully local — no cloud, ultra-fast latency, and optimized for battery efficiency. Run it with one line in NexaSDK: nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU Demo below. Star NexaSDK for more NPU-first model releases. Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Heeseon Lim, Justin Lee

1 Comment

Like Comment Share
Nexa AI

5,500 followers
2d
Report this post
🔥 LFM2-1.2B just got a major speed boost — 52 tokens/sec on Snapdragon X Elite. We’ve optimized Liquid AI’s hybrid LFM2-1.2B with our NexaML Turbo Engine, achieving real-time inference fully on the Qualcomm Hexagon NPU. LFM2’s new multiplicative-gate + convolution architecture isn’t trivial to run — it demanded hardware-aware graph optimization. NexaML Turbo squeezes every bit of NPU performance for faster, smoother on-device AI. This update shows what happens when great model design meets a purpose-built inference engine. Thrilled to be partnering with Liquid AI — and even more excited for what’s next. Ramin Hasani Mathias Lechner Alexander Amini Daniela Rus Jeffrey Li Manoj Khilnani Chun-Po Chang

4 Comments

Like Comment Share
Nexa AI

5,500 followers
2d
Report this post
Phi-4-mini, Microsoft’s newest 3.8B-parameter model with partial rope, now runs fully on Qualcomm Hexagon NPU for the first time — powered by the NexaML engine through NexaSDK. This brings a major AI performance lift to Snapdragon devices, delivering ~20 tokens/sec on Snapdragon X Elite — with richer reasoning and extended context, all directly on-device. Phi-4-mini isn’t just small — it’s clever. It packs reasoning, math, coding, and function-calling capabilities that rival much larger models, in a fraction of the size. With NexaML’s NPU-optimized runtime, developers can now build continuous, context-aware reasoning experiences that stay fully local. Run Phi-4-mini locally with one line: nexa infer NexaAI/phi4-mini-npu-turbo ⭐ us on GitHub for more NPU-first model releases. Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Heeseon Lim, Justin Lee

3 Comments

Like Comment Share
Nexa AI

5,500 followers
3d Edited
Report this post
Nexa will be at PyTorch Conference at Moscone Center West (747 Howard St, SF)! We’ll be at Booths D3 (with Qualcomm) & D1 (with AMD) showing some brand new things we’ve been building. Come say hi 👋
Like Comment Share
Nexa AI

5,500 followers
3d Edited
Report this post
LFM2-1.2B models from Liquid AI are now running on Qualcomm Hexagon NPU in NexaSDK, powered by NexaML engine. Four new edge-ready variants: - LFM2-1.2B — general chat and reasoning - LFM2-1.2B-RAG — retrieval-augmented local chat - LFM2-1.2B-Tool — structured tool calling and agent workflows - LFM2-1.2B-Extract — ultra-fast document parsing LFM2 is a brand-new hybrid model architecture with both transformers and the SSM. Most inference frameworks can’t even run it yet. NexaML can. That means these models now run fully accelerated on Qualcomm Hexagon NPUs, hitting real-time speeds with tiny memory footprints for popular edge intelligence tasks — perfect for phones, wearables, and edge devices. We’re already working with customers like Brilliant Labs on what this unlocks next in ARVR glasses. Model link in comments. And if you want to follow the new model drops, star NexaSDK — it helps us deliver faster! Manoj Khilnani, Chun-Po Chang, Dr. Vinesh Sukumar, Srinivasa Deevi, Devang Aggarwal, Madhura Chatterjee, Neeraj Pramod, Bobak Tavangar, Heeseon Lim, Justin Lee

10 Comments

Like Comment Share
Nexa AI

5,500 followers
4d
Report this post
Nexa AI is now SOC 2 Type 2 certified! Dual-layer enterprise-level security by design: - Device Layer: AI runs 100% locally on your hardware. Zero data collection, zero cloud dependency. - Organization Layer: SOC 2 Type 2 certified operations ensuring enterprise-grade security controls, monitoring, and compliance. Local intelligence. Enterprise trust.
Like Comment Share
Nexa AI

5,500 followers
6d
Report this post
NVIDIA sent us a 5090 so we can demo Qwen3-VL 4B & 8B GGUF. You can now run it in our desktop UI, Hyperlink, powered by NexaML Engine — the first and only framework that supports Qwen3-VL GGUF right now. We tried the same demo examples from the Qwen2.5-32B blog — the new Qwen3-VL 4B & 8B are insane. Benchmarks on RTX 5090 (Q4): Qwen3VL-8B → 187 tok/s, ~8GB VRAM Qwen3VL-4B → 267 tok/s, ~6GB VRAM Thanks NVIDIA and Qwen — local multimodal just went beast mode. More optimizations are coming. Run it yourself in Hyperlink — one-click install, fully local, beautiful UI. What interesting Qwen3VL use cases will you discover? Download link below.

2 Comments

Like Comment Share

Browse jobs

Funding

Nexa AI 1 total round

Last Round

Seed Sep 28, 2024

Investors

Alumni Ventures

See more info on crunchbase

Nexa AI

Software Development

Cupertino, California 5,500 followers

On Device AI Deployment and Research | NexaSDK: github.com/NexaAI/nexa-sdk | Hyperlink App: https://hyperlink.nexa.ai/

About us

Locations

Employees at Nexa AI

Chuck E.

Professor at Stanford University

Iris Zhou

UX/Product Designer | Seeking a fulltime position

Wenwen C.

Designer | UW Alum

Zack Li

Cofounder & CTO at NexaAI | On-device Multimodal Models & Edge Inference Acceleration | Ex-Google, Ex-Amazon Lab126

Updates

Join now to see what you are missing

Similar pages

Nexa

Lazarus AI

Chinese Entrepreneurs Organization

Fusion Fund

Nexxa.ai

StartX

Qualcomm

OpusClip

Scale AI

Together AI

Browse jobs

Product Designer jobs

User Experience Designer jobs

Machine Learning Engineer jobs

Intern jobs

Engineer jobs

Scientist jobs

Senior Software Engineer jobs

Researcher jobs

Customer Service Representative jobs

Data Engineer jobs

Biomedical Engineer jobs

Commercial Lawyer jobs

Junior Lawyer jobs

Junior Software Engineer jobs

Lawyer jobs

Designer jobs

Developer jobs

Industrial Designer jobs

Analyst jobs

Risk Analyst jobs

Funding