🧐 Multimodal with just fine-tuning? ByteDance makes it happen with VoRA. The new Vision as LoRA (VoRA) paper introduces a bold, streamlined approach to building multimodal LLMs—no vision encoder, no connector, no architecture bloat. Just LoRA. Instead of bolting a vision tower onto an LLM, VoRA injects vision understanding directly into the LLM using Low-Rank Adaptation (LoRA). That means: 🔹 No new inference overhead- LoRA layers are merged into the LLM after training. 🔹 Frozen base LLM- Only LoRA + visual embeddings (~6M params) are trained, preserving language ability and ensuring stability. 🔹 Image inputs at native resolution- No resizing, no tiling hacks—VoRA leverages the LLM’s flexible token handling. 🔹 Bidirectional attention for vision- Instead of using causal masks across all tokens, VoRA allows vision tokens to attend freely—boosting context modeling. To teach the LLM visual features: 🔹 VoRA uses block-wise distillation from a pretrained ViT, aligning intermediate hidden states across layers. This improves visual alignment while keeping the LLM’s core untouched. 🔹 The training objective combines distillation loss (cosine similarity between ViT + LLM visual features) and standard language modeling loss over image-caption pairs. What does this get you? 🔹 A modality-agnostic architecture ready for extension to audio, point clouds, and beyond. This might be one of the most efficient takes yet on vision-language modeling. Excited to see how this evolves. 📄 Paper + Code: https://lnkd.in/gwZr33Vj Follow Aman Chadha and I for more updates!
Multimodal AI Developments
Explore top LinkedIn content from expert professionals.
-
-
The AI landscape has rapidly evolved beyond just large language models. Today’s systems rely on a wide range of foundational model types—each designed for specific modalities, tasks, and constraints. This visual covers 12 foundational AI models and their core workflows. This is intended for engineers, researchers, and builders who want a structured view of the ecosystem. Here’s a breakdown of what’s included: → LLM (Large Language Models) – GPT, LLaMA Trained using transformer architecture to generate coherent, human-like text. The workflow involves data collection, tokenization, pattern learning, fine-tuning, and deployment. → SLM (Small Language Models) – Phi, TinyLLaMA Lightweight and efficient for on-device or low-resource environments. Focuses on model compression, compact training, and benchmarking. → VLM (Vision-Language Models) – CLIP, Flamingo Learns joint understanding between images and text. Ideal for tasks like image captioning and visual QA. → MLLM (Multimodal Large Language Models) – Gemini Designed to process and align multiple modalities such as text, image, audio, and video. → LAM (Large Action Models) – RT-2, InstructDiffusion Generates sequences of executable actions using behavioral and reinforcement learning data. → LRM (Large Reasoning Models) – DeepSeek-R1 Structured for tool use, chain-of-thought reasoning, and test-time modularity in logic-heavy tasks. → MoE (Mixture of Experts) – Mixtral Activates a subset of specialized models per input to reduce computation cost and improve performance. → SSM (State Space Models) – Mamba, RetNet Efficient at long-context sequence modeling using dynamic systems and parallelism. → RNN (Recurrent Neural Networks) – LSTM, GRU Uses hidden states to process time-dependent data, maintaining memory across input sequences. → CNN (Convolutional Neural Networks) – EfficientNet Learns spatial patterns in image data via convolution layers, pooling, and hierarchical stacking. → SAM (Segment Anything Model) – Meta Segments objects from images based on prompts (text, points, or boxes), making it useful for dynamic image understanding. → LNN (Liquid Neural Networks) – LFMs Leverages differential equations to adapt in real-time, supporting applications in time-sensitive environments. This chart is designed to help you understand not just what these models are, but how they work under the hood. If you're working in AI, this foundational understanding is crucial for making informed architectural decisions.
-
🔧 AI agents are taking off. But we may be building them all wrong. NVIDIA’s latest research suggests we’ve been scaling agents inefficiently: ➡️ It’s not large language models (LLMs) that will scale agentic AI. ➡️ It’s Small Language Models (SLMs) — compact, local, and radically cheaper. That insight forced me to stop and rethink everything. I’ve seen too many teams build agents that call GPT-4… for everything. Even for basic, predictable tasks like: → Formatting JSON → Extracting a few values → Generating API calls Why? Because it’s easy. But it’s also wasteful. We're burning compute — and budgets — on jobs that don’t need a genius to do them. 🔍 NVIDIA’s findings are a wake-up call: ⚡SLMs like Phi-3 and DeepSeek-7B are crushing older LLMs ⚙️ Toolformer (6.7B) outperformed GPT-3 (175B) 🧠 DeepSeek-7B beat GPT-4o on reasoning 📉 40–70% of LLM calls can already be swapped for SLMs And the upside? ✅ 10–30x cheaper inference ✅ No GPUs, no clusters — run on laptops ✅ Fine-tune overnight (LoRA/QLoRA) ✅ Less hallucination, better structure ✅ More modular and scalable system design 🛠 What this means for us: The AI industry has poured billions into LLM infrastructure. But that may soon feel like building a spaceship to cross the street. I’m rethinking my own approach: → Start with SLMs for agent sub-tasks → Only fall back to LLMs when truly necessary → Embrace modular, specialized design Because here’s the truth: Bigger isn’t always better. Smaller is often smarter. Curious to hear your take: Are we finally reaching the post-LLM era for agent design? 🔗 Full paper > https://zurl.co/kCwWU #AI #AgenticAI #SLMs #Automation #FutureOfWork #NVIDIA #LLMs #AIEngineering #CostEfficiency #AIArchitecture
-
I've been saying for over a year that multimodal large language models will become the ultimate interface between physicians and a range of AI-based solutions. Here is the proof! In this study, the authors developed and evaluated an autonomous clinical AI agent leveraging GPT-4 with multimodal precision oncology tools to support personalized clinical decision-making. They used multiple sources such as histopathology slides, radiological images and search tools like OncoKB, PubMed and Google. "Evaluated on 20 realistic multimodal patient cases, the AI agent autonomously used appropriate tools with 87.5% accuracy, reached correct clinical conclusions in 91.0% of cases and accurately cited relevant oncology guidelines 75.5% of the time. Compared to GPT-4 alone, the integrated AI agent drastically improved decision-making accuracy from 30.3% to 87.2%." Source: https://lnkd.in/dwjGvxcH
-
Multimodal AI is shaping a shift in healthcare by combining different kinds of patient data to improve care across diagnostics, treatment, and monitoring. 1️⃣ It links data from imaging, wearables, clinical notes, genomics, and more to create a fuller picture of patient health. 2️⃣ Imaging, physiological signals, and clinical notes are the most commonly used data types, especially in oncology, cardiovascular, and neurological disorders. 3️⃣ Intermediate fusion is the most used integration method, combining data at the feature level for better balance between complexity and interpretability. 4️⃣ These systems enable early diagnosis, prognosis, treatment planning, and real-time monitoring, with growing applications in areas like digital twins and automated reporting. 5️⃣ Personalized medicine is a major driver, with multimodal models supporting tailored treatment decisions by analyzing combined molecular, physiological, and behavioral data. 6️⃣ Despite progress, challenges remain: data heterogeneity, privacy concerns, lack of benchmarks, and regulatory constraints slow adoption. 7️⃣ Explainability is key for clinical trust. Emerging models include attention maps, concept attribution, and human-in-the-loop feedback for better transparency. 8️⃣ Energy demands of training large models have sparked interest in "green AI", focusing on efficiency and scalability in clinical settings. 9️⃣ Future systems may rely more on self-supervised and federated learning to handle data gaps and maintain privacy across institutions. 🔟 Clinical validation and regulatory reform are needed for multimodal systems to move from labs into widespread practice. ✍🏻 Florenc Demrozi, Mina Farmanbar, Kjersti Engan. Multimodal AI for Next-Generation Healthcare: Data Domains, Algorithms, Challenges, and Future Perspectives. Current Opinion in Biomedical Engineering. 2025. DOI: 10.1016/j.cobme.2025.100632 (pre-proof)
-
Enterprises today are drowning in multimodal data - text, images, audio, video, time-series, and more. Large multimodal LLMs promise to make sense of this, but in practice, embeddings alone often collapse nuance and context. You get fluency without grounding, answers without reasoning, “black boxes” where transparency matters most. That’s why the new IEEE paper “Building Multimodal Knowledge Graphs: Automation for Enterprise Integration” by Ritvik G, Joey Yip, Revathy Venkataramanan, and Dr. Amit Sheth really resonates with me. Instead of forcing LLMs to carry the entire cognitive burden, their framework shows how automated Multi Modal Knowledge Graphs (MMKGs) can bring structure, semantics, and provenance into the picture. What excites me most is the way the authors combine two forces that usually live apart. On one side, bottom-up context extraction - pulling meaning directly from raw multimodal data like text, images, and audio. On the other, top-down schema refinement - bringing in structure, rules, and enterprise-specific ontologies. Together, this creates a feedback loop between emergence and design: the graph learns from the data but also stays grounded in organizational needs. And this isn’t just theoretical elegance. In their Nourich case study, the framework shows how a food image, ingredient list, and dietary guidelines can be linked into a multimodal knowledge graph that actually reasons about whether a recipe is suitable for a diabetic vegetarian diet - and then suggests structured modifications. That’s enterprise relevance in action. To me, this signals a bigger shift: LLMs alone won’t carry enterprise AI into the future. The future is neurosymbolic, multimodal, and automated. Enterprises that invest in these hybrid architectures will unlock explainability, scale, and trust in ways current “all-LLM” strategies simply cannot. Link to the paper -> https://lnkd.in/gv93znbQ #KnowledgeGraphs #MultimodalAI #NeurosymbolicAI #EnterpriseAI #KnowledgeGraphLifecycle #MMKG #AIResearch #Automation #EnterpriseIntegration
-
I consider prompting techniques some of the lowest-hanging fruits one can use to achieve step-change improvement with their model performance. This isn’t to say that “typing better instructions” is that simple. As a matter of fact, it can be quite complex. Prompting has evolved into a full discipline with frameworks, reasoning methods, multimodal techniques, and role-based structures that dramatically change how models think, plan, analyse, and create. This guide that breaks down every major prompting category you need to build powerful, reliable, and structured AI workflows: 1️⃣ Core Prompting Techniques The foundational methods include few-shot, zero-shot, one-shot, style prompts. They teach the model patterns, tone, and structure. 2️⃣ Reasoning-Enhancing Techniques Approaches like Chain-of-Thought, Graph-of-Thought, ReAct, and Deliberate prompting help LLMs reason more clearly, avoid shortcuts, and solve complex tasks step-by-step. 3️⃣ Instruction & Role-Based Prompting Define the task clearly or assign the model a “role” such as planner, analyst, engineer, or teacher to get more predictable, domain-focused outputs. 4️⃣ Prompt Composition Techniques Methods like prompt chaining, meta-prompting, dynamic variables, and templates help you build multi-step, modular workflows used in real agent systems. 5️⃣ Tool-Augmented Prompting Combine prompts with vector search, retrieval (RAG), planners, executors, or agent-style instructions to turn LLMs into decision-making systems rather than passive responders. 6️⃣ Optimization & Safety Techniques Guardrails, verification prompts, bias checks, and error-correction prompts improve reliability, factual accuracy, and trustworthiness. These are essential for production systems. 7️⃣ Creativity-Enhancing Techniques Analogy prompts, divergent prompts, story prompts, and spatial diagrams unlock creative reasoning, exploration, and alternative problem-solving paths. 8️⃣ Multimodal Prompting Use images, audio, video, transcripts, diagrams, code, or mixed-media prompts (text + JSON + tables) to build richer and more intelligent multimodal workflows. Modern prompting has fully evolved to designing thinking systems. When you combine reasoning techniques, structured instructions, memory, tools, and multimodal inputs, you unlock a level of performance that avoids costly fine tuning methods. What best practices have you used when designing prompts for your LLM? #LLM
-
A child gathers more data in their first four years than all the text ever published online. That’s not just a fun stat. It highlights a core limitation in how modern AI is built. Most AI systems are trained on natural language data. They learn by extracting statistical patterns from language, not through embodied experience or real-world interaction. Compare that to how humans learn: → Multimodal sensory input processed in parallel → Continuous physical interaction with dynamic environments → Emotional and contextual feedback shaping understanding in real time Natural language is a compressed abstraction of experience. It encodes meaning, but strips away direct context, causality, and sensory nuance. That’s why language models excel at: Summarizing information at scale Extracting patterns from structured data Generating coherent, fluent responses …but often fail at: Grounding responses in real-world causality Navigating ambiguity or incomplete information Adapting to evolving, unstructured scenarios Even state-of-the-art models can: Confidently output factually incorrect information Misinterpret intent in natural instructions Break down when context isn’t explicitly encoded We’re training systems to imitate comprehension, using only the shadows of real experience. So what’s the next frontier? True progress in AI will require a leap beyond language: → Multisensory data (audio, video, spatial signals) → Embodied interaction → Context-aware models Language is an entry point. But if the goal is adaptive, human-like intelligence, grounded experience is essential.
-
For a long time, many companies built AI systems around a simple idea: choose the most powerful large language model available and use it across the entire workflow. One large model handling classification, summarization, routing, reasoning, and generation. What I am seeing now, especially going into 2026, is a clear architectural shift. Teams are moving away from the “one giant model does everything” approach. Instead, they are decomposing workflows and assigning different models to different layers of the system. Smaller, more specialized models are being used for well-defined tasks, while larger models are reserved for complex reasoning where their breadth actually matters. For those who are newer to this space, a SLMs typically refers to a model in the 1B to 12B parameter range. These models are optimized for efficiency, lower latency, and narrower domains. They are not designed to replace frontier-scale models, but to handle specific tasks extremely well. There are two practical reasons why I believe 2026 will be a high-adoption year for SLMs: ✦ Cheaper, faster, and more customizable For tasks like classification, structured extraction, lightweight reasoning, or domain-specific summarization, a smaller model is often more than sufficient. It runs with lower latency, costs less to scale, and if it is open source, it can be fine-tuned and adapted to your internal data and workflows. That level of customization gives teams real control over performance and differentiation. ✦ On-device and edge intelligence As more AI moves closer to the user, on-device and edge inference become critical. Mobile assistants, IoT systems, and privacy-sensitive enterprise applications cannot always rely on sending every request to a large cloud model. Small models make local inference feasible, improving both responsiveness and privacy. Large models are still essential for open-ended reasoning and complex generation. But the most mature systems will not rely on a single model. They will be orchestrated systems, where each model is chosen based on what it is best at. Model size is no longer the strategy, architecture is.
-
LLM inference concepts (2/n): Vision language models are different from LLM in that they have an image encoder to capture and represent image/video information as tokens. We can take that encoder and serve it on separate workers for multimodal serving. Multimodal inference has three different phases: 1. Encoder > This is where images are processed (e.g. ViT). It’s a one-shot, compute-bound stage with high variance. 2. Prefill > Text + embeddings are loaded into the model. Memory-bandwidth heavy, large matrix multiplications. 3. Decode > Tokens are generated one at a time. Long-lived, real-time, and memory-bound. Similar to my post yesterday many VLM systems currently run encoder + prefill + decode on the same GPUs. This mean that: > Encoder work blocks text requests → jittery latency > Text-only requests wait behind image jobs > Compute-bound encoder and memory-bound decode fight for the same hardware > You must scale all GPUs for rare multimodal spikes Disaggregating encoder inference from the other parts fixes this. By separating the visual encoder into its own service on separate GPUs: > Encoder runs in parallel with prefill/decode for other requests > Text-only requests completely bypass the encoder > The system becomes pipeline-parallel instead of serial This unlocks: > smoother latency > higher throughput > independent scaling per stage It also enables encoder output caching: > Common images are encoded once and reused > Cache hits skip the encoder entirely > TTFT drops and encoder load shrinks over time This nice post from vLLM dives into the implementation details 👇 - figure is from the LLaVa paper.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development