Train LLM Agents with ART: A Game-Changing Framework

Seeking RL Intern | Robotics, LLM’s, AI Agents (Either)

𝐖𝐚𝐧𝐭 𝐭𝐨 𝐭𝐫𝐚𝐢𝐧 𝐲𝐨𝐮𝐫 𝐨𝐰𝐧 𝐋𝐋𝐌 𝐚𝐠𝐞𝐧𝐭𝐬 𝐰𝐢𝐭𝐡 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠?? 🤖✨ Meet 𝐀𝐠𝐞𝐧𝐭 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐓𝐫𝐚𝐢𝐧𝐞𝐫 (𝐀𝐑𝐓) – an open-source framework that makes training multi-step, reliable LLM agents a breeze. ART uses 𝐆𝐑𝐏𝐎 (𝐆𝐫𝐨𝐮𝐩 𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐏𝐨𝐥𝐢𝐜𝐲 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧) for building robust, real-world agents—without The headache of writing manual reward functions. Instead, it leverages 𝐑𝐔𝐋𝐄𝐑 (𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐔𝐧𝐢𝐯𝐞𝐫𝐬𝐚𝐥 𝐋𝐋𝐌-𝐄𝐥𝐢𝐜𝐢𝐭𝐞𝐝 𝐑𝐞𝐰𝐚𝐫𝐝𝐬), an LLM-based evaluator that automatically assigns rewards, eliminating tedious hand-crafted reward engineering. Why 𝐀𝐑𝐓 is a game-changer: • ⚡ 2–3x faster development – skip reward function engineering entirely • 🧰 General-purpose – works across any task, no modification needed • 📈 Proven performance – matches or exceeds hand-crafted rewards in 3/4 benchmarks • 🧩 Easy integration – drop-in replacement for manual reward functions • 🤖 Model agnostic – works with Qwen, Llama, GPT-style LLMs, and more And the best part? It’s 100% open source. 📌 Link to the GitHub repo is in the comments! #ReinforcementLearning #LLMAgents #AITraining #OpenSourceAI #MachineLearning #AICommunity #ArtificialIntelligence

To view or add a comment, sign in

More Relevant Posts

Kunal Kumar

SDE || @MSIL || Python - DSA || Automation Scripts || AI/ML || Punctual || Leetcode ||
2w
Report this post
🤖 One LLM, Two Agents: How MATPO Solves the Multi-Agent Memory Problem New research introduces MATPO (Multi-Agent Tool-Integrated Policy Optimization)—a breakthrough RL method that trains planner AND worker agents inside a single LLM, avoiding the memory nightmare of deploying multiple models.[arxiv +2] The Problem: Single-agent LLMs struggle with complex, multi-turn reasoning tasks because of limited context windows and noisy tool responses. The obvious fix? Deploy separate planner and worker agents. But that’s memory-intensive and expensive.[arxiv +2] The MATPO Solution: Instead of running multiple LLM instances, MATPO uses role-specific prompts to create distinct planner and worker personalities within ONE model, then trains both via reinforcement learning with a principled credit assignment mechanism.[huggingface +2] Performance Gains: • +18.38% average relative improvement over single-agent baselines[arxiv +2] • Tested on GAIA-text, WebWalkerQA, and FRAMES benchmarks[arxiv +2] • Greater robustness to noisy tool outputs[huggingface +1] Why This Matters: For AI agents handling knowledge-intensive tasks—web navigation, complex reasoning, research automation—this approach delivers multi-agent benefits (better context management, role specialization) without the infrastructure cost of deploying separate models.[machinelearning.apple +3] Key Innovation: The credit assignment mechanism properly attributes rewards across planner and worker rollouts, enabling stable and efficient multi-agent RL training within a single model instance.[arxiv +1] This is especially relevant for production deployments where memory constraints and inference costs are real concerns. MATPO proves you can have your multi-agent cake and eat it too. 🎂 📄 Paper: arxiv.org/abs/2510.04678 💻 Code: github.com/mzf666/MATPO #AI #MachineLearning #ReinforcementLearning #LLM #MultiAgent #AIAgents #Research #DeepLearning

Multi-Agent Tool-Integrated Policy Optimization arxiv.org
Like Comment
To view or add a comment, sign in
Alberto San Millan Diaz

Director de desarrollo de negocio en Serbatic - Grupo VASS | Head of Business Development in Serbatic - VASS Group
1w
Report this post
Copy, Paste, Chaos 🔄 They copied AI-generated code from the browser, pasted it into the IDE… and broke the project. Version control? None. Testing? Zero. Frustration? 100%. 😅 Teaching integration is as important as teaching generation. AI is powerful inside the workflow — not floating outside it. Tools don’t replace process. They enhance it — if you let them. #SoftwareWorkflow #AIintegration #BestPractices
1 Comment
Like Comment
To view or add a comment, sign in
Vidhya Rasu

Aspiring Data Scientist | Machine Learning | Deep Learning | Artificial Intelligence | Generative AI | Agentic AI
4w
Report this post
In the first post, we introduced the idea of chunking and shared the 4 main techniques. Now in Part 2 of this series, here’s a visual breakdown of those techniques — designed to make it easier to compare them side by side: 🔹 Fixed-size Chunking → Splits text into equal parts (e.g., 500 tokens) 🔹 Semantic Chunking → Splits by meaning, preserving context 🔹 Recursive Chunking → Breaks text hierarchically (headings → sentences) 🔹 Sliding Window Chunking → Overlaps chunks for smoother flow ✅ These techniques remain at the core of RAG pipelines, chatbots, and document Q&A systems. Sometimes, a clear visual is the best way to understand the differences. 👉 Which of these techniques do you find most practical in real-world projects?
Like Comment
To view or add a comment, sign in
Thor Signia

1,869 followers
2w
Report this post
The Future Stack Is Probabilistic Software used to be deterministic. Now it’s probabilistic and most teams haven’t caught up. For decades, software behaved like physics. Same input → same output. If it didn’t, you found the bug, fixed the logic, and moved on. That predictability was our safety net. It built entire industries of testing, QA, compliance, and reliability engineering. You could trust code because code was law. But AI doesn’t play by those rules anymore. Same input → infinite plausible outputs. That single shift breaks the oldest contract between humans and machines: reliability. You can’t “debug” a probabilistic system because nothing’s broken. You can only constrain it. And that’s not a software mindset. That’s a systems mindset. The next generation of engineering won’t be about writing code - It'll be about orchestrating uncertainty. Testing frameworks will need to measure probabilities, not pass/fail. Version control will track behavior drift, not just code diffs. Product teams will ship models that evolve faster than their documentation can update. In this new reality, the question isn’t “How do we control AI?” It’s: How much uncertainty can your system and your organization tolerate? Thorsignia POV In our work with large-scale AI deployments, we’ve seen it firsthand: The trust gap doesn’t come from model performance, it comes from human expectation. When teams apply deterministic thinking to probabilistic systems, friction follows. That’s where trust gets lost, and where great architecture must evolve. Because the real challenge isn’t technical. It’s psychological. Takeaway Deterministic thinking built the old web. Probabilistic thinking will build the intelligent one. The companies that adapt fastest won’t be those with the best models, but those who engineer for uncertainty. How are you testing systems that can’t guarantee the same answer twice? Let’s compare notes. #AIEngineering #SystemsDesign #ProbabilisticAI #Thorsignia
Like Comment
To view or add a comment, sign in
Joey Clover

CTO @ Veridox
3w
Report this post
If you think your team aren't using LLMs to code, you're wrong 🤷♂️ How do you focus on hiring and managing your existing teams? If you are still undergoing a purely manual code review process, you are doing your business a disservice. Any popular version control tool has integrations for "AI Code Reviews". These don't only provide you with faster code reviews, they add guard rails for your team and code. Here's what's important for me when it comes to managing the destructive potential of the current tool ecosystem: 𝐂𝐮𝐫𝐫𝐞𝐧𝐭 𝐓𝐞𝐚𝐦: • Encourage training of latest AI coding tools (Claude Code, OpenCode + GLM) • Give your team direction by including markdown files with system prompts into your repositories. • Identify common patterns of AI jank and start improving the system prompt to mitigate effects. 𝐇𝐢𝐫𝐢𝐧𝐠: • Instead of testing raw technical capabilities, focus on testing critical thinking and share hypothetical scenarios that lean into the real world problems your business is facing. • Give the candidate an existing "vibe-coded" project and test their capabilities of cleaning it up. These are my priorities at the moment when tackling the problems within the current climate. #CodeReview #LLMs #AI #ClaudeCode #VibeCoding

1 Comment
Like Comment
To view or add a comment, sign in
Dorothy Ali

Quality Engineering Consultant | Passionate About Testing AI Systems and Evaluating LLMs
1w
Report this post
The Art of Evaluating LLMs: A New Approach to Testing What if we treated evaluating LLMs with the same rigor as traditional software testing? 🤔 As the adoption of LLMs continues to skyrocket, it’s crucial we establish tailored testing frameworks that not only assess performance but also ensure safety and reliability. I’ve been exploring the idea that evaluating an LLM shouldn’t be a one-time event but an ongoing process. Just like with any complex software, we need to engage in continuous dialogue, exploring edge cases and unexpected behaviors. Think about it: what if we could enhance their performance by applying a methodology similar to that used in traditional QA testing? The potential benefits are immense — trust in our AI systems could soar if we fully understand their capabilities and limitations. I’d love to hear your thoughts on this! How are you evaluating your models? Are there specific frameworks you’ve found effective? Let’s learn together! #artificialintelligence #machinelearning #ai #technology #innovation #llmtesting #aievaluation #promptengineering #aitools #futureofai #aidevelopment
Like Comment
To view or add a comment, sign in
Pablo Bartolome Molina

SW Engineer | Automotive Safety | ISO 26262 & MISRA C Compliance | UWB Embedded Solutions | AI tooling
3w
Report this post
Writing unit tests has always been a balancing act: achieving full coverage without turning the process into an unmanageable effort. For complex systems, generating and maintaining exhaustive test cases manually is unrealistic. That’s where AI comes in. By integrating AI into the testing workflow, I can automatically generate unit tests with comprehensive test vectors. The AI explores edge cases and combinations that might escape human thinking, providing a safety net I didn’t have before. The key isn’t replacing developers; it’s enabling us to focus on high-level design, code reliability, and innovation while AI handles exhaustive scenarios. The result? Faster test creation, more thorough coverage, and fewer surprises in production. It’s not just about saving time, also about improving confidence in every release and reducing risk across the codebase. AI is a powerful tool to raise the overall quality and reliability of software. #UnitTesting #AIinEngineering #TestAutomation #SoftwareQuality #Programming #Innovation
Like Comment
To view or add a comment, sign in
Jaspreet B.

Helping Government deliver citizen centric digital services
1mo
Report this post
AI coding agents don’t fail because of bad code. They fail because of bad context. Most teams are solving the wrong problem. Teams throw bigger prompts, longer histories, or more memory at the problem, which only amplifies noise and compounds early mistakes. Treating context like a storage problem (more tokens, bigger windows) is a wrong approach. The ACE/FCA approach says the fix isn’t “bigger prompts,” it’s a Research → Plan → Implement pipeline with frequent, intentional compaction and human checkpoints so the model always works from the right slice of reality. What works for us is context PRs (CPRs): before any code diff, the agent opens a tiny PR that contains a machine-readable “context snapshot” (files touched, invariants, dependencies, assumptions). Humans approve the snapshot first; CI then blocks if “context drift” is detected (repo changed, tests/trace probes reveal new facts), auto-triggering re-research instead of shipping slop. This works as it turns context from vibe → contract, stops early-phase error amplification, and makes multi-agent work composable. If you’re betting on code agents, add CPRs to your pipeline before you add more tokens. #FutureOfCoding #AIEngineering #DeveloperExperience #Innovation #AIAgents
Like Comment
To view or add a comment, sign in
Reza Zeraat

Artificial Intelligence Engineer | Python Javascript Typescript Java C++ | React Vue Angular React Native Next.js | Node.js Django Nestjs SpringBoot Fastapi | Pandas TensorFlow LangChain Ray Transformers LLMs RAG MLOps
6d
Report this post
🤖 Agentic AI in engineering: what to delegate—and what to keep human Teams are moving past autocomplete to agents that plan, run and iterate on code. The question isn’t can they do it—it’s should they? Here’s a split that works in practice: Let agents handle (with review): Boilerplate: scaffolding, CRUD endpoints, SDK/API wrappers Generate tests and stabilise flaky ones Refactoring with explicit specs (rename, extract, deduplicate) Routine dependency upgrades and simple CI/CD tweaks Typed, well-covered migrations First drafts of docs, READMEs and changelogs Triage: summarising issues, reproducing known bugs, codebase Q&A Keep human judgement on: Architecture choices, design trade-offs and roadmaps Security-sensitive code (auth/crypto, secrets) Performance-critical paths and tricky concurrency Irreversible changes (production schema edits, data retention) Incident command and stakeholder comms; compliance/privacy calls Novel algorithms, ambiguous requirements and product judgement Best as a duo: Bug fixes with a failing test: agent proposes; human validates Multi-file edits: agent drafts; human scopes, reviews and lands Code reviews: agent catches nits; human assesses intent and risk Safety rails that make this work: Sandboxed environments and least privilege Small diffs, mandatory tests and review gates Auto-rollback runbooks and clear escalation paths Transparent logs of prompts/actions Team norms: label agent-authored code, track outcomes, iterate Bottom line: use agents to accelerate the mechanical work; keep humans on meaning, risk and ethics. The win isn’t replacing engineers—it’s upgrading how we engineer. Where do you draw the line? 👇 #AI #SoftwareEngineering #DevTools #PlatformEngineering #MLOps #EngineeringLeadership
Like Comment
To view or add a comment, sign in
Jaclyn Leigh Lilly

Client Success @ BenAI & Ovi *JL2*
3w Edited
Report this post
A few months ago, I was at a conference when a senior exec said: “Prompt engineering is really just good communication.” At the time, it made sense. But after reading the Prompt Engineering Guide and trying to build one myself...it turns out there's a lot more to it. Prompting isn't just about asking the right question, it's about designing an interaction. Some things I’ve learned to keep in mind: ✔️ Clarity outperforms cleverness. If the prompt is vague, the output will be too ✔️ Context matters: what the model knows is what it responds to ✔️ Structure guides output, format is part of the instruction ✔️ You’re not just talking to AI but programming behavior with language I’ve been experimenting by writing a “meta-prompt”, a prompt that teaches an AI how to improve other prompts. If you're curious what that looks like or want to try it yourself, here’s the template below 👇 Let me know what you’re learning or experimenting with too. I’m early in the journey, but already seeing how deep this rabbit hole goes. #PromptEngineering #LLM #LearningInPublic

10 Comments
Like Comment
To view or add a comment, sign in

671 followers

38 Posts

View Profile Follow

LinkedIn respects your privacy

Train LLM Agents with ART: A Game-Changing Framework

Explore content categories