Recouping training costs with AI advancements

Rust & Golang | Engineering Philanthropist | I’m the guy that wrote Goblin.ai alone.

You must recoup your training cost within 6 months or a free model comes out that blows it out of the water for free. Doom-pickle-2 was a silly story I wrote in gest about this. It was a theoretical model that popped up 6 months after gpt5 to be comparable to it. Instead I no joke have seen probably 11 steps forward comparable to what id describe as the theoretical model I had in mind. This is probably the best example I’ve seen though. This and sparse attention.

Aymeric Roucher

Building Agents, formerly at Hugging Face | Polytechnique - Cambridge

1w Edited

STOP EVERYTHING NOW - we might finally have a radical architecture improvement over Transformers!!! 🚨 A lone scientist just proposed Tiny Recursive Model (TRM), and it is literally the most impressive model that I've seen this year. ➡️ Tiny Recursive Model is 7M parameters ➡️ On ARC-AGI, it beats flagship models like Gemini-2.5-pro Consider how wild this is: Gemini-2.5-pro must be over 10,000x bigger and had 1,000 as many authors 😂 (Alexia is alone on the paper) What's this sorcery? In short: it's a very tiny Transformers, but it loops over itself at two different frequencies, updating two latent variables (i.e. two vectors): one is the proposed answer and the other is... the reasoning. Representing reasoning with a vector, this makes sense: it's much more efficient than building reasoning by generating loads of tokens. Alexia Jolicoeur-Martineau started from the paper Hierarchical Reasoning Model, published a few months ago, that already showed breakthrough improvement on AGI for its small size (27M) Hierarchical Reasoning Model had introduced one main feature: 🔎 Deep supervision In their model, one part (here one layer) would run at high frequency, and another would be lower frequency, running only every n steps. They had used a recurrent architecture, where these layers would repeat many times ; but to make it work they had to do many approximations, including not fully backpropagating the loss through all layers. Alexia studied what was useful and what wasn't, and cleaned the architecture as follows : Why use a recurrent architecture, when you can just make it a loop? ➡️ She made the network recursive, looping over itself Why use 2 latent variables ? ➡️ She provides a crystal clear explanation : the one that changes frequently is the reasoning, the one that changes at low frequency is the proposed answer. ➡️ She runs ablation studies to validate that 2 is indeed optimal. Like with all great research, when reading this paper I felt like everything felt in place naturally : this new setup is a much more elegant way to process reasoning than generating huge chains of tokens as all flagship models currently do. One caveat : TRM does not generate text, it works on fixed length outputs, like the grids of sudoku or ARC. But there's not real blocker to adapting it to text, and I see a high probability this gets done over the next weeks. This might be the breakthrough that we've awaited for so long!

To view or add a comment, sign in

More Relevant Posts

Aymeric Roucher

Building Agents, formerly at Hugging Face | Polytechnique - Cambridge
1w Edited
Report this post
STOP EVERYTHING NOW - we might finally have a radical architecture improvement over Transformers!!! 🚨 A lone scientist just proposed Tiny Recursive Model (TRM), and it is literally the most impressive model that I've seen this year. ➡️ Tiny Recursive Model is 7M parameters ➡️ On ARC-AGI, it beats flagship models like Gemini-2.5-pro Consider how wild this is: Gemini-2.5-pro must be over 10,000x bigger and had 1,000 as many authors 😂 (Alexia is alone on the paper) What's this sorcery? In short: it's a very tiny Transformers, but it loops over itself at two different frequencies, updating two latent variables (i.e. two vectors): one is the proposed answer and the other is... the reasoning. Representing reasoning with a vector, this makes sense: it's much more efficient than building reasoning by generating loads of tokens. Alexia Jolicoeur-Martineau started from the paper Hierarchical Reasoning Model, published a few months ago, that already showed breakthrough improvement on AGI for its small size (27M) Hierarchical Reasoning Model had introduced one main feature: 🔎 Deep supervision In their model, one part (here one layer) would run at high frequency, and another would be lower frequency, running only every n steps. They had used a recurrent architecture, where these layers would repeat many times ; but to make it work they had to do many approximations, including not fully backpropagating the loss through all layers. Alexia studied what was useful and what wasn't, and cleaned the architecture as follows : Why use a recurrent architecture, when you can just make it a loop? ➡️ She made the network recursive, looping over itself Why use 2 latent variables ? ➡️ She provides a crystal clear explanation : the one that changes frequently is the reasoning, the one that changes at low frequency is the proposed answer. ➡️ She runs ablation studies to validate that 2 is indeed optimal. Like with all great research, when reading this paper I felt like everything felt in place naturally : this new setup is a much more elegant way to process reasoning than generating huge chains of tokens as all flagship models currently do. One caveat : TRM does not generate text, it works on fixed length outputs, like the grids of sudoku or ARC. But there's not real blocker to adapting it to text, and I see a high probability this gets done over the next weeks. This might be the breakthrough that we've awaited for so long!
68 Comments
Like Comment
To view or add a comment, sign in
Saeed Kasmani, Ph.D.

Let’s Innovative with AI | AI Leader | Advisor | Mentor |Ex-Redhatter |Ex-CSIRO researcher
1w
Report this post
🤯 Mind-blowing work! A 7M parameter Tiny Recursive Model outperforming giants like Gemini-2.5 is a reminder that innovation isn’t just about scale — it’s about smarter architecture and elegant reasoning loops. This could be a real turning point in AI research. 🚀 #AI #DeepLearning #Innovation #Research #AGI
Aymeric Roucher

Building Agents, formerly at Hugging Face | Polytechnique - Cambridge
1w Edited

STOP EVERYTHING NOW - we might finally have a radical architecture improvement over Transformers!!! 🚨 A lone scientist just proposed Tiny Recursive Model (TRM), and it is literally the most impressive model that I've seen this year. ➡️ Tiny Recursive Model is 7M parameters ➡️ On ARC-AGI, it beats flagship models like Gemini-2.5-pro Consider how wild this is: Gemini-2.5-pro must be over 10,000x bigger and had 1,000 as many authors 😂 (Alexia is alone on the paper) What's this sorcery? In short: it's a very tiny Transformers, but it loops over itself at two different frequencies, updating two latent variables (i.e. two vectors): one is the proposed answer and the other is... the reasoning. Representing reasoning with a vector, this makes sense: it's much more efficient than building reasoning by generating loads of tokens. Alexia Jolicoeur-Martineau started from the paper Hierarchical Reasoning Model, published a few months ago, that already showed breakthrough improvement on AGI for its small size (27M) Hierarchical Reasoning Model had introduced one main feature: 🔎 Deep supervision In their model, one part (here one layer) would run at high frequency, and another would be lower frequency, running only every n steps. They had used a recurrent architecture, where these layers would repeat many times ; but to make it work they had to do many approximations, including not fully backpropagating the loss through all layers. Alexia studied what was useful and what wasn't, and cleaned the architecture as follows : Why use a recurrent architecture, when you can just make it a loop? ➡️ She made the network recursive, looping over itself Why use 2 latent variables ? ➡️ She provides a crystal clear explanation : the one that changes frequently is the reasoning, the one that changes at low frequency is the proposed answer. ➡️ She runs ablation studies to validate that 2 is indeed optimal. Like with all great research, when reading this paper I felt like everything felt in place naturally : this new setup is a much more elegant way to process reasoning than generating huge chains of tokens as all flagship models currently do. One caveat : TRM does not generate text, it works on fixed length outputs, like the grids of sudoku or ARC. But there's not real blocker to adapting it to text, and I see a high probability this gets done over the next weeks. This might be the breakthrough that we've awaited for so long!
Like Comment
To view or add a comment, sign in
Ibrahim Sobh - PhD

🎓 Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
1w
Report this post
🔥 The Big LLM Architecture Comparison 1. DeepSeek V3/R1 Multi-head latent attention compresses KV cache, mixture of experts (256 fine-grained experts, 8 active), 671B parameters with only 37B active during inference. 2. OLMo 2 Post-norm placement inside residuals for training stability, QK normalization, transparent model with shared training data, competitive performance with 7B parameters. 3. Gemma 3 Sliding window attention (1024 tokens) with 5:1 local-global ratio, both pre-norm and post-norm layers, 27B optimal size for local inference. 4. Mistral Small 3.1 Wider but shorter architecture (40 transformer blocks vs 62), larger feedforward modules (32K intermediate), faster inference with fewer sequential layers. 5. Llama 4 Mixture of experts with fewer larger experts (70B active), Maverick 400B model with wider feedforward layers but only moderate benchmark performance. 6. Qwen3 Deeper architecture (28 blocks), Apache 2 license, hybrid instruction/reasoning model with think token, multiple sizes (0.6B-1T), excellent benchmark performance. 7. SmolLM3 Highly transparent 3B model, uses NOPE (no position embeddings) in every fourth layer for better length generalization, similar performance to Qwen3. 8. Kimi 2 1 trillion parameters, 512 experts (1 shared, 8 active), trained with Muon optimizer achieving smooth loss curve, only 32B parameters active. 9. GPT-OSS First OpenAI open-weight model in 6 years, trained for function calling, 32 experts (wider but fewer), includes bias vectors unusually. 10. Grok 2.5 Production model (270B) with 8 experts, additional dense feedforward as shared expert implementation, wider experts following older paradigm from last year. 11. GLM-4.5 Very deep architecture (92 blocks), 335B parameters with 8 experts plus shared expert, top benchmark performance, fewer active parameters than Qwen3.
2 Comments
Like Comment
To view or add a comment, sign in
Illya Yalovyy

It takes fastidiousness to write code that doesn't just do the right thing but also says the right thing.
1w
Report this post
This article explains how to apply hexagonal architecture in Rust to build cleaner, more maintainable applications. It shows how separating domain logic from external concerns through traits and adapters can make code easier to test and evolve in theory. The author does a good job refactoring a messy example into a well-structured design without getting lost in abstractions. But in real projects, things are rarely that simple — swapping a database or major dependency is never just about changing an adapter. It usually involves deep redesign, refactoring, and embracing the unique capabilities of the new system, which no architecture can fully abstract away. Hexagonal architecture is a nice conceptual model, but in large, complex applications, it often remains more of an ideal than a practical day-to-day tool. https://lnkd.in/gnJzJWEN #rust #rustlang #architecture #softwaredesign #softwaredesignpattern

Master Hexagonal Architecture in Rust howtocodeit.com
Like Comment
To view or add a comment, sign in
Kourosh Ebrahiminejad

Senior .NET Developer | C#, ASP.NET Core, EF Core | Microservices | Integration Expert | Open to Remote Opportunities
2w
Report this post
Good architecture whispers. Bad architecture shouts. That’s exactly how I see the Interface Segregation Principle in action. You don’t need one giant interface doing everything - you need small, focused contracts that quietly keep your domain clean. Here’s how I keep things clean: public interface ISoftDeletable { bool IsDeleted { get; set; } } public interface IExpirable { DateTime? PublishDateTime { get; set; } DateTime? ExpiryDateTime { get; set; } bool IsPublished { get; } bool IsExpired { get; } } public interface ITrackable { DateTime CreatedAt { get; set; } DateTime? ModifiedAt { get; set; } } And in my domain: public class Page : Int32EntityBase, IExpirable, ITrackable { // Properties omitted for brevity public DateTime? PublishDateTime { get; set; } public DateTime? ExpiryDateTime { get; set; } public bool IsPublished { get; private set; } public bool IsExpired { get; private set; } public DateTime CreatedAt { get; set; } public long CreatedById { get; set; } public DateTime? ModifiedAt { get; set; } public long ModifiedById { get; set; } } Each entity gets only what it really needs. No extra noise. No “NotImplementedException” drama. ✅ Why it works - Clean separation of concerns - Easier to test and refactor - More composable and reusable - Better readability and intent clarity Because at the end of the day: Good design doesn’t scream for attention - it just works. #DotNet #CSharp #SOLID #InterfaceSegregation #CodeQuality #CleanCode
Like Comment
To view or add a comment, sign in
Kourosh Ebrahiminejad

Senior .NET Developer | C#, ASP.NET Core, EF Core | Microservices | Integration Expert | Open to Remote Opportunities
2w
Report this post
Good architecture whispers. Bad architecture shouts. That’s exactly how I see the Interface Segregation Principle in action. You don’t need one giant interface doing everything - you need small, focused contracts that quietly keep your domain clean. Here’s how I keep things clean: public interface ISoftDeletable { bool IsDeleted { get; set; } } public interface IExpirable { DateTime? PublishDateTime { get; set; } DateTime? ExpiryDateTime { get; set; } bool IsPublished { get; } bool IsExpired { get; } } public interface ITrackable { DateTime CreatedAt { get; set; } DateTime? ModifiedAt { get; set; } } And in my domain: public class Page : Int32EntityBase, IExpirable, ITrackable { // Properties omitted for brevity public DateTime? PublishDateTime { get; set; } public DateTime? ExpiryDateTime { get; set; } public bool IsPublished { get; private set; } public bool IsExpired { get; private set; } public DateTime CreatedAt { get; set; } public long CreatedById { get; set; } public DateTime? ModifiedAt { get; set; } public long ModifiedById { get; set; } } Each entity gets only what it really needs. No extra noise. No “NotImplementedException” drama. ✅ Why it works - Clean separation of concerns - Easier to test and refactor - More composable and reusable - Better readability and intent clarity Because at the end of the day: Good design doesn’t scream for attention - it just works. #DotNet #CSharp #SOLID #InterfaceSegregation #CodeQuality #CleanCode
Like Comment
To view or add a comment, sign in
Amir Charkhand

Software Engineer & .NET Developer | ASP .NET Core web API, Blazor, MAUI | Software architecture enthusiast | Deutsch Lerner
1w
Report this post
When you ask what interface segregation is? people usually say "single responsibility, but for interfaces". However, this is a good explanation by Kourosh Ebrahiminejad

Kourosh Ebrahiminejad

Senior .NET Developer | C#, ASP.NET Core, EF Core | Microservices | Integration Expert | Open to Remote Opportunities
2w

Good architecture whispers. Bad architecture shouts. That’s exactly how I see the Interface Segregation Principle in action. You don’t need one giant interface doing everything - you need small, focused contracts that quietly keep your domain clean. Here’s how I keep things clean: public interface ISoftDeletable { bool IsDeleted { get; set; } } public interface IExpirable { DateTime? PublishDateTime { get; set; } DateTime? ExpiryDateTime { get; set; } bool IsPublished { get; } bool IsExpired { get; } } public interface ITrackable { DateTime CreatedAt { get; set; } DateTime? ModifiedAt { get; set; } } And in my domain: public class Page : Int32EntityBase, IExpirable, ITrackable { // Properties omitted for brevity public DateTime? PublishDateTime { get; set; } public DateTime? ExpiryDateTime { get; set; } public bool IsPublished { get; private set; } public bool IsExpired { get; private set; } public DateTime CreatedAt { get; set; } public long CreatedById { get; set; } public DateTime? ModifiedAt { get; set; } public long ModifiedById { get; set; } } Each entity gets only what it really needs. No extra noise. No “NotImplementedException” drama. ✅ Why it works - Clean separation of concerns - Easier to test and refactor - More composable and reusable - Better readability and intent clarity Because at the end of the day: Good design doesn’t scream for attention - it just works. #DotNet #CSharp #SOLID #InterfaceSegregation #CodeQuality #CleanCode
Like Comment
To view or add a comment, sign in
Yunus Oktay

iOS Developer
6d
Report this post
Are you feeling the "spaghetti code" creep in your SwiftUI navigation? Ever struggled to deep-link users from a push notification, or pop back to the root after a purchase? My new article, "The SwiftUI Navigation Architecture That Will Save Your Projects: The Router Pattern" offers a comprehensive solution. Learn how to build a robust and maintainable navigation system with NavigationStack that will save your projects from chaos. #SwiftUI #iOSDevelopment #CleanCode #Navigation #Architecture #Programming

iCommunity

1,360 followers
6d

The SwiftUI Navigation Architecture That Will Save Your Projects: The Router Pattern by Yunus Oktay https://lnkd.in/dcms5xKV

The SwiftUI Navigation Architecture That Will Save Your Projects: The Router Pattern medium.com
Like Comment
To view or add a comment, sign in
Milan Jovanović Milan Jovanović is an Influencer

Practical .NET and Software Architecture Tips | Microsoft MVP
1w
Report this post
What is the core idea of Clean Architecture? The Dependency Rule. Managing coupling between components. This rule states that source code dependencies can only point "inwards". But what does inwards mean? Our domain entities and business rules are at the center of the system. They don't depend on anything, which makes them more stable. Things around the Domain can change, but that shouldn't affect the domain entities or business rules. We can achieve this using layers, but you don't need them. The layers can be entirely logical. The Missing Chapter explains this in-depth: https://lnkd.in/eYAt5GuH Have you read the Clean Architecture book? What do you think of it? --- Sign up for the .NET Weekly with 74K+ other engineers, and get a free Clean Architecture template: https://lnkd.in/edGSAdiR
17 Comments
Like Comment
To view or add a comment, sign in
Rakesh Gohel

Scaling with AI Agents | Expert in Agentic AI & Cloud Native Solutions| Builder | Author of Agentic AI: Reinventing Business & Work with AI Agents | Driving Innovation, Leadership, and Growth | Let’s Make It Happen! 🤝
3d
Report this post
Stanford just used an agentic architecture to replace fine-tuning and the results are shocking, Here's why ACE is so important... Stanford just dropped the ACE(Agentic Context Engineering) Framework. 📌 ACE adopts an agentic architecture with three specialized components: a Generator, a Reflector, and a Curator. 1. Generator - Creates new context items based on the agent’s experience or task performance. - These items can be strategies, instructions, examples, or evidence. 2. Reflector - Evaluates the usefulness of generated items. (Just like the reflection loop on agents) - Identifies redundancy, brevity bias (oversimplification), or context collapse (loss of detail over time). 3. Curator - Merges high-value items into the evolving context. - Maintains coherence and relevance while preventing overload or degradation. But why the agentic architecture? They used an agentic architecture to let specialized roles collaboratively evolve context in real-time. This prevents collapse or oversimplification and enables continual adaptation without fine‑tuning. In contrast to similar frameworks like Dynamic Cheatsheet, which relied on a single evolving memory and was prone to brevity bias (preferring shorter, oversimplified outputs) and context degradation. But how can ACE replace fine-tuning for LLMs? 📌 Traditional fine-tuning involves retraining model weights on domain-specific data, which is: - Expensive (computationally and financially) - Slow (requires retraining cycles) - Static (doesn’t adapt in real time) 📌 ACE sidesteps these issues by: - Editing context instead of weights—making updates lightweight and fast. - Supporting continual improvement—agents evolve their strategies over time. - Avoiding brevity bias and context collapse—preserving domain insights better than naive summarization. 📌 Proofs? The gains include: 1/ +10.6% improvement on agent tasks 2/ +8.6% boost in financial reasoning 3/ ~86.9% latency reduction compared to strong context-adaptation baselines What do you think about this framework? Let me know in the comments below 👇 Save 💾 ➞ React 👍 ➞ Share ♻️ & follow for everything related to AI Agents
16 Comments
Like Comment
To view or add a comment, sign in

4,931 followers

320 Posts

View Profile Follow

LinkedIn respects your privacy

Recouping training costs with AI advancements

Explore content categories