How to Learn AI Evals for PMs in 5 Minutes

View profile for Aakash Gupta
Aakash Gupta Aakash Gupta is an Influencer

The AI PM Guy 🚀 | Helping you land your next job + succeed in your career

OpenAI CPO: Evals are becoming a core skill for PMs. PM in 2025 is changing fast. PMs need to learn brand new skills: 1. AI Evals (https://lnkd.in/eGbzWMxf) 2. AI PRDs (https://lnkd.in/eMu59p_z) 3. AI Strategy (https://lnkd.in/egemMhMF) 4. AI Discovery (https://lnkd.in/e7Q6mMpc) 5. AI Prototyping (https://lnkd.in/eJujDhBV) And evals is amongst the deepest topics. There's 3 steps to them: 1. Observing (https://lnkd.in/e3eQBdMp) 2. Analyzing Errors (https://lnkd.in/eEG83W5D) 3. Building LLM Judges (https://lnkd.in/ez3stJRm) - - - - - - Here's your simple guide to evals in 5 minutes: (Repost this before anything else ♻️) 𝟭. 𝗕𝗼𝗼𝘁𝘀𝘁𝗿𝗮𝗽 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 Start with 100 diverse traces of your LLM pipeline. Use real data if you can, or systematic synthetic data generation across key dimensions if you can't. Quality over quantity here: aggressive filtering beats volume. 𝟮. 𝗔𝗻𝗮𝗹𝘆𝘇𝗲 𝗧𝗵𝗿𝗼𝘂𝗴𝗵 𝗢𝗽𝗲𝗻 𝗖𝗼𝗱𝗶𝗻𝗴 Read every trace carefully and label failure modes without preconceptions. Look for the first upstream failure in each trace. Continue until you hit theoretical saturation, when new traces reveal no fundamentally new error types. 𝟯. 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗬𝗼𝘂𝗿 𝗙𝗮𝗶𝗹𝘂𝗿𝗲 𝗠𝗼𝗱𝗲𝘀 Group similar failures into coherent, binary categories through axial coding. Focus on Gulf of Generalization failures (where clear instructions are misapplied) rather than Gulf of Specification issues (ambiguous prompts you can fix easily). 𝟰. 𝗕𝘂𝗶𝗹𝗱 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗼𝗿𝘀 Create dedicated evaluators for each failure mode. Use code-based checks when possible (regex, schema validation, execution tests). For subjective judgments, build LLM-as-Judge evaluators with clear Pass/Fail criteria, few-shot examples, and structured JSON outputs. 𝟱. 𝗗𝗲𝗽𝗹𝗼𝘆 𝘁𝗵𝗲 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗙𝗹𝘆𝘄𝗵𝗲𝗲𝗹 Integrate evals into CI/CD, monitor production with bias-corrected success rates, and cycle through Analyze→ Measure→ Improve continuously. New failure modes in production feed back into your evaluation artifacts. Evals are now a core skill for AI PMs. This is your map. - - - - - I learned this from Hamel Husain and Shreya Shankar. Get 35% off their course: https://lnkd.in/e5DSNJtM 📌 Want our step-by-step guide to evals? Comment 'steps' + DM me. Repost to cut the line. ➕ Follow Aakash Gupta to stay on top of AI x PM.

  • graphical user interface, text
Aakash Gupta

The AI PM Guy 🚀 | Helping you land your next job + succeed in your career

1mo

Have a great Friday evening! Excited to share this weekend’s newsletter with you 🫡

  • No alternative text description for this image
Alex Rechevskiy

I help PMs land $700K+ product roles 🚀 Follow for daily posts on growing your product skills & career 🛎️ Join our exclusive group coaching program for ambitious PMs 👇

1mo

Evals. Modern speak for QA.

Dean Peters

Product Management Trainer, Consultant, & Mentor | Innovation Coach & AI Tamer | Hakawati (حكواتي)

1mo

Once again, I feel like we're overhyping tools and techniques. Evals are important, but just like being good at writing Gherkin-style acceptance criteria, it's not going to make you a great product manager. And just because Lenny Rachitsky's latest podcast pumps it up, doesn't mean it necessarily needs to be the most important thing we drop stop and roll towards. Don't get me wrong, evals are important, and Lenny does a great job with his podcasts, but the hype cycles we're putting ourselves through are related to tools and techniques ... Oh my goodness people ... We're all going to spin ourselves into the ground.

Sachin Sharma

Become Elite PM In 90 Days ~ Product Career Coach : Mentor IT Professionals to Break into Product Management Role || Aspiring PMs Resources & 1:1 Call (Demo Call) ↓

1mo

This is gold for every PM stepping into AI. Clear, crisp, and super actionable. Thanks for sharing.

Larry Imgrund

Senior Product Manager at Gainsight || Author on CommunityFREQ

1mo

I learned the importance of evals through painstakingly coding prototypes / PRDs in VS Code with Claude Code. Creating an ideal interaction environment with the LLM for an end user. Getting accurate suggestions and insights with high confidence from your data. It’s a skill to practice now!

Aarzoo Bhatia

Product @ Devrev.ai | Ex-Freshworks, 1mg | SPJIMR'21 | Carnegie Mellon

1mo

overhyping evals.

Will Scardino

AI Product Leader 🔥 Driving $168M+ for 100M users | Agentic AI @Verizon | Top 6% Voice | ✨AI PM BY DESIGN | Ex-Grubhub, Acxiom, Humana, FEMA

1mo

Great post 👍 before jumping to LLMaaJ it's essential that PMs conduct error analysis on tonality, personality, and response accuracy. Clustering traces for main use cases + edge cases. LLMs are horrible at understanding numeric ranges ie. 1-5. Binary's the way to go folks~ happy Fri! Also code evals 👌 Which eval package do you recommend? Arize, langsmith, and Galileo are some of the popular third party tools

Jason Knight

Fractional CPO | Coaching, Training, and Scaling B2B Product Teams | Building AI Products Since Before it was Cool

1w

I mean, maybe, but that's what I'd say if I was OpenAi's CPO even if it wasn't true 🤔

Rafik Matta

Applied ML & AI Leader

1mo

I think it's misplaced to expect PMs to prepare evals, unless we mean evals in the typical product sense (i.e. product metrics). A PM with a good stats background can probably do this, but we're basically asking PMs to be data scientists.

See more comments

To view or add a comment, sign in

Explore content categories