OpenAI CPO: Evals are becoming a core skill for PMs. PM in 2025 is changing fast. PMs need to learn brand new skills: 1. AI Evals (https://lnkd.in/eGbzWMxf) 2. AI PRDs (https://lnkd.in/eMu59p_z) 3. AI Strategy (https://lnkd.in/egemMhMF) 4. AI Discovery (https://lnkd.in/e7Q6mMpc) 5. AI Prototyping (https://lnkd.in/eJujDhBV) And evals is amongst the deepest topics. There's 3 steps to them: 1. Observing (https://lnkd.in/e3eQBdMp) 2. Analyzing Errors (https://lnkd.in/eEG83W5D) 3. Building LLM Judges (https://lnkd.in/ez3stJRm) - - - - - - Here's your simple guide to evals in 5 minutes: (Repost this before anything else ♻️) 𝟭. 𝗕𝗼𝗼𝘁𝘀𝘁𝗿𝗮𝗽 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 Start with 100 diverse traces of your LLM pipeline. Use real data if you can, or systematic synthetic data generation across key dimensions if you can't. Quality over quantity here: aggressive filtering beats volume. 𝟮. 𝗔𝗻𝗮𝗹𝘆𝘇𝗲 𝗧𝗵𝗿𝗼𝘂𝗴𝗵 𝗢𝗽𝗲𝗻 𝗖𝗼𝗱𝗶𝗻𝗴 Read every trace carefully and label failure modes without preconceptions. Look for the first upstream failure in each trace. Continue until you hit theoretical saturation, when new traces reveal no fundamentally new error types. 𝟯. 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗬𝗼𝘂𝗿 𝗙𝗮𝗶𝗹𝘂𝗿𝗲 𝗠𝗼𝗱𝗲𝘀 Group similar failures into coherent, binary categories through axial coding. Focus on Gulf of Generalization failures (where clear instructions are misapplied) rather than Gulf of Specification issues (ambiguous prompts you can fix easily). 𝟰. 𝗕𝘂𝗶𝗹𝗱 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗼𝗿𝘀 Create dedicated evaluators for each failure mode. Use code-based checks when possible (regex, schema validation, execution tests). For subjective judgments, build LLM-as-Judge evaluators with clear Pass/Fail criteria, few-shot examples, and structured JSON outputs. 𝟱. 𝗗𝗲𝗽𝗹𝗼𝘆 𝘁𝗵𝗲 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗙𝗹𝘆𝘄𝗵𝗲𝗲𝗹 Integrate evals into CI/CD, monitor production with bias-corrected success rates, and cycle through Analyze→ Measure→ Improve continuously. New failure modes in production feed back into your evaluation artifacts. Evals are now a core skill for AI PMs. This is your map. - - - - - I learned this from Hamel Husain and Shreya Shankar. Get 35% off their course: https://lnkd.in/e5DSNJtM 📌 Want our step-by-step guide to evals? Comment 'steps' + DM me. Repost to cut the line. ➕ Follow Aakash Gupta to stay on top of AI x PM.
Evals. Modern speak for QA.
Once again, I feel like we're overhyping tools and techniques. Evals are important, but just like being good at writing Gherkin-style acceptance criteria, it's not going to make you a great product manager. And just because Lenny Rachitsky's latest podcast pumps it up, doesn't mean it necessarily needs to be the most important thing we drop stop and roll towards. Don't get me wrong, evals are important, and Lenny does a great job with his podcasts, but the hype cycles we're putting ourselves through are related to tools and techniques ... Oh my goodness people ... We're all going to spin ourselves into the ground.
This is gold for every PM stepping into AI. Clear, crisp, and super actionable. Thanks for sharing.
I learned the importance of evals through painstakingly coding prototypes / PRDs in VS Code with Claude Code. Creating an ideal interaction environment with the LLM for an end user. Getting accurate suggestions and insights with high confidence from your data. It’s a skill to practice now!
overhyping evals.
Great post 👍 before jumping to LLMaaJ it's essential that PMs conduct error analysis on tonality, personality, and response accuracy. Clustering traces for main use cases + edge cases. LLMs are horrible at understanding numeric ranges ie. 1-5. Binary's the way to go folks~ happy Fri! Also code evals 👌 Which eval package do you recommend? Arize, langsmith, and Galileo are some of the popular third party tools
I mean, maybe, but that's what I'd say if I was OpenAi's CPO even if it wasn't true 🤔
I think it's misplaced to expect PMs to prepare evals, unless we mean evals in the typical product sense (i.e. product metrics). A PM with a good stats background can probably do this, but we're basically asking PMs to be data scientists.
The AI PM Guy 🚀 | Helping you land your next job + succeed in your career
1moHave a great Friday evening! Excited to share this weekend’s newsletter with you 🫡