First Hands-On Experience with Promptfoo for LLM Testing

Yesterday, I shared how automating LLM testing can save time and scale evaluation. Today, I want to share my first hands-on experience with Promptfoo, an open-source tool for prompt-based testing. The goal wasn’t to build a complex system, but to take the first, simple step from theory to practice. The setup was surprisingly straightforward. The heart of Promptfoo is a single configuration file where you define your prompts, the AI models to test, and your success criteria. For my first test, I kept it basic: I asked a model to explain AI testing and created a simple rule to check if the response contains the word “quality” and "software testing." Later, this can be extended to gold-standard comparisons or scoring for accuracy, relevance, or tone. The magic wasn’t in the complexity of the check itself, but in the workflow. What was once a manual “copy, paste, and read” task is now a repeatable command I can run in my terminal anytime. It’s the “Hello, World!” of LLM testing, but it fundamentally changes how you work — from random spot-checks to a structured, engineering-led approach. This simple check is just the start. The real power comes when you move beyond keywords and start evaluating meaning and tone automatically. More on that soon! I’ve uploaded the basic setup and .yaml configuration here if you want to explore or try it yourself: https://lnkd.in/dZFWmRk9 What's the first simple rule you would set for an AI's response? #AITesting #LLMtesting #Promptfoo #TestAutomation #OpenSource #QualityAssurance #aitester #qualityassurance #testautomation #careergrowth #llmevaluation

4 Comments

🔉Syam Sasi

Lead Software Engineer @ Carousell, Core member of TAQELAH

Here is a better UI for Promptfoo ;)

2 Reactions

To view or add a comment, sign in

More Relevant Posts

Husein Fatehi

@CSUN | Information Systems| AI junkie
2w Edited
Report this post
No code tools are the new coding of the development world But what does it mean to the competeitors (Make, n8n, zapier)? This is what I think OpenAI’s new AgentKit changes what “automation” means. It brings models, tools, and memory into one place turning AI agents into full-fledged doers. That doesn’t put Make.com or n8n in danger. It opens new ground. They still excel at what AgentKit can’t......yet: ✅ Deep integrations across systems (CRMs, thirdparty apps) ✅ Reliable workflows with complex branching ✅ Flexibility for builders who need the extra umf in control The real opportunity? Collaboration, not competition. Think AgentKit as the brain and Make or n8n as the nervous system connecting it to the world. Automation isn’t getting replaced. It’s getting upgraded. The next wave belongs to those who bridge AI intelligence with real-world execution. What do you think about this new plot twist?
Like Comment
To view or add a comment, sign in
Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
3w
Report this post
Evaluations for LLMs often get treated as an afterthought. But without them, you’re essentially shipping blind. Honestly, evaluating LLMs is not straightforward. Some of the biggest challenges I see teams run into: → No consistent framework for benchmarking models across tasks → Evals are run ad-hoc, not integrated into CI/CD pipelines → Hard to simulate multi-turn, agentic workflows with tools + memory → Regression detection (“did I break the prompt?”) is manual and error-prone → Test suites don’t evolve as systems retrain and scale Eval Protocol (EP) directly addresses these pain points. It’s an open protocol + SDK built by Fireworks AI, that brings engineering rigor (unit tests, regression checks), CI/CD pipelines to LLM applications. Why it matters: → EP supports both single-turn and multi-turn/agentic evals → You can define evals with thresholds, rollout processors, and even simulate agents with MCP configs → It scales with your workflow, from local model selection to long-term system maintenance If you’re building GenAI applications at scale, you should check this out: evalprotocol.io Because reliable systems don’t just depend on good prompts or models, they depend on disciplined evaluation. How are you currently approaching evals in your workflow? 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
58 Comments
Like Comment
To view or add a comment, sign in
Future AGI

24,626 followers
2w
Report this post
🔥 Last week, we launched AgentCompass and our team is getting questions, feedback, appreciation - left, right and centre. The inbound rush has been crazy (and energizing). But in the middle of that, we kept getting one genuine question: “How exactly do you categorize errors in agent traces? And how does it help in root cause analysis?” Here’s the short version ⬇️ AgentCompass introduces a structured taxonomy of 5 trace error categories: 🧠 Thinking & Response Issues -> e.g. hallucinations, misinterpretations, poor decision logic 🛡️ Safety & Security Risks -> e.g. leaks, biased or dangerous content 🔧 Tool & System Failures -> e.g. API errors, misconfigurations, runtime faults 📉 Workflow & Task Gaps -> e.g. lost context, goal drift, orchestration errors 🔍 Reflection Gaps -> meta-level failures like no self-correction or blind retries of failed actions% 👉 Why this matters: 1. Instead of “your agent failed,” you now know what failed, where it failed, and why. 2. That means faster root cause discovery, automated clustering of recurring issues, and fewer hours spent digging through raw traces. 3. For builders, this means debugging agents like debugging code: structured, repeatable, measurable. That’s the part we are most excited about. Reliable AI isn’t about more metrics, it’s about the right categories that tell you where to look. Run Compass ONCE. You’ll know where your agent falters most often and why- https://lnkd.in/giceCBmC
1 Comment
Like Comment
To view or add a comment, sign in
Brij kishore Pandey Brij kishore Pandey is an Influencer

AI Architect | Strategist | Generative AI | Agentic AI
1w
Report this post
Every week, I get questions like — “What exactly is an AI Agent?” “Isn’t it just a bot with an LLM?” Not really. An AI Agent is more like a 𝘀𝘆𝘀𝘁𝗲𝗺 𝘁𝗵𝗮𝘁 𝗰𝗮𝗻 𝘁𝗵𝗶𝗻𝗸, 𝗿𝗲𝗮𝘀𝗼𝗻, 𝗮𝗻𝗱 𝗮𝗰𝘁. The LLM is just one part — it gives the brain power. • 𝗢𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻 𝗟𝗮𝘆𝗲𝗿: breaks high-level goals into smaller reasoning and execution steps • 𝗧𝗼𝗼𝗹𝘀 & 𝗘𝘅𝘁𝗲𝗻𝘀𝗶𝗼𝗻𝘀: allow real-world actions — querying databases, calling APIs, automating workflows • 𝗗𝗮𝘁𝗮 𝗦𝘁𝗼𝗿𝗲𝘀: ground the agent in facts and real-time context • 𝗠𝗲𝗺𝗼𝗿𝘆 𝗟𝗮𝘆𝗲𝗿𝘀: --𝗦𝗵𝗼𝗿𝘁-𝘁𝗲𝗿𝗺 𝗺𝗲𝗺𝗼𝗿𝘆 holds the current conversation or task context --𝗟𝗼𝗻𝗴-𝘁𝗲𝗿𝗺 𝗺𝗲𝗺𝗼𝗿𝘆 helps the agent retain insights across sessions for adaptive behavior • 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 & 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆: ensure the agent operates transparently, tracks lineage, and respects policies • 𝗧𝗲𝗹𝗲𝗺𝗲𝘁𝗿𝘆 & 𝗟𝗼𝗴𝗴𝗶𝗻𝗴: provide insight into what the agent is doing, how decisions are made, and when to intervene • 𝗣𝗿𝗼𝘁𝗼𝗰𝗼𝗹𝘀 (𝗠𝗖𝗣, 𝗔𝟮𝗔): connect agents, tools, and systems for smooth coordination I put everything in one place — 𝘁𝗵𝗶𝘀 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗖𝗵𝗲𝗮𝘁 𝗦𝗵𝗲𝗲𝘁 — to help you understand each component and how it all connects. If you’re experimenting with multi-agent systems, or building orchestration layers around LLMs, this will help you see the big picture before you dive into the code.
35 Comments
Like Comment
To view or add a comment, sign in
Liam Sweeney

Systems designer | Building smart, AI-powered processes that save time & cut costs
3w
Report this post
“Zapier always breaks though doesn't it.. I've heard Make.com is better.” I've heard this twice now this week (and many times before). Both can break, and so do a lot of other platforms i've used. This shouldn't stop you using them - they have different strengths depending on your need. I use both interchangeably all the time. Make.com is brilliant for complex workflows, but Zapier is quicker for simple ones (the AI Copilot is really good for this). Both are fine so long as you design with failure in mind. Because here’s the truth: no automation tool is ‘set and forget’. APIs change, triggers misfire, data fields get edited. It’s not the tool’s fault - it’s just software being software and people being people. What matters isn’t if it breaks but: → Do you know when it happens? → Can you recover without losing data? → Have you built a way to roll back or re-run? → Most importantly: have you trained your team on how it works? Its worth saying i do love both tools, but right now my favourite is Gumloop - powerful, flexible, great for AI-first ops. But give it time and I’m sure i'll hear someone else say the same about it too.
Like Comment
To view or add a comment, sign in
Vijay Krishna Gudavalli

GenAI Based Manual Tester | Automation Test Engineer | ISTQB Certified QA Engineer | Web & Mobile App Testing (Appium) | Selenium, Playwright | API, Performance & AI Testing | Jenkins, GitHub Actions | Postman, JMeter |
2w
Report this post
⚙️ Best Practices, Tips & Pro Tricks ✅ Follow These Rules: Use descriptive names (getCurrentDateTime) Write natural-language descriptions Separate Info Tools & Action Tools Use directReturn = true for fast responses Keep output simple & human-readable 🧠 Analogy: Each tool = microservice → clean input/output → single purpose 💡 Pro Tip: Declarative (@Tool) = Simplicity Programmatic (ToolCallback) = Control 🎯 Conclusion: Organized tools make AI code cleaner, faster, and more explainable. #SpringAI, #AIEngineering, #BestPractices, #LLM, #FunctionCalling, #QAAutomation, #JavaDevelopment, #AITools, #CleanCode, #PromptEngineering
Like Comment
To view or add a comment, sign in
Beniamin Hutyra

Time is money. I help you save both. | Ai & Automation Expert | Ai Project Manager
1mo
Report this post
Amazing how far we’ve come in not even 3 years. 🔥 I have finally tested the new Perplexity Comet Browser. ☄️ I got the invitation and decided to give it a shot after Emanuel Stadler and Jakub Popluhar strongly recommended it. And this might be the beginning of how we actually start working with AI like a teammate. I recorded a video showing exactly how it works. The Comet Browser doesn’t just answer questions. 👉 It tracks your research. 👉 It understands your context. 👉 And it stays with you while you move across tabs, projects, and tools. In the video, I ask Comet to help me plan a flowchart for Email Automation. Comet create the plan, breaks it down, and builds a flowchart inside the canvas. 🤯 This is the first time I’ve felt like an AI tool is thinking with me instead of just reacting to me. We’re used to AI giving outputs. This feels more like a teammate. You can watch the demo in the video I posted. Curious what this could replace for you? 👀 Happy building 🧑💻

5 Comments
Like Comment
To view or add a comment, sign in
Kevil Rana

AI/ML and Backend Developer @AsynzX Computing | Data Annotator @outlier AI | MTech in CSE | BE in IT with Specialization IoT | Foundation from IITM in Data Science
1w Edited
Report this post
Just shipped selfvcs: A lightweight, local VCS that supercharges your dev workflow with AI smarts! No more bland commits get semantic messages, change insights (feat/fix/refactor?), risk levels, auto-version bumps, and changelog gen in one CLI. Core Implementations 1)Local Repository System: Each project is initialized into a structured repository that stores snapshots of code versions locally using file hashing and timestamp-based version mapping. 2)Commit & History Management: The system supports both manual and AI-assisted commit messages (via a local LLM), automatically logging metadata such as time, user, and file change summary. 3)Change Detection Algorithm: A custom comparison module tracks file additions, deletions, and modifications using hashing for efficiency simulating Git’s diff mechanism in a simplified, educational way. 4)Pull & Restore Functionality: Users can revert to any previous commit safely, restoring project files to their exact state at that point in history. 5)Secure Offline Operation: Every process from commit creation to history management runs completely offline. No data leaves the machine, ensuring privacy and full local autonomy. pip install selfvcs Key perks: 1)Private & local: Everything stays on your machine. 2)AI-powered: LLM detects changes, suggests pro messages, tags impacts. 3)Simple: Drop-in CLI with core commands like push, log, and more. All commands: 1)selfvcs init - Initialize repository 2)selfvcs push <files> -m "<message>" - Push with manual message 3)selfvcs push <files> --auto-detect - Push with AI-generated message 4)selfvcs log -n <num> - Show last N commits 5)selfvcs pull <commit_id> -o <dir> - Restore commit to directory 6)selfvcs delete <commit_id> - Delete a commit 7)selfvcs status - Show repo status 8)selfvcs auth set <groq-key> - Set GROQ API key 9)selfvcs changelog -n <num> - Generate changelog from last N commits 10)selfvcs insight <commit_id> - Show insights for a commit Install & try! Open to feedback! #ai #groq #innovation
2 Comments
Like Comment
To view or add a comment, sign in
Franziskus Brackmann

Co-Founder @ BrackMo Solutions | Helping Service Businesses Automate & Scale with AI | IE University | AI Enthusiast
3w
Report this post
This guy literally builds n8n AI Agents with 1 prompt I'm sharing his prompt below but first some thoughts on how he wrote the prompt: 1. Context Dump Everything > Don't hold back. Your use case, data sources, desired outputs... throw it all in there. The AI needs to understand your entire ecosystem, not just the happy path. 2. Demand Production-Ready Code > Explicitly ask for error handling, retry logic, and fallback mechanisms. The difference between a demo and something you can actually ship? Those three things right there. 3. Request the Full Stack > Node configurations. JSON structures. Testing checklists. Environment variables. When you ask for everything upfront, you get a complete blueprint, not just fragments you'll struggle to piece together later. 4. Include Edge Cases > Ask for "common edge cases and solutions" in your prompt. This one line saves you from those 2am debugging sessions when everything breaks in production. 5. Make It Maintainable > Specify "make it maintainable by someone else." This forces cleaner architecture, proper naming conventions, and documentation that your future self (or teammate) will actually thank you for. Check out his full prompt in comments and use it to build AI Agents in n8n. P.S. "This Guy" = X / alex_prompter
3 Comments
Like Comment
To view or add a comment, sign in
Rahul Vaghasiya

AI/ML engineering || Data science || Generative AI || Artificial Intelligence at @TDG
1w
Report this post
Every week, I get questions like — “What exactly is an AI Agent?” “Isn’t it just a bot with an LLM?” Not really. An AI Agent is more like a 𝘀𝘆𝘀𝘁𝗲𝗺 𝘁𝗵𝗮𝘁 𝗰𝗮𝗻 𝘁𝗵𝗶𝗻𝗸, 𝗿𝗲𝗮𝘀𝗼𝗻, 𝗮𝗻𝗱 𝗮𝗰𝘁. The LLM is just one part — it gives the brain power. • 𝗢𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻 𝗟𝗮𝘆𝗲𝗿: breaks high-level goals into smaller reasoning and execution steps • 𝗧𝗼𝗼𝗹𝘀 & 𝗘𝘅𝘁𝗲𝗻𝘀𝗶𝗼𝗻𝘀: allow real-world actions — querying databases, calling APIs, automating workflows • 𝗗𝗮𝘁𝗮 𝗦𝘁𝗼𝗿𝗲𝘀: ground the agent in facts and real-time context • 𝗠𝗲𝗺𝗼𝗿𝘆 𝗟𝗮𝘆𝗲𝗿𝘀: --𝗦𝗵𝗼𝗿𝘁-𝘁𝗲𝗿𝗺 𝗺𝗲𝗺𝗼𝗿𝘆 holds the current conversation or task context --𝗟𝗼𝗻𝗴-𝘁𝗲𝗿𝗺 𝗺𝗲𝗺𝗼𝗿𝘆 helps the agent retain insights across sessions for adaptive behavior • 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 & 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆: ensure the agent operates transparently, tracks lineage, and respects policies • 𝗧𝗲𝗹𝗲𝗺𝗲𝘁𝗿𝘆 & 𝗟𝗼𝗴𝗴𝗶𝗻𝗴: provide insight into what the agent is doing, how decisions are made, and when to intervene • 𝗣𝗿𝗼𝘁𝗼𝗰𝗼𝗹𝘀 (𝗠𝗖𝗣, 𝗔𝟮𝗔): connect agents, tools, and systems for smooth coordination I put everything in one place — 𝘁𝗵𝗶𝘀 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗖𝗵𝗲𝗮𝘁 𝗦𝗵𝗲𝗲𝘁 — to help you understand each component and how it all connects. If you’re experimenting with multi-agent systems, or building orchestration layers around LLMs, this will help you see the big picture before you dive into the code.
1 Comment
Like Comment
To view or add a comment, sign in

1,382 followers

28 Posts

View Profile Connect

LinkedIn respects your privacy

First Hands-On Experience with Promptfoo for LLM Testing

Explore content categories