Understanding AI SDR Hallucinations and Nuance Challenges

Explore top LinkedIn content from expert professionals.

  • Last week, an AI medical record summary failed to capture critical information about my dad's condition and next steps in his care. Why do AI tools sometimes "hallucinate" lab results or omit critical context? There are many known (and unknown) risks with AI tools in healthcare and most of these risks are embedded at the research and development phase. 🔍 That means that it is in this phase that scrutiny is warranted because once it's deployed into clinical workflows it's too late. Yet in so many conversations about AI risk in research, I still hear: 💬 “The only real risk is a data breach,” or 💬 “AI is just basic statistics, like regression.” The worst excuse I've ever heard was: 💬 "Doctors make the same mistakes all the time." These statements concern me, and hopefully they concern you too. While I agree many AI tools are relatively low risk, not all are. For example, deep learning and GenAI tools used to summarize patient records can behave in unpredictable and non-linear ways. These #ComplexSystems operate in dynamic, high-stakes clinical environments. They can have real-world consequences for patients and #ResearchParticipants. ⚠️ A small prompt tweak or formatting change in a generative AI summary tool can ripple into misdiagnoses, missed safety alerts, or inappropriate clinical decisions. These aren’t random bugs; they emerge from complex system interactions, like: 🫥 FEEDBACK LOOPS reinforce incorrect predictions Examples: --> “low-risk” labels lead to less monitoring; --> Using AI to screen certain groups for study eligibility but historical screening has systematically excluded minority groups and non-English-speaking ⚖️ EMBEDDED/HISTORICAL BIASES in training data amplify health disparities across race, gender, or disability. 📉 DATA DRIFT: evolving EHR inputs cause the model to misinterpret new formats or trends. 🥴 HALLUCINATION: Fabricating patient details or omitting critical nuances due to token limits or flawed heuristics. ... and so much more... ⚠️ These risks affect patient and research participant safety and jeopardize #ResearchIntegrity. 🏨 If institutions adopt these tools without recognizing their system-level vulnerabilities, the consequences can be profound and hard to trace. That’s why research institutions need: ✅ More technical and algorithmic audits. ✅ Governance frameworks that translate these complex behaviors into plain-language, IRB-ready guidance that centers safety, ethics, and compliance. ✅ To demystify the system-level risks behind these tools. 💡 Fortunately, there's a solution💡 With the right SMEs, we can craft practical, plain-language approaches to improve #IRB review and ethical oversight. Is anyone else working on this at the IRB level? I’d love to compare notes (or maybe even partner on the work!?). #AIinHealthcare #ComplexSystems #IRB #GenerativeAI #ClinicalAI #DigitalHealth #ResponsibleAI #AIEthics #HRPP #AIHSR #SaMD

  • View profile for Sohrab Rahimi

    Partner at McKinsey & Company | Head of Data Science Guild in North America

    20,094 followers

    Hallucination in LLMs refers to generating factually incorrect information. This is a critical issue because LLMs are increasingly used in areas where accurate information is vital, such as medical summaries, customer support, and legal advice. Errors in these applications can have significant consequences, underscoring the need to address hallucinations effectively. This paper (https://lnkd.in/ergsBcGP ) presents a comprehensive overview of the current research and methodologies addressing hallucination in LLMs. It categorizes over thirty-two different approaches, emphasizing the importance of Retrieval-Augmented Generation (RAG), Knowledge Retrieval, and other advanced techniques. These methods represent a structured approach to understanding and combating the issue of hallucination, which is critical in ensuring the reliability and accuracy of LLM outputs in various applications. Here are the three most effective and practical strategies that data scientists can implement currently: 1. Prompt Engineering: Adjusting prompts to provide specific context and expected outcomes, improving the accuracy of LLM responses. 2. Retrieval-Augmented Generation (RAG): Enhancing LLM responses by accessing external, authoritative knowledge bases, which helps in generating current, pertinent, and verifiable responses. 3. Supervised Fine-Tuning (SFT): Aligning LLMs with specific tasks using labeled data to increase the faithfulness of model outputs. This helps in better matching the model's output with input data or ground truth, reducing errors and hallucinations

  • View profile for Sid J (Siddhartha Reddy Jonnalagadda)

    LLMs @ Meta | Past: Gemini, NotebookLM @ Google; Amazon Alexa; Microsoft Cortana; UC Berkeley Lecturer; Northwestern Professor; Mayo Researcher; PhD in AI; IIT CS

    15,940 followers

    A recent survey paper (https://lnkd.in/gxmdQQET) has meticulously categorized the wealth of strategies developed to address the phenomenon of 'hallucinations' in Large Language Models (LLMs). This term refers to the instances where LLMs, despite their linguistic prowess, generate content that sounds credible but is actually unfounded or incorrect. The survey provides a high-level taxonomy of hallucination mitigation techniques, dividing them into two principal domains: 'Prompt Engineering' and 'Model Development'. 'Prompt Engineering' is about fine-tuning the interaction between the user and the AI, ensuring the prompts lead to more accurate outputs. It includes well known methods such as Retrieval Augmented Generation, where the model pulls in external information to improve response accuracy, and Self-Refinement through Feedback and Reasoning, which enables models to iteratively refine their outputs based on feedback mechanisms. 'Model Development', on the other hand, gets into the architectural nuts and bolts of LLMs. It spans from introducing new decoding strategies that guide the model's generation phase, using Knowledge Graphs to provide a structured database of facts, to devising new loss functions that reward outputs for their faithfulness to factual input data, and Supervised Fine-Tuning that aligns models more closely with human-labeled data. By understanding and applying these techniques, developers and researchers can make LLMs more reliable, trustworthy, and ultimately more useful for everyone.

  • View profile for Tomasz Tunguz
    Tomasz Tunguz Tomasz Tunguz is an Influencer
    401,924 followers

    When a person asks a question of an LLM, the LLM responds. But there’s a good chance of an some error in the answer. Depending on the model or the question, it could be a 10% chance or 20% or much higher. The inaccuracy could be a hallucination (a fabricated answer) or a wrong answer or a partially correct answer. So a person can enter in many different types of questions & receive many different types of answers, some of which are correct & some of which are not. In this chart, the arrow out of the LLM represents a correct answer. Askew arrows represent errors. Today, when we use LLMs, most of the time a human checks the output after every step. But startups are pushing the limits of these models by asking them to chain work. Imagine I ask an LLM-chain to make a presentation about the best cars to buy for a family of 5 people. First, I ask for a list of those cars, then I ask for a slide on the cost, another on fuel economy, yet another on color selection. The AI must plan what to do at each step. It starts with finding the car names. Then it searches the web, or its memory, for the data necessary, then it creates each slide. As AI chains these calls together the universe of potential outcomes explodes. If at the first step, the LLM errs : it finds 4 cars that exist, 1 car that is hallucinated, & a boat, then the remaining effort is wasted. The error compounds from the first step & the deck is useless. As we build more complex workloads, managing errors will become a critical part of building products. Design patterns for this are early. I imagine it this way : (third chart) At the end of every step, another model validates the output of the AI. Perhaps this is a classical ML classifier that checks the output of the LLM. It could also be an adversarial network (a GAN) that tries to find errors in the output. The effectiveness of the overall chained AI system will be dependent on minimizing the error rate at each step. Otherwise, AI systems will make a series of unfortunate decisions & its work won’t be very useful.

  • View profile for Stacie Buck, RHIA, CCS-P, CPCO, CIRCC, CCC, RCC, RCCIR

    👑 Queen of IR Coding™ | Renowned CIRCC Course Creator | Expert Interventional & Diagnostic Radiology Coding Auditor & Educator | Author of Cracking the IR Code™ : Your Comprehensive Guide to Mastering IR Coding

    17,820 followers

    What are AI Hallucinations? An AI hallucination refers to a situation where an artificial intelligence system, while processing data or generating output, produces incorrect, unexpected, or nonsensical results. This phenomenon typically occurs in complex AI models like those used in natural language processing or image recognition. These hallucinations happen because AI systems, especially those relying on machine learning, base their outputs on patterns they've learned from their training data. If the training data is limited, biased, or doesn't cover certain scenarios, the AI might "fill in the gaps" inappropriately, leading to these odd or incorrect results. AI doesn't think like humans and doesn't have real-world understanding or common sense, so its errors can sometimes be quite strange or funny. Imagine an AI as a smart computer program that assists in picking codes for different health problems. Let's say you're playing a game where you have to match descriptions of health problems with the right code from a huge list. You have a robot assistant in the game to help suggest which code matches each description. Most of the time, this robot is really good at its job and suggests the right code. But sometimes, it gets confused. Maybe the way the health problem is described is a bit tricky, or maybe the robot remembers something that's similar but not exactly the same. This is similar to what we call an "AI hallucination" in real life. For example, if a doctor says a patient has a "cough and fever," the AI might remember that these symptoms usually mean the patient has a common cold, which has its own ICD-10 code. But what if the patient actually has something else that also causes a cough and fever? The AI might still suggest the code for a common cold because it's not great at noticing the small details that make this patient's situation different. Imagine a doctor notes that a patient is experiencing shortness of breath and fatigue. The AI system, trained on medical data, might recall that these symptoms often align with a condition like asthma, which has a specific ICD-10 code. Therefore, the AI suggests the asthma code. However, what if the patient's symptoms are actually due to heart disease? Shortness of breath and fatigue can also be symptoms of heart conditions, but the AI might miss this because it's focusing on the most common associations in its training data. It doesn't pick up on other subtle signs in the patient's records that point to heart disease rather than asthma. That's how an AI can sometimes choose the wrong ICD-10 code. It's like your game's robot giving you an answer it's confident about, but it's actually based on a little mix-up. #AIFriday#AIinHealthcare#MedicalCoding#medicalbilling#revenuecyclemanagement#autonomouscoding#medicalai#artificialintelligence#ChatGPT

  • View profile for Elena Gurevich

    AI Policy-Curious Attorney | AI Legal Strategy, Governance & Compliance | EU GPAI Code of Practice Working Groups | Owner @ EG Legal Services | Board Member, Center for Art Law

    9,329 followers

    A new study by the Stanford Institute for Human-Centered Artificial Intelligence (HAI) of "legal hallucinations" in LLM models (GPT 3.5, Llama 2, and PaLM 2). One of the findings showed that large language models' performance "deteriorates when dealing with more complex tasks that require a nuanced understanding of legal issues or interpretation of legal texts. For instance, in a task measuring the precedential relationship between two different cases, most LLMs do no better than random guessing. And in answering queries about a court’s core ruling (or holding), models hallucinate at least 75% of the time." Link to paper in comments.

  • View profile for Rodney W. Zemmel
    Rodney W. Zemmel Rodney W. Zemmel is an Influencer

    Global Head of the Blackstone Operating Team

    39,936 followers

    Don't be afraid of hallucinations! It's usually an early question in most talks I give on GenAI "But doesn't in hallucinate? How do you use a technology that makes things up?". It's a real issue, but it's a manageable one. 1. Decide what level of accuracy you really need in your GenAI application. For many applications it just needs to be better than a human, or good enough for a human first draft. It may not need to be perfect. 2. Control your inputs. If you do your "context engineering" well, you can point it to the data you want better. Well written prompts will also reduce the need for unwanted creativity! 3. Pick a "temperature". You can select a model setting that is more "creative" or one that sticks more narrowly to the facts. This adjusts the internal probabilities. The "higher temperature" results can often be more human-like and more interesting. 4. Cite your sources. RAG and other approaches allow you to be transparent about what the answers are based on, to give a degree of comfort to the user. 5. AI in the loop. You can build an AI "checker" to assess the quality of the output 6. Human in the loop. You aren't going to just rely on the AI checker of course! In the course of a few months we've seen concern around hallucinations go from a "show stopper" to a "technical parameter to be managed" for many business applications. It's by no means a fully solved problem, but we are highly encouraged by the pace of progress. #mckinseydigital #quantumblack #generativeai

  • View profile for Umer Khan M.

    Physician | Futurist | Angel Investor | Custom Software Development | AI in Healthcare Free Course | Digital Health Consultant | YouTuber | AI Integration Consultant | In the pursuit of constant improvement

    15,120 followers

    🚀🧠 𝗡𝗮𝘃𝗶𝗴𝗮𝘁𝗶𝗻𝗴 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀: 𝗧𝗼𝘄𝗮𝗿𝗱𝘀 𝗥𝗲𝗮𝗹-𝗪𝗼𝗿𝗹𝗱 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 As we venture deeper into the realm of AI and Large Language Models (LLMs), one challenge stands out: their tendency to '𝗵𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗲'. This phenomenon, while sometimes a boon in creative fields, often emerges as a hurdle in practical applications where precision is paramount. 🔍 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝗟𝗟𝗠𝘀: 𝗔 𝗗𝗼𝘂𝗯𝗹𝗲-𝗘𝗱𝗴𝗲𝗱 𝗦𝘄𝗼𝗿𝗱 Hallucinations in LLMs can spark innovation in exploratory fields, but in most real-world scenarios, they pose a risk to accuracy and reliability. Addressing this is crucial for the broader adoption of these advanced models. 🛠️ 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 𝘁𝗼 𝗠𝗶𝘁𝗶𝗴𝗮𝘁𝗲 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀 Recent advancements have introduced effective ways to mitigate these hallucinations. A combination of techniques like RAG (Retrieval-Augmented Generation) and prompt tuning are proving successful. By instructing LLMs to adhere strictly to provided context, we're seeing a significant leap in accuracy and applicability. 📖 𝗘𝘅𝗽𝗹𝗼𝗿𝗶𝗻𝗴 𝗖𝘂𝘁𝘁𝗶𝗻𝗴-𝗘𝗱𝗴𝗲 𝗧𝗲𝗰𝗵𝗻𝗶𝗾𝘂𝗲𝘀 A survey paper titled "A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models" offers a deep dive into these methods. They can be categorized broadly into: 𝟭. 𝗣𝗿𝗼𝗺𝗽𝘁 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴-𝗕𝗮𝘀𝗲𝗱 𝗠𝗲𝘁𝗵𝗼𝗱𝘀: These involve crafting input prompts to minimize hallucinations and self-refinement through feedback and reasoning, enhancing the model's output accuracy. 𝟮. 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁𝗮𝗹 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀: Adjustments in model architecture and training, like new decoding strategies, integrating knowledge graphs, and implementing faithfulness-based loss functions, are key. Supervised Fine-Tuning (SFT) on quality datasets further ensures factual consistency. 💡 𝗧𝗵𝗲 𝗙𝘂𝘁𝘂𝗿𝗲 𝗼𝗳 𝗟𝗟𝗠𝘀 Incorporating these techniques in new LLMs suggests a future with minimized hallucinations, enhancing their practicality in various sectors. The nuanced use of hallucination properties can still be beneficial in certain creative or exploratory applications. I'd love to hear your thoughts. How do you see these developments shaping the use of LLMs in your industry? What are the potential impacts of more accurate and reliable AI models in your field? Join the conversation about shaping a future where AI not only innovates but also informs with precision. #AI #LargeLanguageModels #MachineLearning #TechInnovation #AIin2024 #DigitalTransformation Source: https://lnkd.in/dZkFX8Ti ------ Reza Eghbal Manuel Mitola, MBA Philippe GERWILL Dylan Johannes Boshkow  Dr. Martha Boeckenfeld Paul Blocchi  Patrick Cheng  Deep D. Les Shute Kai Saeger  Katrina Delargy Timothy Riffe Alok Anand Sai Pradeep Srinivasa Ghadeer A. Elliott A. Vishal Falke

  • View profile for Sireesha Pulipati

    Staff Data Engineer 🌀 Shopify 🌐 Ex-Google 📘 Author 🎓 Stanford GSB 👩💻 Google Developer Expert 🤝 Mentor

    4,330 followers

    Hallucinations - those moments when AI systems invent their own "facts" - are a growing concern as #LLMs power more applications. Hallucinations are inevitable. Intuit, building a comprehensive AI-powered finance intelligence platform, is heavily invested in reliable LLM development. Their research focuses on: ⚡ LLM Reliability: Making sure AI outputs are trustworthy. ⚡ LLM Optimization: Fine-tuning AI for specific tasks. ⚡ LLM-Optimized Applications: Building the best tools possible. The key to tackling #hallucinations? Accurate and reliable detection! At the recent #IntuitDevMeetup, Jiaxin Zhang, an Intuit Staff Research Scientist, presented a groundbreaking approach co-developed with Vanderbilt University: Semantic-aware cross-check consistency (https://lnkd.in/gzKvaWkT). Here is the gist: 💡 Many fact-checking approaches require access to the output probability distribution, which may not be available for black box systems such as #ChatGPT 💡 The alternative is a sampling approach based on the idea that if an #LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. 💡 However, self-check consistency approaches do not work well where the response may be consistent but factually wrong. 💡 Cross-check approaches such as cross-question, cross-model, or a combination of the two provide much better detection of non-factual hallucinations. #AI #research

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    40,586 followers

    OpenAI says reusing three key parameters can substantially reduce hallucinations and encourage deterministic generations. tl;dr set the same seed and temperature parameters with each GPT API call to mitigate LLMs' indeterministic nature. How? (1) Set a seed by choosing any number and using it consistently across API requests (2) Ensure all other parameters (prompt, temperature, top-p) are identical for each call (3) Monitor the system_fingerprint field and ensure it doesn't change 𝗘𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗲𝗱 𝗲𝘅𝗽𝗹𝗮𝗶𝗻𝗮𝘁𝗶𝗼𝗻 Many developers don’t know that every GPT API call returns an extra parameter called system_fingerprint, which is OpenAI's identifier for the currently running GPT model configuration. Storing and reusing the seed parameter for future API calls is likely to return the same result for the same system_fingerprint. Setting the same temperature would further increase the likelihood of consistent results. What do these three parameters have to do with reducing hallucinations? (a) It is easier to identify hallucination patterns when responses are more consistent, i.e. similar, and employ safety nets to mitigate downstream implications (b) More consistent generations also reduce the probability of a new response hallucination pattern that slips through the already-deployed safety nets Combined with advanced prompt engineering techniques, hallucinations can be significantly diminished https://lnkd.in/g7_6eP6y I’d be excited to see researchers publish the seed, system_prompt, temperature, and prompt in an AIConfig [0] format so others can easily reproduce their results. This would foster more reliable and trustworthy research in times when the AI community questions the credibility of reported benchmarks. [0] https://lnkd.in/gmvNTf8g from LastMile AI

Explore categories