SlideShare a Scribd company logo
4
Most read
5
Most read
7
Most read
Applications of Large Language Models in
Materials Discovery and Design
Anubhav Jain
Lawrence Berkeley National Laboratory
MRS Fall meeting, Nov 2023
Slides (already) posted to hackingmaterials.lbl.gov
Today is the 1
year birthday of
ChatGPT!
2
Today is the 1
year birthday of
ChatGPT!
3
To celebrate the occasion, I used
ChatGPT to generate an image of a
birthday cake for itself
The results tell you a lot of what you
need to know about the current
state of these kinds of models
Today is the 1
year birthday of
ChatGPT!
4
To celebrate the occasion, I
used ChatGPT to generate an
image of a birthday cake for
itself
Somehow, the results tell you a
lot of what you need to know
about the current state of
these kinds of models
Prior to LLMs, we trained custom models to
perform simple NLP tasks and did just “OK”
5
• A little over a year ago, even
simple tasks like labeling words
into categories (“NER”) required
custom models
• The models took time to develop
and train
• For example, we tried a custom
BERT model that took 1 month to
train on 8 NVIDIA V100
GPUs…and got slightly better
performance than simpler
models
Weston, L.; Tshitoyan, V.; Dagdelen, J.; Kononova, O.; Trewartha,
A.; Persson, K. A.; Ceder, G.; Jain, A. Named Entity Recognition and
Normalization Applied to Large-Scale Information Extraction from
the Materials Science Literature. J. Chem. Inf. Model. 2019.
The NER was also just the first step to more
complex data extraction
6
About 80%–90%
accuracy
achieved
~60% accuracy
(based on our internal
testing)
Accuracy unclear,
as good test sets
unavailable.
Maybe 70%?
“Structured information extraction from complex scientific text
with fine-tuned large language models”, in review,
https://arxiv.org/abs/2212.05238
Things are much easier today …
• We no longer design the LLM models
• Training / fine-tuning is done via an API
• We mainly focus on domain-specific labeling and labeling efficiency …
• Others use “zero-shot” LLMs so don’t even need to label/fine-tune!
7
“Structured information extraction from complex scientific text with fine-tuned large
language models”, in review, https://arxiv.org/abs/2212.05238
This means we can focus on applications!
E.g., doping
• Doping is difficult to calculate, and there is no large doping database
• It is therefore a good application for NLP data extraction
8
Mapping the doping in specific materials
9
Mn-doped (52 mentions) Cr-doped (83 mentions)
N-doped (46 mentions)
Fe-doped (80 mentions)
• Based on parsing scientific
~350,000 abstracts
• Final data set contains over
>200,000 host-dopant links with
f1 score ~0.8
• Using the data set, we can look
up the doping data for any
material composition along with
applications tied to that specific
dopant
Predicting dopants
Given partial information about
a material’s dopants, we can
predict what other dopants may
be likely using collaborative
filtering
10
Lu2O3 dopants
Count
Dopant element
Decreasing frequency
Eu Yb Er Tm …
! = 3 (%ℎ'(( )*+,(- +./0%1.!+)
masked solution algorithm sees
Training
, = 5 (5 40(++(+ *//.5(-)
3rd & 5th prediction correct
1st, 2nd, & 4th predictions wrong
Prediction
Decreasing recommendation strength
Sr Y Eu Ni Yb
2 of 3 solutions (66%
recovered) in k=5 guesses
Model does OK – although room for
improvement
11
If you mask 3 top known
dopants and try to re-
predict them in 5
guesses, you recover
~35% of them (about 1)
Data across >2000 hosts
We will share the full
data set with the
community so they can
also try to make models
Thoughts on the
future - RAG
• Previously, ChatGPT tried to
answer all questions “from
memory”
• Led to hallucination and other
issues
• Now, ChatGPT can search the
web to answer questions
(retrieval augmented generation
or RAG)
• One could also search code
documentation, user manuals,
long reports, journal articles, etc.
to produce answers
12
Example – turning our group handbook into
Q&A tool in ~1 hour using GPT Apps
13
Too much reading for most people …
So many words!
Example – turning our group handbook into
Q&A tool in ~1 hour using GPT Apps
14
Too much reading for most people …
So many words!
Training GPT (via conversation) to deliver
information from handbook via Q&A
Examples of the GPT tool
15
How will this change materials science in the
next few years?
• One change will be a
transformation of user interfaces
• Materials databases will be
natively integrated with LLM
interfaces
• APIs will be easier to use since
LLMs will help translate human
intent to API calls
16
“Show me materials from Materials Project that
contain Ca, have a band gap >1.2 eV, and have a bulk
modulus >100 GPa.”
”Also include materials from OQMD, Jarvis, and any
other materials databases you are aware of.”
Acknowledgements
17
• Alex Dunn
• John Dagdelen
• Nick Walker
• Sanghoon Lee
• Amalie Trewartha
• Leigh Weston
• Kristin Persson
• Gerbrand Ceder
Funded by Toyota
Research Institute
and
DOE-BES Materials
Project program

More Related Content

PPTX
Transformers AI PPT.pptx
RahulKumar854607
 
PPTX
A Comprehensive Review of Large Language Models for.pptx
SaiPragnaKancheti
 
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
PDF
And then there were ... Large Language Models
Leon Dohmen
 
PDF
An introduction to the Transformers architecture and BERT
Suman Debnath
 
PDF
Transformers, LLMs, and the Possibility of AGI
SynaptonIncorporated
 
PPTX
LLM presentation final
Ruth Griffin
 
PPTX
Unleashing the Google Bard Discover the Revolutionary New Tool How does it Co...
i-engage
 
Transformers AI PPT.pptx
RahulKumar854607
 
A Comprehensive Review of Large Language Models for.pptx
SaiPragnaKancheti
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
And then there were ... Large Language Models
Leon Dohmen
 
An introduction to the Transformers architecture and BERT
Suman Debnath
 
Transformers, LLMs, and the Possibility of AGI
SynaptonIncorporated
 
LLM presentation final
Ruth Griffin
 
Unleashing the Google Bard Discover the Revolutionary New Tool How does it Co...
i-engage
 

What's hot (20)

PDF
Prompt Engineering - an Art, a Science, or your next Job Title?
Maxim Salnikov
 
PDF
Loan Default Prediction with Machine Learning
Alibaba Cloud
 
PPTX
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
PPTX
Iterative Incremental development
Oliver Schreck
 
PDF
Large Language Models.pdf
BLINXAI
 
PPTX
Bert
Abdallah Bashir
 
PPTX
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
PDF
Let's talk about GPT: A crash course in Generative AI for researchers
Steven Van Vaerenbergh
 
PDF
Using Large Language Models in 10 Lines of Code
Gautier Marti
 
PDF
Waterfall model
Sandeep Kumar
 
PPTX
How ChatGPT and AI-assisted coding changes software engineering profoundly
Pekka Abrahamsson / Tampere University
 
PPT
Cocomo model
Bala Ganesh
 
PPTX
Loan default prediction with machine language
Aayush Kumar
 
PPTX
Spiral model
khuram22
 
PDF
Uses of AI text bot.pdf
SreeNivas983124
 
PDF
ChatGPT for State The Art- Prof. Wisnu Jatmiko (UIN Raden Fatah 2023).pdf
AchmadNizarHidayanto
 
PPTX
LLaMA 2.pptx
RkRahul16
 
PDF
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
PPTX
Natural language processing
Abash shah
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Maxim Salnikov
 
Loan Default Prediction with Machine Learning
Alibaba Cloud
 
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
Iterative Incremental development
Oliver Schreck
 
Large Language Models.pdf
BLINXAI
 
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Let's talk about GPT: A crash course in Generative AI for researchers
Steven Van Vaerenbergh
 
Using Large Language Models in 10 Lines of Code
Gautier Marti
 
Waterfall model
Sandeep Kumar
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
Pekka Abrahamsson / Tampere University
 
Cocomo model
Bala Ganesh
 
Loan default prediction with machine language
Aayush Kumar
 
Spiral model
khuram22
 
Uses of AI text bot.pdf
SreeNivas983124
 
ChatGPT for State The Art- Prof. Wisnu Jatmiko (UIN Raden Fatah 2023).pdf
AchmadNizarHidayanto
 
LLaMA 2.pptx
RkRahul16
 
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
Natural language processing
Abash shah
 
Ad

Similar to Applications of Large Language Models in Materials Discovery and Design (20)

PDF
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
PDF
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
PDF
Discovering advanced materials for energy applications by mining the scientif...
Anubhav Jain
 
PDF
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Anubhav Jain
 
PDF
Materials design using knowledge from millions of journal articles via natura...
Anubhav Jain
 
PDF
Accelerating materials design through natural language processing
Anubhav Jain
 
PPTX
Learning Systems for Science
Ian Foster
 
PDF
Natural language processing for extracting synthesis recipes and applications...
Anubhav Jain
 
PDF
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Anubhav Jain
 
PDF
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
PPTX
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Nathan Frey, PhD
 
PDF
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
PDF
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
PDF
Open Source Tools for Materials Informatics
Anubhav Jain
 
PDF
Overview of accelerated materials design efforts in the Hacking Materials res...
Anubhav Jain
 
PDF
Exploring New Frontiers in Inverse Materials Design with Graph Neural Network...
KAMAL CHOUDHARY
 
PPTX
Literature review for prompt engineering of ChatGPT.pptx
LokerXu2
 
PDF
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Discovering advanced materials for energy applications by mining the scientif...
Anubhav Jain
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Anubhav Jain
 
Materials design using knowledge from millions of journal articles via natura...
Anubhav Jain
 
Accelerating materials design through natural language processing
Anubhav Jain
 
Learning Systems for Science
Ian Foster
 
Natural language processing for extracting synthesis recipes and applications...
Anubhav Jain
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Anubhav Jain
 
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Nathan Frey, PhD
 
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
Open Source Tools for Materials Informatics
Anubhav Jain
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Anubhav Jain
 
Exploring New Frontiers in Inverse Materials Design with Graph Neural Network...
KAMAL CHOUDHARY
 
Literature review for prompt engineering of ChatGPT.pptx
LokerXu2
 
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
Ad

More from Anubhav Jain (20)

PDF
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
Anubhav Jain
 
PDF
Research opportunities in materials design using AI/ML
Anubhav Jain
 
PDF
Accelerating materials discovery with big data and machine learning
Anubhav Jain
 
PDF
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
Anubhav Jain
 
PDF
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
PDF
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
PDF
Best practices for DuraMat software dissemination
Anubhav Jain
 
PDF
Best practices for DuraMat software dissemination
Anubhav Jain
 
PDF
Available methods for predicting materials synthesizability using computation...
Anubhav Jain
 
PDF
Efficient methods for accurately calculating thermoelectric properties – elec...
Anubhav Jain
 
PDF
Machine Learning for Catalyst Design
Anubhav Jain
 
PDF
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
PDF
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
PDF
The Materials Project
Anubhav Jain
 
PDF
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
PDF
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
PDF
Discovering and Exploring New Materials through the Materials Project
Anubhav Jain
 
PDF
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
PDF
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
PDF
Machine Learning Platform for Catalyst Design
Anubhav Jain
 
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
Anubhav Jain
 
Research opportunities in materials design using AI/ML
Anubhav Jain
 
Accelerating materials discovery with big data and machine learning
Anubhav Jain
 
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
Anubhav Jain
 
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
Best practices for DuraMat software dissemination
Anubhav Jain
 
Best practices for DuraMat software dissemination
Anubhav Jain
 
Available methods for predicting materials synthesizability using computation...
Anubhav Jain
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Anubhav Jain
 
Machine Learning for Catalyst Design
Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
The Materials Project
Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
Discovering and Exploring New Materials through the Materials Project
Anubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
Machine Learning Platform for Catalyst Design
Anubhav Jain
 

Recently uploaded (20)

PDF
Gamifying Agent-Based Models in Cormas: Towards the Playable Architecture for...
ESUG
 
PPTX
Introduction to biochemistry.ppt-pdf_shotrs!
Vishnukanchi darade
 
PPTX
Reticular formation_nuclei_afferent_efferent
muralinath2
 
PPTX
Modifications in RuBisCO system to enhance photosynthesis .pptx
raghumolbiotech
 
PPTX
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
PDF
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
ESUG
 
PPTX
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PDF
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PPTX
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
PDF
JADESreveals a large population of low mass black holes at high redshift
Sérgio Sacani
 
PDF
Package-Aware Approach for Repository-Level Code Completion in Pharo
ESUG
 
PPTX
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
PPTX
Limbic system_components_connections_ functions.pptx
muralinath2
 
PPTX
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
PPTX
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
PPT
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
PPTX
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
PDF
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
Gamifying Agent-Based Models in Cormas: Towards the Playable Architecture for...
ESUG
 
Introduction to biochemistry.ppt-pdf_shotrs!
Vishnukanchi darade
 
Reticular formation_nuclei_afferent_efferent
muralinath2
 
Modifications in RuBisCO system to enhance photosynthesis .pptx
raghumolbiotech
 
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
ESUG
 
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
JADESreveals a large population of low mass black holes at high redshift
Sérgio Sacani
 
Package-Aware Approach for Repository-Level Code Completion in Pharo
ESUG
 
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
Limbic system_components_connections_ functions.pptx
muralinath2
 
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 

Applications of Large Language Models in Materials Discovery and Design

  • 1. Applications of Large Language Models in Materials Discovery and Design Anubhav Jain Lawrence Berkeley National Laboratory MRS Fall meeting, Nov 2023 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. Today is the 1 year birthday of ChatGPT! 2
  • 3. Today is the 1 year birthday of ChatGPT! 3 To celebrate the occasion, I used ChatGPT to generate an image of a birthday cake for itself The results tell you a lot of what you need to know about the current state of these kinds of models
  • 4. Today is the 1 year birthday of ChatGPT! 4 To celebrate the occasion, I used ChatGPT to generate an image of a birthday cake for itself Somehow, the results tell you a lot of what you need to know about the current state of these kinds of models
  • 5. Prior to LLMs, we trained custom models to perform simple NLP tasks and did just “OK” 5 • A little over a year ago, even simple tasks like labeling words into categories (“NER”) required custom models • The models took time to develop and train • For example, we tried a custom BERT model that took 1 month to train on 8 NVIDIA V100 GPUs…and got slightly better performance than simpler models Weston, L.; Tshitoyan, V.; Dagdelen, J.; Kononova, O.; Trewartha, A.; Persson, K. A.; Ceder, G.; Jain, A. Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. 2019.
  • 6. The NER was also just the first step to more complex data extraction 6 About 80%–90% accuracy achieved ~60% accuracy (based on our internal testing) Accuracy unclear, as good test sets unavailable. Maybe 70%? “Structured information extraction from complex scientific text with fine-tuned large language models”, in review, https://arxiv.org/abs/2212.05238
  • 7. Things are much easier today … • We no longer design the LLM models • Training / fine-tuning is done via an API • We mainly focus on domain-specific labeling and labeling efficiency … • Others use “zero-shot” LLMs so don’t even need to label/fine-tune! 7 “Structured information extraction from complex scientific text with fine-tuned large language models”, in review, https://arxiv.org/abs/2212.05238
  • 8. This means we can focus on applications! E.g., doping • Doping is difficult to calculate, and there is no large doping database • It is therefore a good application for NLP data extraction 8
  • 9. Mapping the doping in specific materials 9 Mn-doped (52 mentions) Cr-doped (83 mentions) N-doped (46 mentions) Fe-doped (80 mentions) • Based on parsing scientific ~350,000 abstracts • Final data set contains over >200,000 host-dopant links with f1 score ~0.8 • Using the data set, we can look up the doping data for any material composition along with applications tied to that specific dopant
  • 10. Predicting dopants Given partial information about a material’s dopants, we can predict what other dopants may be likely using collaborative filtering 10 Lu2O3 dopants Count Dopant element Decreasing frequency Eu Yb Er Tm … ! = 3 (%ℎ'(( )*+,(- +./0%1.!+) masked solution algorithm sees Training , = 5 (5 40(++(+ *//.5(-) 3rd & 5th prediction correct 1st, 2nd, & 4th predictions wrong Prediction Decreasing recommendation strength Sr Y Eu Ni Yb 2 of 3 solutions (66% recovered) in k=5 guesses
  • 11. Model does OK – although room for improvement 11 If you mask 3 top known dopants and try to re- predict them in 5 guesses, you recover ~35% of them (about 1) Data across >2000 hosts We will share the full data set with the community so they can also try to make models
  • 12. Thoughts on the future - RAG • Previously, ChatGPT tried to answer all questions “from memory” • Led to hallucination and other issues • Now, ChatGPT can search the web to answer questions (retrieval augmented generation or RAG) • One could also search code documentation, user manuals, long reports, journal articles, etc. to produce answers 12
  • 13. Example – turning our group handbook into Q&A tool in ~1 hour using GPT Apps 13 Too much reading for most people … So many words!
  • 14. Example – turning our group handbook into Q&A tool in ~1 hour using GPT Apps 14 Too much reading for most people … So many words! Training GPT (via conversation) to deliver information from handbook via Q&A
  • 15. Examples of the GPT tool 15
  • 16. How will this change materials science in the next few years? • One change will be a transformation of user interfaces • Materials databases will be natively integrated with LLM interfaces • APIs will be easier to use since LLMs will help translate human intent to API calls 16 “Show me materials from Materials Project that contain Ca, have a band gap >1.2 eV, and have a bulk modulus >100 GPa.” ”Also include materials from OQMD, Jarvis, and any other materials databases you are aware of.”
  • 17. Acknowledgements 17 • Alex Dunn • John Dagdelen • Nick Walker • Sanghoon Lee • Amalie Trewartha • Leigh Weston • Kristin Persson • Gerbrand Ceder Funded by Toyota Research Institute and DOE-BES Materials Project program