The Status of ML Algorithms
for Structure-property Relationships
Using Matbench as a Test Protocol
Anubhav Jain
Lawrence Berkeley National Laboratory
TMS Spring 2022, March 2022
Slides (already) posted to hackingmaterials.lbl.gov
ML is quickly becoming a standard tool for
materials screening
2
Machine learning
High-throughput DFT
Expensive calculation
Experiment
Millions of candidates
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
3
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
4
Q: Which one is the “best”
based on the literature?
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
5
Q: Which one is the “best”
based on the literature?
A: Can’t tell! They’re nearly
all done on different data.
Difficulty of comparing ML algorithms
6
Data set used
in study A
Data set used
in study B
Data set used
in study C
• Different data sets
• Source (e.g., OQMD vs MP)
• Quantity (e.g., MP 2018 vs MP 2019)
• Subset / data filtering (e.g., ehull<X)
• Different evaluation metrics
• Test set vs. cross validation?
• Different test set fraction?
• Often no runnable version of a
published algorithm.
MAE 5-Fold CV = 0.102 eV
RMSE Test set = 0.098 eV
vs.
? ?
What’s needed – an “ImageNet” for materials
science
7
https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
What does a standard
data set do for a field?
8
One of the reasons computer science
/ machine learning seems to advance
so quickly is that they decouple data
generation from algorithm
development
This allows groups to focus on
algorithm development without all
the data generation, data cleaning,
etc. that often is the majority of an
end-to-end data science project
The ingredients of the Matbench benchmark
qStandard data sets
qStandard test splits according to nested cross-validation procedure
qAn online leaderboard that encourages reproducible results
9
How to design good data sets for materials
science?
10
• There is no single type of problem that materials scientists are trying
to solve
• For now, focus on materials property prediction (from structure or
composition)
• We want a test set that contains a diverse array of problems
• Smaller data versus larger data
• Different applications (electronic, mechanical, etc.)
• Composition-only or structure information available
• Experimental vs. Ab-initio
• Classification or regression
Matbench includes 13 different ML tasks
11
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
The tasks encompass a variety of problems
12
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
The ingredients of the Matbench benchmark
ü Standard data sets
q Standard test splits according to nested cross-validation procedure
q An online leaderboard that encourages reproducible results
13
The most common method:
a single hold-out test set
14
• Training/validation is used for
model selection
• Test/hold-out is used only for
error estimation (i.e., final
score)
Nested CV as a standard scoring metric
15
Nested CV is like hold-out, but varies the hold out set.
Think of it as k different “universes” – we have a
different training + validation of the model in each
universe and a different hold-out.
Nested CV as a standard scoring metric
16
Nested CV is like hold-out, but varies the hold out set.
Think of it as N different “universes” – we have a
different training + validation of the model in each
universe and a different hold-out.
“A nested CV procedure provides an almost unbiased estimate of the true error.”
Varma and Simon, Bias in error estimation when using cross-validation for model
selection (2006)
The ingredients of the Matbench benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation procedure
q An online leaderboard that encourages reproducible results
17
Matbench Website – now complete!
https://matbench.materialsproject.org
Matbench compares ML algorithms
19
Bigger datasets
Better
relative
performance
Access to Datasets/ML tasks
Interactively, via Materials Project
ml.materialsproject.org
Programmatically via matbench in python (2 lines)
*loads all 13 tasks
Programmatically via matminer in python (2 lines) Direct download, via matbench.materialsproject.org
Preferred/easiest method!
https://github.com/hackingmaterials/matminer
https://github.com/hackingmaterials/matminer
Programmatic Access and Analysis of Submissions
21
• Run a benchmark on your own algorithm in ~10 lines of code
• Run on any combination or all of the 13 existing tasks
• If your entry outperforms existing entry, submit algorithm in a pull request!
Existing notebooks/code and
software requirements for
reproducing any benchmark
{'python': [['crabnet==1.2.1',
'scikit_learn==1.0.2', 'matbench==0.5']]}
Comprehensive raw data
(accessible via matbench python
package or any json-capable
language) on all benchmarks
Publicly available to anyone!
In-depth performance metrics for
individual ML tasks for all
submissions
Both visually on website, and
programmatically
The ingredients of the Matbench benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation procedure
ü An online leaderboard that encourages reproducible results
22
What algorithms have been tested on the
matbench data set so far?
• Magpie + sine coloumb matrix random forest (feature-based random forests)
• Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028
(2016). https://doi.org/10.1038/npjcompumats.2016.28
• Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015):
1094-1101.
• Automatminer (feature-based AutoML)
• Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer
Reference Algorithm. npj Comput Mater 2020, 6 (1), 138.
• CGCNN (graph neural network)
• Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett.
2018, 120 (14), 145301.
• MEGNET (graph neural network)
• Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 2019, 31
(9), 3564–3572.
• MODNet (feature-based neural network)
• De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet.
arXiv:2102.02263 [cond-mat] 2021.
• CRABNet (attention-based composition neural network)
• Wang, A.; Kauwe, S.; Murdock, R.; Sparks, T. Compositionally-Restricted Attention-Based Network for Materials Property Prediction; ChemRxiv, 2020.
https://doi.org/10.26434/chemrxiv.11869026.v1.
• ALIGNN (graph neural network with bond angles)
• Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8.
23
Insights from standardized comparisons
24
• Originally, we found traditional ”hand-crafted” feature models performed best generally when ! < 10%
• So it seemed matsci data – typically small datasets, esp. experimental – was best modelled by traditional
ML/feature methods, e.g. Random Forest
• Clever developments in neural networks have improved GNN models on smaller datasets, in part
powered by competition on the Matbench leaderboard
• Standardized platform has enabled easier identification of techniques which work well for certain
problems, and those that do not
+
Insights from standardized comparisons
25
Errors Predicting Final Phonon DOS Peak Frequencies
Structural GNN
(2022)
Composition GNN
(2021)
Algorithm
Mean MAE
(cm-1)
Mean RMSE
(cm-1)
Maximum
max_error (cm-1)
ALIGNN (2022) 29.5385 53.501 615.3466
MODNet v0.1.10
(2021) 38.7524 78.222 1031.8168
CrabNet (2021) 55.1114 138.3775 1452.7562
AMMExpress
(2020) 56.1706 109.7048 1151.557
CGCNN (2019) 57.7635 141.7018 2504.8743
Mean Absolute Error !"#$ ± &"#$ Predicting Final PhDOS Peaks
SoTA early 2020
Same data, same test; so, why are some algorithms best?
• ALIGNN: Incorporation of bond angle into crystal graph
• Bond angle/local env importance for vibrational properties?
• Matbench enables these sorts of “instant” ablation studies
Insights from standardized comparisons
26
Errors Predicting Predicting Expt. !"#$
Mean Absolute Error %&'( ± *&'( Predicting Expt. !"#$
Composition GNN
Algorithm
Mean MAE
(eV)
Std. MAE
(eV)
Mean RMSE
(eV)
CrabNet 0.3463 0.0088 0.8504
MODNet (v0.1.10) 0.347 0.0222 0.7437
CrabNet v1.2.1 0.3757 0.0207 0.8805
AMMExpress v2020 0.4161 0.0194 0.9918
Traditional Features
+ Encoding/selection
SoTA early 2020
Same data, same test; so, why are some algorithms best?
• CrabNet: Importance of attention mechanism for
compositional props.; low variability across folds
• MODNet: Normalized Mutual Information feature selection
results in high performance at risk of higher variability across
folds
Improvements to Materials ML Benchmarks
27
Standardized Uncertainty Quantification More Datasets + Better Tasks!
• ML-Materials design improved by UQ of each prediction
• Enables adaptive design:
• Practical: modern models (e.g., MODNet) produce
UQ estimates naturally
• Useful: Can analyze UQ to tell us how often samples
true values actually fall outside UQ range
• In progress: Coming soon to matbench package!
• Impossible to represent the full field of materials
design in a single set of benchmarks
• However… can we come close? Aim to include a wider
variety of properties and sources:
• Expt. load-dependent Vicker’s hardness
• Expt. superconductor Tc
• Expt. Δ"#
$
from crystal structure
• Expt. UV-Vis measurements of metal oxides
• Unique, domain-specific procedures for each task
• For example: segregation of CV samples into clusters
based on structure/composition (LOCOCV)
• Evaluation procedures which most closely resemble
real world usage of these algorithms in the most
computationally feasible fashion
Conclusions and future
• As the community increasingly develops new algorithms for machine
learning materials properties, a standard way to test these algorithms
is needed
• Matbench represents such a standard and allows you to test your
algorithms against others
• Matbench also allows us to measure overall progress in the field
• We hope to see you on the leaderboard!
28
Acknowledgements
29
Alex Dunn
Lead developer
Qi Wang
Alex Ganose Daniel Dopp
Slides (already) posted to hackingmaterials.lbl.gov

The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol

  • 1.
    The Status ofML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol Anubhav Jain Lawrence Berkeley National Laboratory TMS Spring 2022, March 2022 Slides (already) posted to hackingmaterials.lbl.gov
  • 2.
    ML is quicklybecoming a standard tool for materials screening 2 Machine learning High-throughput DFT Expensive calculation Experiment Millions of candidates
  • 3.
    There are manynew algorithms being published for ML in materials – New ones constantly reported! 3
  • 4.
    There are manynew algorithms being published for ML in materials – New ones constantly reported! 4 Q: Which one is the “best” based on the literature?
  • 5.
    There are manynew algorithms being published for ML in materials – New ones constantly reported! 5 Q: Which one is the “best” based on the literature? A: Can’t tell! They’re nearly all done on different data.
  • 6.
    Difficulty of comparingML algorithms 6 Data set used in study A Data set used in study B Data set used in study C • Different data sets • Source (e.g., OQMD vs MP) • Quantity (e.g., MP 2018 vs MP 2019) • Subset / data filtering (e.g., ehull<X) • Different evaluation metrics • Test set vs. cross validation? • Different test set fraction? • Often no runnable version of a published algorithm. MAE 5-Fold CV = 0.102 eV RMSE Test set = 0.098 eV vs. ? ?
  • 7.
    What’s needed –an “ImageNet” for materials science 7 https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
  • 8.
    What does astandard data set do for a field? 8 One of the reasons computer science / machine learning seems to advance so quickly is that they decouple data generation from algorithm development This allows groups to focus on algorithm development without all the data generation, data cleaning, etc. that often is the majority of an end-to-end data science project
  • 9.
    The ingredients ofthe Matbench benchmark qStandard data sets qStandard test splits according to nested cross-validation procedure qAn online leaderboard that encourages reproducible results 9
  • 10.
    How to designgood data sets for materials science? 10 • There is no single type of problem that materials scientists are trying to solve • For now, focus on materials property prediction (from structure or composition) • We want a test set that contains a diverse array of problems • Smaller data versus larger data • Different applications (electronic, mechanical, etc.) • Composition-only or structure information available • Experimental vs. Ab-initio • Classification or regression
  • 11.
    Matbench includes 13different ML tasks 11 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  • 12.
    The tasks encompassa variety of problems 12 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  • 13.
    The ingredients ofthe Matbench benchmark ü Standard data sets q Standard test splits according to nested cross-validation procedure q An online leaderboard that encourages reproducible results 13
  • 14.
    The most commonmethod: a single hold-out test set 14 • Training/validation is used for model selection • Test/hold-out is used only for error estimation (i.e., final score)
  • 15.
    Nested CV asa standard scoring metric 15 Nested CV is like hold-out, but varies the hold out set. Think of it as k different “universes” – we have a different training + validation of the model in each universe and a different hold-out.
  • 16.
    Nested CV asa standard scoring metric 16 Nested CV is like hold-out, but varies the hold out set. Think of it as N different “universes” – we have a different training + validation of the model in each universe and a different hold-out. “A nested CV procedure provides an almost unbiased estimate of the true error.” Varma and Simon, Bias in error estimation when using cross-validation for model selection (2006)
  • 17.
    The ingredients ofthe Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure q An online leaderboard that encourages reproducible results 17
  • 18.
    Matbench Website –now complete! https://matbench.materialsproject.org
  • 19.
    Matbench compares MLalgorithms 19 Bigger datasets Better relative performance
  • 20.
    Access to Datasets/MLtasks Interactively, via Materials Project ml.materialsproject.org Programmatically via matbench in python (2 lines) *loads all 13 tasks Programmatically via matminer in python (2 lines) Direct download, via matbench.materialsproject.org Preferred/easiest method! https://github.com/hackingmaterials/matminer https://github.com/hackingmaterials/matminer
  • 21.
    Programmatic Access andAnalysis of Submissions 21 • Run a benchmark on your own algorithm in ~10 lines of code • Run on any combination or all of the 13 existing tasks • If your entry outperforms existing entry, submit algorithm in a pull request! Existing notebooks/code and software requirements for reproducing any benchmark {'python': [['crabnet==1.2.1', 'scikit_learn==1.0.2', 'matbench==0.5']]} Comprehensive raw data (accessible via matbench python package or any json-capable language) on all benchmarks Publicly available to anyone! In-depth performance metrics for individual ML tasks for all submissions Both visually on website, and programmatically
  • 22.
    The ingredients ofthe Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure ü An online leaderboard that encourages reproducible results 22
  • 23.
    What algorithms havebeen tested on the matbench data set so far? • Magpie + sine coloumb matrix random forest (feature-based random forests) • Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028 (2016). https://doi.org/10.1038/npjcompumats.2016.28 • Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015): 1094-1101. • Automatminer (feature-based AutoML) • Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. • CGCNN (graph neural network) • Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120 (14), 145301. • MEGNET (graph neural network) • Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 2019, 31 (9), 3564–3572. • MODNet (feature-based neural network) • De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet. arXiv:2102.02263 [cond-mat] 2021. • CRABNet (attention-based composition neural network) • Wang, A.; Kauwe, S.; Murdock, R.; Sparks, T. Compositionally-Restricted Attention-Based Network for Materials Property Prediction; ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.11869026.v1. • ALIGNN (graph neural network with bond angles) • Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8. 23
  • 24.
    Insights from standardizedcomparisons 24 • Originally, we found traditional ”hand-crafted” feature models performed best generally when ! < 10% • So it seemed matsci data – typically small datasets, esp. experimental – was best modelled by traditional ML/feature methods, e.g. Random Forest • Clever developments in neural networks have improved GNN models on smaller datasets, in part powered by competition on the Matbench leaderboard • Standardized platform has enabled easier identification of techniques which work well for certain problems, and those that do not +
  • 25.
    Insights from standardizedcomparisons 25 Errors Predicting Final Phonon DOS Peak Frequencies Structural GNN (2022) Composition GNN (2021) Algorithm Mean MAE (cm-1) Mean RMSE (cm-1) Maximum max_error (cm-1) ALIGNN (2022) 29.5385 53.501 615.3466 MODNet v0.1.10 (2021) 38.7524 78.222 1031.8168 CrabNet (2021) 55.1114 138.3775 1452.7562 AMMExpress (2020) 56.1706 109.7048 1151.557 CGCNN (2019) 57.7635 141.7018 2504.8743 Mean Absolute Error !"#$ ± &"#$ Predicting Final PhDOS Peaks SoTA early 2020 Same data, same test; so, why are some algorithms best? • ALIGNN: Incorporation of bond angle into crystal graph • Bond angle/local env importance for vibrational properties? • Matbench enables these sorts of “instant” ablation studies
  • 26.
    Insights from standardizedcomparisons 26 Errors Predicting Predicting Expt. !"#$ Mean Absolute Error %&'( ± *&'( Predicting Expt. !"#$ Composition GNN Algorithm Mean MAE (eV) Std. MAE (eV) Mean RMSE (eV) CrabNet 0.3463 0.0088 0.8504 MODNet (v0.1.10) 0.347 0.0222 0.7437 CrabNet v1.2.1 0.3757 0.0207 0.8805 AMMExpress v2020 0.4161 0.0194 0.9918 Traditional Features + Encoding/selection SoTA early 2020 Same data, same test; so, why are some algorithms best? • CrabNet: Importance of attention mechanism for compositional props.; low variability across folds • MODNet: Normalized Mutual Information feature selection results in high performance at risk of higher variability across folds
  • 27.
    Improvements to MaterialsML Benchmarks 27 Standardized Uncertainty Quantification More Datasets + Better Tasks! • ML-Materials design improved by UQ of each prediction • Enables adaptive design: • Practical: modern models (e.g., MODNet) produce UQ estimates naturally • Useful: Can analyze UQ to tell us how often samples true values actually fall outside UQ range • In progress: Coming soon to matbench package! • Impossible to represent the full field of materials design in a single set of benchmarks • However… can we come close? Aim to include a wider variety of properties and sources: • Expt. load-dependent Vicker’s hardness • Expt. superconductor Tc • Expt. Δ"# $ from crystal structure • Expt. UV-Vis measurements of metal oxides • Unique, domain-specific procedures for each task • For example: segregation of CV samples into clusters based on structure/composition (LOCOCV) • Evaluation procedures which most closely resemble real world usage of these algorithms in the most computationally feasible fashion
  • 28.
    Conclusions and future •As the community increasingly develops new algorithms for machine learning materials properties, a standard way to test these algorithms is needed • Matbench represents such a standard and allows you to test your algorithms against others • Matbench also allows us to measure overall progress in the field • We hope to see you on the leaderboard! 28
  • 29.
    Acknowledgements 29 Alex Dunn Lead developer QiWang Alex Ganose Daniel Dopp Slides (already) posted to hackingmaterials.lbl.gov