Diverse Explanations From Data-Driven and Domain-Driven Perspectives in the Physical Sciences

Sichao Li, Xin Wang & Amanda Barnard School of Computing, Australian National University, Canberra, Australia [email protected]
Abstract

Machine learning methods have been remarkably successful in material science, providing novel scientific insights, guiding future laboratory experiments, and accelerating materials discovery. Despite the promising performance of these models, understanding the decisions they make is also essential to ensure the scientific value of their outcomes. However, there is a recent and ongoing debate about the diversity of explanations, which potentially leads to scientific inconsistency. This Perspective explores the sources and implications of these diverse explanations in ML applications for physical sciences. Through three case studies in materials science and molecular property prediction, we examine how different models, explanation methods, levels of feature attribution, and stakeholder needs can result in varying interpretations of ML outputs. Our analysis underscores the importance of considering multiple perspectives when interpreting ML models in scientific contexts and highlights the critical need for scientists to maintain control over the interpretation process, balancing data-driven insights with domain expertise to meet specific scientific needs. By fostering a comprehensive understanding of these inconsistencies, we aim to contribute to the responsible integration of eXplainable Artificial Intelligence (XAI) into physical sciences and improve the trustworthiness of ML applications in scientific discovery.

  • October 2024

1 Introduction

The physical science studies rely heavily on the domain knowledge of scientists and often involve complex and computationally expensive simulations or economically expensive high-throughput experiments 1, 2, 3. The outcomes strongly depend on how comprehensively the search space is sampled, and this traditional domain-driven approach has clear limitations in prediction accuracy and efficiency. Machine learning (ML) approaches have received increasing attention as promising tools for materials research, particularly with the rise of deep neural networks (DNN) and numerous novel model structures proposed to enhance model predictability 4, 5, 6, 7.

ML models adopt a data-driven approach by analysing a large volume of historical data to derive new insights, based on various inputs such as physicochemical structure, state variables, or raw characterisation data, to reduce the dependency on domain knowledge or specific infrastructure. These models also demonstrate high accuracy in predicting a wide range of materials’ physical, mechanical, optoelectronic, and thermal properties, such as crystal structure, melting temperature, formation enthalpy, and bandgap 8, 5, 9, 10, 11, 12. Their success has driven the rapid adoption of ML models in scientific areas 2, 13, 14.

Refer to caption

Figure 1: The potential conflicts from data-driven and domain-driven explanations in the decision-making process involving well-trained ML models. The dashed line denotes the conventional pipeline of XAI in the scientific domain, where stakeholders utilise ML models and retrieve data-driven explanations to analyse results. In practice, it is common that data-driven explanations conflict with domain knowledge, misleading researchers, and requiring a new approach.

However, the most accurate ML models, such as DNNs, are frequently difficult to explain and are often referred to as “black-boxes”. The lack of explainability of ML methods has hindered their potential impact in many domains 15. Methods of explainability/interpretability for ML models are intensively studied in the field of computing, and there is an increasing demand for applying XAI to ML predictions in science 16, 17, 1, 18, 19. For instance, a typical scenario where a new material property is predicted by an ML model based on the structure can be challenging to optimise and translate to manufacturers without understanding how specific material characteristics influence the prediction. For this reason, many scientists find black-box ML models difficult to trust.

The problem is exacerbated by the fact that recent studies have demonstrated that many different ML models can perform similarly on the same dataset, even with different functional forms 20, 21, 22. As a result, ML models with comparable effectiveness can offer diverse explanations for the same task and an intuitive question easily arises: which model should we trust? Many researchers seek to identify which samples or features are important in describing the model effects 23, 24. However, when faced with the existence of multiple “equally-good” models, training and explaining with a single model becomes problematic, and results in inconsistencies between interpretability and established scientific principles 25, 26. Figure 1 illustrates the potential conflicts between data-driven and domain-driven explanations in the decision-making process involving well-trained ML models, where the term “well-trained” models refer to ML models that demonstrate promising performance on their respective tasks. The performance of these models is crucial, as the reliability of explanations derived from a model is contingent on its predictive capability.

In this Perspective, we argue that, even though there are a large number of XAI studies in scientific domains, trust has yet to be established. We focus on well-trained models to ensure that the explanations we analyse are based on models with predictions that can be reasonably trusted. We identify and explore diversities between data-driven and domain-driven explanations in well-trained ML models from different perspectives, shown in Fig. 2. Through three case studies in materials science, nanotechnology and molecular property prediction, we highlight how different models, explanation methods, levels of feature attribution, and stakeholder needs can result in varying interpretations for identical tasks. This illuminates the multifaceted nature of explanations in scientific applications, and emphasises the crucial role of scientists in guiding the interpretation process.

2 Background and Concepts

2.1 Explainability and Interpretability in Science

Explainability and interpretability are closely related concepts in XAI, often used interchangeably to describe the ability of humans to understand model predictions. In scientific contexts, these concepts are particularly crucial as they bridge the gap between data-driven ML models and domain-driven scientific understanding 27, 28, 29, 30, 21, 31, 32, 33, 34. In scientific applications, the need for explainable and interpretable ML models is crucial 35, 20. Scientists not only need accurate predictions but also require insights into the underlying mechanisms. This aligns with the scientific method, where understanding the reason behind a prediction is as important as observing the phenomenon itself.

2.2 The Foundation of Trust: Performance and Explanation

Trust in ML models in scientific and technology is built on two pillars: performance and explanation 20. Performance metrics such as Mean Absolute Error (MAE), coefficient of determination (R2superscript𝑅2{R^{2}}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), and Area Under the Receiver Operating Characteristic curve (ROC-AUC) quantify how well a model predictions aligns with ground truth. However, these metrics alone do not provide insights into the scientific validity of the underlying decision-making process. The main challenge lies in the fact that high-performing models, especially complex neural networks, are often difficult to interpret, creating a trade-off between performance and explainability.

2.3 Explanation Categories in Science: Domain-Driven and Data-Driven

Explanation methods in ML can be categorised in many ways, e.g., feature-based vs. instance-based, post-hoc vs. ante-hoc, or output-focused vs. model-focused 36, 37, 32, 27, 1, 26, 31, 26. In this Perspective, we focus on two primary categories of explanations in scientific ML: domain-driven and data-driven approaches. This categorisation provides a framework for understanding the sources of diverse explanations in scientific applications of ML.

Data-Driven Explanations

primarily rely on patterns and relationships discovered in the training data. These approaches are commonly used when integrating ML methods into scientific domains, as they can uncover insights directly from large volumes of data. Data-driven explanations often employ post-hoc interpretation methods on complex, well-performing ML models 38, 1, 39, 40, 41. Data-driven explanations offer significant advantages, such as the ability to uncover hidden patterns in big data. The flexibility of data-driven methods also support a variety of application without the need for building complex, domain-specific ML models. However, their limitation lies in the potential to produce explanations that, while mathematically sound, may lack clear scientific interpretation or may even conflict with established theories.

Domain-Driven Explanations

leverage expert knowledge and established scientific principles to construct domain-specific ML models or interpret ML model outputs. These explanations aim to align model behavior with existing theories and physical laws. For instance, attention-based models such as CrabNet 42 can provide element-wise contributions to property predictions, aligning with domain knowledge about elemental influences. The strength of domain-driven explanations is their scientific consistency and interpretability within the context of existing knowledge. They often result in explanations that are more readily accepted and understood by domain experts. However, they can be limited by their dependence on both current scientific understanding and ML modeling, which may inadvertently constrain the model performance, and ability to discover novel patterns or relationships not yet recognised in the field 20.

Interplay Between Approaches

does not mean that domain-driven and data-driven explanations are mutually exclusive categories. In practice, many effective explanation methods incorporate elements of both approaches. Inherently interpretable models can be data-driven while incorporating domain knowledge, such as physics-informed neural networks that embed known physical laws into their architecture. Conversely, complex black-box models can be designed with domain-specific constraints or analysed using explanation methods that align with domain understanding. However, balancing these approaches can lead to explanations that are both scientifically consistent and capable of revealing novel insights, potentially driving forward our understanding in sciences.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Visualisation of four sources of inconsistent explanations.

2.4 Explanation Inconsistency

Scientific consistency 43 was introduced as an essential component for model learning in 2017, among the fundamental requirements for producing trustworthy outcomes in scientific applications 25. This means that the results obtained from ML models must be consistent with established scientific principles 26, 44. This concept bridges the gap between data-driven predictions and domain knowledge. Instead of scientific consistency, which is subjective and hard to maintain, we explore scientific inconsistency through the lens of data-driven and domain-driven explanations. Our focus is on feature-based explanations in human-understandable terms, providing visualisations of feature importance rankings to illustrate diverse explanations.

2.5 Feature-Based Explanations in ML

Feature-based explanations play a crucial role in interpreting model behavior in human-understandable terms 38, 1, 39, 40, 41. In this Perspective, we focus specifically on feature importance as a key aspect of these explanations.

Feature Importance Measurements

are of the most commonly used methods for explaining ML models is feature importance ranking, derived from input-output relationships 40, 41. In regression tasks, feature importance is calculated based on how much a feature contributes to the predicted value; while in classification, feature importance is calculated based on the contribution to each class. Although complex neural networks are hard to interpret, some simpler models can inherently provide explanations. For example, a decision tree is a rule-based classification algorithm that splits at each node according to metrics such as the Gini index, which is computed as:

Gini=1i=1Cpi2Gini1superscriptsubscript𝑖1𝐶superscriptsubscript𝑝𝑖2\text{Gini}=1-\sum_{i=1}^{C}p_{i}^{2}Gini = 1 - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the current percentage of class i𝑖iitalic_i and C𝐶Citalic_C is the number of classes. The difference of Gini index at each split is calculated as the difference between the parent node and the weighted average of child nodes, such that:

ΔGini=Giniparent(NleftN×Ginileft+NrightN×Giniright)ΔGinisubscriptGiniparentsubscript𝑁left𝑁subscriptGinileftsubscript𝑁right𝑁subscriptGiniright\Delta\text{Gini}=\text{Gini}_{\text{parent}}-\left(\frac{N_{\text{left}}}{N}% \times\text{Gini}_{\text{left}}+\frac{N_{\text{right}}}{N}\times\text{Gini}_{% \text{right}}\right)roman_Δ Gini = Gini start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT - ( divide start_ARG italic_N start_POSTSUBSCRIPT left end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG × Gini start_POSTSUBSCRIPT left end_POSTSUBSCRIPT + divide start_ARG italic_N start_POSTSUBSCRIPT right end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG × Gini start_POSTSUBSCRIPT right end_POSTSUBSCRIPT )

where N𝑁Nitalic_N denotes the number of instances in the node. Decision trees45 iterate through all features and possible values to find the one that maximises the Gini index difference and uses it for the split. After training, feature importance can be calculated as the sum of Gini index differences among all nodes where the feature is used for split, and divided by the sum of Gini index differences among all features.

More sophisticated methods such as random forests and XGBoost46 combine decision trees with additional mechanisms such as regularisation. The importance of features is calculated by the average gain across all nodes of all trees where the feature is used for splits. In the absence of these intrinsic model explanations, some well-established methods can offer universal explanations for both simple and complex models. The explanation methods highlighted in this study are summarised in Table 1.

Explanation Methods Description Usage
Shapley Additive Explanations (SHAP)47 A game theory-based method that computes the marginal contribution of each feature by calculating the Shapley value. Case study 1 Sec.3.2
Permutation Importance (PI)48 Evaluate feature importance by invalidating features and measuring the difference in model performance. Case study 1 Sec.3.2
Local Interpretable Model-agnostic Explanations (LIME)49 Generates new data instances near a specific instance and trains a simple model on these to obtain local explanations. Case study 3 Sec.3.3
Integrated Gradients (IG)50 Computes feature importance by integrating the gradients of the model’s output with respect to the input features. Case study 1 in Sec.3.2
Feature Interaction Score (FIS)23 Measures the performance change of feature attributions from the baseline, quantifying different levels of feature attributions. Case study 2 Sec.3.1
Connection Weights (NN) 51, 52 Analyses the weights of connections in neural networks to determine feature importance. Case study 1 in Sec.3.2
Gini Importance (Decision Trees)45 Calculates feature importance in decision trees based on the Gini impurity criterion. Case study 3 Sec.3.3
Average Gain (XGBoost) 46 Computes feature importance in XGBoost models based on the average gain across all splits where the feature is used. Case study 3 Sec.3.3
Table 1: Summary of explanation methods highlighted in this perspective.

2.5.1 Feature Attribution Scores

can be used to ensure consistency when comparing feature importance across different models and methods 53, 49, 50, 54. This approach allows for a fair comparison of feature importance in a broader context, especially when dealing with complex feature interactions. Feature attribution refers to the process of assigning importance or contribution values to input features with respect to a model prediction, indicating the importance of a feature. In particular, feature interaction score (FIS) is a convenient way quantify different levels of feature attributions 22, 23, 55, 56, such that:

φi(fref)=𝔼[L(fref(𝐗s),𝐲)]𝔼[L(fref(𝐗),𝐲)]subscript𝜑𝑖subscript𝑓𝑟𝑒𝑓𝔼delimited-[]𝐿subscript𝑓𝑟𝑒𝑓subscript𝐗𝑠𝐲𝔼delimited-[]𝐿subscript𝑓𝑟𝑒𝑓𝐗𝐲\varphi_{i}(f_{ref})=\mathbb{E}[L(f_{ref}(\mathbf{X}_{\setminus s}),\mathbf{y}% )]-\mathbb{E}[L(f_{ref}(\mathbf{X}),\mathbf{y})]italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = blackboard_E [ italic_L ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT ∖ italic_s end_POSTSUBSCRIPT ) , bold_y ) ] - blackboard_E [ italic_L ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( bold_X ) , bold_y ) ]

where 𝐗ssubscript𝐗𝑠\mathbf{X}_{\setminus s}bold_X start_POSTSUBSCRIPT ∖ italic_s end_POSTSUBSCRIPT denotes the input matrix when the feature of interest is replaced by an independent variable. This method measures the performance change of feature attributions from a baseline Lref=𝔼[L(fref(𝐗),𝐲)]subscript𝐿𝑟𝑒𝑓𝔼delimited-[]𝐿subscript𝑓𝑟𝑒𝑓𝐗𝐲L_{ref}=\mathbb{E}[L(f_{ref}(\mathbf{X}),\mathbf{y})]italic_L start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT = blackboard_E [ italic_L ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( bold_X ) , bold_y ) ].

The “level” of feature attribution indicates the complexity of the relationships being explained. Individual feature importance is referred to as first-order feature importance, while the importance of relationships between two features is referred to as second-order feature interaction. Higher-order interactions involve combinations of three or more features, resulting in higher-order feature interactions. Similarly, higher-order feature attributions can be represented by:

φI(f)=𝔼[L(fref(𝐗I),𝐲)]𝔼[L(fref(X),𝐲)]subscript𝜑𝐼𝑓𝔼delimited-[]𝐿subscript𝑓𝑟𝑒𝑓subscript𝐗𝐼𝐲𝔼delimited-[]𝐿subscript𝑓𝑟𝑒𝑓𝑋𝐲\varphi_{I}(f)=\mathbb{E}[L(f_{ref}(\mathbf{X}_{\setminus I}),\mathbf{y})]-% \mathbb{E}[L(f_{ref}({X}),\mathbf{y})]italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_f ) = blackboard_E [ italic_L ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT ∖ italic_I end_POSTSUBSCRIPT ) , bold_y ) ] - blackboard_E [ italic_L ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_X ) , bold_y ) ]

where I𝐼Iitalic_I is a set of features, formalised as |I|>1𝐼1|I|>1| italic_I | > 1. In practice, one can permute the features of interest multiple times to achieve a similar measurement 57. With this in mind, the feature attribution score is defined as the difference between the loss change of replacing features simultaneously and the sum of the loss change of replacing multiple features individually:

FISI(fref)=φI(fref)iIφi(fref).𝐹𝐼subscript𝑆𝐼subscript𝑓𝑟𝑒𝑓subscript𝜑𝐼subscript𝑓𝑟𝑒𝑓subscript𝑖𝐼subscript𝜑𝑖subscript𝑓𝑟𝑒𝑓FIS_{I}(f_{ref})=\varphi_{I}(f_{ref})-\sum_{i\in I}\varphi_{i}(f_{ref}).italic_F italic_I italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) .

For common feature importance, FISs(fref)=φs(fref)𝐹𝐼subscript𝑆𝑠subscript𝑓𝑟𝑒𝑓subscript𝜑𝑠subscript𝑓𝑟𝑒𝑓FIS_{s}(f_{ref})=\varphi_{s}(f_{ref})italic_F italic_I italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = italic_φ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) and the loss can be approximated by the model performance metrics mentioned above.

Both first-order and second-order explanations have value in the physical sciences, where importances and interactions can be underpinned or anticipated by domain knowledge. These sorts of explanations are often intuitive for researchers, but must be explicitly calculated for ML models, leading to potential inconsistencies.

2.5.2 Implications for Scientific ML

The ability to calculate and compare feature attribution scores across different levels, models, and methods allows researchers to identify and analyse various types of inconsistencies in explanations:

  • Data-driven Inconsistencies: By quantifying feature importance using different data-driven approaches (e.g., SHAP, PI, and LIME), researchers can directly compare and contrast explanations derived from the same data but different methods.

  • Domain-driven Inconsistencies: Different stakeholders within the same domain may have specific requirements, needs, and aims, potentially leading to conflicting explanations based on their individual perspectives.

  • Data-driven vs. Domain-driven Inconsistencies: This type of inconsistency arises when explanations derived from data-driven methods do not align with domain knowledge or stakeholder expectations.

These inconsistencies underscore the complexity of interpreting ML models in scientific contexts. It is often impractical, if not impossible, to satisfy all stakeholders with their diverse aims and needs, or to force users to accept a single model or explanation method as definitive. In the following case studies, we demonstrate how feature-based explanation methods can lead to diverse interpretations of the same scientific phenomena. This exploration highlights the critical importance of considering multiple perspectives when applying ML in scientific domains. Additionally, we identify and discuss common factors contributing to this diversity of explanations, as revealed through our case studies.

3 Case Studies and Insights

The following collection of examples highlights some inconsistencies in explanations derived from ML models in the physical sciences, based on four key sources of explanation diversity:

  • Model Selection: Accurate ML models trained for the same task can produce diverse explanations at both global and instance levels.

  • Explanation Method Choice: Well-established XAI methods 111In this study, we use well-established explanation methods, which we define as those that are widely accepted and commonly used in the ML community, such as SHAP, LIME, and Integrated Gradients. often generate diverse explanations even when applied to the same model.

  • Feature Attribution Level: Different levels of feature attributions (e.g., first-order vs. higher-order interactions) can result in diverse explanations of the same phenomenon.

  • Stakeholder Perspective: Diverse needs and priorities of different stakeholders can lead to varied interpretations of model outputs.

3.1 AFLOW Bulk Modulus Benchmark

Our first example explores the prediction of property trends in alloys using their crystal structure, a fundamental problem in materials informatics. A public open-source software in this domain is the AFLOW (Automatic Flow) framework, which enables automated property determination and serves as a benchmark in the field 58, 59. In a recent study 42, the Compositionally Restricted Attention-Based network (CrabNet) was introduced for predicting materials properties structure-agnostic of crystals. This model achieves state-of-the-art performance by using mat2vec embeddings 60 for encoding chemical information and provides heat map style explanations. The authors compared CrabNet with other ML models, including Roost 61, ElemNet 62, and random forest (RF) models.

Properties Roost CrabNet HotCrab ElemNet RF MLP
AFLOW Bulk modulus 8.82 8.69 9.10 12.12 11.91 11.30
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: The average contribution of all elements to bulk modulus predictions, computed from the AFLOW bulk modulus dataset. (a) MAE scores of Roost, CrabNet, one-hot encoded CrabNet (HotCrab), ElemNet, and MLP on the held-out test datasets, compared with the random forest (RF) baseline for the property. (b) Figure reprinted from 42 under the CC BY 4.0 license 63. (c) First-order feature attribution calculated based on the well-trained MLP. The lighter-coloured elements in the periodic table contribute more towards a compound’s bulk modulus value.

Here we demonstrate model results evaluated using MAE on the test set in Fig. 3(a). The models were trained on the same dataset but utilised different encoding representations: mat2vec encoded chemical information (Roost, CrabNet), one-hot encoding without chemical information (one-hot encoded CrabNet (HotCrab), ElemNet, multilayer perception (MLP)), and a Magpie-featurised CBFV 64 for RF. Here we can see that a simple MLP achieves comparable performance without the need for complex fine-tuning processes. While MLP may offer a slight advantage over RF and ElemNet in performance, we do not anticipate it as well as complex neural networks in benchmark tests. Our focus in this study is on explanations from ML models, particularly the CrabNet and MLP black-boxes. The self-attention mechanism enables the CrabNet to preserve the elemental identity within a compound and thus directly predict the element’s contribution to the property prediction Fig. 3(b) illustrates the average contributions of each element in a CrabNet model, with lighter elements indicating higher contributions.

Similarly, we compare this to the first-order feature attribution from the MLP model shown in Fig. 3 (b) and (c), where each element is coloured according to its attribution value. For this case, we split the data set into train, validation, and test sets using a fixed random seed, where the training and validation sets were employed for model training and hyperparameter tuning, respectively, and the test set was used for the evaluation of model performance metrics. We employed Mendeleev encoding 65 which uses the chemical formula exclusively, and has been shown to achieve comparable performance to other ML models 66. Second-order feature interactions in the Supporting Information to highlight the deeper insights into how these interactions contribute to predictions.

This is significant because different levels of feature attributions lead to diverse explanations. At the first-order level, elements such as iridium (Ir) and platinum (Pt) were found to be independently important, indicating a strong reliance on these elements for accurate predictions. However, the second-order feature interactions (see Supporting Information) show that seemingly less significant elements can play crucial roles when considered in combination. Potassium (K) and chlorine (Cl) have a higher impact on second-order feature attributions despite their lower importance at the first-order level. This diversity in feature attributions across different levels highlights the complexity of the underlying relationships in the data. It also offers researchers the flexibility to select the most appropriate level of feature attribution for their specific research questions or domain knowledge. From a physical perspective, these findings align with our understanding of material properties. The importance of Ir and Pt at the first-order level is consistent with their known high bulk moduli due to strong interatomic bonds. The significance of K and Cl in second-order interactions reflects their role in compound formation and their influence on crystal structures, which can dramatically alter bulk properties 67, 68.

3.2 Metallic Nanoparticle Property Prediction

Table 2: Representative questions for five four stakeholders in the case of metallic nanoparticles
Feature sets Representatives Stakeholders Representative questions Representative
expectations
Important Computer
scientist
Developers Which features should be included in the model to achieve the best performance? I believe that feature importance should depend solely on the model and data.
Controllable Experimental
materials
scientist
Scientists Which features are controllable and can be measured using microanalysis methods in the lab? I anticipate that controllable features are responsible for functional properties.
Structural Computational
materials
scientist
Scientists Which features are related to the structure and represent important inputs into my simulations? I expect structural features to be related to physical properties.
Experimental Manufacturer Professionals Which controllable features are related to the synthesis conditions and attributes that are inputs for industrial processes? I believe focusing on processing features will save me money.

This case study uses a dataset of metallic nanoparticles originally generated through molecular dynamics simulations that model the sintering and coarsening processes of palladium (Pd), gold (Au), and platinum (Pt) at varying temperatures and atomic deposition rates 69, 70, 71, 72. The fractal dimension has recently been proposed as a superior way of describing the complexity of the surfaces that is relevant to catalysis. This dimension is calculated using a box-counting algorithm, using the Sphractal Python package 73, 74 which is capable of estimating the fractal dimensions for surfaces composed of precise mathematical objects or atomistic coordinates.

In this example, we highlight the impact of using ML models for the same task, including RF and XGBoost, and examine connection weights from a well-trained MLP 51, 52 as an exemplar. One approach to intuitively explain the MLP is by directly analysing the gradients, using IG, given that the model learns through parameter optimisation with respect to the training data. To show how this works we considered four domain-based scenarios Important, Controllable, Structural, and Experimental, which are described in 72. The Processing category is not considered because it cannot be fitted well, resulting in unreliable predictions. Each remaining scenario corresponds to the needs of different potential stakeholders and involves a different feature sub-set. We also include detailed feature sub-sets and descriptions in Supporting Information.

Using these scenarios, we compare three ML models (XGBoost, MLP and RF) of similar high performance 75. All models are trained on the same dataset and evaluated using the same metrics (R2superscript𝑅2{R^{2}}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and MAE), yielding the following results: XGBoost (R2superscript𝑅2{R^{2}}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: 0.74, MAE: 0.03), MLP (R2superscript𝑅2{R^{2}}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: 0.75, MAE: 0.03), and RF (R2superscript𝑅2{R^{2}}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: 0.74, MAE: 0.03). The outcome is illustrated in Fig. 4(top). The normalised importance values from PI and SHAP are also compared, noting the focus here is on the relative rankings rather than the absolute values. These outcomes are shown in Fig. 4(bottom), and trained four ML models from each scenario separately to predict the formation energy to compare the rankings of features has displayed in Fig. 5.

Refer to caption
Figure 4: (a) Feature importance rankings of metallic nanoparticles from accurate models, including XGBoost, MLP, and RF, and (b) feature importance rankings (MLP) from different well-established explanation methods including PI, SHAP and IG. The x-axis displays features ordered by their importance ranking from RF, which serves as a baseline. Other rankings are plotted according to this order. The y-axis denotes the importance score, normalised between 0 and 1.

Our example of metallic nanoparticle properties demonstrates that it’s possible to identify multiple well-performing models for the same task, leading to diverse explanations. As shown in Fig. 4(a), the feature importance rankings from RF and XGBoost (which are both tree-based) are similar, whereas MLP presents notably different explanations. This variance underscores the challenge in determining which model explanation should be trusted, even when all models perform similarly well on the prediction task, and highlights the need for careful consideration when interpreting ML models in scientific contexts 22, 23, 20, 23. Relying on a single model’s explanation may provide an incomplete or potentially misleading understanding of the underlying phenomena.

In addition to this, Well-established XAI methods lead to diverse explanations from the same model in this example. Based on our well-trained MLP model, example results for IG, SHAP, and PI are shown in Fig. 4 (b). We can see here that while all methods agree on the importance of certain features, they differ in their ranking of others. This inconsistency demonstrates that the choice of XAI method can significantly impact the resulting explanation, and highlights the importance of choosing an approach that best aligns with domain knowledge 76.

Refer to caption
Figure 5: Feature importance rankings of metallic nanoparticles fractal dimension predictions from four stakeholders (scenarios). Features ordered by their importance rankings (left to right), serving as a baseline. Other rankings are plotted according to this order. The y𝑦yitalic_y-axis denotes the importance score, normalised between 0 and 1.

The metallic nanoparticles example also highlights the diversity corresponding to different potential stakeholders with varying needs, aims, or requirements, as summarised in Table 2. As illustrated in Fig. 5, the feature importance rankings differ significantly across these scenarios. Diverse needs can lead to diverse explanations even when the predictive performance remains constant. For instance, a manufacturer might focus on features specifically related to synthesis conditions, while model developers might advocate for including features that optimise performance, even if they’re beyond laboratory control. A prime example of this divergence is the factor Time, which is crucial for laboratory scientists but less significant for computational materials scientists. These competing interests impact both model training and the resulting explanation, and emphasise the importance of considering the intended use and audience when developing and interpreting ML models in scientific contexts  32, 36. Ultimately scientific researchers are the determining factor in all attempts to explain models, showing consistency with other fields, e.g., a recent perspective in healthcare 77.

3.3 MoleculeNet BACE-1 Classification Benchmark

The biophysical BACE dataset 78 comprises 1513 compounds with physicochemical properties used for binary classification focused on inhibitors of human β𝛽\betaitalic_β-secretase 1 (BACE-1). It includes both quantitative (IC50 values) and qualitative (binary labels) binding results, specifically split into active (IC50 \leq 100 nM) and inactive classes. This dataset integrates experimental values reported in scientific literature, some with detailed crystal structures available. The compounds associated with their 2D structures and binary labels are provided by the benchmark MoleculeNet 79, whihc implements ECFP (Extended-Connectivity Fingerprints) featurisation method to decompose molecules into sub-modules from heavy atoms, with an assigned unique identifier. These identifiers extend through bonds to form larger sub-structures, which are then hashed into fixed-length binary fingerprints which encode the topological characteristics of molecules, enabling applications such as similarity searching and activity prediction. In this example, we use the DeepChem framework for data generation and feature representation 80, and present four high-performing RF, XGBoost, Support Vector Machine (SVM), and MLP, based on their performance comparison in MoleculeNet. These models were retrained on the BACE dataset, using the same data for training, and evaluated using accuracy, ROC-AUC, recall, and precision scores.

Table 3: The cross-validation performance results of four models in BACE-1 classification
Models Accuracy ROC-AUC Recall Precision
RF 0.813 ±plus-or-minus\pm± 0.025 0.888 ±plus-or-minus\pm± 0.026 0.795 ±plus-or-minus\pm± 0.029 0.801 ±plus-or-minus\pm± 0.046
XGBoost 0.802 ±plus-or-minus\pm± 0.024 0.876 ±plus-or-minus\pm± 0.027 0.788 ±plus-or-minus\pm± 0.033 0.785 ±plus-or-minus\pm± 0.043
SVM 0.817 ±plus-or-minus\pm± 0.029 0.884 ±plus-or-minus\pm± 0.027 0.801 ±plus-or-minus\pm± 0.027 0.803 ±plus-or-minus\pm± 0.045
MLP 0.765 ±plus-or-minus\pm± 0.061 0.862 ±plus-or-minus\pm± 0.033 0.819 ±plus-or-minus\pm± 0.066 0.730 ±plus-or-minus\pm± 0.084

3.3.1 Example

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: The confusion matrices of two classifiers in predicting bindings result in a set of inhibitors of human beta-secretase (BACE-1). The models include RF, XGBoost, SVM, and MLP, arranged from (a-d).

The overall performance of all four models are presented in confusion matrices shown in Fig. 6, where accuracy, ROC-AUC, recall and precision scores are included in Table 3. These performances are similar to those reported in MoleculeNet benchmark 79. In this case, our objective is to highlight individual predictions and their associated explanations by applying LIME, which creates locally perturbed datasets and fits linear surrogate models based on these perturbations, shown in 7, from the equally well-trained models, as illustrated in Fig. 8. In this case, while the RF and MLP show promising overall accuracy, they misclassified both of the two samples. In contrast, the XGBoost model accurately classified both samples and SVM misclassified one out of the two samples.

(a)Refer to caption (b)Refer to caption

Figure 7: Two samples from BACE dataset misclassified by some of four well-trained models (a) Chemical structure: Fc1cc(cc(F)c1)CC(NC(=O)c1c2cc(ccc2n(c1)C(=O)N(CCCC)C)C(OCc
1ccccc1)=O)C(O)C[NH2+]Cc1cc(OC)ccc1. The ground truth classification is inactive and is misclassified by RF, SVM, and MLP. (b) Chemical structure: s1c(ccc1-c1cc(C#N)c(F)cc1)C12N=C(N)N(C)C(=O)C1CN(C2)c1ccccc1.
Refer to caption
Figure 8: LIME Local explanations for four well-trained models applied to sample a (a-d) and b (e-h) from RF, XGBoost, SVM, and MLP model, respectively. Misclassified samples are labelled and it is noted that fragments vary across models due to featurisation.

Our analysis of the BACE-1 dataset using multiple high-performing models (RF, XGBoost, SVM, and MLP) highlights an important phenomenon known as predictive multiplicity 24. This concept captures the potential for individual-level discrepancies introduced by the arbitrary choice of a single model, even when the overall performance metrics are similar. As illustrated in Fig. 8, we observe significant variations in predictions and explanations across different models for the same molecular samples. For instance, the two samples presented are classified differently by various models, despite all models achieving comparable overall performance metrics (Fig. 6). This divergence in individual predictions underscores the limitations of relying solely on aggregate performance measures when selecting models for critical scientific applications. The LIME explanations highlight this issue by revealing how different models prioritise various molecular substructures in their decision-making processes. For example, the fragment cC(C)(C)N=C(N)N in the sample (b) contributes positively to the prediction in some models but negatively in others. Such inconsistencies at the local level can have significant implications in domains such as drug discovery, where decisions about individual compounds can have far-reaching consequences.

This predictive multiplicity in BACE-1 inhibitor classification aligns with the known complexity of protein-ligand interactions, where subtle changes in molecular structure can significantly alter binding affinity 81. The variation in predictions across models for the same input highlights the inherent uncertainty in ML approaches, especially when applied to complex molecular systems. This suggests the need for ensemble approaches, where considering predictions from multiple high-performing models may provide a more robust and nuanced understanding of the underlying relationships in the data. Tools like LIME, which provide instance-level explanations, are crucial for understanding and potentially mitigating the effects of predictive multiplicity.

3.4 Insights Into Diverse Explanations

Our case studies across different domains in physical science (materials science, nanotechnology and chemistry) reveal several key insights into the sources of diverse explanations. Whether predicting bulk modulus, nanoparticle properties, or molecular activities, we consistently observed diverse explanations from data-driven explanations, domain-driven explanations, and data-driven vs. domain-driven explanations. This ubiquity underscores the importance of considering explanation diversity in scientific ML applications.

We have illustrated the four high-level sources of explanation diversity: model selection, explanation method choice, feature attribution level, and stakeholder perspective. These findings align with discoveries in other fields, suggesting a broader applicability of these concepts. Each model (e.g., MLP, RF, and XGBoost) and explanation method (e.g., SHAP, LIME, and IG) is theoretically sound but operates under distinct assumptions, which influence feature importance and lead to diverse explanations. Another factor contributing to this diversity is the complex dependencies between input features 82, 83, which can significantly impact how different explanation methods attribute importance. This corresponds to the challenges in higher-order feature attributions. Recent research 84, 85, 86, 87, 88, 89 has begun exploring causal models in materials informatics to address this limitation, showing promise for extracting deeper materials physics insights. The existence of multiple well-performing models for the same task aligns with the concept of Rashomon sets 90, 24, while the variability in explanations from different methods has been noted in other studies 91.

3.5 Scientists-Centric Perspective

In light of the inherent diversity in explanations derived from ML, we advocate for a scientists-centric approach to interpretation. The complexity of scientific research often leads to diverse requirements that are challenging to resolve within a single model or explanation method. Empowering scientists to maintain control over the explanation process 41, 30 includes recognising that different stakeholders in scientific contexts often have varied needs and perspectives when employing ML models. This perspective discourages blind trust in model-generated explanations, and positions explanations as tools serving scientists, ensuring that insights derived from ML models are relevant and actionable for each specific scientific context or stakeholder.

The inherent diversity in ML models and explanation methods can be leveraged to achieve consistency across these varied perspectives. A potential pipeline to address this could involve two key steps: 1. initial exploration of a set of well-performing models, effectively sampling the model space to capture a range of interpretations. 2. application of optimization algorithms to identify models that best align with scientists’ expectations and domain knowledge. By embracing this diversity, researchers can address different needs independently, as we saw in the nanoparticle case study. Similarly, in molecular property prediction tasks, as illustrated in our BACE-1 classification example, scientists can utilise multiple high-performing models and local explanation methods like LIME to gain insights into individual predictions. This approach allows for a balance between data-driven insights and domain expertise, producing interpretations that are scientifically sound and practically useful for diverse needs.

4 Summary and Opportunities

In this Perspective, we discuss diverse explanations from both data-driven and domain-driven points of view, and the inconsistencies that hinder ML’s potential impact in the physical sciences. We illustrate this through three examples drawn from public research: AFLOW for property prediction, metallic nanoparticle property prediction, and BACE-1 classification. We identified four sources contributing to diverse explanations, including different levels of feature attributions, different well-established explanation methods, different well-trained ML models, and different requirements for stakeholders. All of these sources rely on high-performing ML models yet result in different explanations. We advocate for a scientists-centric approach, embracing this diversity as an opportunity to enhance scientific understanding. This approach allows researchers to select methods aligning with their specific needs and domain expertise, maintaining human oversight in the interpretation process 36.

Looking forward, we suggest that considering sets of models rather than single models offers a promising direction, potentially providing a range of feature attributions without compromising performance. Integrating various explanation methods (such as concept-based, example-based, and feature-based approaches) also shows potential for addressing diverse needs in scientific research. By fostering an understanding of diverse explanations in scientific ML, we can also contribute to the responsible integration of XAI into scientific discovery, enhancing trustworthiness and deepening our comprehension of complex physical phenomena.

Data Availability Statement

This Perspective does not present primary research results or introduce new data, software, or code. To ensure reproducibility we provide access to the following resources:

These resources allow readers to reproduce our findings and further explore the concepts discussed in this Perspective.

Acknowledgements

We gratefully acknowledge the National Computational Infrastructure (NCI) for providing access to their computing facilities in model training under project number p00.

References

  • 1 Xiaoting Zhong, Brian Gallagher, Shusen Liu, Bhavya Kailkhura, Anna Hiszpanski, and T Yong-Jin Han. Explainable machine learning in materials science. npj Computational Materials, 8(1):204, 2022.
  • 2 Bhavya Kailkhura, Brian Gallagher, Sookyung Kim, Anna Hiszpanski, and T Yong-Jin Han. Reliable and explainable machine-learning methods for accelerated material discovery. npj Computational Materials, 5(1):108, 2019.
  • 3 Yuanfeng Xu, Luis Elcoro, Zhi-Da Song, Benjamin J Wieder, MG Vergniory, Nicolas Regnault, Yulin Chen, Claudia Felser, and B Andrei Bernevig. High-throughput calculations of magnetic topological materials. Nature, 586(7831):702–707, 2020.
  • 4 Keith T Butler, Daniel W Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. Machine learning for molecular and materials science. Nature, 559(7715):547–555, 2018.
  • 5 Jonathan Schmidt, Mário RG Marques, Silvana Botti, and Miguel AL Marques. Recent advances and applications of machine learning in solid-state materials science. npj computational materials, 5(1):83, 2019.
  • 6 Sichao Li and Amanda S Barnard. Inverse design of mxenes for high-capacity energy storage materials using multi-target machine learning. Chemistry of Materials, 34(11):4964–4974, 2022.
  • 7 Yue Liu, Tianlu Zhao, Wangwei Ju, and Siqi Shi. Materials discovery and design using machine learning. Journal of Materiomics, 3(3):159–177, 2017.
  • 8 Jörg Behler. Perspective: Machine learning potentials for atomistic simulations. The Journal of chemical physics, 145(17):170901, 2016.
  • 9 Kristof T Schütt, Henning Glawe, Felix Brockherde, Antonio Sanna, Klaus-Robert Müller, and Eberhard KU Gross. How to represent crystal structures for machine learning: Towards fast prediction of electronic properties. Physical Review B, 89(20):205118, 2014.
  • 10 Zijiang Yang, Yuksel C Yabansu, Dipendra Jha, Wei-keng Liao, Alok N Choudhary, Surya R Kalidindi, and Ankit Agrawal. Establishing structure-property localization linkages for elastic deformation of three-dimensional high contrast composites using deep learning approaches. Acta Materialia, 166:335–345, 2019.
  • 11 Amanda S. Barnard and George Opletal. Selecting machine learning models for metallic nanoparticles. Nano Futures, 4:035003, 2020.
  • 12 Amanda J. Parker and Amanda S. Barnard. Unsupervised structure classes vs. supervised property classes of silicon quantum dots using neural networks. Nanoscale horizons, 6:277–282, 2021.
  • 13 Jenna Wiens and Erica S Shenoy. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clinical infectious diseases, 66(1):149–153, 2018.
  • 14 Rodrigo P Carvalho, Cleber FN Marchiori, Daniel Brandell, and C Moyses Araujo. Artificial intelligence driven in-silico discovery of novel organic lithium-ion battery cathodes. Energy storage materials, 44:313–325, 2022.
  • 15 Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical systems, decision sciences, and data products. Big data, 5(3):246–255, 2017.
  • 16 José Jiménez-Luna, Francesca Grisoni, and Gisbert Schneider. Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10):573–584, 2020.
  • 17 Weitong Huang, Hanna Suominen, Tommy Liu, Gregory Rice, Carlos Salomon, and Amanda S Barnard. Explainable discovery of disease biomarkers: The case of ovarian cancer to illustrate the best practice in machine learning and Shapley analysis. Journal of Biomedical Informatics, 141:104365, 2023.
  • 18 Amanda S Barnard and Bronwyn L Fox. Importance of Structural Features and the Influence of Individual Structures of Graphene Oxide Using Shapley Value Analysis. Chemistry of Materials, 35(21):8840–8856, 2023.
  • 19 Amanda S. Barnard. Explainable prediction of N-V-related defects in nanodiamond using neural networks and Shapley values. Cell Reports Physical Science, 3(1):100696, 2022.
  • 20 Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
  • 21 Sichao Li and Amanda Barnard. Variance Tolerance Factors For Interpreting All Neural Networks. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–9, 2023.
  • 22 Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res., 20(177):1–81, 2019.
  • 23 Sichao Li, Rong Wang, Quanling Deng, and Amanda Barnard. Exploring the cloud of feature interaction scores in a Rashomon set. arXiv preprint arXiv:2305.10181, 2023.
  • 24 Hsiang Hsu and Flavio Calmon. Rashomon Capacity: A Metric for Predictive Multiplicity in Classification. Advances in Neural Information Processing Systems, 35:28988–29000, 2022.
  • 25 Markus Reichstein, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Carvalhais, and fnm Prabhat. Deep learning and process understanding for data-driven Earth system science. Nature, 566(7743):195–204, 2019.
  • 26 Ribana Roscher, Bastian Bohn, Marco F Duarte, and Jochen Garcke. Explainable machine learning for scientific insights and discoveries. Ieee Access, 8:42200–42216, 2020.
  • 27 Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1):18, 2020.
  • 28 Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31–57, 2018.
  • 29 Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. In Examples are not enough, learn to criticize! criticism for interpretability, volume 29, 2016.
  • 30 Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence, 267:1–38, 2019.
  • 31 Fergus Imrie, Robert Davis, and Mihaela van der Schaar. Multiple stakeholders drive diverse interpretability requirements for machine learning in healthcare. Nature Machine Intelligence, 5(8):824–829, 2023.
  • 32 Amina Adadi and Mohammed Berrada. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6:52138–52160, 2018.
  • 33 Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  • 34 Tommy Liu and Amanda S. Barnard. The emergent role of explainable artificial intelligence in the materials sciences. Cell Reports Physical Science, 4:101630, 2023.
  • 35 Alex A Freitas. Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter, 15(1):1–10, 2014.
  • 36 Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pages 80–89. IEEE, 2018.
  • 37 Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):1–42, 2018.
  • 38 Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, Jörg Schlötterer, Maurice van Keulen, and Christin Seifert. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ACM Computing Surveys, 55(13s):1–42, 2023.
  • 39 Jessica Gola, Dominik Britz, Thorsten Staudt, Marc Winter, Andreas Simon Schneider, Marc Ludovici, and Frank Mücklich. Advanced microstructure classification by data mining methods. Computational Materials Science, 148:324–335, 2018.
  • 40 Praveen Pankajakshan, Suchismita Sanyal, Onno E de Noord, Indranil Bhattacharya, Arnab Bhattacharyya, and Umesh Waghmare. Machine learning and statistical analysis for materials science: stability and transferability of fingerprint descriptors and chemical insights. Chemistry of Materials, 29(10):4190–4201, 2017.
  • 41 Ribana Roscher, Bastian Bohn, Marco F. Duarte, and Jochen Garcke. Explainable Machine Learning for Scientific Insights and Discoveries. IEEE Access, 8:42200–42216, 2020.
  • 42 Anthony Yu-Tung Wang, Steven K Kauwe, Ryan J Murdock, and Taylor D Sparks. Compositionally restricted attention-based network for materials property predictions. Npj Computational Materials, 7(1):77, 2021.
  • 43 Anuj Karpatne, Gowtham Atluri, James H Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on knowledge and data engineering, 29(10):2318–2331, 2017.
  • 44 Jia-Zhong Zhang. Avoiding spurious correlation in analysis of chemical kinetic data. Chemical communications, 47 24:6861–6863, 2011.
  • 45 Leo Breiman. Classification and regression trees. Routledge, 2017.
  • 46 Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, et al. Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4):1–4, 2015.
  • 47 Scott M Lundberg and Su-In Lee. In A unified approach to interpreting model predictions, volume 30, 2017.
  • 48 Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
  • 49 Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
  • 50 Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR, 06–11 Aug 2017.
  • 51 Marcus W Beck. NeuralNetTools: Visualization and analysis tools for neural networks. Journal of statistical software, 85(11):1, 2018.
  • 52 Julian D Olden, Michael K Joy, and Russell G Death. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological modelling, 178(3-4):389–397, 2004.
  • 53 Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386, 2016.
  • 54 Joseph D Janizek, Pascal Sturmfels, and Su-In Lee. Explaining explanations: Axiomatic feature interactions for deep networks. The Journal of Machine Learning Research, 22(1):4687–4740, 2021.
  • 55 Sichao Li, Amanda S Barnard, and Quanling Deng. Practical attribution guidance for rashomon sets. arXiv preprint arXiv:2407.18482, 2024.
  • 56 Jiayun Dong and Cynthia Rudin. Exploring the cloud of variable importance for the set of all good models. Nature Machine Intelligence, 2(12):810–824, 2020.
  • 57 Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In 2016 IEEE symposium on security and privacy (SP), pages 598–617. IEEE, 2016.
  • 58 Stefano Curtarolo, Wahyu Setyawan, Gus LW Hart, Michal Jahnatek, Roman V Chepulskii, Richard H Taylor, Shidong Wang, Junkai Xue, Kesong Yang, Ohad Levy, et al. Aflow: An automatic framework for high-throughput materials discovery. Computational Materials Science, 58:218–226, 2012.
  • 59 Conrad L Clement, Steven K Kauwe, and Taylor D Sparks. Benchmark aflow data sets for machine learning. Integrating Materials and Manufacturing Innovation, 9(2):153–156, 2020.
  • 60 Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexander Dunn, Ziqin Rong, Olga Kononova, Kristin A Persson, Gerbrand Ceder, and Anubhav Jain. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763):95–98, 2019.
  • 61 Rhys EA Goodall and Alpha A Lee. Predicting materials properties without crystal structure: Deep representation learning from stoichiometry. Nature communications, 11(1):6280, 2020.
  • 62 Dipendra Jha, Logan Ward, Arindam Paul, Wei-keng Liao, Alok Choudhary, Chris Wolverton, and Ankit Agrawal. Elemnet: Deep learning the chemistry of materials from only elemental composition. Scientific reports, 8(1):17593, 2018.
  • 63 Creative Commons. Creative commons attribution 4.0 international license, 2013.
  • 64 Steven K Kauwe, Jake Graser, Antonio Vazquez, and Taylor D Sparks. Machine learning prediction of heat capacity for solid inorganics. Integrating Materials and Manufacturing Innovation, 7:43–51, 2018.
  • 65 Zixin Zhuang and Amanda S. Barnard. Structure-free mendeleev encodings of material compounds for machine learning. Chemistry of Materials, 35:9325–9338, 2023.
  • 66 Zixin Zhuang and Amanda S. Barnard. Classification of battery compounds using structure-free mendeleev encodings. J. Cheminform., 16:47, 2024.
  • 67 Hyunchae Cynn, John E Klepeis, Choong-Shik Yoo, and David A Young. Osmium has the lowest experimentally determined compressibility. Physical review letters, 88(13):135701, 2002.
  • 68 Destiny E Charlie, Hitler Louis, Goodness J Ogunwale, Ismail O Amodu, Providence B Ashishie, Ernest C Agwamba, and Adedapo S Adeyinka. Effects of alkali-metals (x= li, na, k) doping on the electronic, optoelectronic, thermodynamic, and x-ray spectroscopic properties of x–sni3 halide perovskites. Computational Condensed Matter, 35:e00798, 2023.
  • 69 Amanda Barnard, Baichuan Sun, and George Opletal. Platinum Nanoparticle Data Set. v2. CSIRO. Data Collection., 2018. doi: 10.25919/5d3958d9bf5f7.
  • 70 Amanda Barnard and George Opletal. Gold Nanoparticle Data Set. v1. CSIRO. Data Collection., 2019. doi: 10.25919/5d395ef9a4291.
  • 71 Amanda Barnard and George Opletal. Palladium Nanoparticle Data Set. v2. CSIRO. Data Collection., 2023. doi: 10.25919/epxd-8p61.
  • 72 Sichao Li, Jonathan YC Ting, and Amanda S Barnard. The impact of domain-driven and data-driven feature selection on the inverse design of nanoparticle catalysts. Journal of Computational Science, 65:101896, 2022.
  • 73 Jonathan Yik Chang Ting, George Opletal, and Amanda S. Barnard. Fractal characterisation of simulated metal nanocatalysts in 3d. Small Science, 2022.
  • 74 Jonathan Yik Chang Ting, Andrew Thomas Agars Wood, and Amanda Susan Barnard. Sphractal: Estimating the Fractal Dimension of Surfaces Computed from Precise Atomic Coordinates via Box-Counting Algorithm, 2024.
  • 75 Davide Chicco, Matthijs J Warrens, and Giuseppe Jurman. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 7:e623, 2021.
  • 76 Amanda S. Barnard, Benyamin Motevalli, Amanda J Parker, Meli Fischer, Chris Feigl, and George Opletal. Nanoinformatics, and the big challenges for the science of small things. Nanoscale, 11:19190–19201, 2019.
  • 77 Tim Miller. Explainable AI is Dead, Long Live Explainable AI! Hypothesis-driven Decision Support using Evaluative AI. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 333–342, 2023.
  • 78 Govindan Subramanian, Bharath Ramsundar, Vijay Pande, and Rajiah Aldrin Denny. Computational modeling of β𝛽\betaitalic_β-secretase 1 (bace-1) inhibitors using ligand based approaches. Journal of chemical information and modeling, 56(10):1936–1949, 2016.
  • 79 Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
  • 80 Bharath Ramsundar, Peter Eastman, Patrick Walters, Vijay Pande, Karl Leswing, and Zhenqin Wu. Deep Learning for the Life Sciences. O’Reilly Media, 2019. https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837.
  • 81 Anantha Krishnan Dhanabalan, Manish Kesherwani, Devadasan Velmurugan, and Krishnasamy Gunasekaran. Identification of new bace1 inhibitors using pharmacophore and molecular dynamics simulations approach. Journal of Molecular Graphics and Modelling, 76:56–69, 2017.
  • 82 Christoph Molnar, S Gruber, and P Kopper. Limitations of interpretable machine learning methods, 2020.
  • 83 Kjersti Aas, Martin Jullum, and Anders Løland. Explaining individual predictions when features are dependent: More accurate approximations to shapley values. Artificial Intelligence, 298:103502, 2021.
  • 84 Ayana Ghosh and Saurabh Ghosh. Mapping causal pathways with structural modes fingerprint for perovskite oxides. Machine Learning: Science and Technology, 2024.
  • 85 Adrien Feix and Časlav Brukner. Quantum superpositions of common-causeand direct-cause causal structures. New Journal of Physics, 19(12):123028, 2017.
  • 86 Zachary R Fox and Ayana Ghosh. Active causal learning for decoding chemical complexities with targeted interventions. arXiv preprint arXiv:2404.04224, 2024.
  • 87 Nikolai Miklin, Alastair A Abbott, Cyril Branciard, Rafael Chaves, and Costantino Budroni. The entropic approach to causal correlations. New Journal of Physics, 19(11):113041, 2017.
  • 88 Ayana Ghosh. Towards physics-informed explainable machine learning and causal models for materials research. Computational Materials Science, 233:112740, 2024.
  • 89 Anik Saha, Oktie Hassanzadeh, Alex Gittens, Jian Ni, Kavitha Srinivas, and Bulent Yener. A cross-domain evaluation of approaches for causal knowledge extraction. arXiv preprint arXiv:2308.03891, 2023.
  • 90 Sichao Li and Amanda S Barnard. Multi-target neural network predictions of MXenes as high-capacity energy storage materials in a Rashomon set. Cell Reports Physical Science, 4(11):101675, 2023.
  • 91 Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, and Himabindu Lakkaraju. The disagreement problem in explainable machine learning: A practitioner’s perspective. arXiv preprint arXiv:2202.01602, 2022.