An overview of the pipeline for developing and benchmarking Trustworthy models at different stages of the machine learning model pipeline.


Understanding Training Dynamics of Large Scale Models

A myriad of machine learning models have been trained under supervised, unsupervised, or self-supervised settings using different predictive and trustworthy properties. While these models do achieve state-of-the-art performance on downstream predictive and generative tasks, we have little understanding of their exact training dynamics, e.g., which examples are more important for the model to converge? which examples do the model learn first? does harder examples ends up getting memorized? the contrast between early- and late-training stages, etc. In addition, previous researches have shown that it is possible to train deep neural networks with fairness, robustness, and explainability properties, but haven't yet delivered a computationally efficient way to train large-scale visual-language and large language models. Further, different pillars of TrustworthyML do promise great success independently, but they are primarily discussed in dedicated silos, and it remains unclear whether a relation exists between them. Hence, we focus on three key questions with respect to training large-scale trustworthy models:

  1. When and which model parameters i) learn to memorize, ii) start hallucinating, iii) are prone to noise, and iv) learn biased features during the training process?
  2. How to incorporate and enhance the trustworthy properties at different stages (early or late) of the training pipeline?
  3. What is the inter-connection and trade-off between different trustworthy properties?

Addressing these questions will help us understand the training dynamics of a model and, moving forward, we aim to develop architectures and training algorithms that will lead to more trustworthy models. For instance, we have developed techniques to understand the training dynamics of a ML model, where we use gradient explanations to identify samples that are easier/harder for the model to learn (see VoG).

A Tale of Two Long Tails
Towards a Unified Framework for Fair and Stable Graph Representation Learning
Estimating Example Difficulty using Variance of Gradients
Towards Training GNNs using Explanation Directed Message Passing

Read More


Making Pre-Trained Models More Trustworthy during Inference

Recent years have seen an enormous growth in the use of large-scale models that are trained on broad uncurated datasets using self-supervised learning and adapted for a wide range of downstream tasks. In contrast to supervised models, self-supervised pretrained models learn and embed general patterns and features from the data in a densely packed representation space and in most cases we are only provided with the output representation of these models, and we have little to no understanding of how to interpret them. Once we unfold this representation space, it is very hard to understand what these closed models learn. The way we interpret this is that large models learn data features and patterns and embed them in a densely packed representation space, where we have small feature pockets that can be used for specific downstream tasks and tools like in-context learning (ICL) and chain-of-thoughts (CoT) helps us to navigate this complex representation space. The question then becomes -- how unique and trustworthy are these representations?

Most organizations training visual-language and large language models provide their representations as an Encoder service, where users can employ the representations from these models to their respective downstream tasks. Hence, it's important to disentangle the representations from these models and understand their trustworthy properties. We explore different robustness, explainability, uncertainty, and fairness properties using inference- and API-level access of visual-language and large language models. For example, with RL frameworks being deployed at scale as well as performing autonomously, it becomes imperative to incorporate explainability in them. In particular, we develop counterfactual methods to generate explanations by asking the question: “What least change to the current policy would improve or worsen it to a new policy with a specified target return?” Our generated counterfactual policies provide direct insights into how a policy can be modified to achieve better results as well as what to avoid in order not to deteriorate the performance (see figure).

Towards Fair Knowledge Distillation using Student Feedback
Towards Safe Large Language Models for Medicine
On the Trade-offs between Adversarial Robustness and Actionable Explanations
Towards the Unification and Robustness of Perturbation and Gradient Based Explanations
Exploring Counterfactual Explanations Through the Lens of Adversarial Examples: A Theoretical and Empirical Analysis
DeAR: Debiasing Vision-Language Models with Additive Residuals
Counterfactual Explanation Policies in RL
Certifying LLM Safety against Adversarial Prompting

Read More


Scalable and Novel Algorithms for Explaining Large Multimodal Models

A popular strategy for explaining the decision of a predictive and generative model is attributing their decision to either the i) input features (i.e., feature attribution) and/or ii) samples from the training datasets (i.e., data attribution). Intuitively, an attribution map on the input represents a heatmap that highlights input features (pixles in images, tokens in language, nodes/edges in graphs, temporal window for time series and trajectories in RL agents) that are the evidence for and against the model outputs. A widely used technique to construct attribution maps is to approximate the attribution value of an input region by the probability change when that region is absent, i.e., removed from the input.

While removing an input feature to measure its attribution is a principle method (i.e., “intervention” in causal reasoning), a key open question is: How to remove? The removal of input features often lead to out-of-distribution (OoD) inputs (e.g., noisy images, perturbed tokens, disconnceted graphs, etc.) on which the underlying models were not trained. Because ML models are often easily fooled by unusual input patterns, we hypothesize that such examples might yield explanations that are unfaithful, i.e., explanations that do not represent the internal reasoning of the model. We develop several attribution-based techniques that generate explanations that are faithful to the underlying model and interpretable to human users and stakeholders. For example, we introduced generative-based attribution strategies that can be augmented with existing feature attribution algorithms to generate more faithful and accurate explanation and extend these techniques to RL agents, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training.

In our work, we develop new explainability algorithms to study the behavior of complex black-box unimodal and multimodal models. The research on multimodal explanation methods is at a very nascent stage and we aim to introduce novel algorithms that generate actionable explanations for multimodal models.

Are Large Language Models Post Hoc Explainers?
Explaining Image Classifiers by Removing Input Features Using Generative Models
Explaining RL Decisions with Trajectories
Intriguing Properties of Visual-Language Model Explanations
Towards the Unification and Robustness of Perturbation- and Gradient-based Explanations

Read More


Large Scale Benchmarking of Post Hoc Explainers for Diverse Data Modalities

The increasing use of predictive and generative models in high-stake domains including healthcare, law, and finance requires us to develop algorithms and techniques that can explain the behavior fo these complex machine learning models. There has been a growing demand on explaining the behavior of these large-scale models so that both model developers and stakeholders can understand the rationale behind their predictions or responses, and determine if and when to rely on them. To this end, various techniques have been proposed in recent literature to generate post hoc explanations of individual predictions made by complex ML models trained on different data modalities like image, text, graphs, and tables. In general, these explanation methods output the influence of each of the input features on the model's output and is observed as a proxy for model explanation. However, it is critical to answer -- which explanation methods are effective with respect to what notions of reliability and under what conditions? Answering this question will ensure that the explanations generated by these methods are reliable so that relevant stakeholders and decision makers are provided with credible information about the underlying models.

We take the first step towards answering this question by designing benchmarks with a broader ecosystem of data loaders, data processing functions, explanation methods, evaluation metrics (e.g., accuracy, faithfulness, stability, fairness), and explanation visualizers, to reliably benchmark the quality of any given explanation for models trained on different modalities. We provide open-source implementations and ready-to-use API interfaces for different post hoc explanation methods, and release a collection of predictive models trained on synthetic and real-world datasets for benchmarking explanation methods to promote transparency, and to allow users to easily compare the performance of multiple explanation methods across a wide variety of synthetic and real-world datasets, evaluation metrics, and predictive models.

We are looking to develop easy-to-use reproducible benchmarks (synthetic datasets and evaluation metrics) to quantify explanations generated for multimodal models like LVLMS, where explanations are not limited to one data modality.

Quantifying Uncertainty in Natural Language Explanations of LLMs
OpenXAI: Towards a Transparent Evaluation of Post hoc Model Explanations
Probing GNN Explainers: A Rigorous Theoretical and Empirical Analysis of GNN Explanation Methods
Evaluating Explainability for Graph Neural Networks
SAM: The Sensitivity of Attribution Methods to Hyperparameters

Read More