AI modellerinin tahminlerini açıklama yeteneğini geliştirmek

Yeni bir yaklaşım, kullanıcıların sağlık hizmetleri ve otonom sürüş gibi güvenlik açısından kritik uygulamalarda bir modelin tahminlerine ne ölçüde güvenebileceklerini anlamalarına yardımcı olabilir.

In critical areas such as medical diagnostics, users often seek to understand the reasoning behind a computer vision model's predictions to assess its reliability.

Concept bottleneck modeling is one approach that allows artificial intelligence systems to clarify their decision-making processes. This method compels a deep-learning model to utilize a set of human-understandable concepts for making predictions. Recent research by MIT computer scientists has introduced a technique that encourages the model to achieve improved accuracy along with clearer and more concise explanations.

The concepts utilized by the model are typically defined beforehand by human experts. For example, a clinician might suggest concepts like “clustered brown dots” and “variegated pigmentation” to indicate that a medical image may show melanoma.

However, previously defined concepts may be irrelevant or lack the necessary detail for specific tasks, which can diminish the model’s accuracy. The new method extracts concepts that the model has already learned during its training for that specific task, compelling the model to use these, resulting in better explanations compared to standard concept bottleneck models.

This approach employs a pair of specialized machine-learning models that automatically extract knowledge from a target model and translate it into easily understandable concepts. Ultimately, their technique can transform any pretrained computer vision model into one that can explain its reasoning using concepts.

“Essentially, we want to understand the thought processes of these computer vision models. A concept bottleneck model allows users to discern what the model is thinking and why it made a specific prediction. Our method, which employs better concepts, can enhance accuracy and improve the accountability of black-box AI models,” explains lead author Antonio De Santis, a graduate student at Polytechnic University of Milan who conducted this research while visiting the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT.

He is co-authoring a paper on this work alongside Schrasing Tong SM ’20, PhD ’26; Marco Brambilla, a professor of computer science and engineering at Polytechnic University of Milan; and senior author Lalana Kagal, a principal research scientist at CSAIL. The research will be presented at the International Conference on Learning Representations.

Enhancing the bottleneck

Concept bottleneck models (CBMs) are widely used to improve AI explainability. These techniques introduce an intermediate step where a computer vision model predicts the concepts present in an image before making a final prediction.

This intermediate step, or “bottleneck,” aids users in understanding the model’s reasoning.

For instance, a model identifying bird species might select concepts like “yellow legs” and “blue wings” before concluding that the bird is a barn swallow.

However, since these concepts are often generated in advance by humans or large language models (LLMs), they may not be tailored to the specific task. Additionally, even with predefined concepts, the model may still utilize irrelevant learned information, a problem known as information leakage.

“These models are designed to maximize performance, so they might inadvertently use concepts that are not apparent to us,” De Santis notes.

The MIT researchers proposed a different approach: recognizing that the model, trained on extensive data, may have learned the necessary concepts for making accurate predictions for the specific task. They aimed to create a CBM by extracting this existing knowledge and translating it into human-readable text.

In the initial step of their method, a specialized deep-learning model called a sparse autoencoder selectively identifies the most relevant features learned by the model and reconstructs them into a limited set of concepts. A multimodal LLM then describes each concept in simple language.

This multimodal LLM also annotates images in the dataset by identifying which concepts are present and absent in each image. The researchers use this annotated dataset to train a concept bottleneck module to recognize these concepts.

They integrate this module into the target model, compelling it to make predictions using only the set of learned concepts that the researchers extracted.

Managing the concepts

Throughout the development of this method, the researchers faced various challenges, including ensuring accurate annotation of concepts by the LLM and verifying that the sparse autoencoder identified concepts understandable to humans.

To prevent the model from using unknown or irrelevant concepts, they limit it to five concepts for each prediction. This restriction encourages the model to select the most relevant concepts and enhances the clarity of the explanations.

When comparing their approach to leading CBMs on tasks such as predicting bird species and identifying skin lesions in medical images, their method achieved the highest accuracy while providing more precise explanations.

Their approach also generated concepts that were more relevant to the images in the dataset.

“We’ve demonstrated that extracting concepts from the original model can outperform other CBMs, but there remains a tradeoff between interpretability and accuracy that needs to be addressed. Black-box models that lack interpretability still outperform ours,” De Santis remarks.

Looking ahead, the researchers aim to explore potential solutions to the information leakage issue, possibly by adding more concept bottleneck modules to prevent unwanted concepts from leaking through. They also plan to scale up their method by utilizing a larger multimodal LLM to annotate a more extensive training dataset, which could enhance performance.

“I’m excited about this work because it advances interpretable AI in a promising direction and creates a natural connection to symbolic AI and knowledge graphs,” says Andreas Hotho, professor and head of the Data Science Chair at the University of Würzburg, who was not involved in this research. “By deriving concept bottlenecks from

Diğer Haberler