Black Box to Glass Box: Audit-Ready AI

A Case Study for Intent Recognition

Modern AI models, including large language models, can perform a wide range of tasks with minimal adaptation. However, this flexibility can come at the cost of transparency and interpretability: why did the model give the answer it did? At BluelightAI we are working to make AI systems more transparent so that models can be deployed to critical roles while maintaining trust and observability. This is especially critical in regulated industries like banking and insurance, where frameworks such as SR 11-7 require documented proof of what drove each model decision before deployment can proceed.

In this blog post we’ll outline one approach we are developing for interpretable classification and present a case study using this approach for intent recognition. This approach leverages LLMs in combination with cross-layer transcoders (CLTs) to produce an interpretable classification system. CLTs decode the internal activations of an LLM into thousands of distinct human-interpretable features. These features can be used as part of sophisticated methods for tracing a model’s computations, but we will instead use them as the foundation for a simple interpretable classifier model. We use the LLM in combination with the CLT as a feature extractor that decomposes input text into interpretable features that can then be used as input to a simple, transparent classifier while maintaining high performance.

Key Findings on Intent Recognition

We evaluated this method on the DSTC11 Track 2 dataset, a realistic benchmark for intent induction in customer service dialogue developed by Amazon Science, which was used as part of a 2023 challenge with 34 participating teams. The dataset is drawn from insurance-domain interactions and features 913 labeled utterances across 22 intent classes (e.g., GetQuote, FileClaim).

We trained a logistic regression classifier (a fundamentally simple linear model) on the extracted features and achieved an impressive 95.6% accuracy on the held-out test set. This substantially outperformed the random chance baseline of 4.5%.

The core technical success lies in the feature design and the classifier’s sparsity:

Feature Extraction: We used the Qwen3-1.7B-Base model paired with our CLT to extract 20,480 sparse features per layer. To manage dimensionality, we restricted analysis to the last 13 transformer layers (15-27) and then pruned to the top 5% of features by variance, resulting in 13,312 dimensions per utterance.

Model Sparsity: We enforced sparsity by applying L1 regularization to the model, resulting in only 563 non-zero weights out of over 5.7 million total parameters. This means each prediction relies on a small, identifiable set of features, on average about 25 features per intent class, making the model’s decision-making process transparent.

Interpretability

The resulting model is both locally and globally interpretable. Each CLT feature activates in a well-defined, interpretable set of contexts, examples of which can be seen in our feature explorer. For any given prediction, we can:

Identify the features that contributed most strongly to the decision.
Inspect their maximum-activating examples to understand the underlying semantic concept the feature represents (e.g., legal language, pricing context).

Example 1: “I need to file a claim for my car accident”

Predicted Intent: FileClaim (correct)

Top Contributing Features:

Layer 23, Feature 10794 (contribution: 2.28): This feature fires strongly on legal language involving claims and disputes. Max-activating examples include “asserted a cross-claim against Morse/Diesel” and “cause injury or damage… a charge of defamation.”

Layer 19, Feature 16199 (contribution: 0.75): Activates on text about filing legal documents. Examples include “filed a pro se petition for writ of habeas corpus” and “plaintiff filed an action with this court.”

Layer 23, Feature 4955 (contribution: 0.61): Fires on court filings and complaints. Examples: “filed counterclaims alleging malicious abuse of process” and “filed a complaint against the City.”

The model correctly identifies “file a claim” by detecting features associated with legal filing language. While the CLT was trained on general text (not insurance data), it learned features for concepts like “legal claims” and “filing documents” that transfer effectively to the insurance domain. This kind of per-prediction attribution shows exactly which semantic concepts drove a classification. Model risk teams in banking and insurance require this type of documentation before approving AI-assisted workflows such as collections recommendations, fraud case narratives, or AML behavioral conclusions.

Example 2: “Can I get a quote for auto insurance?”

Predicted Intent: GetQuote (correct)

Top Contributing Features:

Layer 21, Feature 12630 (contribution: 0.87): Fires on real estate and pricing contexts. Examples discuss property pricing and purchasing decisions.

Layer 24, Feature 19798 (contribution: 0.70): Activates on sales and brochure language. Examples include “offers for sales, brochures, advertisements” and product/service offerings.

Here the model picks up on features related to pricing, offers, and sales inquiries, semantic concepts that generalize from the CLT’s training data to the “get a quote” intent.

Conclusion

There are many scenarios where we need an AI model to reliably make a selection from a predefined set of options based on an unstructured text input. Off-the-shelf LLMs can often perform this task very well, but offer very limited visibility into the reasons behind their decisions. In settings where auditability and explainability are required for regulatory, trust, or reliability reasons, this can prevent adoption of a promising AI application.

In banking and insurance, where AI is increasingly recommending collections settlement offers, synthesizing fraud case narratives, and triggering KYC reviews, stakeholders need to understand and trust each decision the model makes. The approach we outline here is a way to leverage the LLM’s deep understanding of natural language without relying on its opaque internal reasoning. By using the LLM alongside a CLT to extract interpretable features from unstructured text, we transform an opaque AI system into a transparent one that gives every stakeholder, from the data scientist to the chief risk officer, a clear answer to the question: why did the model make this decision?