How Reliable Are AI Concepts in Real-World Domain Adaptation Tasks?
Table of Links
Abstract and 1 Introduction
2 Related Work
3 Methodology and 3.1 Representative Concept Extraction
3.2 Self-supervised Contrastive Concept Learning
3.3 Prototype-based Concept Grounding
3.4 End-to-end Composite Training
4 Experiments and 4.1 Datasets and Networks
4.2 Hyperparameter Settings
4.3 Evaluation Metrics and 4.4 Generalization Results
4.5 Concept Fidelity and 4.6 Qualitative Visualization
5 Conclusion and References
Appendix
4.3 Evaluation Metrics
We consider the following evaluation metrics to evaluate each component of the concept discovery framework.
• Generalization: We start by quantitatively evaluating the quality of concepts learned by measuring how well the learned concepts can generalize to new domains. To achieve this, we compare our proposed method against the aforementioned baselines on domain adaptation settings.
• Concept Fidelity: To evaluate consistency in the learned concepts, we compute the intersection over union of the concept sets associated with for two data points xi and xj from same class as defined in Equation 10:
4.4 Genenralization Results
Tables 2, 3, and 4 report the domain adaptation results on the OfficeHome, DomainNet, VisDA and the Digit datasets, respectively. The notation X→Y represents models trained on X as the source domain (with abundant data) and Y as the target domain (with limited data) and evaluated on the test set of domain Y . The best statistically significant accuracy is reported in bold. The last three rows in all the tables list the performance of the RCE framework, RCE trained with regularization (RCE+PCG), and RCE trained with both regularization and contrastive learning paradigm (RCE+PCG+CCL).
Comparision with baselines. The first row in each table lists the performance of a standard Neural Network trained using the setting described in [Yu and Lin, 2023] (S+T). As a standard NN is not inherently explainable, we consider this setting as a baseline to understand the upper bound of the performance-explainability tradeoff.
The second and third rows in each table lists the performance of SENN and DiSENN respectively. SENN performs worse than S+T setting in almost all settings, except in a handful of settings where the performance matches S+T. This is expected, as SENN is formulated as an overparameterized version of a standard NN with regularization. Recall that DiSENN replaces the autoencoder in SENN with a VAE, and as such is not generalizable to bigger datasets without domain engineering. DiSENN performs the worst among all approaches for all datasets due to poor VAE generalization.
Recall that UnsupervisedCBM is an improved version of SENN architecture with a discriminator in addition to the aggregation function. In most cases, it performs slightly better than SENN and is at par with S+T. However, in particular cases in OfficeHome data (R→A) and DomainNet (S→P), UnsupCBM performs the best. We attribute this result to two factors: first, the Art (A) and Sketch (S) domains are significantly different from Real (R) and Picture (P) domains due to both of the former being hand-drawn while the latter being photographed as mentioned in [Yu and Lin, 2023]. Second, the use of a discriminator as proposed in UnsupervisedCBM helps enforce domain invariance in those particular cases.
BotCL explicitly attempts to improve concept fidelity and applies contrastive learning to discover concepts. However, the contrastive loss formulation is rather basic and they never focuses on domain invariance. BotCL’s performance is similar to S+T for the most part except in OfficeHome data (C→A), where it just outperforms all other approaches. One possible reason is that Clipart domain is significantly less noisy, and hence basic transformations in BotCL work well.
As the last row demonstrates, our proposed framework RCE+PCG+CCL outperforms all baselines on a vast majority of the settings across all four datasets and is comparable to SOTA baselines in the other settings.
Ablation studies. We also report the performance corresponding to various components of our proposed approach. We observe that the performance of RCE is almost identical to SENN, which is expected as there is very weak regularization in both cases. In almost all cases, adding prototype-based grounding regularization (RCE+PCG) improves performance over RCE while models trained with both PCG regularization and contrastive learning (RCE+PCG+CCL) outperform all approaches on a vast majority of settings across all datasets. Note that the setting RCE+CCL is not reported, as it defeats the fundamental motivation of maintaining concept fidelity.
Effect of the number of concepts and dimensions. We observe that there are no significant differences in performance over a varying number of concepts or dimensions. For all results reported, the number of concepts is set to a number of classes in the dataset, and their dimension is set to 1. For results on varying numbers of concepts and dimensions – refer to Appendix.
4.5 Concept Fidelity
As the RCE framework is explicitly regularized with a concept fidelity regularizer and grounded using prototypes, we would expect high fidelity scores. Table 5 lists the fidelity scores for the aforementioned baselines and our proposed method. Fidelity scores are averaged for each domain when taken as a target (e.g. for the domain (A) in DomainNet, the score is an average of C→A, P→A and R→A). As expected, our method and BotCL, both with specific fidelity regularization outperform all other baseline approaches. Our method outperforms BotCL in most settings, except when the target domains are Art in DomainNet and Clipart in OfficHome due to significant domain dissonance.
4.6 Qualitative Visualization
Domain Alignment. We consider the extent to which the models trained using both concept grounding and contrastive learning maintain concept consistency not only within the source domain but also across the target domain as well. To understand what discriminative information is captured by a particular concept, Figure 5 shows the most important prototypes selected from the training set of both the source and target domains corresponding to five randomly selected concepts. We observe that prototypes explaining each concept are visually similar. For more results, refer Appendix.
Explanation using prototypes. For a given input sample, we also plot the prototypes associated with the highest activated concept, i.e., the important concept. Figure 6 shows the prototypes associated with the concepts most responsible for prediction (highest relevance scores). As can be seen, the prototypes possess distinct features, for eg., they capture the round face of an alarm clock. More results are reported in Appendix.