AITTI: Learning Adaptive Inclusive Token for Text-to-Image Generation

S-Lab, Nanyang Technological University
arXiv, 2024

Abstract

Despite the high-quality results of text-to-image generation, stereotypical biases have been spotted in their generated contents, compromising the fairness of generative models. In this work, we propose to learn adaptive inclusive tokens to shift the attribute distribution of the final generative outputs. Unlike existing de-biasing approaches, our method requires neither explicit attribute specification nor prior knowledge of the bias distribution. Specifically, the core of our method is a lightweight adaptive mapping network, which can customize the inclusive tokens for the concepts to be de-biased, making the tokens generalizable to unseen concepts regardless of their original bias distributions. This is achieved by tuning the adaptive mapping network with a handful of balanced and inclusive samples using an anchor loss. Experimental results demonstrate that our method outperforms previous bias mitigation methods without attribute specification while preserving the alignment between generative results and text descriptions. Moreover, our method achieves comparable performance to models that require specific attributes or editing directions for generation. Extensive experiments showcase the effectiveness of our adaptive inclusive tokens in mitigating stereotypical bias in text-to-image generation. The code will be made publicly accessible.

The main goal of this paper is to devise a bias mitigation method for biased concepts An ideal inclusive T2I model yields results with evenly distributed sensitive attributes across all attribute classes, e.g., 50% male and 50% female in gender, when no attribute-related instructions are provided. A crucial aspect of a fair model is its ability to generate inclusive outcomes without direct instruction regarding the target attribute class. Besides, users' unawareness of potential biases related to a target concept should be respected.

Therefore, we argue that a good de-biasing algorithm should:

  1. Achieve fairer results without explicit specification of the target attribute class during generation.
  2. Require no prior knowledge of the original bias distribution associated with the concept (e.g., the doctor concept is stereotypically biased towards males).

Framework of our proposed adaptive inclusive token for text-to-image generation. The blue color indicates frozen weights, and the green color indicates trainable weights. Left: single training stage. Right: details of text model with adaptive mapping network. The adaptive inclusive token is concept-specific. TokenIDs are for illustration only.

Quantitative Results

Comparisons with baseline methods on fairness, quality, and text alignment of generative results across three bias attributes. Abbreviations are used for simplicity: SD1.5: Stable Diffusion v1.5. TIME: time diffusion. FD: fair diffusion. EI: ethical intervention. FM: fair mapping. TI: revised textual inversion. * indicates editing-based methods that require careful tuning of editing strengths to achieve reasonable results. The best KL Divergency results in each category are marked by bold.

Methods Gender Race Age
DKL FID ↓ CLIP ↑ DKL FID ↓ CLIP ↑ DKL FID ↓ CLIP ↑
SD1.5 [18] 0.3584 281.12 0.2823 0.5973 281.12 0.2823 0.2319 281.12 0.2823
With attribute specification or prior knowledge on the bias distribution
ITI-GEN [24] 0.0078 278.21 0.2753 0.3699 247.05 0.2679 0.1560 243.09 0.2648
TIME* [15] 0.2908 277.79 0.2733 0.5463 270.03 0.2663 0.2285 271.09 0.2738
FD* [5] 0.2420 278.10 0.2718 0.4987 277.64 0.2738 0.2246 280.33 0.2740
Without attribute specification
EI [1] 0.1666 283.52 0.2758 0.6033 281.11 0.2745 0.2258 289.82 0.2773
FM [14] 0.1174 222.82 0.2341 0.3722 220.37 0.2391 0.3823 255.72 0.2402
TI [6] 0.2590 283.38 0.2777 0.8065 275.32 0.2799 0.3113 286.22 0.2823
Ours 0.1298 272.35 0.2789 0.3625 277.15 0.2808 0.2168 268.53 0.2798

Citation IDs follow the ones in our arXiv paper.

Qualitative Results

Ablation Studies

Ablation studies on proposed components for single-bias mitigation.
Lanchor Fam Gender DKL Race DKL Age DKL
0.2590 0.8065 0.3113
0.1822 0.3540 0.3999
0.3981 0.8090 0.2581
0.1298 0.3625 0.2168


Performance of combining adaptive inclusive tokens to mitigate multi-biases.
<ig> <ir> <ia> Gender DKL Race DKL Age DKL FID ↓ CLIP ↑
0.3584 0.5973 0.2319 281.12 0.2823
0.1518 0.3141 - 270.19 0.2780
- 0.5691 0.2079 267.61 0.2782
0.1568 - 0.2848 263.37 0.2793
0.1417 0.5932 0.2272 262.95 0.2769


Performance of adaptive inclusive tokens in complex scenes. Reported metrics are on gender bias. Additional prompts are added to "A photo of a ig {occupation}".
Metrics DKL FID ↓ CLIP ↑ DKL FID ↓ CLIP ↑ DKL FID ↓ CLIP ↑
Prompts + "drinking coffee." + "reading a book." + "listening to music."
SD1.5 0.3567 266.75 0.3116 0.3992 326.80 0.3045 0.4246 274.67 0.3033
Ours 0.2404 264.88 0.3082 0.2669 312.87 0.3020 0.2195 275.71 0.3009

BibTeX


        @article{hou2024aitti,
          author    = {Hou, Xinyu and Li, Xiaoming and Loy, Chen Change},
          title     = {{AITTI}: Learning Adaptive Inclusive Token for Text-to-Image Generation},
          journal   = {arXiv preprint arXiv: 2406.12805},
          year      = {2024},
        }