Transformers are proven to perform well not only in language-related tasks, but also in vision. Despite their performance however, transformer-based models are extremely difficult to train on small datasets, especially for vision-related tasks.

To address this problem, a paper written by Asher TrockmanJ. Zico Kolter proposes mimetic initialization. As its name suggests, the self-attention weights of the transformers are initialized in a way that mimics those of their trained counterparts, allowing for a faster convergence

This is a summary of the paper with our key takeaways and conclusion.

Observation of Transformer Weights

As discussed above, mimetic initialization aims to replicate the pre-trained self-attention transformers’ weights. Naturally, the first step is to observe the weight patterns. The weights are defined as follows: query and key weights with d x k dimension, and value and projection weights, with d x d dimension (considered to be full rank). d is the dimension of the transformers, k is the head dimension, i.e. d divided by number of heads. Given those weights, for inputs X (n x d) with additive positional embeddings P (n x d), attention map is defined as follows.

After formally defining the weights involved in self-attention, ViT-Tiny is dissected for analysis as shown below. It is possible to note a common pattern, in which a positive diagonal is formed from the product of query and key weights. Similarly, the product of value and projection weights hold a negative diagonal.

Figure 1. Self-attention weights of an ImageNet-pretrained ViT-Tiny model

For language models on the other hand, a similar pattern is noted, but this time query-key product and value-projection product are now negative and positive, respectively. However, note that value-projection product gradually becomes positive for deeper layers.

Figure 2. Self-attention weights of a pretrained GPT-2 model

Initialization

Given the above observations, a simple initialization can be made to satisfy diagonality (below initializations are for ViT transformers). Given by the fact that random normal matrix is approximately orthogonal (given Z with dimension d x k and Z ~ N(0,I/k), product of Z and its transpose is equal to the identity matrix), the following can be defined.

Note that the normal matrix is scaled by the transformer head dimension to maintain unit average diagonal magnitude. For negative diagonal in the case of value-projection product, the sign of projection weight can be simply inverted to obtain the below equation.

However, such initialization does not provide control on how much accentuation should be put on the diagonal. Thus, the equation can be further modified into the following.

For the above equations, Z ~ N(0,I/d) and alpha, beta values lie in [0,1]. The parameters explicitly determine the strength of the diagonal. With such modification, the attention map can be re-written.

From this form of the attention map equation, it is visible that the type of positional embedding can affect mimetic initialization and thus its performance.

Results

The novel initialization method was shown to work consistently on relatively small dataset (in comparison to JFT-300M), such as Tiny ImageNet, CIFAR-100, and SVHN, showing an improvement in classification accuracy by 5.63, 6.39, and 0.39 respectively. It is noteworthy that the use of sinusoidal position embeddings is necessary, showcasing its role in mimetic initialization (as also indicated from attention map equation). The paper attributes the success of this method to its mathematical similarity to other techniques used to improve ViT trainings. The below image shows the similarity of attention maps obtained from ViT under various pretraining and initialization conditions. Notice the similarity between the mimetic-initialized model’s attention maps to those of ImageNet and CIFAR-10 pretrained models.

Figure 3. Attention maps from one CIFAR-10 batch for 1, 4, and 11th transformer layers of ViT-Tiny (a) untrained, (b) CIFAR-10 trained (c) ImageNet pretrained (d) Mimetic initialization (e) Mimetic initialization + CIFAR-10 trained.

Mimetic initialization was shown to also work with language models. While the improvements in the metrics are not as obvious as those in vision models, results from both small scale and medium scale tasks display a drop in their perplexity and BPC (bits-per-character, similar concept to cross entropy for language models).

Figure 4. Impact of mimetic initialization on language tasks

Conclusion

The key takeaways for this paper are as follows:

  1. Visual transformers are known for their poor performance when trained on small datasets. Mimetic initialization offers a solution to the problem by mimicking the self-attention transformer weight patterns, offering faster convergence for smaller datasets.
  2. The method does not require any additional training nor modifications to the original model. It is also analytically found that the initialization shares similar properties as existing training aiding methods for transformers.
  3. The initialization does not only show significant improvements for vision models, but also for language models.

The results of this paper can be used for faster training of current TTS models which actively use transformers, as this method not only showed results on visual data but also on sequential data (and thus, applicable to speech). Specifically, this could be beneficial when training new architectures on smaller datasets for which a pre-trained model does not exist.

Link: https://arxiv.org/abs/2305.09828