We follow on from our two previous posts

Opportunities and Risks of foundation models

Understanding self supervised learning

In this post, we understand the taxonomy of TPTLM – Transformer based pre-trained language models

The post is based on a paper which covers this topic extensively:

AMMUS : A Survey of Transformer-based Pretrained Models in Natural …

Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sa…

Transformer based pre-trained language models (TPTLM) are a complex and fast growing area of AI – so I recommend this paper as a good way to understand and navigate the landscape

We can classify TPTLM from four perspectives

  • Pretraining Corpus
  • Model Architecture
  • Type of SSL (self-supervised learning) and
  • Extensions

Pretraining Corpus-based models

General pretraining: Models like GPT-1 , BERT etc are  pretrained on general corpus. For example, GPT-1

is pretrained on Books corpus while BERT and UniLM are pretrained on English Wikipedia and Books corpus.

This form of training is more general from multiple sources of information

Social Media-based: you could train on models using social media

Language-based: Models could be trained on languages either monolingual or multilingual.


TPTLM  could be classified based on their architecture.  A T-PTLM can be pretrained using a stack of encoders or decoders or both.

Hence, you could have architectures based on

  • Encoder-based
  • Decoder-based
  • Encoder-Decoder based

Self supervised learning – SSL is one of the key ingredients in building T-PTLMs.

A T-PTLM can be developed by pretraining using Generative, Contrastive or Adversarial, or Hybrid SSL. Hence, based on SSLs you could have

  • Generative SSL
  • Contrastive SSL
  • Adversarial SSL
  • Hybrid SSL

Based on extensions, you can classify TPTLMs according to the following categories

  • Compact T-PTLMs: aim to reduce the size of the T-PTLMs and make them faster using a variety of model compression techniques like pruning, parameter sharing, knowledge distillation, and quantization.
  • Character-based T-PTLMs: CharacterBERT uses CharCNN+Highway layer to generate word representations from character embeddings and then apply transformer encoder layers. ex AlphaBERT
  • Green T-PTLMs: focus on environmentally friendly methods
  • Sentence-based T-PTLMs: extend T-PTLMs like BERT to generate quality sentence embeddings.
  • Tokenization-Free T-PLTMs: avoid the use of explicit tokenizers to split input sequences to cater for languages such as…

Continue reading: http://www.datasciencecentral.com/xn/detail/6448529:BlogPost:1066508

Source: www.datasciencecentral.com