We follow on from our two previous posts
In this post, we understand the taxonomy of TPTLM – Transformer based pre-trained language models
The post is based on a paper which covers this topic extensively:
Transformer based pre-trained language models (TPTLM) are a complex and fast growing area of AI – so I recommend this paper as a good way to understand and navigate the landscape
We can classify TPTLM from four perspectives
- Pretraining Corpus
- Model Architecture
- Type of SSL (self-supervised learning) and
Pretraining Corpus-based models
General pretraining: Models like GPT-1 , BERT etc are pretrained on general corpus. For example, GPT-1
is pretrained on Books corpus while BERT and UniLM are pretrained on English Wikipedia and Books corpus.
This form of training is more general from multiple sources of information
Social Media-based: you could train on models using social media
Language-based: Models could be trained on languages either monolingual or multilingual.
TPTLM could be classified based on their architecture. A T-PTLM can be pretrained using a stack of encoders or decoders or both.
Hence, you could have architectures based on
- Encoder-Decoder based
Self supervised learning – SSL is one of the key ingredients in building T-PTLMs.
A T-PTLM can be developed by pretraining using Generative, Contrastive or Adversarial, or Hybrid SSL. Hence, based on SSLs you could have
- Generative SSL
- Contrastive SSL
- Adversarial SSL
- Hybrid SSL
Based on extensions, you can classify TPTLMs according to the following categories
- Compact T-PTLMs: aim to reduce the size of the T-PTLMs and make them faster using a variety of model compression techniques like pruning, parameter sharing, knowledge distillation, and quantization.
- Character-based T-PTLMs: CharacterBERT uses CharCNN+Highway layer to generate word representations from character embeddings and then apply transformer encoder layers. ex AlphaBERT
- Green T-PTLMs: focus on environmentally friendly methods
- Sentence-based T-PTLMs: extend T-PTLMs like BERT to generate quality sentence embeddings.
- Tokenization-Free T-PLTMs: avoid the use of explicit tokenizers to split input sequences to cater for languages such as…
Continue reading: http://www.datasciencecentral.com/xn/detail/6448529:BlogPost:1066508