Let’s say we want to translate from English to German.
We can clearly see, that we cannot translate the sentence by translating each word separately. For example, the English word “the” can be translated into “der” or “die” depending on the gender of the noun it relates to. Also the word “to” is not translated into German at all, because there is no infinitive in the German sentence. And there are many more examples of how the context of a word affects it’s translation.
We somehow need to feed information about our whole input sentence into our machine translation model, so that it can understand the context of words.
And since most machine translation models output one word at a time, we have go give the model also information about which parts it has already translated.
In the past, machine translation was mainly done by using recurrent neural networks like LSTM or GRU. However they struggled to learn dependencies between words, which are far away in a sentence due to the number of computation steps between words increases with distance.
Transformers were introduced to address this problem by removing recurrence and replacing it with an attention mechanism. I will go through inner workings of the architecture proposed in the famous paper “Attention is all you need”.
The transformer model can predict one word/token at a time. It takes as input the source sentence we want to translate and the parts of the sentence it has already translated. The transformer then the outputs the next word.
The transformer has two different parts called “encoder” and “decoder”. The input sentence is given into the encoder, while the already translated part is given to the decoder part, which also generates the output.
The attention mechanism is at the core of the Transformer architecture and it is inspired by the attention in the human brain. Imagine yourself being at a party. You can recognize your name being spoken at the other side of the room, even if it should get lost in all the other noise. Your brain can focus on things it considers important and filters out all unnecessary information.
Attention in transformers is facilitated with the help of queries, keys, and values.
Key: A key is a label of a word and is used to distinguish between different words.
Query: Check all available keys and selects the one, that matches best. So it represents an active request for specific…
Continue reading: https://towardsdatascience.com/attention-please-85bd0abac41?source=rss—-7f60cf5620c9—4