{"id":143,"date":"2024-04-27T10:50:09","date_gmt":"2024-04-27T10:50:09","guid":{"rendered":"http:\/\/localhost:8888\/sawberries\/2024\/04\/27\/understanding-state-of-art-language-html\/"},"modified":"2024-04-27T10:50:09","modified_gmt":"2024-04-27T10:50:09","slug":"understanding-state-of-art-language-html","status":"publish","type":"post","link":"http:\/\/localhost:8888\/sawberries\/2024\/04\/27\/understanding-state-of-art-language-html\/","title":{"rendered":"Understanding SoTA Language Models (BERT, RoBERTA, ALBERT, ELECTRA)"},"content":{"rendered":"
\n

\u00a0Hi everyone,<\/p>\n

There are a ton of language models out there today! Many of which have their unique way of learning “self-supervised” language representations that can be used by other downstream tasks.\u00a0<\/p>\n

In this article, I decided to summarize the current trends and share some key insights to glue all these novel approaches together.\u00a0 \ud83d\ude03 (Slide credits: Delvin et. al. Stanford CS224n)<\/p>\n

<\/p>\n<\/p>\n

\"\"<\/a><\/div>\n<\/p>\n

<\/span><\/a><\/span><\/p>\n

<\/p>\n

Problem: Context-free\/Atomic Word Representations<\/span><\/b><\/p>\n

We started with context-free approaches like word2vec, GloVE embeddings <\/b>in my previous post<\/a><\/span>. The drawback of these approaches is that they do not account for syntactic context. e.g. “open a bank<\/b> account” v\/s “on the river bank<\/b>“. The word bank <\/b>has different meanings depending on the context the word is used in.<\/p>\n<\/p>\n

\"\"<\/a><\/div>\n

Solution #1: Contextual Word Representations<\/span><\/b><\/p>\n

With ELMo<\/b><\/span> the community started building forward (left to right) and backward (right to left) sequence language models, and used embeddings extracted from both (concatenated) these models as pre-trained embeddings for downstream modeling tasks like classification (Sentiment etc.)<\/p>\n<\/p>\n

\"\"<\/a><\/div>\n
<\/div>\n

Potential drawback:<\/b><\/p>\n

ELMo can be considered a “weakly bi-directional model” as they trained 2 separate models here.<\/p>\n

Solution #2: Truly bi-directional Contextual Representations<\/b><\/span><\/p>\n<\/p>\n

To solve the drawback of “weakly bi-directional” approach and the information bottleneck that comes with LSTMs \/ Recurrent approaches – the Transformer architecture was developed. Transformers unlike LSTM\/RNN are an entirely feedforward network. Here is a quick summary of the architecture:<\/div>\n
\"\"<\/a><\/div>\n
Tip:<\/b> If you are new to transformers but are familiar with vanilla Multi-Layer Perceptron (MLP) or Fully connected Neural networks. You can think of transformers as being similar to MLP\/standard NN with fancy bells and whistles on top of that.<\/span><\/i><\/div>\n

<\/span><\/i><\/div>\n
But, what makes the transformer so much more effective?<\/div>\n

<\/span><\/i><\/div>\n
\n
\"\"<\/a><\/div>\n
<\/div>\n
2 key ideas:<\/b><\/span><\/div>\n

<\/span><\/div>\n
1. Every word has an opportunity to learn a representation with-respect-to every other word (Truly bi-directional)<\/b>\u00a0in the sentence (think of every word as a feature given as input to a fully connected network). To further build on this idea let’s consider the transformer as a fully connected network with 1 hidden layer as shown below:<\/div>\n
<\/div>\n
\n
\"\"<\/a><\/div>\n
source: Stackoverflow<\/a><\/div>\n
<\/div>\n
If x1 and x5 are 2 words\/tokens from my earlier example (on<\/b> the river bank<\/b>), now x1 has access to x5 regardless of the distance between x1 and x5 (the word on <\/b>can learn a representation depending on the context provided by the word\u00a0bank)<\/b><\/div>\n

<\/b><\/div>\n
2. Essentially, since every layer can be represented as a big matrix multiplication (parallel computation)<\/b> over one multiplication per token that happens in an LSTM, the transformer is much faster than an LSTM.<\/b><\/div>\n

<\/b><\/div>\n
<\/span>
<\/b><\/div>\n
<\/div>\n

<\/p>\n

Problem with bi-directional models:<\/span><\/b><\/div>\n

<\/b><\/p>\n

<\/div>\n<\/div>\n
But, Language models (LM) are supposed to model P(w_t+1\/w_1..w_t)? How does the model learn anything if you expose all the words to it?<\/div>\n
<\/div>\n
BERT<\/span><\/b> develops upon this idea using transformers to learn Masked Language Modeling (MLM) and translates the task to P(w_masked\/w_1..w-t)<\/div>\n
<\/div>\n
Tradeoff: In MLM, you could be masking and predicting ~15% words in the sentence. However, in Left-to-Right LM you are predicting 100% of words in the sentence (higher sample efficiency).<\/div>\n
<\/div>\n
\n
\"\"<\/a><\/div>\n
<\/div>\n

There are some changes in the input to the model with respect to the previous LSTM based approach. The input now has 3 embeddings:\u00a0<\/p><\/div>\n

<\/p>\n
\"\"<\/a><\/div>\n
<\/div>\n<\/div>\n
1. Token embeddings<\/b>\u00a0– (Same as embeddings fed into the LSTM model)\u00a0<\/div>\n
<\/div>\n
2. Segment Embeddings<\/b> –\u00a0<\/div>\n
\n