{"id":143,"date":"2024-04-27T10:50:09","date_gmt":"2024-04-27T10:50:09","guid":{"rendered":"http:\/\/localhost:8888\/sawberries\/2024\/04\/27\/understanding-state-of-art-language-html\/"},"modified":"2024-04-27T10:50:09","modified_gmt":"2024-04-27T10:50:09","slug":"understanding-state-of-art-language-html","status":"publish","type":"post","link":"http:\/\/localhost:8888\/sawberries\/2024\/04\/27\/understanding-state-of-art-language-html\/","title":{"rendered":"Understanding SoTA Language Models (BERT, RoBERTA, ALBERT, ELECTRA)"},"content":{"rendered":"
\u00a0Hi everyone,<\/p>\n
There are a ton of language models out there today! Many of which have their unique way of learning “self-supervised” language representations that can be used by other downstream tasks.\u00a0<\/p>\n
In this article, I decided to summarize the current trends and share some key insights to glue all these novel approaches together.\u00a0 \ud83d\ude03 (Slide credits: Delvin et. al. Stanford CS224n)<\/p>\n
<\/p>\n<\/p>\n
<\/span><\/a><\/span><\/p>\n <\/p>\n Problem: Context-free\/Atomic Word Representations<\/span><\/b><\/p>\n We started with context-free approaches like word2vec, GloVE embeddings <\/b>in my previous post<\/a><\/span>. The drawback of these approaches is that they do not account for syntactic context. e.g. “open a bank<\/b> account” v\/s “on the river bank<\/b>“. The word bank <\/b>has different meanings depending on the context the word is used in.<\/p>\n<\/p>\n Solution #1: Contextual Word Representations<\/span><\/b><\/p>\n With ELMo<\/b><\/span> the community started building forward (left to right) and backward (right to left) sequence language models, and used embeddings extracted from both (concatenated) these models as pre-trained embeddings for downstream modeling tasks like classification (Sentiment etc.)<\/p>\n<\/p>\n Potential drawback:<\/b><\/p>\n ELMo can be considered a “weakly bi-directional model” as they trained 2 separate models here.<\/p>\n Solution #2: Truly bi-directional Contextual Representations<\/b><\/span><\/p>\n<\/p>\n <\/p>\n <\/b><\/p>\n There are some changes in the input to the model with respect to the previous LSTM based approach. The input now has 3 embeddings:\u00a0<\/p><\/div>\n <\/span><\/ul>\n <\/span><\/p>\n “Bigger the LM, the better it is”<\/p><\/blockquote>\n <\/b><\/i><\/p>\n<\/p>\n<\/div>\n The central idea was to train the same BERT model for longer (more epochs) and on more data. The evaluation results show that it does better than the standard BERT model we saw earlier.<\/span><\/div>\n XLNet introduced this idea of relative position embeddings instead of static position embeddings that we saw earlier. These start out as linear relationships and are combined together in deeper layers to learn a non-linear attention function.<\/span><\/div>\n Additionally, instead of going just Left-to-Right, XLNet introduced this idea of Permutation Language Modelling (PLM) which allows us to randomly permute the order for every training sentence as shown in the figure. You are still predicting one “MASKED” word at a time given some permutation of the input. This gives us a much better sample efficiency.<\/span><\/div>\n <\/span><\/div>\n
<\/span><\/i><\/div>\n
<\/span><\/i><\/div>\n
<\/span><\/div>\n
<\/b><\/div>\n
<\/b><\/div>\n
<\/b><\/div>\n\n
\n
\n
<\/b><\/div>\n
<\/span><\/div>\n
<\/span><\/div>\n