[Paper Review] Improving Language Understanding by Generative Pre-Training

Paper Info

Titile: “Improving Language Understanding by Generative Pre-Training”
Publisher: OpenAI
Initial release: June 2018
Repository: huggingface, github

읽은 이유

유명 machine learning 논문 탐방

Summary

Explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning.
Thier general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task.
Make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.

Scopes of Work

Natural language processing (NLP)
Semi-supervised learning for NLP
Generateve pre-traning and Discriminative fine-tuning
Unsupervised pre-traninig and Supervised fine tuning

Motivation Problem

> The ability to learn effectively from raw text, such as pre-trained word embeddings, is crucial in NLP.

In NLP, the dependence on supervised learning should be alleviated.
Most deep learning methods require substantial amounts of manually labed data. In many domain, their applicability suffer from a dearth of annotated resources.
Gatering more annotation can be time-cosuming and expensive.
Models that can leverage linguistic information from unlabeled data provide a valuable alternative.
Even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost.
The pretrained word embedding improve performance on a ragne of NLP tasks.

> Word embedding has benefits, however, mainly transfer word-level information whereas we aim to capture higher-level semantics.

Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data.

> Challenges in leveraging more than word-level information form unlabed text.

First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer.
Second, there is no consensus on the most effective way to transfer these learned representations to the target task. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing.

Key Idea

> Explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning.

Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks.
Generateve pre-traning (Unsuperviosed pre-training): A language modeling objective on a large corpus of unlabeled text to learn the initial parameters of a neural network model.
Discriminative fine-tuning (Supervised fine-tuning): Adapting the initial parameters to a target task using several datasets with manually annotated training examples, utilizing the corresponding supervised objective.

> Use the Transformer Architecture.

The Transformer model provides a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks.
Task-specific input adaptations, treating structured text as a single token sequence, enable effective fine-tuning with minimal changes to the pre-trained model.

> Aim to capture higher-level semantics then word-level.

Word embeddings trained on unlabeled corpora improve task performance but mainly transfer word-level information. Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data.

> Task-specific input transformations.

Certain tasks have structured inputs. Previous work suggested learning task-specific architectures on top of transferred representations, but this requires significant customization and does not apply transfer learning to the additional components. We use a traversal-style approach, where we convert structured inputs into an ordered sequence that our pre-trained model can process.

Note for Remembering

Generateve pre-traning (unsupervised pre-training)
Discriminative fine-tuning (supervised fine-tuning)

117M parameters
12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12
attention heads)
For the position-wise feed-forward networks, we used 3072 dimensional inner states.
Adam optimization
consine schedule
contiguous sequences of 512 tokens
GELU

댓글남기기