LSTM is Dead Long Live transformers

Blog post description.

AIMACHINE LEARNING

5/29/20232 min read

Natural Language Processing (NLP) has experienced a radical transformation with the emergence of transformer models, redefining the strategies and techniques used in NLP tasks. As the pivotal advancement in this field, transformers have resolved several limitations of their predecessors and have notably spurred the large language model revolution. This article illuminates the breakthrough of transformer models, their benefits, and the new era they have brought about.

In supervised learning, one of the fundamental aspects of NLP, a document serves as an input with the aim to predict an output associated with it, such as its categorization as spam or not. The inherent challenge lies in converting the variable-length document into a fixed size vector. Earlier strategies to address this included the 'bag of words' model and n-grams, both falling short due to issues related to disregard for word order or an explosion in dimensionality.

Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) then emerged as potential solutions. While these models showed promise, they were burdened with issues such as vanishing and exploding gradients and complexities in training. However, their contributions to NLP cannot be understated, as they served as stepping stones towards the development of transformers.

The birth of transformers signaled a new era in NLP. They surpassed many challenges presented by earlier models, ushering in new possibilities for NLP tasks. Transformers introduced two groundbreaking innovations: positional encoding and multi-headed attention, fundamentally altering the approach to NLP.

Unlike previous models, transformers don't require sequence-dependent processing, which significantly boosts their computational efficiency. They also eliminated the need for sigmoid or tanh activation functions which, while historically beneficial, could lead to complications. Instead, transformers employ the Rectified Linear Unit (ReLU) activation function, facilitating efficient training, and easier gradient computation.

This computational efficiency of transformers, paired with their ability to handle a wider range of tasks, is precisely what fueled the explosion of large language models. Transformers' ability to process tokens independently, in parallel, enabled the handling of larger amounts of data more efficiently, making it feasible to train models with billions of parameters. This advance paved the way for models like NVIDIA's Megatron and Facebook's Roberta, showcasing the remarkable scalability of transformers.

Moreover, leveraging transformers has been made more accessible thanks to libraries like Hugging Face, which provides PyTorch and TensorFlow implementations. With a few lines of code, you can fine-tune a pre-trained language model, reflecting the reusable and fine-tunable nature of these models.

The crowning achievement of transformers, however, lies in their proficiency in transfer learning. Unlike LSTMs, transformers can be pre-trained on substantial volumes of unsupervised text, demonstrating their exceptional power for large-scale applications.

In conclusion, transformers have revolutionized NLP, offering computational efficiency, ease of training, and robust transfer learning capabilities. Their emergence has ushered in the large language model era, fundamentally changing our understanding and processing of language. While models like LSTMs still have their niche applications, transformers have indisputably paved the way for the future of NLP.

LSTM is Dead Long Live transformers

kss239@cornell.edu