Transformers

Updated Sep 22, 2024 ·

Overview

Transformers are key to pre-training and focuses on relationships between words across long sentences.

Handle large text efficiently
Capture connections between distant words
Improve context and meaning in sentences

Transformers were introduced in a research paper titled “Attention Is All You Need”, which changed how models learn language.

Learn how words relate to each other
Use attention to capture long-range dependencies

It contains four main parts which work together to help the model understand text deeply and generate accurate responses.

Pre-processing
Positional encoding
Encoders
Decoders

How Transformers Work

Consider the sample input:

"Bob, who lives in New York and works as a software"

The transformer processes this step by step:

Converts the text into numbers
Adds word position information
Uses encoders to understand meaning
Uses decoders to predict the next words

Each step builds on the last, ensuring that the model keeps context and meaning intact.

After processing, the model completes the sentence as:

"Bob, who lives in New York and works as a software engineer, loves exploring new restaurants in the city."

Transformer Architecture

Text Pre-Processing and Representation

Before learning, the text needs to be broken into smaller pieces called tokens and represented numerically.

Breaks sentences into tokens
Removes unnecessary words and simplifies forms
Converts tokens into numbers using word embeddings

Example in Python:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Transformers learn relationships between words.")
print(tokens)

Expected output:

['transformers', 'learn', 'relationships', 'between', 'words', '.']

This process helps the transformer understand sentences as patterns of numbers, and prepares it for deeper processing.

Positional Encoding

Since word order matters, transformers use positional encoding to give each token a sense of position.

Adds position information to each word
Helps the model understand sentence structure
Keeps relationships between distant words

Without positional encoding, the model would treat:

“The cat sat on the mat”

the same as:

“On the mat sat the cat.”

Encoders

Encoders help the model understand the input sentence through layers of attention and neural networks.

Use attention mechanism to focus on important words
Neural networks process specific features of the input data
Contain multiple layers that extract meaning
Pass learned features to the next stage

Encoders help the model understand complex structures, like how “New York” refers to one place, not two separate words.

Decoders

Decoders take the encoded information and turn it into the final output text.

Predicts words based on encoded data
Uses attention to refine predictions
Builds sentences step by step

Decoders ensure that generated text sounds natural and remains consistent with the original context.

Long-Range Relationships

Transformers excel at linking distant words that relate to each other.

Capture meaning across long sentences
Connect distant phrases for better context
Improve overall understanding and accuracy

For example, in:

"Bob, who lives in New York and works as a software engineer, loves exploring new restaurants in the city."

The model links “Bob” with “loves exploring new restaurants” This creates a more complete understanding of the subject.

Parallel Processing Advantage

Unlike older models that read one word at a time, transformers can process multiple words at once.

Handle words simultaneously
Reduce training and response time
Improve efficiency for large datasets

Consider the sentence:

"The cat sat on the mat"

The transformer can process all the words at once, rather than reading each word one by one.

Overview​

How Transformers Work​

Transformer Architecture​

Text Pre-Processing and Representation​

Positional Encoding​

Encoders​

Decoders​

Long-Range Relationships​

Parallel Processing Advantage​