Building Blocks

Updated Sep 21, 2024 ·

Using Text Data

Computers themselves cannot read or understand text like humans do.

They only process numbers, not words or emotions
Sentences like “I am a data scientist” have no meaning to them
They rely on patterns and structure rather than understanding language directly

Natural Language Processing (NLP) bridges this gap by converting text into numerical form. It enables machines to detect context, recognize meaning, and make sense of language. This allows LLMs to transform raw text into intelligent, human-like responses.

Linguistic Subtleties

LLMs extend NLP’s capabilities by recognizing subtle meaning in language. LLM can detect linguistic subtleties like:

Irony
Humor
Pun
Sarcasm
Intonation
Intent

For example, if asked “What’s your favorite book?”, an LLM might say:

“That’s a tough one. My all-time favorite is To Kill a Mockingbird. Have you read it?”

This makes interactions sound natural, not robotic.

How LLMs Are Trained

LLMs are called “large” because they are trained on huge amounts of data and contain many parameters. Parameters are the patterns and rules learned during training, and more parameters mean the model can understand more complex relationships and produce more accurate responses.

As models grow in scale, they start developing new abilities not seen in smaller ones. Scale depends on two main factors:

The size of the training data
The number of model parameters

When this scale crosses a certain point, performance can suddenly improve, leading to new skills and deeper understanding. To reach that stage, LLMs go through several key training steps:

Text preprocessing
Text representation
Pre-training on large datasets
Fine-tuning for specific tasks
Advanced fine-tuning for higher accuracy

These steps help LLMs learn structure, meaning, and context, allowing them to respond intelligently and adapt to complex language patterns.

Text Preprocessing

Text preprocessing organizes and simplifies raw text before analysis.

Includes tokenization, stop word removal, and lemmatization
Steps can happen in any order depending on the task
Each step reduces noise and highlights useful words

This process ensures that only meaningful words remain, making analysis more accurate.

Tokenization

Tokenization breaks sentences into smaller parts called tokens.

Each word or punctuation mark becomes a token
Turns text into a list of separate items

Example:

text = "Working with natural language processing is tricky."
tokens = ["Working", "with", "natural", "language", "processing", "is", "tricky", "."]

This step turns sentences into a structured list that computers can easily handle.

Stop Word Removal

Some common words add little meaning to a sentence.

Words like “is”, “with”, “the” are called stop words
Removing them helps focus on meaningful content

Example:

Before stop word removal:

["Working", "with", "natural", "language", "processing", "is", "tricky", "."]

After removal, only the key parts of the sentence remain:

["Working", "natural", "language", "processing", "tricky", "."]

Lemmatization

Lemmatization simplifies different forms of a word into their base form.

Groups similar words like “talked”, “talking”, and “talk”
Reduces redundancy and improves pattern recognition

Example:

from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
print(lem.lemmatize("talked"))
print(lem.lemmatize("talking"))

Expected output:

talk
talk

This keeps word meaning consistent across variations.

Text Representation

Once preprocessed, text must be converted into numbers that machines understand.

Text representation turns words into numerical values
Common methods include bag-of-words and word embeddings
Enables LLMs to process and learn from large text datasets

This transformation makes human language machine-readable. Some common methods to do this:

Bag-of-words
Word embeddings

Bag-of-Words

Bag-of-words counts how often each word appears.

Creates a matrix showing word frequency
Treats each sentence as a collection of words, not a sequence

Using this text as example:

The cat chased the mouse swiftly", "The mouse chased the cat

Code:

from sklearn.feature_extraction.text import CountVectorizer
sentences = ["The cat chased the mouse swiftly", "The mouse chased the cat"]
vectorizer = CountVectorizer(stop_words='english')
matrix = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names_out())
print(matrix.toarray())

Expected output:

['cat' 'chased' 'mouse' 'swiftly']
[[1 1 1 1]
 [1 1 1 0]]

While simple, Bag-of-words can’t capture context or relationships between words.

Misses opposite meanings in similar sentences
Treats related words as unrelated
Fails to understand sentence structure

Because of these limits, it’s often replaced by more advanced methods like word embeddings.

Word Embeddings

Word embeddings represent words using numbers that reflect their meanings.

Similar words have similar number patterns
Captures relationships like “cat hunts mouse” or “tiger hunts deer”
Each word becomes a vector (a list of numbers)

Example (simplified):

cat = [-0.9, 0.9, 0.9]
mouse = [0.8, -0.7, 0.7]

This helps LLMs understand not just words but also their relationships and context.

Fine-Tuning LLMs

Fine-tuning allows smaller teams to benefit from many existing pre-trained models without needing massive resources. It improves a model’s understanding of a particular topic.

A pre-trained model already knows general language patterns
Fine-tuning teaches it specialized terms and styles
The process adjusts the model slightly rather than rebuilding it

It’s like a person who already speaks a language learning new words in a specific field, such as medicine or law. This helps the model communicate better in that domain.

Challenges with Large Models

LLMs are powerful but also difficult and expensive to manage.

Require huge computing power and storage
Need large-scale infrastructure
Depend on vast amounts of data and time

Building and training them from scratch demands advanced hardware and reliable systems, which most organizations cannot afford.

Efficient Model Training

Training must also be efficient to save time and costs.

Large models can take weeks or months to train
Efficient training reduces time using better algorithms
Parallel processing shortens overall training duration

For instance, training a huge LLM on one GPU could take centuries, but optimized setups finish the same job within weeks.

Data Availability

Another challenge is the need for high-quality training data to accurately learn the complexities and subtleties of language.

LLMs are trained on hundreds of gigabytes of text
This equals millions of books and online articles
Poor-quality data leads to inaccurate or biased results

Since fine-tuning uses much smaller datasets, data quality matters even more to ensure reliable performance.

Overcoming These Challenges

Fine-tuning helps manage the complexity of LLMs bt:

Adapts a general model to a focused use case
Reduces the need for massive computing power
Makes AI accessible to smaller teams and projects

Because fine-tuned models already understand general language, they can quickly learn specialized knowledge with minimal resources.

Fine-Tuning vs Pre-Training

Fine-tuning and pre-training differ mainly in scale and purpose.

Fine-tuning uses fewer resources and less time
Requires only one CPU or GPU in many cases
Uses small datasets (hundreds of MBs to a few GBs)
Pre-training uses massive data (hundreds of GBs) and thousands of GPUs

Fine-tuning is faster, cheaper, and ideal for adapting existing models, while pre-training builds new models from scratch. Both are important, but fine-tuning makes advanced language models practical for everyday use.

Transfer Learning

Transfer learning allows a model to use what it has already learned to perform new, related tasks.

Applies existing knowledge to new problems
Saves time and training resources
Improves performance on small datasets

For example, someone who learns to play the piano can easily pick up the guitar because of shared concepts like rhythm and notes. Similarly, an LLM trained on general language can apply that understanding to specific tasks with little extra data.

Common learning techniques:

Zero-shot learning - No explicit training
One-shot learning - Learn new task with few examples
Multi-shot learning - Requires more examples than few-shot

Zero-Shot Learning

Zero-shot learning lets LLMs perform new tasks without being explicitly trained for them.

No specific examples are used
Relies on existing language understanding
Transfers general knowledge to new situations

Example:

If a model knows what a “horse” is and is told a “zebra” looks like a striped horse, it can identify a zebra correctly without prior examples.

Few-Shot Learning

Few-shot learning helps models learn a new task with only a few examples.

Builds on previous knowledge
Learns from limited samples
Reduces the need for massive datasets

Example:

A student recalls lessons from class and answers a similar question in an exam, even without extra studying. Similarly, an LLM uses a few examples to adapt to new tasks.

Multi-Shot Learning

Multi-shot learning is similar to few-shot learning but uses more examples for better accuracy.

Uses more examples per task
Builds stronger understanding
Improves precision and generalization

Example:

If the model sees several images of Golden Retrievers, it can learn to recognize them and later identify similar breeds, like Labradors, with higher confidence.

Using Text Data​

Linguistic Subtleties​

How LLMs Are Trained​

Text Preprocessing​

Tokenization​

Stop Word Removal​

Lemmatization​

Text Representation​

Bag-of-Words​

Word Embeddings​

Fine-Tuning LLMs​

Challenges with Large Models​

Efficient Model Training​

Data Availability​

Overcoming These Challenges​

Fine-Tuning vs Pre-Training​

Transfer Learning​

Zero-Shot Learning​

Few-Shot Learning​

Multi-Shot Learning​

Using Text Data

Linguistic Subtleties

How LLMs Are Trained

Text Preprocessing

Tokenization

Stop Word Removal

Lemmatization

Text Representation

Bag-of-Words

Word Embeddings

Fine-Tuning LLMs

Challenges with Large Models

Efficient Model Training

Data Availability

Overcoming These Challenges

Fine-Tuning vs Pre-Training

Transfer Learning

Zero-Shot Learning

Few-Shot Learning

Multi-Shot Learning