Skip to main content

Natural Language Processing

Updated May 04, 2023 ·

Overview

Natural Language Processing (NLP) allows computers to understand human language, which makes it possible for them to identify and categorize entities in text.

In the example below, words are sorted into categories such as names of persons and locations.

Bag of Words

When dealing with text data, we use different techniques to extract features for machine learning models.

  • Text data can be represented by counting the frequency of important words
  • This technique is known as the bag of words

To understand bag of words, consider analyzing sentences for word counts.

N-grams

N-grams improve the bag of words technique by considering sequences of words.

  • Counting sequences of words helps capture more contextual information
  • Example: Counting "This is" together instead of just "This"

Diagram:

Limitations

There are limitations to the bag of words approach, such as handling synonyms.

  • Word counts alone do not account for synonyms

  • Different words for "blue" like "navy-blue", "cobalt", "vivid cerulean" should ideally be grouped

Word Embeddings

Word embeddings address some limitations of the bag of words by grouping similar words.

  • Word embeddings create similar features for similar words
  • Mathematical representations of words that follow intuitive rules
  • Example: "King" - "man" + "woman" ≈ "Queen"

A more advanced example is using word embeddings together with dimensionality reduction.

Language Translation

Mapping words or sentences to numbers allows neural networks to perform tasks like language translation.

  • Techniques like bag of words and word embeddings are used

  • Example: Translating from Spanish to English

Applications

NLP powers many common applications, making our interaction with technology more intuitive.

  • Language translation apps (e.g., Google Translate)
  • Chatbots for customer service
  • Personal assistant apps (e.g., Siri, Alexa)
  • Sentiment analysis to gauge emotions in text