Skip to main content

Parsing with Grok

Updated Dec 30, 2022 ·

Structured and Unstructured Data

Structured data is highly organized and stored in a predefined format like rows and columns, making it easy to search, analyze, and process. Logstash can efficiently parse this type of data and transform it for indexing in Elasticsearch or other destinations.

For example, consider the following CSV file.

id,name,email
1,John Doe,johndoe@example.com
2,Jane Smith,janesmith@example.com
3,Bob Johnson,bobjohnson@example.com

This file contains three fields.

  • id (unique identifier)
  • name (full name of the individual)
  • email (contact email)

The structure allows straightforward processing and data extraction.

On the other hand, unstructured data lacks a predefined format, which makes it harder to organize and process. Examples include text logs, emails, or multimedia files that require advanced parsing techniques. Parsing unstructured data can be complex due to inconsistent formats or mixed information types.

For instance, Linux system logs like the following:

Dec 31 10:12:45 localhost kernel: [ 1234.567890] CPU: 2 PID: 1234 Comm: bash Not tainted
Dec 31 10:12:46 localhost sshd[5678]: Accepted password for user from 192.168.1.100 port 22 ssh2
Dec 31 10:12:47 localhost systemd[1]: Started Session c1 of user root.

These logs are a mix of timestamps, system events, and metadata. Parsing such logs requires tools capable of handling inconsistent formats and extracting key details like timestamps, process IDs, or event descriptions.

Grokking

Grok filtering is a powerful feature in Logstash that allows extracting and structuring unstructured data. It uses patterns to match and parse specific fields from logs or text data for easier processing.

  • Simplifies parsing of complex log files.
  • Helps structure unorganized data for indexing and analysis.

Grok uses regular expressions (regex) to identify patterns in text. It includes prebuilt patterns for common log types, making it versatile for processing diverse data formats.

As an example, let's say we want a regex pattern to match the following email addresses:

john@xyz.com  
ted2024@abc.edu
abby21street@gov.au

The regex to match these email addresses:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Predefined Grok Patterns

Grok includes predefined patterns that simplify the parsing of common data formats, such as emails, IP addresses, or timestamps. These patterns save time and make it easy to extract structured data from unstructured logs.

  • Predefined patterns cover a wide range of use cases
  • Easy to use and customize for specific fields in log data.

For example, to extract an email address from a log:

%{EMAILADDRESS:client_email}

For more information, please see Grokking grok.

Generic Grok Syntax

The Grok syntax uses a standard format to match and extract data from logs. Each Grok expression is composed of patterns and identifiers to capture specific fields.

%{PATTERN:identifier}

Using Grok

Consider the sample log below. This log is in ISO format, commonly used for timestamps.

2022-12-18T18:34:30.12+02:00 DEBUG Printing sample log  

This line has three components:

  • Timestamp: Represents the log's creation time.

    %{TIMESTAMP_ISO8601:timestamp}
  • Log Level: Indicates the log's severity or type.

    %{LOGLEVEL:log_level}
  • Message: Contains the details of the log entry.

    %{GREEDYDATA:message}

To extract these components, the Grok filter can be written as:

%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:log_level} %{GREEDYDATA:message}

Testing Grok Patterns

We can use the Grok Debug Tool to test and refine Grok patterns.

  • Input the sample log and the Grok pattern.
  • Enable the "Named Captures Only" option to see only the extracted fields.

Using the previous example, we can enter the sample log and the pattern.