Apache Hadoop

Updated May 05, 2025 ·

Overview

When handling petabytes of data, Hadoop enables efficient processing by utilizing multiple computers to process data in parallel.

Expand storage capacity as data grows
Each computer added enhances the Hadoop Distributed Filesystem (HDFS)
Built to continue operations even if some computers fail
Distributes data across separate hardware

This lab demonstrates how to integrate Elasticsearch with Apache Hadoop to process large datasets.

Use an Apache access log as an example of big data.
Write a simple MapReduce job to ingest the file with Hadoop
Index the ingested data into Elasticsearch

Lab Environment

Node	Hostname	IP Address
Node 1	elasticsearch	192.168.56.101
Node 2	logstash	192.168.56.102
Node 3	kibana	192.168.56.103
Node 4	hadoop	192.168.56.104

Setup details:

The nodes are created in VirtualBox using Vagrant.
An SSH key is generated on the Elasticsearch node

Pre-requisites

Choosing the Right Tool

Hadoop, Elasticsearch, and Logstash each have unique strengths for data management and analysis.

Hadoop
- Handles large-scale data ingestion from billions of sources
- Processes data in parallel across clusters
- Suitable for initial collection and raw data processing
Elasticsearch
- Stores and indexes data for fast retrieval
- Optimized for full-text searches and complex queries
- Ideal for quick data access and user-friendly search results
Logstash
- Collects and processes real-time data
- Converts raw data into structured formats
- Acts as a bridge between data sources and Elasticsearch

Working together, Hadoop collects and processes vast amounts of data, Logstash organizes and forwards it, and Elasticsearch stores and retrieves it quickly for user searches.

How MapReduce Works

MapReduce processes data through three stages: map, shuffle, and reduce. It allows fine-tuned customization and optimization for efficient data processing.

Initially, the data is split into smaller chunks that are distributed across different computing nodes.

Nodes map the data into key-value pairs.
Values with the same keys are grouped together during the shuffle stage.
Shuffled data is simplified for processing in the reduce stage.
The output is the final aggregated result.

Offline Installation

In a production enterprise setup, Hadoop nodes are typically placed in a private network to enhance security and control access. Offline installation is generally recommended for such setups.

On a computer with internet access, download the latest Hadoop version.
```
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz 
```
For more information, please see Hadoop downloads.
Copy the files to the virtual machine. If you are using VirtualBox in your computer, you can map local folder to a fileshare in you VM.
Login to your VM and create the hadoop user and add it to the sudoers group.
```
sudo useradd hadoop
sudo passwd hadoop
sudo usermod -aG sudo hadoop 
```

Proceed to the directory where the Hadoop package is stored and unzip the file.

cd /mnt/fileshare/hadoop
cp hadoop* /tmp
tar -xvzf hadoop-3.4.1.tar.gz -C /usr/local
mv /usr/local/hadoop-3.4.1 /usr/local/hadoop

Change the ownership of the Hadoop directory.

sudo chown -R hadoop:hadoop /usr/local/hadoop

Before anything else, verify the required packages on the VM.

Check if Java is installed.

java -version
dpkg -l | grep openjdk

If Java is installed, you'll see output similar to:

openjdk version "11.0.25" 2024-10-15
OpenJDK Runtime Environment (build 11.0.25+9-post-Ubuntu-1ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.25+9-post-Ubuntu-1ubuntu122.04, mixed mode, sharing)

If Java is not installed, you can install it:

sudo apt install -y openjdk-11-jdk
sudo apt install -y openjdk-8-jdk   # (Optional) Install Java 8 if needed

You can choose between installed Java versions if multiple are available.

sudo update-alternatives --config java

Output:

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-11-openjdk-amd64/bin/java
Nothing to configure.

Check if libsnappy is installed:
```
dpkg -l | gxrep libsnappy
```
If not installed, you can install it:
```
sudo apt install -y libsnappy-dev
```
Check if zlib is installed:
```
dpkg -l | grep zlib
```
If not installed, you can install it:
```
sudo apt install zlib1g-dev 
```
Verify Python installation.
```
python3 --version
```
You can install Python 3 on Ubuntu using:
```
sudo apt update
sudo apt install python3 
```

After installing the pre-requisites, proceed with the installation.

Add the following to ~/.bashrc or /etc/profile:

cat >>  ~/.bashrc

Enter the following:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME 
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_BNTIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Reload the shell.
```
source ~/.bashrc
```

Verify Installation:

hadoop version

Output:

Hadoop 3.4.1
Source code repository https://github.com/apache/hadoop.git -r 4d7825309348956336b8f06a08322b78422849b1
Compiled by mthakur on 2024-10-09T14:57Z
Compiled on platform linux-x86_64
Compiled with protoc 3.23.4
From source with checksum 7292fe9dba5e2e44e3a9f763fce3e680
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.4.1.jar 

Download the Sample Log File

A sample log file will be used to simulate big data for this project. This file will be imported into Elasticsearch for further processing. The log file will serve as the input for the MapReduce job.

Download the file here: hadoop-apache-access.log

After you download the log file, you can transfer the file to your node. If you are using VirtualBox in your computer, you can map local folder to a fileshare in you VM.

MapReduce Project

info

This section covers the theory behind the MapReduce project.

To save time, you can clone the GitHub repository that contains the sample source code.

For more information, please see Clone the Project Repository.

To compile the MapReduce code into a JAR file, we'll need throws use Maven. In a real-world scenario, there are additional steps:

Use an IDE to create the project and write the code.
Compile the code into a JAR file with Maven on your local machine.
Transfer the JAR file to the Hadoop instance.

Maven Setup

Create an empty Maven project using your IDE.
Skip the archetype selection since you only need an empty project.

After the project is created, modify the pom.xml file with the following properties:

<properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch-hadoop-mr</artifactId>
        <version>7.8.0</version>
    </dependency>
    <dependency>
        <groupId>commons-httpclient</groupId>
        <artifactId>commons-httpclient</artifactId>
        <version>3.1</version>
    </dependency>
</dependencies>

Notes:

Update the version numbers based on your setup.
The elasticsearch-hadoop-mr library is used to write to Elasticsearch.
The commons-httpclient library allows Hadoop to make REST calls to the Elasticsearch server.

Define the access log mapper class and override the default map method with the logic you need:

import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class AccessLogIndexIngestion {

    public static class AccessLogMapper extends Mapper<Object, Object, Text, IntWritable> {
        @Override
        protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {
            // Implement your map logic here
        }
    }

    public static void main(String[] args) {
        // Main method logic
    }
}

Build the JAR File

Github repository: test-maven-project

On a computer with internet access, clone the repository:

git clone https://github.com/joseeden/test-maven-project.git

Install Maven:
```
sudo apt install -y maven 
```
Proceed to the project directory and build the JAR file.
```
mvn clean package 
```

If successful, you should see a build success message:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  36.095 s
[INFO] Finished at: 2025-01-05T17:03:43+08:00
[INFO] ------------------------------------------------------------------------

Copy the JAR file onto the Hadoop node. If you are using VirtualBox in your computer, you can map local folder to a fileshare in you VM.

Create the Elasticsearch Index

Login to the Elasticsearch node and run the following command to create an Elasticsearch index. Make sure to change the password and the IP address for your Elasticsearch node.

curl -s -u elastic:elastic \
-H 'Content-Type: application/json' \
-X PUT "$ELASTIC_ENDPOINT:9200/logs?pretty" -d'  
{
  "mappings": {
    "properties": {
      "ip": { "type": "keyword" },
      "dateTime": { "type": "date", "format": "dd/MMM/yyyy:HH:mm:ss" },
      "httpstatus": { "type": "keyword" },
      "url": { "type": "keyword" },
      "responseCode": { "type": "keyword" },
      "size": { "type": "integer" }
    }
  }
}' | jq

The dateTime field is defined as a date, which allows us to visualize various metrics using Kibana. The date format is also specified to ensure that values passed to Elasticsearch are parsed correctly.

If the index is created correctly, you should see:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "logs"
} 

Another way it to check the indices:

curl -u elastic:elastic --insecure \
-X GET "$ELASTIC_ENDPOINT:9200/_cat/indices?v"

Output:

health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
yellow open   logs  wpxLIsW1TCOqAnpUl2SGKg   1   1          0            0       227b           227b         227b

Running the Hadoop Job

Switch to hadoop user. Ensure both the sample log file and the JAR file is copied over.

su - hadoop
mkdir /tmp/hadoop
cp /mnt/fileshare/logs/hadoop-apache-access.log /tmp/hadoop/
cp /mnt/fileshare/hadoop/test-maven-project-1.0.jar /tmp/hadoop/
cd /tmp/hadoop

In this step, I created the /tmp/hadoop directory and copied over both files.

Execute the Hadoop JAR file with the log file as input.

hadoop jar test-maven-project-1.0.jar hadoop-apache-access.log

Verify the Index

Verify the index is created:

curl -u elastic:elastic --insecure \
-X GET "$ELASTIC_ENDPOINT:9200/_cat/indices?v"

Output:

health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
yellow open   logs  wpxLIsW1TCOqAnpUl2SGKg   1   1      10000                       0        805.2kb      805.2kb

Visualize in Kibana

TODO....

info

Store the Elasticsearch endpoint and credentials in variables:

ELASTIC_ENDPOINT="https://your-elasticsearch-endpoint"
ELASTIC_USER="your-username"
ELASTIC_PW="your-password"

Cleanup

Use the command below to delete the indices after the lab. Make sure to replace enter-name with the index name.

curl -s -u $ELASTIC_USER:$ELASTIC_PW  \
-H 'Content-Type: application/json' \
-XDELETE "$ELASTIC_ENDPOINT:9200/enter-name" | jq

Overview​

Lab Environment​

Pre-requisites​

Choosing the Right Tool​

How MapReduce Works​

Offline Installation​

Download the Sample Log File​

MapReduce Project​

Build the JAR File​

Create the Elasticsearch Index​

Running the Hadoop Job​

Verify the Index​

Visualize in Kibana​

Cleanup​

Overview

Lab Environment

Pre-requisites

Choosing the Right Tool

How MapReduce Works

Offline Installation

Download the Sample Log File

MapReduce Project

Build the JAR File

Create the Elasticsearch Index

Running the Hadoop Job

Verify the Index

Visualize in Kibana

Cleanup