Batch Processing

Updated Oct 07, 2020 ·

Overview

Batch processing means handling data in groups rather than one by one. Each batch starts, runs, and finishes before new data is added.

Processes data in groups
Runs from start to finish as one task
Triggered by time or events
Uses a set batch size
Often called a "job" when running

This approach allows consistent, predictable processing. Each batch is self-contained, making it easier to manage and troubleshoot.

Common Examples

These examples show how batching simplifies work by grouping data. Once a batch finishes, the system moves to the next one without confusion.

Reading files
- Files are often read in chunks
- Each chunk represents part of the file being processed
Email handling
- Emails are checked and updated in groups
- Sending queued emails happens once a connection returns
Printing
- Each print job runs separately
- Multiple jobs wait their turn to avoid mixing pages

Why Use Batch Processing

Batch processing has been around since the early days of computing when people submitted tasks using punch cards.

Easy to understand and manage
Runs consistently every time
Simple to scale for better performance

Batch processing helps improve reliability and efficiency. It’s not always the best fit, but for many repetitive and predictable tasks, it gets the job done cleanly and effectively.

Scaling

Scaling means improving how fast a process runs. It helps us finish the same work in less time or handle more data in the same amount of time.

Improves speed of data processing
Handles more data efficiently
Focuses on better performance

Vertical Scaling

Vertical scaling means upgrading the power of a single system to make it faster. It is often the easiest way to boost performance. The system just runs faster on better hardware without modifying how it works.

Uses stronger hardware
- Faster CPUs or more cores
- More or quicker RAM
Improves data access speed
- Uses SSDs instead of hard drives
- Adds faster network components
Simple to apply
- Usually doesn’t require code changes

While vertical scaling is simple, it has limits and costs.

Limited growth
- Hardware has a maximum capacity
- You can’t always get a faster CPU or more RAM
High cost
- Faster components cost much more
- Price increases don’t always match performance gains
Unpredictable improvements
- Hardware speed may not grow fast enough for your data growth

Vertical scaling works well at first, but it can’t grow endlessly. Eventually, you reach a point where adding more power isn’t practical or affordable.

Horizontal Scaling

Horizontal scaling means spreading tasks across multiple systems instead of relying on one.

Uses multiple machines
- Tasks are divided between computers
- Can also mean adding CPUs or cores to share the load
Handles parallel work
- Each task runs independently
- Ideal for jobs that can be split easily
Efficient for large workloads
- Scales out as data grows
- Often cost-effective

Horizontal scaling is powerful but comes with added complexity.

Needs a framework
- Requires tools like Spark or Dask
- Helps coordinate tasks across systems
Requires strong communication
- Systems must exchange data efficiently
- Increases network and setup needs
Not suitable for all tasks
- Sequential jobs can’t easily be split
- Some tasks depend on step-by-step instructions

Delays

Delays are common in batch processing. They affect how fast data moves from collection to final output.

Waiting for data to be ready
- All required data must be collected first
- Missing data can delay the entire batch
Waiting to start processing
- Scheduled jobs may have to wait for the next interval
- Data arriving right after a run waits until the next one
Processing time itself
- Larger data means longer processing
- Depends on system speed and scaling
Waiting for final output
- Results need to be copied or transferred
- Users see data only after this step completes

Every stage adds time to the total delay. Reducing each step helps make the whole batch faster and more efficient.

Example 1: Waiting For Data

Sometimes we wait for the data before processing can even begin.

Data must come from multiple sources
- Each source might send logs at different times
- Systems often send data when idle to save resources
Busy systems slow data collection
- When systems are overloaded, logs arrive late
- The batch waits until all data is available

In this case, waiting for all machines to send data creates a delay. The process can’t start until everything is ready.

Example 2: Slow Processing Time

Processing speed can also be a major cause of delay.

Large data volumes
- Example: 100 GB of logs taking 23 hours to process
- Processing speed: around 4.4 GB per hour
Growing data
- If data increases by 5% monthly, next month takes 24 hours
- After two months, it takes 25 hours to finish

If data grows faster than the system can handle, one day’s batch might take longer than a day to finish. This creates a backlog that gets worse over time.

Example 3: Delayed Reports

Even after processing finishes, final outputs like reports can take extra time.

Multiple steps to complete
- Data must first be processed
- Then moved to the analytics system
Report generation
- The report can only start when all data is available
- Even a short report may depend on long data transfers

For example, if each stage takes several hours, a daily report might show data that’s already 1.5 days old. The more stages there are, the longer users must wait to see updated results.

Overview​

Common Examples​

Why Use Batch Processing​

Scaling​

Vertical Scaling​

Horizontal Scaling​

Delays​

Example 1: Waiting For Data​

Example 2: Slow Processing Time​

Example 3: Delayed Reports​