Loading in Chunks
Loading Data
When dealing with large datasets, loading everything at once may not be possible. Instead, we can load data in smaller chunks, process each chunk, and discard it before moving to the next.
- Useful for large files, databases, or API responses
- Uses
pandas.read_csv()
with thechunksize
argument - Each chunk is processed separately to save memory
Example:
import pandas as pd
# Load data in chunks
chunks = pd.read_csv("data.csv", chunksize=1000)
for chunk in chunks:
print(chunk.head()) # Process each chunk separately
Output (First 5 rows of each chunk):
A B C
0 1.2 2.3 3.1
1 4.5 5.6 6.2
...
Summing a Column
If a CSV has a column x
with numbers, we can sum the values without loading everything into memory.
- Read data in chunks
- Sum values in each chunk
- Store partial results and combine them later
Example:
total_sum = 0
for chunk in pd.read_csv("data.csv", usecols=["x"], chunksize=1000):
total_sum += chunk["x"].sum()
print("Total sum:", total_sum)
Output:
Total sum: 12345678
Summing Without a List
Instead of storing results in a list, we can update a total sum directly.
- No need for extra memory
- Adds sums during iteration
Example:
total = sum(chunk["x"].sum() for chunk in pd.read_csv("data.csv", usecols=["x"], chunksize=1000))
print("Total sum:", total)
Output:
Total sum: 12345678
Example: Processing Twitter Data
Large datasets can't always fit into memory, so we process them in chunks. In the example below, we analyze a CSV file of Twitter data by processing 10 entries at a time.
- Use
pd.read_csv()
withchunksize=10
- Count occurrences of languages in tweets
- Store results in a dictionary
Download the Twitter dataset here: tweets.csv
Solution:
import pandas as pd
counts_dict = {}
# Iterate, chunk by chunk
for chunk in pd.read_csv("tweets.csv", chunksize=10):
# Iterate over Language column
for entry in chunk['lang']:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
print(counts_dict)
Output:
{'en': 15, 'es': 8, 'fr': 6, 'de': 4, 'it': 2}
The output is a dictionary where the keys represent different language codes found in the dataset, and the values indicate the number of times each language appears in the tweets. For example, 'en': 15 means that 15 tweets were in English, while 'es': 8 means 8 tweets were in Spanish. This confirms that the script correctly counted the occurrences of each language while processing the data in chunks.
Example: Making Code Reusable
Instead of rewriting the same code for similar tasks, it's better to use functions. The example below defines a function to count occurrences of values in a specific column while processing a CSV file in chunks.
- Reads the file in chunks using
pd.read_csv()
- Counts occurrences of values in a given column
- Returns the results as a dictionary
Download the Twitter dataset here: tweets.csv
Solution:
import pandas as pd
def count_entries(csv_file, c_size, colname):
"""Return a dictionary with counts of occurrences as value for each key."""
counts_dict = {}
# Process file in chunks
for chunk in pd.read_csv(csv_file, chunksize=c_size):
for entry in chunk[colname]:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
return counts_dict
result_counts = count_entries('tweets.csv', 10, 'lang')
print(result_counts)
Output:
{'en': 97, 'et': 1, 'und': 2}
The output is a dictionary where each key represents a language code, and the value indicates how many tweets were in that language. 'en': 97
means there were 97 English tweets, 'et': 1
means 1 tweet was in Estonian, and 'und': 2
means 2 tweets had an undefined language.