Counting Problems in DataFrame

Updated Oct 28, 2019 ·

Problem

You are given a CSV file named tweets.csv that contains Twitter data. One of the columns in the dataset is lang, which stores the language of each tweet.

Real Twitter data

The dataset contains real Twitter data and may include offensive content. Get the dataset here: tweets.csv

Task

Count how many tweets exist for each language.

Requirements – Part 1

Import the pandas library as pd
Load tweets.csv into a DataFrame named df
Iterate over the lang column
Build a dictionary and print the result

Requirements – Part 2

Define a function named count_entries
The function should accept a DataFrame and a column name
Return the dictionary instead of printing it
Call the function using the lang column

Details on the dictionary

Keys represent language codes
Values represent the number of tweets in that language

Thought process

Without using a function

Start with an empty dictionary
Loop through each value in the lang column
If the language already exists, increase its count
If it does not exist, add it with a count of 1

Using a function

Move the dictionary and loop logic inside a function
Use the column name as a parameter to make the function reusable
Return the result so it can be stored or reused

Solution – Part 1

See project files here: Github.

Install the packages using a requirements.txt:

pip install -r requirements.txt

The first step is to solve the problem without using a function. This helps verify that the logic works correctly.

## count_langs_v1.py
import pandas as pd

df = pd.read_csv('tweets.csv')
langs_count = {}

col = df['lang']

for entry in col:
    if entry in langs_count:
        langs_count[entry] += 1
    else:
        langs_count[entry] = 1

print(langs_count)

At this stage, the dictionary contains the count of tweets per language.

Running the script:

python count_langs_v1.py

Output:

{'en': 97, 'et': 1, 'und': 2}

Solution – Part 2

Once the logic is confirmed, the next step is to convert it into a reusable function.

## count_langs_v2.py
import pandas as pd

df = pd.read_csv('tweets.csv')
langs_count = {}

col = df['lang']

def count_entries(df, col_name):
    langs_count = {}
    col = df[col_name]

    for entry in col:
        if entry in langs_count:
            langs_count[entry] += 1
        else:
            langs_count[entry] = 1

    return langs_count

result = count_entries(df, 'lang')
print(result)

Running the script:

python count_langs_v2.py

Output:

{'en': 97, 'et': 1, 'und': 2}

Problem​

Thought process​

Solution – Part 1​

Solution – Part 2​

Problem

Thought process

Solution – Part 1

Solution – Part 2