Counting Problems in DataFrame
Problem
You are given a CSV file named tweets.csv that contains Twitter data. One of the columns in the dataset is lang, which stores the language of each tweet.
The dataset contains real Twitter data and may include offensive content. Get the dataset here: tweets.csv
Task
- Count how many tweets exist for each language.
Requirements – Part 1
- Import the pandas library as
pd - Load
tweets.csvinto a DataFrame nameddf - Iterate over the
langcolumn - Build a dictionary and print the result
Requirements – Part 2
- Define a function named
count_entries - The function should accept a DataFrame and a column name
- Return the dictionary instead of printing it
- Call the function using the
langcolumn
Details on the dictionary
- Keys represent language codes
- Values represent the number of tweets in that language
Thought process
Without using a function
- Start with an empty dictionary
- Loop through each value in the
langcolumn - If the language already exists, increase its count
- If it does not exist, add it with a count of 1
Using a function
- Move the dictionary and loop logic inside a function
- Use the column name as a parameter to make the function reusable
- Return the result so it can be stored or reused
Solution – Part 1
See project files here: Github.
Install the packages using a requirements.txt:
pip install -r requirements.txt
The first step is to solve the problem without using a function. This helps verify that the logic works correctly.
## count_langs_v1.py
import pandas as pd
df = pd.read_csv('tweets.csv')
langs_count = {}
col = df['lang']
for entry in col:
if entry in langs_count:
langs_count[entry] += 1
else:
langs_count[entry] = 1
print(langs_count)
At this stage, the dictionary contains the count of tweets per language.
Running the script:
python count_langs_v1.py
Output:
{'en': 97, 'et': 1, 'und': 2}
Solution – Part 2
Once the logic is confirmed, the next step is to convert it into a reusable function.
## count_langs_v2.py
import pandas as pd
df = pd.read_csv('tweets.csv')
langs_count = {}
col = df['lang']
def count_entries(df, col_name):
langs_count = {}
col = df[col_name]
for entry in col:
if entry in langs_count:
langs_count[entry] += 1
else:
langs_count[entry] = 1
return langs_count
result = count_entries(df, 'lang')
print(result)
Running the script:
python count_langs_v2.py
Output:
{'en': 97, 'et': 1, 'und': 2}