Skip to main content

Data Quality

Updated May 12, 2023 ·

Overview

This design phase checks data quality and decides how to collect and process it. Data quality measures how well data serves its purpose.

  • Poor data leads to inaccurate predictions.
  • Evaluated using accuracy, completeness, consistency, and timeliness.

Data Quality Dimensions

A good dataset should meet the following criteria:

MetricDescriptionExample
AccuracyData should reflect realityA customer’s age recorded as 18 when they are actually 32
CompletenessMissing data should be minimalA customer’s last name is missing
ConsistencyData should be uniform across sourcesThe customer is marked active in one database but inactive in another.
TimelinessData should be available when neededOrders sync at the end of the day but are not available in real-time.

If data quality is low in one or more dimension, we can:

  • Collect more data
  • Fill missing values
  • Standardize definitions

Data Ingestion

Once we have quality data, we need a way to collect and process it. Automated data pipelines help extract, process, and store data efficiently using the ETL process:

  • Extract: Get data from sources.
  • Transform: Convert it into the right format.
  • Load: Store it in a database.