Skip to main content

Data Quality

Updated May 12, 2023 ·

Overview

This design phase checks data quality and decides how to collect and process it.

Data Quality

Data quality measures how well data serves its purpose.

  • Poor data leads to inaccurate predictions.
  • Evaluated using accuracy, completeness, consistency, and timeliness.

Data Quality Dimensions

A good dataset should meet the following criteria:

  • Accuracy

    • Data should reflect reality.
    • Example: A customer’s age recorded as 18 when they are actually 32 is inaccurate.
  • Completeness

    • Missing data should be minimal.
    • Example: A customer’s last name is missing.
  • Consistency

    • Data should be uniform across sources.
    • Example: One department defines "active customer" differently than another.
  • Timeliness

    • Data should be available when needed.
    • Example: Daily order updates are not real-time.

Examples

Below are some examples:

DimensionSample QuestionsSample Dimension Quality
AccuracyDoes our data correctly describe the customer?The customer's age in the data is 18, but is actually 32.
CompletenessIs there any customer data missing?For 80% of the customers, we don't have a last name.
ConsistencyIs the definition of the customer the same across the company?The customer is marked active in one database but inactive in another.
TimelinessWhen is the customer order data available?Orders sync at the end of the day but are not available in real-time.

If data quality is low in one or more dimension, we can:

  • Collect more data
  • Fill missing values
  • Standardize definitions

Data Ingestion

Once we have quality data, we need a way to collect and process it. Automated data pipelines help extract, process, and store data efficiently using the ETL process:

  • Extract: Get data from sources.
  • Transform: Convert it into the right format.
  • Load: Store it in a database.