In the following article you will find the definition of data quality, what the domains are and how to quickly implement a solution.
In the current digital environment, the amount of available data is overwhelming. However, the true cornerstone for making informed decisions lies in the quality of this data. In this article, we will explore the crucial importance of data quality, analyzing the inherent challenges that organizations face in managing information. Although often overlooked, data quality plays a fundamental role in the reliability and usefulness of the information that underpins our strategic decisions.
Data quality
?Data quality measures how well a dataset complies with the criteria of accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose, and is fundamental for all data governance initiatives within an organization. Data quality standards ensure that companies make decisions based on data to achieve their business objectives.
source: IBM
source: DataCamp cheat sheet
The following table highlights the various domains of data quality, from accuracy to fitness, providing an essential guide for assessing and enhancing the robustness of datasets:
Dimensions | Description |
---|---|
🎯 Accuracy | Data accuracy, or how close data is to reality or truth. Accurate data is that which faithfully reflects the information it seeks to represent. |
🧩 Completeness | Measures the entirety of the data. A complete dataset is one that has no missing values or significant gaps. Data integrity is crucial for gaining a comprehensive and accurate understanding. |
✅ Validity | Indicates whether the data conforms to defined rules and standards. Valid data complies with the established constraints and criteria for a specific dataset.. |
🔄 Consistency | Refers to the uniformity of data over time and across different datasets. Consistent data does not exhibit contradictions or discrepancies when compared with each other |
📇 Uniqueness | Evaluates whether there are no duplicates in the data. Unique data ensures that each entity or element is represented only once in a dataset |
⌛Timeliness | Refers to the timeliness of data. Timely information is that which is available when needed, without unnecessary delays. |
🏋️ Fitness | This aspect evaluates the relevance and usefulness of data for the intended purpose. Data should be suitable and applicable to the specific objectives of the organization or analysis being conducted. |
Next, we provide an example where some issues with an e-commerce-based use case can be observed.
ID Transacción | ID Cliente | Producto | Cantidad | Precio Unitario | Total |
---|---|---|---|---|---|
⚪ 1 | 10234 | Laptop HP | 1 | $800 | $800 |
🟣 2 | Wireless Headphones | 2 | $50 | $100 | |
🔵 3 | 10235 | Smartphone | -1 | $1000 | -$1000 |
🟢 4 | 10236 | Wireless Mouse | 3 | $30 | $90 |
🟢 4 | 10237 | Wireless Keyboard | 2 | $40 | $80 |
🟣 Row 2 (Completeness): Row 2 does not comply with data integrity (Completeness) as the customer ID is missing. Customer information is incomplete, making it challenging to trace the transaction back to a specific customer.
🔵 Row 3 (Accuracy and Consistency): Row 3 exhibits accuracy (Accuracy) and consistency (Consistency) issues. The quantity of products is negative, which is inaccurate and goes against the expected consistency in a transaction dataset.
🟢 Row 4 (Uniqueness): The introduction of a second row with the same transaction ID (Transaction ID = 4) violates the uniqueness principle. Each transaction should have a unique identifier, and having two rows with the same Transaction ID creates duplicates, affecting the uniqueness of transactions.
The following are some of the Python implementations carried out to perform data quality validations:
Framework | Descripción |
---|---|
Great Expectations | Great Expectations is an open-source library for data validation. It enables the definition, documentation, and validation of expectations about data, ensuring quality and consistency in data science and analysis projects |
Pandera | Pandera is a data validation library for data structures in Python, specifically designed to work with pandas DataFrames. It allows you to define schemas and validation rules to ensure data conformity |
Dora | Dora is a Python library designed to automate data exploration and perform exploratory data analysis. |
Let’s analyze some of the metrics that can be observed in their GitHub repositories, taking into account that the metrics were obtained on 2023-11-12.
Metricas | Great Expectations | Pandera | Dora |
---|---|---|---|
👥 Members | 399 | 109 | 106 |
⚠️ Issues: Open | 112 | 273 | 1 |
🟢 Issues: Close | 1642 | 419 | 7 |
⭐ Stars | 9000 | 2700 | 623 |
📺 Watching | 78 | 17 | 42 |
🔎 Forks | 1400 | 226 | 63 |
📬 Open PR | 43 | 19 | 0 |
🐍 Version Python | >=3.8 | >=3.7 | No especificada |
📄 Version Number | 233 | 76 | 3 |
📄 Last Version | 0.18.2 | 0.17.2 | 0.0.3 |
📆 Last Date Version | 9 Nov 2023 | 30 sep 2023 | 30 jun 2020 |
📄 Licence type | Apache-2.0 license | MIT | MIT |
📄 Languages |
|
|
|
Notification of Changes:
Apache 2.0: Requires notification of changes made to the source code when distributing the software.
MIT: Does not require specific notification of changes.
Compatibility:
Apache 2.0: Known to be compatible with more licenses compared to MIT.
MIT: Also quite compatible with various licenses, but Apache 2.0 License is often chosen in projects seeking greater interoperability with other licenses.
Attribution:
Apache 2.0: Requires attribution and the inclusion of a copyright notice.
MIT: Requires attribution to the original authorship but may have less strict requirements in terms of how that attribution is displayed.
Considering these currently analyzed metrics, let’s proceed with an example implementation using Pandera and Great Expectations.
For the development of this example, we will use the dataset named ‘Tips.’ You can download the dataset from the followinge link.
The ‘tips’ dataset contains information about tips given in a restaurant, along with details about the total bill, the gender of the person who paid the bill, whether the customer is a smoker, the day of the week, and the meal’s time.
Column | Description |
---|---|
total_bill | The total amount of the bill (including the tip). |
tip | The amount of tip given. |
sex | The gender of the bill payer (male or female). |
smoker | Whether the customer is a smoker or not. |
day | The day of the week when the meal was made. |
time | The time of day (lunch or dinner). |
size | The size of the group that shared the meal. |
Below is a table with the first 5 rows of the dataset:
total_bill | tip | sex | smoker | day | time | size |
---|---|---|---|---|---|---|
16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Next, we will provide an example of implementing Pandera using the dataset described earlier.
pip install pandas pandera
Import pandas and pandera
import pandas as pd
import pandera as pa
Import the dataframe file
= 'data/tips.csv'
path = pd.read_csv(path)
data
print(f"Numero de columnas: {data.shape[1]}, Numero de filas: {data.shape[0]}")
print(f"Nombre de columnas: {list(data.columns)}")
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null object
4 day 244 non-null object
5 time 244 non-null object
6 size 244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
Now, let’s create the schema object that contains all the validations we want to perform.
You can find additional validations that can be performed at the following link: <https://pandera.readthedocs.io/en/stable/dtype_validation.html>
= pa.DataFrameSchema({
schema "total_bill": pa.Column(float, checks=pa.Check.le(50)),
"tip" : pa.Column(float, checks=pa.Check.between(0,30)),
"sex" : pa.Column(str, checks=[pa.Check.isin(['Female','Male'])]),
"smoker" : pa.Column(str, checks=[pa.Check.isin(['No','Yes'])]),
"day" : pa.Column(str, checks=[pa.Check.isin(['Sun','Sat'])]),
"time" : pa.Column(str, checks=[pa.Check.isin(['Dinner','Lunch'])]),
"size" : pa.Column(int, checks=[pa.Check.between(1,4)])
})
try:
schema(data).validate()except Exception as e:
print(e)
= e error
Schema None: A total of 3 schema errors were found.
Error Counts
------------
- SchemaErrorReason.SCHEMA_COMPONENT_CHECK: 3
Schema Error Summary
--------------------
schema_context column check failure_cases n_failure_cases
Column day isin(['Sun', 'Sat']) [Thur, Fri] 2
size in_range(1, 4) [5, 6] 2
total_bill less_than_or_equal_to(50) [50.81] 1
def get_errors(error, dtype_dict=True):
= []
response
for item in range(len(error.schema_errors)):
= error.schema_errors[item]
error_item
response.append(
{'column' :error_item.schema.name,
'check_error':error_item.schema.checks[0].error,
'num_cases' :error_item.failure_cases.index.shape[0],
'check_rows' :error_item.failure_cases.to_dict()
})
if dtype_dict:
return response
else:
return pd.DataFrame(response)
=True) get_errors(error,dtype_dict
[{'column': 'total_bill',
'check_error': 'less_than_or_equal_to(50)',
'num_cases': 1,
'check_rows': {'index': {0: 170}, 'failure_case': {0: 50.81}}},
{'column': 'day',
'check_error': "isin(['Sun', 'Sat'])",
'num_cases': 81,
'check_rows': {'index': {0: 77,
1: 78,
2: 79,
3: 80,
4: 81,
5: 82,
6: 83,
7: 84,
...
5: 156,
6: 185,
7: 187,
8: 216},
'failure_case': {0: 6, 1: 6, 2: 5, 3: 6, 4: 5, 5: 6, 6: 5, 7: 5, 8: 5}}}]
Great Expectations is an open-source Python-based library for validating, documenting, and profiling your data. It helps maintain data quality and improve communication about data across teams.
source : <https://docs.greatexpectations.io/docs/>
Therefore, we can describe Great Expectations as an open source tool designed to guarantee the quality and reliability of data in various sources, such as databases, tables, files and dataframes. Its operation is based on the creation of validation groups that specify the expectations or rules that the data must comply with.
The following are the steps that we must define when using this framework:
Definition of Expectations: Specify the expectations you have for the data. These expectations can include simple constraints, such as value ranges, or more complex rules about data coherence and quality.
Connecting to Data Sources: In this step, define the connections you need to make to various data sources, such as databases, tables, files, or dataframes.
Generation of Validation Suites: Based on the defined expectations, Great Expectations generates validation suites, which are organized sets of rules to be applied to the data.
Execution of Validations: Validation suites are applied to the data to verify if they meet the defined expectations. This can be done automatically in a scheduled workflow or interactively as needed.
Generation of Analysis and Reports: Great Expectations provides advanced analysis and reporting capabilities. This includes detailed data quality profiles and reports summarizing the overall health of the data based on expectations.
Alerts and Notifications: If the data does not meet the defined expectations, Great Expectations can generate alerts or notifications, allowing users to take immediate action to address data quality issues.
Together, Great Expectations offers a comprehensive solution to ensure data quality over time, facilitating early detection of problems and providing confidence in the integrity and usefulness of data used in analysis and decision-making
!pip install great_expectations==0.17.22 seaborn matplotlib numpy pandas
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import re
import great_expectations as gx
from ruamel.yaml import YAML
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource
from great_expectations.core.expectation_configuration import ExpectationConfiguration
print(f"* great expectations version:{gx.__version__}")
print(f"* seaborn version:{sns.__version__}")
print(f"* numpy version:{np.__version__}")
print(f"* pandas:{pd.__version__}")
* great expectations version:0.17.22
* seaborn version:0.13.0
* numpy version:1.26.1
* pandas:2.1.3
= 'data/tips.csv'
path = gx.read_csv(path) data_gx
= pd.DataFrame([item for item in dir(data_gx) if item.find('expect_')==0],columns=['expectation'])
list_expectations 'expectation_type'] = np.select( [
list_expectations[str.find('_table_')>0,
list_expectations.expectation.str.find('_column_')>0,
list_expectations.expectation.str.find('_multicolumn_')>0,
list_expectations.expectation.'table','column','multicolumn'],
],[='other'
default
)
=(20,6))
plt.figure(figsize=list_expectations.expectation_type)
sns.countplot(x plt.show()
In the image, it can be observed that the available expectations are mainly applied to columns (for example:
expect_column_max_to_be_between
) and tables (for example:expect_table_columns_to_match_set
), although an expectation based on the values of multiple columns can also be applied (for example:expect_multicolumn_values_to_be_unique
).
# The following list contains the columns that the dataframe must have:
= ['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']
columns = columns) data_gx.expect_table_columns_to_match_set(column_set
{
"success": true,
"result": {
"observed_value": [
"total_bill",
"tip",
"sex",
"smoker",
"day",
"time",
"size"
]
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}
# Now, we delete two columns, 'time' and 'size,' to validate the outcome
= = ['total_bill', 'tip', 'sex', 'smoker', 'day']
columns = columns) data_gx.expect_table_columns_to_match_set(column_set
If we observe, the result is False, and in the details, they provide information about the columns that the dataframe has in addition to those expected.
{"success": false,
"result": {
"observed_value": [
"day",
"sex",
"size",
"smoker",
"time",
"tip",
"total_bill"
],"details": {
"mismatched": {
"unexpected": [
"size",
"time"
]
}
}
},"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
} }
'total_bill_group'] = pd.cut(data_gx['total_bill'],
data_gx[=[0,10,20,30,40,50,float('inf')],
bins=['0-10', '10-20', '20-30', '30-40', '40-50', '>50'],
labels=False,
right=True)
include_lowest
# Now, let's validate if 3 categories exist within the dataset
='total_bill_group',
data_gx.expect_column_distinct_values_to_contain_set(column=['0-10','10-20', '20-30'],
value_set='BASIC') result_format
{"success": true,
"result": {
"observed_value": [
"0-10",
"10-20",
"20-30",
"30-40",
"40-50",
">50"
],"element_count": 244,
"missing_count": null,
"missing_percent": null
},"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
} }
'sex') data_gx.expect_column_values_to_not_be_null(
{"success": true,
"result": {
"element_count": 244,
"unexpected_count": 0,
"unexpected_percent": 0.0,
"unexpected_percent_total": 0.0,
"partial_unexpected_list": []
},"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
} }
Now, let’s generate a Great Expectations project to run a group of validations based on one or more datasets.
!yes Y | great_expectations init
___ _ ___ _ _ _
/ __|_ _ ___ __ _| |_ | __|_ ___ __ ___ __| |_ __ _| |_(_)___ _ _ ___
| (_ | '_/ -_) _` | _| | _|\ \ / '_ \/ -_) _| _/ _` | _| / _ \ ' \(_-<
\___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
|_|
~ Always know what to expect from your data ~
Let's create a new Data Context to hold your project configuration.
Great Expectations will create a new directory with the following structure:
great_expectations
|-- great_expectations.yml
|-- expectations
|-- checkpoints
|-- plugins
|-- .gitignore
|-- uncommitted
|-- config_variables.yml
|-- data_docs
|-- validations
OK to proceed? [Y/n]:
================================================================================
Congratulations! You are now ready to customize your Great Expectations configuration.
You can customize your configuration in many ways. Here are some examples:
Use the CLI to:
- Run `great_expectations datasource new` to connect to your data.
- Run `great_expectations checkpoint new <checkpoint_name>` to bundle data with Expectation Suite(s) in a Checkpoint for later re-validation.
- Run `great_expectations suite --help` to create, edit, list, profile Expectation Suites.
- Run `great_expectations docs --help` to build and manage Data Docs sites.
Edit your configuration in great_expectations.yml to:
- Move Stores to the cloud
- Add Slack notifications, PagerDuty alerts, etc.
- Customize your Data Docs
Please see our documentation for more configuration options!
!cp -r data gx
# Let's print the contents of the folder
def print_directory_structure(directory_path, indent=0):
= os.path.basename(directory_path)
current_dir print(" |" + " " * indent + f"-- {current_dir}")
+= 1
indent with os.scandir(directory_path) as entries:
for entry in entries:
if entry.is_dir():
print_directory_structure(entry.path, indent)else:
print(" |" + " " * indent + f"-- {entry.name}")
'gx') print_directory_structure(
|-- gx
| -- great_expectations.yml
| -- plugins
| -- custom_data_docs
| -- renderers
| -- styles
| -- data_docs_custom_styles.css
| -- views
| -- checkpoints
| -- expectations
| -- .ge_store_backend_id
| -- profilers
| -- .gitignore
| -- data
| -- tips.csv
| -- uncommitted
| -- data_docs
| -- config_variables.yml
| -- validations
| -- .ge_store_backend_id
Here are some clarifications about the files and folders generated in this directory:
Files/Folders | Description |
---|---|
📄 great_expectations.yml | This file contains the main configuration of the project. Details such as storage locations and other configuration parameters are specified here |
📂 plugins | custom_data_docs:
|
📂 checkpoints | This folder could contain definitions of checkpoints, which are points in the data flow where specific validations can be performed. |
📂 expectations | This is where the expectations defined for the data are stored. This directory may contain various subfolders and files, depending on the project’s organization. |
📂 profilers | It can contain configurations for data profiles, which are detailed analyses of data statistics. |
📄 .gitignore | It is a Git configuration file that specifies files and folders to be ignored when performing tracking and commit operations. (commit) |
📂 data | It contains the data used in the project, in this case, the file tips.csv . |
📂 uncommitted |
|
Configuration of datasource and data connectors:
DataSource: It is the data source used (can be a file, API, database, etc.).
Data Connectors: These are the connectors that facilitate the connection to data sources and where access credentials, location, etc., should be defined.
= 'tips.csv'
datasource_name_file = 'datasource_tips'
datasource_name = 'connector_tips' dataconnector_name
# Let's create the configuration for the datasource
= gx.data_context.DataContext()
context = f"""
my_datasource_config name: {datasource_name}
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
{dataconnector_name}:
class_name: InferredAssetFilesystemDataConnector
base_directory: data
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
assets:
my_runtime_asset_name:
batch_identifiers:
- runtime_batch_identifier_name
"""
= YAML()
yaml **yaml.load(my_datasource_config))
context.add_datasource(=True) sanitize_yaml_and_save_datasource(context, my_datasource_config, overwrite_existing
In the following code snippet, the configuration of three expectations is presented.
In particular, the last one includes a parameter called ‘mostly’ with a value of 0.75. This parameter indicates that the expectation can fail in up to 25% of cases, as by default, 100% compliance is expected unless specified otherwise.
Additionally, an error message can be specified in markdown format, as shown in the last expectation.
= ExpectationConfiguration(
expectation_configuration_table ="expect_table_columns_to_match_set",
expectation_type= {
kwargs"column_set": ['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']
},= {}
meta
)
= ExpectationConfiguration(
expectation_configuration_total_bill = "expect_column_values_to_be_between",
expectation_type= {
kwargs"column": "total_bill",
"min_value": 0,
"max_value": 100
},= {}
meta
)
= ExpectationConfiguration(
expectation_configuration_size ="expect_column_values_to_not_be_null",
expectation_type={
kwargs"column": "size",
"mostly": 0.75,
},={
meta"notes": {
"format": "markdown",
"content": "Expectation to validate column `size` does not have null values."
}
} )
= "tips_expectation_suite"
expectation_suite_name = context.create_expectation_suite(
expectation_suite =expectation_suite_name,
expectation_suite_name=True
overwrite_existing
)
# Add expectations
=expectation_configuration_table)
expectation_suite.add_expectation(expectation_configuration=expectation_configuration_total_bill)
expectation_suite.add_expectation(expectation_configuration=expectation_configuration_size)
expectation_suite.add_expectation(expectation_configuration
# save expectation_suite
=expectation_suite,
context.save_expectation_suite(expectation_suite=expectation_suite_name) expectation_suite_name
-quality/gx/expectations/tips_expectation_suite.json data
Within the ‘expectations’ folder, a JSON file is created with all the expectations generated earlier.
='tips_checkpoint'
checkpoint_name
= f"""
config_checkpoint name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {expectation_suite_name}
validations:
- batch_request:
datasource_name: {datasource_name}
data_connector_name: {dataconnector_name}
data_asset_name: {datasource_name_file}
batch_spec_passthrough:
reader_method: read_csv
reader_options:
sep: ","
data_connector_query:
index: -1
expectation_suite_name: {expectation_suite_name}
"""
# Validate if the YAML structure is correct
context.test_yaml_config(config_checkpoint)
# Add the checkpoint to the generated context
**yaml.load(config_checkpoint)) context.add_checkpoint(
response = context.run_checkpoint(checkpoint_name=checkpoint_name)
response.to_json_dict()
'run_id': {'run_name': None, 'run_time': '2023-11-12T20:39:23.346946+01:00'},
{'run_results': {'ValidationResultIdentifier::tips_expectation_suite/__none__/20231112T193923.346946Z/722b2e93e32fd7222c8ad9339f3e0e1d': {'validation_result': {'success': True,
'results': [{'success': True,
'expectation_config': {'expectation_type': 'expect_table_columns_to_match_set',
'kwargs': {'column_set': ['total_bill',
'tip',
'sex',
'smoker',
'day',
'time',
'size'],
'batch_id': '722b2e93e32fd7222c8ad9339f3e0e1d'},
'meta': {}},
'result': {'observed_value': ['total_bill',
'tip',
'sex',
'smoker',
'day',
'time',
'size']},
'meta': {},
'exception_info': {'raised_exception': False,
'exception_traceback': None,
'exception_message': None}},
'success': True,
{
...'notify_on': None,
'default_validation_id': None,
'site_names': None,
'profilers': []},
'success': True}
context.open_data_docs()
By executing this code chunk, an HTML file with the results of the validations will open at
gx/uncommitted/data_docs/local_site/validations/tips_expectation_suite/__none__/20231112T192529.002401Z/722b2e93e32fd7222c8ad9339f3e0e1d.html
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Mendez (2023, Nov. 12). Romina Mendez: Data Quality. Retrieved from https://r0mymendez.github.io/posts_en/2023-11-17-data-quality/
BibTeX citation
@misc{mendez2023data, author = {Mendez, Romina}, title = {Romina Mendez: Data Quality}, url = {https://r0mymendez.github.io/posts_en/2023-11-17-data-quality/}, year = {2023} }