ADQuaTe: An Automated Data Quality Test Approach for Constraint Discovery and Fault Detection

Opportunity

Available for Licensing

IP Status

US Utility Patent Pending (Not Yet Published)

Inventors

Hajar Homayouni
Sudipto Ghosh
Indrakshi Ray

At A Glance

Researchers at Colorado State have developed ADQuaTe, an automated data quality test approach that uses an unsupervised machine learning technique to discover constraints that may have been missed by experts.  Furthermore, ADQuaTe marks records that violate the constraints as suspicious and explains the violations.

For more detail, please contact our office.

Licensing Director

Mandana Ashouri
Mandana.Ashouri@colostate.edu
970-491-7100

Reference No.:  19-100

Background

Enterprises use databases and data warehouses to store, manage, and query data for making critical decisions.  Records can become corrupted due to how the data is collected, transformed, and managed, in addition to malicious activities.  Incorrect records may violate constraints pertaining to the attributes and records, while inaccurate data can lead to incorrect decisions.  Thus, rigorous data quality testing approaches are required to ensure that the data is correct.

Traditionally, data quality tests validate the data to check for violations of syntactic and semantic constraints.  Syntactic constraint validations check for the conformance of an attribute with the structural specifications in the data model.  For example, in a health data store, patient age must take numeric values.  Semantic constraint validations check for the conformance of the attribute values with the specifications stated by domain experts.  Semantic constraints can exist over single attributes (e.g., patient age >= 0) or multiple attributes (e.g., pregnancy status = true → patient gender = female).

These data quality tests rely on the specification of constraints, which are typically defined by domain experts, who may miss important constraints.  Tools that automatically generate syntactic constraints also exist, but they only check for trivial ones, such as the not-null check.  And even though existing Machine Learning-based approaches automatically discover non-trivial semantic constraints from the data and report the faulty records as outliers, these approaches do not explain which constraints are violated by those records.

There is a true need for better data quality systems to ensure data accuracy.

Benefits
  • Automates constraint discovery process to capture essential constraint in any input data set
  • Machine learning techniques determine constraints that violate detected faulty records
  • Interactive learning techniques incorporate domain knowledge in the constraint discovery and fault detection phases to avoid false alarms
  • Easy to use program that does not require background knowledge in programming
  • Does not require prior knowledge to detect faulty records in data stores
  • Has been evaluated in several real-world databases (e.g. health data warehouse; plant diagnosis database)
Applications
  • Enterprises dealing in data quality assurance mechanisms – correctness of data and accuracy of data-based decisions
Publications or References

Homayouni, S. Ghosh and I. Ray, “ADQuaTe: An Automated Data Quality Test Approach for Constraint Discovery and Fault Detection,” 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles, CA, USA, 2019, pp. 61-68.

Last updated: March 2020

Add keywords or various names of inventors here (text is hidden)