Data Quality, lightning fast

New outlier detection algorithms

The dino repository was updated and has three additional algorithms for outlier detection: HBOS, k-NN and Soda. In addition, ledc now offers support for KD-trees to make searching in metric spaces more efficient in high volume datasets. Read all about it in the documentation.

Author image Antoon Bronselaer


We have a new tool in the ledc framework! Rulebox, a toolbox for rules, combines the powers of many of the ledc repositories in one single place. Here are the main features at a glance:

  • It offers exploration of a dataset, from simple statistics to algorithms for finding outliers and generation or rules. Currently, we support Postgres databases and csv files.

  • There is support for encoding validations with regexes and grexes, sigma rules and functional dependencies in a simple text-based file format (.rbx). Once you have such a file, you can immediately generate simple reports on where constraints are violated.

  • It can communicate with instances of ledc server to find information about validation.

  • A wide range of repair engines to produce corrections of a dataset are at the tips of your fingers.

  • There is support for a wide range of reasoning steps on constraints, including database normalization and Field Code Forest (FCF) variable elimination.

Code, documentation and license information can be found in the rulebox repository on gitlab. A precompiled jar file is available in the package registry.

Author image Antoon Bronselaer

Sigma rules and association analysis

Association analysis is a tool that is often used in data quality applications, for example to spot highly correlated patterns in data. In a recent study, we investigated the connection between association rules and sigma rules in a dynamic setting. That is, we search for association rules, use the set-cover method to repair violations of the rules we found and repeat this until no more violations are found. The full publication is available as an Open Access article and code is part of ledc-dino Some findings of this study are the following:

  • You need to have high certainty that rules you find, are correct. That means, violations of the rule are factual errors and not just rare cases. If not, the accuracy of the repeated search and repair proces quickly deteriorates. If you find correct rules with high precision, then there is great benefit in the dynamic approach as far more errors are detected. We’ve shown a number of methods to spot the difference between rare cases and errors.

  • Association rules are positive IF … THEN… rules and their expressivity is therefore bounded. A pleasant consequence is that variable eleminination (a fundamental step in the construction of a sufficient set), turns out to be quadratic in many practical cases and we can easily recognize those cases easily.

  • We observed that association rules as a method to detect errors in a dataset are complementary to other approaches for automated error detection.

Author image Antoon Bronselaer

Use case: the disbiome database

sqlmerge is a simple CLI tool to merge (duplicate) rows in a relational database. The problem is simple, the solution is not (if you want to know why, continue reading here or see some examples here).

Very recently, we’ve used this tool to clean disbiome: a database that records associations of microbial composition changes with different kinds of diseases. This database is manually curated: researchers inspect new literature on disease-microbiome associations and input all relevant information that can be found in new papers. This includes information about diseases, detection methods, organisms, samples, control groups and of course experimental outcomes, which may be quantitative or qualitative. Over a thousand scientific publications have been added to this database, resulting in almost 11K experimental outcomes.

A while ago, we were contacted by a user to report some publication being added twice. As such, we started a deduplication procedure and found that about 40 publications were added twice to the database. Then, the real problem came up: if we merge two publications, then we must merge information about experiments, quality reports on the publications, sample information … On top of that, if information about two duplicate publications is conflicting we should resolve those conflicts in a correct manner. The bottom line is: we’re dealing with potential changes in tens of tables at the same time.

Here is where sqlmerge comes to the rescue! You tell it you want to merge two rows in the publication table and it will inspect the entire database to see in which other tables this has implications. If it spots any inconsistencies, it will first try to resolve them automatically (you can provide it with some resolution strategies). If this is not possible, user input is requested. This continues up to the point where all issues are resolved. The final outcome is a nicely organised SQL script that encodes all steps necessary to implement the desired merge operation. You can simply run this script on your database and … done.

To appreciate the abilities of this little tool, here’s a number: to correctly merge 40 publications, the tool produced a script of 1096 SQL statements.

Author image Antoon Bronselaer