ledc

Data Quality, lightning fast

ledc-sigma

Errors in data… You can be pretty sure they exist, but how do you locate them? And if you locate them, how do you make your data error-free in the best possible way?

These questions are certainly not easy to answer, but we’ve made a module in ledc to help you with this. This new module is called sigma and introduces a generic type of rule called a sigma-rule. In simple terms, a sigma-rule provides a concise way to describe rows in your data that are not permitted. For such rules, there exists an elegant and efficient method to find minimal repairs of the data. These are error-free corrections of the data that are obtained by changing the original data only minimally. The sigma module offers several implementations of repair engines and offers a wide range of cost models to encode specific error patterns that might be present in your data. In particular, the powerfull parker engine allows to combine sigma rules with key constraints.

Code, documentation, examples and license information can be found in the sigma repository on gitlab.

Author image Antoon Bronselaer

ledc-sqlmerge

Merge two rows in a database. A simple problem right?

In practice, it turns out this problem can be more complicated than you would think at first glance. When you dig deeper, the two main difficulties you will find are (a) avoiding loss of information and (b) maintaining consistency at all times.

The first problem is produced by the fact that you must try to represent information previously held by two rows in a single rows. This might not be possible (e.g., the rows you merge can differ too much) and choices must be made about which information you keep and which not. When those choices are made, it is not so difficult to turn them into an update for one row and a delete of the other row.

This leads to the second problem: doing updates and deletes make changes in databases and these potentially violate constraints that are present. Especially foreign key dependencies are troublesome in this regard and propagating the changes made by merging two rows is a tedious task.

The ledc framework offers some help in solving this problem. The new ledc module sqlmerge is a simple tool that helps you in producing a sql script that models a simple merge in a table as well as all propagated changes in other tables. It ensures “consistency-at-all-time” without make assumptions on constraint deferring. Code, documentation, examples and license information can be found in the sqlmerge repository on gitlab. If you just want to try sqlmerge on your database, you can download the executable in the registries section.

Author image Antoon Bronselaer

ledc-core

In preparation of some more complicated modules, a core package for ledc has been released. This package provides a simple data model with a hierarchical structure similar to JSON and some SQL-like operations. It offers simple bindings from and to csv files and Postgres databases. Code, documentation and license information can be found in the ledc-core repository on gitlab.

Author image Antoon Bronselaer

ledc-pi

The first module of ledc has been published! Ledc-pi is built to measure the quality of individual data values, to find valid instances in free text and to convert between properties. It is a simple and highly expressive module to define ordinal-scaled quality measures for individual columns in your dataset.

Code, documentation and license information can be found in the ledc-pi repository on gitlab. We are currently working on a use case in which the usage of ledc-pi is demonstrated in a real world setting. So more news coming soon.

Author image Antoon Bronselaer

About

What

The Lightweight Engine for Data quality Control (ledc for short), is a simple and fast engine for monitoring and improving data quality. It is an initiative to bring innovations in data quality closer to practice. All modules are implemented in Java and require Java 8 or higher.

The ledc framework is not a big chunk of software, but is designed in a modular fashion. Each module is targeted at solving a specific data quality task. The main idea recurring in all modules, is to keep a balance between expressiveness and simplicity. This allows to build powerful tools that operate lightning fast on your data.

Who

Ledc is the result of academic research in the field of data quality performed at the DDCM lab of Ghent University. The repositories that contain the code also mention scientific publications where the ideas used in ledc are explained.

Contact

If you have a question about the code, suggestions for improvement or require assistance in setting up ledc, please contact us at ledc@ugent.be

Author image Antoon Bronselaer