ledc

Data Quality, lightning fast

A ledc server

When we introduced ledc-pi a while ago, we wrote on the readme page how cumbersome it can be to find information about validation of identifiers. The whole idea behind ledc-pi is to provide a simple framework in which such information can communicated easily and that is interpretable by a machine as well.

At this point, we’ve initiated a simple REST interface for ledc-pi and we collected a first bundle of information on this little web server. The REST interface runs at ledc.ugent.be:8828. You can use to test some validation steps and you can eventually fetch the necessary information so you can use it locally. We’ll be expanding the number of identifiers in the future and we’re working on a client application too.

Author image Antoon Bronselaer

ledc-fundy

Is it a contraction of functional dependency? Or does it relate to fun with dependencies? We’re not quite sure about that yet… What matters is: the new ledc module has arrived and it’s called ledc-fundy!

The idea behind ledc-fundy is simple: to provide an implementation of functional dependencies and algorithms for their fundamental problems. It’s basically a playground for these simple yet fascinating constraints. You’ll find algorithms for discovery, detection of violation, implication, normalisation and repair.

Code and license information can be found in the ledc-fundy repository on gitlab. Documentation and examples to follow soon.

Author image Antoon Bronselaer

The Parker engine

In the sigma module, several repair engines can be found to fix errors in data. One of these engines, the Parker engine, has the ability to fix violations of sigma rules and partial keys in one go. The combination of these two constraint types provides a rich formalism to enforce constraints. Yet, repairing them can still be done using the set-cover approach, albeit in a modified version.

This is the main conclusion of a research cooperation of the DDCM lab with prof. Maribel Acosta (RUB, Germany) and the results are now available on arxiv. Some examples on how the Parker engine can be used can be found in the example folder on gitlab.

Author image Antoon Bronselaer

ledc-dino

Errors in data, part 2!

A few weeks ago, the sigma module was launched, featuring methods to model and repair sigma rules. Following this methodology, an end-user still needs to encode constraints in the form of sigma rules. In some cases, this can be troublesome: the number of rules can be very high or some rules might be missed, allowing errors to remain unnoticed.

In our newest module, some algorithms are implemented to automatically discover sigma rules on a given dataset. This module is called dino, short for Discovery of Inconsistencies and Outliers. Code, documentation, examples and license information can be found in the dino repository on gitlab.

Author image Antoon Bronselaer

New year, new publication

A new year, a new publication!

One of the pilars in the sigma repository, is the ability to find implicit rules where variables are eliminated. This is a computationally intensive task, but in our latest publication entitled Efficient edit rule implication for nominal and ordinal data, new algorithms for nominal and ordinal data have been proposed to improve the runtime of this task. The full publication can be found here.

Author image Antoon Bronselaer