Association analysis is a tool that is often used in data quality applications, for example to spot highly correlated patterns in data. In a recent study, we investigated the connection between association rules and sigma rules in a dynamic setting. That is, we search for association rules, use the set-cover method to repair violations of the rules we found and repeat this until no more violations are found. The full publication is available as an Open Access article and code is part of ledc-dino Some findings of this study are the following:
-
You need to have high certainty that rules you find, are correct. That means, violations of the rule are factual errors and not just rare cases. If not, the accuracy of the repeated search and repair proces quickly deteriorates. If you find correct rules with high precision, then there is great benefit in the dynamic approach as far more errors are detected. We’ve shown a number of methods to spot the difference between rare cases and errors.
-
Association rules are positive IF … THEN… rules and their expressivity is therefore bounded. A pleasant consequence is that variable eleminination (a fundamental step in the construction of a sufficient set), turns out to be quadratic in many practical cases and we can easily recognize those cases easily.
-
We observed that association rules as a method to detect errors in a dataset are complementary to other approaches for automated error detection.