ledc

Data Quality, lightning fast

Use case: the disbiome database

sqlmerge is a simple CLI tool to merge (duplicate) rows in a relational database. The problem is simple, the solution is not (if you want to know why, continue reading here or see some examples here).

Very recently, we’ve used this tool to clean disbiome: a database that records associations of microbial composition changes with different kinds of diseases. This database is manually curated: researchers inspect new literature on disease-microbiome associations and input all relevant information that can be found in new papers. This includes information about diseases, detection methods, organisms, samples, control groups and of course experimental outcomes, which may be quantitative or qualitative. Over a thousand scientific publications have been added to this database, resulting in almost 11K experimental outcomes.

A while ago, we were contacted by a user to report some publication being added twice. As such, we started a deduplication procedure and found that about 40 publications were added twice to the database. Then, the real problem came up: if we merge two publications, then we must merge information about experiments, quality reports on the publications, sample information … On top of that, if information about two duplicate publications is conflicting we should resolve those conflicts in a correct manner. The bottom line is: we’re dealing with potential changes in tens of tables at the same time.

Here is where sqlmerge comes to the rescue! You tell it you want to merge two rows in the publication table and it will inspect the entire database to see in which other tables this has implications. If it spots any inconsistencies, it will first try to resolve them automatically (you can provide it with some resolution strategies). If this is not possible, user input is requested. This continues up to the point where all issues are resolved. The final outcome is a nicely organised SQL script that encodes all steps necessary to implement the desired merge operation. You can simply run this script on your database and … done.

To appreciate the abilities of this little tool, here’s a number: to correctly merge 40 publications, the tool produced a script of 1096 SQL statements.