Monthly Archives: May 2015

Data Deduplication in Relational Databases

A huge part of database related work is to make sure that the data is consistent. In real world data is never ideal, and whenever you need using data from existing data sources, you have to understand what is right and what is wrong there, and know how to circumvent data quality issues.  Two most frequent data integrity issues in relational databases are missing date and duplicate data. A record/document is missing if it was not written to the database by an application, or was mistakenly deleted. A record/document is duplicated if it was recorded more than once.

Why does application write same record more than once? A user or an upstream code could send same document twice, and an application doesn’t handle this case. Or a user could send incorrect record the first time, and a corrected one later: an application could be designed to save all records instead of modifying existing ones.

Continue reading