Data Deduplication in Relational Databases

A huge part of database related work is to make sure that the data is consistent. In real world data is never ideal, and whenever you need using data from existing data sources, you have to understand what is right and what is wrong there, and know how to circumvent data quality issues.  Two most frequent data integrity issues in relational databases are missing date and duplicate data. A record/document is missing if it was not written to the database by an application, or was mistakenly deleted. A record/document is duplicated if it was recorded more than once.

Why does application write same record more than once? A user or an upstream code could send same document twice, and an application doesn’t handle this case. Or a user could send incorrect record the first time, and a corrected one later: an application could be designed to save all records instead of modifying existing ones.

If you are working on an application which reads this data for reporting or for transaction processing purposes, it is good to start from testing if there are any duplicates in a data source. Let’s say, you have an SQL Server table:

CREATE TABLE dbo.Payment
(
  PaymentID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
  TransactionID VARCHAR(32) NOT NULL,
  ...
)

PaymentID is an identity column and a primary key. TransactionID is an external key, controlled by a payment processor. If there is no unique constraint on TransactionID field, it is possible to have duplicate payment records. You can find whether there are duplicate TransactionID in this table (all queries in this post are for SQL Server 2008 or later, though other database servers may support them too):

SELECT TransactionID, COUNT(*)
FROM Payment
GROUP BY TransactionID
HAVING COUNT(*) > 1

This query returns all TransactionID used in more than one row (GROUP BY groups by TransactionID, and HAVING filters based on row count). If there any, then you can get into details: eyeball these duplicate records to understand if they are full duplicates, why they could be inserted twice, and what is the best strategy for deduplication. This query returns all rows with duplicate TransactionID:

SELECT *
FROM Payment
WHERE TransactionID IN
  (SELECT TransactionID
  FROM Payment
  GROUP BY TransactionID
  HAVING COUNT(*) > 1)
ORDER BY TransactionID, PaymentID

If after investigation you decided that you need to query only the latest row (based on PaymentID) for each TransactionID then you can use CTE (common table expression) with ROW_NUMBER to do it:

;WITH dedupPayments AS
(
  SELECT PaymentID, 
    ROW_NUMBER() OVER(PARTITION BY TransactionID ORDER BY PaymentID DESC) rowNumber
  FROM Payment
)
SELECT * 
FROM Payment p
INNER JOIN dedupPayments d ON d.PaymentID = p.PaymentID AND d.rowNumber = 1

Here CTE returns all rows, adding row number partitioned by TransactionID and sorted by PaymentID. Partitioning is to counting rows separately for each TransactionID. As we need only the last row for each TransactionID, sorting is set to produce rowNumber = 1 for the last row of each partition. Then we query data from Payment table, join with CTE and get only rows with rowNumber=1.

In some cases, CTE can be replaced with a temporary table for performance reasons.

  • Evgeny Hramov

    CTE are strange). In some cases they are fast, in some – very slow. And if you want join cte itself twice it may be bad performance idea. You can use simplier

    SELECT TransactionID, max(PaymentID)m_id
    FROM Payment
    GROUP BY TransactionID

    to select about *last* records. And you of course know, that ids mustn’t go straight 1,2,3,4… And bigger ID doesn’t mean later record.

    • surmenok

      Performance can be weird, agreed. CTE are often used to force SQL Server to use better query plan. But it is a very simple case. Usually a query has several CTE in chain.
      Thanks for an idea of max(payment).
      Yes, I understand that IDs must increase over time, that’s why I made it an IDENTITY field in an example. In some cases, when an application inserts records in wrong order, you have to use something else, some kind of timestamp perhaps.