One of the DBA's biggest annoyances is dealing with duplicate data. No matter how much we try to guard against it, duplicates always mange to find their way into our tables. Duplicate data is a big problem because it can affect application views (where each item is supposed to be unique), skew statistics, and, in severe cases, increase server overhead.
In this tip, we'll learn how to recognize duplicate data in MySQL, as well as how to delete them without removing precious valid data.
Duplicate Types
Most of the duplicate records that you'll encounter are one of two distinct types: Duplicate Meaning and Non-unique Keys. In this instalment we'll be dealing with Duplicate Meaning; we'll address Non-unique Keys in the next one.
When a Duplicate is not a Duplicate
Duplicate Meaning is the most common type of duplicate. It's a situation where two or more fields' contents are not the same, but their meaning is. You could think of it as a semantic duplicate.
Consider the following table excerpt:
movie_name | media |
--------------------------- | |
ACADEMY DINOSAUR | Theatre |
ACE GOLDFINGER | Television |
ADAPTATION HOLES | Theatre |
AFFAIR PREJUDICE | Theatre |
AFRICAN EGG | TV |
In the media column, the entries "Television" and "TV" have the same connotation, but expressed differently. This issue is often caused by the use of free-text input where a limited dropdown would have been a better choice.
This type of duplication can be very challenging to deal with because you can't exclude duplicates using a SELECT DISTINCT.
There are two ways to deal with this problem:
- Select data using REPLACE() to swap out values that we don't want with those that we want to see instead:
- Update the actual table data. Here's a statement that updates all instances of “TV” with the preferred “TELEVISION” value:
SELECT DISTINCT | movie_name, |
REPLACE(media, "TV", "TELEVISION") as media, | |
FROM films; |
UPDATE films |
SET media = REPLACE(media, "TV", "TELEVISION") |
WHERE media = "TV"; |
Here's a real-life example that I came across only a month ago!
Somehow, some unwanted curly apostrophes found their way into our data. Notice the “O'BRIEN” and “O'BRIEN” entries:
first_name | last_name |
--------------------- | |
PENELOPE | GUINESS |
CONAN | O'BRIEN |
ED | CHASE |
JENNIFER | DAVIS |
CONAN | O'BRIEN |
We can deal with this problem in the same way we did above:
- Select data using REPLACE() to swap out curly apostrophes with regular single quotes so that we're always dealing with the same character:
- Update the actual table data. This statement updates all apostrophes in the last_name column with regular single quotes:
SELECT DISTINCT | first_name, |
REPLACE(last_name, "'", "'") as last_name, | |
FROM actors | WHERE REPLACE(last_name, "'", "'") like "O'BRIEN"; |
UPDATE actors |
SET last_name = REPLACE(last_name, "'", "'") |
WHERE last_name like "%'%"; |
Conclusion
Duplicate records, doubles, redundant data, duplicate rows; whatever you want to call them, they are one of the biggest banes in a DBA's life. Nevertheless, it's crucial that you weed them out on a regular basis, lest you want to generate faulty statistics and confuse your users who interact with the database.