`
snowtigersoft
  • 浏览: 55902 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

MySQL Handling Duplicates

阅读更多

Tables or result sets sometimes contain duplicate records. Sometime it is allowed but sometime it is required to stop duplicate records. Sometime it is required to identify duplicate records and remove them from the table. This chapter will describe how to prevent duplicate records occurring in a table and how to remove already existing duplicate records.

Preventing Duplicates from Occurring in a Table:

You can use a PRIMARY KEY or UNIQUE Index on a table with appropriate fields to stop duplicate records. Lets take one example, The following table contains no such index or primary key, so it would allow duplicate records for first_name and last_name

CREATE TABLE person_tbl
(
    first_name CHAR(20),
    last_name CHAR(20),
    sex CHAR(10)
);

To prevent multiple records with the same first and last name values from being created in this table, add a PRIMARY KEY to its definition. When you do this, it's also necessary to declare the indexed columns to be NOT NULL, because a PRIMARY KEY does not allow NULL values:

CREATE TABLE person_tbl
(
   first_name CHAR(20) NOT NULL,
   last_name CHAR(20) NOT NULL,
   sex CHAR(10)
   PRIMARY KEY (last_name, first_name)
);

The presence of a unique index in a table normally causes an error to occur if you insert a record into the table that duplicates an existing record in the column or columns that define the index.

Use INSERT IGNORE rather than INSERT . If a record doesn't duplicate an existing record, MySQL inserts it as usual. If the record is a duplicate, the IGNORE keyword tells MySQL to discard it silently without generating an error.

Following example does not error out and same time it will not insert duplicate records.

mysql> INSERT IGNORE INTO person_tbl (last_name, first_name)
    -> VALUES( 'Jay', 'Thomas');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT IGNORE INTO person_tbl (last_name, first_name)
    -> VALUES( 'Jay', 'Thomas');
Query OK, 0 rows affected (0.00 sec)

Use REPLACE rather than INSERT. If the record is new, it's inserted just as with INSERT. If it's a duplicate, the new record replaces the old one:

mysql> REPLACE INTO person_tbl (last_name, first_name)
    -> VALUES( 'Ajay', 'Kumar');
Query OK, 1 row affected (0.00 sec)
mysql> REPLACE INTO person_tbl (last_name, first_name)
    -> VALUES( 'Ajay', 'Kumar');
Query OK, 2 rows affected (0.00 sec)

INSERT IGNORE and REPLACE should be chosen according to the duplicate-handling behavior you want to effect. INSERT IGNORE keeps the first of a set of duplicated records and discards the rest. REPLACE keeps the last of a set of duplicates and erase out any earlier ones.

Another way to enforce uniqueness is to add a UNIQUE index rather than a PRIMARY KEY to a table.

CREATE TABLE person_tbl
(
   first_name CHAR(20) NOT NULL,
   last_name CHAR(20) NOT NULL,
   sex CHAR(10)
   UNIQUE (last_name, first_name)
);

Counting and Identifying Duplicates:

Following is the query to count duplicate records with first_name and last_name in a table.

mysql> SELECT COUNT(*) as repetitions, last_name, first_name
    -> FROM person_tbl
    -> GROUP BY last_name, first_name
    -> HAVING repetitions > 1;

This query will return a list of all the duplicate records in person_tbl table. In general, to identify sets of values that are duplicated, do the following:

  • Determine which columns contain the values that may be duplicated.

  • List those columns in the column selection list, along with COUNT(*).

  • List the columns in the GROUP BY clause as well.

  • Add a HAVING clause that eliminates unique values by requiring group counts to be greater than one.

Eliminating Duplicates from a Query Result:

You can use DISTINCT along with SELECT statement to find out unique records available in a table.

mysql> SELECT DISTINCT last_name, first_name
    -> FROM person_tbl
    -> ORDER BY last_name;

An alternative to DISTINCT is to add a GROUP BY clause that names the columns you're selecting. This has the effect of removing duplicates and selecting only the unique combinations of values in the specified columns:

mysql> SELECT last_name, first_name
    -> FROM person_tbl
    -> GROUP BY (last_name, first_name);

Removing Duplicates Using Table Replacement:

If you have duplicate records in a table and you want to remove all the duplicate records from that table then here is the procedure.

mysql> CREATE TABLE tmp SELECT last_name, first_name, sex
    ->                  FROM person_tbl;
    ->                  GROUP BY (last_name, first_name);
mysql> DROP TABLE person_tbl;
mysql> ALTER TABLE tmp RENAME TO person_tbl;

An easy way of removing duplicate records from a table is that add an INDEX or PRIMAY KEY to that table. Even if this table is already available you can use this technique to remove duplicate records and you will be safe in future as well.

mysql> ALTER IGNORE TABLE person_tbl
    -> ADD PRIMARY KEY (last_name, first_name);
分享到:
评论

相关推荐

    Delete Duplicates for Outlook Express 5.8.1

    "Delete Duplicates for Outlook Express 5.8.1" 是一款专为Outlook Express设计的软件,用于帮助用户查找并删除邮箱中的重复邮件。Outlook Express是微软在早期推出的一款电子邮件客户端,它允许用户接收、发送电子...

    MySQL结巴中文分词插件SqlJieba.zip

    Records: 3 Duplicates: 0 Warnings: 0 mysql> SELECT MATCH(c) AGAINST('上海') FROM t; ---------------------------- | MATCH(c) AGAINST('上海') | ---------------------------- | 0 | | ...

    linux下mysql数据库

    Records: 0 Duplicates: 0 Warnings: 0 ``` - 查看修改后的表结构: ```sql mysql> describe student; +-------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra...

    Python 重复数据处理(df.drop-duplicates方法)Python源码

    Python 重复数据处理(df.drop_duplicates方法)Python源码Python 重复数据处理(df.drop_duplicates方法)Python源码Python 重复数据处理(df.drop_duplicates方法)Python源码Python 重复数据处理(df.drop_...

    Delete Duplicates for Outlook Express and Windows Mail v3.8.0.1注册版

    Delete Duplicates for Outlook Express and Windows Mail v3.8.0.1注册版程序

    Informatica case study RemoveDuplicates

    《Informatica案例研究:RemoveDuplicates》 在信息技术领域,数据清洗和去重是至关重要的步骤,尤其是在大数据处理中。Informatica作为一个强大的数据集成工具,提供了RemoveDuplicates功能,以确保数据的准确性和...

    Pandas之drop_duplicates:去除重复项方法

    在本次分享中,我们将详细探讨Pandas库中的`drop_duplicates`方法,以及它的应用场景和重要参数。 首先,`drop_duplicates`方法是专为`DataFrame`格式的数据设计的,它允许我们从数据集中移除那些重复的行。该方法...

    Remove_duplicates.cpp

    Remove_duplicates.cpp

    Python代码源码-实操案例-框架案例-重复数据处理(df.drop duplicates方法).zip

    这个压缩包文件"Python代码源码-实操案例-框架案例-重复数据处理(df.drop duplicates方法).zip"显然包含了一些示例代码,用于解释如何使用Pandas的`df.drop_duplicates()`方法来处理重复的数据。`df.drop_duplicates...

    LeetCode Remove Duplicates from Sorted Array解决方案

    "LeetCode Remove Duplicates from Sorted Array解决方案" 本文将详细介绍 LeetCode 中的 Remove Duplicates from Sorted Array 解决方案,包括问题描述、解决方案和关键知识点。 问题描述: 给定一个排序的数组 ...

    remove_duplicates:从数组中删除重复条目的功能

    "remove_duplicates"功能专注于解决一个特定的问题:从数组中去除重复的条目。这个功能对于那些需要保持数组元素唯一性的场景非常有用,比如在数据存储、数据分析或者前端开发中避免显示重复的数据。 在JavaScript...

    pandas.DataFrame.drop_duplicates 用法介绍

    `pandas.DataFrame.drop_duplicates` 是一个非常实用的功能,用于处理数据框 (`DataFrame`) 中的重复行。这个函数在数据预处理阶段尤其重要,因为它可以帮助确保分析的数据是唯一的,避免因重复记录导致的错误统计。...

    FindDuplicates:用于查找重复项和合并标题的 Calibre 插件

    而“FindDuplicates”是 Calibre 的一款插件,专门用于帮助用户查找和处理重复的书籍条目,确保用户的电子书库保持整洁有序。这款插件通过比较元数据、文件内容或标题来识别可能的重复项,并提供了合并标题的功能,...

    Delete Duplicates for Outlook Express and Windows Mail v3.8.0.1注册码

    “Delete Duplicates for Outlook Express and Windows Mail v3.8.0.1”是一款专门针对Outlook Express和Windows Mail设计的去重工具。该软件的主要功能在于帮助用户清理邮箱中重复的邮件,从而有效提高邮箱空间的...

    formatted_task097_conala_remove_duplicates.json

    formatted_task097_conala_remove_duplicates.json

    详解pandas使用drop_duplicates去除DataFrame重复项参数

    Pandas之drop_duplicates:去除重复项 方法 DataFrame.drop_duplicates(subset=None, keep='first', inplace=False) 参数 这个drop_duplicate方法是对DataFrame格式的数据,去除特定列下面的重复行。返回...

    Python DataFrame使用drop_duplicates()函数去重(保留重复值,取重复值)

    `drop_duplicates()`函数是DataFrame中用于去重的关键方法,本文将详细介绍如何使用这个函数来实现这两种需求。 首先,让我们了解如何创建一个包含重复值的DataFrame。DataFrame可以由字典、列表、数组或其他数据...

    find_duplicates

    I have two seperate directory structures `dirA` and `dirB` that have possible duplicates between them. I want to delete all duplicates from `dirB` 在这种情况下, finddupes_in_dir.py是最佳解决方案。 您...

Global site tag (gtag.js) - Google Analytics