remove-duplicate-rows

Description

remove-duplicate-rows adaptor removes duplicate rows from a datatable.

If a duplicate is found, those duplicated rows are placed in a separate datatable.

Inputs

data Type: datatable Required: Yes The datatable containing duplicate rows.

column names Type: list Required: No A list of columns to compare for duplicate values. If unspecified, entire rows will be compared.

case sensitive Type: boolean Required: No When set to True, lowercase and uppercase letters are treated as different. When set to False, lowercase and uppercase letters are treated as equivalent. If unspecified, defaults to False

Outputs

data Type: datatable A datatable containing only unique rows, including first instance of duplicate rows.

duplicates Type: datatable A datatable containing duplicate rows.

Examples

Example 1: Default behaviour.

Inputs:

data:

id
code
name

1

GB

United Kingdom

2

TR

Turkey

3

US

United States

4

IND

India

5

IND

India

6

us

United States

column name: null (empty)

case sensitive: null (empty)

Outputs:

data:

id
code
name

1

GB

United Kingdom

2

TR

Turkey

3

US

United States

4

IND

India

5

IND

India

6

us

United States

duplicates: Empty Table

-> By default all the rows are compared, since all the values in the id column are unique no duplicate rows were removed.

Example 2: Specify column names to compare.

Inputs:

data:

id
code
name

1

GB

United Kingdom

2

TR

Turkey

3

US

United States

4

IND

India

5

IND

India

6

us

United States

column name:

  1. code

  2. name

case sensitive: null (empty)

Outputs:

data:

id
code
name

1

GB

United Kingdom

2

TR

Turkey

3

US

United States

4

IND

India

duplicates:

id
code
name

1

IND

India

2

us

United States

-> Removed the duplicate rows 5 and 6.

Example 3: Specify case sensitive column names to compare.

Inputs:

data:

id
code
name

1

GB

United Kingdom

2

TR

Turkey

3

US

United States

4

IND

India

5

IND

India

6

us

United States

column name:

  1. code

  2. name

case sensitive: True

Outputs:

data:

id
code
name

1

GB

United Kingdom

2

TR

Turkey

3

US

United States

4

IND

India

5

us

United States

duplicates:

id
code
name

1

IND

India

-> Removed only the duplicate row 5 as row 6 code column is in lowercase and does not match row 3 .

Use case

  • Removing duplicated rows after using append-datatables adaptor on two partially-overlapping datasets.

  • Determining which rows are duplicated, and how many times they are duplicated.

Last updated