Data Duplication Troubleshooting

Introduction

When processing data through data factory, data calculation is often repeated due to improper processing. There are three reasons for data duplication:

Possible reasons

Example

Incomplete join conditions

Originally, distinguishing the uniqueness needs to connect three fields: Product name, Product style, and Unit, but only the first two fields are connected. At this time, there may be repeated records with the same product name and style, and different units.

Select a subform in Input

If both form fields and subfields are selected at the same time in Input, the form fields will be repeated according to the number of rows in the subform. The time of form fields repeated is the same as the number of rows in the subform.

Repeated data

If the data itself is repeated, the output result may be repeated.

Incomplete join conditions

Description

Each field must be connected with the corresponding field through Join. Each field in the left form must be only connected with that in the right form, or the calculated data may be repeated.

For example, distinguishing the uniqueness needs to connect three fields: Product name, Product style, and Unit, but only the first two fields are connected. At this time, there may be repeated records with the same product name and style, and different units.

Original record:

Correct example

Incorrect example

Solution

Add a Join node and all fields that can determine the uniqueness need to be used as connection fields. As the example above shows, connection fields need to be set for the left and right form as: Product name = Product name, Product style = Product style, and Unit = Unit.

Selecting a Subform in Input

Description

When both form fields and subfields are set as selected fields in Input, form fields will be repeated, so form fields cannot distinguish the uniqueness of the data. If you process data by Join, connection fields cannot simply be set to form fields.

The storage order number of form fields is all the same as follows:

Solutions

For the data in the example above, if a Join node needs to be added, the connection fields need to be set: Order number = Order number (order number related to the order), Product name = Product name, Product style = Product model, and Unit = Unit.

Repeated Data

Description

When there are subforms in the orignial record and the record is calculated in Data Factory, each sub-record in the subform will be split into separate pieces of data. The original record is as follows:

In the data factory, the original data becomes as follows, where there are two repeated pieces of data:

Solution

When processing data, it is necessary to ensure the uniqueness of the data and no repeated data. You can deduplicate the repeated data through Group by.

  • Add grouping fields: Inbound order number, Product name, Product style, and Unit
  • Add aggregation fields: Inbound quantity

Settings are as follows:

The processed data is shown as follows and there are no more duplicate data:

Was this information helpful?
Yes
NoNo
Need more help? Contact support