Data Duplication Troubleshooting
Introduction
When processing data through data factory, data calculation is often repeated due to improper processing. There are three reasons for data duplication:
Possible reasons | Example |
Incomplete join conditions | Originally, distinguishing the uniqueness needs to connect three fields: Product name, Product style, and Unit, but only the first two fields are connected. At this time, there may be repeated records with the same product name and style, and different units. |
Select a subform in Input | If both form fields and subfields are selected at the same time in Input, the form fields will be repeated according to the number of rows in the subform. The time of form fields repeated is the same as the number of rows in the subform. |
Repeated data | If the data itself is repeated, the output result may be repeated. |
Incomplete join conditions
Description
Each field must be connected with the corresponding field through Join. Each field in the left form must be only connected with that in the right form, or the calculated data may be repeated.
For example, distinguishing the uniqueness needs to connect three fields: Product name, Product style, and Unit, but only the first two fields are connected. At this time, there may be repeated records with the same product name and style, and different units.
Original record:
Correct example | |
Incorrect example |
Solution
Add a Join node and all fields that can determine the uniqueness need to be used as connection fields. As the example above shows, connection fields need to be set for the left and right form as: Product name = Product name, Product style = Product style, and Unit = Unit.
Selecting a Subform in Input
Description
When both form fields and subfields are set as selected fields in Input, form fields will be repeated, so form fields cannot distinguish the uniqueness of the data. If you process data by Join, connection fields cannot simply be set to form fields.
The storage order number of form fields is all the same as follows:
Solutions
For the data in the example above, if a Join node needs to be added, the connection fields need to be set: Order number = Order number (order number related to the order), Product name = Product name, Product style = Product model, and Unit = Unit.
Repeated Data
Description
When there are subforms in the orignial record and the record is calculated in Data Factory, each sub-record in the subform will be split into separate pieces of data. The original record is as follows:
In the data factory, the original data becomes as follows, where there are two repeated pieces of data:
Solution
When processing data, it is necessary to ensure the uniqueness of the data and no repeated data. You can deduplicate the repeated data through Group by.
- Add grouping fields: Inbound order number, Product name, Product style, and Unit
- Add aggregation fields: Inbound quantity
Settings are as follows:
The processed data is shown as follows and there are no more duplicate data: