Alteryx Designer Cloud Discussions

TrifactaUsers · ‎08-01-2018

We have a dataset with a PrimaryKey, called AddressUuid, see http://prntscr.com/kdcxzo. A job runs every night with new statusses of the same AddressUuid. I want to dedouplicate based on AddressUuid and a date field. I found some documentation but was not able to https://cloud.google.com/dataprep/docs/html/Deduplicate-Data_57344590. Can you give me the correct recipe?

Row 1

AddressUuid xxx

Date 8-1-2018

Row 2

AddressUuid xxx

Date 7-31-2018

I only want to keep row 1, row 2 should be deleted..

Trifacta_Alumni · ‎08-01-2018

Hi John,

it would take two steps to do this. The first is to create a ranking column, the second is to delete values where ranking column not equal to 1. This basically indicates your primary addressUuid (based on most recent date).

1) create a new column indicating the rank of each row for each AddressUuid. We will use the rownumber() function for this.

derive value: ROWNUMBER() group: AddressUuid order: -Date

the -Date ensures the column is descending, so more recent date first.

2) filter out the duplicates (values greater than 1)

filter type: greaterThan col: Accepts_CreditCards greaterThan: 1 action: Delete

Hope this helps!

Trifacta_Alumni · ‎12-29-2020

Updated with the following how-to article based on this question: https://community.trifacta.com/s/article/Remove-duplicate-rows-based-on-a-primary-key

Alteryx Designer Cloud Discussions

Recipe for deleting duplicate rows based on 1 column (PrimaryKey)