Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

Recipe for deleting duplicate rows based on 1 column (PrimaryKey)

TrifactaUsers
10 - Fireball

We have a dataset with a PrimaryKey, called AddressUuid, see http://prntscr.com/kdcxzo. A job runs every night with new statusses of the same AddressUuid. I want to dedouplicate based on AddressUuid and a date field. I found some documentation but was not able to https://cloud.google.com/dataprep/docs/html/Deduplicate-Data_57344590. Can you give me the correct recipe? 

 

Row 1

AddressUuid xxx

Date 8-1-2018

 

Row 2

AddressUuid xxx

Date 7-31-2018

 

I only want to keep row 1, row 2 should be deleted..

2 REPLIES 2
Trifacta_Alumni
Alteryx Alumni (Retired)

Hi John,

 

it would take two steps to do this. The first is to create a ranking column, the second is to delete values where ranking column not equal to 1. This basically indicates your primary addressUuid (based on most recent date).

 

1) create a new column indicating the rank of each row for each AddressUuid. We will use the rownumber() function for this.

derive value: ROWNUMBER() group: AddressUuid order: -Date

the -Date ensures the column is descending, so more recent date first.

 

2) filter out the duplicates (values greater than 1)

filter type: greaterThan col: Accepts_CreditCards greaterThan: 1 action: Delete

 

Hope this helps!

Trifacta_Alumni
Alteryx Alumni (Retired)

Updated with the following how-to article based on this question: https://community.trifacta.com/s/article/Remove-duplicate-rows-based-on-a-primary-key