Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

How to marge two data sets?

TrifactaUsers
10 - Fireball

Hi everyone,

 

I have two time series data sets. The first one contains all the historical values and need to be updated. The second data set contains new values as well as some historical values (corrections) that differ from the historical values contained in the first data set. The idea is to update the first data set with the new values and to replace the "new" historical values contained in the first data set (you can think of it as a correction made in a second moment).

 

How can I achieve that? How can I replace the value contained in the first data set with the ones in the second data set?

 

To give you an idea, I would like to drop all the rows of the first data set whose date is contained also in the second data set..

 

Thank you :)

10 REPLIES 10
Trifacta_Alumni
Alteryx Alumni (Retired)

Hi @Daniel Zito? This would definitely require a join. You can choose Join task from toolbar and initiate a join.

I would like to be thorough if i could see the data, but I can only guess a couple of things at the moment. Here's h0w i would go about solving this.

 

  1. I am going to assume that you would like to update a string column data from 1st dataset(colA) with 2nd dataset(colB). To do that there has be a positive correlation between both columns based on an identifier field column(colC).
  2. Left join 1st dataset with 2nd dataset based on identifier field column(colC) as primary keys. Bring the clean column(colB) from dataset 2 into the join output.
  3. Next, i'll search transformation edit column with formula and put colA. In the formula field i'll write IF($col !=colB, colB,$col), Group by colC. Add step

 

Best

Vardan

TrifactaUsers
10 - Fireball

Hi @Vardan Negi , thank you for your replay. Really appreciated ;)

 

In the picture below, you can find en example of my situation. I have already tried with joins but it didn´t work out.

 

 

 

As you see, I want to update data set 1 with the values of data set 2. On the right, you can see the desired output.

 

Thank you a lot :)

 

Daniel

Trifacta_Alumni
Alteryx Alumni (Retired)

Sure no problem. On this screenshot i would like to know why Type A is replaced with Type B3. There should be a way to uniquely identify rows from both dataset. Is there a primary key id for these records in bot datasets?

 

 

I am also thinking if there is a way you would want to assign a composite key to combination of (Date,Car and Type) for both datasets and then left join Dataset 2 with Dataset1.

TrifactaUsers
10 - Fireball

No. There is no primary key. The idea is to drop the values in yellow and to append the new values in green. It´s like replacing the old values with the new ones contained in the data set 2.

 

You can think of it like copying all the values in green and pasting these values in the data set 1 matching the first date in data set 2 (09.05.2020 in the pic)

TrifactaUsers
10 - Fireball

For a better understanding:

 

 

 

One idea might be to unify both the data sets (Source "A" stands for data set 1 and "B" for data set 2) and then remove the rows where Source is "A" and date is between MIN and MAX date where Source is "B".

 

In pseudo code:

 

IF(Source == "A"):

FOR(i in date):

IF(i >= MIN(date(WHERE Source == "B") AND i <= MAX(date(WHERE Source == "B")):

REMOVE row

Trifacta_Alumni
Alteryx Alumni (Retired)

Hi @Daniel Zito? I am still working on a way to do this through a union. Meanwhile, it can be done using two join statements.

 

On Source A create a recipe and name it transform_A. Paste these 3 steps on the recipe. You can copy each step and paste or copy all steps and from recipe editor, select More(ellipsis) task and paste 3 steps.

 

settype col: Date customType: 'Datetime','dd-mm-yy','dd*mm*yyyy' type: custom

derive type: multiple value: rownumber() group: Date order: Date as: 'rownum'

merge col: Date,rownum with: ':' as: 'pid'

 

Do exactly the above for Source B and name the recipe transform_B

 

On flow view again, add a new recipe to transform_B. Name this recipe join B. edit recipe.

Choose join from the toolbar, left join, pid as joining key.

Output columns, unselect all from single click checkbox and select only Date, Car, Type and Sells columns coming from Source B(denoted by orange) and not Source A(denoted by green). Add step.

 

On flow view again, add a new recipe to transform_A. Name this recipe join A. edit recipe.

Choose join from the toolbar, left join, pid as joining key.

Output columns, unselect all from single click checkbox and select Date, Car, Type and Sells columns coming from Source A(denoted by orange) and Date from source B(denoted by green). Go to advanced options, Prefix current data as sourceA_ and Prefix Joined-in data as sourceB_. Save and Continue, Add step to recipe.

 

Next, copy and paste the below 4 steps in the same recipe:

 

set col: sourceB_Date value: fill(sourceB_Date, -1, 0) order: sourceA_Date

filter type: custom rowType: single row: ismissing([sourceB_Date]) action: Keep

drop col: sourceB_Date action: Drop

rename type: findAndReplace col: * on: `sourceA_` with: '' matchAll: true

 

Choose Union and select recipe join B to union with join A. This should be the desired outcome.

 

I am still looking at better options.

 

Best

Vardan

 

 

Trifacta_Alumni
Alteryx Alumni (Retired)

Hi @Daniel Zito? Apologies for the above inefficient answer. I was overthinking too much.

Here's an easier way using Union. On both datasets add new recipes and add below 2 steps

 

settype col: Date customType: 'Datetime','dd-mm-yy','dd*mm*yyyy' type: custom

derive type: single value: 'B' as: 'source'. (Change the value to A for source A recipe)

 

  • Union both datasets from any recipe.
  • Next from the toolbar choose Functions -> Aggregations -> COUNTDISTINCT. Input source column as argument for the function and Group by Date. Add step to recipe.
  • Search transform, filter rows custom formula. Paste this on Conditions box countdistinct_source == 2 && source == 'A' , Choose delete matching rows. Add step

 

Hope this helps.

 

Best

Vardan

 

TrifactaUsers
10 - Fireball

Hi @Vardan Negi?  thank you very much for your massive effort and knowledge in data wrangling :)

 

Could you please re-write how I can proceed with the second method, I got a bit confused.

 

Thank you again ;)

TrifactaUsers
10 - Fireball

That was great! Congratulations! It seems working. Thanks a lot @Vardan Negi? :) That was also my idea but I didn't know how to put it into actions.

 

Thank you a lot ;)