Alteryx Designer Cloud Discussions

48caa5e6b034c457c5c5 · ‎01-16-2020

Assume I have a data set as follows:

If I perform a group-by and count, as follows:

I obtain the following result:

However, my end-users/customers require the following:

How can my desired outcome be achieved within Trifacta?

Trifacta_Alumni · ‎01-17-2020

Hi @Ellsworth Campbell? I have written a recipe to solve the above problem and it might need a bit of an explanation from my end. I am posting the steps here as below;

1. derive type: multiple value: ROWNUMBER() order: $sourcerownumber as: 'rownumber'

2.derive type: multiple value: COUNTIF(gender == 'M') group: {cluster id} as: 'male count'

3.derive type: multiple value: COUNTIF(gender == 'F') group: {cluster id} as: 'female count'

4.derive type: single value: IF({female count} == 0, {male count}, {female count}) as: 'non zero gender count'

5.derive type: multiple value: IF(rownumber == COUNT() && ({female count} == 0 || {male count} == 0), {non zero gender count} + 1, {non zero gender count}) group: {cluster id} as: 'range gap'

6.derive type: single value: RANGE({non zero gender count}, {range gap} + 1, 1) as: 'range output'

7.flatten col: {range output}

8.derive type: multiple value: {range output} - {non zero gender count} as: 'gender impute'

9.case condition: customConditions cases: [gender == 'M' && {gender impute} == 1,'F'],[gender == 'F' && {gender impute} == 1,'M'] default: gender as: 'Gen'

10.derive type: multiple value: COUNT() group: {cluster id},Gen as: 'Count'

11.set col: Count value: IF({gender impute} == 1, 0, $col)

12.drop col: rownumber, {female count}, {male count}, {non zero gender count}, {range gap}, {range output}, {gender impute}, gender, id action: Drop

You can copy a single step and paste them by clicking on Search Transformation or New Step inside the Transformer. I would advice you to please go over each step carefully to fully understand the step and it's purpose.

Please also note this may not the the most optimal way. But to save you time for now, you can work with this recipe.

Do let me know if you need clarity at any point.

Best,

Vardan

48caa5e6b034c457c5c5 · ‎01-17-2020

Hi Vardan, that is quite complex, and will be difficult to scale back up to the original problem but I can give it a shot.

That being said, some of the syntax here suggests that this may only be available in a later version of Trifacta than I currently have installed. Can you determine whether or not this is available in version: 6.4.1+251.20190903160643.f5987d6.

Trifacta_Alumni · ‎01-17-2020

Hi @Ellsworth Campbell? I agree the answer may look overwhelming at first glance. However there are actually 2 steps i.e Step 5 and Step 6 that form the backbone of the solution. A good rule of thumb to remember when confronted with cases to create additional rows based on a logic; Arrays especially Range Function are very useful.

All other steps only amplify the underlying logic to help us set up the solution and get there.

Having said that, the steps has no special bearing with version 6.8 or above. I would like to know scalability aspect of the problem if you don't mind. It would help to know the size of the datasets( in terms of columns and rows), a little background of your business use-case. There are also best practices to shape and format the dataset before and after we apply the above steps depending upon on the case.

Best

Vardan

48caa5e6b034c457c5c5 · ‎01-17-2020

The scalability aspect is that I need to perform a similar computation across ~12 different variables, with multiple categories per field. Note that, for the purposes of this example, I have removed many of the additional group-by characteristics.

What follows is a slightly more complex example that includes the features of interest.

For example, assume a data structure:

We want to group by the cluster, state, jurisdictionCount, and ageCategory. The desired result is:

I've also attached these as CSV files, in case that is more helpful.

We are working with a dataset that has 10's of millions of rows and roughly ~250 columns. I am calculating aggregate demographic attributes for ~12 columns (age, gender, risk factor, etc).

I work in public health and infectious disease surveillance.

48caa5e6b034c457c5c5 · ‎01-17-2020

Including example input and output as CSVs.

Trifacta_Alumni · ‎01-17-2020

I might do it like this:

pivot col: clusterID group: gender value: counta(gender) limit: 50

unpivot col: counta_gender_1,counta_gender_2 groupEvery: 1

replacepatterns col: key with: '' on: `{start}{alphanum-underscore}{14}`

sort order: key

rename type: manual mapping: [key,'cluster_id'],[value,'count']

Trifacta_Alumni · ‎01-17-2020

I'll take a look at your 2nd example and see if I can replicate it.

Trifacta_Alumni · ‎01-17-2020

pivot col: ageCategory group: clusterID,stateJurisdiction,jurisdictionCount value: counta(stateJurisdiction) limit: 50

unpivot col: {counta_stateJurisdiction_13-19},{counta_stateJurisdiction_20-29},{counta_stateJurisdiction_40-49},{counta_stateJurisdiction_60+} groupEvery: 1

replacepatterns col: key with: '' on: `{start}{alphanum-underscore}{25}`

48caa5e6b034c457c5c5 · ‎01-17-2020

Zach,

Thank you! I think that this may get me moving in the right direction. Because we'll be leveraging sampling, we expect that all possible values will be represented in the Trifacta sample. From there I can leverage the Pivot/Unpivot capabilities to count/group, respectively.

Hopefully this will get us past this hurdle!

--Ells

Alteryx Designer Cloud Discussions

Achieving zero counts with group-by function