Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

Random Sample to Aggregate leads to missing information

I am using an on-premises installation of Trifacta Enterprise (version 4.0.1+441).

 

We are in the process of migrating a SAS workflow to use Trifacta instead, and are in the process of validating results for a workflow regarding the presence/absence of HIV drug resistance in public health databases.

 

I have attached a snippet of the raw data and highlighted a sequence of interest in yellow. I have highlighted a single mutation for this same individual in orange.

 

 

In brief, I am attempting do the following:

  1. Create a column for each unique value in 'observedMutation'
  2. Place a 1 or 0 in said mutation column, depending on whether the mutation was observed for that row.

 

I have accomplished this with the following command:

pivot col: columnHeaders value: count() group: id limit: 142

 

This works, as expected. However, when a certain mutation is sufficiently rare so as to not enter the random sample. It seems that the pivot is not generating new columns for 'unobserved' mutations.

 

note: there are 142 unique mutation types that the WHO monitors. Hence the limit of 142. Unfortunately, there are ~27 columns being 'dropped' when the data suggests that they should not be.

 

 

I am not sure if this is helpful, but in these instances, I am running it on the local trifacta server (rather than our cloudera cluster).

 

Can you tell me whether this is expected behavior for my current version of Trifacta? Can you give any suggestions for how to avoid dropping information that is not captured in the first sample?

 

Thank you in advance,

Ells

7 REPLIES 7
Trifacta_Alumni
Alteryx Alumni (Retired)

Hey Ells,

 

This is expected, as the pivot transformation is a sample dependent one. As you are in version 4.0, you don't have access to some of the new sampling methods that can solve issues like these. Starting in version 4.1, we have the ability to take a stratified sample, which will pull in at least one occurrence of every unique value in a column (so you would have at least one row for every mutation in your dataset. Then, when you perform a pivot it would create a new column for each mutation.

 

If you are interested, we would love to get you up to the latest version of trifacta so you have access to the new sampling methods that can solve this particular use case. Let me know if this is something you are interested in following up with.

 

Best,

David

 

 

Hi David, I expected that this would be the response but wanted to make sure.

 

Being a government institution, it can be difficult to get folks badged and onto campus to perform the upgrade. Given the big changes after our version, we need to upgrade our Cloudera prior to upgrading Trifacta...which is a heavy lift that requires on-site attention.

 

In the meantime, we're going to move ahead to validate another section of our SAS -> Trifacta pipeline migration.

Actually, I have one more question before closing this out:

I'm running Cloudera CDH version 5.7.5 (#3 built by jenkins on 20161101-2028 git: 56bbe2710d05f83c2f42f198fdb9436d08691716)

 

Can I upgrade to Trifacta 4.1 with this CDH to get around this problem, while we work our physical access issues to complete the upgrade to the latest version?

Trifacta_Alumni
Alteryx Alumni (Retired)

Hey Ells,

 

I'm not sure is that version is supported or not, let me ask @Sebastian Cyris?  who has more expertise in that realm 🙂

 

One thing that I was thinking might be a possible way around your first question (bear with me, this is a little bit of a hack and I haven't tested this myself on version 4.0), is if you do the following:

 

  1. derive value: rownumber() order: <choose any column, but if you have a date column that's a good one to use> group: <unobserved_mutations> as: rownumber
  2. keep value: rownumber == 1
  3. take a new random sample
  4. pivot the unobserved_mutations
  5. delete the keep step from your recipe

 

My thought is that if you force the sample to contain one of each of the unique values in your unobserved mutations column, than the pivot will have each unique value hardcoded into it. Then, when you run the job, you won't run into the issue of the column being sample dependent. It's basically a way of collecting a stratified sample (which we now have native in the product).

 

Give this a shot and let me know if it works! (I hope it works and I don't send you down a rabbit hole of trying this out 😂 )

Thanks David, I welcome the suggestion and will give it a shot early next week to let you know. Division rules are such that these data cannot be accessed off-site...and the weekend has arrived so we'll have to hold til Monday! That being said, I see no reason why it wont work, its just a matter of forcing representation.

?

Another thought...

these 30,000 rows are a subset of about 700,000 rows in a larger dataset. It might be easier to get minimal representation of each mutation by throwing more data at it? Thoughts on whether this improves or reduces the odds...my intuition fails me here...

?

I welcome any thoughts regarding versioning from @Sebastian Cyris? as well!

?

Have a great weekend, guys!

?

?

Trifacta_Alumni
Alteryx Alumni (Retired)

Hi @Ellsworth Campbell?  ,

We only certified CDH 5.10 - 5.12 with 4.1.x . See as reference the 4.1 docs System Requirements in https://docs.trifacta.com/display/r041/System+Requirements?os_username=tr9r041usr%24&os_password=^gTa%403)04k--

 

That said, we have systems seen that have been setup with older releases. Keep in mind that your mileage may vary since this configuration has not been QA'd. For a quick sanity check, I would install your CDH version and Trifacta in a test environment to ensure that it works against all your ingestion and publishing endpoints.

 

Another option would be spinning off a test environment with the latest supported CDH and Trifacta 5.0.1 . See https://docs.trifacta.com/display/r050/System+Requirements?os_username=tr29r050usr%26&os_password=*1ads%40flfg%40npa(%3DPPKz for the System Requirements in the latest 5.x version

While one could argue that as a test environment this is not production ready, it would allow you to validate your needs against the latest release. Overall, it would be be expediting your migration to the latest version, while letting you validate the latest Wrangle features. I would recommend checking in with your designated Customer Success Manager, so s/he can idenitfy the best options and also provide guidance how to take advantage of our Architects and Services team

 

@David McNamara?  Close, but no cigar. It did exactly what you suggested it would, however, now we're running into another issue...

 

The final output of the run has exactly the same number of rows as the final random sample, which is about 20,000 less than the full count. This is true after removing the keep and rownumber derivation steps.

 

Any idea of the cause of this new behavior? I can't seem to trace it back...