Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

Hi everyone, I'm trying to join 2 data set together, but inner join does not return all rows.

gjunhao96
8 - Asteroid

The left data set contains customer information and has 1000 rows (1000 customers). The right data set contains order information including customer id and has 5000 rows (5000 orders). I'm sure that all customer id in the order dataset is found in the customer data set and thus, the inner join should return 5000 rows.

 

However, what I get from Trifacta is only 1790 rows. If i run the entire flow, my results return the intended 5000 rows. Also if collected a new sample of the data, the new sample will return 5000 rows as well.

 

Thus, I was wondering why Trifacta choses to return only 1790 rows of data and not 5000 rows. Here I'm assuming that 5000 rows is not very huge and thus, there is no need to sample the data?

 

Also is there a way to force Trifacta to return all 5000 rows?

 

Attached is a screenshot of the join step as reference. Thank you.

3 REPLIES 3
AMiller_Tri
Alteryx Alumni (Retired)

Hi @Jun Hao Goh? , thank you for your question =]

 

In this scenario, Trifacta prefers to take you back to the transformer grid as soon as possible.

The fact that once you create another sample, the entire 5,000 records are shown - is proof that Trifacta's transformer grid can "handle" that amount of data; perhaps it doesn't cross the threshold of 10mb of a sample.

 

However, when Trifacta 'trims' the output data as part of a join to ensure you're not overflowing with too big of a sample in the grid view - it might reduce the sample size to something smaller than its upper limit. This fits the logic of 'taking you back to the action' as soon as possible.

 

as documented in this link:

"NOTE: Unnest, union, or join transforms may significantly increase the number of rows or columns in your dataset. To prevent overloading the browser's memory, the application may apply a limit function to the results to artificially limit the number of rows displayed in your sample. You can generate a new sample if desired. This limitation is not applied during the job execution."

 

I hope that makes sense; please feel free to ask any additional questions.

 

Thanks,

Amit.

umaidk
8 - Asteroid

Good question

?

umaidk
8 - Asteroid

Good question ?