Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

Performance: remove columns or rows first?

Hi all,

 

my dataset is a Hive file with approx 2.000 columns and several millions of lines. I need trifacta to keep only some columns (about 500) and approx. 10% of the lines based on a filter on the values of one column.

What is more efficient, removing useless columns first or filtering rows first?

(the column used for filtering is kept in the final output).

 

Thanks for your advice.

MM

2 REPLIES 2
Trifacta_Alumni
Alteryx Alumni (Retired)

Hi, Michael--

 

That's a wide dataset. I would drop columns first.

 

Keep in mind that what you see on screen in the Transformer page is a sample. In your sample, all columns in the dataset are represented in some form, so in your case, the initial sample will have a smaller number of rows because of the large number of columns. After you drop your columns, you should take another sample, which will bring back a larger number of rows.

 

Here's a good topic on removing data from your dataset.

 

https://docs.trifacta.com/display/SS/Remove+Data

 

You can remove a range of columns in a single step. The operative character is the tilde, which is how you specify a range of columns. See the second example here.

 

https://docs.trifacta.com/display/SS/Remove+Data#RemoveData-Dropcolumns

 

After you have made your dataset more narrow, you can work to filter out rows.

 

Hope that helps.

 

Cheers,

-SteveO

 

Hello Steve,

thanks for this exhaustive answer. This is very usefull.

 

Have a nice day.