Alteryx Designer Cloud Discussions

HariB1 · ‎07-07-2022

Let's say we are reading an Excel file from S3 (the external process will update this excel daily), assume in the Flow first run it has 20 columns, in the next run it has only 19 columns, as one column missed, how we can verify column differences and fail the flow run automatically

AMiller_Tri · ‎07-07-2022

Hi @Hari Bachala? , thank you for your question! =]

option 1 - schema validation feature

The straightforward way would be to use the Schema validation feature, which exists in newer versions of Trifacta:

However, this feature is currently limited: Schema validation applies only to sources that have published schemas (relational datasources and schematized file types). In future versions, schema validation will also support CSVs.

You can read through schema management and validation documentation here

Also, see this great article about schema drift validation.

option 2 - add a validating recipe step

an alternative solution would be to perform a step that doesn't necessarily affect any column but refers to all of them.

For example - a find-replace step that searches for a never-seen character. Or a renaming step that keeps all of the columns' original names.

That way, if a column is missing in a subsequent run - you can decide to make the job fail, as that discussed step would be red (due to the missing column):

option 3 - schema drift detection template

You can use this template flow from our templates page to validate your file data against an expected schema or when data has shifted in schema from what was expected. It makes use of Trifacta's ability to import data as is without applying inferred row splitting technique, and comparing it to an expected schema's headers through a join

option 4 - use plans to first try to write your excel to an already existing table

You can create a table - even an empty one (or with just 1 record) that abides by your desired schema; for example, write the original excel data into a table in whatever native database you're working with (you gave an example of S3 - so I'd assume you have access to AWS's redshift database as well). Then, create a plan. The plan's first step will simply be to try to write an empty recipe (derived from the updated excel) to that table. If columns were added\removed from that excel, then the job (the plan task) would fail as schemas in relational databases are static.

From there, you can decide to branch, depending on whether the first task had failed or succeeded. A plan is flagged as failed if at least one task has failed - which will be the situation once the schema is changed. (again, talking about a change in column number or column names).

I hope that makes sense; please feel free to ask any additional questions.

Thanks,

Amit.

HariB1 · ‎07-07-2022

Thank you, Amit for your quick response.

Options 1,3 is not available in 8.2 version of Trifacta, I will try with option 2.

are there any major changes after v 8.2?

AMiller_Tri · ‎07-11-2022

Hi @Hari Bachala? , in newer versions options 1,3 will be available.

However, I'd suggest you also try option 4, as it will also account for columns that are being added, not just for columns that are being removed.

Thanks,

Amit.

HariB1 · ‎07-11-2022

sure, maybe option:4 may not work for us, we are planning to schedule the flow using airflow, I see there are many missing features in custom AMI ( trifecta installed on our AWS account), For example, 1. schema validation 2. environment variable parameterization 3. recipe output parameterization ( {dbname}.table )

Alteryx Designer Cloud Discussions

How to fail the Flow when expected columns are missed in the data source.