Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

How to extract data from Unstructured format (Multiple Flat file - Daily files needs to be consolidated as one Monthly report).

TrifactaUsers
10 - Fireball

Conditions: Need to look for 'Report-47' and then 'Monetary Instrument Sales'. It it's then extract the values for 'Purchaser Account Number', 'Cash In', 'Reason Code', 'STAT' as highlighted. The flat file is just 1% of daily report so we have to read through complete file for each day then consolidate other daily files to get a monthly report. I am a new Trifacta user. Any advise on this will be highly appreciated!. Thanks in advance!

 

7 REPLIES 7
Trifacta_Alumni
Alteryx Alumni (Retired)

Hey Sundar!

 

Attached is a Recipe to parse this file. Here is output + recipe panel opened

So the essence of working with unstructured data like this is to find patterns and positions that you can use to start to break the data out into rows and columns. I'm not going to walk through every step in the recipe in depth because it is extensive, but to summarize the steps taken:

  1. Remove Structure from the dataset. Trifacta tries to infer structure (works well with csv's, excel, json, etc... not so much with unstructured text 🙂), we want to remove that structure to get all of the data into a single cell
  2. splitrows on 'MONETARY INSTRUMENT SALES'. the splitrows step converts data into rows, using MONETARY INSTRUMENT SALES allows us to create a new row every time a new monetary instrument sale is hit.
  3. Create an array, where each element is separated by \r\n (carriage return and new line). This will allow us to then flatten the array (converting each element into a single row) thus getting the data to a state that we can start to split into columns. We're also nesting this array with an index, that we will use later on.
  4. Flatten the array as noted above, unnest to get the index into it's own column
  5. split columns based on positions. This assumes that the relative format of the above file stays the same (as in, the system producing these files produces them in the same unstructured format every time. So containing some structure...)
  6. remove the noise. only keep information on the 7th and 9th lines of each instrument sales invoice. These lines contain the info on 'Purchaser Account Number', 'Cash In', 'Reason Code', 'STAT' 
  7. back fill values so there is at least one row containing all of the above values from step 6 (for instance, at index 1).
  8. keep row where index is 1 (thus getting only one row per sales invoice)
  9. delete unimportant columns and rename fields

 

Thanks to @Alon Bartur?  who came up with this solution. Shameless pitch, but the great part about using Trifacta to solve this was that seeing a preview of every step we are creating really allowed us to validate that the steps we were taking were moving us in the right direction. All in all it didn't take too long to solve (i'd be hopeless trying to do this with python 😅).

 

Go ahead and give it a shot and let us know if it helps!!

Trifacta_Alumni
Alteryx Alumni (Retired)

Wow, outstanding solution @David McNamara? 

The common strategy with this and similar “structured” data is to turn it all down and rebuild.

TrifactaUsers
10 - Fireball

?Excellent! Thank you so much @David McNamara?  for the solution. I really appreciate all your efforts for such a quick turnaround. I am preparing recipe now using release 4.2 and post the update very soon. Thanks again!

TrifactaUsers
10 - Fireball

Hi David, Can you post the image of the remaining steps of the recipe. I can see until step 17 and want to see the remaining steps for validation. Thanks @David McNamara? ?

Trifacta_Alumni
Alteryx Alumni (Retired)

Hey @Sundar Jagan? , no problem!

Here are the last few steps:

I've also attached the flow in version 4.2 to this comment. You should be able to import the flow (go to flow view, select more options (...), click Import, and choose the attached .zip file) and then simply replace the source file with you flat file example, that way you don't have to recreate the recipe from scratch 🙂

 

Hope this helps!

David

TrifactaUsers
10 - Fireball

Amazing! Thanks a lot @David McNamara? . Really appreciate for attaching the flow for version 4.2 it's very much helpful. Hats off again to both of you @David McNamara?  and @Alon Bartur? . Just wondering if you would recommend any resources to extract the data in a same way from PDF.?

Trifacta_Alumni
Alteryx Alumni (Retired)

Hi Sundar,

 

No resources that I know of, unfortunately. I can try pull together some information and send it along to you. I will add to my todo for next week.

 

Have a nice weekend :)

David