Alteryx Machine Learning Discussions

caltang · ‎06-30-2023

I've got a use case which requires the PDF to Text function and OCR capabilities. Thing is, the file is not standardized due to human writing being involved, which means cursive and unintelligible handwriting sometimes over the printed parts of the file.

End Goal is to parse out certain information from the file - I've done a few and got some results, but I'd say it's about 10% of the full stack...

How would one handle such a use case? Are there any examples out there from Maveryx community?

P.S: Sorry I cannot share the PDFs, they contain sensitive PII information that I cannot disclose. Looking for advice + guidance from the community!

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

acarter881 · ‎07-07-2023

Hello, @caltang.

It depends on how many you have to do, how standardized the PDFs are, etc. I don't have much experience with the Intelligence Suite; however, your use case sounds too complex for a standard setup within Designer.

I suggest trying Google's Document AI: https://cloud.google.com/document-ai. You can upload some documents and test how well it's performing. There are other solutions, even others from Google, such as Cloud Vision: https://cloud.google.com/vision. If I were to try this in a programming language, I'd go for Python. It will likely involve a lot of setup, iterating, and research.

caltang · ‎07-07-2023

I’ll check out Document AI! Unfortunately, i don’t have an R&D team nor do I think the PDF To Text tool is advanced enough at this stage to do that.. guess I’ll have to look out of Alteryx as an alternative.

thanks @acarter881 !

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

acarter881 · ‎07-07-2023

You're welcome, @caltang.

Some of the other large tech companies, such as Amazon (https://aws.amazon.com/textract/), have their equivalent services. I've found Document AI to be pretty impressive. Good luck! This is a fascinating topic. AI seems like the solution. :)

caltang · ‎07-07-2023

It seems to be a paid service... I'll have a look and see. Thanks @acarter881 !

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

gjjadhao · ‎07-12-2023

@caltang Utilization of Python Scripts for extracting Text from PDF can be useful, libraries/Modules like Pdfminer, tabula, camelot etc can be used for this purpose.

roughchr · ‎11-01-2023

@gjjadhao thanks for the tip - are you able to share any more specifics e.g. sample code for extracting text using these libraries/Modules like Pdfminer, tabula, camelot?

Yiqundu · ‎11-08-2023

hi

Alteryx Machine Learning Discussions

Getting Started

Start your learning journey with Alteryx Machine Learning Interactive Lessons
Go to Lessons

PDF to Text

Alteryx Machine Learning Discussions

Getting Started

Start your learning journey with Alteryx Machine Learning Interactive Lessons Go to Lessons

PDF to Text

Start your learning journey with Alteryx Machine Learning Interactive Lessons
Go to Lessons