Start your journey with Alteryx Machine Learning - Take our Interactive Lesson today!

Alteryx Machine Learning Discussions

Find answers, ask questions, and share expertise about Alteryx Machine Learning.
Getting Started

Start your learning journey with Alteryx Machine Learning Interactive Lessons

Go to Lessons
SOLVED

PDF to Text

caltang
17 - Castor
17 - Castor

I've got a use case which requires the PDF to Text function and OCR capabilities. Thing is, the file is not standardized due to human writing being involved, which means cursive and unintelligible handwriting sometimes over the printed parts of the file. 

 

End Goal is to parse out certain information from the file - I've done a few and got some results, but I'd say it's about 10% of the full stack... 

 

How would one handle such a use case? Are there any examples out there from Maveryx community? 

 

P.S: Sorry I cannot share the PDFs, they contain sensitive PII information that I cannot disclose. Looking for advice + guidance from the community! 

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
7 REPLIES 7
acarter881
12 - Quasar

Hello, @caltang.

 

It depends on how many you have to do, how standardized the PDFs are, etc. I don't have much experience with the Intelligence Suite; however, your use case sounds too complex for a standard setup within Designer.

 

I suggest trying Google's Document AI: https://cloud.google.com/document-ai. You can upload some documents and test how well it's performing. There are other solutions, even others from Google, such as Cloud Vision: https://cloud.google.com/vision. If I were to try this in a programming language, I'd go for Python. It will likely involve a lot of setup, iterating, and research.

caltang
17 - Castor
17 - Castor

I’ll check out Document AI! Unfortunately, i don’t have an R&D team nor do I think the PDF To Text tool is advanced enough at this stage to do that.. guess I’ll have to look out of Alteryx as an alternative. 

thanks @acarter881 !

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
acarter881
12 - Quasar

You're welcome, @caltang.

 

Some of the other large tech companies, such as Amazon (https://aws.amazon.com/textract/), have their equivalent services. I've found Document AI to be pretty impressive. Good luck! This is a fascinating topic. AI seems like the solution. :)

caltang
17 - Castor
17 - Castor

It seems to be a paid service... I'll have a look and see. Thanks @acarter881 !

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
gjjadhao
9 - Comet

@caltang Utilization of Python Scripts for extracting Text from PDF can be useful, libraries/Modules like Pdfminer, tabula, camelot etc can be used for this purpose.

roughchr
6 - Meteoroid

@gjjadhao thanks for the tip - are you able to share any more specifics e.g. sample code for extracting text using these libraries/Modules like Pdfminer, tabula, camelot?

Yiqundu
5 - Atom

hi