PDF Text Extraction using OCR

How to process large numbers of documents programmatically, while maintaining a human in the loop?

Issue:

Manually extracting data points from 30+ unique PDFs forms.

Solution:

A system that would process a package of PDF documents, pull the required data points, and send the results for human review. Upon completion of the human review process, the results should be stored in a database. Amazon's Sagemaker service provides an interface to achieve this workflow: Augmented AI or A2I.

Here is a step by step of the event oriented architecture deployed on the AWS infrastructure:

  1. Document packages hit an S3 bucket and trigger an AWS Lambda service.

  2. The core code, deployed as a Docker container on Lambda, downloads the file from the bucket.

  3. The Document is sent to AWS's Textract service which extracts and returns the text and tables.

  4. The code then categorizes each page and applies the proper parsing logic. AWS maintains two excellent packages for employing location-based strategies to finding data points: geofinder and response parser.

  5. Submit results to AWS Augmented AI human loop.

  6. Workers in the group receive an email that new documents are available for review.

  7. When document review is submitted, another event is triggered via S3 to create record in a PostgreSQL database hosted on AWS.

OCR-textract-AWS-python-sagemaker

^Augmented AI comes with out-of-the-box integration with Textract, however custom templates like the above can be made to suit any application.