Author: Ankit Kumar Jha
Contents:
- About Amazon Textract
- About Lambda Function in AWS
- Steps to extract data from PDF (single page) using MuleSoftÂ
Introduction:
The objective of this blog is to provide a detailed description of how to extract data from PDF using MuleSoft. We are using the Amazon Textract feature from AWS which helps MuleSoft Developers and Architects to gain the maximum benefit.
Prerequisite:
You should have AWS Account Access with AWS features like S3 Bucket, Lambda Function, and Amazon Textract.
What is Amazon Textract?
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). Amazon Textract is a service that extracts text and data from scanned documents. In this blog, we will discuss how Textract works.
What is Lambda Function in AWS?
AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you. These events may include changes in state or an update, such as a user placing an item in a shopping cart on an e-commerce website. You can use AWS Lambda to extend other AWS services with custom logic or create your own backend services that operate at AWS scale, performance, and security. AWS Lambda automatically runs code in response to multiple events, such as HTTP requests via Amazon API Gateway, modifications to objects in Amazon Simple Storage Service (Amazon S3) buckets, table updates in Amazon DynamoDB, and Amazon Textract for extracting the data.
Steps to Extract Data from PDF (single page) Using MuleSoft:
Step 1: First create an object in AWS S3 Bucket.

Step 2: Create the lambda function. Provide function name, Runtime (You can choose language as per your requirement. I have chosen Python 3.9 from the dropdown), and Architecture (Select as it is x86_64)Â rest of the things remain the same then Click Create function.

Step 3:Â Create a trigger by clicking the trigger button and then you will land on the trigger configuration page. Select the S3 bucket from the source Dropdown and provide the bucket name which you have created for storing PDF File. Now click on the OK button and the trigger will be created and displayed on the Function overview page.

Step 4: Now click on Code Tab on the same page and create one folder with a name and add two files to it.

  Refer to the below script for the parser.py code file.
"""
-*- coding: utf-8 -*-
========================
AWS Lambda ========================
"""
import json
import uuid
def extract_text(response, extract_by="WORD"):
line_text = []
for block in response["Blocks"]:
if block["BlockType"] == extract_by:
line_text.append(block["Text"])
return line_text
def map_word_id(response):
word_map = {}
for block in response["Blocks"]:
if block["BlockType"] == "WORD":
word_map[block["Id"]] = block["Text"]
if block["BlockType"] == "SELECTION_ELEMENT":
word_map[block["Id"]] = block["SelectionStatus"]
return word_map
def extract_table_info(response, word_map):
row = []
table = {}
ri = 0
flag = False
for block in response["Blocks"]:
if block["BlockType"] == "TABLE":
key = f"table_{uuid.uuid4().hex}"
table_n = +1
temp_table = []
if block["BlockType"] == "CELL":
if block["RowIndex"] != ri:
flag = True
row = []
ri = block["RowIndex"]
if "Relationships" in block:
for relation in block["Relationships"]:
if relation["Type"] == "CHILD":
row.append(" ".join([word_map[i] for i in relation["Ids"]]))
else:
row.append(" ")
if flag:
temp_table.append(row)
table[key] = temp_table
flag = False
return table
def get_key_map(response, word_map):
key_map = {}
for block in response["Blocks"]:
if block["BlockType"] == "KEY_VALUE_SET" and "KEY" in block["EntityTypes"]:
for relation in block["Relationships"]:
if relation["Type"] == "VALUE":
value_id = relation["Ids"]
if relation["Type"] == "CHILD":
v = " ".join([word_map[i] for i in relation["Ids"]])
key_map[v] = value_id
return key_map
def get_value_map(response, word_map):
value_map = {}
for block in response["Blocks"]:
if block["BlockType"] == "KEY_VALUE_SET" and "VALUE" in block["EntityTypes"]:
if "Relationships" in block:
for relation in block["Relationships"]:
if relation["Type"] == "CHILD":
v = " ".join([word_map[i] for i in relation["Ids"]])
value_map[block["Id"]] = v
else:
value_map[block["Id"]] = "VALUE_NOT_FOUND"
return value_map
def get_kv_map(key_map, value_map):
final_map = {}
for i, j in key_map.items():
final_map[i] = "".join(["".join(value_map[k]) for k in j])
return final_map
Refer to the below script for the lambda_function.py file.
"""
-*- coding: utf-8 -*-
========================
AWS Lambda
======================== """
import time
import json
import boto3
from urllib.parse import unquote_plus
from pprint import pprint
from parser import (
extract_text,
map_word_id,
extract_table_info,
get_key_map,
get_value_map,
get_kv_map,
) def lambda_handler(event, context):
textract = boto3.client("textract")
if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print(f"Bucket: {bucketname} ::: Key: {filename}")
print("New version")
response = textract.analyze_docunmet(
Document={
"S3Object": {
"Bucket": bucketname,
"Name": filename,
} },
FeatureTypes=["FORMS", "TABLES"], )
raw_text = extract_text(response, extract_by="LINE")
word_map = map_word_id(response)
table = extract_table_info(response, word_map)
key_map = get_key_map(response, word_map)
value_map = get_value_map(response, word_map)
final_map = get_kv_map(key_map, value_map)
print(json.dumps(table))
print(json.dumps(final_map))
print(raw_text)
return {"statusCode": 200, "body": json.dumps("Thanks from Srce Cde!"),
"table":json.dumps(table),
"text":json.dumps(raw_text),
"map": json.dumps(final_map)
}
Step 5: Now click on the Configuration Tab and click on permission on the left panel. Add Two Policy names as 1> AWSLambdaExecute 2>AmazonTextractFullAccess and click OK.
Step 6: Now open the AnyPoint studio and start implementation as mentioned below.
A. First add the Module Amazon lambda Connector and Amazon S3 from Anypoint Exchange and configure it with the credentials.
B. Drag and drop on the New Updated File connector and configure a path from where a PDF file can be polled.

C. Drag a set Variable and store the file name in that Variable.

D. Drag a Put Object Connector from AWS S3 Modules and provide the below details.

E. Drag a Transform message and add the required inputs to invoke the AWS Lambda function.

F. Drag Invoke Connector from Amazon Lambda Connector Module and add the below configuration.

G. Drag one more Transform Message from the Mule Palette and add the below DataWeave code for removing backslash(/) from the Extracted Json Output.

Now, run the application by selecting any single-page PDF file and after extracting the data, the output will look like this:

Summary
In this blog, we have understood the process of extracting data from PDF using MuleSoft and AWS Textract feature. Also, we got to know about the configuration of AWS Lambda connector and AWS S3 Connector. With the help of this blog, we can extract any data from the PDF.Â