Author: Uday Kumar
Software application is mainly used for simplifying the manual work and processing data at faster speed with accuracy. In today’s world data processing with agility is one of the much anticipated requirements to meet the end users’ needs. Knowing the end users and responding at the earliest has significance to run the business and also keep the software application live.
Due to digitization and automation, integration and data processing services are scaling up at a larger rate to keep pace with the business demands and end users’ requirements.
Most of the softwares or applications/APIs fail to meet the requirements when it comes to real or near data processing capacity.
Processing data using streaming is one way to achieve the goal to meet the end users’ needs but it is good to know as a developer and architect when and why to use the streaming services to process the large amount of data. Sometimes using a non-streaming approach of processing data works fine for larger payloads. Large amount of data is a very generic term which in fact is required to quantify with the business team or users as to what is the definition of the “large amount of data” for them. For example a large amount of data could be 5 million, 1 million or half million records for some. After knowing exactly the near amount of data to be processed, the further key mechanism is to get aware of the data format which also plays a massive role in determining data processing strategy while dealing with complexities. Both the near amount of data and the formats could help developers and architects to decide which data processing strategy is best suited to their requirements. And that’s where the Streaming and Non-streaming data processing strategy come into the picture.
Based on my research, using Streaming should be the last option for data processing. Different data formats and the amount of data helps to understand in a better way to use and exploit the system resources while selecting the right data processing strategy. Streaming uses more memory and has more complexities involved in dealing with system resources than Non-streaming. So it is prudent to avoid these complexities and complications initially by better understanding the source and target data formats including the amount of data and then take a wise decision to go for either.
XML – JSON:
A million record data for example in an xml document should be processed with 3 things in mind.
- The total number of records in an xml document (source data)
- The number of records in each collection in an xml document
- The target data format(s)
The bigger the collection size, it takes more processing time for Streaming than Non-streaming.
For xml data processing, depending on the amount of data and in depth collection size, the Non-streaming works better than Streaming sometimes.
This streaming app, shown in the screen print above does streaming of XML source data into target JSON . Similarly, other streaming app examples such as CSV-JSON and JSON-CSV use similar code. The only difference is in the http mimeType settings where “collectionPath” is specific to xml, the other source data formats should not use this. The other difference is the Transform Message in which the input and target data formats need to be adjusted as per the requirement of source and target data being used. The Write component simply writes the payload to a destination and SetPayload returns a successful message to the client about the process completion.
For more information on this app, please refer to the video tutorial here. The other video in the series which explains Streaming and Non-streaming could be referred to here.
The source and target has the following data formats in my analysis:
“bookName”: “Mulesoft for beginners”,
Below is the comparison sheet for XML to JSON:
|1011631||4.30 s (has 7 collections, 4 collections with 1 lac recs and 3 with 2 lacs recs)|
With memory capacity (-XX:PermSize=2048M -XX:MaxPermSize=6144M)
|994904||not worked with (-XX:PermSize=2048M -XX:MaxPermSize=6144M)|
Was getting JVM connect error, xml doc has only one collection
|806396||Studio throwing java.lang.OutOfMemoryError: Java heap space with default heap size|
1 m 1.80 s
Worked with (-XX:PermSize=1024 -XX:MaxPermSize=4096)
|1011624||8.69 s (7 collections, 4 collections with 1 lac recs and 3 with 2 lacs recs)|
|994904||12.94 s (XML has one collection for this and below processed recs)|
If you notice, the record size 1011631 is processed through Streaming in just 4.30 s (records split into 7 collections) whereas the record size 994904 in one collection was not able to process at all with as bigger heap memory (-XX:PermSize=2048M -XX:MaxPermSize=6144M). In just the other way round record size 994904 is processed through Non-streaming processing strategy with just one collection in 12.94 s. Here it tells a lot about the different data processing tactics a developer and an architect should think before designing and implementing their applications. Do we really require Streaming to process this much amount of data?
If you observe the processing time of different data sizes above in Streaming and Non-streaming section, you will not find much lagging in the processing time for lower sizes of records while thinking of the resources being used in Non-streaming processing strategy is very minimal. But from the end users’ perspective every single second matters even a millisecond. This is the critical point where a developer and an architect should be decisive while not only considering the processing time but also the resources required to process the data. The ultimate goal should be to process a maximum amount of data with lesser utilization of resources, keeping the SLA in mind which is a more important factor to tune the processing time near to it.
JSON – CSV:
In my findings, while processing the JSON data into CSV, I used the Streaming and Non-streaming approaches to observe which strategy processes the data faster keeping in mind the number of records and the volume of data. Non-streaming works better relatively for the larger payloads than Streaming as it has lesser resource utilization and complexities involved and better to avoid as much as we should. Source and target data has the following formats:
“Name”: “1001-Michael Jeffery”,
“Name”: “1002-Duncan Randall”,
Below is the comparison sheet for JSON to CSV:
CSV – JSON:
Similarly, for the CSV to JSON data processing, Streaming processing strategy works better than Non-streaming for larger payloads. Comparison sheet is given below for observations.
“Name”: “1001-Michael Jeffery”,
“Name”: “1002-Duncan Randall”,
Below is the comparison sheet for CSV to JSON:
If you see the JSON – CSV and CSV – JSON data processing comparison sheet above, JSON – CSV performs better than CSV – JSON for the similar data structure. The only difference is the source and target data formats. This concludes a fact that source and target data formats also play a pivotal role in the processing of large amounts of data.
The above comparison sheets and examples & conclusions are tried and compiled based on the local machine settings with different heap sizes:
with capacity: 16GB RAM and,
Anypoint Studio version: 7.1.11
Mule version: 4.3.0,
-XX:PermSize=1048M -XX:MaxPermSize=4096M or
-XX:PermSize=2048M -XX:MaxPermSize=6144M or
Default in Anypoint Studio