SlideShare a Scribd company logo
© 2022 RWS
1
A way to parse huge JSON files
when the memory used to be a
limitation
Negruti Andrei
© 2022 RWS
2
2 © 2022 RWS
Why do we
have to
process
huge JSON
files?
© 2022 RWS
3
3 © 2022 RWS
Over 7,500 experts across 36 countries
and a client base spanning Europe,
North and South America and Asia
Pacific
Our unrivalled experience and deep
understanding of language has been
developed over more than 60 years
Our global
scale and
experience
We support 330+ language variants
and translate 378+ billion words a
year
© 2022 RWS
4
© 2022 RWS
5
5 © 2022 RWS
Books to BCM’s
(Bilingual Content Model)
© 2022 RWS
6
• 48.922 words
• Original file: 0.7 Mb
• BCM (JSON): 2.9 Mb
The Great Gatsby
F. Scott Fitzgerald
© 2022 RWS
7
• 105.204 words
• Original file: 1.6 Mb
• BCM (JSON): 4.8 Mb
1984
George Orwell
© 2022 RWS
8
• 67.495 words
• Original file: 3.2 Mb
• BCM (JSON): 8.5 Mb
The Clean Coder
Robert E. Martin
© 2022 RWS
9
• 572.298 words
• Original file: 3.6 Mb
• BCM (JSON): 33.2 Mb
War and Peace
Leo Tolstoy
© 2022 RWS
10
• 561.317 words
• Original file: 6.4 Mb
• BCM (JSON): 53.5 Mb
The Lord of The Rings
(Entire trilogy)
J.R.R. Tolkien
© 2022 RWS
11
• 449.467 words
• Original file: 14.1 Mb
• BCM (JSON): 145.8 Mb
Introduction to Algorithms
Thomas H. Cormen,
Charles E. Leiserson, Ronald L. Rivest
and Clifford Stein
© 2022 RWS
12
• 2.565 words
• Original file: 10.2 Mb
• BCM (JSON): 1.3 Mb
A way to parse huge JSON
files when the memory used
to be a limitation
Negruti Andrei
© 2022 RWS
13
13 © 2022 RWS
Zoom into the
Apply Machine Translation
Step
© 2022 RWS
14
© 2022 RWS
15
© 2022 RWS
16
© 2022 RWS
17
© 2022 RWS
18
© 2022 RWS
19
© 2022 RWS
20
© 2022 RWS
21
© 2022 RWS
22
© 2022 RWS
23
© 2022 RWS
24
© 2022 RWS
25
© 2022 RWS
26
26 © 2022 RWS
Processing a JSON in a
Streaming way
© 2022 RWS
27
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
© 2022 RWS
28
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
</dependency>
© 2022 RWS
29
JsonParser parser = new JsonFactory().createParser(input)
© 2022 RWS
30
parser.nextToken()
© 2022 RWS
31
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
{
“action”: “SUM_NUMBERS”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
JsonToken.START_OBJECT
JsonToken.FIELD_NAME
JsonToken.VALUE_STRING
JsonToken.FIELD_NAME
JsonToken.START_OBJECT
JsonToken.FIELD_NAME
JsonToken.VALUE_STRING
JsonToken.FIELD_NAME
JsonToken.VALUE_STRING
JsonToken.END_OBJECT
JsonToken.FIELD_NAME
JsonToken.START_ARRAY
JsonToken.VALUE_NUMBER_INT
JsonToken.VALUE_NUMBER_INT
...
JsonToken.VALUE_NUMBER_INT
JsonToken.END_ARRAY
JsonToken.END_OBJECT
null
parser.nextToken()
© 2022 RWS
32
JsonParser parser = new JsonFactory().createParser(numbersFile);
JsonToken token = parser.nextToken();
long total = 0;
while (token != null) {
token = parser.nextToken();
if (JsonToken.FIELD_NAME.equals(token) && parser.getCurrentName().equals("numbers")) {
parser.nextToken(); //Position cursor at START_ARRAY
while (parser.nextToken() != JsonToken.END_ARRAY) {
total += parser.getValueAsInt();
}
}
}
© 2022 RWS
33
33 © 2022 RWS
We built a new
way to process
JSON’s
© 2022 RWS
34
<dependency>
<groupId>com.sdl.lt.lc.json.streaming</groupId>
<artifactId>json-streaming-processor</artifactId>
<version>0.0.1</version>
</dependency>
© 2022 RWS
35
ReadJsonProcessor processor = JsonProcessorBuilder.initProcessor(numbersFile);
PathMatcher pathMatcher = PathMatcherBuilder.builder()
.field("numbers").startArray()
.build();
Iterator<Integer> numbersIterator = processor.readValues(pathMatcher, Integer.class);
long total = 0;
while (numbersIterator.hasNext()) {
total += numbersIterator.next();
}
© 2022 RWS
36
36 © 2022 RWS
Rewrite JSON and add
+1 to each number
© 2022 RWS
37
JsonFactory jsonFactory = new JsonFactory();
JsonParser parser = jsonFactory.createParser(numbersFile);
JsonGenerator generator = jsonFactory.createGenerator(outputStream);
JsonToken token = parser.nextToken();
generator.copyCurrentEvent(parser);
while (token != null) {
token = parser.nextToken();
if (token == null) {
break;
}
generator.copyCurrentEvent(parser);
if (JsonToken.FIELD_NAME.equals(token) && parser.getCurrentName().equals("numbers")) {
parser.nextToken(); //Position cursor at START_ARRAY
generator.copyCurrentEvent(parser);
while (parser.nextToken() != JsonToken.END_ARRAY) {
generator.writeNumber(parser.getValueAsInt() + 1);
}
generator.copyCurrentEvent(parser);
}
}
© 2022 RWS
38
JsonProcessorBuilder builder = JsonProcessorBuilder.initBuilder(numbersFile, outputStream);
PathMatcher pathMatcher = PathMatcherBuilder.builder()
.field("numbers").startArray()
.build();
JsonElementTransformer plusOneEachNumber = builder.mapEach(pathMatcher, Integer.class, nr -> nr + 1);
JsonVisitor visitor = JsonVisitor.withTransformer(plusOneEachNumber);
VisitJsonProcessor processor = builder.build();
processor.visit(visitor);
© 2022 RWS
39
39 © 2022 RWS
Rewrite JSON and add
+1 to each number
Bonus:
retrieve username
© 2022 RWS
40
{
“action”: “PLUS_ONE”,
“requester”: {
“id”: “307d3a82”,
“username”: “admin”
},
“numbers”: [123, 731, ..., 421]
}
© 2022 RWS
41
JsonProcessorBuilder builder = JsonProcessorBuilder.initBuilder(numbersFile, outputStream);
PathMatcher numbersPathMatcher = PathMatcherBuilder.builder()
.field("numbers").startArray()
.build();
PathMatcher usernamePathMatcher = PathMatcherBuilder.builder()
.field("requester").field("username")
.build();
AtomicReference<String> usernameValue = new AtomicReference<>();
JsonVisitor visitor = JsonVisitor.withTransformers(
List.of(
builder.mapEach(numbersPathMatcher, Integer.class, nr -> nr + 1),
builder.peek(usernamePathMatcher, String.class, e -> usernameValue.set(e.getElement()))
)
);
VisitJsonProcessor processor = builder.build();
processor.visit(visitor);
System.out.println(usernameValue.get());
© 2022 RWS
42
42 © 2022 RWS
Performance
© 2022 RWS
43 © 2022 RWS
43
54
ms
104
ms
352
ms
366
ms
904
ms
76
ms
148
ms
435
ms
482
ms
1498
ms
81
ms
155
ms
589
ms
1025
ms
4868
ms
ms
1000 ms
2000 ms
3000 ms
4000 ms
5000 ms
6000 ms
10MB 20MB 40MB 60MB 100MB
SUMMING ALL NUMBERS
Jackson Library Memory
© 2022 RWS
44 © 2022 RWS
44
92
ms
208
ms
699
ms
972
ms
1789
ms
135
ms
258
ms
853
ms
1324
ms
2013
ms
150
ms
295
ms
1067
ms
3541
ms
10747
ms
ms
2000 ms
4000 ms
6000 ms
8000 ms
10000 ms
12000 ms
10MB 20MB 40MB 60MB 100MB
+1 EACH NUMBER AND REWRITE JSON
Jackson Library Memory
© 2022 RWS
45
45 © 2022 RWS
When should you use this
library?
© 2022 RWS
46
46 © 2022 RWS
You want to
avoid building
a complex
token based
logic
© 2022 RWS
47
47 © 2022 RWS
You can break
your JSON into
smaller
processable
units
© 2022 RWS
48
48 © 2022 RWS
You don’t mind
the
deserialization
penalty
© 2022 RWS
49
49 © 2022 RWS
You want to be
up and
running faster
© 2022 RWS
50
50 © 2022 RWS
Our results from processing
JSON’s streamed
© 2022 RWS
51
51 © 2022 RWS
No more out
of memory
errors
© 2022 RWS
52
52 © 2022 RWS
Can process
JSON’s of
ANY SIZE
© 2022 RWS
53
53 © 2022 RWS
Can process
multiple
files in
parallel
© 2022 RWS
54
54 © 2022 RWS
Cheaper to
run our
services
© 2022 RWS
55
https://github.com/RWS/json-streaming-processor

More Related Content

PPT
WebSocket JSON Hackday
PDF
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
PPTX
How to leverage what's new in MongoDB 3.6
PDF
Spark Streaming with Cassandra
KEY
Handling Real-time Geostreams
KEY
Handling Real-time Geostreams
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
PDF
Programming with Python and PostgreSQL
WebSocket JSON Hackday
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
How to leverage what's new in MongoDB 3.6
Spark Streaming with Cassandra
Handling Real-time Geostreams
Handling Real-time Geostreams
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
Programming with Python and PostgreSQL

Similar to IT Days - Parse huge JSON files in a streaming way.pptx (20)

PPTX
Webinar: General Technical Overview of MongoDB for Dev Teams
PPT
Mongo Web Apps: OSCON 2011
PPTX
Codable routing
ODP
Mongo db dla administratora
PDF
MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (...
PDF
A Century Of Weather Data - Midwest.io
PDF
Superficial mongo db
PPTX
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
PDF
Spray Json and MongoDB Queries: Insights and Simple Tricks.
PDF
MongoDB for Analytics
PDF
Redis 101
PDF
Boost Development With Java EE7 On EAP7 (Demitris Andreadis)
PDF
PDF
NoSQL meets Microservices
PDF
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
PDF
Big Data Expo 2015 - Gigaspaces Making Sense of it all
PDF
MongoDB: Optimising for Performance, Scale & Analytics
PDF
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Webinar: General Technical Overview of MongoDB for Dev Teams
Mongo Web Apps: OSCON 2011
Codable routing
Mongo db dla administratora
MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (...
A Century Of Weather Data - Midwest.io
Superficial mongo db
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
Spray Json and MongoDB Queries: Insights and Simple Tricks.
MongoDB for Analytics
Redis 101
Boost Development With Java EE7 On EAP7 (Demitris Andreadis)
NoSQL meets Microservices
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Big Data Expo 2015 - Gigaspaces Making Sense of it all
MongoDB: Optimising for Performance, Scale & Analytics
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Ad

Recently uploaded (20)

PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Nekopoi APK 2025 free lastest update
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
top salesforce developer skills in 2025.pdf
PPTX
history of c programming in notes for students .pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
System and Network Administration Chapter 2
PPTX
Introduction to Artificial Intelligence
Upgrade and Innovation Strategies for SAP ERP Customers
Odoo POS Development Services by CandidRoot Solutions
Nekopoi APK 2025 free lastest update
Why Generative AI is the Future of Content, Code & Creativity?
Designing Intelligence for the Shop Floor.pdf
Digital Systems & Binary Numbers (comprehensive )
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Design an Analysis of Algorithms I-SECS-1021-03
top salesforce developer skills in 2025.pdf
history of c programming in notes for students .pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Operating system designcfffgfgggggggvggggggggg
Softaken Excel to vCard Converter Software.pdf
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
iTop VPN Free 5.6.0.5262 Crack latest version 2025
wealthsignaloriginal-com-DS-text-... (1).pdf
System and Network Administration Chapter 2
Introduction to Artificial Intelligence
Ad

IT Days - Parse huge JSON files in a streaming way.pptx

  • 1. © 2022 RWS 1 A way to parse huge JSON files when the memory used to be a limitation Negruti Andrei
  • 2. © 2022 RWS 2 2 © 2022 RWS Why do we have to process huge JSON files?
  • 3. © 2022 RWS 3 3 © 2022 RWS Over 7,500 experts across 36 countries and a client base spanning Europe, North and South America and Asia Pacific Our unrivalled experience and deep understanding of language has been developed over more than 60 years Our global scale and experience We support 330+ language variants and translate 378+ billion words a year
  • 5. © 2022 RWS 5 5 © 2022 RWS Books to BCM’s (Bilingual Content Model)
  • 6. © 2022 RWS 6 • 48.922 words • Original file: 0.7 Mb • BCM (JSON): 2.9 Mb The Great Gatsby F. Scott Fitzgerald
  • 7. © 2022 RWS 7 • 105.204 words • Original file: 1.6 Mb • BCM (JSON): 4.8 Mb 1984 George Orwell
  • 8. © 2022 RWS 8 • 67.495 words • Original file: 3.2 Mb • BCM (JSON): 8.5 Mb The Clean Coder Robert E. Martin
  • 9. © 2022 RWS 9 • 572.298 words • Original file: 3.6 Mb • BCM (JSON): 33.2 Mb War and Peace Leo Tolstoy
  • 10. © 2022 RWS 10 • 561.317 words • Original file: 6.4 Mb • BCM (JSON): 53.5 Mb The Lord of The Rings (Entire trilogy) J.R.R. Tolkien
  • 11. © 2022 RWS 11 • 449.467 words • Original file: 14.1 Mb • BCM (JSON): 145.8 Mb Introduction to Algorithms Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein
  • 12. © 2022 RWS 12 • 2.565 words • Original file: 10.2 Mb • BCM (JSON): 1.3 Mb A way to parse huge JSON files when the memory used to be a limitation Negruti Andrei
  • 13. © 2022 RWS 13 13 © 2022 RWS Zoom into the Apply Machine Translation Step
  • 26. © 2022 RWS 26 26 © 2022 RWS Processing a JSON in a Streaming way
  • 27. © 2022 RWS 27 { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] }
  • 29. © 2022 RWS 29 JsonParser parser = new JsonFactory().createParser(input)
  • 31. © 2022 RWS 31 { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } { “action”: “SUM_NUMBERS”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] } JsonToken.START_OBJECT JsonToken.FIELD_NAME JsonToken.VALUE_STRING JsonToken.FIELD_NAME JsonToken.START_OBJECT JsonToken.FIELD_NAME JsonToken.VALUE_STRING JsonToken.FIELD_NAME JsonToken.VALUE_STRING JsonToken.END_OBJECT JsonToken.FIELD_NAME JsonToken.START_ARRAY JsonToken.VALUE_NUMBER_INT JsonToken.VALUE_NUMBER_INT ... JsonToken.VALUE_NUMBER_INT JsonToken.END_ARRAY JsonToken.END_OBJECT null parser.nextToken()
  • 32. © 2022 RWS 32 JsonParser parser = new JsonFactory().createParser(numbersFile); JsonToken token = parser.nextToken(); long total = 0; while (token != null) { token = parser.nextToken(); if (JsonToken.FIELD_NAME.equals(token) && parser.getCurrentName().equals("numbers")) { parser.nextToken(); //Position cursor at START_ARRAY while (parser.nextToken() != JsonToken.END_ARRAY) { total += parser.getValueAsInt(); } } }
  • 33. © 2022 RWS 33 33 © 2022 RWS We built a new way to process JSON’s
  • 35. © 2022 RWS 35 ReadJsonProcessor processor = JsonProcessorBuilder.initProcessor(numbersFile); PathMatcher pathMatcher = PathMatcherBuilder.builder() .field("numbers").startArray() .build(); Iterator<Integer> numbersIterator = processor.readValues(pathMatcher, Integer.class); long total = 0; while (numbersIterator.hasNext()) { total += numbersIterator.next(); }
  • 36. © 2022 RWS 36 36 © 2022 RWS Rewrite JSON and add +1 to each number
  • 37. © 2022 RWS 37 JsonFactory jsonFactory = new JsonFactory(); JsonParser parser = jsonFactory.createParser(numbersFile); JsonGenerator generator = jsonFactory.createGenerator(outputStream); JsonToken token = parser.nextToken(); generator.copyCurrentEvent(parser); while (token != null) { token = parser.nextToken(); if (token == null) { break; } generator.copyCurrentEvent(parser); if (JsonToken.FIELD_NAME.equals(token) && parser.getCurrentName().equals("numbers")) { parser.nextToken(); //Position cursor at START_ARRAY generator.copyCurrentEvent(parser); while (parser.nextToken() != JsonToken.END_ARRAY) { generator.writeNumber(parser.getValueAsInt() + 1); } generator.copyCurrentEvent(parser); } }
  • 38. © 2022 RWS 38 JsonProcessorBuilder builder = JsonProcessorBuilder.initBuilder(numbersFile, outputStream); PathMatcher pathMatcher = PathMatcherBuilder.builder() .field("numbers").startArray() .build(); JsonElementTransformer plusOneEachNumber = builder.mapEach(pathMatcher, Integer.class, nr -> nr + 1); JsonVisitor visitor = JsonVisitor.withTransformer(plusOneEachNumber); VisitJsonProcessor processor = builder.build(); processor.visit(visitor);
  • 39. © 2022 RWS 39 39 © 2022 RWS Rewrite JSON and add +1 to each number Bonus: retrieve username
  • 40. © 2022 RWS 40 { “action”: “PLUS_ONE”, “requester”: { “id”: “307d3a82”, “username”: “admin” }, “numbers”: [123, 731, ..., 421] }
  • 41. © 2022 RWS 41 JsonProcessorBuilder builder = JsonProcessorBuilder.initBuilder(numbersFile, outputStream); PathMatcher numbersPathMatcher = PathMatcherBuilder.builder() .field("numbers").startArray() .build(); PathMatcher usernamePathMatcher = PathMatcherBuilder.builder() .field("requester").field("username") .build(); AtomicReference<String> usernameValue = new AtomicReference<>(); JsonVisitor visitor = JsonVisitor.withTransformers( List.of( builder.mapEach(numbersPathMatcher, Integer.class, nr -> nr + 1), builder.peek(usernamePathMatcher, String.class, e -> usernameValue.set(e.getElement())) ) ); VisitJsonProcessor processor = builder.build(); processor.visit(visitor); System.out.println(usernameValue.get());
  • 42. © 2022 RWS 42 42 © 2022 RWS Performance
  • 43. © 2022 RWS 43 © 2022 RWS 43 54 ms 104 ms 352 ms 366 ms 904 ms 76 ms 148 ms 435 ms 482 ms 1498 ms 81 ms 155 ms 589 ms 1025 ms 4868 ms ms 1000 ms 2000 ms 3000 ms 4000 ms 5000 ms 6000 ms 10MB 20MB 40MB 60MB 100MB SUMMING ALL NUMBERS Jackson Library Memory
  • 44. © 2022 RWS 44 © 2022 RWS 44 92 ms 208 ms 699 ms 972 ms 1789 ms 135 ms 258 ms 853 ms 1324 ms 2013 ms 150 ms 295 ms 1067 ms 3541 ms 10747 ms ms 2000 ms 4000 ms 6000 ms 8000 ms 10000 ms 12000 ms 10MB 20MB 40MB 60MB 100MB +1 EACH NUMBER AND REWRITE JSON Jackson Library Memory
  • 45. © 2022 RWS 45 45 © 2022 RWS When should you use this library?
  • 46. © 2022 RWS 46 46 © 2022 RWS You want to avoid building a complex token based logic
  • 47. © 2022 RWS 47 47 © 2022 RWS You can break your JSON into smaller processable units
  • 48. © 2022 RWS 48 48 © 2022 RWS You don’t mind the deserialization penalty
  • 49. © 2022 RWS 49 49 © 2022 RWS You want to be up and running faster
  • 50. © 2022 RWS 50 50 © 2022 RWS Our results from processing JSON’s streamed
  • 51. © 2022 RWS 51 51 © 2022 RWS No more out of memory errors
  • 52. © 2022 RWS 52 52 © 2022 RWS Can process JSON’s of ANY SIZE
  • 53. © 2022 RWS 53 53 © 2022 RWS Can process multiple files in parallel
  • 54. © 2022 RWS 54 54 © 2022 RWS Cheaper to run our services

Editor's Notes

  • #2: Hello & Welcome A production issue fixed with the help of an open source library we built
  • #3: Let’s start from the top Why do we even have to process huge JSON files? By process I mean actually manipulating these files not only storing them for import/export flows or analytics
  • #4: We, in RWS, do translation’s a lot of them for a lot of clients In our system whatever the customer uploads turns into a JSON that we call BCM (Billingual Content Model)
  • #5: How does a simple translation flow Upload files Convert to BCM’s (which are JSON’s that hold both the original file details and the resulting translated document details) Apply Machine Translation Maybe someone will then rewrite parts of the machined translated text Finally we convert this JSON back to the original file type with the translated text
  • #6: To give you a sense of how big these files can get I will quickly show some books and tell you the BCM’s size for each of them Keep in mind this is before applying any translation which will of course make files even bigger
  • #7: About 50 thousand words Gets to 3Mb
  • #8: About 105 thousand words Gets you to 5 Mb
  • #9: Couldn’t skip Uncle Bob’s book About 68 thousand words Gets you to 8 and a half Mb JSON
  • #10: Let’s jump to one of the longest novels About 570 thousand words Gives us 33 Mb Json
  • #11: The entire Lord of The Rings Trilogy Slightly less words than the previous book 53 Mb Json
  • #12: Finally, Introduction to Algorithms About 450 thousand words Almost 150 Mb JSON
  • #13: Finally just to introduce an Inception moment This presentation has about 2.5k words including notes And the JSON we will process to translate is a little over 1Mb
  • #14: - Now that you have a sense of how big some of our BCM’s can be let’s zoom into the place where we first noticed problems
  • #15: - First we receive a message that we have to apply translation to a file
  • #16: - Then we download the file in memory
  • #17: - Remove the already translated paragraphs
  • #18: - Send the file to translation
  • #19: The send to translation might give us some problems But when we were merging the translation back into the original file we had more problems This is triggered by receiving another message that the translation for a given file is done
  • #20: First we download the original file again
  • #21: Then we download the translation
  • #22: When we have both files in memory we merge them together
  • #23: Then we upload the BCM to be used by other services Where’s the problem?
  • #24: Go back to the merging Its here where we have 2 JSON’s in memory to be able to merge them together Let me show you how much memory this might use
  • #25: What happens if we try to translate Introduction to Algorithms The file without any translation is 150Mb Assume the translation doubles the file just to make math easier In memory those same files can be a lot bigger with one bigger JSON that I’ve tested it was 3 times bigger Assuming the same multiplier results in almost 1.3 Gb of memory being used This is on top of everything else running on the service
  • #26: What’s the immediate fix you can do? Pay more for more memory Reduce your throughput This is not a long term solution
  • #27: Here is where we started working on a different approach Working in the Java ecosystem we looked around for a solution that fits our needs Here is where I found about Jackson’s streaming capabilities
  • #28: Say we have this JSON We receive a request from specifying an action to perform on a list of numbers from a user The JSON is too big to deserialize in memory How do we sum up all numbers with limited memory use using Jackson?
  • #29: First we add the Jackson dependency to our project
  • #30: Then we will be creating a JsonParser To do this we will need an InputSource from where the JSON will be read, let’s assume a file
  • #31: - Then we will be using the parser.nextToken() the most to process this JSON
  • #32: If we keep calling the .nextToken() method If you want to sum the numbers you will have a while block Find the numbers field After that iterate over all numbers and add them to a total Doing this will ensure that you have at most a few Kb of memory being used for the bytes that Jackson will buffer in memory for better performance
  • #33: This is how a version of the code might look like We managed to sum the numbers The only issue is that even for our simple example this is the logic, imagine how your own logic mapped to tokens would look like Performance wise this is the best but our use cases are too complex to implement them using with tokens
  • #34: We built a new library Backed by Jackson but without the token logic Let’s try again to sum all the numbers with our new library
  • #35: As of last night you can import the library in your own project
  • #36: We initialize a processor to read from our file We define the path where we can find the list of numbers After that using the processor and the path defined we can say that at the given path we expect to find Integer’s and to read them all Using the iterator we can quickly sum up all the numbers
  • #37: Most of our operations on the BCM’s require us to rewrite it and apply different logic while we do that We made sure that the library we wrote permits us to do that as well Let’s look over another exercise and see how we can do it in a streamed way
  • #38: won’t bother you with how you can do the rewrite and +1 using Jackson because it’s a lot I’m just showing you a snippet of how one might be able to do it in over 20 lines of codes
  • #39: Using our library it’s a lot easier to do this We create a processor builder and initialize with the input source pointing to the numbers file and and outputSource where we write our JSON Define numbers path again Create what we call a transformer that we tell to look at the numbers array path, expect to find Integer’s and apply the following function Then we create a visitor using this transformer Build our Visiting processor And we call the visit method of the processor We will iterate over JSON and apply all the transformers specified while rewriting the entire JSON in the OutputStream
  • #40: - If we want to complicate the problem even more and say we want to extract the username from the JSON
  • #41: Remember we had a requester field that has an object including the username
  • #42: Using our library we can quickly achieve this Define our path matchers Define an object where we will be storing out username value Define our JSON visitor specifying the 2 transformers Then we visit the JSON At the end we will have the username value stored and we can do whatever we want with it
  • #43: How does our library compare with Jackson and processing files in memory?
  • #44: Did some benchmark tests just to give you a glimpse of how Jackson our Library and processing files in memory compare As you can see from 10Mb files to 100Mb JSON’s, if we want to sum the numbers Jackson will be the fastest, followed closely by our library Starting from about 60Mb files processing in memory becomes way slower I don’t have a graph to show you how much memory is being used but is Kb vs Mb’s For my numbers JSON I’ve only used 3 digit numbers. 3 digits on disk it’s 3 bits, same 3 digits as an int in Java will use 32 bits. Each number is more than 10 times bigger in memory.
  • #45: Comparing performance for rewriting JSON’s shows us a similar story Doing anything in memory is always slower than doing it in a streamed way
  • #46: If you know your JSON’s are too big for you to hold in memory, when is it a good idea to use our library over Jackson?
  • #47: - This is a good tell-tale sign that you should be using our library
  • #48: I’ve only shown you deserialization to Integer’s and String’s but using our library you can deserialize to an entire object Split your logic into smaller processable units
  • #49: Like I mentioned the slide before you can deserialize to any object and process into smaller units Deserialization has it’s performance penalty, you have to check how many deserialization’s you end up doing
  • #50: Even for the easiest of logics building that logic using our library will be faster
  • #51: For now we use the library in 4 distinct flows What are our own personal results from processing files using the library we built For our BCM’s our processable units are Paragraphs and we were able to do everything we needed by only deserializing one paragraph at a time
  • #53: - Doesn’t matter the size of a BCM a paragraph can’t be so big that we can’t hold in memory
  • #54: - Back to processing multiple files in parallel since we use less memory
  • #55: - We simply don’t need as much memory added to these services because we don’t use as much now
  • #56: On GitHub you can find the source code, all the other transformers you can use and tests that show you how they all work I’ve got a list of refactorings, new api’s, tests to add so if you like the library please do follow the progress there Would love to receive any feedback on it If you have a use case where you think this library might work for you try using the library or come contact me If you use a different programming language than Java and you need the same functionality also contact me because we can make something work