SlideShare a Scribd company logo
FormatConversion
A handy pattern for format conversions
2017-11-27
Overview
• At exactEarth, we deal with data in a lot of different formats.
• We had problems with a proliferation of converters.
• I’m going to discuss a pattern we developed for dealing with situations like
this that turned out well.
What does exactEarth do?
“We track all of the world’s ships using satellites.”
FormatConversion
Automatic Identification System (AIS)
Automatic Identification System (AIS)
• Designed during the 1990s
• Adopted as a standard in 2002
• Very High Frequency (VHF)
radio transmissions
• 27 different types of messages
transmitted
Maritime Mobile Service Identity
(MMSI)
Location
Speed over ground
Course over ground
Heading
Rate of Turn
Message Types
1, 2, 3
• MaritimeMobile ServiceIdentity
(MMSI)
• Name
• IMO Number
• Callsign
• Dimensions of the ship
• Destination and ETA
Message
Type 5
Effective January 1 2005, AIS transceivers are
required by:
• All ships of 300 gross tonnage and upwards engaged
on international voyages
• All cargo ships of 500 gross tonnage and upwards
not engaged on international voyages
• All passenger ships irrespective of size.
AIS transceivers must be on at all times
(with some limited exceptions)
Top pictures: gross tonnage of
300
Right picture: gross tonnage of
500
Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many representations of the same data
Conversions between formats
In order to ingest data from third parties, and to satisfy customer demands
for data in a particular format, we need to be able to convert between all the
formats
Lossy vs. Lossless Conversions
Some conversions are lossless:
• For example, both NMEAv4 and GNM v3.1 capture all the same data.
But some are lossy, meaning that data is lost in the conversion:
• For example, NMEAv4 to KML
• KML doesn’t have all of the fields that AIS-specific formats do
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
The other fields are either format syntax, checksums, or some trivial additional fields
Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
KML:
<Placemark>
<name>431300061</name>
<TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp>
<Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point>
<Style><IconStyle>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/track.png</href>
<w>64</w><h>64</h>
</Icon><color>ffff0000</color>
<heading>344.0</heading>
</IconStyle>
</Style>
</Placemark>
Problem: Proliferation of Converters
• Code Duplication
• Bug prone, not performant
• Testing + optimization efforts were strained by so many
implementations
• Not flexible
• If a component consumes GNM today, it was hard to add the ability to
consume NMEA
• Inadvertent use of lossy conversions
Step 1: One format to rule them all
We created a new format: EEA
Built for AIS, faithfully reflects the spec.
Extension fields for format-specific metadata
Side benefit: Multi-type fields
Some fields in the AIS spec are multi-typed
Example: Speed over Ground (10 bits, 0-1023)
From the spec:
“Speed over ground in 1/10 knot steps (0-102.2 knots)
1023 = not available, 1022 = 102.2 knots or higher”
Developers were often performing mathematical operations on the fields (!)
In EEA, we made the types of this fields:
Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]
Step 2: Standardized low-level API
• tokenize(input:file_like)
• deserialize_message(unparsed_message)
• serialize_message(parsed_message)
• merge(Iterable[unparsed_message], output:file_like)
Step 3: Conversion Graph
GNM
NM4
DOF
Step 3: Conversion Graph
GNM
NM4
DOF
EEA
Step 3: Conversion Graph
GNM
NMEA4
DOF
EEA
EEA
EEA
Step 3: Conversion Graph
GNM
DOF
EEA
EEA
EEA
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Generating the converter
Now that we have a graph, to make a converter just compose the functions
on the edges of the shortest path:
NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input)))))
Function composition in Python:
https://mathieularose.com/function-composition-in-python/#solution
NetworkX
https://networkx.github.io/
3-clause BSD license
I’ve used it heavily and have no complaints with it
Building the graph
conversions.add_edge((“NMEA”, “4”, “PAYLOAD”), (“NMEA”, "4”, "UNPARSED"), function=tokenize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "EEA_OBJ"), function=deserialize)
conversions.add_edge((“NMEA”, “4”, "EEA_OBJ"), (“NMEA”, "4”, "UNPARSED"), function=serialize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "PAYLOAD"), function=merge)
…
conversions.add_edge((“NMEA”, “4”, " EEA_OBJ "), (“GNM”, “3.1”, "EEA_OBJ"), function=lambda x: x) # nop
…
Example usage
NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD"))
with open("my_nmea_v4_file.nm4", 'rb') as fin:
with open("my_converted_file.gnm", 'wb') as fout:
fout.write(NM4_to_GNM(fin))
Prevention of lossy conversions
Create 2 different conversion graphs:
1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS”
2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS”
Use lossless graph by default, make users explicitly ask to use lossy
conversions
If the user asks for a lossy conversion without being explicit, there will be
no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a
path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message:
“No lossless path from NMEAv4 to KML. If you want to perform a
lossy conversion, you must explicitly allow lossy conversions.”
Extra parameters
Sometimes conversions need additional information
Example:
Conversion from DOFv3 -> DOFv4 requires a timestamp
Extra parameters
1. Mark edges as having required parameters:
conversions.add_edge(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “3”, “UNPARSED”),
function=tokenize,
required_params=set(["timestamp"])
)
2. Allow the user to supply arbitrary keyword arguments to get_converter():
get_converter(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id)
)
Final API
get_converter(
source_schema,
target_schema,
graph=FORWARD_FORMAT_CONVERSIONS,
**kwargs
)
Benefits
• Centralizes the conversion code
• Less bugs, more performant
• Simplifies the code + less duplication
• Don’t need to know all of the input formats a priori
• Dynamic generation of converters
• Reduces chance of accidental lossy conversions
Summary
• We had a problem with multiple formats and converters between them
• By modelling it as a graph problem, it was easy to dynamically generate
converters
• This allowed for greater flexibility, greater safety
• When you have a web of conversion steps, you can use graph traversal
libraries to generate the shortest path to get the answers you want.
Thanks
Questions?
I’m Michael Overmeyer:
@movermeyer on every platform

More Related Content

PPTX
Trans cold express 6 page overview
PPTX
ERTMSFormalSpecs Presentation - October 2016
PPTX
ERTMSFormalSpecs Presentation 9/10/2015
PDF
Truck planning: how to certify the right route
PDF
LA Ember.js Meetup, Jan 2017
PPTX
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
PDF
The Patterns of Distributed Logging and Containers
PPT
Development of a Prototype Web GIS Server for HDF-EOS Data based on OGC Web M...
Trans cold express 6 page overview
ERTMSFormalSpecs Presentation - October 2016
ERTMSFormalSpecs Presentation 9/10/2015
Truck planning: how to certify the right route
LA Ember.js Meetup, Jan 2017
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
The Patterns of Distributed Logging and Containers
Development of a Prototype Web GIS Server for HDF-EOS Data based on OGC Web M...

Similar to FormatConversion (20)

PDF
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
PDF
Distributed Logging Architecture in Container Era
PDF
Distributed Logging Architecture in the Container Era
PPT
An introduction to Apache Camel
PDF
Advanced Globus System Administration
PDF
Charting New Waters: Data Integration Excellence for Port & Marine Operations
PPTX
CDC to the Max!
PPTX
Biztalk ESB Toolkit Introduction
PDF
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
PDF
server-side-fusion-vts
PPTX
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
PPTX
Using FME Server and Engines to Convert Large Amounts of Data
PDF
Getting Started with HBase
PDF
PPT_Deploying_Exchange_Server.pdf.pdf
PDF
Advanced Globus System Administration
PPTX
SmartMet Server OSGeo
PPT
Telnet and FTP.ppt
PPTX
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
PPTX
Evolution of a cloud start up: From C# to Node.js
PDF
Como definir un esquema de direcciones IPv6
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in the Container Era
An introduction to Apache Camel
Advanced Globus System Administration
Charting New Waters: Data Integration Excellence for Port & Marine Operations
CDC to the Max!
Biztalk ESB Toolkit Introduction
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
server-side-fusion-vts
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Using FME Server and Engines to Convert Large Amounts of Data
Getting Started with HBase
PPT_Deploying_Exchange_Server.pdf.pdf
Advanced Globus System Administration
SmartMet Server OSGeo
Telnet and FTP.ppt
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Evolution of a cloud start up: From C# to Node.js
Como definir un esquema de direcciones IPv6
Ad

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
history of c programming in notes for students .pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
top salesforce developer skills in 2025.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Cost to Outsource Software Development in 2025
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Introduction to Artificial Intelligence
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Digital Strategies for Manufacturing Companies
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Nekopoi APK 2025 free lastest update
System and Network Administraation Chapter 3
Designing Intelligence for the Shop Floor.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
history of c programming in notes for students .pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Design an Analysis of Algorithms II-SECS-1021-03
top salesforce developer skills in 2025.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Why Generative AI is the Future of Content, Code & Creativity?
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Cost to Outsource Software Development in 2025
Understanding Forklifts - TECH EHS Solution
Introduction to Artificial Intelligence
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Digital Strategies for Manufacturing Companies
PTS Company Brochure 2025 (1).pdf.......
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Nekopoi APK 2025 free lastest update
Ad

FormatConversion

  • 1. FormatConversion A handy pattern for format conversions 2017-11-27
  • 2. Overview • At exactEarth, we deal with data in a lot of different formats. • We had problems with a proliferation of converters. • I’m going to discuss a pattern we developed for dealing with situations like this that turned out well.
  • 3. What does exactEarth do? “We track all of the world’s ships using satellites.”
  • 6. Automatic Identification System (AIS) • Designed during the 1990s • Adopted as a standard in 2002 • Very High Frequency (VHF) radio transmissions • 27 different types of messages transmitted
  • 7. Maritime Mobile Service Identity (MMSI) Location Speed over ground Course over ground Heading Rate of Turn Message Types 1, 2, 3 • MaritimeMobile ServiceIdentity (MMSI) • Name • IMO Number • Callsign • Dimensions of the ship • Destination and ETA Message Type 5
  • 8. Effective January 1 2005, AIS transceivers are required by: • All ships of 300 gross tonnage and upwards engaged on international voyages • All cargo ships of 500 gross tonnage and upwards not engaged on international voyages • All passenger ships irrespective of size. AIS transceivers must be on at all times (with some limited exceptions)
  • 9. Top pictures: gross tonnage of 300 Right picture: gross tonnage of 500
  • 10. Many ways to store AIS messages • NMEA v3, v4 • GNM v3.1 • Internal Binary formats (with several versions) • “Adapted” formats (several variations): • CSV • XML • JSON • KML • OTH-Gold • Many third-party and “one-off” formats
  • 11. Many ways to store AIS messages • NMEA v3, v4 • GNM v3.1 • Internal Binary formats (with several versions) • “Adapted” formats (several variations): • CSV • XML • JSON • KML • OTH-Gold • Many third-party and “one-off” formats Many representations of the same data
  • 12. Conversions between formats In order to ingest data from third parties, and to satisfy customer demands for data in a particular format, we need to be able to convert between all the formats
  • 13. Lossy vs. Lossless Conversions Some conversions are lossless: • For example, both NMEAv4 and GNM v3.1 capture all the same data. But some are lossy, meaning that data is lost in the conversion: • For example, NMEAv4 to KML • KML doesn’t have all of the fields that AIS-specific formats do
  • 14. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 15. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 16. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 17. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 The other fields are either format syntax, checksums, or some trivial additional fields
  • 18. Lossy Conversion NMEAv4 -> KML Message Type 1: • MMSI (identifier) • Timestamp • Longitude/Latitude • Heading • Navigation Status • Rate of Turn • Speed Over Ground • Position Accuracy • Course over Ground • …
  • 19. Lossy Conversion NMEAv4 -> KML Message Type 1: • MMSI (identifier) • Timestamp • Longitude/Latitude • Heading • Navigation Status • Rate of Turn • Speed Over Ground • Position Accuracy • Course over Ground • … KML: <Placemark> <name>431300061</name> <TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp> <Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point> <Style><IconStyle> <Icon> <href>http://maps.google.com/mapfiles/kml/shapes/track.png</href> <w>64</w><h>64</h> </Icon><color>ffff0000</color> <heading>344.0</heading> </IconStyle> </Style> </Placemark>
  • 20. Problem: Proliferation of Converters • Code Duplication • Bug prone, not performant • Testing + optimization efforts were strained by so many implementations • Not flexible • If a component consumes GNM today, it was hard to add the ability to consume NMEA • Inadvertent use of lossy conversions
  • 21. Step 1: One format to rule them all We created a new format: EEA Built for AIS, faithfully reflects the spec. Extension fields for format-specific metadata
  • 22. Side benefit: Multi-type fields Some fields in the AIS spec are multi-typed Example: Speed over Ground (10 bits, 0-1023) From the spec: “Speed over ground in 1/10 knot steps (0-102.2 knots) 1023 = not available, 1022 = 102.2 knots or higher” Developers were often performing mathematical operations on the fields (!) In EEA, we made the types of this fields: Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]
  • 23. Step 2: Standardized low-level API • tokenize(input:file_like) • deserialize_message(unparsed_message) • serialize_message(parsed_message) • merge(Iterable[unparsed_message], output:file_like)
  • 24. Step 3: Conversion Graph GNM NM4 DOF
  • 25. Step 3: Conversion Graph GNM NM4 DOF EEA
  • 26. Step 3: Conversion Graph GNM NMEA4 DOF EEA EEA EEA
  • 27. Step 3: Conversion Graph GNM DOF EEA EEA EEA NMEA4
  • 28. Step 3: Conversion Graph GNM NMEA4
  • 29. Step 3: Conversion Graph GNM NMEA4
  • 30. Step 3: Conversion Graph GNM NMEA4
  • 31. Step 3: Conversion Graph GNM NMEA4
  • 32. Step 3: Conversion Graph GNM NMEA4
  • 33. Step 3: Conversion Graph GNM NMEA4
  • 34. Step 3: Conversion Graph GNM NMEA4
  • 35. Generating the converter Now that we have a graph, to make a converter just compose the functions on the edges of the shortest path: NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input))))) Function composition in Python: https://mathieularose.com/function-composition-in-python/#solution
  • 37. Building the graph conversions.add_edge((“NMEA”, “4”, “PAYLOAD”), (“NMEA”, "4”, "UNPARSED"), function=tokenize) conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "EEA_OBJ"), function=deserialize) conversions.add_edge((“NMEA”, “4”, "EEA_OBJ"), (“NMEA”, "4”, "UNPARSED"), function=serialize) conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "PAYLOAD"), function=merge) … conversions.add_edge((“NMEA”, “4”, " EEA_OBJ "), (“GNM”, “3.1”, "EEA_OBJ"), function=lambda x: x) # nop …
  • 38. Example usage NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD")) with open("my_nmea_v4_file.nm4", 'rb') as fin: with open("my_converted_file.gnm", 'wb') as fout: fout.write(NM4_to_GNM(fin))
  • 39. Prevention of lossy conversions Create 2 different conversion graphs: 1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS” 2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS” Use lossless graph by default, make users explicitly ask to use lossy conversions If the user asks for a lossy conversion without being explicit, there will be no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message: “No lossless path from NMEAv4 to KML. If you want to perform a lossy conversion, you must explicitly allow lossy conversions.”
  • 40. Extra parameters Sometimes conversions need additional information Example: Conversion from DOFv3 -> DOFv4 requires a timestamp
  • 41. Extra parameters 1. Mark edges as having required parameters: conversions.add_edge( (“DOF”, “3”, “PAYLOAD”), (“DOF”, “3”, “UNPARSED”), function=tokenize, required_params=set(["timestamp"]) ) 2. Allow the user to supply arbitrary keyword arguments to get_converter(): get_converter( (“DOF”, “3”, “PAYLOAD”), (“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id) )
  • 43. Benefits • Centralizes the conversion code • Less bugs, more performant • Simplifies the code + less duplication • Don’t need to know all of the input formats a priori • Dynamic generation of converters • Reduces chance of accidental lossy conversions
  • 44. Summary • We had a problem with multiple formats and converters between them • By modelling it as a graph problem, it was easy to dynamically generate converters • This allowed for greater flexibility, greater safety • When you have a web of conversion steps, you can use graph traversal libraries to generate the shortest path to get the answers you want.

Editor's Notes

  • #5: Why?: Environmental: reef protection, bilge water dumping, oil spills, but most importantly illegal fishing… Logistical: Port authorities, logistics companies, scheduling Security: surveillance, smuggling, piracy
  • #6: As a ship captain, how do you prevent collisions with other vessels? People tend to jump immediately to SONAR/RADAR, but there are a few major problems to that: The equipment is expensive The equipment requires a lot of power These systems are actually quite difficult to read. They require some skill to operate. The most common method was simply to visually observe the other ships and try to estimate their speed, course, heading, and acceleration. Then you would do the same for your vessel and figure out the calculus to determine if you are going to collide or not. Obviously, this I also tricky, and fails in situations like: Night time Stormy weather When you are going around a tight curve in a waterway, and can’t see what’s coming at you Ships don’t stop on a dime. In fact, some of these vessels take in the neighbourhood of 20 minutes to stop. So sometimes, you will have two ships that know they are going to collide well in advance of the collision, but they can’t turn or stop fast enough to do anything about it. So the problem of ship collisions is what triggered the creation of AIS.
  • #7: All vessels transmit a “Hear I am! Please don’t hit me” to the other vessels in the area.
  • #8: Here are some of the fields that are transmitted. In the Type 1,2,3 messages, we have position information. These are transmitted every few seconds while the vessel is moving. In other message types, we have more static information that doesn’t change very often, like registration and destination and ETA.
  • #22: The first thing you do when you have a lot of formats: Create a new format! We created a new internal format that we called EEA. Not only could it hold all of the AIS message fields, but it also had “extension fields” where we would shove all the fields that might be specific to a format. For example, GNM has some metadata fields that are specific to GNM. We put those into a GNMMetadata field within the EEA spec. So now we have a format that can capture all of the complexity of all the AIS formats. It was written to be faithful to the AIS spec, which means that it handles the full complexity of the AIS spec.
  • #23: A side benefit of redesigning our format (and our in-memory representation) with EEA is that we got to fix some of the issues developers were accidentally creating. AIS is a complicated spec. For example, look at the definition for speed over ground. There are 1024 bits, and while most of the bit values can be interpreted as doubles, there are two reserved values that can not. There is 1023 = Not Available, for when the ship doesn’t know how fast it is going. There is also the 1022 value which means 102.2 knots or greater. The problem with all the ad hoc parsers is that developers of them would often get the parsing of these fields wrong. They would often just parse the field as a double, not realizing that these special values existed, and then perform mathematical operations on the fields. So they would do things like average the values of the speed over ground, so you would end up with massive values when a large number of vessels were reporting “Not available”. With EEA, we fixed that, changing the type of the field to be either a double, or one of two special values. This prevents mathematical operations being applied on the field, and forces the developer to stop and think about how they want to actually handle the math.
  • #24: The next step was to define a common interface of functions we apply when parsing all formats. We eventually settled on this interface. You start with a “PAYLOAD” which is a collection of bytes representing messages, which might be the contents of a file for example. You then call tokenize() which finds the boundaries of the messages within the payload and splits the payload on those boundaries. You still haven’t parsed the messages, so you don’t know what they say yet. You only know the bytes that make up each message. We called this “UNPARSED” in this diagram. You can then call deserialize, which actually parses the bytes of the message and gives you an in-memory representation of the message. Most commonly, this was the new EEA format. Then on the reverse direction, we take individual messages and call serialize() on them to return them to the UNPARSED tokens we had before. And finally we call merge(), which writes the messages, one after another, into a payload. So these four functions are pretty much universal to format parsing. They also form a graph, perhaps a sort of state-machine where the data parse levels are the nodes, and the functions are the edges.
  • #25: We went and implemented these four functions for all of data formats. And when you put all the conversion graphs beside one another, you notice something.
  • #26: The PARSED node for most of the formats is EEA.
  • #27: It’s the same format. Really, it’s all the same node. Conceptually, you could add edges between them with a NO-OP function.
  • #29: So now if you wanted to convert between GNM and NM4, you can just follow the edges from GNM PAYLOAD to NMEA4 PAYLOAD
  • #35: And you suddenly have the conversion steps.
  • #37: In order to represent the graph in code, we need a graphing library. I came across NetworkX and have had no complaints. It allows you to create nodes and edges in a graph, and then gives you an easy way to do algorithms like shortest path across graphs.
  • #38: Here’s what it looks like in code: We have our conversions object, which is just a networkX graph object. Then on each line, we add an edge between each of our parse levels. On each edge, we also supply the function to perform. Notice in the last line, that this is an example where we jump between formats that are already parsed into the EEA in-memory representation. Therefore the function is a simple NO-OP.