SlideShare a Scribd company logo
Validating Enterprise Data Lake
Using Open Source Data
Validator – An Airline Industry
Case Study
Reach the community at users@collaborate.jumbune.org
Download Jumbune from http://www.jumbune.org
An Open Source initiative LGPLv3 licensed
Table of Contents
I. Overview ................................................................ 2
II. Business Challenge and Future Proofing ............. 2
III. Finding Anomalies using Jumbune’s Data
Validation............................................................... 2
IV. Result: Analytical Anomalies report...................... 3
VALIDATING ENTERPRISE DATA LAKE USING OPEN SOURCE DATA VALIDATOR – AN AIRLINE INDUSTRY CASE STUDY 2
Overview
A renowned Trans Pacific Airline, ranks among the top international airline company in terms of number of passengers
carried. The company operates two of the world’s longest non-stop flights and on an average 58 flights, between major
cities. By 2013, the airline expanded the fleet to six Airlines A310s and eight Airlines A500s and builds the network to
cover 30 destinations across the world.
Business Challenge and Future Proofing
The Airline maintained its information system in relational data store. In order to suffice specific analytical and future
business needs, the airline company consolidated data sources within various departments such as Human Resources,
Operations, Sales, Maintenance, Customer Relations, Safety, Logistics and Revenue Accounting. All the data from the
passenger itinerary and boarding details, maintenance logs, cargo tracking, fuel load, ticket prices, concession and seating,
crew details is added onto the data hub. The organization wanted to mine all its data (with large volume, variety and
velocity) efficiently and effectively, traditional databases were inefficient perform basic operations and analytics across
organizational silos.
VP, Information Technology, recommended the creation of a data lake to consolidate and store the data in a single
repository that will solve to all current and future analytical needs. The airlines raw data includes:
1. Operation department includes information related with aircraft such as aircraft details, flying details, crew
details, catering details etc.
2. Customer Relations department includes information related with passengers such as personal information,
aircraft details, cabin details - First class, Business class, Economy class, etc.
3. Sales department include information related with manual and online bookings such as Passenger information,
Booking information, Ticket details, etc.
4. Cargo details……. (Add tracking, weight, etc.)
5. Flight maintenance records, spare part records, safety checking,
6. Ticket pricing (across cities, discounts etc.)
7. Human Resources department includes information of employees such as their personal information, salary,
attendance details, designation, etc. ,
Big Data is more about processing large volumes of data. Hadoop, being scalable, reliable and economical, was
undoubtedly the preferred choice for storing and analyzing batch data.
Hadoop provides the ability to store large scale enterprise data on Hadoop Distributed File System (HDFS) and
analyzing this huge data using execution engines such as MapReduce, Hive, Storm, etc. HDFS is a Java-based file system
that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS is
highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets.
Finding Anomalies using Jumbune’s Data Validation
The Engineering team initiated installations and configurations. They created a test environment of 5 nodes Apache
Yarn cluster and the preliminary Hive based analytics worked as per the expectations.
Moving to production, they configured a Hadoop commercial distribution of 100 node cluster. After a month, the
management noticed that the analytical reports generated by the jobs were erroneous. One of the erroneous use case was
generating an analytical report consisting of information on passenger’s carry-on luggage with respect to the age, gender
of the passenger and also the ratio of the carry-on with the checked-in luggage.
The engineering team spent lot of man hours in writing a customized MapReduce program to uncover the root cause in
the analytical logic and later they figured out that the actual problem was caused by the inconsistent data ingested by a
malfunctioning ETL instance operating from one of the airports in the new route. The updated policy of the Airport
Authority refrained the airlines from recording the actual weight of the carry-on luggage. This introduced number of
anomalies into the data hub that led to the erroneous analytics.
VALIDATING ENTERPRISE DATA LAKE USING OPEN SOURCE DATA VALIDATOR – AN AIRLINE INDUSTRY CASE STUDY 3
Jumbune Data Validation MapReduce job analyze batch, incremental data files kept on HDFS and provides generic
categories of validations: Null, Regex and Data Type. Jumbune gives feasibility to analyze TB's of data in comparatively
less time and also helps in finding anomalies. The engineering team ran Jumbune’s Data Validation module. Jumbune
has its own customized MapReduce data validation framework that generically validates data on HDFS. Jumbune is
highly optimized, can be operated remotely, user friendly. Only the HDFS path and validations on the fields needs to be
provided as the input to the data validation module.
The engineering team ran Jumbune’s Data Validation module with null check on all the fields and found that carryon
luggage field contains null values. Jumbune analyzed HDFS data and presented the analytics data of number of null
values in the data. Furthermore, they found three more use cases where Jumbune’s Data Validation module was
beneficial to them for finding data anomalies. The use cases were:
1. In order to suffice marketing needs the team wanted to check how many passengers did not enter their mobile
number.
Solution: Marketing team applied a null check on mobile number field and ran Jumbune. Jumbune launches its
MapReduce program, analyzed HDFS data and presented the analytics report which listed number of passengers who
didn't enter their telephone numbers.
2. The management team, required to take customer feedbacks for their flight experience. They found that the most
feasible way is to send SMS to all the passengers. They did not know whether the data type validation was applied on
phone number field or not.
Solution: With the help of engineering team they launched a Jumbune Data Validation job with data type check on
phone number field. The analytics report listed 102 passengers out of millions of entries who entered wrong phone
numbers.
3. Airlines sales manager observed a fall in their sales. To know the reason, marketing team created an email survey for
the passengers which aimed to know the number of passengers filling incorrect email ids.
Solution: They required knowing the number of passengers who gave improper email ids. With the help of engineering
team they launched a Jumbune Data Validation job with regular expression check on email field and the analytics report
listed 136 passengers who entered wrong email ids.
Result: Analytical Anomalies report
Analyzing enterprise data results in significant loss in revenue and time, Jumbune’s Data Validation has helped this
organisation to get analytical report of anomalies in data hub..
The Airlines engineering team downloaded Jumbune from http://jumbune.org/

More Related Content

PDF
Internal Reporting with Tableau
PPTX
Cis520 group e
DOC
Hardware enhanced association rule mining
PDF
Emirates ICE System Mockup
PPT
Enterprise integration challenges in the aviation industry
PPT
Introduction to Airline Information System
DOCX
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...
PDF
Data mining-for-prediction-of-aircraft-component-replacement
Internal Reporting with Tableau
Cis520 group e
Hardware enhanced association rule mining
Emirates ICE System Mockup
Enterprise integration challenges in the aviation industry
Introduction to Airline Information System
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...
Data mining-for-prediction-of-aircraft-component-replacement

Similar to Validating enterprise data lake using open source data validator (20)

DOCX
full assignment answer for BIS
PDF
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAY
PDF
Application of Big Data Systems to Airline Management
PPTX
Mis case study , Chapter 5, Chapter 6
PDF
Debugging of hadoop production errors using open source flow analyzer – semic...
PDF
Autonomous Driving: The Big Data Value Myth
PDF
Big Data Analytics and Artifical Intelligence
PDF
Health Plan Survey Paper
DOCX
DOCX
DOCX
Business Case
DOCX
Airline ticket reservation system
PDF
Aircraft Ticket Price Prediction Using Machine Learning
PDF
ANALYSIS ON LOAD BALANCING ALGORITHMS IMPLEMENTATION ON CLOUD COMPUTING ENVIR...
DOCX
Ash cis 500 preview full class
PPTX
Flight data analysis using apache pig--------------Final Year Project
PDF
icuWorkbench - Use cases
PDF
Irjet v7 i3290
PDF
IRJET - Centralized Data for Transport Automation System using Android St...
PPT
Organization support systems
full assignment answer for BIS
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAY
Application of Big Data Systems to Airline Management
Mis case study , Chapter 5, Chapter 6
Debugging of hadoop production errors using open source flow analyzer – semic...
Autonomous Driving: The Big Data Value Myth
Big Data Analytics and Artifical Intelligence
Health Plan Survey Paper
Business Case
Airline ticket reservation system
Aircraft Ticket Price Prediction Using Machine Learning
ANALYSIS ON LOAD BALANCING ALGORITHMS IMPLEMENTATION ON CLOUD COMPUTING ENVIR...
Ash cis 500 preview full class
Flight data analysis using apache pig--------------Final Year Project
icuWorkbench - Use cases
Irjet v7 i3290
IRJET - Centralized Data for Transport Automation System using Android St...
Organization support systems
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Unlocking AI with Model Context Protocol (MCP)
sap open course for s4hana steps from ECC to s4
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Ad

Validating enterprise data lake using open source data validator

  • 1. Validating Enterprise Data Lake Using Open Source Data Validator – An Airline Industry Case Study Reach the community at users@collaborate.jumbune.org Download Jumbune from http://www.jumbune.org An Open Source initiative LGPLv3 licensed
  • 2. Table of Contents I. Overview ................................................................ 2 II. Business Challenge and Future Proofing ............. 2 III. Finding Anomalies using Jumbune’s Data Validation............................................................... 2 IV. Result: Analytical Anomalies report...................... 3
  • 3. VALIDATING ENTERPRISE DATA LAKE USING OPEN SOURCE DATA VALIDATOR – AN AIRLINE INDUSTRY CASE STUDY 2 Overview A renowned Trans Pacific Airline, ranks among the top international airline company in terms of number of passengers carried. The company operates two of the world’s longest non-stop flights and on an average 58 flights, between major cities. By 2013, the airline expanded the fleet to six Airlines A310s and eight Airlines A500s and builds the network to cover 30 destinations across the world. Business Challenge and Future Proofing The Airline maintained its information system in relational data store. In order to suffice specific analytical and future business needs, the airline company consolidated data sources within various departments such as Human Resources, Operations, Sales, Maintenance, Customer Relations, Safety, Logistics and Revenue Accounting. All the data from the passenger itinerary and boarding details, maintenance logs, cargo tracking, fuel load, ticket prices, concession and seating, crew details is added onto the data hub. The organization wanted to mine all its data (with large volume, variety and velocity) efficiently and effectively, traditional databases were inefficient perform basic operations and analytics across organizational silos. VP, Information Technology, recommended the creation of a data lake to consolidate and store the data in a single repository that will solve to all current and future analytical needs. The airlines raw data includes: 1. Operation department includes information related with aircraft such as aircraft details, flying details, crew details, catering details etc. 2. Customer Relations department includes information related with passengers such as personal information, aircraft details, cabin details - First class, Business class, Economy class, etc. 3. Sales department include information related with manual and online bookings such as Passenger information, Booking information, Ticket details, etc. 4. Cargo details……. (Add tracking, weight, etc.) 5. Flight maintenance records, spare part records, safety checking, 6. Ticket pricing (across cities, discounts etc.) 7. Human Resources department includes information of employees such as their personal information, salary, attendance details, designation, etc. , Big Data is more about processing large volumes of data. Hadoop, being scalable, reliable and economical, was undoubtedly the preferred choice for storing and analyzing batch data. Hadoop provides the ability to store large scale enterprise data on Hadoop Distributed File System (HDFS) and analyzing this huge data using execution engines such as MapReduce, Hive, Storm, etc. HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. Finding Anomalies using Jumbune’s Data Validation The Engineering team initiated installations and configurations. They created a test environment of 5 nodes Apache Yarn cluster and the preliminary Hive based analytics worked as per the expectations. Moving to production, they configured a Hadoop commercial distribution of 100 node cluster. After a month, the management noticed that the analytical reports generated by the jobs were erroneous. One of the erroneous use case was generating an analytical report consisting of information on passenger’s carry-on luggage with respect to the age, gender of the passenger and also the ratio of the carry-on with the checked-in luggage. The engineering team spent lot of man hours in writing a customized MapReduce program to uncover the root cause in the analytical logic and later they figured out that the actual problem was caused by the inconsistent data ingested by a malfunctioning ETL instance operating from one of the airports in the new route. The updated policy of the Airport Authority refrained the airlines from recording the actual weight of the carry-on luggage. This introduced number of anomalies into the data hub that led to the erroneous analytics.
  • 4. VALIDATING ENTERPRISE DATA LAKE USING OPEN SOURCE DATA VALIDATOR – AN AIRLINE INDUSTRY CASE STUDY 3 Jumbune Data Validation MapReduce job analyze batch, incremental data files kept on HDFS and provides generic categories of validations: Null, Regex and Data Type. Jumbune gives feasibility to analyze TB's of data in comparatively less time and also helps in finding anomalies. The engineering team ran Jumbune’s Data Validation module. Jumbune has its own customized MapReduce data validation framework that generically validates data on HDFS. Jumbune is highly optimized, can be operated remotely, user friendly. Only the HDFS path and validations on the fields needs to be provided as the input to the data validation module. The engineering team ran Jumbune’s Data Validation module with null check on all the fields and found that carryon luggage field contains null values. Jumbune analyzed HDFS data and presented the analytics data of number of null values in the data. Furthermore, they found three more use cases where Jumbune’s Data Validation module was beneficial to them for finding data anomalies. The use cases were: 1. In order to suffice marketing needs the team wanted to check how many passengers did not enter their mobile number. Solution: Marketing team applied a null check on mobile number field and ran Jumbune. Jumbune launches its MapReduce program, analyzed HDFS data and presented the analytics report which listed number of passengers who didn't enter their telephone numbers. 2. The management team, required to take customer feedbacks for their flight experience. They found that the most feasible way is to send SMS to all the passengers. They did not know whether the data type validation was applied on phone number field or not. Solution: With the help of engineering team they launched a Jumbune Data Validation job with data type check on phone number field. The analytics report listed 102 passengers out of millions of entries who entered wrong phone numbers. 3. Airlines sales manager observed a fall in their sales. To know the reason, marketing team created an email survey for the passengers which aimed to know the number of passengers filling incorrect email ids. Solution: They required knowing the number of passengers who gave improper email ids. With the help of engineering team they launched a Jumbune Data Validation job with regular expression check on email field and the analytics report listed 136 passengers who entered wrong email ids. Result: Analytical Anomalies report Analyzing enterprise data results in significant loss in revenue and time, Jumbune’s Data Validation has helped this organisation to get analytical report of anomalies in data hub.. The Airlines engineering team downloaded Jumbune from http://jumbune.org/