SlideShare a Scribd company logo
Aucfanlab Datalake
- Big Data Management Platform -
1. Abstract
This report introduces the “Aucfan Datalake” a cloud-based big data management
platform developed at Aucfan. The Aucfan Datalake provides easy job orchestration
and management for data ingestion, storage and creation of datamarts. While it uses
cutting-edge big data technologies for these workloads, all technical details are
abstracted away from the user, who only has to know a simple high-level interface
for working with his data. The system already handles enormous daily workloads and
caters to a broad range of use cases at Aucfan. Individual scalability of its
components and the use of a cloud-based architecture make it adaptable for all
future workloads.
2. Introduction
The big data ecosystem has evolved at a fast pace with new technologies and
updates to existing technologies being released at an increasing rate in recent years.
The rise of Spark, the influx of new databases, the advent of fully managed cloud
services and similar developments have all contributed to creating a complex and
convoluted eco system that is hard to navigate for the user. The user has to learn
about different technologies like Hadoop and Spark, orchestrating his data
workloads and many other topics. Additionally, he has to manage all these resources
and take care of scaling them to his needs.
To make these technologies easily accessible to internal developers and spread the
use of big data technology in the company, the here introduced Datalake project was
born.
3. Design goals
Based on the above-mentioned difficulties the following design goals where
identified for the system:
- Enable the user to easily schedule data workloads
- Provide a scalable central repository for the company’s data
- Abstract all big data technology related details away
- Let the user work with high-level concepts like “datasets”, “import”, “export” etc.
4. High-level system overview
Given the goals stated above, a system design was conceived that encapsulates the
actual implementation from the user and provides an accessible API interface. From
a bird’s eye view the final system can be described as three parts: The data that the
user owns and wants to work with, the user itself, who uses the Datalake API to
execute data workflows and finally the Datalake, which provides the API and
implements	 the respective actions.
The provided API features entry points for importing data from the user’s data
sources, manage metadata, and export datasets to internal and external datamarts.
Additional features include scheduling workflows, categorizing data and partitioning.
5. API workflow for user
In this section the use of the API is explained from the user’s point of view. The user
works exclusively with the API and its high-level concepts and thus does not need to
know anything about the internally used big data technologies. The high-level
concepts represent among other things data sources, datasets, import and export
jobs, as well as a dataset-specific parser.
A detailed description of a typical workflow from importing the data to exporting it to
a specific datamart follows:
1. The user registers a datasource by defining the type of storage, location and
necessary authentication information of his data. This might for example be
an Azure blob storage account.
2. He then creates a dataset, which might include metadata like a description
or tags. Information like the size of the data and row count are later
automatically added, when the actual data is imported.
3. Next the user creates an Importer object, which contains a reference to the
above created datasource, to schedule the data import job. Additional
options for defining the temporal partitioning of the data may be added.
4. In the final step for the import workflow the user creates a Parser object
specific to his dataset. This parser contains information regarding the
format of the data, its schema and custom parsing instructions.
For example, for parsing CSV files the user would set CSV as the data
format, set CSV-related properties like column-seperator or escape
character and then define the schema of his data.
5. Once the data is parsed and stored as a dataset it can later be exported to
various kinds of datamarts. For this the user only needs to create an
Exporter object that references a datasource (for example a S3 bucket or
Azure SQL database) as the sink.
The user can monitor the execution of each of the above steps to check whether a
job is still running, succeeded or might have failed. In the last case he can access the
error logs of the respective job.
As the above actions are easily executable REST-API calls, shell tools and a
web-based GUI were created to allow users to interact with the system.
6. System architecture
The technical architecture of the system encompasses a Java application for the API
and system-logic as well as an ensemble of cloud services for orchestration and
execution of the data workflows. Here, the exclusive use of cloud services allows for
a cost effective and scalable solution. Microsoft Azure was chosen as the primary
cloud provider for the Datalake, as it provides an extensive portfolio of data services
and integrates well with other systems in the company.
6.1 REST-API
The Datalake project and other projects in the company are designed according
to a microservice architecture with a REST-API server as interface between the
services and users. The API server is implemented as a Spring-based java
application and deployed using Azure as a Platform as a service (PaaS) provider. For
the persistence layer of the application and the main repository for metadata a
NoSQL database was chosen. This allows for adaptability to changes and new
features as well as easy scaling.
6.2 Orchestration
The main job orchestration and monitoring is implemented using Azure’s Data
Factory service. It provides methods to manage and monitor various pipelines for
moving data or executing Hadoop / Spark jobs. Creation of pipelines and resources
in Azure Data Factory is done by REST-API calls from the Datalake’s java backend
server to the Azure Resource Management API.
6.3 Data Storage
All data is stored using Azure Blob Storage as cloud storage to allow scaling of
storage and to be able to scale storage and processing power for Hadoop / Spark
clusters separately. The systems main storage is divided into two logical locations:
1. The “Datalake”, which stores raw data that is imported from various sources as-is
2. The “DataPlatform”, which serves as a structured and optimized data store
6.4 Hive
Hive is used as the main big data warehouse and used to parse and store
datasets according to the format and schema information provided by the user.
Parsed datasets are persisted as external hive tables with the actual data being
stored as gizpped Avro-files. This allows for cost-effective storage and efficient
access to the data.
The schema of these Hive tables is defined by mapping the schema that the user
provided to the corresponding Hive datatypes. Here, the user-defined schema uses
common datatypes defined for the Datalake, which are than mapped to the
corresponding datatypes for each used data store (e.g. Hive, Azure SQL, etc.).
Additionally custom parsing instructions in the form of HiveQL statements can be
defined on a per-column basis.
Data that is parsed and stored like this allows for fast access and can directly be
used for the datamart export function.
6.5 Hadoop / Spark cluster
All Hive and Hadoop jobs are run on a fully-managed and scalable Azure
HDInsight cluster. For this a Linux-cluster running Spark is deployed. Hive queries
are executed vectorized and Tez is used for better performance. To further improve
performance data is partitioned according to its temporal characteristics.
6.6 Internal datamarts
The system additionally manages internal Azure SQL Databases and an Azure
SQL Data Warehouse instance. These can be used for datamarts created from
datasets.
6.7 Cloud services
An advantage of using cloud services exclusively is the scalability of all
components individually from each other. This allows appropriate scaling according
to performance needs of use cases. The used cluster is a scalable HDInsight cluster,
whose node number can be easily increased or shrunk. Additionally, the used data
storage adjusts to its use. The same is true for the Azure DWH which provides a near
infinite scale capability for SQL workloads. Overall this allows the system to react to
changing workloads in all components.
7. Summary
The Datalake is currently used for several use cases in the company. Most important
being the role of a central storage component for the companies EC market data
crawlers. The ingested data is exposed to BI tools using created datamarts.
Additionally, the Datalake functions as the central repository to gather the company’s
data and as the backend for the “Aucfan Dataexchange”, a web service that allows
to browse the contained data using a web browser.
The multitude of possible use cases and the diverse range of users that are easily
able to use the system to manage their data workloads furthermore ensures the
future use of the system in many additional use cases. Finally scalability of all
components individually provides for a cost-effective and efficient system to handle
these use cases with good performance.

More Related Content

PDF
Azure Data Factory usage at Aucfanlab
PDF
Big_SQL_3.0_Whitepaper
PDF
Sap business objects interview questions
PPTX
OSS NA 2019 - Demo Booth deck overview of Egeria
DOC
11g architecture
PPTX
No sql database
PPTX
FROM BIG DATA TO ACTION: HOW TO BREAK OUT OF THE SILOS AND LEVERAGE DATA GOVE...
PDF
MS-SQL SERVER ARCHITECTURE
Azure Data Factory usage at Aucfanlab
Big_SQL_3.0_Whitepaper
Sap business objects interview questions
OSS NA 2019 - Demo Booth deck overview of Egeria
11g architecture
No sql database
FROM BIG DATA TO ACTION: HOW TO BREAK OUT OF THE SILOS AND LEVERAGE DATA GOVE...
MS-SQL SERVER ARCHITECTURE

What's hot (20)

PPTX
Designing and developing your database for application availability
PDF
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
PDF
Bt0066 database management system1
PPTX
One Large Data Lake, Hold the Hype
PDF
Log Analysis Engine with Integration of Hadoop and Spark
PPTX
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
PPTX
Exploring Microsoft Azure Infrastructures
 
PDF
Sas hpa-va-bda-exadata-2389280
PDF
data stage-material
PDF
A Query Model for Ad Hoc Queries using a Scanning Architecture
PDF
Big SQL 3.0 - Toronto Meetup -- May 2014
PPT
Monitor tableau server for reference
PDF
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
PDF
Data Infrastructure at LinkedIn
PPTX
Introducing ms sql_server_updated
PDF
Oracle developer interview questions(entry level)
PDF
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
PDF
hive lab
PDF
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Designing and developing your database for application availability
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Bt0066 database management system1
One Large Data Lake, Hold the Hype
Log Analysis Engine with Integration of Hadoop and Spark
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
Exploring Microsoft Azure Infrastructures
 
Sas hpa-va-bda-exadata-2389280
data stage-material
A Query Model for Ad Hoc Queries using a Scanning Architecture
Big SQL 3.0 - Toronto Meetup -- May 2014
Monitor tableau server for reference
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Data Infrastructure at LinkedIn
Introducing ms sql_server_updated
Oracle developer interview questions(entry level)
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
hive lab
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Ad

Viewers also liked (6)

PDF
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
PDF
インターネットオークションにおけるチケット流通量調査
PDF
出品情報のタイトル文字数と落札価格の相関性
PPTX
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
PPTX
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
PDF
最新業界事情から見るデータサイエンティストの「実像」
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
インターネットオークションにおけるチケット流通量調査
出品情報のタイトル文字数と落札価格の相関性
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
最新業界事情から見るデータサイエンティストの「実像」
Ad

Similar to Aucfanlab Datalake - Big Data Management Platform - (20)

PDF
Agile data lake? An oxymoron?
PDF
Hybrid my sql_hadoop_datawarehouse
PDF
Data As Service (Team: 5, Project: 17)
PPTX
From open data to API-driven business
PPTX
Big Data_Architecture.pptx
PDF
Architecting Agile Data Applications for Scale
PDF
BD_Architecture and Charateristics.pptx.pdf
PDF
TCS_DATA_ANALYSIS_REPORT_ADITYA
PPTX
Big Data Analytics .pptx
PDF
So You Want to Build a Data Lake?
PDF
Cloud Connect 2012, Big Data @ Netflix
PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
KEY
Processing Big Data
PDF
Productionalizing a spark application
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
PPTX
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PPT
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
PDF
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Agile data lake? An oxymoron?
Hybrid my sql_hadoop_datawarehouse
Data As Service (Team: 5, Project: 17)
From open data to API-driven business
Big Data_Architecture.pptx
Architecting Agile Data Applications for Scale
BD_Architecture and Charateristics.pptx.pdf
TCS_DATA_ANALYSIS_REPORT_ADITYA
Big Data Analytics .pptx
So You Want to Build a Data Lake?
Cloud Connect 2012, Big Data @ Netflix
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Processing Big Data
Productionalizing a spark application
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Differentiate Big Data vs Data Warehouse use cases for a cloud solution

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Getting Started with Data Integration: FME Form 101
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
August Patch Tuesday
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPTX
1. Introduction to Computer Programming.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPT
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Getting Started with Data Integration: FME Form 101
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
August Patch Tuesday
Heart disease approach using modified random forest and particle swarm optimi...
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
1. Introduction to Computer Programming.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Assigned Numbers - 2025 - Bluetooth® Document
Teaching material agriculture food technology

Aucfanlab Datalake - Big Data Management Platform -

  • 1. Aucfanlab Datalake - Big Data Management Platform - 1. Abstract This report introduces the “Aucfan Datalake” a cloud-based big data management platform developed at Aucfan. The Aucfan Datalake provides easy job orchestration and management for data ingestion, storage and creation of datamarts. While it uses cutting-edge big data technologies for these workloads, all technical details are abstracted away from the user, who only has to know a simple high-level interface for working with his data. The system already handles enormous daily workloads and caters to a broad range of use cases at Aucfan. Individual scalability of its components and the use of a cloud-based architecture make it adaptable for all future workloads. 2. Introduction The big data ecosystem has evolved at a fast pace with new technologies and updates to existing technologies being released at an increasing rate in recent years. The rise of Spark, the influx of new databases, the advent of fully managed cloud services and similar developments have all contributed to creating a complex and convoluted eco system that is hard to navigate for the user. The user has to learn about different technologies like Hadoop and Spark, orchestrating his data workloads and many other topics. Additionally, he has to manage all these resources and take care of scaling them to his needs. To make these technologies easily accessible to internal developers and spread the use of big data technology in the company, the here introduced Datalake project was born.
  • 2. 3. Design goals Based on the above-mentioned difficulties the following design goals where identified for the system: - Enable the user to easily schedule data workloads - Provide a scalable central repository for the company’s data - Abstract all big data technology related details away - Let the user work with high-level concepts like “datasets”, “import”, “export” etc. 4. High-level system overview Given the goals stated above, a system design was conceived that encapsulates the actual implementation from the user and provides an accessible API interface. From a bird’s eye view the final system can be described as three parts: The data that the user owns and wants to work with, the user itself, who uses the Datalake API to execute data workflows and finally the Datalake, which provides the API and implements the respective actions. The provided API features entry points for importing data from the user’s data sources, manage metadata, and export datasets to internal and external datamarts. Additional features include scheduling workflows, categorizing data and partitioning.
  • 3. 5. API workflow for user In this section the use of the API is explained from the user’s point of view. The user works exclusively with the API and its high-level concepts and thus does not need to know anything about the internally used big data technologies. The high-level concepts represent among other things data sources, datasets, import and export jobs, as well as a dataset-specific parser. A detailed description of a typical workflow from importing the data to exporting it to a specific datamart follows: 1. The user registers a datasource by defining the type of storage, location and necessary authentication information of his data. This might for example be an Azure blob storage account. 2. He then creates a dataset, which might include metadata like a description or tags. Information like the size of the data and row count are later automatically added, when the actual data is imported. 3. Next the user creates an Importer object, which contains a reference to the above created datasource, to schedule the data import job. Additional options for defining the temporal partitioning of the data may be added. 4. In the final step for the import workflow the user creates a Parser object specific to his dataset. This parser contains information regarding the format of the data, its schema and custom parsing instructions. For example, for parsing CSV files the user would set CSV as the data format, set CSV-related properties like column-seperator or escape character and then define the schema of his data. 5. Once the data is parsed and stored as a dataset it can later be exported to various kinds of datamarts. For this the user only needs to create an Exporter object that references a datasource (for example a S3 bucket or Azure SQL database) as the sink.
  • 4. The user can monitor the execution of each of the above steps to check whether a job is still running, succeeded or might have failed. In the last case he can access the error logs of the respective job. As the above actions are easily executable REST-API calls, shell tools and a web-based GUI were created to allow users to interact with the system. 6. System architecture The technical architecture of the system encompasses a Java application for the API and system-logic as well as an ensemble of cloud services for orchestration and execution of the data workflows. Here, the exclusive use of cloud services allows for a cost effective and scalable solution. Microsoft Azure was chosen as the primary cloud provider for the Datalake, as it provides an extensive portfolio of data services and integrates well with other systems in the company. 6.1 REST-API The Datalake project and other projects in the company are designed according to a microservice architecture with a REST-API server as interface between the services and users. The API server is implemented as a Spring-based java application and deployed using Azure as a Platform as a service (PaaS) provider. For the persistence layer of the application and the main repository for metadata a
  • 5. NoSQL database was chosen. This allows for adaptability to changes and new features as well as easy scaling. 6.2 Orchestration The main job orchestration and monitoring is implemented using Azure’s Data Factory service. It provides methods to manage and monitor various pipelines for moving data or executing Hadoop / Spark jobs. Creation of pipelines and resources in Azure Data Factory is done by REST-API calls from the Datalake’s java backend server to the Azure Resource Management API. 6.3 Data Storage All data is stored using Azure Blob Storage as cloud storage to allow scaling of storage and to be able to scale storage and processing power for Hadoop / Spark clusters separately. The systems main storage is divided into two logical locations: 1. The “Datalake”, which stores raw data that is imported from various sources as-is 2. The “DataPlatform”, which serves as a structured and optimized data store 6.4 Hive Hive is used as the main big data warehouse and used to parse and store datasets according to the format and schema information provided by the user. Parsed datasets are persisted as external hive tables with the actual data being stored as gizpped Avro-files. This allows for cost-effective storage and efficient access to the data. The schema of these Hive tables is defined by mapping the schema that the user provided to the corresponding Hive datatypes. Here, the user-defined schema uses common datatypes defined for the Datalake, which are than mapped to the corresponding datatypes for each used data store (e.g. Hive, Azure SQL, etc.). Additionally custom parsing instructions in the form of HiveQL statements can be defined on a per-column basis. Data that is parsed and stored like this allows for fast access and can directly be used for the datamart export function.
  • 6. 6.5 Hadoop / Spark cluster All Hive and Hadoop jobs are run on a fully-managed and scalable Azure HDInsight cluster. For this a Linux-cluster running Spark is deployed. Hive queries are executed vectorized and Tez is used for better performance. To further improve performance data is partitioned according to its temporal characteristics. 6.6 Internal datamarts The system additionally manages internal Azure SQL Databases and an Azure SQL Data Warehouse instance. These can be used for datamarts created from datasets. 6.7 Cloud services An advantage of using cloud services exclusively is the scalability of all components individually from each other. This allows appropriate scaling according to performance needs of use cases. The used cluster is a scalable HDInsight cluster, whose node number can be easily increased or shrunk. Additionally, the used data storage adjusts to its use. The same is true for the Azure DWH which provides a near infinite scale capability for SQL workloads. Overall this allows the system to react to changing workloads in all components. 7. Summary The Datalake is currently used for several use cases in the company. Most important being the role of a central storage component for the companies EC market data crawlers. The ingested data is exposed to BI tools using created datamarts. Additionally, the Datalake functions as the central repository to gather the company’s data and as the backend for the “Aucfan Dataexchange”, a web service that allows to browse the contained data using a web browser. The multitude of possible use cases and the diverse range of users that are easily able to use the system to manage their data workloads furthermore ensures the future use of the system in many additional use cases. Finally scalability of all components individually provides for a cost-effective and efficient system to handle these use cases with good performance.