Etl techniques

A SEMINAR REPORT ON
“ABIG DATAPERSPECTIVE OF CURRENT ETL
TECHNIQUES”
Submitted in partial fulfilment of the requirements for the
Award of the degree of
MASTER OF TECHNOLOGY
IN
COMPUTER NETWORK ENGINEERING
Submitted By
MAHEZABEEN ILKAL

1 | P a g e
Table of Contents
I.INTRODUCTION...............................................................................................................................................2
II ETL--EXTRACT TRANSFORM AND LOAD..............................................................................................4
ETL TOOLS......................................................................................................................................................6
How ETL Tools Work....................................................................................................................................7
Providers of ETL Tools ..................................................................................................................................8
III. RELATED WORK........................................................................................................................................24
IV. AVAILABLE ETL TECHNIQUES TO HANDLE DATA STREAMS..........................................................25
V. Literature survey of ETL techniques : ............................................................................................................30
VI. CONCLUSION..............................................................................................................................................31
VII.REFERENCE...............................................................................................................................................32

2 | P a g e
ABIG DATA PERSPECTIVE OF CURRENT
ETL TECHNIQUES
Abstract— Dynamic data stream processing using real time ETL techniques is currently a high concern as the
amount of data generated is increasing day by day with the emergence of Internet of Things, Big Data and
Cloud. Data streams are characterized by huge volume that can arrive with a high velocity and in different
formats from multiple sources. Therefore, real time ETL techniques should be capable of processing the data
to extract value out of it by addressing the issues related to these characteristics that are associated with data
streams. In this work, we asses and analyze the capability of existing ETL techniques to handle dynamic data
streams and we present whether the existing techniques are relevant in the present situation.
I.INTRODUCTION
Data Management in the electronic and computing era dawned with data model and design using techniques
like normalization. The very idea was based on the fact that information in its original form is mostly
unstructured and contains redundant as well as information that can be derived from primitives. Of the several
reasons, storage cost, communication cost and computing cost were of key. Given the fact that each MB of
storage costed close to USD700 in early 80s, about USD1200/Mbps around 1998 and about USD10000/MIPS
in the 80s, these considerations made real economic sense. Yet another aspect related to data was about
quality related to missing data, outliers and inconsistent formats. All these issues were addressed by
developing techniques to extract, transform and load (ETL) data in to specific databases. However, time is not
static and technology has progressed. As of today, the cost of storage, computing or communication are
negligible. The amount of data generated has also increased tremendously making traditional data processing
and data warehousing techniques becoming infeasible. This has resulted in emergence of Big Data paradigm.

3 | P a g e
Increasing demands of Big Data processing has led to many challenges that arise due to its characteristics
such as volume, velocity, variety, value and veracity. Volume refers to the huge amount of data that needs to
be processed. Velocity refers to the speed at which the data arrives for processing. Variety refers to the data
that can arrive from different sources with different formats. Value refers to the valuable information that can
be extracted out of the raw data to make important business decisions. Veracity refers to the authenticity of
the data. The biggest challenge in processing big data that arrives in streams is to extract the valuable
information out of it by addressing the issues related to its volume, velocity, variety and veracity.
In order to ease the process of querying the useful information, the processed data are stored in the Data
warehouse. The data warehouses are often designed to support analytical data processing. Usually the ETL
process forms an inseparable part of data warehouse which is used for processing the data. The evolution of
ETL process can be divided into three following categories: • Traditional ETL that can handle processing of
highly structured data. Usually databases are considered as source of input and it operates usually in batch
processing mode. • Near-real time ETL that process the structured data more efficiently than traditional ETL
by reducing the refresh time and thus attempts to maintain the freshness of processed data in data warehouse.
• Real-time ETL which processes the data streams in real time that can arrive from different sources in
various formats and characterized by huge volume and high velocity.
Many of the existing ETL techniques process the static data obtained usually from databases that are highly
structured and populate the data warehouse in the required format for further decision making. Some of the
ETL techniques attempt to handle processing of data streams, but data streams are characterized by volume,
velocity [1][2] and variety [3] which needs to be addressed for processing data. From this perspective, the
currently available so called real time ETL approaches are inadequate to address all these characteristics of
data streams. Therefore there is a need to redefine the real time ETL process that can address the issues
related to processing of data streams.

4 | P a g e
II ETL--EXTRACT TRANSFORM AND LOAD
In Data Warehouse (DW) environment, Extraction-Transformation-Loading (ETL) processes constitute the
integration layer which aims to pull data from data sources to targets, via a set of transformations. ETL is
responsible for the extraction of data, their cleaning, conforming and loading into the target.
ETL is a critical layer in DW setting. It is widely recognized that building ETL processes is expensive
regarding time, money and effort. It consumes up to 70% of resources. By this work we intend to enrich the
field of ETL processes, the backstage of data warehouse, by presenting a survey on these processes. DW
Layers Figure 1 illustrates the architecture of DW system. In this figure the one can note that DW system has
four levels:
• Sources: They encompass all types of data sources. They are data provider. The two famous types are
databases and flat files. Finally let note that sources are autonomous or semi-autonomous.
• ETL: stands for Extraction Transformation and Loading. It is the integration layer in DW environment. ETL
tools pull data from several sources (databases tables, flat files, ERP, internet, and so on), apply complex
transformation to them. Finally in the end, data are loaded into the target which is data warehouse store in
DW environment.
 DW: is a central repository to save data produced by ETL layer. DW is a database including fact tables
besides dimension tables. Together these tables are combined in a specific schema which may be star schema
or snowflake schema.
• Reporting and Analysis layer has the mission to catch end-user request and translate it to DW. Collected data
are served to end-users in several formats. For example data is formatted into reports, histograms,
dashboard...etc

5 | P a g e
ETL is a critical component in DW environment. Indeed, it is widely recognized that building ETL
processes, during DW project, are expensive regarding time and money. ETL consume up to 70% of
resources [3], [5], [4], [2]. Interestingly [2] reports and analyses a set of studies proving this fact. In other
side, it is well known too, that the accuracy and the correctness of data, which are parts of ETL responsibility,
are key factors of the success or failure of DW projects.
Given the fact expressed above, about ETL importance, the next section presents ETL missions and its
responsibility.
ETL Mission: As its name indicates, ETL performs three operations (called also steps) which are Extraction,
Transformation and Loading. Upper part of figure 2 shows an example of ETL processes with Talend Open
source tool. The second part of figure 2 shows the palette of Talend tool that is a set of components to build
ETL processes. In what follows, we present each ETL step separately.
Extraction: step has the problem of acquiring data from a set of sources which may be local or distant.
Logically, data sources come from operational applications, but there is an option to use external data sources
for enrichment. External data source means data coming from external entities. Thus during extraction step,
ETL tries to access available sources, pull out the relevant data, and reformat such data in a specified format.
Finally, this step comes with the cost of sources instability besides their diversity. Finally, according to figure
2, this step is performed over source.
Loading: step conversely to previous step, has the problem of storing data to a set of targets. During this step,
ETL loads data into targets which are fact tables and dimension in DW context. However, intermediate results
will be written to temporary data stores. The main challenges of a loading stepare to access available targets
and to write the outcome data (transformed and integrated data) into the targets. This step can be a very time-
consuming task due to indexing and partitioning techniques used in data warehouses [6]. Finally, according to
figure 2, this step is performed over target.

6 | P a g e
Transformation: step is the most laborious one where ETL adds value [3]. Indeed during this step,
ETL carries out the logic of business process instanced as business rules (sometime called mapping rules) and
deals with all types of conflicts (syntax, semantic conflicts … etc). This step is associated with two words:
clean and conform. In one hand, cleaning data aims to fix erroneous data and to deliver clean data for end
users (decisions makers). Dealing with missing data, rejecting bad data are examples of data cleaning
operations. In other hand, conforming data aims to make data correct, in compatibility with other master data.
Checking business rules, checking keys and lookup of referential data are example of conforming operations.
At technical level and in order to perform this step, ETL should supplies a set of data transformations or
operators like filter, sort, inner join, outer joins…etc. Finally this step involves flow schema management
because the structure of processed data is changing and modified step by step, either by adding or removing
attributes. We refer interested readers to [3] and [5] for more details and explanation of each step.
EXTRACTING
Data brought up from external sources
TRANSFORMING
Data to fit demanded standard
LOADING
Converted data into a destination DW
ETL TOOLS
Many ETL tools were originally developed to make the task of the data warehouse developer easier and more
fun. Developers are spared the arduous task of handwriting SQL code, replacing it with easy drag and drop to
develop a data warehouse.

7 | P a g e
Today, the top ETL tools in the market have vastly expanded their functionality beyond data warehousing and
ETL. They now contain extended functionalities for data profiling, data cleansing, Enterprise Application
Integration (EAI), Big Data processing, data governance and master data management.
The need to become more data-dependent has forced many companies to invest in complicated data
warehousing systems. Differentiation and incompatibility between systems led to an uncontrolled growth of
costs and time needed to coordinate the processes. ETL tools were created to simplify the data management
and simultaneously reduce the absorbed effort.
Several types of ETL tools exist, according to different customer needs:
 Tools that perform and supervise only selectedstages of the ETL process, such as data migration tools
(EtL Tools or “small t tools”) or data transformation tools (eTl Tools, “capital T tools”).
 Tools that offer complete solutions (ETL Tools) and include many functions intended for processing large
amounts of data or more complicated ETL projects.
 Code base tools is a family of programming tools that support many operating systems and programming
languages.
 GUI-based tools remove the coding layer and allow users to work without little knowledge of coding
languages.
Some tools, such as server engine tools, execute many simulatneous ETL steps performed by more than one
developer, while other tools, such as client engine tools, are simpler and execute ETL routines on the same
machine as they are developed.
How ETL Tools Work
The first task includes data extraction from internal or external sources. After sending queries to the source
system, data may go indirectly to the database. However, there is usually a need to monitor or gather more
information and then go to the staging area. Some tools extract only new or changed information automatically,
so users don't need to update it independently.
The second task includes transformation, which is a broad category:

8 | P a g e
 Transforming data into a stucture required to continue the operation (extracted data usually includes a
sructure typical to the source)
 Sorting data
 Connecting or separating
 Cleansing
 Checking quality
The third task includes loading the data into a data warehouse.
ETL tools include many additional capabilities beyond the main three - extraction, transformation and
loading. For example, sorting, filtering, data profiling, quality control, cleansing, monitoring, synchronization
and consolidation.
Providers of ETL Tools
Here are the types of ETL tools which are currently being used in the market which makes the data
management task much easier and simultaneously improves data warehousing.
1) QuerySurge
QuerySurge is ETL testing solution developed by RTTS. It is built specifically to automate the testing of Data
Warehouses & Big Data. It ensures that the data extracted from data sources remains intact in the target
systems as well.
Features:
 Improve data quality & data governance
 Accelerate your data delivery cycles
 Helps to automate manual testing effort
 Provide testing across the different platform like Oracle, Teradata, IBM, Amazon, Cloudera, etc.

9 | P a g e
 It speeds up testing process up to 1,000 x and also providing up to 100% data coverage
 It integrates an out-of-the-box DevOps solution for most Build, ETL & QA management software
 Deliver shareable, automated email reports and data health dashboards
2) MarkLogic:
MarkLogic is a data warehousing solution that makes data integration easier and faster using an array of
enterprise features. This tool helps to perform very complex search operations. It can query data including
documents, relationships, and metadata.
Features:
 The Optic API can perform joins and aggregates over documents, triples, and rows.
 It allows specifying more complex security rules for all the elements within documents
 Writing, reading, patching, and deleting documents in JSON, XML, text, or binary formats
 Database Replication for Disaster Recovery
 Specify Output Options on the App Server Configuration
 Importing and Exporting Configuration Information

10 | P a g e
3) Oracle:
Oracle data warehouse software is a collection of data which is treated as a unit. The purpose of this database
is to store and retrieve related information. It helps the server to reliably manage huge amounts of data so that
multiple users can access the same data.
Features:
 Distributes data in the same way across disks to offer uniform performance
 Works for single-instance and real application clusters
 Offers real application testing
 Common architecture between any Private Cloud and Oracle's public cloud
 Hi-Speed Connection to move large data
 Works seamlessly with UNIX/Linux and Windows platforms
 It provides support for virtualization
 Allows connecting to the remote database, table, or view

11 | P a g e
4) AmazonRedShift:
Amazon Redshift is an easy to manage, simple, and cost-effective data warehouse tool. It can analyze almost
every type of data using standard SQL.
Features:
 No Up-Front Costs for its installation
 It allows automating most of the common administrative tasks to monitor, manage, and scale your data
warehouse
 Possible to change the number or type of nodes
 Helps to enhance the reliability of the data warehouse cluster
 Every data center is fully equipped with climate control
 Continuously monitors the health of the cluster. It automatically re-replicates data from failed drives and
replaces nodes when needed

12 | P a g e
5) Domo:
Domo is a cloud-based Data warehouse management tool that easily integrates various types of data sources,
including spreadsheets, databases, social media and almost all cloud-based or on-premise Data warehouse
solutions.
Features:
 Help you to build your dream dashboard
 Stay connected anywhere you go
 Integrates all existing business data
 Helps you to get true insights into your business data
 Connects all of your existing business data
 Easy Communication & messaging platform
 It provides support for ad-hoc queries using SQL
 It can handle most concurrent users for running complex and multiple queries
The Teradata Database is the only commercially available shared-nothing or Massively Parallel Processing
(MPP) data warehousing tool. It is one of the best data warehousing tool for viewing and managing large
amounts of data.

13 | P a g e
Features:
• Simple and Cost Effective solutions
• The tool is best suitable option for organization of any size
• Quick and most insightful analytics
• Get the same Database on multiple deployment options
• It allows multiple concurrent users to ask complex questions related to data
• It is entirely built on a parallel architecture
• Offers High performance, diverse queries, and sophisticated workload management
7) SAP:
SAP is an integrated data management platform, to maps all business processes of an organization. It is an
enterprise level application suite for open client/server systems. It has set new standards for providing the best
business information management solutions.
Features:
 It provides highly flexible and most transparent business solutions
 The application developed using SAP can integrate with any system
 It follows modular concept for the easy setup and space utilization
 You can create a Database system that combines analytics and transactions. These next next-generation
databases can be deployed on any device
 Provide support for On-premise or cloud deployment
 Simplified data warehouse architecture
 Integration with SAP and non-SAP applications

14 | P a g e
8) SAS:
SAS is a leading Datawarehousing tool that allows accessing data across multiple sources. It can perform
sophisticated analyses and deliver information across the organization.
Features:
 Activities managed from central locations. Hence, user can access applications remotely via the Internet
 Application delivery typically closer to a one-to-many model instead of one-to-one model
 Centralized feature updating, allows the users to download patches and upgrades.
 Allows viewing raw data files in external databases
 Manage data using tools for data entry, formatting, and conversion
 Display data using reports and statistical graphics
9) IBM – DataStage:
IBM data Stage is a business intelligence tool for integrating trusted data across various enterprise systems. It
leverages a high-performance parallel framework either in the cloud or on-premise. This data warehousing
tool supports extended metadata management and universal business connectivity.
Features:
 Support for Big Data and Hadoop
 Additional storage or services can be accessed without need to install new software and hardware
 Real time data integration

15 | P a g e
 Provide trusted ETL data anytime, anywhere
 Solve complex big data challenges
 Optimize hardware utilization and prioritize mission-critical tasks
 Deploy on-premises or in the cloud
10) Informatica:
Informatica PowerCenter is Data Integration tool developed by Informatica Corporation. The tool offers the
capability to connect & fetch data from different sources.
Features:
 It has a centralized error logging system which facilitates logging errors and rejecting data into relational
tables
 Build in Intelligence to improve performance
 Limit the Session Log
 Ability to Scale up Data Integration
 Foundation for Data Architecture Modernization
 Better designs with enforced best practices on code development
 Code integration with external Software Configuration tools
 Synchronization amongst geographically distributed team members

16 | P a g e
11) MS SSIS:
SQL Server Integration Services is a Data warehousing tool that used to perform ETL operations; i.e. extract,
transform and load data. SQL Server Integration also includes a rich set of built-in tasks.
Features:
 Tightly integrated with Microsoft Visual Studio and SQL Server
 Easier to maintain and package configuration
 Allows removing network as a bottleneck for insertion of data
 Data can be loaded in parallel and various locations
 It can handle data from different data sources in the same package
 SSIS consumes data which are difficult like FTP, HTTP, MSMQ, and Analysis services, etc.
 Data can be loaded in parallel to many varied destinations
12) TalendOpenStudio:
Open Studio is an open source data warehousing tool developed by Talend. It is designed to convert, combine
and update data in various locations. This tool provides an intuitive set of tools which make dealing with data
lot easier. It also allows big data integration, data quality, and master data management.
Features:
 It supports extensive data integration transformations and complex process workflows
 Offers seamless connectivity for more than 900 different databases, files, and applications
 It can manage the design, creation, testing, deployment, etc of integration processes

17 | P a g e
 Synchronize metadata across database platforms
 Managing and monitoring tools to deploy and supervise the jobs
13) The Ab Initiosoftware:
The Ab Initio is a data analysis, batch processing, and GUI based parallel processing data warehousing tool. It
is commonly used to extract, transform and load data.
Features:
 Meta data management
 Business and Process Metadata management
 Ability to run, debug Ab Initio jobs and trace execution logs
 Manage and run graphs and control the ETL processes
 Components can execute simultaneously on various branches of a graph
14) Dundas:
Dundas is an enterprise-ready Business Intelligence platform. It is used for building and viewing interactive
dashboards, reports, scorecards and more. It is possible to deploy Dundas BI as the central data portal for the
organization or integrate it into an existing website as a custom BI solution.
Features:

18 | P a g e
 Data warehousing tool for Business Users and IT Professionals
 Easy access through web browser
 Allows to use sample or Excel data
 Server application with full product functionality
 Integrate and access all kind of data sources
 Ad hoc reporting tools
 Customizable data visualizations
 Smart drag and drop tools
 Visualize data through maps
 Predictive and advanced data analytics
15) Sisense:
Sisense is a business intelligence tool which analyses and visualizes both big and disparate datasets, in real-
time. It is an ideal tool for preparing complex data for creating dashboards with a wide variety of
visualizations.
Features:
 Unify unrelated data into one centralized place
 Create a single version of truth with seamless data
 Allows to build interactive dashboards with no tech skills
 Query big data at very high speed
 Possible to access dashboards even in the mobile device
 Drag-and-drop user interface

19 | P a g e
 Eye-grabbing visualization
 Enables to deliver interactive terabyte-scale analytics
 Exports data to Excel, CSV, PDF Images and other formats
 Ad-hoc analysis of high-volume data
 Handles data at scale on a single commodity server
 Identifies critical metrics using filtering and calculations
16) TabLeau:
Tableau Server is an online Data warehousing with 3 versions Desktop, Server, and Online. It is secure,
shareable and mobile friendly data warehouse solution.
Features:
 Connect to any data source securely on-premise or in the cloud
 Ideal tool for flexible deployment
 Big data, live or in-memory
 Designed for mobile-first approach
 Securely Sharing and collaborating Data
 Centrally manage metadata and security rules
 Powerful management and monitoring
 Connect to any data anywhere
 Get maximum value from your data with this business analytics platform
 Share and collaborate in the cloud
 Tableau seamlessly integrates with existing security protocols


20 | P a g e
17) MicroStrategy:
MicroStrategy is an enterprise business intelligence application software. This platform supports interactive
dashboards, scorecards, highly formatted reports, ad hoc query and automated report distribution.
Features:
 Unmatched speed, performance, and scalability
 Maximize the value of investment made by enterprises
 Eliminating the need to rely on multiple tools
 Support for advanced analytics and big data
 Get insight into complex business processes for strengthening organizational security
 Powerful security and administration feature
18) Pentaho
Pentaho is a Data Warehousing and Business Analytics Platform. The tool has a simplified and interactive
approach which empowers business users to access, discover and merge all types and sizes of data.
Features:
 Enterprise platform to accelerate the data pipeline
 Community Dashboard Editor allows the fast and efficient development and deployment
 Big data integration without a need for coding
 Simplified embedded analytics
 Visualize data with custom dashboards

21 | P a g e
 Ease of use with the power to integrate all data
 Operational reporting for mongo dB
 Platform to accelerate the data pipeline
19) BigQuery:
Google's BigQuery is an enterprise-level data warehousing tool. It reduces the time for storing and querying
massive datasets by enabling super-fast SQL queries. It also controls access to both the project and also
offering the feature of view or query the data.
Features:
 Offers flexible Data Ingestion
 Read and write data in via Cloud Dataflow, Hadoop, and Spark.
 Automatic Data Transfer Service
 Full control over access to the data stored
 Easy to read and write data in BigQuery via Cloud Dataflow, Spark, and Hadoop
 BigQuery provides cost control mechanisms

22 | P a g e
20) Numetric:
Numetric is the fast and easy BI tool. It offers business intelligence solutions from data centralization and
cleaning, analyzing and publishing. It is powerful enough for anyone to use. This data warehousing tool helps
to measure and improve productivity.
Features:
 Data benchmarking
 Budgeting & forecasting
 Data chart visualizations
 Data analysis
 Data mapping & dictionary
 Key performance indicators
21) SolverBI360 Suite:
Solver BI360 is a most comprehensive business intelligence tool. It gives 360º insights into any data, using
reporting, data warehousing, and interactive dashboards. BI360 drives effective, data-based productivity.

23 | P a g e
Features:
 Excel-based reporting with predefined templates
 Currency conversion and inter-company transactions elimination can be automated
 User-friendly budgeting and forecasting feature
 It reduces the amount of time spent for the preparation of reports and planning
 Easy configuration with User-friendly interface
 Automated data loading
 Combine Financial and Operational Data
 Allows to view data in Data Explorer
 Easily add modules and dimensions
 Unlimited Trees on any dimension
 Support for Microsoft SQL Server/SQL Azure

24 | P a g e
III. RELATED WORK
A. Extracting useful data from data streams by applying a continuous query onto the arriving data The
different sources are assumed to produce data streams but not capable of storing the data by themselves.
Therefore there is a need to extract the data in real time and process it, so that the users can query the
processed data to make appropriate decisions. In case if the arriving data is not captured properly the data
would be lost permanently. As we know the data streams are characterized by volume and velocity which has
to be addressed to capture the data. Therefore the existing methodologies use continuous query techniques [4]
to extract only the data that are useful with respect to a particular context. The rest of the original data are
discarded.
B. Transforming the extracted data into the form required by the data warehouse Many of the existing
techniques address transforming of highly structured data into fact tables and dimension tables as required by
the traditional data warehouse techniques. But the data streams are characterized by heterogeneity as well.
Therefore there is a need to address the challenges that arise due to the heterogeneity of the data. Some of the
existing techniques attempt to address the heterogeneity of the data by creating a semantic model by
constructing ontologies and linked RDFs.
C. Integration of real time and traditional data warehouse architectures Many of the existing techniques
attempts to enhance the traditional data warehouse architecture to handle real time data by adding a real time
data storage component to the existing architecture. The data streams after being processed are loaded into the
real time data storage component from which the users can query for the fresh data and after some time
duration the data from the real time data storage area can be moved to the historical storage area for later use.
This approach can be observed in fig. 1 which shows the complete ETL framework.
D. Synchronization of data updates and querying the data warehouse Continuous loading and updating of
processed data into the data warehouse is an essential process when handling data streams. But this affects the
performance of fetching the query results from the data warehouse. This leads to the problem of query
contention. Therefore there is a need to synchronize the process of data updation and query operations. Many
of the existing techniques attempts to address this issue by introducing buffering techniques to capture the

25 | P a g e
processed real time data and updating the data warehouse once for all when the triggering conditions are
satisfied.
IV. AVAILABLE ETL TECHNIQUES TO HANDLE DATA STREAMS
Authors of [5] propose a methodology to handle data streams using real-time ETL. The proposed
methodology separates out real-time ETL from historical ETL and introduces a dynamic storage area to
address the problem of batch updates by synchronizing the data aggregation operation with real time data and
to improve the freshness of the query results. The architecture does not specifically address the problems of
handling unstructured data with huge volume and velocity.
Authors of [6], [7], [8] and [9] incorporate semantic technology into the ETL process to address the semantic
heterogeneity of the data. The methodology in [6] creates a semantic data model by mapping the data sets
onto the ontologies and loading the RDFs to the data warehouse. The methodology addresses the problem of
integrating heterogeneous source data into a standard format and thus addressing one of the v’s of big data
characteristics namely variety and provides a way to organize the data through the RDF links which actually
increases the accuracy of the results as well. This methodology does not consider volume and velocity of the
data, as it is difficult to construct ontologies manually.
The Authors of [9] propose a technique with semantic approach that makes use of linked RDFs to represent
the data sets. Semantic filtering is applied for integrating large volume of heterogeneous streams on the fly.
Semantic data streams are then filtered by applying continuous queries. The approach also makes use of
summarization techniques to discard some data when it crosses a limit that can be handled. Therefore the
semantics of the original data is being lost if it arrives with huge volume and high velocity.
Authors of paper [10] propose a technique to handle data streams which includes a data stream processor and
an operational data store (ODS). The proposed methodology applies continuous query on to the arriving data
streams so that the irrelevant data can be filtered out and the data stream reduces to manageable size. Even
after reducing the data stream, if it crosses a specified threshold level which can be stored on the memory,

26 | P a g e
data are divided into equal parts and small samples are collected using sampling techniques and rest of the
data are discarded. The proposed methodology tries to address velocity of the stream data up to some extent
but if volume of the data increases then the architecture fails to process the data without losing some of the
source data. The methodology also addresses the problem of synchronization of arriving data with data
processing up to some extent but at the cost of losing the semantics of the original data.
Authors of [11] address the problem of synchronizing data aggregation operation and querying the real time
data warehouse by proposing an algorithm called Integration Based Scheduling Approach (IBSA). IBSA can
be divided into two parts. The first part details about triggering the ETL process when the data sets are
arriving from different sources and the second part of the algorithm decides upon whether to invoke a thread
that performs update of the data warehouse or to invoke a thread that queries the real time Data Warehouse.
This technique addresses the problem of fair resource allocation for both updating and query operations. The
proposed methodology attempts to address the challenges related to real time data warehouse to handle the
data. The data streams can be characterized with volume, velocity and variety. The proposed technique fails
to explain how the scheduling policy scales in the situation when the input data streams for updating the data
warehouse is characterized by high volume and velocity and also when the query queue is full. It does not
consider the heterogeneity of the data during the integration process as well.
Authors of [12] propose architecture to support both real time data warehouse to handle data streams and
historical data warehouse to handle historical data. The methodology captures the continuous data by
implementing multilevel caching technique which actually differentiates the freshness of the arriving data.
The proposed methodology also introduces double mirror partitioning technique to synchronize the data
warehouse update process and querying operations. As the proposed methodology makes use of the caching
technique, the buffer should be of limited size. If the buffer size is increased, the efficiency reduces because
of the reduced hit rate. Therefore when the data stream arrives with huge volume and velocity, some amount

27 | P a g e
of data must be dropped as it contains limited buffer space. As a result of this, the semantics of the original
data is lost.
Authors of [13] and [14] provide detailed information about the evolution of ETL process starting from
traditional ETL to real time ETL to handle data streams. Different existing real time and near real time ETL
techniques are discussed. The data streams that are considered for real time ETL process can also be
characterized by volume and velocity and therefore there is a need to address these two characteristics in real
time data processing. But the different techniques consolidated in these two works do not address any of the
characteristics of data streams. The architecture provided for real time data processing makes use of a stream
analysis engine during extraction process which actually looks for some specified patterns from the arriving
data streams and rest of the data are discarded. As a result of this the semantics of the original data is lost.
The methodology proposed by [15], Provides a review on existing problems and available solutions of
processing historical data as well as data streams. It also provides some views in addressing the problem of
joining the historical data and real time data streams by using a buffering technique. While processing the
data streams, the transformation phase cannot cope up with processing the continuously arriving fast data
streams and therefore this paper discuss about a technique called ELT in which the transformation process on
data is being done after loading the data on to the data warehouse. But the paper considers highly structured
data and does not address the heterogeneity of the data. The methodology fails to address the problem of
scaling with the volume and velocity associated with the data streams. The architecture proposed in [16]
implements a real time ETL engine which is responsible for processing the data streams and loading it on to
the real time data warehouse. The real time ETL architecture consists of RBFs (Remote Buffer Framework)
which is responsible for receiving the data streams from different sources. These RBFs are connected to RIFs
(Remote Integrator Framework), whose function is to accumulate data from different RBFs and pass it on to
real time ETL. The architecture clearly specifies that different sources can be connected to a single RBF. The

28 | P a g e
data streams arriving from different sources can be characterized by volume and velocity. Therefore the RBFs
when receiving arriving data streams from different sources may run out of memory space if the arriving data
has huge volume and velocity in which case some amount of original data must be discarded, as a result of
which the semantics of the original data is lost before even processing the data.
The methodology in [17] and [18] attempts to enhance traditional data warehouse to support real time data
stream processing. The proposed methodology assumes that the same old traditional ETL architecture holds
good for processing data streams with very few modifications to it. As the data streams arrives in real time,
loading and refreshing time interval will be very less which reduces the data warehouse loading process
significantly. Therefore the proposed methodology incorporates a modification, to isolate dynamic data
warehouse component which supports data stream processing from traditional data warehouse. But, as data
streams are always characterized by volume, velocity and variety, the traditional ETL component cannot cope
up with these characteristics. The traditional ETL uses staging operation for processing the data which uses
small sized fixed memory which cannot store huge volume of arriving data streams. The proposed
architecture does not address the issues with velocity and variety of the data as well.
Authors of [19] propose a methodology to address real time data integration in the transformation phase of
ETL process. This approach implements a technique called Divide Join - Data Integration (DJ- DI) whose
behavior changes according to the size of arriving data. When a change in the operational data source tables
are identified they are partitioned in to manageable size and join operation is performed on each partition. The
approach considers big data as the source data and address only structured data but big data can be
characterized by huge volume of variety of data arriving with high velocity.
The methodology in [20] implements web services based real time data warehouse architecture. The proposed
methodology attempts to address real time data warehouse challenges by isolating real time data warehouse
component from traditional data warehouse. As the architecture is based on web services for transferring the
data from source to real time data warehouse component, it cannot support heterogeneous data types. As data
streams are characterized by volume, velocity and variety, the real time data warehouse architecture must

29 | P a g e
address the challenges that arise from these characteristics. The proposed architecture uses web services and
does not address the challenges that arise with respect to network related issues such as traffic, bandwidth and
so on because of volume and velocity of the arriving data streams.
The methodology in [21] attempts to address the challenges related to handling real time data warehouse by
considering three aspects. They are, real time data extraction methods, maintaining consistency during
integration and continuous loading of processed data. The real time data extraction method specified by the
Authors, uses log analysis which is not suitable for data streams as the data streams are characterized by
volume and velocity and because of which it becomes difficult to handle logs. The proposed methodology
details about integrating the data from different sources but does not take into account the heterogeneity of the
data during integration process. The methodology attempts to address the challenges that arise due to the
volume of data by proving filtering functionality to remove less important data with respect to the context and
as a result of this the semantics of the original data is lost.
The paper [22] attempts to address the feasibility of integrating the real time data and trade-offs between
quality and availability of the data in data warehouse for querying in real time. The proposed methodology
uses same old traditional ETL approach for handling the data streams which is not feasible as the traditional
ETL cannot cope up with the challenges posed by volume and velocity of the data streams. The methodology
does not address the challenges related to heterogeneity of the data streams as well and considers only the
structured data.

30 | P a g e
V. Literature survey of ETL techniques :
A lot of ETL tools are now capable of combining structured data with unstructured data in one mapping. In
addition they can handle very large amounts of data, that do not necessarily have to be stored in data
warehouses. Now Hadoop-connectors or similar interfaces to big data sources are provided by almost 40% of
the ETL tools nowadays. And the support for Big Data is growing continually.
The number one reason why you should implement a data warehouse is so that employees or end users can
access the data warehouse and use the data for reports, analysis and decision making. Using the data in a
warehouse can help you locate trends, focus on relationships and help you understand more about the
environment that your business operates in. Data warehouses also increase the consistency of the data and
allow it to be checked over and over to determine how relevant it is. Because most data warehouses are
integrated, you can pull data from many different areas of your business, for instance human resources,
finance, IT, accounting, etc.While there are plenty of reasons why you should have a data warehouse, it
should be noted that there are a few negatives of having a data warehouse including the fact that it is time
consuming to create and to keep operating. You might also have a problem with current systems being
incompatible with your data. It is also important to consider future equipment and software upgrades; these
may also need to be compatible with data.Finally security might be a huge concern, especially if your data is
accessible over an open network such as the internet. You do not want your data to be viewed by your
competitor or worse hacked and destroyed.
Limitations:There are avarietyof ETL toolsavailable inthe marketsufferingfromageneral problemof
interoperatabilityof the APIandthe proprietarymetadataformats.Thismakesthe functionalitiesof the ETLtools
difficulttocombine [32].A Meta model ARKTOScapable of modelingandexecutingpractical ETLscenariosand
capturingcommontasks of data cleaning,schedulingandtransformationof data[5].The commercial anddataquality
toolsare classifiedbythree perspectives.These three perspectivesare dataqualityproblems,genericfunctionalities
for extractingdatafromdata sourcesand foreverydata quality problemidentifiedadetail must be keptasto which
onesare addressesandwhichoneshave beenleft[33].The strengthsandweaknessesof the handcodedETL process

31 | P a g e
and Tool-basedETLprocesswere studiedbyZode andthe factorsbasedon whichthe choice can be made between
the two were discussed.Eachone has itspros andcons and the criteriafor selectingthe tool still remainsatopicof
research.Settingupof criteriaisnot an easytask andthisis generallybasedonthe classificationandcategorizationof
the operationsthatare specificallymarkedforthe organizations[34].Suchclassificationandcategorizationof the ETL
operationsinaccordance to some built-inoperationsof some commercial toolswasdone byVassiliadis.[35]Stressed
on the fact that the currentgeneration of ETL toolsprovideslittlesupportforsystematicallycapturingbusiness
requirementsandtranslatingtheseintooptimizeddesignsthatmeetthe correctnessandqualityrequirements.The
nextgenerationof BIsolutionswill impose evenmore challengingrequirements(nearreal-timeexecution,integration
of structuredandunstructureddata,and more flexibleflow of databetweenthe operational applicationsandanalytic
applications),resultinginevenmore complexityinintegrationflowdesign.Hence,it isimportanttocreate automated
or semi-automatedtechniquesthatwill helppractitionerstodeal withthiscomplexity.
VI. CONCLUSION
Real time ETL techniques that are capable of handling data streams should consider the characteristics of the
data streams such as volume, velocity and variety for efficient processing of the data. As the rate of data
generated is increasing day by day, the data that was considered as real time few years back is no more a real
time today with the expected response time decreasing. From this perspective, the existing ETL techniques
partially address the issues related to the characteristics of big data. Therefore there is a need to fill the gap
identified in this work and come up with a new s olution to address the issues that arise due to these
characteristics of the big data.

32 | P a g e
VII.REFERENCE
1] Gorawski, Marcin and Aleksander Chrószcz, "Query processing using negative and temporal tuples in
stream query engines", CEE-SET, Heidelberg, Vol. 7054, pp. 70-83, Springer, 2012.
[2] Gorawski, Marcin, and Aleksander Chrószcz, "Synchronization modeling in stream processing",
Advances in Databases and Information Systems, Heidalberg, Vol. 186, pp. 91-102, Springer, 2013.
[3] Gorawski M., “Advanced data warehouses”, Habilitation, Studia Informatica 30(3B), 386, 2009.
[4] Babu, Shivnath, and Jennifer Widom, "Continuous queries over data streams", ACM SIGMOD Record,
New York, Volume 30, Issue 3, pp. 109120, ACM, September 2001.
[5] Xiaofang Li and Yingchi Mao, “Real-Time Data ETL Framework for Big Real-Time Data Analysis”,
ICIA, Lijiang, pp. 1289-1294, IEEE International Conference, 2015.
[6] Srividya K. Bansal and Sebastian Kagemann, Integrating Big Data: A Semantic Extract-Transform-Load
Framework”, in Computer, vol. 48, no. 3, pp. 42-50, IEEE, Mar. 2015.
[7] J. Villanueva Chávez and X. Li, "Ontology based ETL process for creation of ontological data
warehouse", CCE, 8th International Conference, Merida City, pp. 1-6, IEEE, 2011.
[8] L. Jiang, H. Cai and B. Xu, "A Domain Ontology Approach in the ETL Process of Data Warehousing",
ICEBE, 7th International Conference, Shanghai, pp. 30-35, IEEE, 2010.
[9] Marie-Aude Aufaure, Raja Chiky, Olivier Cure, Houda Khrouf, Gabriel Kepeklian, “From Business
Intelligence to Semantic data stream management”, Future Generation Computer Systems, Vol. 63, pp.
100107, Elsevier, October 2016.
[10] F. Majeed, Muhammad Sohaib Mahmood and M. Iqbal, "Efficient data streams processing in the real
time data warehouse", ICCSIT, 3rd IEEE International Conference, Chengdu, pp. 57-60, IEEE, 2010.
[11] J. Song, Y. Bao and J. Shi, "A Triggering and Scheduling Approach for ETL in a Real-time Data
Warehouse”, CIT, 10th International Conference, Bradford, pp. 91-98, IEEE, 2010.

33 | P a g e
[12] Shao YiChuan, Xingjia Yao, “Research of Real-time Data warehouse Storage Strategy Based on Multi-
level Caches”, ICSSDMS, Macao, Vol. 25, pp. 2315-2321, ELSEVIER, April 2012.
[13] Kakish Kamal, and Theresa A. Kraft. "ETL evolution for real-time data warehousing", Proceedings of
the Conference on Information Systems Applied Research, Vol. 2167, p. 1508, 2012.
[14] Revathy S., Saravana Balaji B. and N. K. Karthikeyan, “From Data Warehouse to Streaming Warehouse:
A Survey on the Challenges for RealTime Data Warehousing and Available Solutions”, International Journal
of Computer Applications, Vol. 81-no2, 2013.
[15] A. Wibowo, "Problems and available solutions on the stage of Extract, Transform, and Loading in near
real-time data warehousing", ISITIA, Surabaya, pp. 345-350, IEEE, 2015.
[16] Marcin Gorawski and Anna Gorawska, “Research on the Stream ETL Process”, BDAS, 10th
International Conference, Poland, Vol. 424, pp. 61-71, Springer, 2014.
[17] Alfredo Cuzzocrea, Nickerson Ferreira and Pedro Furtado, “Enhancing Traditional Data Warehousing
Architectures with Real-Time Capabilities”, ISMIS, 21st International Symposium, Denmark, Vol. 8502, pp.
456- 465, Springer, 2014.
[18] Alfredo Cuzzocrea, Nickerson Ferreira and Pedro Furtado, “Real-Time Data Warehousing: A
Rewrite/Merge Approach”, LNCS, Germany, Vol. 8646, pp. 78-88, Springer, 2014.
[19] Imane Lebdaoui, Ghizlane Orhanou and Said Elhajji, “An Integration Adaptation for Real- Time Data
Warehousing”, IJSEIA, Vol. 8, pp. 115-128, 2014.
[20] M. ObalÕ, B. Dursun, Z. Erdem and A. K. Görür, "A real time data warehouse approach for data
processing", SIU, Haspolat, pp. 1-4, IEEE, 2013.
[21] Rui Jia, Shicheng Xu and Chengbao Peng, “Research on Real Time Data Warehouse Architecture”,
ICICA, Singapore, Vol. 392, pp. 333-342, Springer, August 2013.
[22] LEBDAOUI Imane, Ghizlane ORHANOU, and Said EL HAJJI, "Data Integrity in Real-time
Datawarehousing", Proceedings of the World Congress on Engineering, London, Vol. 3, 2013.

Etl techniques

More Related Content

What's hot (20)

Similar to Etl techniques (20)

Recently uploaded (20)

Etl techniques