ManMachine&Mathematics_Arup_Ray_Ext

MAN, MACHINE & MATHEMATICS
How In-Memory & Open Source Technologies
helping solve Big Data problems
Arup Ray
3rd
International Conference on Business Analytics &
Intelligence, IIM Bangalore, 17th
to 19th
December 2015

Man, Machine & Mathematics | How In
Memory & Open Source Technologies
helping solve Big Data problems
Abstract
Emergence of new technologies helping us capture, manage and interpret the deluge of data
coming from multiple sources (Big Data) giving us the opportunity to run the business using
signals that are fast, real time and hence more relevant. These signals can measure
performance, provide critical indicators about the business, identify customer issues and
complaints, help market more effectively and accurately and make decision real time. This
paper explores these emerging trends in big data technology like in-memory database,
integration of proprietary technologies with open source technologies like Hadoop, R and
application of big data technology in the area of Internet of Things (IOT).
The paper also introduces the concept of Analytics Maturity Quadrant (AMQ) to help businesses
evaluate and develop their analytics strategy.
INTRODUCTION
Solving business problems and generating disruptive business insights with petabytes of
data sounds great on paper, but can be extremely challenging task in real life. For
example, an $18 billion-a-year CPG conglomerate with global footprint, must quickly
respond to the fluctuating costs of 4,000 raw materials that go into more than 20,000
products. What’s more, if they can make promotions for these products more timely by
using faster analysis, the company and its retailer customers can command higher prices
in a business known for razor-thin profit margins. Challenge is not only about storing
petabytes of data (big data), but how fast can we run mathematical models on these
huge data to generate intelligent insights in real time.
This paper explores following aspects taking SAP HANA as a reference
- the emerging trend in in- memory computing and its impact on
analytics
- how marriage between proprietary in memory technology and open
source technology helping mathematicians solve real life problems
- how this technology evolution has made analytics the ‘brain’ behind the
internet of things (IOT) revolution
-

1. IN MEMORY COMPUTING AND ANALYTICS
1.1. Arrival of In memory Analytics
As the cost of RAM declines, in-memory analytics is no more pipe dream for many
businesses. The 64-bit operating systems with 2 terabyte (TB) addressable
memory have made it possible to cache large volumes of data, potentially an
entire data warehouse or data mart in a computer’s RAM. In addition to
incredibly fast query response times, in-memory analytics can reduce or
eliminate the need for data indexing and storing pre-aggregated data in OLAP
(On Line Analytical Processing) cubes or aggregate tables.
1.2. Advent of column storage
Another evolution is usage of columnar data storage for Analytics applications.
Unlike the traditional data storage, where the data records are indexed and
stored in rows with the record containing all the fields, products like Sybase IQ
leverage columnar data storage for analytics which permits faster data access
for OLAP system. For example, in column storage, data is only partially blocked
during access & individual columns can be processed at the same time by
different cores.
However row storage continues to be a preferred option for an OLTP (On Line
Transaction Processing) system where the transaction system may require access
to all the fields of a record every time user accesses the record (e.g., creation
of a sales order). Hence OLTP and OLAP continue to sit in two different boxes
and data need to move from OLTP to OLAP( to run analytics models, reports &
dashboards) & OLAP to OLTP ( for analytics to trigger action in transaction
system). A marriage of row based and column-based technologies can eliminate
the need of maintaining two different systems.
1.3. In-memory Database
An in-memory database means all the data is stored in the memory (RAM) and no
time is wasted in loading the data from hard disk to RAM. Everything is in-memory,
which gives the CPUs quick access to data for processing. The speed advantages
offered by this RAM storage system is further accelerated by the use of multi-
core CPUs, multiple CPUs per board, and multiple boards per server appliance.
In-memory database like SAP HANA combines the power of hardware and
software to process massive volume of real time data using in-memory
computing, e.g.,
 It combines row-based and column-based database technology.
 Data now resides in main-memory (RAM) and no longer on a hard disk.

It is best suited for performing real-time analytics and developing and
deploying real-time applications.
Fig 1. In Memory Computing: Combining power of Hardware & Software
As Forrester Research has pointed out, the outcome of this evolution in
technology is a distributed in-memory data platform like SAP HANA that
enterprises can use to support real-time analytics, predictive and text analytics,
and extreme transaction volumes. The next-generation data platform demands
looking at these new technologies to help deliver the speed, agility and new
insights critical to helping your business grow. For decades, organizations have
built the transactional, operational and analytical layers to support various
applications, operational reporting, and analytics. However, with the growing
need to support real-time data sharing driven by mobile enterprise, separate
transactional, operational, and analytical layers are creating an obstacle in
supporting such an initiative. Distributed in-memory data platform offers a new
approach to collapse the technology stack that can eliminate redundant
hardware, software and middleware components to save money and reduce
complexity through automation and integrated systems that can help developers
and DBAs become more productive.
1.4. OLTP & OLAP in a box
An appliance like SAP HANA blends the column and row storage in the same
database eliminating the need for data movement between two different boxes,
which allows OLTP and OLAP in a single box. Hence the moment a sales order is
created in ERP system, the data is accessible to analytical application and the
business users can access operational reports real time helping them take
corrective action real time instead of doing a post mortem. In section 4, the

paper describes how HANA PAL (predictive analytics library) can access
transaction data real time and deliver a decision of predictive maintenance just
in time to avoid costly breakdowns and prevent millions of dollars loss due to
unplanned maintenance.
2. IN MEMORY COMPUTING AND BIG DATA ANALYTICS
With the explosion of data, even with current high capacity RAM & multi core
processors, the cost of hardware can be prohibitive if business need to store huge
volume of data generated every second (,e.g., clickstreams from website or real
time data generated by hundreds of sensors attached to a Formula One car). This
challenge of large volume of data can be addressed by low cost open source
technologies like Hadoop. However the advantage of the in memory computing
will remain unutilized unless the in-memory analytics can be integrated with
data lakes seamlessly. Hence the built in integration with Hadoop or Spark with
SAP HANA (or similar other in-memory databases) can support an architecture
where data can be stored based on the ‘data temperature’ or frequency of access.
Fig 2. An Big Data Integrated Architecture
2.1. In Memory Data Base and Hadoop Integration
Taking SAP Big Data architecture as an example, a typical big data integrated
architecture stores hot data (more frequently accessed data) in SAP HANA (in

memory database), warm data in Sybase IQ while cold data (less frequently
accessed data) in Hadoop. This integrated architecture allows need based data
access while keeping the infrastructure cost manageable.
SAP recently launched SAP HANA Vora, an in-memory query engine. It runs on
Apache Spark to analyze Big Data stored in Hadoop. The goal is to deliver a single
integrated platform that embraces Hadoop ecosystem for
◾ All Data: OLTP, OLAP & Big Data
◾ All Operations setup, admin, monitoring, operations
◾ ONE interface for applications
Key features of this integration are
◾ Building OLAP style capabilities on Hadoop/HDFS and extend SAP HANA
& Hadoop integrations to provide optimized data processing and
movement between the two platforms
◾ Enabling massive scale out scenarios for HANA
However, SAP HANA VORA reveals another trend that software vendors adopting.
It permits users of open source technologies to continue with the open source
environment to build analytical applications while leveraging power of in
memory computing. For example, SAP HANA VORA offers
 Extensive programming support for Scala, python, C, C++, R, and
Java allow data scientists to use their tool of choice,
 Enable data scientists and developers who prefer Spark R, Spark ML
to mash up corporate data with Hadoop/Spark data easily
 Leverage SAP HANA’s multiple data processing engines and in
memory computing for developing new insights from business and
contextual data.
2.2. In-memory Data base and R Integration
Another aspect of integration of in memory database with open source is
integration with R which gives access to practically infinite number of readily
available algorithms.
Extending the example of SAP Big Data Architecture, SAP Predictive Analytics
offers advanced SAP HANA integration and provides in-database and in-memory
computing through dedicated SAP HANA native libraries and R :
• SAP PAL for HANA : the Predictive Analytics Library with HANA native and
optimized implementations of industry standard predictive algorithms

• SAP APL for HANA : the Automated Predictive Library with SAP proprietary
algorithms which automate many tasks for a simplified, quicker and high
quality model definition
• The R Server enables using any R open source algorithm in the HANA engine
The goal of the integration of the SAP HANA database with R is to enable the
embedding of R code in the SAP HANA database context. That is, the SAP HANA
database allows R code to be processed in-line as part of the overall query
execution plan. This scenario is suitable when an SAP HANA-based modeling and
consumption application wants to use the R environment for specific statistical
functions.
An efficient data exchange mechanism supports the transfer of intermediate
database tables directly into the vector-oriented data structures of R. This offers
a performance advantage compared to standard SQL interfaces, which are tuple
based and therefore require an additional data copy on the R side.
Fig 3. SAP HANA – R Integration Architecture
To process R code in the context of the SAP HANA database, the R code is
embedded in SAP HANA SQL code in the form of a RLANG procedure. The SAP

HANA database uses the external R environment to execute this R code, similar
to native database operations like joins or aggregations. This allows the
application developer to elegantly embed R function definitions and calls within
SQL Script and submit the entire code as part of a query to the database.
Fig 3. shows three main components of the integrated solution: the SAP HANA-
based application, the SAP HANA database, and the R environment. When the
calculation model plan execution reaches an R-operator, the calculation engine’s
R-client issues a request through the Rserve mechanism to create a dedicated R
process on the R host. Then, the R-Client efficiently transfers the R function
code and its input tables to this R process, and triggers R execution. Once the R
process completes the function execution, the resulting R data frame is returned
to the calculation engine, which converts it. Since the internal column-oriented
data structure used within the SAP HANA database for intermediate results is
very similar to the vector-oriented R data frame, this conversion is very efficient.
A key benefit of having the overall control flow situated on the database side is
that the database execution plans are inherently parallel and, therefore,
multiple R processes can be triggered to run in parallel without having to worry
about parallel execution within a single R process.
While the leading vendors of in memory computing and analytics products are
integrating their products to open source technologies, the product upgrade
cycle for software vendors are relatively slow compared to open source
technologies. This may lead to occasional integration challenges due to version
incompatibility.
3. BIG DATA ANALYTICS AND IOT
All these developments has opened up new opportunities for practical
application of analytics. While sensors, clicks, POS devices etc. can generate
large volume of valuable data, making sense of data requires predictive
modelling and processing huge volume of data within reasonable time. Taking
the example of tracking a formula one car, the predictive models can identify
potential failure of a component well in advance by processing zillions of data
from sensors real time and hence can prevent a major accident saving life and
millions of dollars investment.
Building an IoT solution involves three main steps, or phases:
Step 1. Data integration: This is the first and primary step. It brings a variety of
data into a coherent, complete set – from the edge to the core – to offer the
deepest and broadest insights possible.

Step 2. Data management: This step, which brings together IT and
infrastructures, requires special attention. Data management must address the
challenge of managing large volumes of data, as well as layering on contextual
information such as asset taxonomy and time and location data.
Step 3. Making sense of Data: Once foundational data integration and data
management are put in place, many types of business innovation are possible.
Enterprises can create meaningful insights and reimagine their business models
and customer experiences by leveraging predictive /prescriptive analytics.
The next section describes a typical architecture and data flow for an IOT
scenario.
4. CASE STUDY : PREDICTIVE MAINTENACE AND SERVICE FOR A EUROPEAN
MANUFACTURER OF AIR SYTEMS
The customer, a leading manufacturer of air compressors, wanted to provide
differentiated value by supplying compressed gas as a business focus in addition
to their compressors. For this reason, downtime and breakdowns becomes a
critical factor, as it would result in substantial loss for the company. Predictive
maintenance and service would help to understand the availability of the
machinery and would help to avoid lost revenue and lower maintenance costs.
4.1. The Process Innovation:
 Move from Preventative Maintenance to Predictive Maintenance in order to
improve product reliability, service revenue, and customer satisfaction.
 Application of monitoring and predictive analysis by coupling and analyzing
disperse historical data with actual equipment data to more accurately
predict future equipment failures.
 Combined customer data and service level agreements / contracts to alert
and support the service team in preventing failures in an optimized way by
an analytics solution.
Predictive Maintenance provides the ability to plan demand for aftermarket
service and sales based on visibility into the customer base and the support
needed.
4.2. Architecture & Requirements
 Big Data volume from sensor data (temperature, pressure, machine
conditions) in combination with product data including failure codes, machine
master data and business data (OEM, dealer). Integration with Hadoop to
build a data lake is under consideration.

 Flexible predictive tools and algorithms combining technical and business
data
 HANA/IQ central store fed by Data Services and ESP using SAP BI Tools/Portal
for visualization
 Machine Data Insight combining stream & signal intelligence
Fig 4. The IOT Architecture for the compressor manufacturing company
4.3. Value Drivers/Benefits
 Customers can generate additional revenue since they can extend their
service with a prediction on when a machine might break down in the future,
how long they can run a production line with an existing failure and they can
provide the right spare parts to shorten a maintenance shut down.
 Reduced service, warranty and maintenance costs
 Higher service profitability and customer satisfaction.
 Better alignment with spare parts planning and availability
 Improved service intervals / lower service costs for customer
5. ANALYTICS MATURITY QUADRANT
As part of the study, the concept of Analytics Maturity Quadrant (AMQ) was
introduced in the light of above evolution.
The AMQ maps a business with respect to two parameters:
- Maturity in application of analytics as a business tool
- Maturity in usage of data & technology

Fig 5 A. Analytics Maturity Quadrant
As part of the ongoing study, companies from different industry sectors are being
analyzed in terms of their technology footprint and usage of analytics. The
general trend indicates sectors like Retail, Telecom and Financial Services are
way ahead of their counterparts from other sectors and majority of them have
invested in technology and in-house center of excellence for advanced analytics.
In contrast, sectors like manufacturing, CPG, Energy etc. continue to depend
extensively on traditional BI reporting with occasional small pockets of
predictive/prescriptive analytics usage (e.g. forecasting, optimization of
distribution etc.). The later segment has started realizing potential of analytics
being used as strategic tool for competitive advantage. An outcome of this
realization is investment in technology although usage of advanced analytics is
lacking.
The diagram 5 B is an indicative mapping of some of the key industry sectors.
The objective of this mapping is to assist interested companies identify relative
position in the quadrant and define their strategy to move towards the top right
corners.

Fig 5 B. Analytics Maturity Quadrant : Industry Perspective
References
1. Evelson, Boris, I forget: what's in-memory (2010) ? http://blogs.forrester.com,
March 2010
2. Connect, Transform, and Reimagine Business in a Hyperconnected
Future(2014), SAP Thought leadership paper, 2014.
3. Abadi, D. J., Madden , S. R. , Hachem , N. , Column-stores vs. row-stores: how
different are they really? (2008) ,SIGMOD, 2008. pp. 967–980
4. Hensen Doug, In Memory Data bases, InformationWeek, March, 2014, pp. 9-16
5. McKinsey Global Institute , Big data: The next frontier for innovation,
competition, and productivity , May 2011
6. Plattner,Hasso , Zeier Alexander , In-Memory Data Management :An Inflection
Point for Enterprise Applications (2011), Springer, Germany

About Author
Arup Ray, a senior executive with 20+ years of industry experience, manages a global vertical
in Analytics, Big Data, EIM & HANA in SAP SDC. In his role, Arup is also member of the SAP
services global leadership team for Analytics, Big Data & HANA. As a consultant & business
head, Arup had an opportunity to assist customers across five continents in the area of
Analytics, Big Data & Supply Chain Management in multiple industries, e.g., CPG, Retail,
Telecom, Manufacturing etc.. He has also incubated Big Data & HANA CoE in SDC global and
helping customer leverage Analytics & Big Data technology to achieve their strategic goals.
An alumnus of IIT Delhi & ISB Hyderabad, Arup is currently pursuing Business Analytics &
Intelligence program at IIM Bangalore.

ManMachine&Mathematics_Arup_Ray_Ext

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to ManMachine&Mathematics_Arup_Ray_Ext (20)

ManMachine&Mathematics_Arup_Ray_Ext