Customer value analysis of big data products

Customer Value Analysis
of Big Data products
Vikas Sardana
Indian Institute of Management, Bangalore

Agenda
• Background – evolution of data, challenges,
products and vendors
• Top Big Data Use cases
• Case Analysis: Customer Value model for Big
Data analytics use case for a mobile advertising
network
• Conclusion

What is Big Data
• “Big data refers to datasets whose size is beyond
the ability of typical database software tools to
capture, store, manage, and analyze.”
• “Big data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.”
Source:
Source:

Some Sources of Big data
• Web and social media
• Machine generated data – Radio Frequency
Identification, Global Positioning Systems,
Phone apps etc.
• Biometric data
• Human interactions (email, mobile phones,
voice mails, call centers)

Big Data Challenges
• Acquisition
• Storage
• Processing and Analysis

Big Data products
• Hadoop platform and tools
• NOSQL databases

Consumption Models
Open
Source(Build)
Open Source (Buy
support)
Proprietary(Buy)
On Premise
Externally hosted
(Cloud)
Trade-offs
• Building requires in-house expertise
• On Premise leads to capital expenditure while cloud leads to
operational expenses

Prominent Vendors
• Cloudera
• MapR
• HortonWorks
• IBM
• Amazon

Top Big Data customer use cases
• Predictive analytics
Building classification and prediction systems e.g. predicting
the buying preferences of customers.
• Revenue optimization
Pricing in real time based on several factors such as
demand, cost, competition e.g. dynamic pricing. This is
popular in various verticals esp. airline industry.
• Revenue generation
Activities to create revenue streams e.g. segmentation and
targeting.

Top Big Data customer use cases ( Cont. …)
• Maximizing human and physical resources
• Scientific research in new areas
• Fraud detection
Detect potential fraud patterns in transactions
• Security and crime prevention

Gartner’s hype cycle for Big Data - 2012
“Big data has gone into Peak of inflated expectations
and is likely to plateau in 2 – 5 years”
Is there value for customers … ? - Motivation for this study
Source: Gartner

Roger’s ACCORD model for diffusion of
innovation
Dimension Measure Justification
Relative
Advantage
High
(Favorable)
Big data products have solved many new problems and are far ahead of traditional data
management products
Compatibility High
(Favorable)
Most Big Data products use commodity hardware and popular programming languages
and hence are highly compatible in the current IT ecosystem
Complexity High
(Unfavorable)
With a different paradigm of parallelism and a bunch of solutions, users need to
understand the new ways of processing and storing data. However, it requires simpler
programming skills for engineers.
Observability Moderate Although Big Data has been popularized but it is a background IT infrastructure.
Nevertheless, due to the power of problems it has solved, this has been topic of
discussion in various forums
Risk High
(Unfavorable)
It requires considerable investment of resources and energy and is still in its initial years

Case Analysis
Big Data Analytics use case for a Mobile
Advertising Network

Research Methodology
• Primary research with the buying center
• Interviews with business stakeholders and domain experts to
understand business requirements and business metrics
• Interviews with analytics technology experts to understand system
level requirements
• Interviews with hardware procurement and planning experts to
understand costs and sizing methodologies
• Secondary research
• Research and analyst reports on Big Data
• User manuals of the products for Big Data management
• Books, articles and blogs on Big Data technologies and products
• Blogs and websites of prominent mobile Ad networks

Advertising Network overview
Two sided network with advertisers buying ad space
on one side and ad publishers selling the space on
the other.
Image source: www.altitudedigital.com

Ad Serving and Click Flows
Image source: www.inmobi.com

Pricing Model
• Cost Per Click (CPC) – Outcome based pricing,
advertiser is charged only when the ad is clicked.
• Ad network revenue – 50% of the revenue generated
from advertisers is appropriated by the ad network
and the rest 50% is realized by the publisher.

Business Goals Metrics for Ad network
Business Goal Metrics
Revenue optimization for publishers
and self
Maximize Click Through Rate (CTR)
Help advertisers with campaign
planning
Accuracy of CTR prediction
optimization through ongoing
improvements
Accuracy and timeliness of real time reports
analysis later
Ability and accuracy for canned and ad hoc
reporting
Business continuity Availability of reports on a sustained basis
Business Problem
Ad network has set up its data analytics systems to achieve its business goals but isn’t fairing very
well on its performance metrics

Functions of data analytics systems
* This is a high level functionality detail to highlight the hardware requirements though the actual
technical steps are different to process data for real time than those for batch reports.
The insights from various analytics and reporting mechanisms help in
effective placements and effectiveness of ads.

Challenges in data analytics
• Accessing the huge volume of data from the ad
servers
• Preparing huge data for analytics
• Analyzing the data at a large scale and providing
timely insights

Steps for analytics and suitable products
Step Big Data offering suitable Other suitable alternatives
Data Collection of logs and feeds at
a massive scale ( 8 billion collection
events per day)
Challenges:
Burst bandwidth, latency, backlog,
operability
Technical metrics:
Throughput, latency, data loss and
reliability, linearly scalable
Distributed Log Collectors. e.g.
Scribe(Facebook) Flume(Cloudera),
Kafka(LinkedIn)
Log files transferred through network
protocols such as FTP, rsync.
Storing the collected data
Technical metrics:
Throughput, reliability, high
availability, durability.
HDFS, S3, NOSQL stores Files, databases
Processing of data, ETL functions
Technical metrics:
Throughput, high availability
HDFS , Hadoop mapreduce, EMR on
Amazon
Home grown solutions using scripting
languages such as Perl

Steps for analytics and suitable products
Step Big Data offering suitable Other suitable alternatives
BI Reporting
Technical metrics:
Query latency, data freshness
NOSQL Columnar stores, warehouses Traditional row based data warehouses
Ad hoc reporting based on
historical data
Technical metrics:
Throughput, latency
Hadoop mapreduce, Cloudera Impala,
HortonWorks Stinger, Apache Dremel,
Greenplum, Netezza, Teradata
Relational databases
Predictive Analytics
Technical metrics:
Throughput, latency
R, Hadoop map reduce Home grown solution run on Massively
Parallel Processing systems running on
expensive, specialized hardware.

IT Systems architecture using traditional data
management products
IT Systems architecture using Big Data products

Choice of Big data product deployment
Open
Source(Build)
Open Source
(Buy support)
Proprietary(Bu
y)
On Premise
Externally hosted
(Cloud)
Decision criteria: Intellectual property
A strong technology and intellectual property are key
success factors in the mobile ad network and can help them
develop a competitive advantage

Typical case facts about data generated by Ad
Network
• Monthly Ad impressions served: 100 billion
• Events received per day:10 billion events
(An event is triggered at various stages of serving an ad.
Some example events: Ad Request and Ad Impression events,
User Click events, User Ad Interaction events,
Conversion/Acquisition events, and Monetization events)
• Average size of data received per event: 1 KB
• Data received per day: 10 terabytes
(10 billion events X 1 KB of data per event)
Source: https://hasgeek.tv/fifthelephant/2012-2/68-the-
elephant-that-flew-big-data-analytics-inmobi

Stage 1: Data Collection
• Traditional solution: Rsync and FTP are the popular tools used to move
these logs.
With Wide Area Network capacity up to 10 gigabit/sec available, it is easily
possible to send 10 terabytes of data per day from machines that produce logs
to those that consume them as required but the challenges are:
 WAN links are usually weak leads to backlogs on the producer machine.
 Consumer systems being down leads to data choking and delay in event delivery.
 Duplicate data transfer consumes unnecessarily more bandwidth.
• Big Data solution:
• Distributed Log Collectors – Few examples:
o Apache Flume (Initially built by Cloudera)
o Scribe (Facebook)
o Kafka (LinkedIn)

Technical benefits of using distributed log
collectors
• Ability to work with distributed producers over
WAN, with consumers sitting in local or remote
datacenters.
• Producers are decoupled from consumers, so
consumers can process at their own pace.
• Efficient: no duplicate data transfers, uses
compression
• Reliable and linearly scalable

Apache Flume Hardware requirements
Image source: http://flume.apache.org

No. of agents required
Tier 1 agents
• Ratio of 1:16 for outer tier
Number of tier 1 agents = 100/16 ~ 7
Tier 2 agents
• Ratio of 1:4 for inner tier since more data will be
pushed in to Tier-2 from Tier-1
Number of tier 2 agents = 7/4 ~ 2
Total agents required = 9

Physical storage requirements
Calculating the size of physical storage (hard drive) required
• Ad server data – 10 terabyte/day
• No. of ad servers = 100
• Data per sec. from ad server = 1012/(24*60*60*100) =115 KB
• Data to be collected in two hours at this rate = 115 x 60 x 60 x 2 = 828
MB.
(Assume expected resolution time for downstream failures is two hours)
• Increase by safety margin factor say 1.5 = 828 MB x 1.5 = 1,242 MB
• Required File Channel Capacity = 1.2 GB
The physical storage capacity requirement is around 1.2 GB.

CPU Requirements
Multiple sources and sinks can be defined on a given agent based on the event batch size.
Larger the batch size, greater the risk of duplication, hence batch size is limited to a max of
2500 events
Events per sec. = 10TB/(1KB*24*60*60) = 115
For Agent 1:
• Total Exit Batch Size from 16 upstream servers = 16 x 115 = 1840
• No. of sinks to accommodate 1840 events = [ 1840/2500 ] = 1
For Agent2:
• Receiving a batch of 1840 events from each of four upstream agents
• No. of sinks = [ 1840 * 4 / 2500 ] = 3
Cores = (Sources + Sinks) / 2
For Agent 1, Cores = 1
For Agent 2, Cores = 2

Apache Flume Total Hardware Requirements
7 single core machines, each $800
2 dual core machines, each $1000
Total Hardware cost
• $5600 + $2000 = $7600

Stage 2: Storing the collected data
Traditional solution: Network storage as a part of High Performance Computing
(HPC) Clusters
• Ten times extra overhead than commodity hard drives due to communication
requirements within the cluster
• Ten times costlier than commodity hardware due to specialized features such
as redundant storage, high availability etc.
Big Data solution: Hadoop Distributed File System (HDFS)
• Low storage cost per byte as compared to other alternatives such as Storage
Area Network
• Tuned to deliver fast data for Mapreduce workloads up to 2 gigabyte per
second.
• Data reliability is the primary use case and it has been used by various
organizations
• Uses commodity hardware – less initial and maintenance cost.
• Shares cost with compute layer since it is built into the Hadoop kernel.
• Linearly scalable in terms of performance and cost even at very high volume.

Storage Requirements and costs
Traditional Solution: HPC Network
storage
• Network storage used with HPC costs
$100000 for 100GB of data
• For the ad network’s current requirement
of 14 Petabytes, cost = $14 M
• In order to move to move away from this
architecture, there would be a salvage
value of 60% of this hardware.
Big Data solution
• 10TB per day is 30TB physical space (3x
replication factor) with a 30% overhead
for MR jobs' local space (10 * 3 * 1.30) =
39TB physical space per day
• 1.65 hosts per day's worth of data.
• For a 1 year retention, storage required =
39 Terabytes X 365 = 14 Petabytes
• ~600 hosts
• 600 hosts X $5000 per host = $3,000,000
Commodity hardware server configuration:
Chipset: 4 X 6 –core Intel Xeon 3GHz
Memory: 32GB
Operating System: Red Hat Enterprise Linux 5
Network: 2 Gbps (Bonded Network Interface Card)
Disk Space: 2TB X 12 JBOD (Just a Bunch of Disks)

Stage 3: Data processing and preparation
Traditional solution: Scripts (e.g. using
Perl scripting language) on High
Performance Compute hardware
Big Data Solution: Hadoop Mapreduce
Benefits of Hadoop Mapreduce over Perl on HPC hardware
• Scalable to thousands of nodes, shared nothing
• Abstracts complexity of distributed programming
• Reduced human resource cost to 0.5X
• High availability, fault tolerance
• Abstracts cluster functions
• High performance esp. for unstructured data on one time
processing.

Hardware costs for Data Preparation and Processing
Traditional Solution:
• 10TB /day =121MB/sec.
• Average throughput
(MB/s) per Node for
analytics workload = 1
• Desired throughput per
node = 121
• No. of nodes required ~
120
• Cost = 120 nodes X $5000
per node = $600,000
Big Data solution:
• 10TB /day =121MB/sec.
• Average throughput (MB/s)
per Node for analytics
workload = 10
• Desired throughput per
node = 121
• No. of nodes required ~ 12
• Cost = 12 nodes X $5000
per node = $60,000

Human Resource Cost for Data Preparation and
Processing
Traditional solution:
Complex skillset required
to handle distributed
computing complexity
Estimate: 50 person team
@$35000 per person per
year
Cost: $1750000
Big Data solution:
Simpler skillset required
as complexities are
abstracted from the
programmers.
Estimate: 50% cost
reduction
Cost: $875000

Stage 4: Analytics – Reporting, Ad hoc and
predictive analytics
Traditional solution: Row based data warehouses with Structured Query
Language
Big Data solution: NOSQL column stores
No additional hardware costs and similar human resource costs
• Big data solutions benefit as the schemas can be modified at a later stage
to keep the reports up to date with new type of data.
• Optimized for columnar storage and access which are main tasks in
analytics

Quantification of immediate business benefits
S No. Benefit Description Quantum
1 Increase in ad
revenue due to
better CTR
Improved ads will help ad
matching algorithms
more accurately target
the ads to the relevant
users with the relevant
publishers
Estimated CTR increase 5%
Corresponding increase in
Publisher’s ad revenue
5%
Corresponding increase in ad
network’s revenue (50% of
publisher’s ad revenue)
5%
Ad network’s increase in
revenue (current rev. $100M)
$5 M
2 Increase in ad
revenue by
enabling
advertisers to
better plan
campaigns
Better accuracy in
predicting CTR will help
advertisers in better
campaign planning. This
will help improve CTR in
turn increasing the
revenue for publishers
and the ad network
5%
5%
Ad network’s increase in
revenue (current rev. $100M)
$5 M

Quantification of immediate business benefits
Benefit Description Quantum
Increase in ad
revenue due to
better campaign
optimization
Timely and accurate real time
reports will help advertisers do
course correction helping further
with CTR improvement leading to
better ad revenue
5%
5%
Ad network’s increase in revenue
(current rev. $100M)
$5 M
Increase in ad
revenue due to
better availability
of reports
If the ad network provides better
continuity to advertisers, they will
be willing to pay premium.
Estimated premium payment 2%
2%
Ad network’s increase in revenue
(current rev. $100M)
$2 M
Total increase in ad Network’s revenue (1 + 2 + 3 + 4) $14 M

Value Element Mapping
Points of Parity • Open Source software available and the company can
customize and enhance it the way they want.
• Support for Java programming language, for which it is
easy to hire people and further enhance the software due to
abundantly available talent pool
Points of Difference • Simpler skillset required for in-house IT experts in case
of big data products.
• Ability to handle all aspects of big data problems in Big
data products unlike traditional data management products.
• Linearly scalable - Big data products can work with
cheaper hardware and are linearly scalable making them a
future proof investment.
Points of Contention • Adoption uncertainty Although there is community
support among developers to maintain and evolve the Big
data open source products which is growing very fast due
to the buzz but it is unclear whether it will pick up as good
as that in traditional software.
• Stability of big data vendors The commercial vendors are
mostly newly formed companies though founded by very
accomplished people. They are fast gaining traction but it is
unclear whether they will be able to sustain for long term.
Moreover, since pure play Big Data firms are privately held,
their growth and revenues are not clearly known.

Customer Value Model
Big data products Traditional products (Next Best Alternative
– NBA)
Benefits $17M Status quo with the existing systems
Cost Other than Price (Capex + Annual) in
the first year
$7600 (Data Collection)
+ $3M (Storage)
+ $60K (Processing)
+$875000 (Salaries)
+ $1.5M (Implementation and training)
(Already incurred in the existing systems)
$14M (Storage)
+ $600K(Processing)
+$1750000 (Salaries)
Total Cost $5442600 Sunk cost
Value = Benefit - Cost
$11557400
No additional value in the existing systems
Price Free and Open Source Free and Open Source
Delta(Price) 0
Value in Use = Delta(Value) - Delta(Price) $11557400
Effective value in use (for migration to Big
Data products) = Value in Use + Salvage
value of storage and processing + Salaries
saved $22067400
Ignoring the time value of money since the cash flows are considered over a
short period i.e. one year.
Framework reference: James C. Anderson, James A. Narus, DVR
Seshadri

Value placeholders (less tangible)
Positives
• Big data products architecture will be linearly scalable and hence future
proof, future data management requirements will be fulfilled by adding
incremental cost towards buying commodity hardware.
• Customer satisfaction and hence low customer churn due to increased
control in their hands for managing their advertisements.
• Skillset required for in-house IT experts is simpler in case of big data
products and mostly based on popular Java technology.
Negatives
• Although the above big data products are backed by strong companies
and open source communities, these companies and communities are
not as strong as the ones for traditional products.
• The commercial vendors are mostly newly formed companies but
founded by very capable people which are fast gaining traction but it is
unclear whether they will be able to sustain for long term.

Conclusion
• The above case study clearly builds a case for the
value proposition of Big Data products
• Similarly, big data products are being used
extensively across various industries and this value
model will help in building a concrete case for Big
Data products

Customer value analysis of big data products

More Related Content

What's hot (16)

Viewers also liked (12)

Similar to Customer value analysis of big data products (20)

Recently uploaded (20)

Customer value analysis of big data products