Customer Value Analysis
of Big Data products
Vikas Sardana
Indian Institute of Management, Bangalore
Agenda
• Background – evolution of data, challenges,
products and vendors
• Top Big Data Use cases
• Case Analysis: Customer Value model for Big
Data analytics use case for a mobile advertising
network
• Conclusion
What is Big Data
• “Big data refers to datasets whose size is beyond
the ability of typical database software tools to
capture, store, manage, and analyze.”
• “Big data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.”
Source:
Source:
Some Sources of Big data
• Web and social media
• Machine generated data – Radio Frequency
Identification, Global Positioning Systems,
Phone apps etc.
• Biometric data
• Human interactions (email, mobile phones,
voice mails, call centers)
Big Data Challenges
• Acquisition
• Storage
• Processing and Analysis
Big Data products
• Hadoop platform and tools
• NOSQL databases
Consumption Models
Open
Source(Build)
Open Source (Buy
support)
Proprietary(Buy)
On Premise
Externally hosted
(Cloud)
Trade-offs
• Building requires in-house expertise
• On Premise leads to capital expenditure while cloud leads to
operational expenses
Prominent Vendors
• Cloudera
• MapR
• HortonWorks
• IBM
• Amazon
Top Big Data customer use cases
• Predictive analytics
Building classification and prediction systems e.g. predicting
the buying preferences of customers.
• Revenue optimization
Pricing in real time based on several factors such as
demand, cost, competition e.g. dynamic pricing. This is
popular in various verticals esp. airline industry.
• Revenue generation
Activities to create revenue streams e.g. segmentation and
targeting.
Top Big Data customer use cases ( Cont. …)
• Maximizing human and physical resources
• Scientific research in new areas
• Fraud detection
Detect potential fraud patterns in transactions
• Security and crime prevention
Gartner’s hype cycle for Big Data - 2012
“Big data has gone into Peak of inflated expectations
and is likely to plateau in 2 – 5 years”
Is there value for customers … ? - Motivation for this study
Source: Gartner
Roger’s ACCORD model for diffusion of
innovation
Dimension Measure Justification
Relative
Advantage
High
(Favorable)
Big data products have solved many new problems and are far ahead of traditional data
management products
Compatibility High
(Favorable)
Most Big Data products use commodity hardware and popular programming languages
and hence are highly compatible in the current IT ecosystem
Complexity High
(Unfavorable)
With a different paradigm of parallelism and a bunch of solutions, users need to
understand the new ways of processing and storing data. However, it requires simpler
programming skills for engineers.
Observability Moderate Although Big Data has been popularized but it is a background IT infrastructure.
Nevertheless, due to the power of problems it has solved, this has been topic of
discussion in various forums
Risk High
(Unfavorable)
It requires considerable investment of resources and energy and is still in its initial years
Case Analysis
Big Data Analytics use case for a Mobile
Advertising Network
Research Methodology
• Primary research with the buying center
• Interviews with business stakeholders and domain experts to
understand business requirements and business metrics
• Interviews with analytics technology experts to understand system
level requirements
• Interviews with hardware procurement and planning experts to
understand costs and sizing methodologies
• Secondary research
• Research and analyst reports on Big Data
• User manuals of the products for Big Data management
• Books, articles and blogs on Big Data technologies and products
• Blogs and websites of prominent mobile Ad networks
Advertising Network overview
Two sided network with advertisers buying ad space
on one side and ad publishers selling the space on
the other.
Image source: www.altitudedigital.com
Ad Serving and Click Flows
Image source: www.inmobi.com
Pricing Model
• Cost Per Click (CPC) – Outcome based pricing,
advertiser is charged only when the ad is clicked.
• Ad network revenue – 50% of the revenue generated
from advertisers is appropriated by the ad network
and the rest 50% is realized by the publisher.
Business Goals Metrics for Ad network
Business Goal Metrics
Revenue optimization for publishers
and self
Maximize Click Through Rate (CTR)
Help advertisers with campaign
planning
Accuracy of CTR prediction
Help advertisers with campaign
optimization through ongoing
improvements
Accuracy and timeliness of real time reports
Help advertisers with campaign
analysis later
Ability and accuracy for canned and ad hoc
reporting
Business continuity Availability of reports on a sustained basis
Business Problem
Ad network has set up its data analytics systems to achieve its business goals but isn’t fairing very
well on its performance metrics
Functions of data analytics systems
* This is a high level functionality detail to highlight the hardware requirements though the actual
technical steps are different to process data for real time than those for batch reports.
The insights from various analytics and reporting mechanisms help in
effective placements and effectiveness of ads.
Challenges in data analytics
• Accessing the huge volume of data from the ad
servers
• Preparing huge data for analytics
• Analyzing the data at a large scale and providing
timely insights
Steps for analytics and suitable products
Step Big Data offering suitable Other suitable alternatives
Data Collection of logs and feeds at
a massive scale ( 8 billion collection
events per day)
Challenges:
Burst bandwidth, latency, backlog,
operability
Technical metrics:
Throughput, latency, data loss and
reliability, linearly scalable
Distributed Log Collectors. e.g.
Scribe(Facebook) Flume(Cloudera),
Kafka(LinkedIn)
Log files transferred through network
protocols such as FTP, rsync.
Storing the collected data
Technical metrics:
Throughput, reliability, high
availability, durability.
HDFS, S3, NOSQL stores Files, databases
Processing of data, ETL functions
Technical metrics:
Throughput, high availability
HDFS , Hadoop mapreduce, EMR on
Amazon
Home grown solutions using scripting
languages such as Perl
Steps for analytics and suitable products
Step Big Data offering suitable Other suitable alternatives
BI Reporting
Technical metrics:
Query latency, data freshness
NOSQL Columnar stores, warehouses Traditional row based data warehouses
Ad hoc reporting based on
historical data
Technical metrics:
Throughput, latency
Hadoop mapreduce, Cloudera Impala,
HortonWorks Stinger, Apache Dremel,
Greenplum, Netezza, Teradata
Relational databases
Predictive Analytics
Technical metrics:
Throughput, latency
R, Hadoop map reduce Home grown solution run on Massively
Parallel Processing systems running on
expensive, specialized hardware.
IT Systems architecture using traditional data
management products
IT Systems architecture using Big Data products
Choice of Big data product deployment
Open
Source(Build)
Open Source
(Buy support)
Proprietary(Bu
y)
On Premise
Externally hosted
(Cloud)
Decision criteria: Intellectual property
A strong technology and intellectual property are key
success factors in the mobile ad network and can help them
develop a competitive advantage
Typical case facts about data generated by Ad
Network
• Monthly Ad impressions served: 100 billion
• Events received per day:10 billion events
(An event is triggered at various stages of serving an ad.
Some example events: Ad Request and Ad Impression events,
User Click events, User Ad Interaction events,
Conversion/Acquisition events, and Monetization events)
• Average size of data received per event: 1 KB
• Data received per day: 10 terabytes
(10 billion events X 1 KB of data per event)
Source: https://hasgeek.tv/fifthelephant/2012-2/68-the-
elephant-that-flew-big-data-analytics-inmobi
Stage 1: Data Collection
• Traditional solution: Rsync and FTP are the popular tools used to move
these logs.
With Wide Area Network capacity up to 10 gigabit/sec available, it is easily
possible to send 10 terabytes of data per day from machines that produce logs
to those that consume them as required but the challenges are:
 WAN links are usually weak leads to backlogs on the producer machine.
 Consumer systems being down leads to data choking and delay in event delivery.
 Duplicate data transfer consumes unnecessarily more bandwidth.
• Big Data solution:
• Distributed Log Collectors – Few examples:
o Apache Flume (Initially built by Cloudera)
o Scribe (Facebook)
o Kafka (LinkedIn)
Technical benefits of using distributed log
collectors
• Ability to work with distributed producers over
WAN, with consumers sitting in local or remote
datacenters.
• Producers are decoupled from consumers, so
consumers can process at their own pace.
• Efficient: no duplicate data transfers, uses
compression
• Reliable and linearly scalable
Apache Flume Hardware requirements
Image source: http://flume.apache.org
No. of agents required
Tier 1 agents
• Ratio of 1:16 for outer tier
Number of tier 1 agents = 100/16 ~ 7
Tier 2 agents
• Ratio of 1:4 for inner tier since more data will be
pushed in to Tier-2 from Tier-1
Number of tier 2 agents = 7/4 ~ 2
Total agents required = 9
Physical storage requirements
Calculating the size of physical storage (hard drive) required
• Ad server data – 10 terabyte/day
• No. of ad servers = 100
• Data per sec. from ad server = 1012/(24*60*60*100) =115 KB
• Data to be collected in two hours at this rate = 115 x 60 x 60 x 2 = 828
MB.
(Assume expected resolution time for downstream failures is two hours)
• Increase by safety margin factor say 1.5 = 828 MB x 1.5 = 1,242 MB
• Required File Channel Capacity = 1.2 GB
The physical storage capacity requirement is around 1.2 GB.
CPU Requirements
Multiple sources and sinks can be defined on a given agent based on the event batch size.
Larger the batch size, greater the risk of duplication, hence batch size is limited to a max of
2500 events
Events per sec. = 10TB/(1KB*24*60*60) = 115
For Agent 1:
• Total Exit Batch Size from 16 upstream servers = 16 x 115 = 1840
• No. of sinks to accommodate 1840 events = [ 1840/2500 ] = 1
For Agent2:
• Receiving a batch of 1840 events from each of four upstream agents
• No. of sinks = [ 1840 * 4 / 2500 ] = 3
Cores = (Sources + Sinks) / 2
For Agent 1, Cores = 1
For Agent 2, Cores = 2
Apache Flume Total Hardware Requirements
7 single core machines, each $800
2 dual core machines, each $1000
Total Hardware cost
• $5600 + $2000 = $7600
Stage 2: Storing the collected data
Traditional solution: Network storage as a part of High Performance Computing
(HPC) Clusters
• Ten times extra overhead than commodity hard drives due to communication
requirements within the cluster
• Ten times costlier than commodity hardware due to specialized features such
as redundant storage, high availability etc.
Big Data solution: Hadoop Distributed File System (HDFS)
• Low storage cost per byte as compared to other alternatives such as Storage
Area Network
• Tuned to deliver fast data for Mapreduce workloads up to 2 gigabyte per
second.
• Data reliability is the primary use case and it has been used by various
organizations
• Uses commodity hardware – less initial and maintenance cost.
• Shares cost with compute layer since it is built into the Hadoop kernel.
• Linearly scalable in terms of performance and cost even at very high volume.
Storage Requirements and costs
Traditional Solution: HPC Network
storage
• Network storage used with HPC costs
$100000 for 100GB of data
• For the ad network’s current requirement
of 14 Petabytes, cost = $14 M
• In order to move to move away from this
architecture, there would be a salvage
value of 60% of this hardware.
Big Data solution
• 10TB per day is 30TB physical space (3x
replication factor) with a 30% overhead
for MR jobs' local space (10 * 3 * 1.30) =
39TB physical space per day
• 1.65 hosts per day's worth of data.
• For a 1 year retention, storage required =
39 Terabytes X 365 = 14 Petabytes
• ~600 hosts
• 600 hosts X $5000 per host = $3,000,000
Commodity hardware server configuration:
Chipset: 4 X 6 –core Intel Xeon 3GHz
Memory: 32GB
Operating System: Red Hat Enterprise Linux 5
Network: 2 Gbps (Bonded Network Interface Card)
Disk Space: 2TB X 12 JBOD (Just a Bunch of Disks)
Stage 3: Data processing and preparation
Traditional solution: Scripts (e.g. using
Perl scripting language) on High
Performance Compute hardware
Big Data Solution: Hadoop Mapreduce
Benefits of Hadoop Mapreduce over Perl on HPC hardware
• Scalable to thousands of nodes, shared nothing
• Abstracts complexity of distributed programming
• Reduced human resource cost to 0.5X
• High availability, fault tolerance
• Abstracts cluster functions
• High performance esp. for unstructured data on one time
processing.
Hardware costs for Data Preparation and Processing
Traditional Solution:
• 10TB /day =121MB/sec.
• Average throughput
(MB/s) per Node for
analytics workload = 1
• Desired throughput per
node = 121
• No. of nodes required ~
120
• Cost = 120 nodes X $5000
per node = $600,000
Big Data solution:
• 10TB /day =121MB/sec.
• Average throughput (MB/s)
per Node for analytics
workload = 10
• Desired throughput per
node = 121
• No. of nodes required ~ 12
• Cost = 12 nodes X $5000
per node = $60,000
Human Resource Cost for Data Preparation and
Processing
Traditional solution:
Complex skillset required
to handle distributed
computing complexity
Estimate: 50 person team
@$35000 per person per
year
Cost: $1750000
Big Data solution:
Simpler skillset required
as complexities are
abstracted from the
programmers.
Estimate: 50% cost
reduction
Cost: $875000
Stage 4: Analytics – Reporting, Ad hoc and
predictive analytics
Traditional solution: Row based data warehouses with Structured Query
Language
Big Data solution: NOSQL column stores
No additional hardware costs and similar human resource costs
• Big data solutions benefit as the schemas can be modified at a later stage
to keep the reports up to date with new type of data.
• Optimized for columnar storage and access which are main tasks in
analytics
Quantification of immediate business benefits
S No. Benefit Description Quantum
1 Increase in ad
revenue due to
better CTR
Improved ads will help ad
matching algorithms
more accurately target
the ads to the relevant
users with the relevant
publishers
Estimated CTR increase 5%
Corresponding increase in
Publisher’s ad revenue
5%
Corresponding increase in ad
network’s revenue (50% of
publisher’s ad revenue)
5%
Ad network’s increase in
revenue (current rev. $100M)
$5 M
2 Increase in ad
revenue by
enabling
advertisers to
better plan
campaigns
Better accuracy in
predicting CTR will help
advertisers in better
campaign planning. This
will help improve CTR in
turn increasing the
revenue for publishers
and the ad network
Estimated CTR increase 5%
Corresponding increase in
Publisher’s ad revenue
5%
Corresponding increase in ad
network’s revenue (50% of
publisher’s ad revenue)
5%
Ad network’s increase in
revenue (current rev. $100M)
$5 M
Quantification of immediate business benefits
Benefit Description Quantum
Increase in ad
revenue due to
better campaign
optimization
Timely and accurate real time
reports will help advertisers do
course correction helping further
with CTR improvement leading to
better ad revenue
Estimated CTR increase 5%
Corresponding increase in
Publisher’s ad revenue
5%
Corresponding increase in ad
network’s revenue (50% of
publisher’s ad revenue)
5%
Ad network’s increase in revenue
(current rev. $100M)
$5 M
Increase in ad
revenue due to
better availability
of reports
If the ad network provides better
continuity to advertisers, they will
be willing to pay premium.
Estimated premium payment 2%
Corresponding increase in ad
network’s revenue (50% of
publisher’s ad revenue)
2%
Ad network’s increase in revenue
(current rev. $100M)
$2 M
Total increase in ad Network’s revenue (1 + 2 + 3 + 4) $14 M
Value Element Mapping
Points of Parity • Open Source software available and the company can
customize and enhance it the way they want.
• Support for Java programming language, for which it is
easy to hire people and further enhance the software due to
abundantly available talent pool
Points of Difference • Simpler skillset required for in-house IT experts in case
of big data products.
• Ability to handle all aspects of big data problems in Big
data products unlike traditional data management products.
• Linearly scalable - Big data products can work with
cheaper hardware and are linearly scalable making them a
future proof investment.
Points of Contention • Adoption uncertainty Although there is community
support among developers to maintain and evolve the Big
data open source products which is growing very fast due
to the buzz but it is unclear whether it will pick up as good
as that in traditional software.
• Stability of big data vendors The commercial vendors are
mostly newly formed companies though founded by very
accomplished people. They are fast gaining traction but it is
unclear whether they will be able to sustain for long term.
Moreover, since pure play Big Data firms are privately held,
their growth and revenues are not clearly known.
Customer Value Model
Big data products Traditional products (Next Best Alternative
– NBA)
Benefits $17M Status quo with the existing systems
Cost Other than Price (Capex + Annual) in
the first year
$7600 (Data Collection)
+ $3M (Storage)
+ $60K (Processing)
+$875000 (Salaries)
+ $1.5M (Implementation and training)
(Already incurred in the existing systems)
$14M (Storage)
+ $600K(Processing)
+$1750000 (Salaries)
Total Cost $5442600 Sunk cost
Value = Benefit - Cost
$11557400
No additional value in the existing systems
Price Free and Open Source Free and Open Source
Delta(Price) 0
Value in Use = Delta(Value) - Delta(Price) $11557400
Effective value in use (for migration to Big
Data products) = Value in Use + Salvage
value of storage and processing + Salaries
saved $22067400
Ignoring the time value of money since the cash flows are considered over a
short period i.e. one year.
Framework reference: James C. Anderson, James A. Narus, DVR
Seshadri
Value placeholders (less tangible)
Positives
• Big data products architecture will be linearly scalable and hence future
proof, future data management requirements will be fulfilled by adding
incremental cost towards buying commodity hardware.
• Customer satisfaction and hence low customer churn due to increased
control in their hands for managing their advertisements.
• Skillset required for in-house IT experts is simpler in case of big data
products and mostly based on popular Java technology.
Negatives
• Although the above big data products are backed by strong companies
and open source communities, these companies and communities are
not as strong as the ones for traditional products.
• The commercial vendors are mostly newly formed companies but
founded by very capable people which are fast gaining traction but it is
unclear whether they will be able to sustain for long term.
Conclusion
• The above case study clearly builds a case for the
value proposition of Big Data products
• Similarly, big data products are being used
extensively across various industries and this value
model will help in building a concrete case for Big
Data products

More Related Content

PDF
Customer value modelling
PDF
Oracle Commerce Using ATG & Endeca - Do It Yourself Series
PPT
Chp14 Tactical Execution
PDF
Center point energy's crm business case & customer vision
PDF
How To Develop a Value Proposition That SELLS
PDF
Digital procurement transformation_roadmap_2020
PPTX
Media industry solution structured and unstructured data - social media et ...
PDF
Cost reduction tool_complexity_management_2020
Customer value modelling
Oracle Commerce Using ATG & Endeca - Do It Yourself Series
Chp14 Tactical Execution
Center point energy's crm business case & customer vision
How To Develop a Value Proposition That SELLS
Digital procurement transformation_roadmap_2020
Media industry solution structured and unstructured data - social media et ...
Cost reduction tool_complexity_management_2020

What's hot (16)

PDF
Lean Management by Bearing Point
PPTX
Competitive Intelligence
PPTX
Saratoga CRM Roadmap
PPT
Bmgt 518 term paper slides
PPT
E finance ppt. for bfi subject and global finance with e banking.
PDF
Integrating Marketing & BD into Everyones Job
PDF
Multisourcing the new global trend
PDF
VY_FIMECC-S4Fleet_esite_valmis-low
PDF
GRA Retail Supply Chain Whitepaper - Perspectives on Strategic Investment
PDF
Peppersand rogersgrp whitepaper_ms_dynamicscrm_07_2011_a4_highres
PDF
Example Whitepaper
PDF
White Paper: How to bridge the gap between business, IT and networks – applyi...
PDF
HCLT Brochure: Business Intelligence in Retail
PDF
Creating a Capability-Led IT Organization
PPTX
Client X Enterprise Architecture
PDF
Procurement challenges
Lean Management by Bearing Point
Competitive Intelligence
Saratoga CRM Roadmap
Bmgt 518 term paper slides
E finance ppt. for bfi subject and global finance with e banking.
Integrating Marketing & BD into Everyones Job
Multisourcing the new global trend
VY_FIMECC-S4Fleet_esite_valmis-low
GRA Retail Supply Chain Whitepaper - Perspectives on Strategic Investment
Peppersand rogersgrp whitepaper_ms_dynamicscrm_07_2011_a4_highres
Example Whitepaper
White Paper: How to bridge the gap between business, IT and networks – applyi...
HCLT Brochure: Business Intelligence in Retail
Creating a Capability-Led IT Organization
Client X Enterprise Architecture
Procurement challenges
Ad

Viewers also liked (12)

PPTX
Real-Time Big Data at In-Memory Speed, Using Storm
PDF
Stream processing using Apache Storm - Big Data Meetup Athens 2016
PDF
Brand strategy
PDF
Service Blueprinting / Service Design Drinks Berlin
PDF
Real Time Data Streaming using Kafka & Storm
PPT
Standardization and customization
DOCX
Project titles for mba research project
PPTX
Kapferer Brand identity Prism
PPT
Branding & Brand Positioning
PPTX
Personal Branding | Stand Out From The Crowd
PPTX
Intro to Branding & Brand management - Elkottab
PPT
Kapferer Model Brand Identity Prism
Real-Time Big Data at In-Memory Speed, Using Storm
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Brand strategy
Service Blueprinting / Service Design Drinks Berlin
Real Time Data Streaming using Kafka & Storm
Standardization and customization
Project titles for mba research project
Kapferer Brand identity Prism
Branding & Brand Positioning
Personal Branding | Stand Out From The Crowd
Intro to Branding & Brand management - Elkottab
Kapferer Model Brand Identity Prism
Ad

Similar to Customer value analysis of big data products (20)

PPTX
Skillwise Big Data part 2
PPTX
Skilwise Big data
PPTX
Hadoop Boosts Profits in Media and Telecom Industry
PPTX
Data Mesh in Azure using Cloud Scale Analytics (WAF)
PPTX
Big data unit 2
PPTX
Bitkom Cray presentation - on HPC affecting big data analytics in FS
PPTX
IBM Relay 2015: Open for Data
 
PPTX
SMAC - Social, Mobile, Analytics and Cloud - An overview
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
PDF
Big Data Evolution
PDF
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
PPS
Qo Introduction V2
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PPTX
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
Creating a Next-Generation Big Data Architecture
PDF
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
PDF
Gse uk-cedrinemadera-2018-shared
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PDF
When and How Data Lakes Fit into a Modern Data Architecture
Skillwise Big Data part 2
Skilwise Big data
Hadoop Boosts Profits in Media and Telecom Industry
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Big data unit 2
Bitkom Cray presentation - on HPC affecting big data analytics in FS
IBM Relay 2015: Open for Data
 
SMAC - Social, Mobile, Analytics and Cloud - An overview
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Big Data Evolution
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Qo Introduction V2
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Creating a Next-Generation Big Data Architecture
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Gse uk-cedrinemadera-2018-shared
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
When and How Data Lakes Fit into a Modern Data Architecture

Recently uploaded (20)

PPTX
Digital-Marketing-Strategy-Trends-and-Best-Practices-for-2025 PPT3.pptx
PPTX
You_Exec_-_Root_Cause_Analysis_Toolbox_Light_Free (1).pptx
PDF
Digital Marketing Training in Hyderabad
PDF
Freelance digital marketing in 2025:Your path to freedom and growth
PDF
digital marketing courses online with od
PPTX
Best Mobile App Development Company in Lucknow
PPTX
Segmentation_EM[1]_Powerpoint prese.pptx
PDF
Generation Alpha Report 2025 x DKC Analytics.pdf
PDF
AI powered Digital Marketing- How AI changes
PPTX
Unit 2 - Architects Act, COA n competitions.pptx
PPTX
Events Management Overview of Events Management
PPTX
APA Examples Reference Examples Style and
PDF
CAP 9.- Building the Price Foundation.pdf
PPTX
FINAL PPT strategic management lessons.pptx
PPTX
Mastering in Website Competitor Analysis
PDF
Retaining SEO Rankings During Website Redesign.pdf
PDF
EX Kathmandu _Kailash Mansarovar Yatra 2025 by Nagarjuna Travels.pdf
PDF
Chapter 8,9.pdfVGGGCFDRGFDXCRFTGDSEDSFCTGHNHGBVHG
PPTX
Transform Your Business with Top Digital Marketing Services_EGlogics.pptx
PPTX
Opening presentation of Sangam Hospital Bodeli
Digital-Marketing-Strategy-Trends-and-Best-Practices-for-2025 PPT3.pptx
You_Exec_-_Root_Cause_Analysis_Toolbox_Light_Free (1).pptx
Digital Marketing Training in Hyderabad
Freelance digital marketing in 2025:Your path to freedom and growth
digital marketing courses online with od
Best Mobile App Development Company in Lucknow
Segmentation_EM[1]_Powerpoint prese.pptx
Generation Alpha Report 2025 x DKC Analytics.pdf
AI powered Digital Marketing- How AI changes
Unit 2 - Architects Act, COA n competitions.pptx
Events Management Overview of Events Management
APA Examples Reference Examples Style and
CAP 9.- Building the Price Foundation.pdf
FINAL PPT strategic management lessons.pptx
Mastering in Website Competitor Analysis
Retaining SEO Rankings During Website Redesign.pdf
EX Kathmandu _Kailash Mansarovar Yatra 2025 by Nagarjuna Travels.pdf
Chapter 8,9.pdfVGGGCFDRGFDXCRFTGDSEDSFCTGHNHGBVHG
Transform Your Business with Top Digital Marketing Services_EGlogics.pptx
Opening presentation of Sangam Hospital Bodeli

Customer value analysis of big data products

  • 1. Customer Value Analysis of Big Data products Vikas Sardana Indian Institute of Management, Bangalore
  • 2. Agenda • Background – evolution of data, challenges, products and vendors • Top Big Data Use cases • Case Analysis: Customer Value model for Big Data analytics use case for a mobile advertising network • Conclusion
  • 3. What is Big Data • “Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” • “Big data is high-volume, high-velocity and high- variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making.” Source: Source:
  • 4. Some Sources of Big data • Web and social media • Machine generated data – Radio Frequency Identification, Global Positioning Systems, Phone apps etc. • Biometric data • Human interactions (email, mobile phones, voice mails, call centers)
  • 5. Big Data Challenges • Acquisition • Storage • Processing and Analysis
  • 6. Big Data products • Hadoop platform and tools • NOSQL databases
  • 7. Consumption Models Open Source(Build) Open Source (Buy support) Proprietary(Buy) On Premise Externally hosted (Cloud) Trade-offs • Building requires in-house expertise • On Premise leads to capital expenditure while cloud leads to operational expenses
  • 8. Prominent Vendors • Cloudera • MapR • HortonWorks • IBM • Amazon
  • 9. Top Big Data customer use cases • Predictive analytics Building classification and prediction systems e.g. predicting the buying preferences of customers. • Revenue optimization Pricing in real time based on several factors such as demand, cost, competition e.g. dynamic pricing. This is popular in various verticals esp. airline industry. • Revenue generation Activities to create revenue streams e.g. segmentation and targeting.
  • 10. Top Big Data customer use cases ( Cont. …) • Maximizing human and physical resources • Scientific research in new areas • Fraud detection Detect potential fraud patterns in transactions • Security and crime prevention
  • 11. Gartner’s hype cycle for Big Data - 2012 “Big data has gone into Peak of inflated expectations and is likely to plateau in 2 – 5 years” Is there value for customers … ? - Motivation for this study Source: Gartner
  • 12. Roger’s ACCORD model for diffusion of innovation Dimension Measure Justification Relative Advantage High (Favorable) Big data products have solved many new problems and are far ahead of traditional data management products Compatibility High (Favorable) Most Big Data products use commodity hardware and popular programming languages and hence are highly compatible in the current IT ecosystem Complexity High (Unfavorable) With a different paradigm of parallelism and a bunch of solutions, users need to understand the new ways of processing and storing data. However, it requires simpler programming skills for engineers. Observability Moderate Although Big Data has been popularized but it is a background IT infrastructure. Nevertheless, due to the power of problems it has solved, this has been topic of discussion in various forums Risk High (Unfavorable) It requires considerable investment of resources and energy and is still in its initial years
  • 13. Case Analysis Big Data Analytics use case for a Mobile Advertising Network
  • 14. Research Methodology • Primary research with the buying center • Interviews with business stakeholders and domain experts to understand business requirements and business metrics • Interviews with analytics technology experts to understand system level requirements • Interviews with hardware procurement and planning experts to understand costs and sizing methodologies • Secondary research • Research and analyst reports on Big Data • User manuals of the products for Big Data management • Books, articles and blogs on Big Data technologies and products • Blogs and websites of prominent mobile Ad networks
  • 15. Advertising Network overview Two sided network with advertisers buying ad space on one side and ad publishers selling the space on the other. Image source: www.altitudedigital.com
  • 16. Ad Serving and Click Flows Image source: www.inmobi.com
  • 17. Pricing Model • Cost Per Click (CPC) – Outcome based pricing, advertiser is charged only when the ad is clicked. • Ad network revenue – 50% of the revenue generated from advertisers is appropriated by the ad network and the rest 50% is realized by the publisher.
  • 18. Business Goals Metrics for Ad network Business Goal Metrics Revenue optimization for publishers and self Maximize Click Through Rate (CTR) Help advertisers with campaign planning Accuracy of CTR prediction Help advertisers with campaign optimization through ongoing improvements Accuracy and timeliness of real time reports Help advertisers with campaign analysis later Ability and accuracy for canned and ad hoc reporting Business continuity Availability of reports on a sustained basis Business Problem Ad network has set up its data analytics systems to achieve its business goals but isn’t fairing very well on its performance metrics
  • 19. Functions of data analytics systems * This is a high level functionality detail to highlight the hardware requirements though the actual technical steps are different to process data for real time than those for batch reports. The insights from various analytics and reporting mechanisms help in effective placements and effectiveness of ads.
  • 20. Challenges in data analytics • Accessing the huge volume of data from the ad servers • Preparing huge data for analytics • Analyzing the data at a large scale and providing timely insights
  • 21. Steps for analytics and suitable products Step Big Data offering suitable Other suitable alternatives Data Collection of logs and feeds at a massive scale ( 8 billion collection events per day) Challenges: Burst bandwidth, latency, backlog, operability Technical metrics: Throughput, latency, data loss and reliability, linearly scalable Distributed Log Collectors. e.g. Scribe(Facebook) Flume(Cloudera), Kafka(LinkedIn) Log files transferred through network protocols such as FTP, rsync. Storing the collected data Technical metrics: Throughput, reliability, high availability, durability. HDFS, S3, NOSQL stores Files, databases Processing of data, ETL functions Technical metrics: Throughput, high availability HDFS , Hadoop mapreduce, EMR on Amazon Home grown solutions using scripting languages such as Perl
  • 22. Steps for analytics and suitable products Step Big Data offering suitable Other suitable alternatives BI Reporting Technical metrics: Query latency, data freshness NOSQL Columnar stores, warehouses Traditional row based data warehouses Ad hoc reporting based on historical data Technical metrics: Throughput, latency Hadoop mapreduce, Cloudera Impala, HortonWorks Stinger, Apache Dremel, Greenplum, Netezza, Teradata Relational databases Predictive Analytics Technical metrics: Throughput, latency R, Hadoop map reduce Home grown solution run on Massively Parallel Processing systems running on expensive, specialized hardware.
  • 23. IT Systems architecture using traditional data management products IT Systems architecture using Big Data products
  • 24. Choice of Big data product deployment Open Source(Build) Open Source (Buy support) Proprietary(Bu y) On Premise Externally hosted (Cloud) Decision criteria: Intellectual property A strong technology and intellectual property are key success factors in the mobile ad network and can help them develop a competitive advantage
  • 25. Typical case facts about data generated by Ad Network • Monthly Ad impressions served: 100 billion • Events received per day:10 billion events (An event is triggered at various stages of serving an ad. Some example events: Ad Request and Ad Impression events, User Click events, User Ad Interaction events, Conversion/Acquisition events, and Monetization events) • Average size of data received per event: 1 KB • Data received per day: 10 terabytes (10 billion events X 1 KB of data per event) Source: https://hasgeek.tv/fifthelephant/2012-2/68-the- elephant-that-flew-big-data-analytics-inmobi
  • 26. Stage 1: Data Collection • Traditional solution: Rsync and FTP are the popular tools used to move these logs. With Wide Area Network capacity up to 10 gigabit/sec available, it is easily possible to send 10 terabytes of data per day from machines that produce logs to those that consume them as required but the challenges are:  WAN links are usually weak leads to backlogs on the producer machine.  Consumer systems being down leads to data choking and delay in event delivery.  Duplicate data transfer consumes unnecessarily more bandwidth. • Big Data solution: • Distributed Log Collectors – Few examples: o Apache Flume (Initially built by Cloudera) o Scribe (Facebook) o Kafka (LinkedIn)
  • 27. Technical benefits of using distributed log collectors • Ability to work with distributed producers over WAN, with consumers sitting in local or remote datacenters. • Producers are decoupled from consumers, so consumers can process at their own pace. • Efficient: no duplicate data transfers, uses compression • Reliable and linearly scalable
  • 28. Apache Flume Hardware requirements Image source: http://flume.apache.org
  • 29. No. of agents required Tier 1 agents • Ratio of 1:16 for outer tier Number of tier 1 agents = 100/16 ~ 7 Tier 2 agents • Ratio of 1:4 for inner tier since more data will be pushed in to Tier-2 from Tier-1 Number of tier 2 agents = 7/4 ~ 2 Total agents required = 9
  • 30. Physical storage requirements Calculating the size of physical storage (hard drive) required • Ad server data – 10 terabyte/day • No. of ad servers = 100 • Data per sec. from ad server = 1012/(24*60*60*100) =115 KB • Data to be collected in two hours at this rate = 115 x 60 x 60 x 2 = 828 MB. (Assume expected resolution time for downstream failures is two hours) • Increase by safety margin factor say 1.5 = 828 MB x 1.5 = 1,242 MB • Required File Channel Capacity = 1.2 GB The physical storage capacity requirement is around 1.2 GB.
  • 31. CPU Requirements Multiple sources and sinks can be defined on a given agent based on the event batch size. Larger the batch size, greater the risk of duplication, hence batch size is limited to a max of 2500 events Events per sec. = 10TB/(1KB*24*60*60) = 115 For Agent 1: • Total Exit Batch Size from 16 upstream servers = 16 x 115 = 1840 • No. of sinks to accommodate 1840 events = [ 1840/2500 ] = 1 For Agent2: • Receiving a batch of 1840 events from each of four upstream agents • No. of sinks = [ 1840 * 4 / 2500 ] = 3 Cores = (Sources + Sinks) / 2 For Agent 1, Cores = 1 For Agent 2, Cores = 2
  • 32. Apache Flume Total Hardware Requirements 7 single core machines, each $800 2 dual core machines, each $1000 Total Hardware cost • $5600 + $2000 = $7600
  • 33. Stage 2: Storing the collected data Traditional solution: Network storage as a part of High Performance Computing (HPC) Clusters • Ten times extra overhead than commodity hard drives due to communication requirements within the cluster • Ten times costlier than commodity hardware due to specialized features such as redundant storage, high availability etc. Big Data solution: Hadoop Distributed File System (HDFS) • Low storage cost per byte as compared to other alternatives such as Storage Area Network • Tuned to deliver fast data for Mapreduce workloads up to 2 gigabyte per second. • Data reliability is the primary use case and it has been used by various organizations • Uses commodity hardware – less initial and maintenance cost. • Shares cost with compute layer since it is built into the Hadoop kernel. • Linearly scalable in terms of performance and cost even at very high volume.
  • 34. Storage Requirements and costs Traditional Solution: HPC Network storage • Network storage used with HPC costs $100000 for 100GB of data • For the ad network’s current requirement of 14 Petabytes, cost = $14 M • In order to move to move away from this architecture, there would be a salvage value of 60% of this hardware. Big Data solution • 10TB per day is 30TB physical space (3x replication factor) with a 30% overhead for MR jobs' local space (10 * 3 * 1.30) = 39TB physical space per day • 1.65 hosts per day's worth of data. • For a 1 year retention, storage required = 39 Terabytes X 365 = 14 Petabytes • ~600 hosts • 600 hosts X $5000 per host = $3,000,000 Commodity hardware server configuration: Chipset: 4 X 6 –core Intel Xeon 3GHz Memory: 32GB Operating System: Red Hat Enterprise Linux 5 Network: 2 Gbps (Bonded Network Interface Card) Disk Space: 2TB X 12 JBOD (Just a Bunch of Disks)
  • 35. Stage 3: Data processing and preparation Traditional solution: Scripts (e.g. using Perl scripting language) on High Performance Compute hardware Big Data Solution: Hadoop Mapreduce Benefits of Hadoop Mapreduce over Perl on HPC hardware • Scalable to thousands of nodes, shared nothing • Abstracts complexity of distributed programming • Reduced human resource cost to 0.5X • High availability, fault tolerance • Abstracts cluster functions • High performance esp. for unstructured data on one time processing.
  • 36. Hardware costs for Data Preparation and Processing Traditional Solution: • 10TB /day =121MB/sec. • Average throughput (MB/s) per Node for analytics workload = 1 • Desired throughput per node = 121 • No. of nodes required ~ 120 • Cost = 120 nodes X $5000 per node = $600,000 Big Data solution: • 10TB /day =121MB/sec. • Average throughput (MB/s) per Node for analytics workload = 10 • Desired throughput per node = 121 • No. of nodes required ~ 12 • Cost = 12 nodes X $5000 per node = $60,000
  • 37. Human Resource Cost for Data Preparation and Processing Traditional solution: Complex skillset required to handle distributed computing complexity Estimate: 50 person team @$35000 per person per year Cost: $1750000 Big Data solution: Simpler skillset required as complexities are abstracted from the programmers. Estimate: 50% cost reduction Cost: $875000
  • 38. Stage 4: Analytics – Reporting, Ad hoc and predictive analytics Traditional solution: Row based data warehouses with Structured Query Language Big Data solution: NOSQL column stores No additional hardware costs and similar human resource costs • Big data solutions benefit as the schemas can be modified at a later stage to keep the reports up to date with new type of data. • Optimized for columnar storage and access which are main tasks in analytics
  • 39. Quantification of immediate business benefits S No. Benefit Description Quantum 1 Increase in ad revenue due to better CTR Improved ads will help ad matching algorithms more accurately target the ads to the relevant users with the relevant publishers Estimated CTR increase 5% Corresponding increase in Publisher’s ad revenue 5% Corresponding increase in ad network’s revenue (50% of publisher’s ad revenue) 5% Ad network’s increase in revenue (current rev. $100M) $5 M 2 Increase in ad revenue by enabling advertisers to better plan campaigns Better accuracy in predicting CTR will help advertisers in better campaign planning. This will help improve CTR in turn increasing the revenue for publishers and the ad network Estimated CTR increase 5% Corresponding increase in Publisher’s ad revenue 5% Corresponding increase in ad network’s revenue (50% of publisher’s ad revenue) 5% Ad network’s increase in revenue (current rev. $100M) $5 M
  • 40. Quantification of immediate business benefits Benefit Description Quantum Increase in ad revenue due to better campaign optimization Timely and accurate real time reports will help advertisers do course correction helping further with CTR improvement leading to better ad revenue Estimated CTR increase 5% Corresponding increase in Publisher’s ad revenue 5% Corresponding increase in ad network’s revenue (50% of publisher’s ad revenue) 5% Ad network’s increase in revenue (current rev. $100M) $5 M Increase in ad revenue due to better availability of reports If the ad network provides better continuity to advertisers, they will be willing to pay premium. Estimated premium payment 2% Corresponding increase in ad network’s revenue (50% of publisher’s ad revenue) 2% Ad network’s increase in revenue (current rev. $100M) $2 M Total increase in ad Network’s revenue (1 + 2 + 3 + 4) $14 M
  • 41. Value Element Mapping Points of Parity • Open Source software available and the company can customize and enhance it the way they want. • Support for Java programming language, for which it is easy to hire people and further enhance the software due to abundantly available talent pool Points of Difference • Simpler skillset required for in-house IT experts in case of big data products. • Ability to handle all aspects of big data problems in Big data products unlike traditional data management products. • Linearly scalable - Big data products can work with cheaper hardware and are linearly scalable making them a future proof investment. Points of Contention • Adoption uncertainty Although there is community support among developers to maintain and evolve the Big data open source products which is growing very fast due to the buzz but it is unclear whether it will pick up as good as that in traditional software. • Stability of big data vendors The commercial vendors are mostly newly formed companies though founded by very accomplished people. They are fast gaining traction but it is unclear whether they will be able to sustain for long term. Moreover, since pure play Big Data firms are privately held, their growth and revenues are not clearly known.
  • 42. Customer Value Model Big data products Traditional products (Next Best Alternative – NBA) Benefits $17M Status quo with the existing systems Cost Other than Price (Capex + Annual) in the first year $7600 (Data Collection) + $3M (Storage) + $60K (Processing) +$875000 (Salaries) + $1.5M (Implementation and training) (Already incurred in the existing systems) $14M (Storage) + $600K(Processing) +$1750000 (Salaries) Total Cost $5442600 Sunk cost Value = Benefit - Cost $11557400 No additional value in the existing systems Price Free and Open Source Free and Open Source Delta(Price) 0 Value in Use = Delta(Value) - Delta(Price) $11557400 Effective value in use (for migration to Big Data products) = Value in Use + Salvage value of storage and processing + Salaries saved $22067400 Ignoring the time value of money since the cash flows are considered over a short period i.e. one year. Framework reference: James C. Anderson, James A. Narus, DVR Seshadri
  • 43. Value placeholders (less tangible) Positives • Big data products architecture will be linearly scalable and hence future proof, future data management requirements will be fulfilled by adding incremental cost towards buying commodity hardware. • Customer satisfaction and hence low customer churn due to increased control in their hands for managing their advertisements. • Skillset required for in-house IT experts is simpler in case of big data products and mostly based on popular Java technology. Negatives • Although the above big data products are backed by strong companies and open source communities, these companies and communities are not as strong as the ones for traditional products. • The commercial vendors are mostly newly formed companies but founded by very capable people which are fast gaining traction but it is unclear whether they will be able to sustain for long term.
  • 44. Conclusion • The above case study clearly builds a case for the value proposition of Big Data products • Similarly, big data products are being used extensively across various industries and this value model will help in building a concrete case for Big Data products