SlideShare a Scribd company logo
Big Data Summit - Google Venice
October 20, 2015
Agenda
2:00 – 2:30
2:30 – 3:30
3:30 - 4:00
4:00 - 4:30
4:30 - 5:00
5:00 - 6:00
Registration & Welcome
GCP Big Data Overview by Rohit Khare, Google PM
Customer Stories - BlueCava & Pixalate
Panel Discussion, Q&A
Partner Story, Magnus Unum
Reception & Networking
3
● Parking behind Chaya Restaurant on Navy Street
● Visitor badges
● Washrooms
● Beverage & food service
● Wireless access “GoogleGuest”
Logistics
3
01 GCP Big Data Overview
Rohit Khare, Google PM
deck
Confidential & ProprietaryGoogle Cloud Platform 5
Build. Store. Analyze.
Google Cloud Platform for Big Data
Focus on insights, not infrastructure
Big Data Summit, Los Angeles — October 20, 2015
Rohit Khare, Google Cloud Product Manager
William Vambenepe, Lead Product Manager for Big Data
Google cloud big data summit   master gcp big data summit la - 10-20-2015
Google cloud big data summit   master gcp big data summit la - 10-20-2015
Confidential & ProprietaryGoogle Cloud Platform 8
Build
Connect Visualize Find Access
Confidential & ProprietaryGoogle Cloud Platform 9
IaaS PaaS SaaS
Infrastructure-as-a-Service Platform-as-a-Service Software-as-a-Service
Google Cloud Platform
Cloud Computing
Confidential & ProprietaryGoogle Cloud Platform 10
Enterprise Cloud Platform market will exceed $43B globally by 2018.
2013
Confidential & ProprietaryGoogle Cloud Platform 11
Affordable
Capacity
The decreasing cost of storage
enables virtually unlimited
storage in the cloud. $600 can
buy enough storage for the
world’s music.
(Source: McKinsey Global Institute May 2011)
Computing as a utility is now
available for easy purchase,
provided from massively
efficient data centers.
(Source: Nicholas Carr, The Big Switch, 2008)
The internet allows for a model
of real-time access to new
innovation, information and
applications from a wide range
of devices.
IT Trends
On-demand
computing
Instant
access
Confidential & ProprietaryGoogle Cloud Platform 12
On and Off Growing Fast
• Successful services needs to grow/scale
• Keeping up w/ growth is big IT challenge
• Cannot provision hardware fast enough
• On & off workloads (e.g. batch job)
• Over provisioned capacity is wasted
• Time to market can be cumbersome
Cloud Computing Patterns
Confidential & ProprietaryGoogle Cloud Platform 13
Unpredictable Bursting Predictable Bursting
• Services with micro seasonality trends
• Peaks due to periodic increased demand
• IT complexity and wasted capacity
• Unexpected/unplanned peak in demand
• Sudden spike impacts performance
• Can’t over provision for extreme cases
Cloud Computing Patterns
Confidential & ProprietaryGoogle Cloud Platform 14
100 1,000 10,000 100,000
$0
$2,000
$4,000
$6,000
$8,000
public
cloud
private
cloud
servers servers servers servers
Cloud Economics
10x cost benefit for large scale deployments
Confidential & ProprietaryGoogle Cloud Platform 15
Google Cloud Platform
Google Ecosystem + APIs
• Take advantage of Google’s entire ecosystem of services:
Search
Web analytics
Monetization
App Distribution
Confidential & ProprietaryGoogle Cloud Platform 16
We provide all of our
customers with Bronze
support giving you access
to online documentation,
community forums, and
billing support.
If you want direct access
to our support team for
questions related to
service functionality, best
practice architectures, and
service errors.
If you want 24 x 7 phone
support, more rapid target
initial response times and
consultation on
application development,
and architecture for your
specific use case.
If you want the most
comprehensive, personal
and customized support we
offer. Includes everything in
Gold support as well as
direct access to the
Technical Account
Management team.
Gold
starts at $400/month
Platinum
Contact Sales
Silver
$150/month
Bronze
Free
Support
Confidential & ProprietaryGoogle Cloud Platform 17
SSAE-16
SOC 1
SSAE-16
SOC 2
SSAE-16
SOC 3
ISO
27001
HIPAA
(BAA)
PCI DSS
v3.0
FISMA FedRamp
GAE Complete Complete Complete Complete H1 15 Complete
FISMA
(Moderate)
H2 15
GCS Complete Complete Complete Complete Complete Complete n/a H2 15
GCE Complete Complete Complete Complete Complete Complete n/a H2 15
Datastore Complete Complete Complete Complete H1 15 Complete n/a H2 15
Big Query Complete Complete Complete Complete Complete Complete n/a H2 15
Cloud SQL Complete Complete Complete Complete Complete Complete n/a H2 15
Genomics H1 15 H1 15 H1 15 Complete H1 15 n/a n/a H2 15
Apps Complete Complete Complete Complete Complete n/a GAFG only H2 15
Certifications
Confidential & ProprietaryGoogle Cloud Platform 18
Pricing should be flexible and
easy to understand. You
shouldn’t need a PHD to
understand prices, and you
should get the best price
automatically.
If you use a Compute Engine
VM for more than 25% of a
month, you receive discounts
automatically.
Compute Engine instances are
charged in one-minute
increments (with a 10 minute
min), so you only pay for what
you use.
Per Minute
Billing
Sustained Use
Discounts
Philosophy
Pricing
For the past 15 years, Google has been
building out one of the world’s fastest, most
powerful, highest quality cloud
infrastructure on the planet.
Cloud Platform is built on the same
infrastructure that powers Google.
Confidential & ProprietaryGoogle Cloud Platform 21
2002 2004 2006 2008 2010 2012
ColossusMapReduce
SpannerBig Table
Dremel
GFS
Google Innovations in Software
2013 2014
Dataflow
Kubernetes
Confidential & ProprietaryGoogle Cloud Platform 22
A look inside
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 23
Google Cloud Platform
NetworkingCompute Big Data Management Storage Mobile
Developer
Tools
Confidential & ProprietaryGoogle Cloud Platform 24
ManagementNetworkingCompute Big Data Storage Mobile
Developer
Tools
Google Cloud Platform
Compute
Compute
Engine
Container
Engine
App
Engine
Confidential & ProprietaryGoogle Cloud Platform 25
ManagementNetworkingCompute Big Data Storage Mobile
Developer
Tools
Google Cloud Platform
Storage
Cloud
Storage
Cloud
SQL
Cloud
Datastore
Cloud
BigTable
Confidential & ProprietaryGoogle Cloud Platform 26
NoSQL SQL Blob Block
Easy-to-use storage options
Confidential & ProprietaryGoogle Cloud Platform 27
Cloud Storage
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 28
Google Cloud Platform
Cloud Storage: Value
• Safe: Redundant storage at multiple physical
locations. OAuth and granular access controls form
strong, configurable security
• Ease of Use: Same APIs as other CGS products
• High Performance: We provide, 99.95% SLA and 24x7
phone support
• Pricing: Pay only for what you use with some of the
lowest prices in the industry
Confidential & ProprietaryGoogle Cloud Platform 29
Google Cloud Platform
Cloud Storage: Features
• 3 storage options
○ Standard: The highest level of durability,
availability and performance
○ DRA: High level of durability, availability and
performance
○ Nearline: High performance data archiving, online
backup, and disaster recovery
Confidential & ProprietaryGoogle Cloud Platform 30
Cloud Datastore
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 31
Google Cloud Platform
Cloud Datastore: Value
• Accessible Anywhere
• Secure Sharing
• Same High Replication Datastore Used By
App Engine Apps Today
• Equally Fast Queries For Any Sized Dataset
• Data is Replicated Across Several Data Centers
• Use From Any Application or Language
• Serving 4.5 Trillion Requests Per Month
Confidential & ProprietaryGoogle Cloud Platform 32
Google Cloud Platform
Cloud Datastore: Features
• Auto-scale
• Schemaless Access
• SQL-like Capabilities
• Authentication That Just Works
• Fast and Easy Provisioning
• RESTful Endpoints
• ACID Transactions
• Local Development Tools
• Built-in Redundancy
Confidential & ProprietaryGoogle Cloud Platform 33
Cloud SQL
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 34
Google Cloud Platform
Cloud SQL
• Fully managed
• Ease of Use
• Highly Reliable
• Flexible Charging
• Security, Availability, Durability
• EU and US Data Centers
• Easy Migration & Data Portability
• Control
Confidential & ProprietaryGoogle Cloud Platform 35
Cloud BigTable
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 36
ManagementNetworkingCompute Big Data Storage Mobile
Developer
Tools
Google Cloud Platform
Big Data
Big Query
Cloud
Pub/Sub
Cloud
Dataflow
Manage the Entire Lifecycle of Big Data
Store AnalyzeProcessCapture
Manage the Entire Lifecycle of Big Data
Cloud Logs
Google App
Engine
Google Analytics
Premium
Cloud Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)
Capture Store Analyze
Batch
Cloud
DataStore
Process
Stream
Cloud
Monitoring
Cloud
Bigtable
Real time analytics
and Alerts
Cloud Dataflow
Cloud Dataproc
Confidential & ProprietaryGoogle Cloud Platform 39
BigQuery
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 40
Google Cloud Platform
BigQuery: Value
● Performance: Ingest data at 100K rows/second
and process real-time queries
● Ease of use: No administration for performance
and scale
● Scale: No need to worry about growing data.
Unlimited storage with pay as you go pricing
model
● Non-technical analysts can drive queries on
massive datasets using BI tools
Confidential & ProprietaryGoogle Cloud Platform 41
Google Cloud Platform
BigQuery: Features
● Interactive query performance: Query multi-
terabyte datasets in an ad hoc manner
● SQL: Familiar SQL-like query syntax and intuitive
user interface
● Data mashup: Query across diverse datasets
● Highly Available: Data replication in multiple
geographies. Data is available and durable even
in the case of extreme failure modes
● Secure: Access to data is controlled using
customer-owned ACLs
Confidential & ProprietaryGoogle Cloud Platform 42
Cloud Pub/Sub
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 43
Google Cloud Platform
Cloud Pub/Sub: Value
● Scalable, flexible, and reliable enterprise
message-oriented middleware to the cloud
● Provides asynchronous messaging, allowing
secure and highly available communication
between independently written applications
● Delivers low-latency, durable messaging that
helps developers quickly integrate systems
hosted on the Google Cloud Platform and
externally
Confidential & ProprietaryGoogle Cloud Platform 44
Google Cloud Platform
Cloud Pub/Sub: Features
• Unified messaging: Durability and low-latency
delivery in a single product
• Global presence: Connect services located
anywhere in the world
• Flexible delivery options: Both push- and pull-
style subscriptions supported
• Data reliability: Replicated storage and
guaranteed at-least-once message delivery
• Data security and protection: Encryption of data
on the wire and at rest
Confidential & ProprietaryGoogle Cloud Platform 45
Cloud Dataflow
Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 46
Google Cloud Platform
Cloud Dataflow: Value
• Reduce cost of processing large datasets
• Save time: Automatically optimizes data-centric
pipeline code by collapsing multiple logical passes
into a single execution pass
• Increase efficiencies: Fully manages the lifecycle of
required compute resources
• Simple: Dataflow makes it easy to write data-
processing pipelines that incorporate both batch
and stream-processing capabilities and is
language-agnostic
Confidential & ProprietaryGoogle Cloud Platform 47
Google Cloud Platform
Cloud Dataflow: Features
• Unified programming model for both batch and
stream-based data analysis
• Managed scaling: Manages the lifecycle of
required compute resources
• Reliable & consistent processing: Built-in support
for fault-tolerant execution
• Monitoring: Provides lifecycle statistics including
in flight information like real time pipeline
throughput, real time step lag and real time
worker log inspection
Confidential & ProprietaryGoogle Cloud Platform 48
Cloud Dataproc
Google Cloud Platform
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
growing scale
Utilization
improvements
Typical Big Data Processing
Focus on Insight,
Not infrastructure
Programming
Big Data with Google
Reduce Time to Understanding
Continuously accommodating greater data
volumes and new data sources
Capture and store all data for all business
functions
Complexity of building and maintaining a Big
Data system with consistent ease of use
Reducing the time from data collection to action
Managing the cost of the data platform
1
2
3
4
Hurdles to innovate and iterate with Big Data
5
Keep system reliables/running
Keep your data secure
Collaboration within or across organizations7
8
9
6
Traditional Big Data = Big Problems
Google BigQuery
Google Compute and APP Engine Scalable VMs
TBs of Data
Process in seconds
Data Collection
ETL
Raw Data Storage
Aggregation
Analytics Storage
Visualization
Google Cloud Storage
Google Cloud Platform
1
2
3
4
5
6
Interactive
Dashboards + apps
BI tools
Google
Spreadsheets
1
2
Collection
Transformation
Data processing
Cleansing4
Serve Analytics
Raw Data Storage
BigQuery Staging
3 BigQuery Aggregate
Staging
Raw Data Storage AdHoc Queries
REST API
5
6
Google Confidential
Google confidential │ Do not distribute
Overview:
Data to process: Data in the Consolidated Audit Trail (CAT).
A data repository of all equities and options orders, quotes,
and events
Challenges:
How to process the CAT and organize 100 billion market
events into an “order lifecycle” in a 4 hour window
Store 6 years (~30PB) of data
Cloud Bigtable to process and run queries
and tolerate volume increases
6 BILLION
MARKET EVENTS
WRITTEN PER HOUR
1.7 GIGs
PER SECOND
PER HOUR
6 TBs
10 BN
WRITTEN
PER HOUR BURSTS
1.7 GIGABYTES
PER SECOND
10 TERABYTES
PER HOUR
Google confidential │ Do not distribute
Overview:
Data to process: standard game KPIs, marketing data, custom game insight
Several dozen gigabytes of raw logs per day
Challenges:
Struggled to process large volume of data
Long delays between triggering logs and querying data; problematic for games
running live events
Issues controlling permissions
Long-running queries, clunky analysis
Overview:
Data to process: Standard game KPIs, marketing data,
custom game insight
Several dozen gigabytes of raw logs per day
Challenges:
Struggled to process large volume of data
Long delays between triggering logs and querying data; problematic
for games running live events
Issues controlling permissions
Long-running queries, clunky analysis
“BigQuery has helped us focus on actually using data instead
of exhausting ourselves just trying to get to the data.”
CRUNCH
150GIGS OF DATA IN
15 SECONDS
INSTANT
LOG INGESTION
SCALE
WITHOUT
CLOGGING
THE SYSTEM
F L E X I B I L I T Y
ON PERMISSION
CONTROLS
Confidential & ProprietaryGoogle Cloud Platform 54
Confidential & ProprietaryGoogle Cloud Platform 55
700million
“App Engine enabled us to focus on developing the
application. We wouldn’t have gotten here without the
ease of development that App Engine gave us.”
Bobby Murphy, CTO
Snapchat sends
photos and videos each day Google App Engine
scaled seamlessly
during growth to
millions of users
Small team is able
to innovate quickly
and expand
globally
Big Data Partner Ecosystem
Chartio
cloud.google.com
02 Customer Story - BlueCava
Reza Qorbani, CTO
deck
BLUECAVA, INC. / 2015BLUECAVA, INC. / 2015 PAGE 59
CROSS SCREEN STARTS HERE
BLUECAVA, INC. / 2015
BLUECAVA
Business / Product / Challenges
PAGE 60
BLUECAVA, INC. / 2015
INTRODUCTION
PAGE 61
Reza Qorbani
CTO @ BlueCava
• Work with Google Big Data Team in past 1.5 years
• Move from 100% Private Cloud to Hybrid Environment
• Deep Integration with Big Query
Email
reza.qorbani@bluecava.
com
Twitter
@qorban
i
BLUECAVA, INC. / 2015
DISPLAYMOBILEVIDEOEXCHANGESOCIAL
Real-time
Intelligence
ABOUT – BlueCava
PAGE 62
VALIDATION
DEMOGRAP
H
LOCATIONEXCHANGECOVERAGE
Association
Graph
DataTech
Platforms
AdTech
PlatformsOpen Network that Optimizes
Cross-Screen Marketing
MARTECH PLATFORMS & SERVICES
BLUECAVA, INC. / 2015
ABOUT – Association Graph
PAGE 63
House
Hold
Consumer B Consumer A Consumer C
IDFA APN BCID
BLUECAVA, INC. / 2015
ABOUT – Coverage
PAGE 64
100M / House Holds
240M / Consumers
600M / Devices
BLUECAVA, INC. / 2015
ABOUT – Volume
PAGE 65
5 TB Daily
Daily RAW Logs
250k req/sec
From Partners and Exchanges
1.3 Petabyte
Total Storage
25 Billion IDs
Including our Partner IDs
BLUECAVA, INC. / 2015
ABOUT – Challenge
PAGE 66
− Generate data for customers
− Multiple extraction at time
− Keep data for months
− Highly Available
− Easily run Ad-Hoc queries
− Handle lots of POCs
− Flexible to Change
− Unified Data Store
− Bandwidth Cost
− Storage Cost
− Infrastructure Cost
− Operation Cost
Cost Flexibility Delivery
BLUECAVA, INC. / 2015
ARCHITECTURE
BlueCava Platform Overview / Before / Now / Future!
PAGE 67
BLUECAVA, INC. / 2015
ARCHITECTURE – BlueCava Platform Overview
PAGE 68
CORE INTERNAL CUSTOMER
PLATFORM
EDGEX BIDDER OPERATIONS QUALITY API PORTAL
METADATA PREPARE
LOGGING AGGREGATE
FILTER
DETECTOR
TRANSFER / PREPARE PROCESS / ASSOCIATION ANALYZE / REPORT
AG AE DB
BLUECAVA, INC. / 2015
ARCHITECTURE – Before
PAGE 69
WEST (IRVINE) EAST (ASHBURN)
CORECORECUSTOMERINTERNAL
PLATFORM
BACKUP / DR
Geographic Load Balancing
XDC NET
BLUECAVA, INC. / 2015
ARCHITECTURE – Before / Challenges
PAGE 70
Cost
Estimate of $1.5M upfront to scale up
High Monthly Bandwidth cost
Need to Extend Operation team
Scalability
Performance
Storage
Complexity
Resource Limitations
Datacenter Issue with Traffic spikes
Need to scale down after POC finishes
Some processes took more than a day
Customer delivery takes 5-10 hours
Ad-Hoc queries taking hours
Need more historical data to increase quality
Need to keep customer data for months
Deliver large amount of data to customers
Simple Tasks Require Data Engineering Expertise
Customizing Data Output was hard
Data Scientists need meaningful data set
QA/Dev Environment Separation
Ad-Hoc queries create issue for production
BLUECAVA, INC. / 2015
ARCHITECTURE – Before / Solution
PAGE 71
Big Query
▪ Big Data as a Service
▪ Extremely cost effective for our use-case
▪ Support Hierarchical Data Model
▪ Extremely fast
▪ Query using SQL
▪ Solve most of our Big Data challenges
▪ Fraction of cost (It was Unbelievable)
▪ Customer Delivery in Seconds!!!
▪ We dropped Delivery Spark Cluster (10 nodes)
▪ We dropped Ad-Hoc Hadoop Cluster (100x nodes)
▪ Offload ALL Customer Facing Jobs
▪ Only 2 Sprints Development (6 Weeks)
BLUECAVA, INC. / 2015
ARCHITECTURE – Before / Solution
PAGE 72
Cloud Storage
▪ Nice integration with Big Query
▪ No file size limit like S3
▪ HDFS Integration using Hadoop Connector
▪ Seamless Cost Saving: DRA and Nearline
▪ Solved most of our Storage challenges
▪ Simplified our file delivery
▪ Extremely competitive pricing
▪ No need for Backup ☺
BLUECAVA, INC. / 2015
ARCHITECTURE – Before / Solution
PAGE 73
Compute Engine
▪ Great Sustained Pricing
▪ No need for long-term contract
▪ Simple CLI for Automation
▪ BDUtil Library for Hadoop
▪ Elastic Environment which saved us on Cost
▪ 100+ nodes Hadoop under 6 minutes
▪ Use as On-Demand Resource as needed
▪ Stop purchasing more hardware!
BLUECAVA, INC. / 2015
ARCHITECTURE – Now
PAGE 74
WEST (IRVINE) Google Cloud Platform
CORE CUSTOMERINTERNAL
PLATFORM
Cloud Storage
Simple DNS
Interconnect Big Query
BLUECAVA, INC. / 2015
ARCHITECTURE – Future!
PAGE 75
Cost
Move all in Cloud
Scalability
World-wide Coverage
Performance
Real-time Association
Simplify
Data Science Lab
Container Engine Dataproc Dataflow Datalab
BLUECAVA, INC. / 2015
ARCHITECTURE – Future!
PAGE 76
CORE REALTIME PROCESS
ASSOCIATION GRAPH
QUERY
LAB
STORAGE
INTERNAL
CUSTOMER
BATCH PROCESS
BLUECAVA, INC. / 2015 PAGE 77
THANK YOU
02 Customer Story - Pixalate
Amin Bandeali, CoFounder & CTO
deck
@
Amin Bandeali, Founder & CTO
Pixalate, Inc.
Agenda
● What is Pixalate?
● My Role @ Pixalate
● Pixalate Breadth and Depth
● What is Ad Fraud and why is it important to solve?
● Challenges
● Ad Fraud
● Real World BigQuery Use Cases
● Conclusion
Our Mission
To Rate the Whole Internet…
...and YES we also see what Google doesn’t see!
What is Pixalate?
Pixalate is a defacto Ratings Standard for Programmatic Advertising.
SellerTrustIndex.com
My Role @ Pixalate
● Co-Founder, CTO and Solution Architect
● Real-Time Data Junkie - Contributed to Apache Hadoop Project
● Largest AWS DynamoDB user upon launch - not using it anymore!
● Largest AWS SQS user - not using it anymore!
● Pixalate backend runs Java, NodeJS, Redis, Solr, S3 and BigQuery
● Denied using 25000 free hours of AWS Redshift!
● 70% of Pixalate technology runs on AWS -- 30% on BigQuery
● We move 2TB of data from AWS to Google Storage just for BigQuery
Challenges
Process 1+ Trillion Ad Transactions Data/month
Processing Upto 3 PB/month
Analyze Massive amounts of Data to detect fraud
Create customized reports with NO engineering support!
Close to 1 Trillion rows of data in BigQuery
What is Ad Fraud?
Google cloud big data summit   master gcp big data summit la - 10-20-2015
Ad Fraud against
AdMob and
MacDonlds
Google cloud big data summit   master gcp big data summit la - 10-20-2015
Our Realtime Fraud Map
http://www.pixalate.com/map
What’s wrong with this data?
A day in the life of Data Science Team
● An Account Manager requests the data science team for customized report for
a client that measures some specific metrics for the last 6 months of their data.
● Solution 1: AWS EMR - Boring and takes Hours!
○ The Big data engineers will execute an EMR (Hive) job that extracts the data and creates the
report
● Solution 2: BigQuery - Fun and takes Seconds!
○ The data science team implements a usually complex query that calculates all the metrics in
SQL
○ BigQuery will process a couple of TB of data and create the report in few seconds.
Bypassing the Engineers!
● We need to expand a list of 500,000 network addresses in CIDR format (e.g.
128.0.0.1/24) to regular IP format and use them in client reports
● Solution 1: Java
○ provide the Java engineers with the requirements
○ wait for implementation completion
○ wait for UAT and Production push
○ store the data in a database
■ total time ~3 workdays (in Startup Timezone)
● Solution 2: BigQuery
○ the data science team writes a query with 25+ table JOINs and UNIONS that takes care of the
expansion in a clean, easy to test way, and runs it in BigQuery
■ total time ~3 hours
From Waste Picking to Innovation
● The amount of digital data in the universe is growing at
an exponential rate, doubling every two years, and
changing how we live in the world.
○ YET only .5% of that data is analyzed!
● If you can’t mine these data easily and extract semantics,
○ then how is data collection different than waste-picking???
● BigQuery enables Innovation
○ It breaks the dependency between data scientists and big-data
engineers
○ Now data scientists can write complex queries and analyse massive
amounts of data without the need of any backend coding (e.g. Java),
or some other big data framework
○ It enables the deep understanding of complex data and their
Cost reduction using BigQuery
● Complex data processing pipelines impose a new cost optimization
challenge
● Main questions to be answered:
○ Where do I store the data I collect?
○ Where/How do I aggregate the data I collect?
○ How do I enhance the data I collect with other metadata?
○ How do I process the data collected?
■ such that the overall cost is minimized??
● BigQuery can HELP!
BigQuery Health Monitoring Using
BigQuery
But Wait!
Here’s the real benefit...
Zero Cost Queries Over Petabytes!
● How can you query PETABYTES of historical data and create time series to
detect traffic anomalies (e.g. network failures, etc)?
● BigQuery Zero Cost queries (a.k.a. table metadata)
○ can give you the big picture regarding table’s data health
■ within seconds
■ without having to run any costly queries
suspicious activity
Big Query Success is
all about the
Architecture
Spend a LOT of time on Table
Schemas (hint: keep them flat)
Learnings
● BigQuery has its gotchas!
○ The wrong Sharding strategy can slow you down
○ Know your Quotas well -- they will haunt you!
○ Balance the table JOINs appropriately
○ Don’t use ORDER BY unless it’s mandatory
○ Avoid “SELECT *” queries on “fat” tables over long time ranges
● Secret recipe
○ push as much complexity as possible to BigQuery using advanced queries
■ usually > 100 lines of SQL code
○ use backend languages (e.g. Java) to simply orchestrate the data pipeline
○ don’t be scared of data duplication -- storage cost is much cheaper than analysis cost!
Q&A
Amin Bandeali
p: 888.749.2528
m: 714.757.9544
e: mab@pixalate.com
t: http://twitter.com/aminbandeali
Confidential & ProprietaryGoogle Cloud Platform 101
Panel Q&A
Rohit Khare, GCP Big Data PM
Reza Qorbani, BlueCava CTO
Amin Bandeali, Pixalate CoFounder & CTO
04 Partner Story - Magnus Unum
Rajesh Babu, BI, Big Data & Analytics solutions Architect
Subash D'Souza, Big Data Evangelist
deck
Magnus Unum
Raj Babu & Subash D’Souza
Modern BI & Big Data platform
with Google Cloud
Magnus Unum…what we do
We are a
LA based Big Data, Data Science & Analytics
Consulting Services firm
specialized in advising our clients on Strategy, Road
Map/Blue Print, Implementation, Deployment,
Maintenance/Support/Operations for their Big Data,
• Raj Babu
• Co – Founder, Magnus Unum
• Founder, Agile iSS
• 20 years of experience in the BI & Analytics field
• Worked on numerous, very large BI migration and Integration
projects
• Subash D’Souza
• Over 10 years of experience in building scalable solutions for
various enterprise companies
• Organizer for several LA User Groups including Big Data,
Apache Spark & Apache HBase
Magnus Unum…Leadership
Magnus Unum - Key Services
• Architect, Design & Build Big Data Solutions
• Cloud Migration services for Big Data,
Analytics & BI
• Big Data Engineering & Staffing
• Big Data managed & support services
• Data Science Solutions & Services
Magnus Unum – Expertise
• On-Prem
Cloudera, Hortonworks, IBM, Pivotal & MapR
• Cloud
Google Cloud, Amazon AWS & Microsoft Azure
• Analytics/ Reporting
Tableau, MicroStrategy, SAP BO, Qlik &
Pentaho
Why Google Cloud Platform?
Use Case 1 – Migrating your Data
Warehouse and BI to Google Cloud
• Capture / Migrate or Capture
• Storage / Data Management
• Data Processing
• Query/Analytics
• Data Integration
• Access Control
Use Case 2 – Google Analytics detailed
analysis
• Limitation in Google Analytics daily export
• More detailed analysis available as part of
Google Cloud Platform( Must have premium
access)
• Can analyze granular level details of User
Interaction on websites and aggregate the
Please reach out to us for a free
Consultation & Assessment of
your BI, Big Data & Analytics
needs
& additional $500 in GCP credits!
Q&A
Contact
Raj@MagnusUnum.com
Subash@MagnusUnum.com
@sawjd22
https://www.linkedin.com/in/sawjd
cloud.google.com/free-trial
Questions? google-cloud-sw-sales@google.com
Get $300 in credit to use for 60 days.

More Related Content

PDF
#DataUnlimited - Google Big Data Unlimited
PPTX
Big Data Best Practices on GCP
PPTX
Introduction to Google Cloud Platform for Big Data - Trusted Conf
PDF
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
PPTX
Google Cloud Platform (GCP)
PDF
Big Data and ML on Google Cloud
PDF
Google Cloud Platform as a Backend Solution for your Product
PPTX
Introduction to Google Cloud Services / Platforms
#DataUnlimited - Google Big Data Unlimited
Big Data Best Practices on GCP
Introduction to Google Cloud Platform for Big Data - Trusted Conf
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform (GCP)
Big Data and ML on Google Cloud
Google Cloud Platform as a Backend Solution for your Product
Introduction to Google Cloud Services / Platforms

What's hot (20)

PDF
Google Cloud Platform Introduction - 2016Q3
PDF
Getting started with GCP ( Google Cloud Platform)
PDF
 Introduction google cloud platform
PDF
Containerizing the Cloud with Kubernetes and Docker
PPTX
Introduction to Google Cloud Platform
PDF
IoT at Google Scale
PDF
Big data on google cloud
PPTX
Understanding cloud with Google Cloud Platform
PDF
Cloud Developer Days - BigQuery
PDF
Tom Grey - Google Cloud Platform
PDF
An overview of BigQuery
PDF
StackEngine Demo - Docker Austin
PPTX
Google Cloud Platform Data Storage
PPTX
TIAD : Automate everything with Google Cloud
PDF
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
PDF
Google Bigtable
PPTX
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
PPTX
Google cloud
PDF
Google Cloud Dataflow
PDF
Google Cloud Connect Korea - Sep 2017
Google Cloud Platform Introduction - 2016Q3
Getting started with GCP ( Google Cloud Platform)
 Introduction google cloud platform
Containerizing the Cloud with Kubernetes and Docker
Introduction to Google Cloud Platform
IoT at Google Scale
Big data on google cloud
Understanding cloud with Google Cloud Platform
Cloud Developer Days - BigQuery
Tom Grey - Google Cloud Platform
An overview of BigQuery
StackEngine Demo - Docker Austin
Google Cloud Platform Data Storage
TIAD : Automate everything with Google Cloud
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Google Bigtable
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Google cloud
Google Cloud Dataflow
Google Cloud Connect Korea - Sep 2017
Ad

Viewers also liked (20)

PDF
Spotify's journey to GCP
PDF
The Dark Side Of The Cloud
PPTX
Docker orchestration
PDF
Issues on Big Data & Cloud Computing
PDF
Amazon EC2 Container Service in Action
PDF
RancherOS Introduction
PDF
Google на конференции Big Data Russia
PDF
Scala Data Pipelines @ Spotify
PDF
Mesos on coreOS
PDF
From stream to recommendation using apache beam with cloud pubsub and cloud d...
PPTX
Docker toolbox
PDF
형태소 분석기를 적용한 elasticsearch 운영
PDF
Data at Spotify
PPT
Google cloud platform
PDF
Big Data At Spotify
PPTX
Google Cloud and Data Pipeline Patterns
PDF
GCP Gaming 2016 Keynote Seoul, Korea
PDF
Growing up with agile - how the Spotify 'model' has evolved
PPTX
Slides cloud computing
PDF
Introduction to Google Developer Relations
Spotify's journey to GCP
The Dark Side Of The Cloud
Docker orchestration
Issues on Big Data & Cloud Computing
Amazon EC2 Container Service in Action
RancherOS Introduction
Google на конференции Big Data Russia
Scala Data Pipelines @ Spotify
Mesos on coreOS
From stream to recommendation using apache beam with cloud pubsub and cloud d...
Docker toolbox
형태소 분석기를 적용한 elasticsearch 운영
Data at Spotify
Google cloud platform
Big Data At Spotify
Google Cloud and Data Pipeline Patterns
GCP Gaming 2016 Keynote Seoul, Korea
Growing up with agile - how the Spotify 'model' has evolved
Slides cloud computing
Introduction to Google Developer Relations
Ad

Similar to Google cloud big data summit master gcp big data summit la - 10-20-2015 (20)

PDF
Google Cloud Platform
PPTX
Webinar | From Zero to 1 Million with Google Cloud Platform and DataStax
PPTX
Google Cloud Platfrom
PDF
What Are Google Cloud Platform Services: Full Guide for 2025
PDF
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
PDF
node.js on Google Compute Engine
PPTX
google_cloud_presentation.pptx
PPTX
Introduction to google cloud platform
PDF
Introduction to GCP
PDF
Building what's next with google cloud's powerful infrastructure
PPTX
Introduction to Google Cloud & GCCP Campaign
PPTX
Cloud Computing
PDF
Cloud computing overview & running your code on Google Cloud (Jun 2019)
PDF
Comprehensive Guide to Google Cloud Services_ Features, Benefits, and Use Cas...
PDF
Google Cloud Platform for the Enterprise
DOCX
Google Cloud Platform.docx
PDF
Getting started with Google Cloud Training Material - 2018
PDF
Material de treinamento do Google Cloud 2018
PPTX
For linked in part 1
PDF
Google Cloud - Scale With A Smile (Dec 2014)
Google Cloud Platform
Webinar | From Zero to 1 Million with Google Cloud Platform and DataStax
Google Cloud Platfrom
What Are Google Cloud Platform Services: Full Guide for 2025
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
node.js on Google Compute Engine
google_cloud_presentation.pptx
Introduction to google cloud platform
Introduction to GCP
Building what's next with google cloud's powerful infrastructure
Introduction to Google Cloud & GCCP Campaign
Cloud Computing
Cloud computing overview & running your code on Google Cloud (Jun 2019)
Comprehensive Guide to Google Cloud Services_ Features, Benefits, and Use Cas...
Google Cloud Platform for the Enterprise
Google Cloud Platform.docx
Getting started with Google Cloud Training Material - 2018
Material de treinamento do Google Cloud 2018
For linked in part 1
Google Cloud - Scale With A Smile (Dec 2014)

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Computer network topology notes for revision
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Introduction to Business Data Analytics.
Launch Your Data Science Career in Kochi – 2025
Introduction-to-Cloud-ComputingFinal.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Ppt On Nestle.pptx huunnnhhgfvu
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
.pdf is not working space design for the following data for the following dat...
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
Computer network topology notes for revision
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Clinical guidelines as a resource for EBP(1).pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Business Data Analytics.

Google cloud big data summit master gcp big data summit la - 10-20-2015

  • 1. Big Data Summit - Google Venice October 20, 2015
  • 2. Agenda 2:00 – 2:30 2:30 – 3:30 3:30 - 4:00 4:00 - 4:30 4:30 - 5:00 5:00 - 6:00 Registration & Welcome GCP Big Data Overview by Rohit Khare, Google PM Customer Stories - BlueCava & Pixalate Panel Discussion, Q&A Partner Story, Magnus Unum Reception & Networking
  • 3. 3 ● Parking behind Chaya Restaurant on Navy Street ● Visitor badges ● Washrooms ● Beverage & food service ● Wireless access “GoogleGuest” Logistics 3
  • 4. 01 GCP Big Data Overview Rohit Khare, Google PM deck
  • 5. Confidential & ProprietaryGoogle Cloud Platform 5 Build. Store. Analyze. Google Cloud Platform for Big Data Focus on insights, not infrastructure Big Data Summit, Los Angeles — October 20, 2015 Rohit Khare, Google Cloud Product Manager William Vambenepe, Lead Product Manager for Big Data
  • 8. Confidential & ProprietaryGoogle Cloud Platform 8 Build Connect Visualize Find Access
  • 9. Confidential & ProprietaryGoogle Cloud Platform 9 IaaS PaaS SaaS Infrastructure-as-a-Service Platform-as-a-Service Software-as-a-Service Google Cloud Platform Cloud Computing
  • 10. Confidential & ProprietaryGoogle Cloud Platform 10 Enterprise Cloud Platform market will exceed $43B globally by 2018. 2013
  • 11. Confidential & ProprietaryGoogle Cloud Platform 11 Affordable Capacity The decreasing cost of storage enables virtually unlimited storage in the cloud. $600 can buy enough storage for the world’s music. (Source: McKinsey Global Institute May 2011) Computing as a utility is now available for easy purchase, provided from massively efficient data centers. (Source: Nicholas Carr, The Big Switch, 2008) The internet allows for a model of real-time access to new innovation, information and applications from a wide range of devices. IT Trends On-demand computing Instant access
  • 12. Confidential & ProprietaryGoogle Cloud Platform 12 On and Off Growing Fast • Successful services needs to grow/scale • Keeping up w/ growth is big IT challenge • Cannot provision hardware fast enough • On & off workloads (e.g. batch job) • Over provisioned capacity is wasted • Time to market can be cumbersome Cloud Computing Patterns
  • 13. Confidential & ProprietaryGoogle Cloud Platform 13 Unpredictable Bursting Predictable Bursting • Services with micro seasonality trends • Peaks due to periodic increased demand • IT complexity and wasted capacity • Unexpected/unplanned peak in demand • Sudden spike impacts performance • Can’t over provision for extreme cases Cloud Computing Patterns
  • 14. Confidential & ProprietaryGoogle Cloud Platform 14 100 1,000 10,000 100,000 $0 $2,000 $4,000 $6,000 $8,000 public cloud private cloud servers servers servers servers Cloud Economics 10x cost benefit for large scale deployments
  • 15. Confidential & ProprietaryGoogle Cloud Platform 15 Google Cloud Platform Google Ecosystem + APIs • Take advantage of Google’s entire ecosystem of services: Search Web analytics Monetization App Distribution
  • 16. Confidential & ProprietaryGoogle Cloud Platform 16 We provide all of our customers with Bronze support giving you access to online documentation, community forums, and billing support. If you want direct access to our support team for questions related to service functionality, best practice architectures, and service errors. If you want 24 x 7 phone support, more rapid target initial response times and consultation on application development, and architecture for your specific use case. If you want the most comprehensive, personal and customized support we offer. Includes everything in Gold support as well as direct access to the Technical Account Management team. Gold starts at $400/month Platinum Contact Sales Silver $150/month Bronze Free Support
  • 17. Confidential & ProprietaryGoogle Cloud Platform 17 SSAE-16 SOC 1 SSAE-16 SOC 2 SSAE-16 SOC 3 ISO 27001 HIPAA (BAA) PCI DSS v3.0 FISMA FedRamp GAE Complete Complete Complete Complete H1 15 Complete FISMA (Moderate) H2 15 GCS Complete Complete Complete Complete Complete Complete n/a H2 15 GCE Complete Complete Complete Complete Complete Complete n/a H2 15 Datastore Complete Complete Complete Complete H1 15 Complete n/a H2 15 Big Query Complete Complete Complete Complete Complete Complete n/a H2 15 Cloud SQL Complete Complete Complete Complete Complete Complete n/a H2 15 Genomics H1 15 H1 15 H1 15 Complete H1 15 n/a n/a H2 15 Apps Complete Complete Complete Complete Complete n/a GAFG only H2 15 Certifications
  • 18. Confidential & ProprietaryGoogle Cloud Platform 18 Pricing should be flexible and easy to understand. You shouldn’t need a PHD to understand prices, and you should get the best price automatically. If you use a Compute Engine VM for more than 25% of a month, you receive discounts automatically. Compute Engine instances are charged in one-minute increments (with a 10 minute min), so you only pay for what you use. Per Minute Billing Sustained Use Discounts Philosophy Pricing
  • 19. For the past 15 years, Google has been building out one of the world’s fastest, most powerful, highest quality cloud infrastructure on the planet.
  • 20. Cloud Platform is built on the same infrastructure that powers Google.
  • 21. Confidential & ProprietaryGoogle Cloud Platform 21 2002 2004 2006 2008 2010 2012 ColossusMapReduce SpannerBig Table Dremel GFS Google Innovations in Software 2013 2014 Dataflow Kubernetes
  • 22. Confidential & ProprietaryGoogle Cloud Platform 22 A look inside Google Cloud Platform
  • 23. Confidential & ProprietaryGoogle Cloud Platform 23 Google Cloud Platform NetworkingCompute Big Data Management Storage Mobile Developer Tools
  • 24. Confidential & ProprietaryGoogle Cloud Platform 24 ManagementNetworkingCompute Big Data Storage Mobile Developer Tools Google Cloud Platform Compute Compute Engine Container Engine App Engine
  • 25. Confidential & ProprietaryGoogle Cloud Platform 25 ManagementNetworkingCompute Big Data Storage Mobile Developer Tools Google Cloud Platform Storage Cloud Storage Cloud SQL Cloud Datastore Cloud BigTable
  • 26. Confidential & ProprietaryGoogle Cloud Platform 26 NoSQL SQL Blob Block Easy-to-use storage options
  • 27. Confidential & ProprietaryGoogle Cloud Platform 27 Cloud Storage Google Cloud Platform
  • 28. Confidential & ProprietaryGoogle Cloud Platform 28 Google Cloud Platform Cloud Storage: Value • Safe: Redundant storage at multiple physical locations. OAuth and granular access controls form strong, configurable security • Ease of Use: Same APIs as other CGS products • High Performance: We provide, 99.95% SLA and 24x7 phone support • Pricing: Pay only for what you use with some of the lowest prices in the industry
  • 29. Confidential & ProprietaryGoogle Cloud Platform 29 Google Cloud Platform Cloud Storage: Features • 3 storage options ○ Standard: The highest level of durability, availability and performance ○ DRA: High level of durability, availability and performance ○ Nearline: High performance data archiving, online backup, and disaster recovery
  • 30. Confidential & ProprietaryGoogle Cloud Platform 30 Cloud Datastore Google Cloud Platform
  • 31. Confidential & ProprietaryGoogle Cloud Platform 31 Google Cloud Platform Cloud Datastore: Value • Accessible Anywhere • Secure Sharing • Same High Replication Datastore Used By App Engine Apps Today • Equally Fast Queries For Any Sized Dataset • Data is Replicated Across Several Data Centers • Use From Any Application or Language • Serving 4.5 Trillion Requests Per Month
  • 32. Confidential & ProprietaryGoogle Cloud Platform 32 Google Cloud Platform Cloud Datastore: Features • Auto-scale • Schemaless Access • SQL-like Capabilities • Authentication That Just Works • Fast and Easy Provisioning • RESTful Endpoints • ACID Transactions • Local Development Tools • Built-in Redundancy
  • 33. Confidential & ProprietaryGoogle Cloud Platform 33 Cloud SQL Google Cloud Platform
  • 34. Confidential & ProprietaryGoogle Cloud Platform 34 Google Cloud Platform Cloud SQL • Fully managed • Ease of Use • Highly Reliable • Flexible Charging • Security, Availability, Durability • EU and US Data Centers • Easy Migration & Data Portability • Control
  • 35. Confidential & ProprietaryGoogle Cloud Platform 35 Cloud BigTable Google Cloud Platform
  • 36. Confidential & ProprietaryGoogle Cloud Platform 36 ManagementNetworkingCompute Big Data Storage Mobile Developer Tools Google Cloud Platform Big Data Big Query Cloud Pub/Sub Cloud Dataflow
  • 37. Manage the Entire Lifecycle of Big Data Store AnalyzeProcessCapture
  • 38. Manage the Entire Lifecycle of Big Data Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc
  • 39. Confidential & ProprietaryGoogle Cloud Platform 39 BigQuery Google Cloud Platform
  • 40. Confidential & ProprietaryGoogle Cloud Platform 40 Google Cloud Platform BigQuery: Value ● Performance: Ingest data at 100K rows/second and process real-time queries ● Ease of use: No administration for performance and scale ● Scale: No need to worry about growing data. Unlimited storage with pay as you go pricing model ● Non-technical analysts can drive queries on massive datasets using BI tools
  • 41. Confidential & ProprietaryGoogle Cloud Platform 41 Google Cloud Platform BigQuery: Features ● Interactive query performance: Query multi- terabyte datasets in an ad hoc manner ● SQL: Familiar SQL-like query syntax and intuitive user interface ● Data mashup: Query across diverse datasets ● Highly Available: Data replication in multiple geographies. Data is available and durable even in the case of extreme failure modes ● Secure: Access to data is controlled using customer-owned ACLs
  • 42. Confidential & ProprietaryGoogle Cloud Platform 42 Cloud Pub/Sub Google Cloud Platform
  • 43. Confidential & ProprietaryGoogle Cloud Platform 43 Google Cloud Platform Cloud Pub/Sub: Value ● Scalable, flexible, and reliable enterprise message-oriented middleware to the cloud ● Provides asynchronous messaging, allowing secure and highly available communication between independently written applications ● Delivers low-latency, durable messaging that helps developers quickly integrate systems hosted on the Google Cloud Platform and externally
  • 44. Confidential & ProprietaryGoogle Cloud Platform 44 Google Cloud Platform Cloud Pub/Sub: Features • Unified messaging: Durability and low-latency delivery in a single product • Global presence: Connect services located anywhere in the world • Flexible delivery options: Both push- and pull- style subscriptions supported • Data reliability: Replicated storage and guaranteed at-least-once message delivery • Data security and protection: Encryption of data on the wire and at rest
  • 45. Confidential & ProprietaryGoogle Cloud Platform 45 Cloud Dataflow Google Cloud Platform
  • 46. Confidential & ProprietaryGoogle Cloud Platform 46 Google Cloud Platform Cloud Dataflow: Value • Reduce cost of processing large datasets • Save time: Automatically optimizes data-centric pipeline code by collapsing multiple logical passes into a single execution pass • Increase efficiencies: Fully manages the lifecycle of required compute resources • Simple: Dataflow makes it easy to write data- processing pipelines that incorporate both batch and stream-processing capabilities and is language-agnostic
  • 47. Confidential & ProprietaryGoogle Cloud Platform 47 Google Cloud Platform Cloud Dataflow: Features • Unified programming model for both batch and stream-based data analysis • Managed scaling: Manages the lifecycle of required compute resources • Reliable & consistent processing: Built-in support for fault-tolerant execution • Monitoring: Provides lifecycle statistics including in flight information like real time pipeline throughput, real time step lag and real time worker log inspection
  • 48. Confidential & ProprietaryGoogle Cloud Platform 48 Cloud Dataproc Google Cloud Platform
  • 49. Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling growing scale Utilization improvements Typical Big Data Processing Focus on Insight, Not infrastructure Programming Big Data with Google Reduce Time to Understanding
  • 50. Continuously accommodating greater data volumes and new data sources Capture and store all data for all business functions Complexity of building and maintaining a Big Data system with consistent ease of use Reducing the time from data collection to action Managing the cost of the data platform 1 2 3 4 Hurdles to innovate and iterate with Big Data 5 Keep system reliables/running Keep your data secure Collaboration within or across organizations7 8 9 6 Traditional Big Data = Big Problems
  • 51. Google BigQuery Google Compute and APP Engine Scalable VMs TBs of Data Process in seconds Data Collection ETL Raw Data Storage Aggregation Analytics Storage Visualization Google Cloud Storage Google Cloud Platform 1 2 3 4 5 6 Interactive Dashboards + apps BI tools Google Spreadsheets 1 2 Collection Transformation Data processing Cleansing4 Serve Analytics Raw Data Storage BigQuery Staging 3 BigQuery Aggregate Staging Raw Data Storage AdHoc Queries REST API 5 6 Google Confidential
  • 52. Google confidential │ Do not distribute Overview: Data to process: Data in the Consolidated Audit Trail (CAT). A data repository of all equities and options orders, quotes, and events Challenges: How to process the CAT and organize 100 billion market events into an “order lifecycle” in a 4 hour window Store 6 years (~30PB) of data Cloud Bigtable to process and run queries and tolerate volume increases 6 BILLION MARKET EVENTS WRITTEN PER HOUR 1.7 GIGs PER SECOND PER HOUR 6 TBs 10 BN WRITTEN PER HOUR BURSTS 1.7 GIGABYTES PER SECOND 10 TERABYTES PER HOUR
  • 53. Google confidential │ Do not distribute Overview: Data to process: standard game KPIs, marketing data, custom game insight Several dozen gigabytes of raw logs per day Challenges: Struggled to process large volume of data Long delays between triggering logs and querying data; problematic for games running live events Issues controlling permissions Long-running queries, clunky analysis Overview: Data to process: Standard game KPIs, marketing data, custom game insight Several dozen gigabytes of raw logs per day Challenges: Struggled to process large volume of data Long delays between triggering logs and querying data; problematic for games running live events Issues controlling permissions Long-running queries, clunky analysis “BigQuery has helped us focus on actually using data instead of exhausting ourselves just trying to get to the data.” CRUNCH 150GIGS OF DATA IN 15 SECONDS INSTANT LOG INGESTION SCALE WITHOUT CLOGGING THE SYSTEM F L E X I B I L I T Y ON PERMISSION CONTROLS
  • 54. Confidential & ProprietaryGoogle Cloud Platform 54
  • 55. Confidential & ProprietaryGoogle Cloud Platform 55 700million “App Engine enabled us to focus on developing the application. We wouldn’t have gotten here without the ease of development that App Engine gave us.” Bobby Murphy, CTO Snapchat sends photos and videos each day Google App Engine scaled seamlessly during growth to millions of users Small team is able to innovate quickly and expand globally
  • 56. Big Data Partner Ecosystem Chartio
  • 58. 02 Customer Story - BlueCava Reza Qorbani, CTO deck
  • 59. BLUECAVA, INC. / 2015BLUECAVA, INC. / 2015 PAGE 59 CROSS SCREEN STARTS HERE
  • 60. BLUECAVA, INC. / 2015 BLUECAVA Business / Product / Challenges PAGE 60
  • 61. BLUECAVA, INC. / 2015 INTRODUCTION PAGE 61 Reza Qorbani CTO @ BlueCava • Work with Google Big Data Team in past 1.5 years • Move from 100% Private Cloud to Hybrid Environment • Deep Integration with Big Query Email reza.qorbani@bluecava. com Twitter @qorban i
  • 62. BLUECAVA, INC. / 2015 DISPLAYMOBILEVIDEOEXCHANGESOCIAL Real-time Intelligence ABOUT – BlueCava PAGE 62 VALIDATION DEMOGRAP H LOCATIONEXCHANGECOVERAGE Association Graph DataTech Platforms AdTech PlatformsOpen Network that Optimizes Cross-Screen Marketing MARTECH PLATFORMS & SERVICES
  • 63. BLUECAVA, INC. / 2015 ABOUT – Association Graph PAGE 63 House Hold Consumer B Consumer A Consumer C IDFA APN BCID
  • 64. BLUECAVA, INC. / 2015 ABOUT – Coverage PAGE 64 100M / House Holds 240M / Consumers 600M / Devices
  • 65. BLUECAVA, INC. / 2015 ABOUT – Volume PAGE 65 5 TB Daily Daily RAW Logs 250k req/sec From Partners and Exchanges 1.3 Petabyte Total Storage 25 Billion IDs Including our Partner IDs
  • 66. BLUECAVA, INC. / 2015 ABOUT – Challenge PAGE 66 − Generate data for customers − Multiple extraction at time − Keep data for months − Highly Available − Easily run Ad-Hoc queries − Handle lots of POCs − Flexible to Change − Unified Data Store − Bandwidth Cost − Storage Cost − Infrastructure Cost − Operation Cost Cost Flexibility Delivery
  • 67. BLUECAVA, INC. / 2015 ARCHITECTURE BlueCava Platform Overview / Before / Now / Future! PAGE 67
  • 68. BLUECAVA, INC. / 2015 ARCHITECTURE – BlueCava Platform Overview PAGE 68 CORE INTERNAL CUSTOMER PLATFORM EDGEX BIDDER OPERATIONS QUALITY API PORTAL METADATA PREPARE LOGGING AGGREGATE FILTER DETECTOR TRANSFER / PREPARE PROCESS / ASSOCIATION ANALYZE / REPORT AG AE DB
  • 69. BLUECAVA, INC. / 2015 ARCHITECTURE – Before PAGE 69 WEST (IRVINE) EAST (ASHBURN) CORECORECUSTOMERINTERNAL PLATFORM BACKUP / DR Geographic Load Balancing XDC NET
  • 70. BLUECAVA, INC. / 2015 ARCHITECTURE – Before / Challenges PAGE 70 Cost Estimate of $1.5M upfront to scale up High Monthly Bandwidth cost Need to Extend Operation team Scalability Performance Storage Complexity Resource Limitations Datacenter Issue with Traffic spikes Need to scale down after POC finishes Some processes took more than a day Customer delivery takes 5-10 hours Ad-Hoc queries taking hours Need more historical data to increase quality Need to keep customer data for months Deliver large amount of data to customers Simple Tasks Require Data Engineering Expertise Customizing Data Output was hard Data Scientists need meaningful data set QA/Dev Environment Separation Ad-Hoc queries create issue for production
  • 71. BLUECAVA, INC. / 2015 ARCHITECTURE – Before / Solution PAGE 71 Big Query ▪ Big Data as a Service ▪ Extremely cost effective for our use-case ▪ Support Hierarchical Data Model ▪ Extremely fast ▪ Query using SQL ▪ Solve most of our Big Data challenges ▪ Fraction of cost (It was Unbelievable) ▪ Customer Delivery in Seconds!!! ▪ We dropped Delivery Spark Cluster (10 nodes) ▪ We dropped Ad-Hoc Hadoop Cluster (100x nodes) ▪ Offload ALL Customer Facing Jobs ▪ Only 2 Sprints Development (6 Weeks)
  • 72. BLUECAVA, INC. / 2015 ARCHITECTURE – Before / Solution PAGE 72 Cloud Storage ▪ Nice integration with Big Query ▪ No file size limit like S3 ▪ HDFS Integration using Hadoop Connector ▪ Seamless Cost Saving: DRA and Nearline ▪ Solved most of our Storage challenges ▪ Simplified our file delivery ▪ Extremely competitive pricing ▪ No need for Backup ☺
  • 73. BLUECAVA, INC. / 2015 ARCHITECTURE – Before / Solution PAGE 73 Compute Engine ▪ Great Sustained Pricing ▪ No need for long-term contract ▪ Simple CLI for Automation ▪ BDUtil Library for Hadoop ▪ Elastic Environment which saved us on Cost ▪ 100+ nodes Hadoop under 6 minutes ▪ Use as On-Demand Resource as needed ▪ Stop purchasing more hardware!
  • 74. BLUECAVA, INC. / 2015 ARCHITECTURE – Now PAGE 74 WEST (IRVINE) Google Cloud Platform CORE CUSTOMERINTERNAL PLATFORM Cloud Storage Simple DNS Interconnect Big Query
  • 75. BLUECAVA, INC. / 2015 ARCHITECTURE – Future! PAGE 75 Cost Move all in Cloud Scalability World-wide Coverage Performance Real-time Association Simplify Data Science Lab Container Engine Dataproc Dataflow Datalab
  • 76. BLUECAVA, INC. / 2015 ARCHITECTURE – Future! PAGE 76 CORE REALTIME PROCESS ASSOCIATION GRAPH QUERY LAB STORAGE INTERNAL CUSTOMER BATCH PROCESS
  • 77. BLUECAVA, INC. / 2015 PAGE 77 THANK YOU
  • 78. 02 Customer Story - Pixalate Amin Bandeali, CoFounder & CTO deck
  • 79. @ Amin Bandeali, Founder & CTO Pixalate, Inc.
  • 80. Agenda ● What is Pixalate? ● My Role @ Pixalate ● Pixalate Breadth and Depth ● What is Ad Fraud and why is it important to solve? ● Challenges ● Ad Fraud ● Real World BigQuery Use Cases ● Conclusion
  • 81. Our Mission To Rate the Whole Internet… ...and YES we also see what Google doesn’t see!
  • 82. What is Pixalate? Pixalate is a defacto Ratings Standard for Programmatic Advertising. SellerTrustIndex.com
  • 83. My Role @ Pixalate ● Co-Founder, CTO and Solution Architect ● Real-Time Data Junkie - Contributed to Apache Hadoop Project ● Largest AWS DynamoDB user upon launch - not using it anymore! ● Largest AWS SQS user - not using it anymore! ● Pixalate backend runs Java, NodeJS, Redis, Solr, S3 and BigQuery ● Denied using 25000 free hours of AWS Redshift! ● 70% of Pixalate technology runs on AWS -- 30% on BigQuery ● We move 2TB of data from AWS to Google Storage just for BigQuery
  • 84. Challenges Process 1+ Trillion Ad Transactions Data/month Processing Upto 3 PB/month Analyze Massive amounts of Data to detect fraud Create customized reports with NO engineering support! Close to 1 Trillion rows of data in BigQuery
  • 85. What is Ad Fraud?
  • 87. Ad Fraud against AdMob and MacDonlds
  • 89. Our Realtime Fraud Map http://www.pixalate.com/map
  • 90. What’s wrong with this data?
  • 91. A day in the life of Data Science Team ● An Account Manager requests the data science team for customized report for a client that measures some specific metrics for the last 6 months of their data. ● Solution 1: AWS EMR - Boring and takes Hours! ○ The Big data engineers will execute an EMR (Hive) job that extracts the data and creates the report ● Solution 2: BigQuery - Fun and takes Seconds! ○ The data science team implements a usually complex query that calculates all the metrics in SQL ○ BigQuery will process a couple of TB of data and create the report in few seconds.
  • 92. Bypassing the Engineers! ● We need to expand a list of 500,000 network addresses in CIDR format (e.g. 128.0.0.1/24) to regular IP format and use them in client reports ● Solution 1: Java ○ provide the Java engineers with the requirements ○ wait for implementation completion ○ wait for UAT and Production push ○ store the data in a database ■ total time ~3 workdays (in Startup Timezone) ● Solution 2: BigQuery ○ the data science team writes a query with 25+ table JOINs and UNIONS that takes care of the expansion in a clean, easy to test way, and runs it in BigQuery ■ total time ~3 hours
  • 93. From Waste Picking to Innovation ● The amount of digital data in the universe is growing at an exponential rate, doubling every two years, and changing how we live in the world. ○ YET only .5% of that data is analyzed! ● If you can’t mine these data easily and extract semantics, ○ then how is data collection different than waste-picking??? ● BigQuery enables Innovation ○ It breaks the dependency between data scientists and big-data engineers ○ Now data scientists can write complex queries and analyse massive amounts of data without the need of any backend coding (e.g. Java), or some other big data framework ○ It enables the deep understanding of complex data and their
  • 94. Cost reduction using BigQuery ● Complex data processing pipelines impose a new cost optimization challenge ● Main questions to be answered: ○ Where do I store the data I collect? ○ Where/How do I aggregate the data I collect? ○ How do I enhance the data I collect with other metadata? ○ How do I process the data collected? ■ such that the overall cost is minimized?? ● BigQuery can HELP!
  • 95. BigQuery Health Monitoring Using BigQuery
  • 96. But Wait! Here’s the real benefit...
  • 97. Zero Cost Queries Over Petabytes! ● How can you query PETABYTES of historical data and create time series to detect traffic anomalies (e.g. network failures, etc)? ● BigQuery Zero Cost queries (a.k.a. table metadata) ○ can give you the big picture regarding table’s data health ■ within seconds ■ without having to run any costly queries suspicious activity
  • 98. Big Query Success is all about the Architecture Spend a LOT of time on Table Schemas (hint: keep them flat)
  • 99. Learnings ● BigQuery has its gotchas! ○ The wrong Sharding strategy can slow you down ○ Know your Quotas well -- they will haunt you! ○ Balance the table JOINs appropriately ○ Don’t use ORDER BY unless it’s mandatory ○ Avoid “SELECT *” queries on “fat” tables over long time ranges ● Secret recipe ○ push as much complexity as possible to BigQuery using advanced queries ■ usually > 100 lines of SQL code ○ use backend languages (e.g. Java) to simply orchestrate the data pipeline ○ don’t be scared of data duplication -- storage cost is much cheaper than analysis cost!
  • 100. Q&A Amin Bandeali p: 888.749.2528 m: 714.757.9544 e: mab@pixalate.com t: http://twitter.com/aminbandeali
  • 101. Confidential & ProprietaryGoogle Cloud Platform 101 Panel Q&A Rohit Khare, GCP Big Data PM Reza Qorbani, BlueCava CTO Amin Bandeali, Pixalate CoFounder & CTO
  • 102. 04 Partner Story - Magnus Unum Rajesh Babu, BI, Big Data & Analytics solutions Architect Subash D'Souza, Big Data Evangelist deck
  • 103. Magnus Unum Raj Babu & Subash D’Souza Modern BI & Big Data platform with Google Cloud
  • 104. Magnus Unum…what we do We are a LA based Big Data, Data Science & Analytics Consulting Services firm specialized in advising our clients on Strategy, Road Map/Blue Print, Implementation, Deployment, Maintenance/Support/Operations for their Big Data,
  • 105. • Raj Babu • Co – Founder, Magnus Unum • Founder, Agile iSS • 20 years of experience in the BI & Analytics field • Worked on numerous, very large BI migration and Integration projects • Subash D’Souza • Over 10 years of experience in building scalable solutions for various enterprise companies • Organizer for several LA User Groups including Big Data, Apache Spark & Apache HBase Magnus Unum…Leadership
  • 106. Magnus Unum - Key Services • Architect, Design & Build Big Data Solutions • Cloud Migration services for Big Data, Analytics & BI • Big Data Engineering & Staffing • Big Data managed & support services • Data Science Solutions & Services
  • 107. Magnus Unum – Expertise • On-Prem Cloudera, Hortonworks, IBM, Pivotal & MapR • Cloud Google Cloud, Amazon AWS & Microsoft Azure • Analytics/ Reporting Tableau, MicroStrategy, SAP BO, Qlik & Pentaho
  • 108. Why Google Cloud Platform?
  • 109. Use Case 1 – Migrating your Data Warehouse and BI to Google Cloud • Capture / Migrate or Capture • Storage / Data Management • Data Processing • Query/Analytics • Data Integration • Access Control
  • 110. Use Case 2 – Google Analytics detailed analysis • Limitation in Google Analytics daily export • More detailed analysis available as part of Google Cloud Platform( Must have premium access) • Can analyze granular level details of User Interaction on websites and aggregate the
  • 111. Please reach out to us for a free Consultation & Assessment of your BI, Big Data & Analytics needs & additional $500 in GCP credits!