Introduction to Google BigQuery

Cloud
Csaba Toth
Presented By:
Introduction to Google BigQuery

Goal
• Being able to issue queries
• Preferably in an SQL dialect
• Over Big Data
• As small response time as possible
• Preferably interactive web interface
(thus no need to install anything)

Agenda
• Big Data
• Brief look at Hadoop, HIVE and
Spark
• Row based data store vs. Column
data store
• Google BigQuery
• Demo

Big Data
Wikipedia: “collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing
applications”
Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
• 100 Terabytes of data is uploaded to Facebook every day
• Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user
generated data
• Twitter generates 12 Terabytes of data every day
• LinkedIn processes and mines Petabytes of user data to power the "People You May
Know" feature
• YouTube users upload 48 hours of new video content every minute of the day
• Decoding of the human genome used to take 10 years. Now it can be done in 7 days

Little Hadoop history
“The Google File System” - October 2003
• http://labs.google.com/papers/gfs.html – describes a
scalable, distributed, fault-tolerant file system tailored for
data-intensive applications, running on inexpensive
commodity hardware, delivers high aggregate
performance
“MapReduce: Simplified Data Processing on Large
Clusters” - April 2004
• http://queue.acm.org/detail.cfm?id=988408 – describes a
programming model and an implementation for
processing large data sets.

Hadoop
• Hadoop is an open-source software
framework that supports data-
intensive distributed applications
• A Hadoop cluster is composed of a
single master node and multiple
worker nodes

Hadoop
Has two main services:
1. Storing large amounts of data: HDFS
– Hadoop Distributed File System
2. Processing large amounts of data:
implementing the MapReduce
programming model

HDFS
Name node
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B

Job / task management
Name node
Heart beat signals and
communication
Jobtracker
Data node Data node Data node
Task-
tracker
Task-
tracker
Map 1 Reduce 1 Map 2 Reduce 2
Task-
tracker
Map 3 Reduce 3
Map-Reduce

Hadoop vs. RDBMS
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Batch Interactive and Batch
Scaling Linear Nonlinear (worse than
linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch
processing)
Can be near immediate

Apache Hive
Log Data RDBMS
Data Integration LayerFlume Sqoop
Storage Layer (HDFS): row and columnar data, file data
Computing Layer (MapReduce)
Advanced Query Engines (Hive, Pig)
Data Mining
(Pegasus,
Mahout)
Index,
Searches
(Lucene)
DB drivers (Hive driver)
GUI (web interface, RESTful API, JavaScript)
System
management
Distribution
coordination
(Zookeeper)
JDBC ODBC JS

Beyond Apache Hive
Goals: decrease latency
• YARN: the “next generation Hadoop”,
improves performance in many respects
(resource management and allocation, …)
• Hadoop distribution specific solution: e.g.
Cloudera Impala, MPP SQL Query
engine, based on Hadoop

Apache Spark
• Cluster computing framework with multi-
stage in-memory primitives
• Open Source, originates from Berkeley
• In contrast to Hadoop’s two-stage disk-
based MapReduce paradigm, multi-stage
in-memory primitives can provide up to
100x performance increase
• Requires YARN and HDFS

Storing data: row stores
• Traditional RDBMS and often the
document stores are row oriented too
• The engine stores and retrieves rows
from disk (unless indexes help)
• Row is a collection of column cell
values together
• Rows are materialized on disk

Row stores
Row cells
are stored
together
on disk
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day

Row stores
• Not so great for wide rows
• If only a small subset of columns
queried, reading the entire row
wastes IO
• (Indexing strategies can help but I
don’t have time to cover them)

Row stores
Bad case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Entire row must be read
• Reads ~1 GiB data for ~8MiB of column
data

Column stores
• Data is organized by columns
instead of rows
• Non material world: often not
materialized during storage, exists
only in memory
• Each row still has some sort of “row
id”

Column stores
• A row is a collection of column values that
are associated with one another
• Associated: every row has some type of
“row id“
• Can still produce row output (assembling
a row maybe complex though – under the
hood)

Column stores
Stores each COLUMN on disk
id
1
2
3
4
5
6
title
Mrs. Doubtfire
The Big Lebowski
The Fly
Steel Magnolias
The Birdcage
Erin Brokovitch
actor
Robin Williams
Jeff Bridges
Jeff Goldblum
Dolly Parton
Nathan Lane
Julia Roberts
genre
Comedy
Comedy
Horror
Drama
Comedy
Drama
row id = 1
row id = 6
Natural order may be unusual Each column has a file or segment on disk

Column stores
• Column compression can be way more
efficient than row based compression
(sometimes 10:1 to 30:1 ratio)
• Compression: RLE, Integer packing,
dictionaries and lookup, other…
• Reduces both storage and IO (thus
response time)

Column stores
Best case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Only single column read from disk
• Reads ~8 MiB of column data, even less
with compression

Column stores
Bad case scenario:
select *
from long_wide_table
where order_line_id = 34653875;
• Accessing all columns doesn’t save
anything, could be even more expensive
than row store
• Not ideal for tables with few columns

Column stores
Updating and deleting rows is expensive
• Some column stores are append only
• Others just strongly discourage writes
• Some split storage into row and column
areas

Column / Row stores
• RDBMS provide ACID capabilities
• Row stores mainly use tree style indexes
• B-tree derivative index structure provides
very fast binary search as long as it fits
into memory
• Very large datasets end up
unmanageably big indexes
• Column stores: bitmap indexing
Very expensive to update

BigQuery history
“Dremel: Interactive Analysis of Web-Scale
Datasets” – 2010, describes a column store
/ retrieval system
• https://static.googleusercontent.com/media/research.goo
gle.com/en//pubs/archive/36632.pdf
Presentation with illustration about principles
used in Dremel, from Google
• http://www.cs.berkeley.edu/~istoica/classes/cs294/11/not
es/12-sameer-dremel.pdf

BigQuery
• A service that enables interactive analysis
of massively large datasets
• Based on Dremel, a scalable, interactive
ad hoc query system for analysis of read-
only nested data
• Working in conjunction with Google
Storage
• Has a RESTful web service interface

BigQuery
• You can issue SQL queries over big
data
• Interactive web interface
• As small response time as possible
• Auto scales under the hood

BigQuery
SaaS (/ PaaS)
Interfacing:
• REST API
• Web console
• Command line tools
• Language libraries
Insert only

Demo!
Wikipedia public dataset
Natalities public dataset
Names (uploaded)
Google Genomics
https://cloud.google.com/genomics/
https://cloud.google.com/genomics/v1/public-data
https://cloud.google.com/bigquery/web-ui-quickstart
https://cloud.google.com/bigquery/query-reference

Future thoughts
How to visualize data
• Possibly using Google Charts
• BigQuery alongside Google Maps
Playing with genomics data – requires
some bio-informatics knowledge

Resources
• Slides: http://www.slideshare.net/tothc
• Contact: http://www.meetup.com/CCalJUG/
• Csaba Toth: Introduction to Hadoop and MapReduce -
http://www.slideshare.net/tothc/introduction-to-hadoop-
and-map-reduce
• Justin Swanhart: Introduction to column stores -
http://www.slideshare.net/MySQLGeek/intro-to-column-
stores
• Jan Steemann: Column-oriented databases -
http://www.slideshare.net/arangodb/introduction-to-
column-oriented-databases

Resources
• https://anonymousbi.wordpress.com/2012/11/02/hadoo
p-beginners-tutorial-on-ubuntu/
• https://www.capgemini.com/blog/capping-it-
off/2012/01/what-is-hadoop
• http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
• https://www.cloudera.com/content/cloudera/en/docume
ntation/core/latest/PDF/cloudera-impala.pdf
• https://www.keithrozario.com/2012/07/google-bigquery-
wikipedia-dataset-malaysia-singapore.html
• https://cloud.google.com/bigquery/web-ui-quickstart
• https://cloud.google.com/bigquery/query-reference

Resources
• https://github.com/googlegenomics/getting-started-bigquery
• https://github.com/googlegenomics/bigquery-examples
• https://github.com/googlegenomics/readthedocs

Pricing
https://cloud.google.com/bigquery/pricing
Storage: $0.020 per GB/mo.
Queries: $5 per TB processed
Streaming inserts: $0.01 per 200 MiB (1 KiB
rows)
Columns are compressed but price is based
on the uncompressed size

Tips
• Storage dominates the costs
• Plus you need to restrict queries, use as
few columns as possible
• BigQuery scans the full columns which
are involved in the query
• Do not select *
• LIMIT the result set
• Result caching

Introduction to Google BigQuery

More Related Content

What's hot (20)

Similar to Introduction to Google BigQuery (20)

More from Csaba Toth (17)

Recently uploaded (20)

Introduction to Google BigQuery

Editor's Notes