SlideShare a Scribd company logo
Cloud
Csaba Toth
Presented By:
Introduction to Google BigQuery
Our sponsors
Disclaimer
Disclaimer – cont.
Goal
‱ Being able to issue queries
‱ Preferably in an SQL dialect
‱ Over Big Data
‱ As small response time as possible
‱ Preferably interactive web interface
(thus no need to install anything)
Agenda
‱ Big Data
‱ Brief look at Hadoop, HIVE and
Spark
‱ Row based data store vs. Column
data store
‱ Google BigQuery
‱ Demo
Big Data
Wikipedia: “collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing
applications”
Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
‱ 100 Terabytes of data is uploaded to Facebook every day
‱ Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user
generated data
‱ Twitter generates 12 Terabytes of data every day
‱ LinkedIn processes and mines Petabytes of user data to power the "People You May
Know" feature
‱ YouTube users upload 48 hours of new video content every minute of the day
‱ Decoding of the human genome used to take 10 years. Now it can be done in 7 days
Little Hadoop history
“The Google File System” - October 2003
‱ http://labs.google.com/papers/gfs.html – describes a
scalable, distributed, fault-tolerant file system tailored for
data-intensive applications, running on inexpensive
commodity hardware, delivers high aggregate
performance
“MapReduce: Simplified Data Processing on Large
Clusters” - April 2004
‱ http://queue.acm.org/detail.cfm?id=988408 – describes a
programming model and an implementation for
processing large data sets.
Hadoop
‱ Hadoop is an open-source software
framework that supports data-
intensive distributed applications
‱ A Hadoop cluster is composed of a
single master node and multiple
worker nodes
Hadoop
Has two main services:
1. Storing large amounts of data: HDFS
– Hadoop Distributed File System
2. Processing large amounts of data:
implementing the MapReduce
programming model
HDFS
Name node
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B
Job / task management
Name node
Heart beat signals and
communication
Jobtracker
Data node Data node Data node
Task-
tracker
Task-
tracker
Map 1 Reduce 1 Map 2 Reduce 2
Task-
tracker
Map 3 Reduce 3
Map-Reduce
Hadoop vs. RDBMS
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Batch Interactive and Batch
Scaling Linear Nonlinear (worse than
linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch
processing)
Can be near immediate
Apache Hive
Log Data RDBMS
Data Integration LayerFlume Sqoop
Storage Layer (HDFS): row and columnar data, file data
Computing Layer (MapReduce)
Advanced Query Engines (Hive, Pig)
Data Mining
(Pegasus,
Mahout)
Index,
Searches
(Lucene)
DB drivers (Hive driver)
GUI (web interface, RESTful API, JavaScript)
System
management
Distribution
coordination
(Zookeeper)
JDBC ODBC JS
Apache Hive UI
Apache Hive UI
Beyond Apache Hive
Goals: decrease latency
‱ YARN: the “next generation Hadoop”,
improves performance in many respects
(resource management and allocation, 
)
‱ Hadoop distribution specific solution: e.g.
Cloudera Impala, MPP SQL Query
engine, based on Hadoop
Apache Spark
‱ Cluster computing framework with multi-
stage in-memory primitives
‱ Open Source, originates from Berkeley
‱ In contrast to Hadoop’s two-stage disk-
based MapReduce paradigm, multi-stage
in-memory primitives can provide up to
100x performance increase
‱ Requires YARN and HDFS
Spark and Hadoop
Spark and Hadoop
Storing data: row stores
‱ Traditional RDBMS and often the
document stores are row oriented too
‱ The engine stores and retrieves rows
from disk (unless indexes help)
‱ Row is a collection of column cell
values together
‱ Rows are materialized on disk
Row stores
Row cells
are stored
together
on disk
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day
Row stores
‱ Not so great for wide rows
‱ If only a small subset of columns
queried, reading the entire row
wastes IO
‱ (Indexing strategies can help but I
don’t have time to cover them)
Row stores
Bad case scenario:
‱ select sum(bigint_column) from table
‱ Million rows in table
‱ Average row length is 1 KiB
The select reads one bigint column (8 bytes)
‱ Entire row must be read
‱ Reads ~1 GiB data for ~8MiB of column
data
Column stores
‱ Data is organized by columns
instead of rows
‱ Non material world: often not
materialized during storage, exists
only in memory
‱ Each row still has some sort of “row
id”
Column stores
‱ A row is a collection of column values that
are associated with one another
‱ Associated: every row has some type of
“row id“
‱ Can still produce row output (assembling
a row maybe complex though – under the
hood)
Column stores
Stores each COLUMN on disk
id
1
2
3
4
5
6
title
Mrs. Doubtfire
The Big Lebowski
The Fly
Steel Magnolias
The Birdcage
Erin Brokovitch
actor
Robin Williams
Jeff Bridges
Jeff Goldblum
Dolly Parton
Nathan Lane
Julia Roberts
genre
Comedy
Comedy
Horror
Drama
Comedy
Drama
row id = 1
row id = 6
Natural order may be unusual Each column has a file or segment on disk
Column stores
‱ Column compression can be way more
efficient than row based compression
(sometimes 10:1 to 30:1 ratio)
‱ Compression: RLE, Integer packing,
dictionaries and lookup, other

‱ Reduces both storage and IO (thus
response time)
Column stores
Best case scenario:
‱ select sum(bigint_column) from table
‱ Million rows in table
‱ Average row length is 1 KiB
The select reads one bigint column (8 bytes)
‱ Only single column read from disk
‱ Reads ~8 MiB of column data, even less
with compression
Column stores
Bad case scenario:
select *
from long_wide_table
where order_line_id = 34653875;
‱ Accessing all columns doesn’t save
anything, could be even more expensive
than row store
‱ Not ideal for tables with few columns
Column stores
Updating and deleting rows is expensive
‱ Some column stores are append only
‱ Others just strongly discourage writes
‱ Some split storage into row and column
areas
Column / Row stores
‱ RDBMS provide ACID capabilities
‱ Row stores mainly use tree style indexes
‱ B-tree derivative index structure provides
very fast binary search as long as it fits
into memory
‱ Very large datasets end up
unmanageably big indexes
‱ Column stores: bitmap indexing
Very expensive to update
BigQuery history
“Dremel: Interactive Analysis of Web-Scale
Datasets” – 2010, describes a column store
/ retrieval system
‱ https://static.googleusercontent.com/media/research.goo
gle.com/en//pubs/archive/36632.pdf
Presentation with illustration about principles
used in Dremel, from Google
‱ http://www.cs.berkeley.edu/~istoica/classes/cs294/11/not
es/12-sameer-dremel.pdf
BigQuery
‱ A service that enables interactive analysis
of massively large datasets
‱ Based on Dremel, a scalable, interactive
ad hoc query system for analysis of read-
only nested data
‱ Working in conjunction with Google
Storage
‱ Has a RESTful web service interface
BigQuery
‱ You can issue SQL queries over big
data
‱ Interactive web interface
‱ As small response time as possible
‱ Auto scales under the hood
BigQuery
SaaS (/ PaaS)
Interfacing:
‱ REST API
‱ Web console
‱ Command line tools
‱ Language libraries
Insert only
Demo!
Wikipedia public dataset
Natalities public dataset
Names (uploaded)
Google Genomics
https://cloud.google.com/genomics/
https://cloud.google.com/genomics/v1/public-data
https://cloud.google.com/bigquery/web-ui-quickstart
https://cloud.google.com/bigquery/query-reference
Future thoughts
How to visualize data
‱ Possibly using Google Charts
‱ BigQuery alongside Google Maps
Playing with genomics data – requires
some bio-informatics knowledge
Thank you!
Questions?
Our sponsors
Resources
‱ Slides: http://www.slideshare.net/tothc
‱ Contact: http://www.meetup.com/CCalJUG/
‱ Csaba Toth: Introduction to Hadoop and MapReduce -
http://www.slideshare.net/tothc/introduction-to-hadoop-
and-map-reduce
‱ Justin Swanhart: Introduction to column stores -
http://www.slideshare.net/MySQLGeek/intro-to-column-
stores
‱ Jan Steemann: Column-oriented databases -
http://www.slideshare.net/arangodb/introduction-to-
column-oriented-databases
Resources
‱ https://anonymousbi.wordpress.com/2012/11/02/hadoo
p-beginners-tutorial-on-ubuntu/
‱ https://www.capgemini.com/blog/capping-it-
off/2012/01/what-is-hadoop
‱ http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
‱ https://www.cloudera.com/content/cloudera/en/docume
ntation/core/latest/PDF/cloudera-impala.pdf
‱ https://www.keithrozario.com/2012/07/google-bigquery-
wikipedia-dataset-malaysia-singapore.html
‱ https://cloud.google.com/bigquery/web-ui-quickstart
‱ https://cloud.google.com/bigquery/query-reference
Resources
‱ https://github.com/googlegenomics/getting-started-bigquery
‱ https://github.com/googlegenomics/bigquery-examples
‱ https://github.com/googlegenomics/readthedocs
Pricing
https://cloud.google.com/bigquery/pricing
Storage: $0.020 per GB/mo.
Queries: $5 per TB processed
Streaming inserts: $0.01 per 200 MiB (1 KiB
rows)
Columns are compressed but price is based
on the uncompressed size
Tips
‱ Storage dominates the costs
‱ Plus you need to restrict queries, use as
few columns as possible
‱ BigQuery scans the full columns which
are involved in the query
‱ Do not select *
‱ LIMIT the result set
‱ Result caching

More Related Content

PPTX
Google BigQuery 101 & What’s New
PDF
Exploring BigData with Google BigQuery
PDF
Google Dremel. Concept and Implementations.
PPTX
How to Design a Modern Data Warehouse in BigQuery
PPTX
Real-Time Analytics in Transactional Applications by Brian Bulkowski
PPTX
Practical Use of a NoSQL Database
PPTX
NoSQL for SQL Users
PPTX
Big Data in the Real World
Google BigQuery 101 & What’s New
Exploring BigData with Google BigQuery
Google Dremel. Concept and Implementations.
How to Design a Modern Data Warehouse in BigQuery
Real-Time Analytics in Transactional Applications by Brian Bulkowski
Practical Use of a NoSQL Database
NoSQL for SQL Users
Big Data in the Real World

What's hot (20)

PPTX
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
PPTX
NoSQL and MongoDB Introdction
PDF
A Gentle Introduction to GPU Computing by Armen Donigian
PDF
Practical Use of a NoSQL
PDF
Changing the game with cloud dw
PDF
GCP Data Engineer cheatsheet
PDF
Bi on Big Data - Strata 2016 in London
PPTX
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
PPTX
SQL To NoSQL - Top 6 Questions Before Making The Move
PDF
Dremio introduction
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
Choosing data warehouse considerations
PPTX
Augmenting Mongo DB with treasure data
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PPTX
Managed Cluster Services
PPT
Big data & hadoop framework
PPT
Webinar: High Performance MongoDB Applications with IBM POWER8
PPTX
Hadoop Tutorial For Beginners
PPTX
How to boost your datamanagement with Dremio ?
PDF
Optiq: A dynamic data management framework
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
NoSQL and MongoDB Introdction
A Gentle Introduction to GPU Computing by Armen Donigian
Practical Use of a NoSQL
Changing the game with cloud dw
GCP Data Engineer cheatsheet
Bi on Big Data - Strata 2016 in London
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
SQL To NoSQL - Top 6 Questions Before Making The Move
Dremio introduction
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Choosing data warehouse considerations
Augmenting Mongo DB with treasure data
AWS Big Data Demystified #1: Big data architecture lessons learned
Managed Cluster Services
Big data & hadoop framework
Webinar: High Performance MongoDB Applications with IBM POWER8
Hadoop Tutorial For Beginners
How to boost your datamanagement with Dremio ?
Optiq: A dynamic data management framework
Ad

Similar to Introduction to Google BigQuery (20)

PPTX
Column Stores and Google BigQuery
PDF
Intro to big data choco devday - 23-01-2014
PDF
Intro to Big Data
PPTX
2. hadoop fundamentals
PDF
Big data and hadoop overvew
PPTX
4. hadoop ڒڙڐ ŚœŚ‘Ś Ś‘ŚšŚ’
PPTX
Big Data & Hadoop Introduction
PPSX
Hadoop-Quick introduction
PPTX
002 Introduction to hadoop v3
PPTX
Hadoop ppt on the basics and architecture
PPTX
Introduction to Hadoop
PPTX
Real time hadoop + mapreduce intro
PPTX
Big data and hadoop
PPTX
Hadoop ppt1
PPTX
Hadoop and Big data in Big data and cloud.pptx
 
PPT
Hadoop training in bangalore
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PDF
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
PPTX
The Big Data Stack
PDF
getFamiliarWithHadoop
Column Stores and Google BigQuery
Intro to big data choco devday - 23-01-2014
Intro to Big Data
2. hadoop fundamentals
Big data and hadoop overvew
4. hadoop ڒڙڐ ŚœŚ‘Ś Ś‘ŚšŚ’
Big Data & Hadoop Introduction
Hadoop-Quick introduction
002 Introduction to hadoop v3
Hadoop ppt on the basics and architecture
Introduction to Hadoop
Real time hadoop + mapreduce intro
Big data and hadoop
Hadoop ppt1
Hadoop and Big data in Big data and cloud.pptx
 
Hadoop training in bangalore
Hadoop for Bioinformatics: Building a Scalable Variant Store
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
The Big Data Stack
getFamiliarWithHadoop
Ad

More from Csaba Toth (17)

PPTX
Git, GitHub gh-pages and static websites
PPTX
Eclipse RCP Demo
PPTX
The Health of Networks
PPTX
Windows 10 preview
PPTX
Developing Multi Platform Games using PlayN and TriplePlay Framework
PPTX
Trends and future of java
PPTX
Google Compute Engine
PPTX
Google App Engine
PPTX
Setting up a free open source java e-commerce website
PPTX
CCJUG inaugural meeting and Adopt a JSR
PPTX
Google Cloud Platform, Compute Engine, and App Engine
PPTX
Hive and Pig for .NET User Group
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
Introduction to Hadoop and MapReduce
PPTX
Introduction into windows 8 application development
PPTX
Ups and downs of enterprise Java app in a research setting
PPTX
Adopt a JSR NJUG edition
Git, GitHub gh-pages and static websites
Eclipse RCP Demo
The Health of Networks
Windows 10 preview
Developing Multi Platform Games using PlayN and TriplePlay Framework
Trends and future of java
Google Compute Engine
Google App Engine
Setting up a free open source java e-commerce website
CCJUG inaugural meeting and Adopt a JSR
Google Cloud Platform, Compute Engine, and App Engine
Hive and Pig for .NET User Group
Hadoop and Mapreduce for .NET User Group
Introduction to Hadoop and MapReduce
Introduction into windows 8 application development
Ups and downs of enterprise Java app in a research setting
Adopt a JSR NJUG edition

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
System and Network Administraation Chapter 3
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Transform Your Business with a Software ERP System
PDF
AI in Product Development-omnex systems
PPTX
Introduction to Artificial Intelligence
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Nekopoi APK 2025 free lastest update
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Navsoft: AI-Powered Business Solutions & Custom Software Development
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How Creative Agencies Leverage Project Management Software.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
Softaken Excel to vCard Converter Software.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
System and Network Administraation Chapter 3
VVF-Customer-Presentation2025-Ver1.9.pptx
Transform Your Business with a Software ERP System
AI in Product Development-omnex systems
Introduction to Artificial Intelligence
Upgrade and Innovation Strategies for SAP ERP Customers
ManageIQ - Sprint 268 Review - Slide Deck
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Operating system designcfffgfgggggggvggggggggg
Nekopoi APK 2025 free lastest update
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Introduction to Google BigQuery

  • 5. Goal ‱ Being able to issue queries ‱ Preferably in an SQL dialect ‱ Over Big Data ‱ As small response time as possible ‱ Preferably interactive web interface (thus no need to install anything)
  • 6. Agenda ‱ Big Data ‱ Brief look at Hadoop, HIVE and Spark ‱ Row based data store vs. Column data store ‱ Google BigQuery ‱ Demo
  • 7. Big Data Wikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Examples: (Wikibon - A Comprehensive List of Big Data Statistics) ‱ 100 Terabytes of data is uploaded to Facebook every day ‱ Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user generated data ‱ Twitter generates 12 Terabytes of data every day ‱ LinkedIn processes and mines Petabytes of user data to power the "People You May Know" feature ‱ YouTube users upload 48 hours of new video content every minute of the day ‱ Decoding of the human genome used to take 10 years. Now it can be done in 7 days
  • 8. Little Hadoop history “The Google File System” - October 2003 ‱ http://labs.google.com/papers/gfs.html – describes a scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 ‱ http://queue.acm.org/detail.cfm?id=988408 – describes a programming model and an implementation for processing large data sets.
  • 9. Hadoop ‱ Hadoop is an open-source software framework that supports data- intensive distributed applications ‱ A Hadoop cluster is composed of a single master node and multiple worker nodes
  • 10. Hadoop Has two main services: 1. Storing large amounts of data: HDFS – Hadoop Distributed File System 2. Processing large amounts of data: implementing the MapReduce programming model
  • 11. HDFS Name node Metadata Store Data node Data node Data node Node 1 Node 2 Block A Block B Block A Block B Node 3 Block A Block B
  • 12. Job / task management Name node Heart beat signals and communication Jobtracker Data node Data node Data node Task- tracker Task- tracker Map 1 Reduce 1 Map 2 Reduce 2 Task- tracker Map 3 Reduce 3 Map-Reduce
  • 13. Hadoop vs. RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Batch Interactive and Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate
  • 14. Apache Hive Log Data RDBMS Data Integration LayerFlume Sqoop Storage Layer (HDFS): row and columnar data, file data Computing Layer (MapReduce) Advanced Query Engines (Hive, Pig) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) GUI (web interface, RESTful API, JavaScript) System management Distribution coordination (Zookeeper) JDBC ODBC JS
  • 17. Beyond Apache Hive Goals: decrease latency ‱ YARN: the “next generation Hadoop”, improves performance in many respects (resource management and allocation, 
) ‱ Hadoop distribution specific solution: e.g. Cloudera Impala, MPP SQL Query engine, based on Hadoop
  • 18. Apache Spark ‱ Cluster computing framework with multi- stage in-memory primitives ‱ Open Source, originates from Berkeley ‱ In contrast to Hadoop’s two-stage disk- based MapReduce paradigm, multi-stage in-memory primitives can provide up to 100x performance increase ‱ Requires YARN and HDFS
  • 21. Storing data: row stores ‱ Traditional RDBMS and often the document stores are row oriented too ‱ The engine stores and retrieves rows from disk (unless indexes help) ‱ Row is a collection of column cell values together ‱ Rows are materialized on disk
  • 22. Row stores Row cells are stored together on disk id scientist death_by movie_name 1 Reinhardt Maximillian The Black Hole 2 Tyrell Roy Batty Blade Runner 3 Hammond Dinosaur Jurassic Park 4 Soong Lore Star Trek: TNG 5 Morbius His mind Forbidden Planet 6 Dyson Skynet Terminator 2: Judgment Day
  • 23. Row stores ‱ Not so great for wide rows ‱ If only a small subset of columns queried, reading the entire row wastes IO ‱ (Indexing strategies can help but I don’t have time to cover them)
  • 24. Row stores Bad case scenario: ‱ select sum(bigint_column) from table ‱ Million rows in table ‱ Average row length is 1 KiB The select reads one bigint column (8 bytes) ‱ Entire row must be read ‱ Reads ~1 GiB data for ~8MiB of column data
  • 25. Column stores ‱ Data is organized by columns instead of rows ‱ Non material world: often not materialized during storage, exists only in memory ‱ Each row still has some sort of “row id”
  • 26. Column stores ‱ A row is a collection of column values that are associated with one another ‱ Associated: every row has some type of “row id“ ‱ Can still produce row output (assembling a row maybe complex though – under the hood)
  • 27. Column stores Stores each COLUMN on disk id 1 2 3 4 5 6 title Mrs. Doubtfire The Big Lebowski The Fly Steel Magnolias The Birdcage Erin Brokovitch actor Robin Williams Jeff Bridges Jeff Goldblum Dolly Parton Nathan Lane Julia Roberts genre Comedy Comedy Horror Drama Comedy Drama row id = 1 row id = 6 Natural order may be unusual Each column has a file or segment on disk
  • 28. Column stores ‱ Column compression can be way more efficient than row based compression (sometimes 10:1 to 30:1 ratio) ‱ Compression: RLE, Integer packing, dictionaries and lookup, other
 ‱ Reduces both storage and IO (thus response time)
  • 29. Column stores Best case scenario: ‱ select sum(bigint_column) from table ‱ Million rows in table ‱ Average row length is 1 KiB The select reads one bigint column (8 bytes) ‱ Only single column read from disk ‱ Reads ~8 MiB of column data, even less with compression
  • 30. Column stores Bad case scenario: select * from long_wide_table where order_line_id = 34653875; ‱ Accessing all columns doesn’t save anything, could be even more expensive than row store ‱ Not ideal for tables with few columns
  • 31. Column stores Updating and deleting rows is expensive ‱ Some column stores are append only ‱ Others just strongly discourage writes ‱ Some split storage into row and column areas
  • 32. Column / Row stores ‱ RDBMS provide ACID capabilities ‱ Row stores mainly use tree style indexes ‱ B-tree derivative index structure provides very fast binary search as long as it fits into memory ‱ Very large datasets end up unmanageably big indexes ‱ Column stores: bitmap indexing Very expensive to update
  • 33. BigQuery history “Dremel: Interactive Analysis of Web-Scale Datasets” – 2010, describes a column store / retrieval system ‱ https://static.googleusercontent.com/media/research.goo gle.com/en//pubs/archive/36632.pdf Presentation with illustration about principles used in Dremel, from Google ‱ http://www.cs.berkeley.edu/~istoica/classes/cs294/11/not es/12-sameer-dremel.pdf
  • 34. BigQuery ‱ A service that enables interactive analysis of massively large datasets ‱ Based on Dremel, a scalable, interactive ad hoc query system for analysis of read- only nested data ‱ Working in conjunction with Google Storage ‱ Has a RESTful web service interface
  • 35. BigQuery ‱ You can issue SQL queries over big data ‱ Interactive web interface ‱ As small response time as possible ‱ Auto scales under the hood
  • 36. BigQuery SaaS (/ PaaS) Interfacing: ‱ REST API ‱ Web console ‱ Command line tools ‱ Language libraries Insert only
  • 37. Demo! Wikipedia public dataset Natalities public dataset Names (uploaded) Google Genomics https://cloud.google.com/genomics/ https://cloud.google.com/genomics/v1/public-data https://cloud.google.com/bigquery/web-ui-quickstart https://cloud.google.com/bigquery/query-reference
  • 38. Future thoughts How to visualize data ‱ Possibly using Google Charts ‱ BigQuery alongside Google Maps Playing with genomics data – requires some bio-informatics knowledge
  • 41. Resources ‱ Slides: http://www.slideshare.net/tothc ‱ Contact: http://www.meetup.com/CCalJUG/ ‱ Csaba Toth: Introduction to Hadoop and MapReduce - http://www.slideshare.net/tothc/introduction-to-hadoop- and-map-reduce ‱ Justin Swanhart: Introduction to column stores - http://www.slideshare.net/MySQLGeek/intro-to-column- stores ‱ Jan Steemann: Column-oriented databases - http://www.slideshare.net/arangodb/introduction-to- column-oriented-databases
  • 42. Resources ‱ https://anonymousbi.wordpress.com/2012/11/02/hadoo p-beginners-tutorial-on-ubuntu/ ‱ https://www.capgemini.com/blog/capping-it- off/2012/01/what-is-hadoop ‱ http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI ‱ https://www.cloudera.com/content/cloudera/en/docume ntation/core/latest/PDF/cloudera-impala.pdf ‱ https://www.keithrozario.com/2012/07/google-bigquery- wikipedia-dataset-malaysia-singapore.html ‱ https://cloud.google.com/bigquery/web-ui-quickstart ‱ https://cloud.google.com/bigquery/query-reference
  • 44. Pricing https://cloud.google.com/bigquery/pricing Storage: $0.020 per GB/mo. Queries: $5 per TB processed Streaming inserts: $0.01 per 200 MiB (1 KiB rows) Columns are compressed but price is based on the uncompressed size
  • 45. Tips ‱ Storage dominates the costs ‱ Plus you need to restrict queries, use as few columns as possible ‱ BigQuery scans the full columns which are involved in the query ‱ Do not select * ‱ LIMIT the result set ‱ Result caching

Editor's Notes

  • #6: SQL Structured Query Language
  • #33: ACID: Atomicity, Consistency, Isolation, Durability
  • #38: Name schema: name:string,gender:string,count:integer