SlideShare a Scribd company logo
Gabriel PREDA
@eRadical
(Almost) Serverless Analytics System
with BigQuery & AppEngine
Agenda
Going Serverless with
AppEngine & Tasks
Pub/Sub, DataStore
BigQuery
Load
Batch
Streaming Inserts
Query
UDF
Export
...some BigQueries...
AeonsSome years ago...
~ 500,000 - 2,000,000 events / day
(on average)
Some time ago...
~2,000,000 - 22,000,000 events / day
Dec 2014: 57,430,000 events / day
1 day to recompute » 12 hours
NOW()
22,000,000 - 70,000,000 events / day
AVG » 40,000,000 events / day
Processing ~30GB-70GB / day
Recompute 1 day » 10-20 minutes
serverless?
Desired for: https://www.innertrends.com
other... (almost) serverless products
Cloud Functions (alpha - Node.JS)
Cloud DataFlow (Java, Python - beta)
BigQuery
https://cloud.google.com/bigquery/docs/
BigQuery - data types
● STRING - UTF-8 (2 bytes + encoded string size)
● BYTES - base64 encoded (except in Avro)
● INTEGER - 64-bit signed (8 bytes)
● FLOAT (8 bytes)
● BOOLEAN - true/false, 1/0 only in CSV (1 byte)
● TIMESTAMP ex:”2014-08-19 12:41:35.220 UTC” (8 bytes)
● DATE, TIME, DATETIME - limited support in Legacy SQL
● RECORD - a collection of fields (size of fields)
https://cloud.google.com/bigquery/data-types
BigQuery -> loadData()
Formats: CSV, JSON (newline delimited), Avro, Parquet (experimental)
Tools: Web UI, bq, API
Source:
local files,
Cloud Storage, [demo]
Cloud Datastore (backup files),
POST requests,
SQL DML*
Google Sheets
- Federated Data Sources
- Streaming Inserts
BigQuery -> loadData()
bq load ...
BigQuery -> loadData()
Got some rows?
BigQuery -> SELECT … FROM surprise…
query:
SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ]
[ FROM from_body
[ WHERE bool_expression ]
[ OMIT RECORD IF bool_expression]
[ GROUP [ EACH ] BY [ ROLLUP ] { field_name_or_alias } [ , ... ] ]
[ HAVING bool_expression ]
[ ORDER BY field_name_or_alias [ { DESC | ASC } ] [, ... ] ]
[ LIMIT n ]
];
from_body:
from_item [, ...] | # Warning: Comma means UNION ALL here
from_item [ join_type ] JOIN [ EACH ] from_item [ ON join_predicate ] |
(FLATTEN({ table_name | (query) }, field_name_or_alias)) |
table_wildcard_function
from_item:
{ table_name | (query) } [ [ AS ] alias ]
join_type:
{ INNER | [ FULL ] [ OUTER ] | RIGHT [ OUTER ] | LEFT [ OUTER ] | CROSS }
BigQuery -> SELECT … FROM surprise…
Date-Partitioned Tables [demo]
Table Decorators - See the past w/ @
Table Wildcard Functions - TABLE_DATE_RANGE() & TABLE_QUERY()
Interesting functions
- DateTime » UTC_USEC_TO_DAY/HOUR/MONTH/WEEK/YEAR()
» Shifts a UNIX timestamp in microseconds to the beginning of the period it occurs in.
- JSON_EXTRACT[_SCALAR]()
- URL functions » HOST(), DOMAIN(), TLD()
- REGEXP_MATCH(), REGEXP_EXTRACT()
bigquery.defineFunction(
'expandAssetLibrary', // Name of the function exported to SQL
['user_id', 'video_id', 'stage_settings'], // Names of input columns
[ {'name': 'user_id', 'type': 'integer'}, // Output schema
{'name': 'video_id', 'type': 'string'},
{'name': 'asset', 'type': 'string'} ],
expandAssetLibrary // Reference to JavaScript UDF
);
function expandAssetLibrary(row, emit) { …………………………
emit({ user_id: row.user_id, video_id: row.video_id, asset: ss.url.replace('http://', ''));
}
BigQuery -> User Defined Functions
BigQuery -> DML
Standard SQL only
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
BigQuery -> export()
To: Google Cloud Storage
Format: CSV, JSON [.gz], Avro
…1G files
BigQuery -> some (Big)Queries
SELECT year, count(1)
FROM [bigquery-public-data:samples.natality]
WHERE father_age < 18
GROUP BY year
ORDER BY year
SELECT year, count(1)
FROM [bigquery-public-data:samples.natality]
WHERE mother_age < 18
GROUP BY year
ORDER BY year
SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb
FROM [bigquery-public-data:ghcn_m.__TABLES__] ORDER BY gb DESC
BigQuery -> some (Big)Queries
SELECT REGEXP_EXTRACT(path, r'.*.(.*)$') AS file_extension,
COUNT(1) AS k
FROM [bigquery-public-data:github_repos.files]
GROUP BY file_extension
ORDER BY k DESC
LIMIT 20
SELECT table_id, row_count,
CEIL(size_bytes/POW(1024, 3)) AS gb
FROM [bigquery-public-data:github_repos.__TABLES__]
ORDER BY gb DESC

More Related Content

PPTX
Data Types and Processing in ES6
PDF
What they don't tell you about JavaScript
PDF
Tech Talk - Immutable Data Structure
PDF
Add Some Fun to Your Functional Programming With RXJS
PDF
Mozilla とブラウザゲーム
KEY
Introduction to PiCloud
PPT
Moar tools for asynchrony!
PDF
Trading volume mapping R in recent environment
Data Types and Processing in ES6
What they don't tell you about JavaScript
Tech Talk - Immutable Data Structure
Add Some Fun to Your Functional Programming With RXJS
Mozilla とブラウザゲーム
Introduction to PiCloud
Moar tools for asynchrony!
Trading volume mapping R in recent environment

What's hot (20)

PDF
Introduction to cron queue
PPTX
Functional programming
ODP
Data analytics with hadoop hive on multiple data centers
PDF
2016 gunma.web games-and-asm.js
PDF
20151224-games
PPTX
Asynchronous programming
PPTX
No More Deadlocks; Asynchronous Programming in .NET
PDF
RxJS 5 in Depth
PPTX
Working with NoSQL in a SQL Database (XDevApi)
PPTX
NoSQL in SQL - Lior Altarescu
KEY
W3C HTML5 KIG-How to write low garbage real-time javascript
DOCX
A Shiny Example-- R
PPTX
University of Bedford Knowledge Network 2.12.13
PPTX
Data visualization by Kenneth Odoh
PDF
Do something in 5 minutes with gas 1-use spreadsheet as database
PPTX
Functional Programming
PPTX
Visdjango presentation django_boston_oct_2014
PDF
Rubyconfindia2018 - GPU accelerated libraries for Ruby
PPTX
Business Networking Cambridge April 2014
PDF
G* on GAE/J 挑戦編
Introduction to cron queue
Functional programming
Data analytics with hadoop hive on multiple data centers
2016 gunma.web games-and-asm.js
20151224-games
Asynchronous programming
No More Deadlocks; Asynchronous Programming in .NET
RxJS 5 in Depth
Working with NoSQL in a SQL Database (XDevApi)
NoSQL in SQL - Lior Altarescu
W3C HTML5 KIG-How to write low garbage real-time javascript
A Shiny Example-- R
University of Bedford Knowledge Network 2.12.13
Data visualization by Kenneth Odoh
Do something in 5 minutes with gas 1-use spreadsheet as database
Functional Programming
Visdjango presentation django_boston_oct_2014
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Business Networking Cambridge April 2014
G* on GAE/J 挑戦編
Ad

Viewers also liked (20)

PDF
Mashing the data
PPTX
Social Media For Beginners - Agcas 2012
PPSX
9no a 2da version
PDF
Framtidens ehandel redan idag
PDF
Introducción a la cerámica popular canaria cuadernillo
PDF
Weekly plannig52012
PPTX
Свято 8 Березня в середній групі "Ромашка" ДНЗ № 28 м. Мукачево
PPTX
Worcester Food & Active Living Policy Council: An Introduction
PDF
Innovation in digital schools Gess Dubai 2013
PPTX
Professional scepticism judgment uia 2
PDF
8th pre alg -jan22
PDF
Introducción a la ciencia e ingeniería de los materiales william d. callist...
PPT
Guitar 5th grade
PDF
IntroduccióN A La ClíNica PsicolóGica Con NiñOs
PDF
Evolucion de la informatica y su aplicacion
PPTX
Introducción a la CMNUCC
PPT
PPTX
INTRODUCCIÓN A LA COMUNICACIÓN CIENTIFÍCA
PPTX
Introducción a la Biotecnología. Capítulo 2
Mashing the data
Social Media For Beginners - Agcas 2012
9no a 2da version
Framtidens ehandel redan idag
Introducción a la cerámica popular canaria cuadernillo
Weekly plannig52012
Свято 8 Березня в середній групі "Ромашка" ДНЗ № 28 м. Мукачево
Worcester Food & Active Living Policy Council: An Introduction
Innovation in digital schools Gess Dubai 2013
Professional scepticism judgment uia 2
8th pre alg -jan22
Introducción a la ciencia e ingeniería de los materiales william d. callist...
Guitar 5th grade
IntroduccióN A La ClíNica PsicolóGica Con NiñOs
Evolucion de la informatica y su aplicacion
Introducción a la CMNUCC
INTRODUCCIÓN A LA COMUNICACIÓN CIENTIFÍCA
Introducción a la Biotecnología. Capítulo 2
Ad

Similar to (Almost) Serverless Analytics System with BigQuery & AppEngine (20)

PPS
PDF
Using redux and angular 2 with meteor
PDF
Using redux and angular 2 with meteor
PPTX
U-SQL Query Execution and Performance Tuning
PDF
Writing MySQL User-defined Functions in JavaScript
PDF
Rethinking metrics: metrics 2.0 @ Lisa 2014
PDF
BigQueryで作る分析環境
PDF
03 2017Emea_RoadshowMilan-WhatsNew-Mariadbserver10_2andmaxscale 2_1
PDF
A Tour of Building Web Applications with R Shiny
PDF
What’s New in MariaDB Server 10.2
PDF
Large volume data analysis on the Typesafe Reactive Platform
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PDF
Programming IoT Gateways in JavaScript with macchina.io
PDF
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
PPTX
MySQL performance monitoring using Statsd and Graphite
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Die Neuheiten in MariaDB 10.2 und MaxScale 2.1
Using redux and angular 2 with meteor
Using redux and angular 2 with meteor
U-SQL Query Execution and Performance Tuning
Writing MySQL User-defined Functions in JavaScript
Rethinking metrics: metrics 2.0 @ Lisa 2014
BigQueryで作る分析環境
03 2017Emea_RoadshowMilan-WhatsNew-Mariadbserver10_2andmaxscale 2_1
A Tour of Building Web Applications with R Shiny
What’s New in MariaDB Server 10.2
Large volume data analysis on the Typesafe Reactive Platform
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
Programming IoT Gateways in JavaScript with macchina.io
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
MySQL performance monitoring using Statsd and Graphite
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Spark SQL Deep Dive @ Melbourne Spark Meetup
Die Neuheiten in MariaDB 10.2 und MaxScale 2.1

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
history of c programming in notes for students .pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
System and Network Administraation Chapter 3
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Understanding Forklifts - TECH EHS Solution
PDF
AI in Product Development-omnex systems
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Introduction to Artificial Intelligence
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms II-SECS-1021-03
Navsoft: AI-Powered Business Solutions & Custom Software Development
ISO 45001 Occupational Health and Safety Management System
How to Choose the Right IT Partner for Your Business in Malaysia
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Softaken Excel to vCard Converter Software.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
history of c programming in notes for students .pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
System and Network Administraation Chapter 3
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Understanding Forklifts - TECH EHS Solution
AI in Product Development-omnex systems
Online Work Permit System for Fast Permit Processing
Which alternative to Crystal Reports is best for small or large businesses.pdf
Introduction to Artificial Intelligence

(Almost) Serverless Analytics System with BigQuery & AppEngine

  • 1. Gabriel PREDA @eRadical (Almost) Serverless Analytics System with BigQuery & AppEngine
  • 2. Agenda Going Serverless with AppEngine & Tasks Pub/Sub, DataStore BigQuery Load Batch Streaming Inserts Query UDF Export ...some BigQueries...
  • 3. AeonsSome years ago... ~ 500,000 - 2,000,000 events / day (on average)
  • 4. Some time ago... ~2,000,000 - 22,000,000 events / day Dec 2014: 57,430,000 events / day 1 day to recompute » 12 hours
  • 5. NOW() 22,000,000 - 70,000,000 events / day AVG » 40,000,000 events / day Processing ~30GB-70GB / day Recompute 1 day » 10-20 minutes
  • 7. other... (almost) serverless products Cloud Functions (alpha - Node.JS) Cloud DataFlow (Java, Python - beta)
  • 9. BigQuery - data types ● STRING - UTF-8 (2 bytes + encoded string size) ● BYTES - base64 encoded (except in Avro) ● INTEGER - 64-bit signed (8 bytes) ● FLOAT (8 bytes) ● BOOLEAN - true/false, 1/0 only in CSV (1 byte) ● TIMESTAMP ex:”2014-08-19 12:41:35.220 UTC” (8 bytes) ● DATE, TIME, DATETIME - limited support in Legacy SQL ● RECORD - a collection of fields (size of fields) https://cloud.google.com/bigquery/data-types
  • 10. BigQuery -> loadData() Formats: CSV, JSON (newline delimited), Avro, Parquet (experimental) Tools: Web UI, bq, API Source: local files, Cloud Storage, [demo] Cloud Datastore (backup files), POST requests, SQL DML* Google Sheets - Federated Data Sources - Streaming Inserts
  • 13. BigQuery -> SELECT … FROM surprise… query: SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ] [ FROM from_body [ WHERE bool_expression ] [ OMIT RECORD IF bool_expression] [ GROUP [ EACH ] BY [ ROLLUP ] { field_name_or_alias } [ , ... ] ] [ HAVING bool_expression ] [ ORDER BY field_name_or_alias [ { DESC | ASC } ] [, ... ] ] [ LIMIT n ] ]; from_body: from_item [, ...] | # Warning: Comma means UNION ALL here from_item [ join_type ] JOIN [ EACH ] from_item [ ON join_predicate ] | (FLATTEN({ table_name | (query) }, field_name_or_alias)) | table_wildcard_function from_item: { table_name | (query) } [ [ AS ] alias ] join_type: { INNER | [ FULL ] [ OUTER ] | RIGHT [ OUTER ] | LEFT [ OUTER ] | CROSS }
  • 14. BigQuery -> SELECT … FROM surprise… Date-Partitioned Tables [demo] Table Decorators - See the past w/ @ Table Wildcard Functions - TABLE_DATE_RANGE() & TABLE_QUERY() Interesting functions - DateTime » UTC_USEC_TO_DAY/HOUR/MONTH/WEEK/YEAR() » Shifts a UNIX timestamp in microseconds to the beginning of the period it occurs in. - JSON_EXTRACT[_SCALAR]() - URL functions » HOST(), DOMAIN(), TLD() - REGEXP_MATCH(), REGEXP_EXTRACT()
  • 15. bigquery.defineFunction( 'expandAssetLibrary', // Name of the function exported to SQL ['user_id', 'video_id', 'stage_settings'], // Names of input columns [ {'name': 'user_id', 'type': 'integer'}, // Output schema {'name': 'video_id', 'type': 'string'}, {'name': 'asset', 'type': 'string'} ], expandAssetLibrary // Reference to JavaScript UDF ); function expandAssetLibrary(row, emit) { ………………………… emit({ user_id: row.user_id, video_id: row.video_id, asset: ss.url.replace('http://', '')); } BigQuery -> User Defined Functions
  • 16. BigQuery -> DML Standard SQL only Maximum UPDATE/DELETE statements per day per table: 48 Maximum UPDATE/DELETE statements per day per project: 500 Maximum INSERT statements per day per table: 1,000 Maximum INSERT statements per day per project: 10,000
  • 17. BigQuery -> export() To: Google Cloud Storage Format: CSV, JSON [.gz], Avro …1G files
  • 18. BigQuery -> some (Big)Queries SELECT year, count(1) FROM [bigquery-public-data:samples.natality] WHERE father_age < 18 GROUP BY year ORDER BY year SELECT year, count(1) FROM [bigquery-public-data:samples.natality] WHERE mother_age < 18 GROUP BY year ORDER BY year SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb FROM [bigquery-public-data:ghcn_m.__TABLES__] ORDER BY gb DESC
  • 19. BigQuery -> some (Big)Queries SELECT REGEXP_EXTRACT(path, r'.*.(.*)$') AS file_extension, COUNT(1) AS k FROM [bigquery-public-data:github_repos.files] GROUP BY file_extension ORDER BY k DESC LIMIT 20 SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb FROM [bigquery-public-data:github_repos.__TABLES__] ORDER BY gb DESC