SlideShare a Scribd company logo
Analytics metrics delivery and
ML Feature visualization
Evolution of Data Platform at GoPro
ABOUT SPEAKER: CHESTER CHEN
• Head of Data Science & Engineering (DSE) at GoPro
• Prev. Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro
SF BIG
ANALYTICS
Recent Talks
SF BIG
ANALYTICS
Upcoming Talks
SF BIG
ANALYTICS
AGENDA
• Business Use Cases
• Evalution of GoPro Data Platform
• Analytics Metrics Delivery via Slack
• ML Feature Visualization Features with Google Facets and Spark
GROWING DATA NEED FROM GOPRO ECOSYSTEM
DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce Social Media/OTT
3rd party data
Product Insight
User segmentation
CRM/Marketing
/Personalization
EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• features adoptions, user engagements, User segmentation, churn analysis, funnel analysis,
conversion rate etc.
• Web/E-Commercial Analytics
• Camera Analytics
• Scene change detection, feature usages etc.
• Mobile Analytics
• Camera connections, story sharing etc.
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
• Media classifications, recommendations, storage analysis.
Evolution of Data Platform
EVOLUTION OF DATA PLATFORM
EVOLUTION OF DATA PLATFORM
DATA PLATFORM ARCHITECTURE TRANSFORMATION
Batch Ingestion
Framework
•Batch Ingestion
•Pre-processing
Streaming ingestion
Batch Ingestion
Cloud-Based Elastic Clusters
PLOT.LY SERVER
TABLEAU SERVER
EXTERNAL SERFVICE
Notebook
Rest API,
FTP
S3 sync,etc
Dynamic DDL
State Sync
Parquet
STREAMING PIPELINES
Spark Cluster
Long Running Cluster
BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf
INTERACTIVE/NOTEBOOKS
Spark Cluster
Long Running Clusters
Notebooks Scripts
(SQL, Python, Scala)
Scheduled Notebook Jobs
auto-scale
mixed on-demand &
spot Instances
AIRFLOW SETUP
AIRFLOW SETUP
Web Server LB
Scheduler
Airflow Metastore
WorkersWorkers
Workers
Workers
Workers
Web Server B
Web Server LB
Web Server A
Message Queue
Airflow
DAGs
sync
Push DAGs to S3
TAKEAWAYS
• Key Changes
• Centralize hive meta store
• Separate compute and
storage needs
• Leverage S3 as storage
• Horizontal scale with cluster
elasticity
• Less time in managing
infrastructure
• Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year
ago
• Operation
• Reduce the complexity of DevOp Support
• Analytics tools
• SQL only => Notebook with (SQL, Python,
Scala)
CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK
HIVE SQL  Spark
EVOLUTION OF DATA PLATFORM
BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job
TABLE WRITER JOBS
• Job are identified by JobType, JobName, JobConfig
• Majority of the Spark ETL Jobs are Table Writers
• Load data into DataFrame
• DataFrame to DataFrame transformation
• Output DataFrame to Hive Table
• Majority of table writer Jobs can be de-composed as one of the
following sub jobs
TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs
HIVE TABLE WRITER JOB
trait HiveTableWriter extends CoreHiveTableWriter with SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config)
def load(sqlContext: SQLContext, ioInfos: Seq[(String, Seq[InputOutputInfo])]): Seq[(InputOutputInfo, DataFrame)]
def initProcess(sqlContext: SQLContext, jobTypeConfig: Config, jobConfig: Config)
def preProcess(hadoopConf: Configuration, ioInfos: Seq[InputOutputInfo]): Seq[InputOutputInfo]
def process(jobName: String, sqlContext: SQLContext, ioInfos: Seq[InputOutputInfo], jobTypeConfig: Config, jobConfig: Config)
def postProcess(….)
def getInputOutputInfos(sc: SparkContext, jobName: String, jobTypeConfig: Config, jobConfig: Config) : Seq[InputOutputInfo]
def groupIoInfos(ioInfos: Seq[InputOutputInfo]): Seq[(String, Seq[InputOutputInfo])]
ETL JOB CONFIGURATION
gopro.dse.config.etl {
mobile-job {
conf {}
process {}
input {}
output {}
post.process {}
}
}
include classpath("conf/production/etl_mobile_quik.conf")
include classpath("conf/production/etl_mobile_capture.conf")
include classpath("conf/production/etl_mobile_product_events.conf")
Job-level conf override JobType Conf
Job specifics includes
JobType
JobName
Input & output specification
ETL JOB CONFIGURATION
xyz {
process {}
input {
delimiter = ","
inputDirPattern = "s3a://teambucket/xyz/raw/production"
file.ext = "csv"
file.format = "csv"
date.format = "yyyy-MM-dd hh:mm:ss"
table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName"
}
output {
database = “mobile",
file.format = "parquet"
date.format = "yyyy-MM-dd hh:mm:ss"
partitions = 2
file.compression.codec.key = "spark.sql.parquet.compression.codec"
file.compression.codec.value = "gzip”
save.mode = ”append"
transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer]
}
post.process {
deleteSource = true
}
}
Save Mode
JobName
Input specification
output specification
Files needs to goto proper tables
TABLE NAME GENERATION
• Table Name Extractor
• From File Name
• From Directory Name
• Custom Plugin
EXTRACT TABLE NAMES
• From Table Name
• /databucket/3rdparty/ABC/campaign-20180212.csv
• /databucket/3rdparty/ABC/campaign-20180213.csv
• /databucket/3rdparty/ABC/campaign-20180214.csv
• From Directory Name
• /databucket/3rdparty/ABC/campaign/file-20180212.csv
• /databucket/3rdparty/ABC/campaign/file-20180213.csv
• /databucket/3rdparty/ABC/campaign/file-20180214.csv
• From ID Mapping
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/10.log.gz
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/11.log.gz
• /databucket/ABC/2018/02/17/ae6905b068c7beb08d681a5/12.log.gz
• /databucket/ABC/2018/02/18/ae6905b068c7beb08d681a5/13.log.gz
• Table Name, File Date
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
• Table Name, File Date
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
• Table Name, File Date
Configuration
• b2a932aeddbf0f11bae9573  mobile_ios
• ae6905b068c7beb08d681a  mobile_android
Table Extraction
• (mobile_ios, 2017-01-11)
• (mobile_android, 2018-02-17)
• (mobile_android, 2018-02-18)
Data Transformation
ETL With SQL & Scala
DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL
SCALA SCRIPTS
• Define a special SparkJob : Spark Job Code Runner
• Load Scala script files from specified location (defined by config)
• Dynamically compiles the scala code into classes
• For the compiled classes : run spark jobs defined in the scripts
• Twitter EVAL Util: Dynamically evaluates Scala strings and files.
• <groupId>com.twitter</groupId>
<artifactId>util-eval_2.11</artifactId>
<version>6.24.0</version>
SCALA SCRIPTS
object SparkJobCodeRunner extends SparkJob {
private val LOG = LoggerFactory.getLogger(getClass)
import collection.JavaConverters._
override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val jobFileNames: List[String] = //...
jobFileNames.foreach{ x =>
val clazzes : Option[Any] = evalFromFileName[Any](x)
clazzes.foreach{c =>
c match {
case job: SparkJob => job.run(sc, jobType, jobName, config)
case _ => LOG.info("not match")
}
}
}
}
}
SCALA SCRIPTS
import com.twitter.util.Eval
def evalFromFile[T](path: Path)(implicit header: String = ""): Option[T] = {
val fs = //get Hadoop File System …
eval(IOUtils.toString(fs.open(path), "UTF-8"))(header)
}
def eval[T](code: String)(implicit header: String = ""): Option[T] =
Try(Eval[T](header + "n" + code)).toOption
SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob
Data Democratization,
Visualization and Data
Management
EVOLUTION OF DATA PLATFORM
DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• Data Metrics Delivery
• Delivery to Slack : make metrics more accessible to broader audience
• Data Slice & Dice
• Leverage Real-Time OLAP database (Druid) (ongoing project)
• Analytics Visualization (ongoing project)
• Leverage Superset and Data Management Application
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• ML Feature Visualization
Spark Cluster
New or existing cluster
Spark Cluster
Long Running Cluster
Metrics Batch Ingestion
Streaming Ingestion
Output Metrics
BedRock
DATA
VISUALIZATION
&
MANAGEMENT
Working in Progress
Delivery Metrics via Slack
SLACK METRICS DELIVERY
xxxxxx
xxxxxxx
xxxxx xxxxxxxxxx
xxxxx
xxxxxxx xxxxxx xxxxxx
xxxxx
xxxxx
xxxx
xxxxxxxxxxxxxxxx
SLACK METRICS DELIVERY
• Why Slack ?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Move key metrics move away from Tableau Dashboard and put
metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.
SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations
SLACK METRICS CONCEPTS
• Slack Job 
• Channels (private channels) 
• Metrics Groups 
• Metrics1
• …
• MetricsN
• Main Query
• Compare Query (Optional)
• Chart Query (Options)
• Persistence (optional)
• Hive + S3
• Additional deliveries (Optional)
• Kafka
• Other Cache stores (Http Post)
BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks
CONFIGURATION-DRIVEN
slack-plus-push-weekly { //job name
persist-metrics="true"
channels {
dse-metrics {
post-urls {
plus-metrics = "https://hooks.slack.com/services/XXXX"
dse-metrics-test = "https://hooks.slack.com/services/XXXX"
}
plus-metrics { //metrics group
//metrics in the same group will delivered as together in one message
//metrics in different groups will be delivered as separate messages
//overwrite above template with specific name
}
}
}
} //slack-plus-push-weekly
SLACK METRICS CONFIGURATION
slack-mobile-push-weekly.channels.mobile-metrics.capture-metrics { //Job, Channel, KPI Group
//…
weekly-capture-users-by-platform { //metrics name
slack-display.attachment.title = "GoPro Mobile App -- Users by Platform"
metric-period = "weekly”
slack-display.chartstyle { … }
query = ""” … """
compare.query = ""” … """
chart query = ""”… ""”
}
//rest of configuration
}
SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics
Machine Learning Feature
Visualization with Facets + Spark
EVOLUTION OF DATA PLATFORM
FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?
FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.
FACETS OVERVIEW SAMPLE
FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower
FACETS OVERVIEW
FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?
FACETS OVERVIEW + SPARK
ScalaPB
PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))
SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)
SPARK FACETS STATS GENERATOR
def protoFromDataFrames(dataFrames: List[NamedDataFrame],
features : Set[String] = Set.empty[String],
histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
FACET OVERVIEW SPARK
FACET OVERVIEW SPARK
DEMO
INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.
WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features  200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?
PG #
RC Playbook: Your guide to
success at GoPro
FINAL THOUGHTS
FINAL THOUGHTS
• We are still in the earlier stage of Data Platform Evolution,
• We will continue to share we experience with you along the way.
• Questions ?
Thanks You
Chester Chen, Ph.D.
Data Science & Engineering
GoPro

More Related Content

PPTX
Programing for problem solving ( airline reservation system)
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
PDF
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
PDF
A missing link in the ML infrastructure stack?
PDF
Flock: Data Science Platform @ CISL
PDF
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
PDF
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
PDF
Dependency Injection in Apache Spark Applications
Programing for problem solving ( airline reservation system)
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
A missing link in the ML infrastructure stack?
Flock: Data Science Platform @ CISL
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
Dependency Injection in Apache Spark Applications

What's hot (20)

PDF
Scaling Machine Learning To Billions Of Parameters
PDF
MLFlow 1.0 Meetup
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
PDF
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
PPTX
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
PDF
Tactical data engineering
PDF
ROCm and Distributed Deep Learning on Spark and TensorFlow
PDF
Mlflow with databricks
PDF
Streaming Inference with Apache Beam and TFX
PPTX
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
PDF
[FFE19] Build a Flink AI Ecosystem
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
PDF
Ge aviation spark application experience porting analytics into py spark ml p...
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Balancing Automation and Explanation in Machine Learning
PPT
Lift scaffolding from existing database
PDF
MLeap: Productionize Data Science Workflows Using Spark
Scaling Machine Learning To Billions Of Parameters
MLFlow 1.0 Meetup
Production Readiness Testing At Salesforce Using Spark MLlib
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Tactical data engineering
ROCm and Distributed Deep Learning on Spark and TensorFlow
Mlflow with databricks
Streaming Inference with Apache Beam and TFX
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
[FFE19] Build a Flink AI Ecosystem
Hopsworks Feature Store 2.0 a new paradigm
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Ge aviation spark application experience porting analytics into py spark ml p...
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Balancing Automation and Explanation in Machine Learning
Lift scaffolding from existing database
MLeap: Productionize Data Science Workflows Using Spark
Ad

Similar to Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro (20)

PDF
Analytics Metrics Delivery & ML Feature Visualization
PDF
Productionalizing a spark application
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Apache Spark PDF
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
PPTX
ETL with SPARK - First Spark London meetup
PPTX
Paris Data Geek - Spark Streaming
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
PDF
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Spark Meetup at Uber
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PPTX
Apache Spark in Scientific Applciations
Analytics Metrics Delivery & ML Feature Visualization
Productionalizing a spark application
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Be A Hero: Transforming GoPro Analytics Data Pipeline
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Simplifying Big Data Analytics with Apache Spark
Apache Spark PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Adding structure to your streaming pipelines: moving from Spark streaming to ...
ETL with SPARK - First Spark London meetup
Paris Data Geek - Spark Streaming
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Python and Bigdata - An Introduction to Spark (PySpark)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Meetup at Uber
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Apache Spark in Scientific Applciations
Ad

More from Chester Chen (20)

PDF
SFBigAnalytics_SparkRapid_20220622.pdf
PDF
zookeeer+raft-2.pdf
PPTX
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
PDF
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
PDF
Shopify datadiscoverysf bigdata
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
PDF
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
SFBigAnalytics- hybrid data management using cdap
PDF
Sf big analytics: bighead
PPTX
2018 data warehouse features in spark
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PPTX
2018 02 20-jeg_index
PDF
Index conf sparkml-feb20-n-pentreath
PDF
Index conf sparkai-feb20-n-pentreath
PPTX
Hspark index conf
PPTX
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SFBigAnalytics_SparkRapid_20220622.pdf
zookeeer+raft-2.pdf
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Shopify datadiscoverysf bigdata
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SFBigAnalytics_20190724: Monitor kafka like a Pro
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SFBigAnalytics- hybrid data management using cdap
Sf big analytics: bighead
2018 data warehouse features in spark
2018 02-08-what's-new-in-apache-spark-2.3
2018 02 20-jeg_index
Index conf sparkml-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
Hspark index conf
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Global journeys: estimating international migration
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Quality review (1)_presentation of this 21
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Lecture1 pattern recognition............
PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
1_Introduction to advance data techniques.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Clinical guidelines as a resource for EBP(1).pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
.pdf is not working space design for the following data for the following dat...
Global journeys: estimating international migration
IB Computer Science - Internal Assessment.pptx
Moving the Public Sector (Government) to a Digital Adoption
Quality review (1)_presentation of this 21
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
Lecture1 pattern recognition............
Introduction to Business Data Analytics.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Reliability_Chapter_ presentation 1221.5784

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro

  • 1. Analytics metrics delivery and ML Feature visualization Evolution of Data Platform at GoPro
  • 2. ABOUT SPEAKER: CHESTER CHEN • Head of Data Science & Engineering (DSE) at GoPro • Prev. Director of Engineering, Alpine Data Labs • Founder and Organizer of SF Big Analytics meetup
  • 6. AGENDA • Business Use Cases • Evalution of GoPro Data Platform • Analytics Metrics Delivery via Slack • ML Feature Visualization Features with Google Facets and Spark
  • 7. GROWING DATA NEED FROM GOPRO ECOSYSTEM
  • 8. DATA Analytics Platform Consumer Devices GoPro Apps E-Commerce Social Media/OTT 3rd party data Product Insight User segmentation CRM/Marketing /Personalization
  • 9. EXAMPLES OF ANALYTICS USE CASES • Product Analytics • features adoptions, user engagements, User segmentation, churn analysis, funnel analysis, conversion rate etc. • Web/E-Commercial Analytics • Camera Analytics • Scene change detection, feature usages etc. • Mobile Analytics • Camera connections, story sharing etc. • GoPro Plus Analytics • CRM Analytics • Digital Marketing Analytics • Social Media Analytics • Cloud Media Analysis • Media classifications, recommendations, storage analysis.
  • 10. Evolution of Data Platform
  • 11. EVOLUTION OF DATA PLATFORM
  • 12. EVOLUTION OF DATA PLATFORM
  • 13. DATA PLATFORM ARCHITECTURE TRANSFORMATION Batch Ingestion Framework •Batch Ingestion •Pre-processing Streaming ingestion Batch Ingestion Cloud-Based Elastic Clusters PLOT.LY SERVER TABLEAU SERVER EXTERNAL SERFVICE Notebook Rest API, FTP S3 sync,etc Dynamic DDL State Sync Parquet
  • 15. BATCH JOBS Job Gateway Spark ClusterScheduled Jobs New cluster per Job Dev Machines Spark ClusterDev Jobs New or existing cluster Production Job.conf Dev Job.conf
  • 16. INTERACTIVE/NOTEBOOKS Spark Cluster Long Running Clusters Notebooks Scripts (SQL, Python, Scala) Scheduled Notebook Jobs auto-scale mixed on-demand & spot Instances
  • 18. AIRFLOW SETUP Web Server LB Scheduler Airflow Metastore WorkersWorkers Workers Workers Workers Web Server B Web Server LB Web Server A Message Queue Airflow DAGs sync Push DAGs to S3
  • 19. TAKEAWAYS • Key Changes • Centralize hive meta store • Separate compute and storage needs • Leverage S3 as storage • Horizontal scale with cluster elasticity • Less time in managing infrastructure • Key Benefits • Cost • Reduce redundant storage, compute cost. • Use the smaller instance types • 60% AWS cost saving comparing to 1 year ago • Operation • Reduce the complexity of DevOp Support • Analytics tools • SQL only => Notebook with (SQL, Python, Scala)
  • 20. CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK HIVE SQL  Spark
  • 21. EVOLUTION OF DATA PLATFORM
  • 22. BATCH INGESTION GoPro Product data 3rd Parties Data 3rd Parties Data 3rd Parties Data Rest APIs sftp s3 sync s3 sync Batch Data Downloads Input File Formats: CSV, JSON Spark Cluster New cluster per Job
  • 23. TABLE WRITER JOBS • Job are identified by JobType, JobName, JobConfig • Majority of the Spark ETL Jobs are Table Writers • Load data into DataFrame • DataFrame to DataFrame transformation • Output DataFrame to Hive Table • Majority of table writer Jobs can be de-composed as one of the following sub jobs
  • 24. TABLE WRITER JOBS SparkJob HiveTableWriter JDBCToHiveTableWriter AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter CSVTableWriter JSONTableWriter FileToHiveTableWriter HBaseToHiveTableWriter TableToHiveTableWriter HBaseSnapshotJob TableSnapshotJob CoreTableWriter Customized Json JobCustomized CSV Job mixin All jobs has the same way of configuration loading, Job State and error reports All table writers will have the Dynamic DDL capabilities, as long as they becomes DataFrames, they will be behave the same CSV and JSON have different loader Need different Loader to load HBase Record to DataFrame Aggregate Jobs
  • 25. HIVE TABLE WRITER JOB trait HiveTableWriter extends CoreHiveTableWriter with SparkJob { def run(sc: SparkContext, jobType: String, jobName: String, config: Config) def load(sqlContext: SQLContext, ioInfos: Seq[(String, Seq[InputOutputInfo])]): Seq[(InputOutputInfo, DataFrame)] def initProcess(sqlContext: SQLContext, jobTypeConfig: Config, jobConfig: Config) def preProcess(hadoopConf: Configuration, ioInfos: Seq[InputOutputInfo]): Seq[InputOutputInfo] def process(jobName: String, sqlContext: SQLContext, ioInfos: Seq[InputOutputInfo], jobTypeConfig: Config, jobConfig: Config) def postProcess(….) def getInputOutputInfos(sc: SparkContext, jobName: String, jobTypeConfig: Config, jobConfig: Config) : Seq[InputOutputInfo] def groupIoInfos(ioInfos: Seq[InputOutputInfo]): Seq[(String, Seq[InputOutputInfo])]
  • 26. ETL JOB CONFIGURATION gopro.dse.config.etl { mobile-job { conf {} process {} input {} output {} post.process {} } } include classpath("conf/production/etl_mobile_quik.conf") include classpath("conf/production/etl_mobile_capture.conf") include classpath("conf/production/etl_mobile_product_events.conf") Job-level conf override JobType Conf Job specifics includes JobType JobName Input & output specification
  • 27. ETL JOB CONFIGURATION xyz { process {} input { delimiter = "," inputDirPattern = "s3a://teambucket/xyz/raw/production" file.ext = "csv" file.format = "csv" date.format = "yyyy-MM-dd hh:mm:ss" table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName" } output { database = “mobile", file.format = "parquet" date.format = "yyyy-MM-dd hh:mm:ss" partitions = 2 file.compression.codec.key = "spark.sql.parquet.compression.codec" file.compression.codec.value = "gzip” save.mode = ”append" transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer] } post.process { deleteSource = true } } Save Mode JobName Input specification output specification
  • 28. Files needs to goto proper tables TABLE NAME GENERATION • Table Name Extractor • From File Name • From Directory Name • Custom Plugin
  • 29. EXTRACT TABLE NAMES • From Table Name • /databucket/3rdparty/ABC/campaign-20180212.csv • /databucket/3rdparty/ABC/campaign-20180213.csv • /databucket/3rdparty/ABC/campaign-20180214.csv • From Directory Name • /databucket/3rdparty/ABC/campaign/file-20180212.csv • /databucket/3rdparty/ABC/campaign/file-20180213.csv • /databucket/3rdparty/ABC/campaign/file-20180214.csv • From ID Mapping • /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/10.log.gz • /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/11.log.gz • /databucket/ABC/2018/02/17/ae6905b068c7beb08d681a5/12.log.gz • /databucket/ABC/2018/02/18/ae6905b068c7beb08d681a5/13.log.gz • Table Name, File Date • (campaign, 2018-02-12) • (campaign, 2018-02-13) • (campaign, 2018-02-14) • Table Name, File Date • (campaign, 2018-02-12) • (campaign, 2018-02-13) • (campaign, 2018-02-14) • Table Name, File Date Configuration • b2a932aeddbf0f11bae9573  mobile_ios • ae6905b068c7beb08d681a  mobile_android Table Extraction • (mobile_ios, 2017-01-11) • (mobile_android, 2018-02-17) • (mobile_android, 2018-02-18)
  • 31. DATA TRANSFORMATION • HSQL over JDBC via beeline • Suitable for non-java/scala/python-programmers • Spark Job • Requires Spark and Scala knowledge, need to setup job, configurations etc. • Dynamic Scala Scripts • Scala as script, compile Scala at Runtime, mixed with Spark SQL
  • 32. SCALA SCRIPTS • Define a special SparkJob : Spark Job Code Runner • Load Scala script files from specified location (defined by config) • Dynamically compiles the scala code into classes • For the compiled classes : run spark jobs defined in the scripts • Twitter EVAL Util: Dynamically evaluates Scala strings and files. • <groupId>com.twitter</groupId> <artifactId>util-eval_2.11</artifactId> <version>6.24.0</version>
  • 33. SCALA SCRIPTS object SparkJobCodeRunner extends SparkJob { private val LOG = LoggerFactory.getLogger(getClass) import collection.JavaConverters._ override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = { val jobFileNames: List[String] = //... jobFileNames.foreach{ x => val clazzes : Option[Any] = evalFromFileName[Any](x) clazzes.foreach{c => c match { case job: SparkJob => job.run(sc, jobType, jobName, config) case _ => LOG.info("not match") } } } } }
  • 34. SCALA SCRIPTS import com.twitter.util.Eval def evalFromFile[T](path: Path)(implicit header: String = ""): Option[T] = { val fs = //get Hadoop File System … eval(IOUtils.toString(fs.open(path), "UTF-8"))(header) } def eval[T](code: String)(implicit header: String = ""): Option[T] = Try(Eval[T](header + "n" + code)).toOption
  • 35. SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE class CameraAggCaptureMainJob extends SparkJob { def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = { val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc) val cameraCleanDataSchema = … //define DataFrame Schema val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema) .json("s3a://databucket/camera/work/production/clean-events/final/*") cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data") sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict set hive.enforce.bucketing=false set hive.auto.convert.join=false set hive.merge.mapredfiles=true""") sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” ) //rest of code } new CameraAggCaptureMainJob
  • 37. EVOLUTION OF DATA PLATFORM
  • 38. DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS • Data Metrics Delivery • Delivery to Slack : make metrics more accessible to broader audience • Data Slice & Dice • Leverage Real-Time OLAP database (Druid) (ongoing project) • Analytics Visualization (ongoing project) • Leverage Superset and Data Management Application • BedRock: Self-Service & Data Management (ongoing project) • Pipeline Monitoring • Product Analytics Visualization • Self-service Ingestion • ML Feature Visualization
  • 39. Spark Cluster New or existing cluster Spark Cluster Long Running Cluster Metrics Batch Ingestion Streaming Ingestion Output Metrics BedRock DATA VISUALIZATION & MANAGEMENT Working in Progress
  • 41. SLACK METRICS DELIVERY xxxxxx xxxxxxx xxxxx xxxxxxxxxx xxxxx xxxxxxx xxxxxx xxxxxx xxxxx xxxxx xxxx xxxxxxxxxxxxxxxx
  • 42. SLACK METRICS DELIVERY • Why Slack ? • Push vs. Pull -- Easy Access • Avoid another Login when view metrics • When Slack Connected, you are already login • Move key metrics move away from Tableau Dashboard and put metrics generation into software engineering process • SQL code is under software control • publishing job is scheduled and performance is monitored • Discussion/Question/Comments on the specific metrics can be done directly at the channel with people involved.
  • 43. SLACK DELIVERY FRAMEWORK • Slack Metrics Delivery Framework • Configuration Driven • Multiple private Channels : Mobile/Cloud/Subscription/Web etc. • Daily/Weekly/Monthly Delivery and comparison • New metrics can be added easily with new SQL and configurations
  • 44. SLACK METRICS CONCEPTS • Slack Job  • Channels (private channels)  • Metrics Groups  • Metrics1 • … • MetricsN • Main Query • Compare Query (Optional) • Chart Query (Options) • Persistence (optional) • Hive + S3 • Additional deliveries (Optional) • Kafka • Other Cache stores (Http Post)
  • 45. BLACK KPI DELIVERY ARCHITECTURE Slack message json HTTP POST Rest API Server Rest API Server generate graphMetrics Json Return Image HTTP POST Save/Get Image Plot.ly json Save Metrics to Hive Table Slack Spark Job Get Image URL Webhooks
  • 46. CONFIGURATION-DRIVEN slack-plus-push-weekly { //job name persist-metrics="true" channels { dse-metrics { post-urls { plus-metrics = "https://hooks.slack.com/services/XXXX" dse-metrics-test = "https://hooks.slack.com/services/XXXX" } plus-metrics { //metrics group //metrics in the same group will delivered as together in one message //metrics in different groups will be delivered as separate messages //overwrite above template with specific name } } } } //slack-plus-push-weekly
  • 47. SLACK METRICS CONFIGURATION slack-mobile-push-weekly.channels.mobile-metrics.capture-metrics { //Job, Channel, KPI Group //… weekly-capture-users-by-platform { //metrics name slack-display.attachment.title = "GoPro Mobile App -- Users by Platform" metric-period = "weekly” slack-display.chartstyle { … } query = ""” … """ compare.query = ""” … """ chart query = ""”… ""” } //rest of configuration }
  • 48. SLACK DELIVERY BENEFITS • Pros: • Quick and easy access via Slack • Can quickly deliver to engineering manager, executives, business owner and product manager • 100+ members subscribed different channels, since we launch the service • Cons • Limited by Slack UI Real-States, can only display key metrics in two-column formats, only suitable for hive-level summary metrics
  • 50. EVOLUTION OF DATA PLATFORM
  • 51. FEATURE VISUALIZATION • Explore Feature Visualization via Google Facets • Part 1 : Overview • Part 2: Dive • What is Facets Overview ?
  • 52. FACETS OVERVIEW INTRODUCTION • From Facets Home Page • https://pair-code.github.io/facets/ • "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis. • Overview can help uncover issues with datasets, including the following: • Unexpected feature values • Missing feature values for a large number of examples • Training/serving skew • Training/test/validation set skew • Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. • Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. • Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets.
  • 54. FACETS OVERVIEW IMPLEMENTATIONS • The Facets-overview implementation is consists of • Feature Statistics Protocol Buffer definition • Feature Statistics Generation • Visualization • Visualization • The visualizations are implemented as Polymer web components, backed by Typescript code • It can be embedded into Jupyter notebooks or webpages. • Feature Statistics Generation • There are two implementations for the stats generation: Python and Javascripts • Python : using numpy, pandas to generate stats • JavaScripts: using javascripts to generate stats • Both implementations are running stats generation in brower
  • 56. FEATURE OVERVIEW SPARK • Initial exploration attempt • Is it possible to generate larger datasets with small stats size ? • can we generate stats leveraging distributed computing capability of spark instead just using one node ? • Can we generate the stats in Spark, and then used by Python and/or Javascripts ?
  • 57. FACETS OVERVIEW + SPARK ScalaPB
  • 58. PREPARE SPARK DATA FRAME case class NamedDataFrame(name:String, data: DataFrame) val features = Array("Age", "Workclass", ….) val trainData: DataFrame = loadCSVFile(”./adult.data.csv") val testData = loadCSVFile("./adult.test.txt") val train = trainData.toDF(features: _*) val test = testData.toDF(features: _*) val dataframes = List(NamedDataFrame(name = "train", train), NamedDataFrame(name = "test", test))
  • 59. SPARK FACETS STATS GENERATOR val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList()) val proto = generator.protoFromDataFrames(dataframes) persistProto(proto)
  • 60. SPARK FACETS STATS GENERATOR def protoFromDataFrames(dataFrames: List[NamedDataFrame], features : Set[String] = Set.empty[String], histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
  • 63. INITIAL FINDINGS • Implementation • 1st Pass implementation is not efficient • We have to go through each feature multiple paths, with increase number of features, the performance suffers, this limits number of features to be used • The size of dataset used for generate stats also determines the size of the generated protobuf file • I haven’t dive deeper into this as to what’s contributing the change of the size • The combination of data size and feature size can produce a large file, which won’t fit in browser • With Spark DataFrame, we can’t support Tensorflow Records • The Base64-encoded protobuf String can be loaded by Python or Javascripts • Protobuf binary file can also be loaded by Python • But it somehow not be able to loaded by Javascripts.
  • 64. WHAT’S NEXT? • Improve implementation performance • When we have a lot of data and features, what’s the proper size that generate proper stats size that can be loaded into browser or notebook ? • For example, One experiments: 300 Features  200MB size • How do we efficiently partition the features so that can be viewable ? • Data is changing : how can we incremental update the stats on the regular basis ? • How to integrate this into production?
  • 65. PG # RC Playbook: Your guide to success at GoPro FINAL THOUGHTS
  • 66. FINAL THOUGHTS • We are still in the earlier stage of Data Platform Evolution, • We will continue to share we experience with you along the way. • Questions ? Thanks You Chester Chen, Ph.D. Data Science & Engineering GoPro