SlideShare a Scribd company logo
Introduction to AWS Big Data
Omid Vahdaty, Big Data Ninja
When the data outgrows your ability to process
● Volume
● Velocity
● Variety
EMR
● Basic, the simplest distribution of all hadoop distributions.
● Not as good as cloudera.
Collect
● Firehose
● Snowball
● SQS
● Ec2
Store
● S3
● Kinesis
● RDS
● DynmoDB
● Clloud Search
● Iot
Process + Analyze
● Lambda
● EMR
● Redshift
● Machine learning
● Elastic search
● Data Pipeline
● Athena
Visulize
● Quick sight
● ElasticSearch service
History of tools
HDFS - from google
Cassandra - from facebook, Columnar store NoSQL , materialized view, secondary
index
Kafka - linkedin.
Hbase, no sql , hadoop eco system.
EMR Hadoop ecosystem
● Spark - recommend options. Can do all the below…
● Hive
● Oozie
● Mahout - Machine library (mllib is better, runs on spark)
● Presto, better than hive? More generic than hive.
● Pig - scripting for big data.
● Impala - Cloudera, and part of EMR.
ETL Tools
● Attunity
● Splunk
● Semarchy
● Informatica
● Tibco
● Clarit
Architecture
● Decouple :
○ Store
○ Process
○ Store
○ Process
○ insight...
● Rule of thumb: 3 tech in dc, 7 techs max in cloud
○ Do use more b/c: maintenance
Architecture considerations
● unStructured? Structured? Semi structured?
● Latency?
● Throughput?
● Concurrency?
● Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
EcoSystem
● Redshift = analytics
● Aurora = OLTP
● DynamoDB = NoSQL like mongodb.
● Lambda
● SQS
● Cloud Search
● Elastic Search
● Data Pipeline
● Beanstalk - i have jar, install it for me…
● AWS machine learning
Data ingestions Architecture challenges
● Durability
● HA
● Ingestions types:
○ Batch
○ Stream
● Transaction: OLTP, noSQL.
Ingestion options
● Kinesis
● Flume
● Kafka
● S3DistCP copy from
○ S3 to HDFS
○ S3 to s3
○ Cross account
○ Supports compression.
Transfer
● VPN
● Direct Connect
● S3 multipart upload
● Snowball
● IoT
Steaming
● Streaming
● Batch
● Collect
○ Kinesys
○ DynamoDB streams
○ SQS (pull)
○ SNS (push)
○ KAFKA (Most recommend)
Comparison
Streaming
● Low latency
● Message delivery
● Lambda architecture implementation
● State management
● Time or count based windowing support
● Fault tolerant
Stream processor comparison
Stream collection options
Kinesis client library
AWS lambda
EMR
Third party
Spark streaming (latency min =1 sec) , near real time, with lot of libraries.
Storm - Most real time (sub millisec), java code based.
Flink (similar to spark)
Kinesys
● Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
● Analytics - in flight analytics.
● Firehose - Park you data @ destination.
Introduction to AWS Big Data
Kinesis analytics example
CREATE STREAM "INTERMEDIATE_STREAM" (
hostname VARCHAR(1024),
logname VARCHAR(1024),
username VARCHAR(1024),
requesttime VARCHAR(1024),
request VARCHAR(1024),
status VARCHAR(32),
responsesize VARCHAR(32)
);
-- Data Pump: Take incoming data from SOURCE_SQL_STREAM_001 and insert
into INTERMEDIATE_STREAM
KCL
● Read from stream using get api
● Build application with the KCL
● Leverage kinesis spout for storm
● Leverage EMR connector
Firehose - for parking
● Not for fast lane - no in flight analytics
● Capture , transform and load.
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
● Producer - you input to delivery stream
● Buffer size MB
● Buffer in seconds.
Comparison of Kinesis product
● Stream
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Latency 60 sec minimum.
○ For larger “events”
DynamoDB
● Fully managed NoSql document store key value
○ Tables have not fixed schema
● High performance
○ Single digit latency
○ Runs on solid drives
● Durable
○ Multi az
○ Fault toulerant, replicated 3 AZ.
Durability
● Read
○ Eventually consistent
○ Strongly consistent
● Write
○ Quorum ack
○ 3 replication - always - we can’t change it.
○ Persistence to disk.
Indexing and partitioning
● Idnexing
○ LSI - local secondary index, kind of alternate range key
○ GSI - global secondary index - “pivot charts” for your table, kind of projection (vertica)
● Partitioning
○ Automatic
○ Hash key speared data across partitions
DynamoDB Items and Attributes
● Partitions key
● Sort key (optional)
● LSI
● attributes
AWS Titan(on DynamoDB) - Graph DB
● Vertex - nodes
● Edge- relationship b/w nodes.
● Good when you need to investigate more than 2 layer relationship (join of 4 tables)
● Based on TinkerPop (open source)
● Full text search with lucene SOLR or elasticsearch
● HA using multi master replication (cassandra backed)
● Scale using DynamoDB backed.
● Use cases: Cyber , Social networks, Risk management
Elasticache: MemCache / Redis
● Cache sub millisecond.
● In memory
● No disk I/O when querying
● High throughput
● High availability
● Fully managed
Redshift
● Peta scale database for analytics
○ Columnar
○ MPP
● Complex SQL
RDS
● Flavours
○ Mysql
○ Auraora
○ postgreSQL
○ Oracle
○ MSSQL
● Multi AZ, HA
● Managed services
Data Processing
● Batch
● Ad hoc queries
● Message
● Stream
● Machine learning
● (by the use parquet, not ORC, more commonly used in the ecosystem)
Athena
● Presto
● In memory
● Hive meta store for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
EMR
● Emr , pig and hive can work on top of spark , but not yet in EMR
● TEZ is not working. Horton gave up on it.
● Can ran presto on hive
Best practice
● Bzip2 - splittable compression
○ You don't have to open all the file, you can bzip2 only one block in it and work in parallel
● Snappy - encoding /decoding time - VERY fast, not compress well.
● Partitions
● Ephemeral EMR
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper
● zeppelin
● For research - Public data sets exist of AWS...
EMR ARchitecture
● Master node
● Core nodes - like data nodes (with storage)
● Task nodes - (not like regular hadoop, extends compute)
● Default replication factor
○ Nodes 1-3 ⇒ 1 factor
○ Nodes 4-9 ⇒ 2 replication factor
○ Nodes 10+ ⇒ 3 replication factor
○ Not relevant in external tables
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR
● Use SCALA
● If you use Spark SQL , even in python it is ok.
● For code - python is full of bugs, connects well to R
● Scala is better - but not connects easily to data science  R
● Use cloud formation when ready to deploy fast.
● Check instance I ,
● Dense instance is good architecture
● User spot instances - for the tasks.
● Use TAGs
● Constant upgrade of the AMI version
● Don't use TEZ
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Steps to automate steps on running cluster
● Use RDS to share Hive MetaStore (the metastore is mysql based)
● Use r, kafka and impala via bootstrap actions and many others
CPU features for instances
Sending work to emr
● Steps -
○ can be added while cluster is running
○ Can be added from the UI / CLI
○ FIFO scheduler by default
● Emr api
● Ganglia: jvm.JvmMetrics.MemHeapUsedM
Landscape
Hive
● SQL over hadoop.
● Engine: spark, tez, ML
● JDBC / ODBC
● Not good when need to shuffle.
● Connect well with DynamoDB
● SerDe json, parquet,regex etc.
Hbase
● Nosql like cassandra
● Below the Yarn
● Agent of HBASE on each data node. Like impala.
● Write on HDFS
● Multi level Index.
● (AWS avoid talking about it b/c of Dynamo, used only when you want to save
money)
● Driver from hive (work from hive on top of HBase)
Presto
● Like Hive, from facebook.
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know SQL, for system people, for those
who want to avoid java/scala
● Fair fight compared to hive in term of performance only
● Good for unstructured files ETL : file to file , and use sqoop.
Spark
Mahut
● Machine library for hadoop - Not used
● Spark ML lib is used instead.
R
● Open source package for statistical computing.
● Works with EMR
● “Matlab” equivalent
● Works with spark
● Not for developer :) for statistician
● R is single threaded - use spark R to distribute. Not everything works perfect.
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● work s on top of Hive
● Inside EMR.
● Give more feedback to let you know where u are
Hue● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Spark
● In memory
● X10 to X100 times faster
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark
● RDD
○ An array (data set)
○ Read only distributed objects cached in mem across cluster
○ Allow apps to keep working set in mem for reuse
○ Fault tolerant
○ Ops:
■ Transformation: map/ filter / groupby / join
■ Actions: count / reduce/ collet / save / persist
○ Object oreinted ops
○ High level expression liken lambda, funcations, map
Data Frames
● Dataset containing named columns
● Access ordinal positions
● Tungsten execution backed - framework for distributed memory (open
source?)
● Abstraction for selecting , filtering, agg, and plot structure data.
● Run sql on it.
Streaming
● Near real time (1 sec latency) , like batch of 1sec windows
● Same spark as before
● Streaming jobs with API
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark GraphX
● Works with graphs for parallel computations
● Access to access of library of algorithm
○ Page rank
○ Connected components
○ Label propagation
○ SVD++
○ Strongly connected components
Hive on Spark
● Will replace tez.
Spark flavours
● Own cluster
● With yarn
● With mesos
Downside
● Compute intensive
● Performance gain over mapreduce is not guaranteed.
● Streaming processing is actually batch with very small window.
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries which are long lasting.
● DO NOT USE FOR transformation.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases.
● Raw data? s3.
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scal ? redshift
● Complex joins → Redshift
Presto VS redshift
● Not a true DW , but can be used.
● Require S3, or HDFS
● Netflix uses presoto for analytics.
Redshift
● Leader node
○ Meta data
○ Execution plan
○ Sql end point
● Data node
● Distribution key
● Sort key
● Normalize…
● Dont be afraid to duplicate data with different sort key
Redshift Spectrum
● External table to S3
● Additional compute resources to redshift cluster.
● Not good for all use cases
Cost consideration
● Region
● Data out
● Spot instances
○ 50% - 80% cost reduction.
○ Limit your bid
○ Work well with EMR. use spot instances for task core mostly. For dev - use spot.
○ May be killed in the middle :)
● reserved instances
Kinesis pricing
● Streams
○ Shard hour
○ Put payload unit , 25KB
● Firehose
○ Volume
○ 5KB
● Analytics
○ Pay per container , 1 cpu, 4 GB = 11 Cents
Dynamo
● Most expensive: Pay for throughput
○ 1k write
○ 4k read
○ Eventually consistent is cheaper
● Be carefull… from 400$ to 10000$ simply b/s you change block size.
● You pay for storage, until 25GB free, the rest is 25 cent per GB
Optimize cost
● Alerts for cost mistakes-
○ unused machines etc.
○ on 95% from expected cost. → something is wrong…
● No need to buy RI for 3 year, 1year is better.
Visualizing - Quick Sight
● Cheap
● Connect to everything
○ S3
○ Dynamo
○ Redhsift
● SPICE
● Use storyboard to create slides of graphs like PPT
Tableau
● Connect to Redshift
● Good look and feel
● ~1000$ per user
● Not the best in term of performance. Quick sight is faster.
Other visualizer
● Tibco - Spot file
○ Has many connector
○ Had recommendations feature
● Jaspersoft
○ Limited
● ZoomData
● Hunk -
○ Users EMR and S3
○ Scehmoe on the fly
○ Available in market place
○ Expensive.
○ Non SQL , has it own language.
Visualize - which DB to choose?
● Hive is not recommended
● Presto a bit faster.
● Athena is ok
● Impala is better (has agent per machine, caching)
● Redshift is best.
Orchestration
● Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Data pipeline
○ Move data from on prem to cloud
○ Distributed
○ Integrate well with s3, dynamodb, RDS, EMR, EC2, redshift
○ Like ETL: Input, data manipulation, output
○ Not trivial, but nicer than Oozie
● Other: AirFlow, Knime, Luigi, Azkaban
Security
● Shared security model
○ Customer:
■ OS, platform, identity, access management,role, permissions
■ Network
■ Firewall
○ AWS:
■ compute, storage, databases, networking, LB
■ Regions, AZ, edge location
■ compliance
EMR secured
● IAM
● VPC
● Private subnet
● VPC endpoint to S3
● MFA to login
● STS + SAML - token to login. Complex solution
● Kerberos authentication of Hadoop nodes
● Encryption
○ SSH to master
○ TLS between nodes
IAM
● User
● Best practice :
○ MFA
○ Don't use ROOT account
● Group
● Role
● Policy best practice
○ Allow X
○ Disallow all the rest
● Identity federation via LDAP
Kinesis Security best practice
● Create IAM admin for admin
● Create IAM entity for re sharing stream
● Create IAM entity for produces to write
● Create iam entity for consumer to read.
● Allow specific source IP
● Enforce AWS:SecureTransport condition key for every API call.
● Use temp credentials : IAM role
Dynamo Security
● Access
● Policies
● Roels
● Database level access level - row level (item)/ Col level (attribute) access control
● STS for web identity federation.
● Limit of amount row u can see
● Use SSL
● All request can be signed bia SHA256
● PCI, SOC3, HIPPA, 270001 etc.
● Cloud trail - for Audit log
Redshift Security
● SSL in transit
● Encryption
○ KMSAES256
○ HSM encryption (hardware)
● VPC
● Cloud Trail
● All the usual regulation certification.
● Security groups
Big Data patterns
● Interactive query → mostly EMR, Athena
● Batch Processing → reporting → redshift + EMR
● Stream processing → kinesis, kinesis client library, lambda, EMR
● Real time prediction → mobile, dynamoDB. → lambda+ ML.
● Batch prediction → ML and redshift
● Long running cluster → s3 → EMR
● Log aggregation → s3 → EMR → redshift
Stay in touch...
● Omid Vahdaty
● +972-54-2384178

More Related Content

PPTX
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
PPTX
Amazon aws big data demystified | Introduction to streaming and messaging flu...
PPTX
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
PPTX
Introduction to NoSql
PPTX
Kafka website activity architecture
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PPTX
Cloudera Impala + PostgreSQL
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Introduction to NoSql
Kafka website activity architecture
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Cloudera Impala + PostgreSQL
Big Data in 200 km/h | AWS Big Data Demystified #1.3

What's hot (20)

PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
PPTX
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
PPTX
AWS Redshift Introduction - Big Data Analytics
PDF
Hd insight essentials quick view
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PDF
Hadoop Networking at Datasift
PDF
HBaseCon 2015- HBase @ Flipboard
PPTX
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PDF
Argus Production Monitoring at Salesforce
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
PDF
DIscover Spark and Spark streaming
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Cassandra as event sourced journal for big data analytics
PDF
Signal Digital: The Skinny on Wide Rows
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PPSX
Hadoop-Quick introduction
PDF
Demystifying the Distributed Database Landscape
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
PDF
Facebook keynote-nicolas-qcon
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
AWS Redshift Introduction - Big Data Analytics
Hd insight essentials quick view
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Hadoop Networking at Datasift
HBaseCon 2015- HBase @ Flipboard
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
Argus Production Monitoring at Salesforce
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
DIscover Spark and Spark streaming
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
AWS Big Data Demystified #1: Big data architecture lessons learned
Cassandra as event sourced journal for big data analytics
Signal Digital: The Skinny on Wide Rows
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Hadoop-Quick introduction
Demystifying the Distributed Database Landscape
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Facebook keynote-nicolas-qcon
Ad

Similar to Introduction to AWS Big Data (20)

PDF
Big data on aws
PDF
Technologies for Data Analytics Platform
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
PPTX
Data Analysis on AWS
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
PDF
Architecting Data Lakes on AWS
PDF
JDD2014: Real Big Data - Scott MacGregor
PDF
Big Data on AWS
PDF
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
PPTX
Make your data fly - Building data platform in AWS
PPTX
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
PPTX
Building Data Lakes & Analytics on AWS
PDF
Builders' Day - Building Data Lakes for Analytics On AWS LC
PPTX
From raw data to business insights. A modern data lake
PDF
Big data should be simple
PDF
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
PDF
¿Quién es Amazon Web Services?
PDF
AWS Big Data Landscape
PPTX
AWS Summit 2018 Summary
Big data on aws
Technologies for Data Analytics Platform
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Data Analysis on AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
Architecting Data Lakes on AWS
JDD2014: Real Big Data - Scott MacGregor
Big Data on AWS
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Make your data fly - Building data platform in AWS
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
Building Data Lakes & Analytics on AWS
Builders' Day - Building Data Lakes for Analytics On AWS LC
From raw data to business insights. A modern data lake
Big data should be simple
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
¿Quién es Amazon Web Services?
AWS Big Data Landscape
AWS Summit 2018 Summary
Ad

More from Omid Vahdaty (20)

PDF
Data Pipline Observability meetup
PPTX
Couchbase Data Platform | Big Data Demystified
PPTX
Machine Learning Essentials Demystified part2 | Big Data Demystified
PPTX
Machine Learning Essentials Demystified part1 | Big Data Demystified
PPTX
The technology of fake news between a new front and a new frontier | Big Dat...
PDF
Making your analytics talk business | Big Data Demystified
PPTX
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
PPTX
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
PDF
Aerospike meetup july 2019 | Big Data Demystified
PPTX
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
PPTX
AWS Big Data Demystified #4 data governance demystified [security, networ...
PPTX
Emr spark tuning demystified
PPTX
Emr zeppelin & Livy demystified
PPTX
Zeppelin and spark sql demystified
PPTX
Aws s3 security
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PPTX
Introduction to aws dynamo db
PPTX
Hive vs. Impala
PPTX
Introduction to ETL process
PPTX
Cloud Architecture best practices
Data Pipline Observability meetup
Couchbase Data Platform | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
The technology of fake news between a new front and a new frontier | Big Dat...
Making your analytics talk business | Big Data Demystified
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Aerospike meetup july 2019 | Big Data Demystified
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
AWS Big Data Demystified #4 data governance demystified [security, networ...
Emr spark tuning demystified
Emr zeppelin & Livy demystified
Zeppelin and spark sql demystified
Aws s3 security
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to aws dynamo db
Hive vs. Impala
Introduction to ETL process
Cloud Architecture best practices

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Welding lecture in detail for understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Sustainable Sites - Green Building Construction
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Construction Project Organization Group 2.pptx
PDF
composite construction of structures.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Digital Logic Computer Design lecture notes
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Structs to JSON How Go Powers REST APIs.pdf
Welding lecture in detail for understanding
Foundation to blockchain - A guide to Blockchain Tech
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Lecture Notes Electrical Wiring System Components
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Sustainable Sites - Green Building Construction
Mechanical Engineering MATERIALS Selection
Construction Project Organization Group 2.pptx
composite construction of structures.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
UNIT-1 - COAL BASED THERMAL POWER PLANTS
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Arduino robotics embedded978-1-4302-3184-4.pdf

Introduction to AWS Big Data

  • 1. Introduction to AWS Big Data Omid Vahdaty, Big Data Ninja
  • 2. When the data outgrows your ability to process ● Volume ● Velocity ● Variety
  • 3. EMR ● Basic, the simplest distribution of all hadoop distributions. ● Not as good as cloudera.
  • 5. Store ● S3 ● Kinesis ● RDS ● DynmoDB ● Clloud Search ● Iot
  • 6. Process + Analyze ● Lambda ● EMR ● Redshift ● Machine learning ● Elastic search ● Data Pipeline ● Athena
  • 7. Visulize ● Quick sight ● ElasticSearch service
  • 8. History of tools HDFS - from google Cassandra - from facebook, Columnar store NoSQL , materialized view, secondary index Kafka - linkedin. Hbase, no sql , hadoop eco system.
  • 9. EMR Hadoop ecosystem ● Spark - recommend options. Can do all the below… ● Hive ● Oozie ● Mahout - Machine library (mllib is better, runs on spark) ● Presto, better than hive? More generic than hive. ● Pig - scripting for big data. ● Impala - Cloudera, and part of EMR.
  • 10. ETL Tools ● Attunity ● Splunk ● Semarchy ● Informatica ● Tibco ● Clarit
  • 11. Architecture ● Decouple : ○ Store ○ Process ○ Store ○ Process ○ insight... ● Rule of thumb: 3 tech in dc, 7 techs max in cloud ○ Do use more b/c: maintenance
  • 12. Architecture considerations ● unStructured? Structured? Semi structured? ● Latency? ● Throughput? ● Concurrency? ● Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 13. EcoSystem ● Redshift = analytics ● Aurora = OLTP ● DynamoDB = NoSQL like mongodb. ● Lambda ● SQS ● Cloud Search ● Elastic Search ● Data Pipeline ● Beanstalk - i have jar, install it for me… ● AWS machine learning
  • 14. Data ingestions Architecture challenges ● Durability ● HA ● Ingestions types: ○ Batch ○ Stream ● Transaction: OLTP, noSQL.
  • 15. Ingestion options ● Kinesis ● Flume ● Kafka ● S3DistCP copy from ○ S3 to HDFS ○ S3 to s3 ○ Cross account ○ Supports compression.
  • 16. Transfer ● VPN ● Direct Connect ● S3 multipart upload ● Snowball ● IoT
  • 17. Steaming ● Streaming ● Batch ● Collect ○ Kinesys ○ DynamoDB streams ○ SQS (pull) ○ SNS (push) ○ KAFKA (Most recommend)
  • 19. Streaming ● Low latency ● Message delivery ● Lambda architecture implementation ● State management ● Time or count based windowing support ● Fault tolerant
  • 21. Stream collection options Kinesis client library AWS lambda EMR Third party Spark streaming (latency min =1 sec) , near real time, with lot of libraries. Storm - Most real time (sub millisec), java code based. Flink (similar to spark)
  • 22. Kinesys ● Stream - collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ● Analytics - in flight analytics. ● Firehose - Park you data @ destination.
  • 24. Kinesis analytics example CREATE STREAM "INTERMEDIATE_STREAM" ( hostname VARCHAR(1024), logname VARCHAR(1024), username VARCHAR(1024), requesttime VARCHAR(1024), request VARCHAR(1024), status VARCHAR(32), responsesize VARCHAR(32) ); -- Data Pump: Take incoming data from SOURCE_SQL_STREAM_001 and insert into INTERMEDIATE_STREAM
  • 25. KCL ● Read from stream using get api ● Build application with the KCL ● Leverage kinesis spout for storm ● Leverage EMR connector
  • 26. Firehose - for parking ● Not for fast lane - no in flight analytics ● Capture , transform and load. ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service ● Producer - you input to delivery stream ● Buffer size MB ● Buffer in seconds.
  • 27. Comparison of Kinesis product ● Stream ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Latency 60 sec minimum. ○ For larger “events”
  • 28. DynamoDB ● Fully managed NoSql document store key value ○ Tables have not fixed schema ● High performance ○ Single digit latency ○ Runs on solid drives ● Durable ○ Multi az ○ Fault toulerant, replicated 3 AZ.
  • 29. Durability ● Read ○ Eventually consistent ○ Strongly consistent ● Write ○ Quorum ack ○ 3 replication - always - we can’t change it. ○ Persistence to disk.
  • 30. Indexing and partitioning ● Idnexing ○ LSI - local secondary index, kind of alternate range key ○ GSI - global secondary index - “pivot charts” for your table, kind of projection (vertica) ● Partitioning ○ Automatic ○ Hash key speared data across partitions
  • 31. DynamoDB Items and Attributes ● Partitions key ● Sort key (optional) ● LSI ● attributes
  • 32. AWS Titan(on DynamoDB) - Graph DB ● Vertex - nodes ● Edge- relationship b/w nodes. ● Good when you need to investigate more than 2 layer relationship (join of 4 tables) ● Based on TinkerPop (open source) ● Full text search with lucene SOLR or elasticsearch ● HA using multi master replication (cassandra backed) ● Scale using DynamoDB backed. ● Use cases: Cyber , Social networks, Risk management
  • 33. Elasticache: MemCache / Redis ● Cache sub millisecond. ● In memory ● No disk I/O when querying ● High throughput ● High availability ● Fully managed
  • 34. Redshift ● Peta scale database for analytics ○ Columnar ○ MPP ● Complex SQL
  • 35. RDS ● Flavours ○ Mysql ○ Auraora ○ postgreSQL ○ Oracle ○ MSSQL ● Multi AZ, HA ● Managed services
  • 36. Data Processing ● Batch ● Ad hoc queries ● Message ● Stream ● Machine learning ● (by the use parquet, not ORC, more commonly used in the ecosystem)
  • 37. Athena ● Presto ● In memory ● Hive meta store for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions
  • 38. EMR ● Emr , pig and hive can work on top of spark , but not yet in EMR ● TEZ is not working. Horton gave up on it. ● Can ran presto on hive
  • 39. Best practice ● Bzip2 - splittable compression ○ You don't have to open all the file, you can bzip2 only one block in it and work in parallel ● Snappy - encoding /decoding time - VERY fast, not compress well. ● Partitions ● Ephemeral EMR
  • 40. EMR ecosystem ● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper ● zeppelin ● For research - Public data sets exist of AWS...
  • 41. EMR ARchitecture ● Master node ● Core nodes - like data nodes (with storage) ● Task nodes - (not like regular hadoop, extends compute) ● Default replication factor ○ Nodes 1-3 ⇒ 1 factor ○ Nodes 4-9 ⇒ 2 replication factor ○ Nodes 10+ ⇒ 3 replication factor ○ Not relevant in external tables ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 42. EMR ● Use SCALA ● If you use Spark SQL , even in python it is ok. ● For code - python is full of bugs, connects well to R ● Scala is better - but not connects easily to data science R ● Use cloud formation when ready to deploy fast. ● Check instance I , ● Dense instance is good architecture ● User spot instances - for the tasks. ● Use TAGs ● Constant upgrade of the AMI version ● Don't use TEZ ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Steps to automate steps on running cluster ● Use RDS to share Hive MetaStore (the metastore is mysql based) ● Use r, kafka and impala via bootstrap actions and many others
  • 43. CPU features for instances
  • 44. Sending work to emr ● Steps - ○ can be added while cluster is running ○ Can be added from the UI / CLI ○ FIFO scheduler by default ● Emr api ● Ganglia: jvm.JvmMetrics.MemHeapUsedM
  • 46. Hive ● SQL over hadoop. ● Engine: spark, tez, ML ● JDBC / ODBC ● Not good when need to shuffle. ● Connect well with DynamoDB ● SerDe json, parquet,regex etc.
  • 47. Hbase ● Nosql like cassandra ● Below the Yarn ● Agent of HBASE on each data node. Like impala. ● Write on HDFS ● Multi level Index. ● (AWS avoid talking about it b/c of Dynamo, used only when you want to save money) ● Driver from hive (work from hive on top of HBase)
  • 48. Presto ● Like Hive, from facebook. ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries
  • 49. Pig ● Distributed Shell scripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 50. Spark
  • 51. Mahut ● Machine library for hadoop - Not used ● Spark ML lib is used instead.
  • 52. R ● Open source package for statistical computing. ● Works with EMR ● “Matlab” equivalent ● Works with spark ● Not for developer :) for statistician ● R is single threaded - use spark R to distribute. Not everything works perfect.
  • 53. Apache Zeppelin ● Notebook - visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● work s on top of Hive ● Inside EMR. ● Give more feedback to let you know where u are
  • 54. Hue● Hadoop user experience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 55. Spark ● In memory ● X10 to X100 times faster ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 56. Spark ● RDD ○ An array (data set) ○ Read only distributed objects cached in mem across cluster ○ Allow apps to keep working set in mem for reuse ○ Fault tolerant ○ Ops: ■ Transformation: map/ filter / groupby / join ■ Actions: count / reduce/ collet / save / persist ○ Object oreinted ops ○ High level expression liken lambda, funcations, map
  • 57. Data Frames ● Dataset containing named columns ● Access ordinal positions ● Tungsten execution backed - framework for distributed memory (open source?) ● Abstraction for selecting , filtering, agg, and plot structure data. ● Run sql on it.
  • 58. Streaming ● Near real time (1 sec latency) , like batch of 1sec windows ● Same spark as before ● Streaming jobs with API
  • 59. Spark ML ● Classification ● Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 60. Spark GraphX ● Works with graphs for parallel computations ● Access to access of library of algorithm ○ Page rank ○ Connected components ○ Label propagation ○ SVD++ ○ Strongly connected components
  • 61. Hive on Spark ● Will replace tez.
  • 62. Spark flavours ● Own cluster ● With yarn ● With mesos
  • 63. Downside ● Compute intensive ● Performance gain over mapreduce is not guaranteed. ● Streaming processing is actually batch with very small window.
  • 64. Redshift ● OLAP, not OLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries which are long lasting. ● DO NOT USE FOR transformation.
  • 65. EMR vs Redshift ● How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. ● Raw data? s3.
  • 66. Hive VS. Redshift ● Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scal ? redshift ● Complex joins → Redshift
  • 67. Presto VS redshift ● Not a true DW , but can be used. ● Require S3, or HDFS ● Netflix uses presoto for analytics.
  • 68. Redshift ● Leader node ○ Meta data ○ Execution plan ○ Sql end point ● Data node ● Distribution key ● Sort key ● Normalize… ● Dont be afraid to duplicate data with different sort key
  • 69. Redshift Spectrum ● External table to S3 ● Additional compute resources to redshift cluster. ● Not good for all use cases
  • 70. Cost consideration ● Region ● Data out ● Spot instances ○ 50% - 80% cost reduction. ○ Limit your bid ○ Work well with EMR. use spot instances for task core mostly. For dev - use spot. ○ May be killed in the middle :) ● reserved instances
  • 71. Kinesis pricing ● Streams ○ Shard hour ○ Put payload unit , 25KB ● Firehose ○ Volume ○ 5KB ● Analytics ○ Pay per container , 1 cpu, 4 GB = 11 Cents
  • 72. Dynamo ● Most expensive: Pay for throughput ○ 1k write ○ 4k read ○ Eventually consistent is cheaper ● Be carefull… from 400$ to 10000$ simply b/s you change block size. ● You pay for storage, until 25GB free, the rest is 25 cent per GB
  • 73. Optimize cost ● Alerts for cost mistakes- ○ unused machines etc. ○ on 95% from expected cost. → something is wrong… ● No need to buy RI for 3 year, 1year is better.
  • 74. Visualizing - Quick Sight ● Cheap ● Connect to everything ○ S3 ○ Dynamo ○ Redhsift ● SPICE ● Use storyboard to create slides of graphs like PPT
  • 75. Tableau ● Connect to Redshift ● Good look and feel ● ~1000$ per user ● Not the best in term of performance. Quick sight is faster.
  • 76. Other visualizer ● Tibco - Spot file ○ Has many connector ○ Had recommendations feature ● Jaspersoft ○ Limited ● ZoomData ● Hunk - ○ Users EMR and S3 ○ Scehmoe on the fly ○ Available in market place ○ Expensive. ○ Non SQL , has it own language.
  • 77. Visualize - which DB to choose? ● Hive is not recommended ● Presto a bit faster. ● Athena is ok ● Impala is better (has agent per machine, caching) ● Redshift is best.
  • 78. Orchestration ● Oozie ○ Opens source workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Data pipeline ○ Move data from on prem to cloud ○ Distributed ○ Integrate well with s3, dynamodb, RDS, EMR, EC2, redshift ○ Like ETL: Input, data manipulation, output ○ Not trivial, but nicer than Oozie ● Other: AirFlow, Knime, Luigi, Azkaban
  • 79. Security ● Shared security model ○ Customer: ■ OS, platform, identity, access management,role, permissions ■ Network ■ Firewall ○ AWS: ■ compute, storage, databases, networking, LB ■ Regions, AZ, edge location ■ compliance
  • 80. EMR secured ● IAM ● VPC ● Private subnet ● VPC endpoint to S3 ● MFA to login ● STS + SAML - token to login. Complex solution ● Kerberos authentication of Hadoop nodes ● Encryption ○ SSH to master ○ TLS between nodes
  • 81. IAM ● User ● Best practice : ○ MFA ○ Don't use ROOT account ● Group ● Role ● Policy best practice ○ Allow X ○ Disallow all the rest ● Identity federation via LDAP
  • 82. Kinesis Security best practice ● Create IAM admin for admin ● Create IAM entity for re sharing stream ● Create IAM entity for produces to write ● Create iam entity for consumer to read. ● Allow specific source IP ● Enforce AWS:SecureTransport condition key for every API call. ● Use temp credentials : IAM role
  • 83. Dynamo Security ● Access ● Policies ● Roels ● Database level access level - row level (item)/ Col level (attribute) access control ● STS for web identity federation. ● Limit of amount row u can see ● Use SSL ● All request can be signed bia SHA256 ● PCI, SOC3, HIPPA, 270001 etc. ● Cloud trail - for Audit log
  • 84. Redshift Security ● SSL in transit ● Encryption ○ KMSAES256 ○ HSM encryption (hardware) ● VPC ● Cloud Trail ● All the usual regulation certification. ● Security groups
  • 85. Big Data patterns ● Interactive query → mostly EMR, Athena ● Batch Processing → reporting → redshift + EMR ● Stream processing → kinesis, kinesis client library, lambda, EMR ● Real time prediction → mobile, dynamoDB. → lambda+ ML. ● Batch prediction → ML and redshift ● Long running cluster → s3 → EMR ● Log aggregation → s3 → EMR → redshift
  • 86. Stay in touch... ● Omid Vahdaty ● +972-54-2384178