SlideShare a Scribd company logo
ianmas@amazon.com
@IanMmmm
LARGE SCALE DATA
ANALYSIS WITH AWS



Ian Massingham – Technical Evangelist
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT!
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
THE COST OF DATA
GENERATION IS FALLING!
We are constantly producing more data
From all types of industries
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Lower cost,
higher throughput
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Lower cost,
higher throughput
Highly
constrained
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND
= REMOVE CONSTRAINTS
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
AWS Import / Export
AWS Direct Connect
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon EC2
Amazon Elastic
MapReduce
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
AMAZON ELASTIC
MAPREDUCE

HADOOP AS A SERVICE!
•  SPLITS DATA INTO PIECES
•  LETS PROCESSING OCCUR
•  GATHERS THE RESULTS!
HDFS
EMRKinesis
S3 DynamoDB
Data management
Pig
Analytics languages/engines
RDS
Redshift AWS Data Pipeline
EMR + IMPALA DEMO
STARTING AN EMR CLUSTER
WITH HADOOP ECOSYSTEM
TOOLS PRE-INSTALLED
COPY & LOAD OUR DATASET
$	
  scp	
  –i	
  EMRKeyPair.pem	
  ~/aws/hadoop/LHRarrivals*.csv	
  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐
west-­‐1.compute.amazonaws.com:	
  
	
  
$	
  ssh	
  –i	
  EMRKeyPair.pem	
  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com	
  
	
  
$	
  hadoop	
  fs	
  -­‐mkdir	
  /data/	
  
$	
  hadoop	
  fs	
  -­‐put	
  <uploaded_files>	
  /data/	
  
$	
  hadoop	
  fs	
  -­‐ls	
  -­‐h	
  -­‐R	
  /data/	
  
	
  
or at scale, Distributed Copy using S3DistCp to parallel load from S3
	
  
$	
  .	
  /home/hadoop/impala/conf/impala.conf	
  
$	
  hadoop	
  jar	
  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar	
  -­‐Dmapreduce.job.reduces=30	
  -­‐-­‐
src	
  s3://s3bucketname/	
  -­‐-­‐dest	
  hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/
data/	
  -­‐-­‐outputCodec	
  'none'	
  
	
  
** Run on a cluster master node
CREATE EXTERNAL TABLE
$	
  #check	
  the	
  size	
  of	
  our	
  data	
  set	
  
$	
  wc	
  –l	
  LHRarrivals*.csv	
  	
  
	
  
	
  850	
  LHRarrivals2.csv	
  
	
  1526	
  LHRarrivals.csv	
  
	
  	
   	
  2376	
  total	
  
	
  
$	
  impala-­‐shell	
  
	
  
Welcome	
  to	
  the	
  Impala	
  shell.	
  
	
  
>	
  create	
  EXTERNAL	
  TABLE	
  flights	
  (	
  input	
  STRING,	
  id	
  BIGINT,	
  widget	
  STRING,	
  source	
  
STRING,	
  resultnum	
  BIGINT,	
  pageurl	
  STRING,	
  scheduled	
  STRING,	
  flightnumber	
  STRING,	
  
airport	
  STRING,	
  status	
  STRING,	
  terminal	
  STRING	
  )	
  ROW	
  FORMAT	
  DELIMITED	
  FIELDS	
  
TERMINATED	
  BY	
  ','	
  LOCATION	
  '/data/';	
  
>	
  select	
  count	
  (*)	
  from	
  flights;	
  
	
  
Should	
  return	
  count(*)	
  2376	
  reflecting	
  the	
  size	
  of	
  the	
  data	
  set	
  
DEMO OF ODBC ACCESS
Doing this part on Amazon WorkSpaces using the Simba Cloudera
Impala ODBC Driver.!
Set up an SSH tunnel to the master node to allow us to connect to port
25010 from the WorkSpaces desktop to the Impala ODBC port!
A previously configured system DSN allows us to work with the data from
our EMR/Impala cluster directly within Microsoft Excel!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
BATCH
PROCESSING
GENERATE ➔ ➔ SHARE!
STREAM
PROCESSING
AMAZON KINESIS

REAL-TIME DATA STREAM PROCESSING!
Real-time response to content
in semi-structured data streams



Relatively simple computations
on data (aggregates, filters,
sliding window, etc.)
Hourly server logs: how your
systems went wrong an hour ago
Weekly / Monthly Bill: What you
spent this past billing cycle
Daily customer report from your
website: tells you what deal or ad
to try next time
Daily fraud reports: tells you if there
was fraud yesterday
Daily business reports: tells me
how customers used AWS services
yesterday
Real-time metrics: what just went
wrong now
Real-time spending alerts/caps:
guaranteeing you can’t overspend
Real-time analysis: what to offer
the current customer now
Real-time detection: blocks
fraudulent use now
Fast ETL into Amazon Redshift:
how are customers using services
now
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
Amazon EC2
Amazon Elastic
MapReduce
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AWS Import / Export
AWS Direct Connect
GENERATE ➔ ➔ SHARE!
STREAM
PROCESSING
GENERATE ➔ ➔ SHARE!
STREAM
PROCESSING
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
Amazon Kinesis
Stream Processing on
Amazon EC2
WANT TO KNOW MORE?
aws.amazon.com/solutions/case-studies/big-data/!
ianmas@amazon.com
@IanMmmm
LARGE SCALE DATA
ANALYSIS WITH AWS



Ian Massingham – Technical Evangelist

More Related Content

PDF
Innovating in the Cloud from #NimbusIGNITE
PDF
AWS AWSome Day London October 2015
PDF
AWSome Day Manchester 2105 - Intro/Close
PPTX
AWS Cloud Watch
PDF
Cost Optimisation with AWS
PDF
Your First Data Lake on AWS_Simon Elisha
PDF
AWSome Day London January 2016 Intro
PDF
What's New & What's Next from AWS?
Innovating in the Cloud from #NimbusIGNITE
AWS AWSome Day London October 2015
AWSome Day Manchester 2105 - Intro/Close
AWS Cloud Watch
Cost Optimisation with AWS
Your First Data Lake on AWS_Simon Elisha
AWSome Day London January 2016 Intro
What's New & What's Next from AWS?

Similar to 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo (20)

PDF
Cloud World Forum: Large Scale Data Analysis on AWS
PDF
Data Analysis - Journey Through the Cloud
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
PDF
AWS Floor 28 - Building Data lake on AWS
PDF
Builders' Day - Building Data Lakes for Analytics On AWS LC
PDF
Get Value From Your Data
PDF
AWS Chicago user group - October 2015 "reInvent Replay"
PDF
Data Analytics on AWS
PPTX
Make your data fly - Building data platform in AWS
PDF
Big data on aws
PPTX
From raw data to business insights. A modern data lake
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
PDF
Big Data on AWS
PDF
Get Value from Your Data
PPTX
Solving Big Data problems on AWS by Rajnish Malik
PDF
¿Quién es Amazon Web Services?
PPTX
AWS Lake Formation Deep Dive
PDF
Big data and Analytics on AWS
Cloud World Forum: Large Scale Data Analysis on AWS
Data Analysis - Journey Through the Cloud
Big Data, Ingeniería de datos, y Data Lakes en AWS
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
AWS Floor 28 - Building Data lake on AWS
Builders' Day - Building Data Lakes for Analytics On AWS LC
Get Value From Your Data
AWS Chicago user group - October 2015 "reInvent Replay"
Data Analytics on AWS
Make your data fly - Building data platform in AWS
Big data on aws
From raw data to business insights. A modern data lake
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Big Data on AWS
Get Value from Your Data
Solving Big Data problems on AWS by Rajnish Malik
¿Quién es Amazon Web Services?
AWS Lake Formation Deep Dive
Big data and Analytics on AWS
Ad

More from Ian Massingham (20)

PDF
Some thoughts on measuring the impact of developer relations
PDF
Leeds IoT Meetup - Nov 2017
PDF
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
PDF
Getting started with AWS Lambda and the Serverless Cloud
PDF
AWS AWSome Day - Getting Started Best Practices
PDF
AWS IoT Workshop Keynote
PDF
Security Best Practices: AWS AWSome Day Management Track
PDF
AWS re:Invent 2016 Day 2 Keynote re:Cap
PDF
AWS re:Invent 2016 Day 1 Keynote re:Cap
PDF
Getting Started with AWS Lambda & Serverless Cloud
PDF
Building Better IoT Applications without Servers
PDF
AWS AWSome Day Roadshow
PDF
AWS AWSome Day Roadshow Intro
PDF
Hashiconf AWS Lambda Breakout
PDF
Getting started with AWS IoT on Raspberry Pi
PDF
AWSome Day Dublin Intro & Closing Slides
PDF
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
PDF
What's New at AWS Update for AWS User Groups
PDF
Advanced Security Masterclass - Tel Aviv Loft
PDF
Security Best Practices
Some thoughts on measuring the impact of developer relations
Leeds IoT Meetup - Nov 2017
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
Getting started with AWS Lambda and the Serverless Cloud
AWS AWSome Day - Getting Started Best Practices
AWS IoT Workshop Keynote
Security Best Practices: AWS AWSome Day Management Track
AWS re:Invent 2016 Day 2 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
Getting Started with AWS Lambda & Serverless Cloud
Building Better IoT Applications without Servers
AWS AWSome Day Roadshow
AWS AWSome Day Roadshow Intro
Hashiconf AWS Lambda Breakout
Getting started with AWS IoT on Raspberry Pi
AWSome Day Dublin Intro & Closing Slides
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
What's New at AWS Update for AWS User Groups
Advanced Security Masterclass - Tel Aviv Loft
Security Best Practices
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The AUB Centre for AI in Media Proposal.docx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

  • 1. ianmas@amazon.com @IanMmmm LARGE SCALE DATA ANALYSIS WITH AWS
 
 Ian Massingham – Technical Evangelist
  • 2. THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT!
  • 5. THE COST OF DATA GENERATION IS FALLING!
  • 6. We are constantly producing more data
  • 7. From all types of industries
  • 10. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 11. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Lower cost, higher throughput
  • 12. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Lower cost, higher throughput Highly constrained
  • 13. + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
  • 14. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 15. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! AWS Import / Export AWS Direct Connect
  • 16. Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect
  • 17. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2
  • 18. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon EC2 Amazon Elastic MapReduce
  • 21. •  SPLITS DATA INTO PIECES •  LETS PROCESSING OCCUR •  GATHERS THE RESULTS!
  • 22. HDFS EMRKinesis S3 DynamoDB Data management Pig Analytics languages/engines RDS Redshift AWS Data Pipeline
  • 23. EMR + IMPALA DEMO
  • 24. STARTING AN EMR CLUSTER WITH HADOOP ECOSYSTEM TOOLS PRE-INSTALLED
  • 25. COPY & LOAD OUR DATASET $  scp  –i  EMRKeyPair.pem  ~/aws/hadoop/LHRarrivals*.csv  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐ west-­‐1.compute.amazonaws.com:     $  ssh  –i  EMRKeyPair.pem  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com     $  hadoop  fs  -­‐mkdir  /data/   $  hadoop  fs  -­‐put  <uploaded_files>  /data/   $  hadoop  fs  -­‐ls  -­‐h  -­‐R  /data/     or at scale, Distributed Copy using S3DistCp to parallel load from S3   $  .  /home/hadoop/impala/conf/impala.conf   $  hadoop  jar  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar  -­‐Dmapreduce.job.reduces=30  -­‐-­‐ src  s3://s3bucketname/  -­‐-­‐dest  hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/ data/  -­‐-­‐outputCodec  'none'     ** Run on a cluster master node
  • 26. CREATE EXTERNAL TABLE $  #check  the  size  of  our  data  set   $  wc  –l  LHRarrivals*.csv        850  LHRarrivals2.csv    1526  LHRarrivals.csv        2376  total     $  impala-­‐shell     Welcome  to  the  Impala  shell.     >  create  EXTERNAL  TABLE  flights  (  input  STRING,  id  BIGINT,  widget  STRING,  source   STRING,  resultnum  BIGINT,  pageurl  STRING,  scheduled  STRING,  flightnumber  STRING,   airport  STRING,  status  STRING,  terminal  STRING  )  ROW  FORMAT  DELIMITED  FIELDS   TERMINATED  BY  ','  LOCATION  '/data/';   >  select  count  (*)  from  flights;     Should  return  count(*)  2376  reflecting  the  size  of  the  data  set  
  • 27. DEMO OF ODBC ACCESS Doing this part on Amazon WorkSpaces using the Simba Cloudera Impala ODBC Driver.! Set up an SSH tunnel to the master node to allow us to connect to port 25010 from the WorkSpaces desktop to the Impala ODBC port! A previously configured system DSN allows us to work with the data from our EMR/Impala cluster directly within Microsoft Excel!
  • 28. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2
  • 29. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 30. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! BATCH PROCESSING
  • 31. GENERATE ➔ ➔ SHARE! STREAM PROCESSING
  • 32. AMAZON KINESIS
 REAL-TIME DATA STREAM PROCESSING!
  • 33. Real-time response to content in semi-structured data streams
 
 Relatively simple computations on data (aggregates, filters, sliding window, etc.)
  • 34. Hourly server logs: how your systems went wrong an hour ago Weekly / Monthly Bill: What you spent this past billing cycle Daily customer report from your website: tells you what deal or ad to try next time Daily fraud reports: tells you if there was fraud yesterday Daily business reports: tells me how customers used AWS services yesterday Real-time metrics: what just went wrong now Real-time spending alerts/caps: guaranteeing you can’t overspend Real-time analysis: what to offer the current customer now Real-time detection: blocks fraudulent use now Fast ETL into Amazon Redshift: how are customers using services now
  • 35. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 36. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 Amazon EC2 Amazon Elastic MapReduce Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 AWS Import / Export AWS Direct Connect
  • 37. GENERATE ➔ ➔ SHARE! STREAM PROCESSING
  • 38. GENERATE ➔ ➔ SHARE! STREAM PROCESSING Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 Amazon Kinesis Stream Processing on Amazon EC2
  • 39. WANT TO KNOW MORE? aws.amazon.com/solutions/case-studies/big-data/!
  • 40. ianmas@amazon.com @IanMmmm LARGE SCALE DATA ANALYSIS WITH AWS
 
 Ian Massingham – Technical Evangelist