SlideShare a Scribd company logo
Architecting Data Lakes on AWS with HiFX
Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner
that has been designing and migrating applications and workloads in the cloud since
2010. We have been helping organisations to become truly data driven by building
data lakes in AWS since 2015.
2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to make smart business
decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there was a need to
design the collection and storage layers that would scale well.
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing.
Dozens of independently managed collections of data, leading to data silos. Having no single source of
truth was leading to difficulties in identifying what type of data is available, granting access and integration.
04
03
02
01
Our Journey from Data to Decisions with an AWS powered Data Lake
Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near realtime access to any data
source
Engineering well designed big
data stores for reporting and
and exploratory analysis
Architect a secure, well
governed data lake to store all
data in a raw format. S3 is the
fabric with which we have woven
the solution.
Processing data in streams or
batches to aid analytics and
machine learning, supplemented
by smart workflow management to
orchestrate the tasks
Dynamic dashboards and
visualisations that makes data tell
stories and help drive insights.
Offering recommendations and
predictive analytics off the data
in the data lake
COLLECT STORE PROCESS CONSUME
Scribe (Collector)
Accumulo (Storage)
Acccumulo is the data consumer
component responsible for reading data
from the event streams (Kinesis Streams),
performing rudimentary data quality checks
and converting data to Avro Format before
loading it to the Data Lake
Our Data Lake in S3 captures and store
raw data at scale for a low cost. It allows us
to store many types of data in the same
repository while allowing to define the
structure of the data at the time when it is
used
scribe
accumulo
Scribe collects data from the trackers
and writes them to Kinesis Streams
It is written in Go and engineered for
high concurrency, low latency and
horizontally scalability
Currently running on two c4.large
instances, our API latency for 50
percentile is 12.6ms and 75 percentile
is 36ms. This is made possible
because of the consistent and
predicable performance of Kinesis
COLLECT STORE PROCESS CONSUME
6
Why Amazon S3 For Data Lake ?
Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with
consistent view (backed by DynamoDB) works really well
Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure–
SSL, client/server-side encryption
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999%
durability. Lower TCO and easier to scale than HDFS
Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data
04
03
02
01
Prism (Processor)
Lens (Consumer)
Custom built reporting & visualisation app
to help business owners to easily interpret,
visualise and record data and derive
insights
Detailed Analysis of KPIs, Event
Segmentation, Funnels, Search Insights,
Path Finder, Retention/Addiction Analysis
etc powered by Redshift and Druid. Using
Pgpool to cache Redshift queries.
process
consume
Unified Processing Engine using
Apache Spark running on EMR written
in Scala
Airflow is used to programmatically
author, schedule and monitor
workflows
Prism generates data for tracking KPIs
and perform funnel, pathflow, retention
and affinity analysis. It also include
machine learning workloads that
generate recommendations and
predictions
COLLECT STORE PROCESS CONSUME
COLLECT STORE PROCESS CONSUME
KPIs
Product relationship
Understand which products are viewed consecutively
Product affinity
Understand which products are purchased together
Sales
Hourly, daily, weekly, monthly, quarterly, and annual
Average market basket
Average order size
Cart abandonment rate
Shopping cart abandonment rate
Days/ Visits To "Purchase"
The average number of days and sessions from the first website
interaction to purchase.
Cost per Acquisition
(Total Cost of Marketing Activities) / (# of Conversions)
Repeat purchase rate
What % of our customers are repeat customers
Product page performance
Measuring product performance
The scatter plot compares the number of unique
users that view each product with the number of
unique users that add the product to basket, with
the size of each dot being the number of uniques
that buy the product.
Any products located in the lower right corner are
highly trafficked but low converting - any effort
spent fixing those product pages (e.g. by checking
the copy, updating the product images or lowering
the price) should be rewarded with a significant
sales uplift, given the number of people visiting
those pages
Measuring product performance
In contrast, products located in the top left of the
plot are very highly converting, but low trafficked
pages. We should drive more traffic to these
pages, either by positioning those products more
prominently on catalog pages, for example, or by
spending marketing dollars driving more traffic to
those pages specifically. Again, that investment
should result in a significant uplift in sales, given
how highly converting those products are.
Similarly, products in the lower left corner are
performing poorly - but it is not clear whether this is
because they have low traffic levels and /or are
poor at driving conversions. We should invest in
improving the performance of these pages, but the
return on that investment is likely to be smaller (or
harder to achieve) than the other two opportunities
Product page performance
Identifying products / content that go well together
Market basket analysis is an Association rule learning technique aimed at uncovering the associations and
connections between specific products in our store
In a market basket analysis, we look to see if there are combinations of products that frequently co-occur in
transactions.
We can use this type of analysis to:
• Inform the placement of content items on sites, or products in catalogue
• Drive recommendation engines (like Amazon’s customers who bought this product also bought these
products…)
• Deliver targeted marketing (e.g. emailing customers who bought products specific products with other
products and offers on those products that are likely to be interesting to them.)

More Related Content

PDF
PPTX
UTAD - Jornadas de Informática - Potential of Big Data
PPTX
CData Power BI Connectors - MS Business Application Summit
PPTX
Solution architecture Amazon web services
PDF
Choosing the Right Database for My Workload: Purpose-Built Databases
PDF
Tapdata Product Intro
PPTX
Spark Summit Keynote by Seshu Adunuthula
UTAD - Jornadas de Informática - Potential of Big Data
CData Power BI Connectors - MS Business Application Summit
Solution architecture Amazon web services
Choosing the Right Database for My Workload: Purpose-Built Databases
Tapdata Product Intro
Spark Summit Keynote by Seshu Adunuthula

What's hot (16)

PPTX
Microsoft business intelligence and analytics
PDF
Why Finance Should Consider Agile Modern Data Delivery Platform
PDF
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
PPTX
Spark Summit presentation by Ken Tsai
PPTX
Chug building a data lake in azure with spark and databricks
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
PPTX
Building Modern Data Platform with AWS
PPTX
data warehouse vs data lake
PDF
Data Lakes: 8 Enterprise Data Management Requirements
PPTX
Anatomy of a data driven architecture - Tamir Dresher
PDF
Why HR Should Consider Agile Modern Data Delivery Platform
PDF
SAP HANA Database
PDF
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
PDF
From ingest to insights with AWS
PDF
Data Governance a Business Value Driven Approach
PDF
Considerations for Data Access in the Lakehouse
Microsoft business intelligence and analytics
Why Finance Should Consider Agile Modern Data Delivery Platform
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
Spark Summit presentation by Ken Tsai
Chug building a data lake in azure with spark and databricks
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Building Modern Data Platform with AWS
data warehouse vs data lake
Data Lakes: 8 Enterprise Data Management Requirements
Anatomy of a data driven architecture - Tamir Dresher
Why HR Should Consider Agile Modern Data Delivery Platform
SAP HANA Database
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | Edureka
From ingest to insights with AWS
Data Governance a Business Value Driven Approach
Considerations for Data Access in the Lakehouse
Ad

Similar to Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT (20)

PDF
Architecting Data Lakes on AWS
PDF
Delivering business insights and automation utilizing aws data services
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
PPTX
Building a Real-Time Data Platform on AWS
PDF
Big data and Analytics on AWS
PDF
It's All About the Data - Tia Dubuisson
PDF
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
PDF
Agile enterprise analytics on aws
PDF
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
PDF
20141021 AWS Cloud Taekwon - Big Data on AWS
PPTX
From raw data to business insights. A modern data lake
PDF
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
PDF
Keynote: Customer Journey with Streaming Data on AWS - Rahul Pathak, AWS
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
PDF
AWS Data Analytics on AWS
PDF
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
Building a modern data platform in the cloud. AWS DevDay Nordics
PDF
AWS Partner Data Analytics on AWS_Handout.pdf
Architecting Data Lakes on AWS
Delivering business insights and automation utilizing aws data services
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
Building a Real-Time Data Platform on AWS
Big data and Analytics on AWS
It's All About the Data - Tia Dubuisson
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Agile enterprise analytics on aws
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
20141021 AWS Cloud Taekwon - Big Data on AWS
From raw data to business insights. A modern data lake
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
Keynote: Customer Journey with Streaming Data on AWS - Rahul Pathak, AWS
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
EVOLVING PATTERNS IN BIG DATA - NEIL AVERY
AWS Data Analytics on AWS
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Big Data, Ingeniería de datos, y Data Lakes en AWS
Building a modern data platform in the cloud. AWS DevDay Nordics
AWS Partner Data Analytics on AWS_Handout.pdf
Ad

Recently uploaded (20)

PPTX
Introduction to machine learning and Linear Models
PDF
Mega Projects Data Mega Projects Data
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
annual-report-2024-2025 original latest.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
Introduction to machine learning and Linear Models
Mega Projects Data Mega Projects Data
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
annual-report-2024-2025 original latest.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IB Computer Science - Internal Assessment.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Reliability_Chapter_ presentation 1221.5784
Qualitative Qantitative and Mixed Methods.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Acumen Training GuidePresentation.pptx
climate analysis of Dhaka ,Banglades.pptx

Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

  • 1. Architecting Data Lakes on AWS with HiFX Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner that has been designing and migrating applications and workloads in the cloud since 2010. We have been helping organisations to become truly data driven by building data lakes in AWS since 2015.
  • 2. 2 The Challenges Lack of agility and accessibility for data analysis which would aid the product team to make smart business decisions and improve strategies Increasing volume and velocity of data. With new digital properties getting added, there was a need to design the collection and storage layers that would scale well. Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing. Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, granting access and integration. 04 03 02 01
  • 3. Our Journey from Data to Decisions with an AWS powered Data Lake Connecting dozens of data streams and repositories to a unified data pipeline enabling near realtime access to any data source Engineering well designed big data stores for reporting and and exploratory analysis Architect a secure, well governed data lake to store all data in a raw format. S3 is the fabric with which we have woven the solution. Processing data in streams or batches to aid analytics and machine learning, supplemented by smart workflow management to orchestrate the tasks Dynamic dashboards and visualisations that makes data tell stories and help drive insights. Offering recommendations and predictive analytics off the data in the data lake
  • 4. COLLECT STORE PROCESS CONSUME Scribe (Collector) Accumulo (Storage) Acccumulo is the data consumer component responsible for reading data from the event streams (Kinesis Streams), performing rudimentary data quality checks and converting data to Avro Format before loading it to the Data Lake Our Data Lake in S3 captures and store raw data at scale for a low cost. It allows us to store many types of data in the same repository while allowing to define the structure of the data at the time when it is used scribe accumulo Scribe collects data from the trackers and writes them to Kinesis Streams It is written in Go and engineered for high concurrency, low latency and horizontally scalability Currently running on two c4.large instances, our API latency for 50 percentile is 12.6ms and 75 percentile is 36ms. This is made possible because of the consistent and predicable performance of Kinesis
  • 6. 6 Why Amazon S3 For Data Lake ? Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with consistent view (backed by DynamoDB) works really well Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure– SSL, client/server-side encryption Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999% durability. Lower TCO and easier to scale than HDFS Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data 04 03 02 01
  • 7. Prism (Processor) Lens (Consumer) Custom built reporting & visualisation app to help business owners to easily interpret, visualise and record data and derive insights Detailed Analysis of KPIs, Event Segmentation, Funnels, Search Insights, Path Finder, Retention/Addiction Analysis etc powered by Redshift and Druid. Using Pgpool to cache Redshift queries. process consume Unified Processing Engine using Apache Spark running on EMR written in Scala Airflow is used to programmatically author, schedule and monitor workflows Prism generates data for tracking KPIs and perform funnel, pathflow, retention and affinity analysis. It also include machine learning workloads that generate recommendations and predictions COLLECT STORE PROCESS CONSUME
  • 9. KPIs Product relationship Understand which products are viewed consecutively Product affinity Understand which products are purchased together Sales Hourly, daily, weekly, monthly, quarterly, and annual Average market basket Average order size Cart abandonment rate Shopping cart abandonment rate Days/ Visits To "Purchase" The average number of days and sessions from the first website interaction to purchase. Cost per Acquisition (Total Cost of Marketing Activities) / (# of Conversions) Repeat purchase rate What % of our customers are repeat customers
  • 10. Product page performance Measuring product performance The scatter plot compares the number of unique users that view each product with the number of unique users that add the product to basket, with the size of each dot being the number of uniques that buy the product. Any products located in the lower right corner are highly trafficked but low converting - any effort spent fixing those product pages (e.g. by checking the copy, updating the product images or lowering the price) should be rewarded with a significant sales uplift, given the number of people visiting those pages
  • 11. Measuring product performance In contrast, products located in the top left of the plot are very highly converting, but low trafficked pages. We should drive more traffic to these pages, either by positioning those products more prominently on catalog pages, for example, or by spending marketing dollars driving more traffic to those pages specifically. Again, that investment should result in a significant uplift in sales, given how highly converting those products are. Similarly, products in the lower left corner are performing poorly - but it is not clear whether this is because they have low traffic levels and /or are poor at driving conversions. We should invest in improving the performance of these pages, but the return on that investment is likely to be smaller (or harder to achieve) than the other two opportunities Product page performance
  • 12. Identifying products / content that go well together Market basket analysis is an Association rule learning technique aimed at uncovering the associations and connections between specific products in our store In a market basket analysis, we look to see if there are combinations of products that frequently co-occur in transactions. We can use this type of analysis to: • Inform the placement of content items on sites, or products in catalogue • Drive recommendation engines (like Amazon’s customers who bought this product also bought these products…) • Deliver targeted marketing (e.g. emailing customers who bought products specific products with other products and offers on those products that are likely to be interesting to them.)

Editor's Notes

  • #6: Trackers allow to collect data from any type application (web, mobile), service or device. All trackers adhere to the predefined Tracker Protocol. Send data asynchronously, and hence would not affect the performance. Collectors are stateless and horizontally scalable Each shard in kinesis steam can support reads upto 2MB per second and writes upto 1,000 records / 1MB per second. Scribe and Accumulo automatically detects new shards and scales. Accumulo KCL Java App that buffers the events and upload the batches as Avro files to the Data Lake DAGs in Airflow pull dimension and offline data and loads them to the Data Lake
  • #9: Streaming workloads for near-realtime reports ( news ) and batch for daily reports (classifieds) EMR with instance fleets provides a cost effective way to process data Data processing involves quality checks, cleansing, reconciling and enrichment. Subset of data ( sans page views, data beyond last 2 years) sent to druid & redshift. All data (historical) stored as parquet in S3 with lifecycle policy. Athena can point to this data for ad hoc analysis Druid for realtime data and aggregate queries that do not require join. Redshift for everything else. LENS built with React using nvd3 charting library. Built for multi tenancy with fine-grained ACLs. Apis powered by Go Recommendations powered by DyanamoDB ( predictable performance and no need to sort on multiple fields)