1
Confidential
2
Confidential
Marketing Data Lake in the Cloud
3
Confidential
Agenda
● Marketing things are boring, aren’t they ?
● Starting points
● Challenges on a project
● Next steps and evolution
● Conclusions
4
Confidential
Marketing things are boring,
aren’t they ?
5
Confidential
What is marketing about ?
● Research what we buy
6
Confidential
What is marketing about ?
● Research what we buy
● Figure out purchase behavior
7
Confidential
What is marketing about ?
● Research what we buy
● Figure out purchase behavior
● Target audience for AD better
8
Confidential
What is marketing about ?
● Research what we buy
● Figure out purchase behavior
● Target audience for AD better
● Help adjust AD campaigns
9
Confidential
How advertiser business works
SellersBuyers
Ad Network Ad Network
Agency DSP Ad Exchange SSP Publisher
DMP/Data Supply
Brand Audience
RTB
10
Confidential
How advertiser business works
SellersBuyers
Ad Network Ad Network
Agency DSP Ad Exchange SSP Publisher
DMP/Data Supply
Brand Audience
RTB
11
Confidential
Business scenario
Figure out how advertising (online and offline, impression) leads us to the
purchase
12
Confidential
Business scenario
Figure out how advertising (online and offline, impression) leads us to the
purchase
I saw ad on my cell phone, then on my laptop, then on a printed coupons
and I bought promoted item.
13
Confidential
Privacy concerns
Q: Do such companies spy on me?
A: In some way yes, but you agreed to share this info about yourself. Read
EULA
14
Confidential
Privacy concerns
Q: Do such companies spy on me?
A: In some way yes, but you agreed to share this info about yourself. Read
EULA
Q: Do they want to steal my private data ?
A: No, they don’t want
15
Confidential
Privacy concerns
Q: Do such companies spy on me?
A: In some way yes, but you agreed to share this info about yourself. Read
EULA
Q: Do they want to steal my private data ?
A: No, they don’t want
Q: Do they have my bank accounts numbercredit cards numbers ?
A: No, they don’t have
16
Confidential
Privacy concerns
Q: Do such companies spy on me?
A: In some way yes, but you agreed to share this info about yourself. Read
EULA
Q: Do they want to steal my private data ?
A: No, they don’t want
Q: Do they have my bank accounts numbercredit cards numbers ?
A: No, they don’t have
Q: Do they buy information about me from other companies ?
A: Yes, they do
17
Confidential
Starting points
18
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
19
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
- HDP Ambari managed services
20
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
- HDP Ambari managed services
- No unit-tests
21
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
- HDP Ambari managed services
- No unit-tests
- NiFi extensive use
22
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
- HDP Ambari managed services
- No unit-tests
- NiFi extensive use
- Scala based Spark jobs
23
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
- HDP Ambari managed services
- No unit-tests
- NiFi extensive use
- Scala based Spark jobs
- No CI/CD
24
Confidential
Starting points
Sources by technology
- Oracle
25
Confidential
Starting points
Sources by technology
- Oracle
- Netezza
26
Confidential
Starting points
Sources by technology
- Oracle
- Netezza
- MongoDB
27
Confidential
Starting points
Sources by technology
- Oracle
- Netezza
- MongoDB
- File on S3 bucket
28
Confidential
Starting points
Sources by technology
- Oracle
- Netezza
- MongoDB
- File on S3 bucket
- Hive on 3rd party server
29
Confidential
Starting points
Scheduling execution chain
Oozie
30
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
31
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
Java application
32
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
Java application
NiFi
33
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
Java application
NiFi
Spark job
34
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
Java application
NiFi
Spark jobSqoop job
35
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
Java application
NiFi
Spark jobSqoop jobdistcp job
36
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
Java application
NiFi
Spark jobSqoop jobdistcp job
Shell script
37
Confidential
Starting points
Scheduling execution chain
Oozie
Shell script
Java application
NiFi
Spark jobSqoop jobdistcp job
Shell script
PostgreSQL
38
Confidential
Starting points
Solutions map
Our data
platform
Our data
platform
Matched
impressions
Setup
campaign
Our data
platform
Store orders
and
impressions
Store orders
and
impressions
Matched
impressions
Old data
platform
AdExhange3
AdExhange1
Identities
AdExhange2
Kafka
Analytics
AdExhange4
MDM
Impressions
IdentitiesSetup
campaign
UI
DataLake
Refined Target Presentation Index
ing
Raw
Business scenario 1
Data
sources
Starting points
Ingestion
Business scenario 2
Orchestration
Transformation
43
Confidential
Challenges on a project
44
Confidential
Challenges on a project
No unit-tests
45
Confidential
Challenges on a project
No unit-tests
Let’s make it!
46
Confidential
Challenges on a project
No unit-tests
Let’s make it!
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.SparkConf
import org.scalatest.{BeforeAndAfter, FunSuite, Matchers}
47
Confidential
Challenges on a project
test("load id map table data") {
// given
val expectedData = List(
MyMap(id1 = "GUID1", id2 = "GUID2", id_type = "aaid").......
)
val expected = spark.createDataFrame(sc.parallelize(expectedData))
expected.write.mode(SaveMode.Append).format("hive").partitionBy("id_type").sav
eAsTable(Db.MySchema.name + "." + Db.MySchema.Table.MyTable)
// when
val actual = MainClass.functionToTest(spark, Db.MySchema.name,
Db.MySchema.Table.MyTable)
48
Confidential
Challenges on a project
// then
val actualFieldsQueried = actual.schema.fields.map(f => f.name)
withClue("Actual fields queried:n" + actual.schema.treeString) {
actualFieldsQueried shouldEqual Array("id1", "id2", "id_type")
}
val actualData = actual.collect()
withClue(actualData.mkString("n", "n", "n")) {
actualData.length should equal(expectedData.size)
withClue("Actual id2 field differs from expected") {
actualData.map(r => r.getAs[String]("id2")) should contain
theSameElementsAs expected.map(id => id.id2)
}
}
}
49
Confidential
Challenges on a project
Scheduling invocation chain is too long
50
Confidential
Challenges on a project
Scheduling invocation chain is too long
- Get rid of NiFi
51
Confidential
Challenges on a project
Scheduling invocation chain is too long
- Get rid of NiFi
- Get rid of shell scripts
52
Confidential
Challenges on a project
Scheduling invocation chain is too long
- Get rid of NiFi
- Get rid of shell scripts
- Get rid of Oozie
53
Confidential
Challenges on a project
One job runs for 4 hours and take all resources of the cluster
54
Confidential
Challenges on a project
One job runs for 4 hours and take all resources of the cluster
Job has to analyze history for the last 52 weeks of orders history
55
Confidential
Challenges on a project
One job runs for 4 hours and take all resources of the cluster
Job has to analyze history for the last 52 weeks of orders history
We can make it incremental!
56
Confidential
Challenges on a project
No automation of rollout for the cluster
57
Confidential
Challenges on a project
No automation of rollout for the cluster
- Time to setup new cluster is about 7-10 days
58
Confidential
Challenges on a project
No automation of rollout for the cluster
- Time to setup new cluster is about 7-10 days
- Cloudbreak blueprints do not help too much
59
Confidential
Challenges on a project
No CICD
60
Confidential
Challenges on a project
No CICD
- Make at least build, unit-test and deployment of the jars automated
61
Confidential
Challenges on a project
No CICD
- Make at least build, unit-test and deployment of the jars automated
- Partially covered CICD of the Oozie scripts
62
Confidential
Challenges on a project
No CICD
- Make at least build, unit-test and deployment of the jars automated
- Partially covered CICD of the Oozie scripts
- CD for the shell scripts
63
Confidential
Challenges on a project
Low level of security
64
Confidential
Challenges on a project
Low level of security
- Every developer uses sa account to get to the edge node
65
Confidential
Challenges on a project
Low level of security
- Every developer uses sa account to get to the edge node
- Single admin user for Ambari
66
Confidential
Challenges on a project
Low level of security
- Every developer uses sa account to get to the edge node
- Single admin user for Ambari
- Developers has access to PROD
67
Confidential
Challenges on a project
Stuck with Kafka 0.7
Customer has lock himself with this old version of Kafka
There only option to consume messages is to use old Java library
68
Confidential
Next steps and evolution
69
Confidential
Next steps and evolution
70
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
71
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
- Move to the Airflow
72
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
- Move to the Airflow
- Move to the Databricks
73
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
- Move to the Airflow
- Move to the Databricks
Strategic:
- Get rid of Kafka 0.7
74
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
- Move to the Airflow
- Move to the Databricks
Strategic:
- Get rid of Kafka 0.7
- Switch to the Next Generation platform
75
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
76
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
77
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
- Azure Databricks
78
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
- Azure Databricks
- Azure Data Lake
79
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
- Azure Databricks
- Azure Data Lake
- Azure EventHub
80
Confidential
Conclusions
81
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
82
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
83
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
84
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
- General programming practices for everyone, not just for Software
Engineers
85
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
- General programming practices for everyone, not just for Software
Engineers
- Automate everything that you can
86
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
- General programming practices for everyone, not just for Software
Engineers
- Automate everything that you can
- Documentation first
87
Confidential
87
Q&A session
88
Confidential

More Related Content

PDF
CDI 2.0 (JSR 365) - Java Day Tokyo 2017 (English)
PDF
Devops is a Security Requirement
ODP
Devops is not about Tooling
PDF
Town Hall - Business Implications of Open Source OSGi Implementations - BJ Ha...
PDF
Continuous Infrastructure First
PPTX
Refactor your code: when, why and how?
PPTX
Refactor your code: when, why and how (revisited)
ODP
Adopting Devops , Stories from the trenches
CDI 2.0 (JSR 365) - Java Day Tokyo 2017 (English)
Devops is a Security Requirement
Devops is not about Tooling
Town Hall - Business Implications of Open Source OSGi Implementations - BJ Ha...
Continuous Infrastructure First
Refactor your code: when, why and how?
Refactor your code: when, why and how (revisited)
Adopting Devops , Stories from the trenches

Similar to Marketing data lake in the cloud (20)

PDF
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”
PDF
Where refactoring meets big $$$
PDF
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...
PDF
QCon SF 2022: The Secret to Finding Impactful Projects to Land a Staff-Plus E...
PDF
Metrics-driven Continuous Delivery
PDF
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
PDF
Никита Галкин "Technical backlog: инструкция к применению"
KEY
The business case for contributing code
PPTX
5 Lessons from Enterprise DevOps
PDF
Using Software In Qualitative Research A Stepbystep Guide Ann Lewins
PDF
So Now You’re a UiPath Developer – What’s Next?” What Role do You Play as Dev...
PPTX
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
PDF
Automate, Integrate, Innovate - AI-powered GitLab CI for Drupal module develo...
PPTX
Kylin Engineering Principles
PPTX
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
PPTX
PHP + Business = Money!
PDF
Crafting a central product narrative
PPT
presentation slides
PPTX
[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...
PDF
What CS Class Didn't Teach About Testing
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”
Where refactoring meets big $$$
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...
QCon SF 2022: The Secret to Finding Impactful Projects to Land a Staff-Plus E...
Metrics-driven Continuous Delivery
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Никита Галкин "Technical backlog: инструкция к применению"
The business case for contributing code
5 Lessons from Enterprise DevOps
Using Software In Qualitative Research A Stepbystep Guide Ann Lewins
So Now You’re a UiPath Developer – What’s Next?” What Role do You Play as Dev...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Automate, Integrate, Innovate - AI-powered GitLab CI for Drupal module develo...
Kylin Engineering Principles
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
PHP + Business = Money!
Crafting a central product narrative
presentation slides
[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...
What CS Class Didn't Teach About Testing
Ad

More from GlobalLogic Ukraine (20)

PDF
GlobalLogic JavaScript Community Webinar #21 “Інтерв’ю без заспокійливих”
PPTX
Deadlocks in SQL - Turning Fear Into Understanding (by Sergii Stets)
PDF
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
PDF
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
PDF
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
PDF
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
PPTX
Штучний інтелект як допомога в навчанні, а не замінник.pptx
PPTX
Задачі AI-розробника як застосовується штучний інтелект.pptx
PPTX
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
PDF
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
PDF
JavaScript Community Webinar #14 "Why Is Git Rebase?"
PDF
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
PPTX
Страх і сила помилок - IT Inside від GlobalLogic Education
PDF
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
PDF
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
PDF
“How to Secure Your Applications With a Keycloak?
PDF
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
PPTX
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
PDF
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
PDF
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
GlobalLogic JavaScript Community Webinar #21 “Інтерв’ю без заспокійливих”
Deadlocks in SQL - Turning Fear Into Understanding (by Sergii Stets)
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
Штучний інтелект як допомога в навчанні, а не замінник.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
JavaScript Community Webinar #14 "Why Is Git Rebase?"
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
Страх і сила помилок - IT Inside від GlobalLogic Education
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
“How to Secure Your Applications With a Keycloak?
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Ad

Recently uploaded (20)

PDF
Five Habits of High-Impact Board Members
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PPT
Geologic Time for studying geology for geologist
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Modernising the Digital Integration Hub
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
Five Habits of High-Impact Board Members
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Geologic Time for studying geology for geologist
Benefits of Physical activity for teenagers.pptx
sbt 2.0: go big (Scala Days 2025 edition)
Improvisation in detection of pomegranate leaf disease using transfer learni...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
sustainability-14-14877-v2.pddhzftheheeeee
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
UiPath Agentic Automation session 1: RPA to Agents
Flame analysis and combustion estimation using large language and vision assi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Credit Without Borders: AI and Financial Inclusion in Bangladesh
CloudStack 4.21: First Look Webinar slides
Modernising the Digital Integration Hub
1 - Historical Antecedents, Social Consideration.pdf
Comparative analysis of machine learning models for fake news detection in so...
Convolutional neural network based encoder-decoder for efficient real-time ob...

Marketing data lake in the cloud