SlideShare a Scribd company logo
1 /
A Study Review of Common Big Data
Architecture for
Small-medium Enterprise
Ridwan Fadjar Septian, ridwanbejo@gmail.com, Master of information System, Faculty of Postgraduate,
Universitas Komputer Indonesia.
Fajri Abdillah, clasense04@gmail.com, Senior Software Engineer, Horangi Cyber Security.
Tajhul Faijin Aliyudin, tajhulfaijin@gmail.com, Senior Software Engineer, Tado.live.
MSCEIS 2019
2 /
1. Introduction
● Big data is a set of facility that processes those large-scale
dataset in a complex way beside the traditional infrastructure
that only applied with small amount of dataset [1].
● Amount of users, mobile device and internet of thing
become a factor for the enterprise which could gain more
dataset from their user [1].
● That large dataset could be analyzed and processed to
support their businesses with distributed processing and
large-scale storage [1].
3 /
1. Introduction (2)
Characteristic of Big Data [1]:
● Volume, the quantity of the dataset. From Gigabyte to Petabyte or more.
● Variety, structured dataset, semi structured dataset, unstructured dataset
● Veracity, quality of dataset before and after data preprocessing
● Velocity, frequency of collecting the dataset from amount of event
sources
● Value , useful information and pattern that might be converted into
knowledge
● Variability, dataset might be in sparse format that need a preprocessing
first
4 /
1. Introduction (3)
Big Data methodology [2]:
1. Acquisistion of the dataset
2. Organizing the dataset
3. Analyze the dataset
4. Take a decision from the analysis result
5 /
1. Introduction (4)
Big Data methodology [4]:
1. Data acquisition
2. Data storing
3. Data management
4. Data analysis
5. Data visualization
6 /
1. Introduction (5)
Advance data analysis in Big Data [2,3,5,7]:
1. Regression learning
2. Classification
3. Association rule mining
4. Clustering
5. Forecasting
6. Deep learning
7. Natural language processing
7 /
1. Introduction (6)
Big Data adoption (example):
1. Education: Analytics for academic, learning, information
technology and institutional information [6]
2. Environmental Technology: Energy efficiency, sustainable
farming and agriculture, smart city, national strategy without
harming the environment [5]
3. Healthcare Industry: improved pharmaceutical, personalized
patient care healthcare, medical device design, fraud
detection on medical claim, preventive action for certain
diseases based on genomic analysis [4]
8 /
1. Introduction (7)
Big Data risk (example) [1,2,3,6,7]:
1. Security and privacy protection
2. Data ownership and transparency
3. Data quality
4. Cost of implementation
5. Infrastructure maintenance
6. Developer failure
7. Security system and compliance with the regulation
9 /
2. Methods
1. Research planning
– Studying a common architecture of big data
– Find architecture that could be implemented by small-medium enterprise
2. Scoping the research
– Studying the big data architecture and its component only
– This research isn’t covering cost calculation for implementing the big data
architecture
3. Data acquisition by reviewing amount of papers
– ~42 papers are obtained to become references for this study review
4. Conclusion and recommendations
10 /
3. Result and Discussion
1. Enterprise Architecture for Big Data Project
2. Event Sources
3. Message Queue
4. Data Lake
5. Extract Transform Load (ETL)
6. Data Warehouse
7. Data Mining Methodology
8. Data Visualization
9. Security and Compliance
10. Small Medium Enterprise in Big Data Era
11 /
3.1 Enterprise Architecture for
Big Data Project
TOGAF phase [8]:
1. Preliminary Phase
2. Phase A: Architecture Vision
3. Phase B: Business Architecture
4. Phase C: Information System Architectures
5. Phase D: Technology Architecture
6. Phase E: Opportunities and Solutions
7. Phase F: Migration Planning
8. Phase G: Implementation Governance and Phase
9. Phase H: Architecture Change Management
12 /
3.1 Enterprise Architecture for
Big Data Project (1)
TOGAF for Big Data [9]:
1. Has a potential for stabilize the implementation of big data architecture
2. Clear business goal
3. Aligned with business requirements
4. Clear planning
5. Better in recognize project scope
6. Better communication between stakeholders
7. Better change management
8. Improve focus on business opportunities
TOGAF also has Target Capability Model that could drive the oil and gas enterprise to
build big data to manage and migrate from their former data infrastructure to big
data infrastructure [10].
13 /
3.2 Event sources
Event sources for Big Data:
1. Web application [10]: clickstream behaviour, liked post, sharing post,
recommendation to friends
2. Internet of Thing (IoT) [11, 12, 13]: parking lot occupation detection and
heat regulation at university (390 Gb), 10.000 sensors around industrial
area that sent dataset for every 15 minutes, collecting Return Air
Temperature (RAT) and Set Point Temperature (SPT) from sensors at
manufacture sector.
3. External data sources [14]: Transaction Processing Performance Council
dataset (TPC-DI)
4. Mobile application [15, 16]: geolocation, geo-tagged tweet, tweet
analysis, spending-time at tourism objects.
14 /
3.3 Message Queue
Some facts about message queue [17]:
 software that retain some message for certain period from producers
and it will be processed by the consumers.
 organized by certain of topic and each topic has partition key.
 consumer could be more than one instance and perform a reliable
process to process the messages.
 conjunction for web application and data storage to prevent lost data
during processing requests from the clients
15 /
3.3 Message Queue (1)
16 /
3.4 Data Lake
What is Data Lake [24]:
• Kind form of massive data storage that collects raw dataset before the
dataset gets further processing
• On top of Hadoop File System that initially introduced by James Dixon.
• Could be utilized by using the open-source solution such as Hadoop File
System from Apache Hadoop.
• Enterprise can build their data lake with mentioned open source
products in their environment and more cost-effective than process it on
the database.
17 /
3.4 Data Lake (1)
What is Data Lake [24]:
• Kind form of massive data storage that collects raw dataset before the
dataset gets further processing
• On top of Hadoop File System that initially introduced by James Dixon.
• Could be utilized by using the open-source solution such as Hadoop File
System from Apache Hadoop.
• Enterprise can build their data lake with mentioned open source
products in their environment and more cost-effective than process it on
the database.
• The dataset could be stored on data lake with various extensions such as
CSV, Apache Avro, Apache Orc, Textfile, JSON, XML, etc
18 /
3.4 Data Lake (2)
Data lake have several key features such as [25]:
• Large scale batch processing, schema on read, able to store large scale
data volume with low cost,
• Could be accessed using SQL-like systems even they are not in SQL
format dataset,
• Complex processing that’s even apply machine learning operation,
• Store raw dataset instead compact format such as in SQL format,
• Low cost for distributed processing
19 /
3.4 Data Lake (3)
Data lake has core components such as [26]:
• on the backend side such as
• catalog storage,
• batch job performance and scheduling
• fault tolerance
• garbage collection of metadata.
• On the frontend side such as
• dataset profile pages
• dataset search
• team dashboards
20 /
3.5 Extract Transform Load (ETL)
What is ETL [14]:
• ETL (Extract, Transform, Load) is the part of big data that has a role to
perform a conversion from raw dataset into a cleansed dataset.
• ETL is a set of tools that could help the enterprise to perform real-time
analytics and decisions.
• ETL could be divided into three approaches that consist of micro-batch,
near real-time and streaming.
• ETL must be a combination of scheduler and ETL script.
21 /
3.5 Extract Transform Load (ETL)
(1)
What is ETL [14]:
• Scheduler could use software such as Cron Jobs while for Apache Kafka
could be used for streaming approach.
• ETL script could be executed in distributed processes through a cluster of
workers such as by using Apache Spark or executed in a single node
such as using plain SQL or programming language approach.
• ETL could transform dataset from SQL format into other text format or
vice versa.
• Supported data format by ETL technology could be XML, CSV, JSON,
Apache Avro, Apache Orc, etc.
22 /
3.5 Extract Transform Load (ETL)
(2)
Data preprocessing is also part of ETL phase that consists of [29]:
• imperfect data handling
• dimensionality reduction
• instance reduction
• Discretization
• imbalanced data
• incomplete data
• and instance reduction
Data preprocessing could take apart on improving the result of machine
learning model
23 /
3.6 Data Warehouse
Data warehouse is a kind of [30]:
• denormalized database that have generic information to cover
management level question
• centralized data source that could supply the data to develop the
strategic plan of the enterprise to make a better decision based on
historical data that stored historical data.
• Facility to perform online analytical processing (OLAP) over the data
source to answer the business needs.
• Facility that receives the cleansed dataset that processed by the ETL part
in the big data pipeline.
• a source for data mining tasks to perform forecasting, classification,
pattern recognition, clustering, etc
24 /
3.6 Data Warehouse (1)
In some cases, such as in the educational sector, the data warehouse has
key roles to perform [31]:
• feasibility assessment and data analysis for an educational system from
different viewpoints.
• make a quick decision to evaluate the educational system.
• collect a huge amount of datasets from several kinds of existing
databases and unify those data source into a single data source.
• a decision support system (DSS) technique over the data warehouse.
25 /
3.7 Data Mining Methodology
Data Mining is one of the methodologies over big data. There are some
known methodologies to perform better data mining processes such as
[32]:
● KDD process
● CRISP-DM
● RAMSYS
● DMIE
● DMEE
● ASUM-DM
● AABA
● etc
26 /
3.7 Data Mining Methodology (1)
KDD has several steps that consist of [32]:
● selection
● preprocessing
● transformation
● data mining
● knowledge gain
27 /
3.7 Data Mining Methodology (2)
a popular CRISP-DM has several steps that perform a complete operation
that consist of [32]:
● Business understanding
● data understanding
● data preprocessing
● Modeling
● Evaluation
● deployment
28 /
3.7 Data Mining Methodology (3)
CRISP-DM example use case [32, 35, 36]:
● In the retail sector, CRISP-DM is applied to perform a data mining
process that uses association rule mining to predict the sales pattern
● Research on climatology has been conducted also by applying CRISP-
DM and KDD as a combination to improve the data mining result
● Another research on social media analysis has shown a result that CRISP-
DM made the data mining processes gave the better result to perform
the favorite TV series classification by applying Decision Tree algorithm
29 /
3.8 Data Visualization
Visualization could deliver more engagement to management level or other
users so they can understand the insight that gains from the big data [37].
● Visualization could be formed as a web dashboard or document that
contains an explanation and story.
● Data could be visualized with various graphs such as bar chart, line
chart, pie chart, scatter chart, map chart, etc.
● Data visualization itself has several different types that consist of
linear data visualization, planar data visualization, volumetric data
visualization, temporal data visualization, multidimensional data
visualization, and network data visualization
30 /
3.8 Data Visualization (1)
There are also some various open source products that can help the
enterprise to visualize their insight from big data such as:
● Pentaho
● Apache Zeppelin
● Apache Superset
● Metabase
● etc.
Those products have capabilities to connect with Hadoop File System also
with various database products that commonly used by the market.
31 /
3.9 Security and Compliance
Some security approaches that might be applied for big data infrastructure
such as [1]:
● encryption
● security as a service
● real-time monitoring
● privacy by design
● data protection and authorization
● log management
● authentication
● data anonymization
● secure communication line
32 /
3.9 Security and Compliance (1)
Enterprise security for big data [42]:
● NIST for Big Data
● ISO/IEC 20546:2019 for Big Data
● ISO/IEC 27001:2017
33 /
3.10 Small Medium Enterprise in
Big Data Era
● Based on TOGAF standard, enterprise could be considered as: 1) a whole
corporation or the division of that corporation, 2) government agency
or signle government department, 3) distributed organizations that
linked together by common ownership and separated geographically, 4)
groups of countries or governments that working together to create
common or shareable deliverables and partnerships or 5) alliances of
business that working together [8].
● While small-medium enterprise is an enteprise that defined based on
annual work unit, annual turnover and annual balance sheet
according to European Commission [43, 44]. Small-medium enterprise
is a differentiation of the enterprise itself but based on the size of the
three indicators.
34 /
3.10 Small Medium Enterprise in
Big Data Era (1)
Based on prior researches according to European Commission with
European Union Standards [43, 44]:
● Annual work unit per year >= 10 and < 50, annual turnover and
annual balance sheet < 10 million euro -> Small enterprise
● Annual work unit per year >= 50 and < 250, annual turnover and
annual balance sheet < 50 million euro -> Medium enterprise
In that case, we could state that small-medium enterprise has an annual
work unit around 10 until 250 annual work unit based on European
Commission with European Union Standards
35 /
3.10 Small Medium Enterprise in
Big Data Era (2)
An advantage of leveraging big data for small-medium enterprise [45]:
● learn their pattern from past transaction
● combine with external data to understand the market behaviour in order
to gain competitive advantage and growth
● increase their product improvement and innovation against the
competitor
36 /
4. Conclusion
Common big data architecture for small-medium enterprises could be
categorized into three components that consist of:
● design and architecture component: a small-medium enterprise could
leverage enterprise architecture framework such as TOGAF based on the
study review
● infrastructure component: it could consist of event sources, message
queue, data lake, extract-transform-load (ETL), data warehouse and data
visualization
● operational component: the study found that data mining
methodology is running on top of the infrastructure.
37 /
4. Conclusion (1)
● By using the methodology such as CRISP-DM or the other
methodologies could produce a better data mining result.
● Security and compliance could be include in the operational component
since the security and compliance is a life cycle for the sustainability of
big data infrastructure and architecture against threat, vulnerability and
risk.
● SME could utilize open source products with minimum cost that could
be established as a part of the big data infrastructure such as:
– Apache Nifi, Apache Hadoop, Apache Kafka, Apache Spark, Apache
Storm, Scribe, RabbitMQ, Apache Zeppelin, Metabase, PostgreSQL,
etc
38 /
References
39 /
References
40 /
References
41 /
References

More Related Content

PDF
Data Gloveboxes: A Philosophy of Data Science Data Security
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
PPTX
Big Data in the Real World
DOCX
Hotel inspection data set analysis copy
PPTX
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
PDF
Big Data Computing Architecture
PDF
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Data Gloveboxes: A Philosophy of Data Science Data Security
Data lake-itweekend-sharif university-vahid amiry
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Big Data in the Real World
Hotel inspection data set analysis copy
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Computing Architecture
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)

What's hot (20)

PPTX
Shaping a Digital Vision
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PDF
Lecture6 introduction to data streams
PPTX
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Active Learning for Fraud Prevention
PDF
Introduction to Apache Apex by Thomas Weise
PPTX
Managed Cluster Services
PPTX
Building a Scalable Data Science Platform with R
PPTX
Hadoop project design and a usecase
PPTX
PDF
Lambda architecture for real time big data
PDF
Big Data Ready Enterprise
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PDF
Strata San Jose 2017 - Ben Sharma Presentation
PPTX
Data & analytics challenges in a microservice architecture
PDF
Big Telco - Yousun Jeong
PPTX
Querying Druid in SQL with Superset
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
PDF
Bigdata Hadoop project payment gateway domain
Shaping a Digital Vision
The key to unlocking the Value in the IoT? Managing the Data!
Lecture6 introduction to data streams
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
What's new in SQL on Hadoop and Beyond
Active Learning for Fraud Prevention
Introduction to Apache Apex by Thomas Weise
Managed Cluster Services
Building a Scalable Data Science Platform with R
Hadoop project design and a usecase
Lambda architecture for real time big data
Big Data Ready Enterprise
Lambda-less Stream Processing @Scale in LinkedIn
Strata San Jose 2017 - Ben Sharma Presentation
Data & analytics challenges in a microservice architecture
Big Telco - Yousun Jeong
Querying Druid in SQL with Superset
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Bigdata Hadoop project payment gateway domain
Ad

Similar to A Study Review of Common Big Data Architecture for Small-Medium Enterprise (20)

PDF
J0212065068
PPTX
BDA-Module-1.pptx
PDF
Big data service architecture: a survey
PDF
E018142329
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
PPTX
Lecture 3.31 3.32.pptx
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
PDF
Big data Question bank.pdf
PDF
Big Data Analytics Unit I CCS334 Syllabus
PPTX
unit 1 big data.pptx
DOC
GouriShankar_Informatica
PDF
A Logical Architecture is Always a Flexible Architecture (ASEAN)
PDF
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
PDF
An Overview of Data Lake
PDF
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
PDF
Big Data Processing with Hadoop : A Review
DOC
Dwdm unit 1-2016-Data ingarehousing
PDF
Big data analysis concepts and references
PDF
IRJET- A Scenario on Big Data
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
J0212065068
BDA-Module-1.pptx
Big data service architecture: a survey
E018142329
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Lecture 3.31 3.32.pptx
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Big data Question bank.pdf
Big Data Analytics Unit I CCS334 Syllabus
unit 1 big data.pptx
GouriShankar_Informatica
A Logical Architecture is Always a Flexible Architecture (ASEAN)
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
An Overview of Data Lake
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
Big Data Processing with Hadoop : A Review
Dwdm unit 1-2016-Data ingarehousing
Big data analysis concepts and references
IRJET- A Scenario on Big Data
Data Engineer's Lunch #85: Designing a Modern Data Stack
Ad

More from Ridwan Fadjar (20)

PDF
Google Cloud Platform for Python Developer - Beginner Guide.pdf
PDF
My Hashitalk Indonesia April 2024 Presentation
PDF
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
PDF
Cloud Infrastructure automation with Python-3.pdf
PDF
GraphQL- Presentation
PDF
Bugs and Where to Find Them (Study Case_ Backend).pdf
PDF
Introduction to Elixir and Phoenix.pdf
PDF
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
PDF
CS meetup 2020 - Introduction to DevOps
PDF
Why Serverless?
PDF
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
PDF
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
PDF
Mongodb intro-2-asbasdat-2018-v2
PDF
Mongodb intro-2-asbasdat-2018
PDF
Mongodb intro-1-asbasdat-2018
PDF
Resftul API Web Development with Django Rest Framework & Celery
PDF
Memulai Data Processing dengan Spark dan Python
PDF
Kisah Dua Sejoli: Arduino & Python
PDF
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
PDF
Modul pelatihan-django-dasar-possupi-v1
Google Cloud Platform for Python Developer - Beginner Guide.pdf
My Hashitalk Indonesia April 2024 Presentation
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
Cloud Infrastructure automation with Python-3.pdf
GraphQL- Presentation
Bugs and Where to Find Them (Study Case_ Backend).pdf
Introduction to Elixir and Phoenix.pdf
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
CS meetup 2020 - Introduction to DevOps
Why Serverless?
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018
Mongodb intro-1-asbasdat-2018
Resftul API Web Development with Django Rest Framework & Celery
Memulai Data Processing dengan Spark dan Python
Kisah Dua Sejoli: Arduino & Python
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Modul pelatihan-django-dasar-possupi-v1

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Big Data Technologies - Introduction.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced IT Governance
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Big Data Technologies - Introduction.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Understanding_Digital_Forensics_Presentation.pptx
Advanced IT Governance
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Advanced Soft Computing BINUS July 2025.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding

A Study Review of Common Big Data Architecture for Small-Medium Enterprise

  • 1. 1 / A Study Review of Common Big Data Architecture for Small-medium Enterprise Ridwan Fadjar Septian, ridwanbejo@gmail.com, Master of information System, Faculty of Postgraduate, Universitas Komputer Indonesia. Fajri Abdillah, clasense04@gmail.com, Senior Software Engineer, Horangi Cyber Security. Tajhul Faijin Aliyudin, tajhulfaijin@gmail.com, Senior Software Engineer, Tado.live. MSCEIS 2019
  • 2. 2 / 1. Introduction ● Big data is a set of facility that processes those large-scale dataset in a complex way beside the traditional infrastructure that only applied with small amount of dataset [1]. ● Amount of users, mobile device and internet of thing become a factor for the enterprise which could gain more dataset from their user [1]. ● That large dataset could be analyzed and processed to support their businesses with distributed processing and large-scale storage [1].
  • 3. 3 / 1. Introduction (2) Characteristic of Big Data [1]: ● Volume, the quantity of the dataset. From Gigabyte to Petabyte or more. ● Variety, structured dataset, semi structured dataset, unstructured dataset ● Veracity, quality of dataset before and after data preprocessing ● Velocity, frequency of collecting the dataset from amount of event sources ● Value , useful information and pattern that might be converted into knowledge ● Variability, dataset might be in sparse format that need a preprocessing first
  • 4. 4 / 1. Introduction (3) Big Data methodology [2]: 1. Acquisistion of the dataset 2. Organizing the dataset 3. Analyze the dataset 4. Take a decision from the analysis result
  • 5. 5 / 1. Introduction (4) Big Data methodology [4]: 1. Data acquisition 2. Data storing 3. Data management 4. Data analysis 5. Data visualization
  • 6. 6 / 1. Introduction (5) Advance data analysis in Big Data [2,3,5,7]: 1. Regression learning 2. Classification 3. Association rule mining 4. Clustering 5. Forecasting 6. Deep learning 7. Natural language processing
  • 7. 7 / 1. Introduction (6) Big Data adoption (example): 1. Education: Analytics for academic, learning, information technology and institutional information [6] 2. Environmental Technology: Energy efficiency, sustainable farming and agriculture, smart city, national strategy without harming the environment [5] 3. Healthcare Industry: improved pharmaceutical, personalized patient care healthcare, medical device design, fraud detection on medical claim, preventive action for certain diseases based on genomic analysis [4]
  • 8. 8 / 1. Introduction (7) Big Data risk (example) [1,2,3,6,7]: 1. Security and privacy protection 2. Data ownership and transparency 3. Data quality 4. Cost of implementation 5. Infrastructure maintenance 6. Developer failure 7. Security system and compliance with the regulation
  • 9. 9 / 2. Methods 1. Research planning – Studying a common architecture of big data – Find architecture that could be implemented by small-medium enterprise 2. Scoping the research – Studying the big data architecture and its component only – This research isn’t covering cost calculation for implementing the big data architecture 3. Data acquisition by reviewing amount of papers – ~42 papers are obtained to become references for this study review 4. Conclusion and recommendations
  • 10. 10 / 3. Result and Discussion 1. Enterprise Architecture for Big Data Project 2. Event Sources 3. Message Queue 4. Data Lake 5. Extract Transform Load (ETL) 6. Data Warehouse 7. Data Mining Methodology 8. Data Visualization 9. Security and Compliance 10. Small Medium Enterprise in Big Data Era
  • 11. 11 / 3.1 Enterprise Architecture for Big Data Project TOGAF phase [8]: 1. Preliminary Phase 2. Phase A: Architecture Vision 3. Phase B: Business Architecture 4. Phase C: Information System Architectures 5. Phase D: Technology Architecture 6. Phase E: Opportunities and Solutions 7. Phase F: Migration Planning 8. Phase G: Implementation Governance and Phase 9. Phase H: Architecture Change Management
  • 12. 12 / 3.1 Enterprise Architecture for Big Data Project (1) TOGAF for Big Data [9]: 1. Has a potential for stabilize the implementation of big data architecture 2. Clear business goal 3. Aligned with business requirements 4. Clear planning 5. Better in recognize project scope 6. Better communication between stakeholders 7. Better change management 8. Improve focus on business opportunities TOGAF also has Target Capability Model that could drive the oil and gas enterprise to build big data to manage and migrate from their former data infrastructure to big data infrastructure [10].
  • 13. 13 / 3.2 Event sources Event sources for Big Data: 1. Web application [10]: clickstream behaviour, liked post, sharing post, recommendation to friends 2. Internet of Thing (IoT) [11, 12, 13]: parking lot occupation detection and heat regulation at university (390 Gb), 10.000 sensors around industrial area that sent dataset for every 15 minutes, collecting Return Air Temperature (RAT) and Set Point Temperature (SPT) from sensors at manufacture sector. 3. External data sources [14]: Transaction Processing Performance Council dataset (TPC-DI) 4. Mobile application [15, 16]: geolocation, geo-tagged tweet, tweet analysis, spending-time at tourism objects.
  • 14. 14 / 3.3 Message Queue Some facts about message queue [17]:  software that retain some message for certain period from producers and it will be processed by the consumers.  organized by certain of topic and each topic has partition key.  consumer could be more than one instance and perform a reliable process to process the messages.  conjunction for web application and data storage to prevent lost data during processing requests from the clients
  • 15. 15 / 3.3 Message Queue (1)
  • 16. 16 / 3.4 Data Lake What is Data Lake [24]: • Kind form of massive data storage that collects raw dataset before the dataset gets further processing • On top of Hadoop File System that initially introduced by James Dixon. • Could be utilized by using the open-source solution such as Hadoop File System from Apache Hadoop. • Enterprise can build their data lake with mentioned open source products in their environment and more cost-effective than process it on the database.
  • 17. 17 / 3.4 Data Lake (1) What is Data Lake [24]: • Kind form of massive data storage that collects raw dataset before the dataset gets further processing • On top of Hadoop File System that initially introduced by James Dixon. • Could be utilized by using the open-source solution such as Hadoop File System from Apache Hadoop. • Enterprise can build their data lake with mentioned open source products in their environment and more cost-effective than process it on the database. • The dataset could be stored on data lake with various extensions such as CSV, Apache Avro, Apache Orc, Textfile, JSON, XML, etc
  • 18. 18 / 3.4 Data Lake (2) Data lake have several key features such as [25]: • Large scale batch processing, schema on read, able to store large scale data volume with low cost, • Could be accessed using SQL-like systems even they are not in SQL format dataset, • Complex processing that’s even apply machine learning operation, • Store raw dataset instead compact format such as in SQL format, • Low cost for distributed processing
  • 19. 19 / 3.4 Data Lake (3) Data lake has core components such as [26]: • on the backend side such as • catalog storage, • batch job performance and scheduling • fault tolerance • garbage collection of metadata. • On the frontend side such as • dataset profile pages • dataset search • team dashboards
  • 20. 20 / 3.5 Extract Transform Load (ETL) What is ETL [14]: • ETL (Extract, Transform, Load) is the part of big data that has a role to perform a conversion from raw dataset into a cleansed dataset. • ETL is a set of tools that could help the enterprise to perform real-time analytics and decisions. • ETL could be divided into three approaches that consist of micro-batch, near real-time and streaming. • ETL must be a combination of scheduler and ETL script.
  • 21. 21 / 3.5 Extract Transform Load (ETL) (1) What is ETL [14]: • Scheduler could use software such as Cron Jobs while for Apache Kafka could be used for streaming approach. • ETL script could be executed in distributed processes through a cluster of workers such as by using Apache Spark or executed in a single node such as using plain SQL or programming language approach. • ETL could transform dataset from SQL format into other text format or vice versa. • Supported data format by ETL technology could be XML, CSV, JSON, Apache Avro, Apache Orc, etc.
  • 22. 22 / 3.5 Extract Transform Load (ETL) (2) Data preprocessing is also part of ETL phase that consists of [29]: • imperfect data handling • dimensionality reduction • instance reduction • Discretization • imbalanced data • incomplete data • and instance reduction Data preprocessing could take apart on improving the result of machine learning model
  • 23. 23 / 3.6 Data Warehouse Data warehouse is a kind of [30]: • denormalized database that have generic information to cover management level question • centralized data source that could supply the data to develop the strategic plan of the enterprise to make a better decision based on historical data that stored historical data. • Facility to perform online analytical processing (OLAP) over the data source to answer the business needs. • Facility that receives the cleansed dataset that processed by the ETL part in the big data pipeline. • a source for data mining tasks to perform forecasting, classification, pattern recognition, clustering, etc
  • 24. 24 / 3.6 Data Warehouse (1) In some cases, such as in the educational sector, the data warehouse has key roles to perform [31]: • feasibility assessment and data analysis for an educational system from different viewpoints. • make a quick decision to evaluate the educational system. • collect a huge amount of datasets from several kinds of existing databases and unify those data source into a single data source. • a decision support system (DSS) technique over the data warehouse.
  • 25. 25 / 3.7 Data Mining Methodology Data Mining is one of the methodologies over big data. There are some known methodologies to perform better data mining processes such as [32]: ● KDD process ● CRISP-DM ● RAMSYS ● DMIE ● DMEE ● ASUM-DM ● AABA ● etc
  • 26. 26 / 3.7 Data Mining Methodology (1) KDD has several steps that consist of [32]: ● selection ● preprocessing ● transformation ● data mining ● knowledge gain
  • 27. 27 / 3.7 Data Mining Methodology (2) a popular CRISP-DM has several steps that perform a complete operation that consist of [32]: ● Business understanding ● data understanding ● data preprocessing ● Modeling ● Evaluation ● deployment
  • 28. 28 / 3.7 Data Mining Methodology (3) CRISP-DM example use case [32, 35, 36]: ● In the retail sector, CRISP-DM is applied to perform a data mining process that uses association rule mining to predict the sales pattern ● Research on climatology has been conducted also by applying CRISP- DM and KDD as a combination to improve the data mining result ● Another research on social media analysis has shown a result that CRISP- DM made the data mining processes gave the better result to perform the favorite TV series classification by applying Decision Tree algorithm
  • 29. 29 / 3.8 Data Visualization Visualization could deliver more engagement to management level or other users so they can understand the insight that gains from the big data [37]. ● Visualization could be formed as a web dashboard or document that contains an explanation and story. ● Data could be visualized with various graphs such as bar chart, line chart, pie chart, scatter chart, map chart, etc. ● Data visualization itself has several different types that consist of linear data visualization, planar data visualization, volumetric data visualization, temporal data visualization, multidimensional data visualization, and network data visualization
  • 30. 30 / 3.8 Data Visualization (1) There are also some various open source products that can help the enterprise to visualize their insight from big data such as: ● Pentaho ● Apache Zeppelin ● Apache Superset ● Metabase ● etc. Those products have capabilities to connect with Hadoop File System also with various database products that commonly used by the market.
  • 31. 31 / 3.9 Security and Compliance Some security approaches that might be applied for big data infrastructure such as [1]: ● encryption ● security as a service ● real-time monitoring ● privacy by design ● data protection and authorization ● log management ● authentication ● data anonymization ● secure communication line
  • 32. 32 / 3.9 Security and Compliance (1) Enterprise security for big data [42]: ● NIST for Big Data ● ISO/IEC 20546:2019 for Big Data ● ISO/IEC 27001:2017
  • 33. 33 / 3.10 Small Medium Enterprise in Big Data Era ● Based on TOGAF standard, enterprise could be considered as: 1) a whole corporation or the division of that corporation, 2) government agency or signle government department, 3) distributed organizations that linked together by common ownership and separated geographically, 4) groups of countries or governments that working together to create common or shareable deliverables and partnerships or 5) alliances of business that working together [8]. ● While small-medium enterprise is an enteprise that defined based on annual work unit, annual turnover and annual balance sheet according to European Commission [43, 44]. Small-medium enterprise is a differentiation of the enterprise itself but based on the size of the three indicators.
  • 34. 34 / 3.10 Small Medium Enterprise in Big Data Era (1) Based on prior researches according to European Commission with European Union Standards [43, 44]: ● Annual work unit per year >= 10 and < 50, annual turnover and annual balance sheet < 10 million euro -> Small enterprise ● Annual work unit per year >= 50 and < 250, annual turnover and annual balance sheet < 50 million euro -> Medium enterprise In that case, we could state that small-medium enterprise has an annual work unit around 10 until 250 annual work unit based on European Commission with European Union Standards
  • 35. 35 / 3.10 Small Medium Enterprise in Big Data Era (2) An advantage of leveraging big data for small-medium enterprise [45]: ● learn their pattern from past transaction ● combine with external data to understand the market behaviour in order to gain competitive advantage and growth ● increase their product improvement and innovation against the competitor
  • 36. 36 / 4. Conclusion Common big data architecture for small-medium enterprises could be categorized into three components that consist of: ● design and architecture component: a small-medium enterprise could leverage enterprise architecture framework such as TOGAF based on the study review ● infrastructure component: it could consist of event sources, message queue, data lake, extract-transform-load (ETL), data warehouse and data visualization ● operational component: the study found that data mining methodology is running on top of the infrastructure.
  • 37. 37 / 4. Conclusion (1) ● By using the methodology such as CRISP-DM or the other methodologies could produce a better data mining result. ● Security and compliance could be include in the operational component since the security and compliance is a life cycle for the sustainability of big data infrastructure and architecture against threat, vulnerability and risk. ● SME could utilize open source products with minimum cost that could be established as a part of the big data infrastructure such as: – Apache Nifi, Apache Hadoop, Apache Kafka, Apache Spark, Apache Storm, Scribe, RabbitMQ, Apache Zeppelin, Metabase, PostgreSQL, etc