How to get started in
Big Data for Master’s
Students
Mohamed Nadjib Mami
mami@cs.uni-bonn.de
24 March 2018
1. Big Data is a “way of thinking” not a “Domain”
- It is a Situation
- It is a Way of thinking
- It is an Adaptation
- It is not a Domain
- It is not a Specialty
- It is not not only Big in size
Limitation of traditional systems
- Size of computational data
- Speed of flowing data
- Formats of data
… Quality/trustworthiness of data
… Importance of data
Dimensions
- Volume
- Velocity
- Variety
- Veracity
- Value
2
2. Big Data is Data Management in the back
Source: DAMA-DMBOK2 Framework 2014
● It is all about interacting with data
○ Collect
○ Store
○ Maintain & control
○ Retrieve
○ Analyse
3
2. Big Data is Data Management in the back
● Take Data Management class, most importantly:
○ Relational algebra and database, ACID properties
○ SQL query language (focus on join and aggregation queries)
○ NOSQL, CAP theorem, BASE properties
○ Batch vs. stream vs. interactive processing
○ Lambda vs. Kappa architectures
○ Data Lake vs. Data Warehouse concepts
4
2. Big Data is Data Management in the back
● Relational model
○ The basics of basics ... the past, present (& future?)
○ Data modeled in form of relations
■ Algebra: project, select, join, aggregate, union, intersect...
○ Data stored RDBMS in tables, tuples, attributes...
● ACID Properties => guarantees DB integrity
○ Atomicity … apply all ops or nothing
○ Consistency … changes respect constraint
○ Isolation … parallel changes do not interfere
○ Durability … no committed change is lost
5
2. Big Data is Data Management in the back
● SQL: Structured Query Language
○ Declarative Query Language for Structured data (tables)
○ Aka. relational query language
■ Implements the relational algebra functions
○ (You should) Focus on JOIN and AGGREGATION
■ JOIN is the bases of querying
■ AGGREGATE is the bases of data analytics
6
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ New application needs => new DB management systems
■ Scalable and scale-out solutions (distributed)
■ Representations other than relational/SQL
■ Flexible schema
○ Not only SQL?
■ Similar syntaxes to SQL are used
● CQL (Cassandra Query Language)
7
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Quick lookups (hash, dictionary)
○ Query semi-structured data
○ Query flexible-schema tables
○ Query highly interconnected data
○ A mix of the above (multi-model)
● SQL & NOSQL = friends not foes (complementary)
8
Key-value
Document
Columnar
Graph
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Key-value (Simplest NOSQL model)
■ Encode all data in form of (key : value) pairs
■ Long distributed dictionaries/hash
■ Access: HTTP requests, API, etc.
■ Examples:
● Riak, Redis, Voldemort, Dynamo
9
105 abd
106 azb
107 tvu
108 lol
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Document-oriented
■ Encode data in form of semi-structured “documents”
● Commonly in JSON-like
■ Access: HTTP requests, API, etc.
■ Examples:
● MongoDB, CouchDB, Couchbase
10
{
"FirstName": "AAA",
"LastName": "BBB",
"Hobbies":
["painting",”swimming”]
}
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Columnar
■ Store data in columns (vs. rows in RDBMS)
● Optimized for analytical queries OLAP
■ Based on Columns families
● Like RDBMS tables but with unfixed schema
■ Examples:
● Cassandra, HBase, Accumulo, Bigtable
11
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Graph-oriented
■ Model data in form of graphs (edges and vertices)
■ Optimal for storing highly interconnected
Graph-shaped data
● Query data by traversal
■ Examples:
● Neo4j, infinitegraph, Neptune
12
2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ CAP theorem for designing distributed systems
■ Consistency returns latest results
■ Availability has to return result even stale
■ Partition tolerance tolerate data loss between nodes
○ In present of P choose between C and A (tradeoff)
■ C: query errors or times out as requested data is n/a
■ A: query returns out-of-data results
13
2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ CAP theorem for designing distributed systems
■ too simplistic | good to learn the basics
○ PACELC extends CAP
■ P(A|C)E(L|C) = if P choose A or C Else choose E or C
14
Partition?
Latency
Consistency
Availability
Consistency
Elsethen
2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ BASE of NOSQL (contrasting ACID of RDBMS)
○ Suggested by the same person as ACID
○ Basically available guarantees CAP Availability
○ Soft state system state may change over time
○ Eventual consistency system will become consistent over
time
15
2. Big Data is Data Management in the back
● Batch vs. stream vs. interactive processing
○ Batch: actions applied to bulked data periodically
■ Example: Extract-Transform-Load (ETL)
○ Real-time: computation applied to streams once arrived
■ Example: analyse sensors weather data
○ Interactive/iterative:
■ Example: Machine Learning algorithms
16
2. Big Data is Data Management in the back
● Lambda vs. Kappa architectures
○ Lambda architecture
■ Three layers:
● Batch
● Speed
● Serving
■ Fault-tolerant
■ Scalable
17
Source: MapR - Lambda Architecture
2. Big Data is Data Management in the back
● Lambda vs. Kappa architectures
○ Kappa architecture
■ Batch layers omitted => batch special case of stream
18
Source: O’reilly: Applying the Kappa architecture in the telco industry
2. Big Data is Data Management in the back
● Data Warehouse can be implemented on top of Data Lake
19
Data Lake Data Warehouse
Repository of raw-data in its original form A well structured data repository
Append-only, read-only Read and write
Schema-on-read (no predefined schema) Schema-on-right (well predefined schema)
ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Open to any access tools incl. DWH tools BI and OLAP tools and standards
3. Think big, think distributed
● Adaptation: now we deal with cluster-wide large scale data
● New essential factors come into play
○ Movement (aka shuffling)...
○ Reading and writing…
● MUST-know: fault-tolerance, replication, high-availability,
distributed file system ...in addition to previous concepts
○ Advise: learn them from Hadoop (HDFS), Apache Spark
20
...of large data
4. Adopt an “Optimizer” way of thinking
● History: my code works!
● Now: my code works fast
⇒ a slowly working code ~= not working code
○ How fast my app gets the job done? (performance)
○ How much output my app generates (throughput)
● Tuning and optimization are your new concerns e.g.
○ Reduce shuffled data (moved)
○ Reduce data written to/read from disk
21
General advice and comments
● Don’t move to big data settings if you don’t have to
● Don’t hesitate to start it if you feel like … it’s a lot of fun! :)
● For people who intend to do research in relation to big data
○ I have an idea, I just need to implement it becomes
○ I just have an idea, I need to implement it
○ Two phases instead of one:
■ 1. Make it work in your single-machine
■ 2. Make it work in your cluster >> and optimize
○ But it’s a lot of fun … still!
● Can all that fade off? Yes, as anything can, but unlikely any sooner
22
Wrap-up
1. Big Data is a Way of thinking not a Domain
2. Big Data is Data Management in the back
3. Think big, think distributed
4. Adopt an “Optimizer” way of thinking
23
questions

More Related Content

PDF
NoSQL for Artificial Intelligence
PPTX
Big data processing system
ODP
Building next generation data warehouses
PPTX
NoSQL databases
ODP
Graphing Your Data
PPTX
NOSQL Databases types and Uses
PPTX
How Linked Data Can Speed Information Discovery
PPTX
Scalable data systems at Traveloka
NoSQL for Artificial Intelligence
Big data processing system
Building next generation data warehouses
NoSQL databases
Graphing Your Data
NOSQL Databases types and Uses
How Linked Data Can Speed Information Discovery
Scalable data systems at Traveloka

What's hot (20)

PPTX
Accelerating Delivery of Data Products - The EBSCO Way
PPTX
Big data technology unit 3
PPSX
A Seminar on NoSQL Databases.
PPTX
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
PPTX
PPTX
The future of Big Data tooling
PPT
NoSQL databases
PPTX
Introduction to NoSQL
PDF
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...
PPTX
Music recommendations API with Neo4j
PPTX
PPTX
MongoDB and Hadoop Handling for Big Data
PDF
NOSQL- Presentation on NoSQL
PPT
Big Data: Improving capacity utilization of transport companies
PPT
10. Graph Databases
PPTX
No sqlpresentation
PPT
Schemaless Databases
PPTX
Nosql databases
PDF
Hdfs Dhruba
PDF
Big Data Streams Architectures. Why? What? How?
Accelerating Delivery of Data Products - The EBSCO Way
Big data technology unit 3
A Seminar on NoSQL Databases.
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
The future of Big Data tooling
NoSQL databases
Introduction to NoSQL
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...
Music recommendations API with Neo4j
MongoDB and Hadoop Handling for Big Data
NOSQL- Presentation on NoSQL
Big Data: Improving capacity utilization of transport companies
10. Graph Databases
No sqlpresentation
Schemaless Databases
Nosql databases
Hdfs Dhruba
Big Data Streams Architectures. Why? What? How?
Ad

Similar to How to get started in Big Data for master's students (20)

PPTX
bigdata.pptx
PDF
bigdata.pdf
PPTX
Big data Intro - Presentation to OCHackerz Meetup Group
PPTX
Chapter1-Introduction Εισαγωγικές έννοιες
PPTX
bigdata 2.pptx
PPTX
Big Data
PPTX
Big Data with Not Only SQL
PPTX
Foundations of Big Data: Concepts, Techniques, and Applications
DOCX
Handling and Analyzing Big Data_ A Professional Guide
PPTX
U - 2 Emerging.pptx
PPTX
Bigdata
PDF
IRJET- A Scenario on Big Data
PPTX
Fundamentals of Big Data
PPTX
Kartikey tripathi
PPTX
Big data explanation with real time use case
PPTX
Big Data Overview 2013-2014
PPTX
An Overview of BigData
PDF
PDF
Lecture1 introduction to big data
PPTX
Big data
bigdata.pptx
bigdata.pdf
Big data Intro - Presentation to OCHackerz Meetup Group
Chapter1-Introduction Εισαγωγικές έννοιες
bigdata 2.pptx
Big Data
Big Data with Not Only SQL
Foundations of Big Data: Concepts, Techniques, and Applications
Handling and Analyzing Big Data_ A Professional Guide
U - 2 Emerging.pptx
Bigdata
IRJET- A Scenario on Big Data
Fundamentals of Big Data
Kartikey tripathi
Big data explanation with real time use case
Big Data Overview 2013-2014
An Overview of BigData
Lecture1 introduction to big data
Big data
Ad

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Machine Learning and working of machine Learning
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPTX
New ISO 27001_2022 standard and the changes
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
statistics analysis - topic 3 - describing data visually
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
eGramSWARAJ-PPT Training Module for beginners
PPT
statistic analysis for study - data collection
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
recommendation Project PPT with details attached
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
IMPACT OF LANDSLIDE.....................
Machine Learning and working of machine Learning
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Session 11 - Data Visualization Storytelling (2).pdf
New ISO 27001_2022 standard and the changes
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
statistics analysis - topic 3 - describing data visually
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
Navigating the Thai Supplements Landscape.pdf
CYBER SECURITY the Next Warefare Tactics
eGramSWARAJ-PPT Training Module for beginners
statistic analysis for study - data collection
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
A biomechanical Functional analysis of the masitary muscles in man
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
recommendation Project PPT with details attached

How to get started in Big Data for master's students

  • 1. How to get started in Big Data for Master’s Students Mohamed Nadjib Mami mami@cs.uni-bonn.de 24 March 2018
  • 2. 1. Big Data is a “way of thinking” not a “Domain” - It is a Situation - It is a Way of thinking - It is an Adaptation - It is not a Domain - It is not a Specialty - It is not not only Big in size Limitation of traditional systems - Size of computational data - Speed of flowing data - Formats of data … Quality/trustworthiness of data … Importance of data Dimensions - Volume - Velocity - Variety - Veracity - Value 2
  • 3. 2. Big Data is Data Management in the back Source: DAMA-DMBOK2 Framework 2014 ● It is all about interacting with data ○ Collect ○ Store ○ Maintain & control ○ Retrieve ○ Analyse 3
  • 4. 2. Big Data is Data Management in the back ● Take Data Management class, most importantly: ○ Relational algebra and database, ACID properties ○ SQL query language (focus on join and aggregation queries) ○ NOSQL, CAP theorem, BASE properties ○ Batch vs. stream vs. interactive processing ○ Lambda vs. Kappa architectures ○ Data Lake vs. Data Warehouse concepts 4
  • 5. 2. Big Data is Data Management in the back ● Relational model ○ The basics of basics ... the past, present (& future?) ○ Data modeled in form of relations ■ Algebra: project, select, join, aggregate, union, intersect... ○ Data stored RDBMS in tables, tuples, attributes... ● ACID Properties => guarantees DB integrity ○ Atomicity … apply all ops or nothing ○ Consistency … changes respect constraint ○ Isolation … parallel changes do not interfere ○ Durability … no committed change is lost 5
  • 6. 2. Big Data is Data Management in the back ● SQL: Structured Query Language ○ Declarative Query Language for Structured data (tables) ○ Aka. relational query language ■ Implements the relational algebra functions ○ (You should) Focus on JOIN and AGGREGATION ■ JOIN is the bases of querying ■ AGGREGATE is the bases of data analytics 6
  • 7. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ New application needs => new DB management systems ■ Scalable and scale-out solutions (distributed) ■ Representations other than relational/SQL ■ Flexible schema ○ Not only SQL? ■ Similar syntaxes to SQL are used ● CQL (Cassandra Query Language) 7
  • 8. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Quick lookups (hash, dictionary) ○ Query semi-structured data ○ Query flexible-schema tables ○ Query highly interconnected data ○ A mix of the above (multi-model) ● SQL & NOSQL = friends not foes (complementary) 8 Key-value Document Columnar Graph
  • 9. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Key-value (Simplest NOSQL model) ■ Encode all data in form of (key : value) pairs ■ Long distributed dictionaries/hash ■ Access: HTTP requests, API, etc. ■ Examples: ● Riak, Redis, Voldemort, Dynamo 9 105 abd 106 azb 107 tvu 108 lol
  • 10. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Document-oriented ■ Encode data in form of semi-structured “documents” ● Commonly in JSON-like ■ Access: HTTP requests, API, etc. ■ Examples: ● MongoDB, CouchDB, Couchbase 10 { "FirstName": "AAA", "LastName": "BBB", "Hobbies": ["painting",”swimming”] }
  • 11. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Columnar ■ Store data in columns (vs. rows in RDBMS) ● Optimized for analytical queries OLAP ■ Based on Columns families ● Like RDBMS tables but with unfixed schema ■ Examples: ● Cassandra, HBase, Accumulo, Bigtable 11
  • 12. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Graph-oriented ■ Model data in form of graphs (edges and vertices) ■ Optimal for storing highly interconnected Graph-shaped data ● Query data by traversal ■ Examples: ● Neo4j, infinitegraph, Neptune 12
  • 13. 2. Big Data is Data Management in the back ● NOSQL and distributed systems (network, shared-data) ○ CAP theorem for designing distributed systems ■ Consistency returns latest results ■ Availability has to return result even stale ■ Partition tolerance tolerate data loss between nodes ○ In present of P choose between C and A (tradeoff) ■ C: query errors or times out as requested data is n/a ■ A: query returns out-of-data results 13
  • 14. 2. Big Data is Data Management in the back ● NOSQL and distributed systems (network, shared-data) ○ CAP theorem for designing distributed systems ■ too simplistic | good to learn the basics ○ PACELC extends CAP ■ P(A|C)E(L|C) = if P choose A or C Else choose E or C 14 Partition? Latency Consistency Availability Consistency Elsethen
  • 15. 2. Big Data is Data Management in the back ● NOSQL and distributed systems (network, shared-data) ○ BASE of NOSQL (contrasting ACID of RDBMS) ○ Suggested by the same person as ACID ○ Basically available guarantees CAP Availability ○ Soft state system state may change over time ○ Eventual consistency system will become consistent over time 15
  • 16. 2. Big Data is Data Management in the back ● Batch vs. stream vs. interactive processing ○ Batch: actions applied to bulked data periodically ■ Example: Extract-Transform-Load (ETL) ○ Real-time: computation applied to streams once arrived ■ Example: analyse sensors weather data ○ Interactive/iterative: ■ Example: Machine Learning algorithms 16
  • 17. 2. Big Data is Data Management in the back ● Lambda vs. Kappa architectures ○ Lambda architecture ■ Three layers: ● Batch ● Speed ● Serving ■ Fault-tolerant ■ Scalable 17 Source: MapR - Lambda Architecture
  • 18. 2. Big Data is Data Management in the back ● Lambda vs. Kappa architectures ○ Kappa architecture ■ Batch layers omitted => batch special case of stream 18 Source: O’reilly: Applying the Kappa architecture in the telco industry
  • 19. 2. Big Data is Data Management in the back ● Data Warehouse can be implemented on top of Data Lake 19 Data Lake Data Warehouse Repository of raw-data in its original form A well structured data repository Append-only, read-only Read and write Schema-on-read (no predefined schema) Schema-on-right (well predefined schema) ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Open to any access tools incl. DWH tools BI and OLAP tools and standards
  • 20. 3. Think big, think distributed ● Adaptation: now we deal with cluster-wide large scale data ● New essential factors come into play ○ Movement (aka shuffling)... ○ Reading and writing… ● MUST-know: fault-tolerance, replication, high-availability, distributed file system ...in addition to previous concepts ○ Advise: learn them from Hadoop (HDFS), Apache Spark 20 ...of large data
  • 21. 4. Adopt an “Optimizer” way of thinking ● History: my code works! ● Now: my code works fast ⇒ a slowly working code ~= not working code ○ How fast my app gets the job done? (performance) ○ How much output my app generates (throughput) ● Tuning and optimization are your new concerns e.g. ○ Reduce shuffled data (moved) ○ Reduce data written to/read from disk 21
  • 22. General advice and comments ● Don’t move to big data settings if you don’t have to ● Don’t hesitate to start it if you feel like … it’s a lot of fun! :) ● For people who intend to do research in relation to big data ○ I have an idea, I just need to implement it becomes ○ I just have an idea, I need to implement it ○ Two phases instead of one: ■ 1. Make it work in your single-machine ■ 2. Make it work in your cluster >> and optimize ○ But it’s a lot of fun … still! ● Can all that fade off? Yes, as anything can, but unlikely any sooner 22
  • 23. Wrap-up 1. Big Data is a Way of thinking not a Domain 2. Big Data is Data Management in the back 3. Think big, think distributed 4. Adopt an “Optimizer” way of thinking 23 questions