SlideShare a Scribd company logo
Data Science Tech
“Data Science is the extraction of knowledge
from data using mathematics, statistics,
computer science, machine learning, pattern
recognition, predictive analysis, etc.”
- Wikipedia
“Information is not knowledge”
- Einstein
- DIKW
- Apache Spark
- YARN
- RDDs
- Apache Hive
- HDFS
- Parquet
- Niometrics
- Columnar DB
- DBMS
- HBase
- OLAP
- OLTP
DIKW Hierarchy
- signals, symbols, raw facts
- first-line products of observation
- unorganized
Data
0101 1101 1011 1001 0101 1101
1011 1001 0101 1101 1011 1001
0110 0001 1010 1111 0101 1101
1011 1001 0101 1101 1011 1001
Information
- inferred from data
- answers interrogative questions
- data that is now useful through organization
and structuring
Knowledge
- knowledge is subjective
- many consider it as applied information
- synthesis of multiple information
- contextualized consolidated information
Wisdom
- an appreciation of the why
- knowing the right things to do
- very immaterial
Data Science Toolchain 101
Apache Spark
- fast engine for large-scale data processing
- support for Python, Scala, Java
- SparkSQL, MLlib, GraphX, Streaming
Data Science Toolchain 101
YARN
- Yet Another Resource Negotiator
- resource management for computing
resources in a cluster
- can be seen as a distributed operating
system
- separates resource management from
Hadoop data processing layer
Apache Hive
- data warehouse infrastructure
- provides data querying, analysis, and
aggregation
- developed initially by Facebook
Resilient Distributed Databases (RDDs)
- fault-tolerant database management system
used for cluster computing
- done by chunking data across multiple
nodes and racks for redundancy
- a common feature in cluster computing
HDFS
- Hadoop Distributed File System
- Runs with RDDs in managing a fault-tolerant
file system
- Java based and spans clusters of
commodity servers
Data Science Toolchain 101
HBase
- Distributed non-relational database
- modelled after Google’s BigTable
- runs on top of HDFS
- fault-tolerance on sparse and large data
- supports compression, in-memory filtering
and operations
Parquet
- Columnar file format
- stores data in columns instead of rows as in
traditional relational databases
- for efficient compression and encoding
- more aligned for OLAP
Columnar Database
- Stores tables as sections of columns
- advantageous for data warehousing
- more efficient for computations over large
numbers of rows with similar column items
- more aligned for OLAP
Online Transaction Processing (OLTP)
- Information processes that facilitate
transactions
- data entry and retrieval
- provide data for data warehousing
- emphasis on fast simple single querying,
- ACID, and multi-access
- involve operational business processes
Online Analytical Processing (OLAP)
- low transactional volume
- complex queries with aggregation
- OLAP uses data from OLTP systems
- queries involve traversing massive quantities
of data
- involves business intelligence/data science
activities
Data Science Toolchain 101
Data Science Toolchain 101
Thank You

More Related Content

PDF
Data science-toolchain
PPTX
Hive: Data Warehousing for Hadoop
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Big Data Analytics for Non-Programmers
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
Real-time analytics with HBase
PDF
Big Data technology Landscape
Data science-toolchain
Hive: Data Warehousing for Hadoop
Big data vahidamiri-tabriz-13960226-datastack.ir
Big Data Analytics for Non-Programmers
Data lake-itweekend-sharif university-vahid amiry
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Real-time analytics with HBase
Big Data technology Landscape

What's hot (20)

PPTX
Hadoop white papers
PPTX
Intro to Big Data Hadoop
PPTX
Big Data Unit 4 - Hadoop
PPTX
Big data and Hadoop
PDF
An Introduction to Apache Spark
PPT
Hadoop mapreduce and yarn frame work- unit5
PPTX
Hadoop An Introduction
PPTX
Dc python meetup
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Spark - Philly JUG
PPTX
Topic modeling using big data analytics
PPTX
Hadoop for beginners free course ppt
PPTX
Design of Hadoop Distributed File System
PDF
SparkR-Advance Analytic for Big Data
PPT
Big Data Fundamentals in the Emerging New Data World
PPTX
Big data Hadoop presentation
PDF
Hadoop Technologies
PDF
Introduction to Big Data & Hadoop
PDF
Scala: the unpredicted lingua franca for data science
PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Hadoop white papers
Intro to Big Data Hadoop
Big Data Unit 4 - Hadoop
Big data and Hadoop
An Introduction to Apache Spark
Hadoop mapreduce and yarn frame work- unit5
Hadoop An Introduction
Dc python meetup
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Spark - Philly JUG
Topic modeling using big data analytics
Hadoop for beginners free course ppt
Design of Hadoop Distributed File System
SparkR-Advance Analytic for Big Data
Big Data Fundamentals in the Emerging New Data World
Big data Hadoop presentation
Hadoop Technologies
Introduction to Big Data & Hadoop
Scala: the unpredicted lingua franca for data science
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Ad

Similar to Data Science Toolchain 101 (20)

PPTX
Big Data Analytics with Hadoop
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
PPTX
Big Data with Not Only SQL
PDF
Hadoop Developer
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Big data Hadoop
PDF
Big data, Hadoop, NoSQL DB - introduction
PPTX
Apache Drill at ApacheCon2014
PPTX
Big Data and Hadoop Training in Chandigarh
DOC
Big Data Technologies - Hadoop, Spark, and Beyond.doc
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PDF
Scaling Storage and Computation with Hadoop
PPTX
No sql and sql - open analytics summit
PDF
Big data and hadoop
PPTX
Berlin Hadoop Get Together Apache Drill
PDF
Hoodie - DataEngConf 2017
PPTX
Hadoop and Netezza - Co-existence or Competition?
Big Data Analytics with Hadoop
Data Modeling in Hadoop - Essentials for building data driven applications
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Hopsworks in the cloud Berlin Buzzwords 2019
Big Data with Not Only SQL
Hadoop Developer
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big data Hadoop
Big data, Hadoop, NoSQL DB - introduction
Apache Drill at ApacheCon2014
Big Data and Hadoop Training in Chandigarh
Big Data Technologies - Hadoop, Spark, and Beyond.doc
Big Data Analytics Presentation on the resourcefulness of Big data
Scaling Storage and Computation with Hadoop
No sql and sql - open analytics summit
Big data and hadoop
Berlin Hadoop Get Together Apache Drill
Hoodie - DataEngConf 2017
Hadoop and Netezza - Co-existence or Competition?
Ad

More from Francis Michael Bautista (7)

PDF
Intro to Jupyter Notebooks
PDF
Pandas + Folium Toolchain Demo
PDF
AI and Natural Language Processing
PDF
AI Deck on Relationships and NLP
PDF
Introduction to Data Science
PDF
Data Science Applications
PDF
Software Development
Intro to Jupyter Notebooks
Pandas + Folium Toolchain Demo
AI and Natural Language Processing
AI Deck on Relationships and NLP
Introduction to Data Science
Data Science Applications
Software Development

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to machine learning and Linear Models
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Lecture1 pattern recognition............
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Fluorescence-microscope_Botany_detailed content
ISS -ESG Data flows What is ESG and HowHow
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Quality review (1)_presentation of this 21
Introduction to machine learning and Linear Models
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IB Computer Science - Internal Assessment.pptx
Foundation of Data Science unit number two notes
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
oil_refinery_comprehensive_20250804084928 (1).pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Knowledge Engineering Part 1
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Lecture1 pattern recognition............
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Data Science Toolchain 101

  • 2. “Data Science is the extraction of knowledge from data using mathematics, statistics, computer science, machine learning, pattern recognition, predictive analysis, etc.” - Wikipedia “Information is not knowledge” - Einstein
  • 3. - DIKW - Apache Spark - YARN - RDDs - Apache Hive - HDFS - Parquet - Niometrics - Columnar DB - DBMS - HBase - OLAP - OLTP
  • 5. - signals, symbols, raw facts - first-line products of observation - unorganized Data 0101 1101 1011 1001 0101 1101 1011 1001 0101 1101 1011 1001 0110 0001 1010 1111 0101 1101 1011 1001 0101 1101 1011 1001
  • 6. Information - inferred from data - answers interrogative questions - data that is now useful through organization and structuring
  • 7. Knowledge - knowledge is subjective - many consider it as applied information - synthesis of multiple information - contextualized consolidated information
  • 8. Wisdom - an appreciation of the why - knowing the right things to do - very immaterial
  • 10. Apache Spark - fast engine for large-scale data processing - support for Python, Scala, Java - SparkSQL, MLlib, GraphX, Streaming
  • 12. YARN - Yet Another Resource Negotiator - resource management for computing resources in a cluster - can be seen as a distributed operating system - separates resource management from Hadoop data processing layer
  • 13. Apache Hive - data warehouse infrastructure - provides data querying, analysis, and aggregation - developed initially by Facebook
  • 14. Resilient Distributed Databases (RDDs) - fault-tolerant database management system used for cluster computing - done by chunking data across multiple nodes and racks for redundancy - a common feature in cluster computing
  • 15. HDFS - Hadoop Distributed File System - Runs with RDDs in managing a fault-tolerant file system - Java based and spans clusters of commodity servers
  • 17. HBase - Distributed non-relational database - modelled after Google’s BigTable - runs on top of HDFS - fault-tolerance on sparse and large data - supports compression, in-memory filtering and operations
  • 18. Parquet - Columnar file format - stores data in columns instead of rows as in traditional relational databases - for efficient compression and encoding - more aligned for OLAP
  • 19. Columnar Database - Stores tables as sections of columns - advantageous for data warehousing - more efficient for computations over large numbers of rows with similar column items - more aligned for OLAP
  • 20. Online Transaction Processing (OLTP) - Information processes that facilitate transactions - data entry and retrieval - provide data for data warehousing - emphasis on fast simple single querying, - ACID, and multi-access - involve operational business processes
  • 21. Online Analytical Processing (OLAP) - low transactional volume - complex queries with aggregation - OLAP uses data from OLTP systems - queries involve traversing massive quantities of data - involves business intelligence/data science activities