SlideShare a Scribd company logo
Chapter Two
Data science
Chapter Two
Data science
 Multi-disciplinary field that uses
scientific methods, processes,
algorithms, and systems to extract
knowledge and insights from
structured and unstructured data.
What is Data?
 Data is a raw/unprocessed facts or figures. Which
is seeming random and useless to human.
 Data can be represented with the help of:
■ Alphabet (A-Z, a-z)
■ Digit (0-9)
■ Special characters (+, -, *, /, <, >, = …
etc.)
…Data Vs Information
Source:
internet
Data Vs. Information
 Unprocessed
 Never suited to specific need
 Not depend on information
 Measured in bit and byte
 Data is a raw/unprocessed facts or figures and
Information is processed form of data.
 Processed data
 Suited to specific need because
unwanted or irrelevant data removed
 Depend on data
 Measured in meaningful like time and
quality
Data Information
Data Processing Life cycle
 Data processing is the re-structuring or re-ordering of data by people or machines to increase
their usefulness and add values for a particular purpose.
 Data processing consists of the following basic steps:
 Input (preparation of data for processing)
 Processing ( input data changing into useful form)
 Output (collecting the result of processing)
Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by people or machines to increase their
usefulness and add values for a particular purpose. It has three steps.
Input:
 Data preparation in convenient form for processing. The form will depend on the processing machine.
 For example, when electronic computers are used for data processing, the input data can be recorded on
hard disk, CD, flash disk and so on.
Source: Introduction to emerging technology module page 23
Data Processing Cycle
Processing:
 The input data is changed to produce data in a more useful form.
 For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be
calculated from the sales orders.
Output:
 The result of the processing step is collected. The particular form of the output data depends on the use of
the data.
 For example, output data may be payroll for employees.
Data types and their representation
Data Type Is the type of data that is going to be processed.
 Can be defined from diverse perspective.
1. Programming perspective
2. Data Analytics perspective
Common Data
Type in
programming
perspective
 Integers(int)- is used to store whole numbers, mathematically known as
integers
 Booleans(bool)- is used to represent restricted to one of two values: true or
false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of characters and
numbers
Data types and their representation
 Programming perspective
Data Type is an attribute of data that tells the compiler or interpreter how the
programmer intends to use the data.
Common Data
Type in Data
Analytics
perspective
 Structured (data in tabular format, relationship exist between
rows columns and tables) Ex. SQL databases, Excel files
 Semi-structured (use tags and other markers) also known as a
self-describing structure. Ex. JSON and XML
 Unstructured (Not organized in predefined manner) Ex. Audio,
Video, Image, Textual data, NoSQL databases
 Metadata (Data about data) Ex. When and where was the photo
taken
Data types and their representation
 Data Analytics perspective
Defines the operations that can be done on the data, the meaning of the data, and the way values
of that type can be stored.
Structured Data
 It conforms to a tabular format with a relationship between the different
rows and columns.
 Examples of structured data are Excel files or SQL databases. Each of
these has structured rows and columns that can be sorted.
Source: internet
Semi-structured data
 It is a form of structured data that does not conform with the formal structure of
data models associated with relational databases or other forms of data tables, but
nonetheless, contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Examples of semi-structured data include JSON and XML
Source: internet
Unstructured Data
 It is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-heavy
but may contain data such as dates, numbers, and facts as well which results in
irregularities and ambiguities that make it difficult to understand using traditional
programs as compared to data stored in structured databases.
 Examples of unstructured data include audio, video files or No-SQL databases.
Emerging Technology Chapter 2 Data Science
Data value Chain
 Is information flow within a big data system as a series of steps needed to generate value and
useful insights from data.
 Data value chain are
 Data Acquisition (gathering, filtering, cleaning)
 Data Analysis (Making raw data amenable to use in decision making) discovering, transforming,
and modelling data
 Data curation : active management of data. Include creation of content, selection, classification
transformation, validation and preservation and improving the accessibility and quality of data.
 Data Storage (Storing the processed data)
 Data Usage (Using the processed data to make decision)
Basic concepts of big data
Big data: is a term for a collection of data sets so large and complex
• large dataset which means a dataset too large
to reasonably process or store with traditional
tooling or on a single computer
Big data is can be characterized by
Data is live streaming
or in motion (Rapidity)
Data comes in many
different forms from
diverse sources. Data
In many forms.
Can we trust the data?
How accurate is it? …
etc. uncertainty about
the data.
Volume
Velocity
Variety
Veracity
large amounts of data,
Massive datasets
THE 4 Vs OF BIG DATA
BIG
DATA
The size of the data
VOLUME
V1 V2
V3 V4
Different types of data.
VARIETY
The speed with which data
Is generated.
VELOCITY
Trustworthiness of data
Accuracy
VERACITY
Clustered Computing
 Cluster computing: Computer made up of computer.
 Is a set of loosely or tightly
connected computers that
work together so that, in
many respects, they can be
viewed as a single system.
Data Processing Life cycle
Combining the resources of many smaller machines to
get benefits such as:
1. Resource Pooling (storage space,
CPU, memory)
2. High Availability (fault tolerance)
3. Easy Scalability (scalable by adding
additional machine)
Cluster can be handled by a software like YARN (Yet
Another Resource Negotiator)
22
● An open-source framework intended to make interaction with big data.
● Allows for the distributed processing of large datasets across clusters of computers.
● Four key characteristics of Hadoop are:
1. Economical (ordinary computers can be used for data processing)
2. Reliable (resistant to hardware failure)
3. Scalable: It is easily scalable both, horizontally and vertically
4. Flexible (can store as much data as you need)
23
23
● The core components of
Hadoop:
1. Data management
2. Access
3. Processing
4. Storage
Hadoop
and its
24
24
● Ingesting data into the system: Transferring data from d/t source to Hadoop
● Processing the data in storage: Data is stored and processed
● Computing and analyzing data: analyzed by processing frameworks such as Pig, Hive, and Impala.
● Visualizing the results: Accessing analyzed data.
Big Data
Life
Thank
s!

More Related Content

PPTX
2016 Chapter 2 - Intro. to Data Sciences.pptx
PPTX
Chapter 2- Data Science and big data.pptx
PDF
the study of data to extract meaningful insights for business
PPTX
Chapter -2- Data science Emerging Tech.pptx
PPTX
2015 Chapter 2 - Intro. to Data Sciences.pptx
PPTX
Chapter 2 - Intro to Data Sciences[2].pptx
PPTX
U - 2 Emerging.pptx
PPTX
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
2016 Chapter 2 - Intro. to Data Sciences.pptx
Chapter 2- Data Science and big data.pptx
the study of data to extract meaningful insights for business
Chapter -2- Data science Emerging Tech.pptx
2015 Chapter 2 - Intro. to Data Sciences.pptx
Chapter 2 - Intro to Data Sciences[2].pptx
U - 2 Emerging.pptx
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx

Similar to Emerging Technology Chapter 2 Data Science (20)

PDF
Data science
PPTX
Data Science
PPTX
Introduction of Data Science and Data Analytics
PPTX
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx
PDF
@vtucode.in-21CS71-module-1-pdf.pdfBig data
PPTX
Chapter 2 - EMTE.pptx
PPT
Big Data Analytics (Collection of huge Data)
PPTX
ch2 DS.pptx
PDF
BDA Mod1@AzDOCUMENTS.in.pdf
PPTX
1. APPLICATION OF INFORMATION TECHNALOGY_2.pptx
PPTX
1. APPLICATION OF INFORMATION TECHNALOGY_2.pptx
PDF
Ch_2.pdf
PPSX
Introduction to Big Data Analytics.ppsx
PPT
Database, Lecture-1.ppt
PDF
data analyticsggfgfgfgfdgdfgfdgfdgfdgfdgdffdfdf
PPTX
Information Systems
PPT
20IT501_DWDM_PPT_Unit_II.ppt
PPT
20IT501_DWDM_PPT_Unit_II.ppt
PPTX
Introduction to data science
PPTX
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Data science
Data Science
Introduction of Data Science and Data Analytics
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx
@vtucode.in-21CS71-module-1-pdf.pdfBig data
Chapter 2 - EMTE.pptx
Big Data Analytics (Collection of huge Data)
ch2 DS.pptx
BDA Mod1@AzDOCUMENTS.in.pdf
1. APPLICATION OF INFORMATION TECHNALOGY_2.pptx
1. APPLICATION OF INFORMATION TECHNALOGY_2.pptx
Ch_2.pdf
Introduction to Big Data Analytics.ppsx
Database, Lecture-1.ppt
data analyticsggfgfgfgfdgdfgfdgfdgfdgfdgdffdfdf
Information Systems
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
Introduction to data science
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Ad

More from SolomonEndalu (13)

PPT
Research methodology for computer scienceTechnical Report MAS.ppt
PPTX
manual Networking LAB SESSION 1 PPT.pptx
PPTX
LAB SESSION 3 PPT.pptxNetworking lab manual LAB SESSION 1 PPT.pptx
PPTX
Networking lab manual LAB SESSION 1 PPT.pptx
PPSX
Data Structure and Algorithm Chapter 3.ppsxDSA Chapter 3.ppsx
PPSX
Data Structure and Algorithm Chapter 1.ppsx
PPSX
Data Structure and Algorithm Chapter 2.ppsx Chapter 2.ppsx
PPTX
Chapter 2. data structure and algorithm
PPTX
chapter_7 _Other Emerging Technologies-new.pptx
PPTX
Chapter 5 ARIntroduction to Emerging Technologies
PPTX
Emerging Technology Chapter 4 internets of things
PPTX
chapter_1_Introduction to Emerging Technologies
PPTX
Emerging Technology Chapter 3 Artificial Intelligence
Research methodology for computer scienceTechnical Report MAS.ppt
manual Networking LAB SESSION 1 PPT.pptx
LAB SESSION 3 PPT.pptxNetworking lab manual LAB SESSION 1 PPT.pptx
Networking lab manual LAB SESSION 1 PPT.pptx
Data Structure and Algorithm Chapter 3.ppsxDSA Chapter 3.ppsx
Data Structure and Algorithm Chapter 1.ppsx
Data Structure and Algorithm Chapter 2.ppsx Chapter 2.ppsx
Chapter 2. data structure and algorithm
chapter_7 _Other Emerging Technologies-new.pptx
Chapter 5 ARIntroduction to Emerging Technologies
Emerging Technology Chapter 4 internets of things
chapter_1_Introduction to Emerging Technologies
Emerging Technology Chapter 3 Artificial Intelligence
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Cloud computing and distributed systems.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The AUB Centre for AI in Media Proposal.docx
Spectroscopy.pptx food analysis technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)

Emerging Technology Chapter 2 Data Science

  • 2. Chapter Two Data science  Multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
  • 3. What is Data?  Data is a raw/unprocessed facts or figures. Which is seeming random and useless to human.  Data can be represented with the help of: ■ Alphabet (A-Z, a-z) ■ Digit (0-9) ■ Special characters (+, -, *, /, <, >, = … etc.)
  • 5. Data Vs. Information  Unprocessed  Never suited to specific need  Not depend on information  Measured in bit and byte  Data is a raw/unprocessed facts or figures and Information is processed form of data.  Processed data  Suited to specific need because unwanted or irrelevant data removed  Depend on data  Measured in meaningful like time and quality Data Information
  • 6. Data Processing Life cycle  Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose.  Data processing consists of the following basic steps:  Input (preparation of data for processing)  Processing ( input data changing into useful form)  Output (collecting the result of processing)
  • 7. Data Processing Cycle  Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose. It has three steps. Input:  Data preparation in convenient form for processing. The form will depend on the processing machine.  For example, when electronic computers are used for data processing, the input data can be recorded on hard disk, CD, flash disk and so on. Source: Introduction to emerging technology module page 23
  • 8. Data Processing Cycle Processing:  The input data is changed to produce data in a more useful form.  For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be calculated from the sales orders. Output:  The result of the processing step is collected. The particular form of the output data depends on the use of the data.  For example, output data may be payroll for employees.
  • 9. Data types and their representation Data Type Is the type of data that is going to be processed.  Can be defined from diverse perspective. 1. Programming perspective 2. Data Analytics perspective
  • 10. Common Data Type in programming perspective  Integers(int)- is used to store whole numbers, mathematically known as integers  Booleans(bool)- is used to represent restricted to one of two values: true or false  Characters(char)- is used to store a single character  Floating-point numbers(float)- is used to store real numbers  Alphanumeric strings(string)- used to store a combination of characters and numbers Data types and their representation  Programming perspective Data Type is an attribute of data that tells the compiler or interpreter how the programmer intends to use the data.
  • 11. Common Data Type in Data Analytics perspective  Structured (data in tabular format, relationship exist between rows columns and tables) Ex. SQL databases, Excel files  Semi-structured (use tags and other markers) also known as a self-describing structure. Ex. JSON and XML  Unstructured (Not organized in predefined manner) Ex. Audio, Video, Image, Textual data, NoSQL databases  Metadata (Data about data) Ex. When and where was the photo taken Data types and their representation  Data Analytics perspective Defines the operations that can be done on the data, the meaning of the data, and the way values of that type can be stored.
  • 12. Structured Data  It conforms to a tabular format with a relationship between the different rows and columns.  Examples of structured data are Excel files or SQL databases. Each of these has structured rows and columns that can be sorted. Source: internet
  • 13. Semi-structured data  It is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless, contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.  Examples of semi-structured data include JSON and XML Source: internet
  • 14. Unstructured Data  It is information that either does not have a predefined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well which results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in structured databases.  Examples of unstructured data include audio, video files or No-SQL databases.
  • 16. Data value Chain  Is information flow within a big data system as a series of steps needed to generate value and useful insights from data.  Data value chain are  Data Acquisition (gathering, filtering, cleaning)  Data Analysis (Making raw data amenable to use in decision making) discovering, transforming, and modelling data  Data curation : active management of data. Include creation of content, selection, classification transformation, validation and preservation and improving the accessibility and quality of data.  Data Storage (Storing the processed data)  Data Usage (Using the processed data to make decision)
  • 17. Basic concepts of big data Big data: is a term for a collection of data sets so large and complex • large dataset which means a dataset too large to reasonably process or store with traditional tooling or on a single computer
  • 18. Big data is can be characterized by Data is live streaming or in motion (Rapidity) Data comes in many different forms from diverse sources. Data In many forms. Can we trust the data? How accurate is it? … etc. uncertainty about the data. Volume Velocity Variety Veracity large amounts of data, Massive datasets
  • 19. THE 4 Vs OF BIG DATA BIG DATA The size of the data VOLUME V1 V2 V3 V4 Different types of data. VARIETY The speed with which data Is generated. VELOCITY Trustworthiness of data Accuracy VERACITY
  • 20. Clustered Computing  Cluster computing: Computer made up of computer.  Is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system.
  • 21. Data Processing Life cycle Combining the resources of many smaller machines to get benefits such as: 1. Resource Pooling (storage space, CPU, memory) 2. High Availability (fault tolerance) 3. Easy Scalability (scalable by adding additional machine) Cluster can be handled by a software like YARN (Yet Another Resource Negotiator)
  • 22. 22 ● An open-source framework intended to make interaction with big data. ● Allows for the distributed processing of large datasets across clusters of computers. ● Four key characteristics of Hadoop are: 1. Economical (ordinary computers can be used for data processing) 2. Reliable (resistant to hardware failure) 3. Scalable: It is easily scalable both, horizontally and vertically 4. Flexible (can store as much data as you need)
  • 23. 23 23 ● The core components of Hadoop: 1. Data management 2. Access 3. Processing 4. Storage Hadoop and its
  • 24. 24 24 ● Ingesting data into the system: Transferring data from d/t source to Hadoop ● Processing the data in storage: Data is stored and processed ● Computing and analyzing data: analyzed by processing frameworks such as Pig, Hive, and Impala. ● Visualizing the results: Accessing analyzed data. Big Data Life

Editor's Notes

  • #4: Here is the good example to compare the two: as you see from the picture Data is just represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.). But Information is interpreted data; created from structured data in a particular context.
  • #7: Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose. Data processing consists of the three basic steps - input, processing, and output. In the Input step: the input data is prepared in some convenient form for processing. The form will depend on the processing machine. For example, when electronic computers are used, the input data can be recorded on any one of the several types of storage medium, such as hard disk, CD, flash disk and so on.
  • #8: In the Processing step of data processing cycle− the input data is changed to produce data in a more useful form. For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be calculated from the sales orders. In the Output step of data processing cycle− the result of the proceeding processing step is collected. The particular form of the output data depends on the use of the data. For example, output data may be payroll for employees.
  • #12: Structured data types are straightforward to analyze and conforms to a tabular format with a relationship between the different rows and columns. Common examples of structured data are Excel files or SQL databases. Each of these has structured rows and columns that can be sorted.
  • #13: Semi-structured data types are a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless, as you see from the picture, contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. JSON and XML are forms of semi-structured data.
  • #14: Unstructured data types are information that either does not have a predefined data model or are not organized in a pre-defined manner. Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well which results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in structured databases. Common examples of unstructured data include audio, video files or No-SQL databases.