SlideShare a Scribd company logo
Big Data Analytics
(CS443)
IV B.Tech (IT)
2018-19 I semester
1
Big Data
And Analytics
Seema Acharya
Subhashini Chellappan
2
Chapter 1
Types of Digital Data
3
Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
Introduction to digital data
and its types
1. Structured data: Sources of
structured data, ease with
structured data, etc.
2. Semi-Structured data: Sources
of semi-structured data,
characteristics of semi-
structured data.
3. Unstructured data: Sources of
unstructured data, issues with
terminology, dealing with
unstructured data.
a) To differentiate between
structured, semi-structured
and unstructured data.
b) To understand the need to
integrate structured, semi-
structured and unstructured
data.
4
Agenda
Types of Digital Data
Structured
 Sources of structured data
 Ease with structured data
Semi-Structured
 Sources of semi-structured data
Unstructured
 Sources of unstructured data
 Issues with terminology
 Dealing with unstructured data
5
Classification of Digital Data
Digital data is classified into the following categories:
Structured data
Semi-structured data
Unstructured data
6
Approximate Percentage Distribution of Digital Data
Approximate percentage distribution of digital data
7
Structured Data
8
Structured Data
This is the data which is in an organized form (e.g., in rows and columns) and can
be easily used by a computer program.
Relationships exist between entities of data, such as classes
and their objects.
Data stored in databases is an example of structured data.
 Think structured data, and think data model. (Store, Process and Access) RDBMS.
 schema, columns and rows, No.of rows is cardinality and no. of columns degree
of relation.
 with constraints Employee schema
9
Sources of Structured Data
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
Structured data Spreadsheets
OLTP Systems
10
Ease with Structured Data
Ease with Structured data
Security
Indexing /
Searching
Input / Update /
Delete
DML
Scalability
Transaction
Processing
(ACID)
11
Storage and processing
capability
Speeds up data retrieval operations
Semi-structured Data
12
Semi-structured Data
This is the data which does not conform to a data model
but has some structure. However, it is not in a form
which can be used easily by a computer program.
Example, emails, XML, markup languages like HTML,
etc. Metadata for this data is available but is not
sufficient.
13
Sources of Semi-structured Data
Semi structured data
XML (Extensible Markup Language) developed utilizing
the SOAP(Simple Object Access protocol) Principles
Other Markup Languages
JSON (Java Script object Notation)
14
JSON is used to transmit data between a server and a web
application. JSON is popularized by web services developed utilizing
the Representational State Transfer (REST) – an architectural style
for creating scalable web services.
Mango DB and couchbase store data natively in JSON format.
Mango DB(open source, distributed, NoSQL, document oriented
database)
Couchbase ( originally known as Membase, DB(open source,
distributed, NoSQL, document oriented database
15
Characteristics of Semi-structured Data
Semi-structured data
Inconsistent Structure
Self-describing
(lable/value pairs)
Often Schema information is
blended with data values
Data objects may have different
attributes not known beforehand
16
Unstructured Data
17
Unstructured Data
This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer
program.
About 80–90% data of an organization is in this format.
Example: memos,chat rooms, PowerPoint
presentations, images, videos, letters,
researches, white papers, body of an email, etc.
18
Sources of Unstructured Data
Unstructured data
Web Pages
Images
Free-Form
Text
Audios
Videos
Body of
Email
Text
Messages
Chats
Social
Media data
Word
Document 19
Issues with terminology – Unstructured Data
Issues with terminology
Structure can be implied despite not being
formerly defined.
Data with some structure may still be labeled
unstructured if the structure doesn’t help with
processing task at hand
Data may have some structure or may even be
highly structured in ways that are unanticipated
or unannounced.
20
Dealing with Unstructured Data
Dealing with Unstructured Data
Data Mining
Natural Language Processing (NLP)
Text Analytics
Noisy Text Analytics
21
Manual tagging with Metadata
Part of speech tagging
Unstructured Information
Management Architecture
Dealing with Unstructured Data
22
Data Mining
Natural Language Processing (NLP)
Text Analytics
Noisy Text Analytics
Manual tagging with Metadata
Part of speech tagging
Unstructured Information
Management Architecture
Knowledge discovery in databases, popular Mining algorithms are
Association rule mining, Regression Analysis, and Collaborative filtering
It is related to HCI. It is about enabling computers to understand human or
natural language input.
Text mining is the process of gleaning high quality and meaningful
information from text. It includes tasks such as text categorization, text
clustering, sentiment analysis and concept/entity extraction.
Process of extraction structured or semi-structured from noisy
unstructured data such as chats, blogs, wikis, emails .. Spelling
mistakes, abbreviations, uh, hm, non standard words. .
Open source platform from IBM used for real time content analytics.
POST is the process of reading text and tagging each word in the sentence
belonging to particular parts of speech such as noun, verb, objective...
This is about tagging manually with adequate meta data to provide
the requisite semantics to understand unstructured data.
Data Mining : Knowledge discovery in databases, popular
Mining algorithms are Association rule mining, Regression
Analysis, and Collaborative filtering.
NLP: It is related to HCI. It is about enabling computers to
understand human or natural language input.
Text Analytics: Text mining is the process of gleaning high quality
and meaningful information from text. It includes tasks such as text
categorization, text clustering, sentiment analysis and
concept/entity extraction.
23
Noisy Text Analytics: Process of extraction structured or semi-
structured from noisy unstructured data such as chats, blogs,
wikis, emails .. Spelling mistakes, abbreviations, uh, hm, non
standard words. .
Manual Tagging with Meta data: This is about tagging
manually with adequate meta data to provide the requisite
semantics to understand unstructured data.
Parts of Speech Tagging : POST is the process of reading text
and tagging each word in the sentence belonging to particular
parts of speech such as noun, verb, objective...
UIMA: Open source platform from IBM used for real time
content analytics.
24
Answer a few quick questions …
25
26
Place the following in suitable basket:
Email ii. MS Access iii. Images iv. Database
v. Char conversions vi. Relations / Tables vii. Face book
viii.Videos ix. MS Excel x. XML
Structured Unstructured Semi structured
Match the following
Column A Column B
NLP Content analytics
Text analytics Text messages
UIMA Chats
Noisy unstructured
data
Text mining
Data mining Comprehend human or natural language input
Noisy unstructured
data
Uses methods at the intersection of statistics,
Artificial Intelligence, machine learning & DBs
IBM UIMA
27
5,4,1,2,6,3,7
Answer Me
Which category (structured, semi-structured, or unstructured) will you place
a Web Page in?
Which category (structured, semi-structured, or unstructured) will you place
Word Document in?
State a few examples of human generated and machine-generated data.
28
List various types of digital data?
Structured, Semi-structured and unstructured
Why an email placed in the Unstructured category?
Because it contains hyperlinks, attachments, videos, images, free flowing text...
What category will you place a CCTV footage into? unstructured
You have just got a book issued from the library. What are the details about the book
that can be placed in an RDBMS table.
Ans: Title, author, publisher, year, no.of pages, type of book, price, ISBN, with CD or not.
Which category would you place the consumer complaints and feedback? Unstructured.
Which category (structured, semi-structured or Unstructured) will you place a web page
in? Unstructured
Which category (structured, semi-structured or Unstructured) will you place a Power
point presentation in? Unstructured
Which category (structured, semi-structured or Unstructured) will you place a word
document in? Unstructured
29
30
Data Generation
Origin
Definition Information
Management
Proficiency
Examples
Humans Data representing the
digitization of human
interactions
Structured
Business process data e.g.,
payment transactions, sales
order, call record, ERP, CRM
Semistructured Weblogs
Unstructured
Content such as Web pages, E-
mail, Blog, Wiki, Review,
Comment
Binary
Content such as Video, Audio,
Photo
Machines
Data representing
machine-to-machine
interactions, or simply
not human-generated
(Internet of Things)
Structured Some devices
Semistructured
Computer logs, Device logs,
Network logs,
Sensor/Meter logs
Binary Video, Audio, Photo
Answer Me
31
•Why an email placed in the Unstructured category?
•What category will you place a CCTV footage into?
•You have just got a book issued from the library. What are the details
about the book that can be placed in an RDBMS table.
•Which category would you place the consumer complaints and
feedback?
•Which category (structured, semi-structured or Unstructured) will
you place a web page in?
•Which category (structured, semi-structured or Unstructured) will
you place a Power point presentation in?
•Which category (structured, semi-structured or Unstructured) will
you place a word document in?
Introduction to Big Data
Characteristics of Data:
1.Composition: It deals with structure of data, that is, the
sources of data, granularity, the types and the nature of data
as to whether it is static or real streaming.
2.Condition: It deals with the state of data, can any one use this
as it is or cleansing is required for further enhancement and
enrichment.
3.Context: It deals with “where has this data been generated”
why was this data generated” how sensitive is this data
32
Evolution of Big Data:
1. 1970s and before : main frames (the data is
primitive and structured)
2. 1980s and 1990s Relational databases (the
data intensive applications)
3. the WWW and the IOT have led to an
onslaught of structured, unstructured, and
multimedia data.
33
The Evolution of Big Data
34
Data Generation and
storage
Data Utilization Data Driven
Complex and
unstructured
Structured data,
Unstructured data,
Multimedia data
Complex and
Relational
Relational databases
: Data intensive
applications
Primitive and
structured
Main frames: Basic data
storage
1970s and before Relational
1980s and 1990s
2000s and beyond
What’s Big Data?
No single definition; here is from Wikipedia:
Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
It is about three Vs.
"Big data" is high-volume, -velocity and -variety information assets that
demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
35
What’s Big Data?
“Big data is high-volume, -velocity and -variety information assets” talks about voluminous
data that may have great variety(s,s,u) and will require a good speed/pace for storage,
preparation, processing and analysis.
“Cost-effective, innovative forms of information processing” talks about embracing new
techniques and technologies to capture, store, process, persist, integrate, and visualize high
volume, high variety and high velocity data.
“Enhanced insight and decision making” talks about deriving deeper, richer, and meaningful
insights and then using these insights to make faster and better decisions to gain business
value and this competitive edge.
Data -> information -> Actionable intelligence -> better decisions -> Enhanced
business value.
36
Challenges with Big Data
The challenges with big data:
1.Data today is growing at an exponential rate.
The key question is : will all this data be useful for analysis how
will separate knowledge from noise..
2. How to host big data solutions outside the world.
3. The period of retention of big data.
4. Dearth of skilled professionals.
5. Challenges with respect to capture, curation, storage, search,
sharing, transfer, analysis, privacy violations and visualization.
6. Shortage of data visualization experts.
37
What is Big Data
Big data is the data that is Big in volume, velocity and variety
Volume: Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes->
Zettabytes-> Yottabytes
Where does this data get generated
1. Typical internal data sources: data storage, archives,
2. External data sources: Public web: wikipedia, weather, regulatory,
complience..
3. Both (nternal and external sources).
38
What is Big Data
Velocity
Batch -> periodic ->Near real time ->Real time processing.
Variety: Structured, semi-structured and Unstructured.
Other:
1. Veracity and validity..
2. Volatality
3. variability
39
40
41
TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA
1. In Traditional BI environment, all the enterprise’s data is
housed in a central server where as in a Big data
environment data resides in a distributed file system.
The distributed file system scales by scaling in or out
horizontally as compared to typical database sever that
scales vertically.
2. In traditional BI, data is generally analyzed in an offline
mode whereas in Big data, it is analyzed both real time as
well as in offline mode.
42
TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA
3. Traditional BI is about structured data and the
data is taken to process functions (move data to
code).
Where as Big data is about variety: Structured,
semi structured, and unstructured data and here
the processing functions are taken to the data
(move code to data)
43
Big Data Analytics
Big data is more real-time in nature than
traditional DW applications
Traditional DW architectures (e.g.
Exadata, Teradata) are not well-suited
for big data apps
Shared nothing, massively parallel
processing, scale out architectures are
well-suited for big data apps
44
Typical data warehouse environment
45
CRM, ERP, Legacy, Third party apps. The data is then integrated, cleaned,
transformed, and standerdized through the process of Extraction, Transformation
and loading(ETL).
Used to enable decision making from the use of adhoc queries.
Reporing/Dashboarding, OLAP, Adhoc querying, and Modeling.
Typical Hadoop environment
46
The data placed in Hadoop Distributed File system. Operational data store
What is Big Data Analytics
47
Big data Analytics is the process of
examining big data to uncover patterns,
unearth trends, and find unknown
correlations and other useful information to
make faster and better decisions.
What is Big Data Analytics
48
The Big Data Analytics is:
•Technology enabled analytics: few data
analytics and visualization tools
•Richer, deeper insights into customers,
partners and the business
•Competitive advantage
Big Data Analytics is:
49
•Collaboration of three communities: IT, Business
users and data scientists.
•Working with data sets whose volume and
variety exceed the current storage and processing
capabilities and infrastructure of enterprise.
•Move code to data for greater speed and
efficiency
•Better faster decisions in real time.
What is Big Data Analytics
50
Few Top Analytics tools are: MS Excel, SAS, IBM
SPSS Modeler, R analytics, Statistica, World
Programming Systems (WPS), and Weka.
The open source analytics tools are: R analytics
and Weka.
What is Big Data Analytics
51
The open source analytics tools are: R analytics
and Weka.
Classification of Analytics: There are basically
two schools of thought:
Those that classify analytics into basic,
operational, advanced and monetized.
Those that classify analytics into analytics 1.0,
analytics 2.0 and analytics 3.0.
First school of thought:
52
Basic analytics: This primarily slicing and slicing of data to
help with basic business insights. This is about reporting on
historical data, basic visualization etc.
Operationalized Analytics: It is operationalized analytics if it
gets woven into the enterprise’s business process.
Advanced Analytics: This largely is about forecasting for the
future by way of predictive and prescriptive modeling.
Monetized analytics: This is analytics in use to derive direct
business revenue.
53
54
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: 1950s to
2009
Era: 2005 to 2012 Era: 2012 to present
Descriptive
statistics (report
events,
occurrences etc of
the past.
Descriptive statistics +
Predictive statistics (use
data from the past to
make predictions for
the future.
Descriptive statistics + Predictive
statistics + prescriptive statistics
(use data from the past to make
prophecies for the future and at the
same time make recommendations
to leverage the situations to one’s
advantage.
55
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: 1950s to 2009 Era: 2005 to 2012 Era: 2012 to present
Key questions
asked:
What happened?
Why did it
happen?
Key questions are:
What will happen?
Why will it happen?
Key questions are:
What will happen?
When will it happen?
Why will it happen?
What should be the
action taken to take
advantage of what will
happen?
56
Data from legacy
systems, ERP,CRM
and third party
applications.
Big Data A blend of big data and
data from legacy
systems, ERP,CRM and
third party applications.
Small and structured
data sources. Data
stored in enterprise
data warehouses or
data marts.
Big data is being taken up
seriously. Data is mainly
unstructured, arriving at a
higher pace. This fast flow
of big volume data had to be
stored and processed
rapidly, often on massively
parallel servers running
hadoop.
A blend of big data and
traditional analytics to
yield insights and
offerings with speed
and impact.
57
Data was
internally
sourced.
Data was often
externally sourced.
Data is being both
internally and
externally sourced.
Relational
databases
Database
applications,
Hadoopo clusters,
SQL to hadoop
environments etc..
In ,memory
analytics, in database
processing, agile
analytical methods,
Machine learning
techniques etc ..
58
Top Challenges facing Big Data
59
Scale
Security
Schema
Continuous availability
Consistency
Partition tolerant
Data quality
Top Challenges facing Big Data
60
Scale: Storage (RDBMS, NoSQL is the major
concern that needs to be addressed
Security (poor security mechansim)
Schema (no rigid schema, Dynamic is required)
Continuous availability (how to provide 24X7 support)
Consistency
Partition tolerant
Data quality
Techniques used in Big data environments:
61
•In memory analytics
•In-Database processing
•Symmetric Multiprocessor system
•Massively parallel processing
•Distributed systems
•Shared nothing architecture
•CAP theorem
Techniques used in Big data environments:
62
In-memory Analytics: Data access from non-volatile
storage such as hard disk is a slow process. This
problem has been addressed using in-memory
analytics. Here all the relevant data is stored in
Random Access memory (RAM) or primary storage
thus eliminating the need to access the data from hard
disk. The advantage is faster access rapid deployment,
better insights, and minimal IT involvement.
Techniques used in Big data environments:
63
In-Database Processing: In-Database processing is also called
as in-database analytics. It works by fusing data warehouses
with analytical systems. Typically the data from various
enterprise OLTP systems after cleaning up through the
process of ETL is stored in the Enterprise dataware house or
data marts. The huge data sets are then exported to
analytical programs for complex and extensive computations.
With in-database processing, the database program itself can
run the computations by eliminating the need for export and
thereby saving on time. Leading database vendors are
offering this feature to large businesses.
Techniques used in Big data environments:
64
Symmetric Multi-Processor System: In this there is
single common main memory that is shared by two or
more identical processors. The processors have full
access to all I/O devices and are controlled by single
operating system instance.
SMP are tightly coupled multiprocessor systems. Each
processor has its own high speed memory called cache
memory and are connected using a system bus.
Techniques used in Big data environments:
65
Techniques used in Big data environments:
66
Massively Parallel Processing:
Massively parallel Processing (MPP) refers to the coordinated
processing of programs by a number of processors working
parallel. The processors, each have their own OS and
dedicated memory. They work on different parts of the same
program. The MPP processors communicate using some sort of
messaging interface.
MPP is different from symmetric multiprocessing in that SMP
works with processors sharing the same OS and same memory.
SMP also referred as tightly coupled Multiprocessing.
Techniques used in Big data environments:
67
Massively Parallel Processing:
Massively parallel Processing (MPP) refers to the coordinated
processing of programs by a number of processors working
parallel. The processors, each have their own OS and
dedicated memory. They work on different parts of the same
program. The MPP processors communicate using some sort of
messaging interface.
MPP is different from symmetric multiprocessing in that SMP
works with processors sharing the same OS and same memory.
SMP also referred as tightly coupled Multiprocessing.
Techniques used in Big data environments:
68
Shared nothing Architecture:
The three most common types of architecture for multiprocessor
systems:
Shared memory
Shared disk
Shared nothing.
In shared memory architecture, a common central memory is shared by
multiple processors. In shared disk architecture, Multiple processors
share a common collection of disks while having their own private
memory. In shared nothing architecture, neither memory nor disk is
shared among multiple processors.
69
Techniques used in Big data environments:
70
Advantages of shared nothing architecture:
•Fault isolation:
•Scalability:
71
CAP Theorem:
The CAP theorem is also called the Brewer’s theorem. It states that in a
distributed computing environment, it is possible to provide the following
guarantees:
Consistency
Availability
Partition tolerance
Consistency implies that every read fetches the last write.
Availability implies that reads and writes always succeed. In other words,
each non-failing node will return response in a reasonable amount of time.
Partition tolerance implies that the system will continue to function when
network partition occurs.
BASE
72
Definition - What does Basically Available, Soft State, Eventual
Consistency (BASE) mean?
Basically Available, Soft State, Eventual Consistency (BASE) is a data system
design philosophy that prizes availability over consistency of operations.
BASE was developed as an alternative for producing more scalable and
affordable data architectures, providing more options to expanding
enterprises/IT clients and simply acquiring more hardware to expand data
operations.
Techopedia explains Basically Available, Soft State, Eventual Consistency
(BASE)
BASE may be explained in contrast to another design philosophy -
Atomicity, Consistency, Isolation, Durability (ACID). The ACID model
promotes consistency over availability, whereas BASE promotes availability
over consistency.

More Related Content

PDF
Big data Analytics
PPSX
Edge Detection and Segmentation
PDF
A* Search Algorithm
PPTX
PPTX
PPTX
Big Data Analytics
PDF
Software project management
PDF
Introduction to Cassandra
Big data Analytics
Edge Detection and Segmentation
A* Search Algorithm
Big Data Analytics
Software project management
Introduction to Cassandra

What's hot (20)

PPTX
Metadata
PDF
Introduction to metadata management
PDF
Azure Synapse 101 Webinar Presentation
ODP
Introduction To Data Warehousing
PPTX
Information retrieval (introduction)
PDF
Introducing Neo4j
PPTX
Basic Introduction of Data Warehousing from Adiva Consulting
PPT
Data warehousing
PDF
Data warehousing
PPTX
Building a modern data warehouse
PPTX
Data Mining
PPTX
Dspace software
PPT
Data warehouse
PPTX
Introduction to Metadata
PPTX
Demystifying Data Warehouse as a Service
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
Power BI Advance Modeling
 
PPT
Introduction to Metadata
PDF
Google BigQuery Best Practices
Metadata
Introduction to metadata management
Azure Synapse 101 Webinar Presentation
Introduction To Data Warehousing
Information retrieval (introduction)
Introducing Neo4j
Basic Introduction of Data Warehousing from Adiva Consulting
Data warehousing
Data warehousing
Building a modern data warehouse
Data Mining
Dspace software
Data warehouse
Introduction to Metadata
Demystifying Data Warehouse as a Service
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Power BI Advance Modeling
 
Introduction to Metadata
Google BigQuery Best Practices
Ad

Similar to null.pptx (20)

PDF
PDF
Synthesys Technical Overview
PPTX
Introduction of Data Science and Data Analytics
PPT
Database
PPT
Database
PPT
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
DOCX
Web Mining
PDF
Map_reduce_working_Big Data_Analytics_2025
PPTX
Text mining introduction-1
PPTX
New information service
PPT
Knowledge discovery thru data mining
PDF
Structured and Unstructured Information Extraction Using Text Mining and Natu...
PPTX
Social Media Data Collection & Analysis
PPT
Information Architecture: Putting the "I" back in IT
PPT
Artificial Intelligence and the Internet
PPT
AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvh
PPTX
Fundamentals Concepts on Text Analytics.pptx
PDF
Document Based Data Modeling Technique
PPT
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
PDF
20CS601 - Big data Analytics - types of data , definition of big data
Synthesys Technical Overview
Introduction of Data Science and Data Analytics
Database
Database
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Web Mining
Map_reduce_working_Big Data_Analytics_2025
Text mining introduction-1
New information service
Knowledge discovery thru data mining
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Social Media Data Collection & Analysis
Information Architecture: Putting the "I" back in IT
Artificial Intelligence and the Internet
AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvh
Fundamentals Concepts on Text Analytics.pptx
Document Based Data Modeling Technique
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
20CS601 - Big data Analytics - types of data , definition of big data
Ad

More from Siva453615 (20)

PDF
Week-3_watermark.pdf strategic management
PDF
106622.pdf planning process in management and organisational behaviour
PPTX
Managerial economics markets unit 3 for jntuk
PPT
Unit-1-Introduction-to-Management
PPT
70055256-Reward-Management.ppt
PPT
431239867-MOTIVATION-REWARD-SYSTEM-ppt.ppt
PPT
FARRoundtable.ppt
PPTX
hrm ppt.pptx
PDF
15_02_2023_41526209.pdf
PDF
17CS008.pdf
PDF
15_02_2023_41526209.pdf
PPT
HVPE 9.1 About the Course.ppt
PPTX
UHV I Induction Program Highlights v2.pptx
PPTX
EITK.pptx
PPTX
SIP UHV.pptx
PPT
CIClassCh05.ppt
PPT
UHV lecture 1.ppt
PPT
3 Guidelines, Content _ Process of VE.ppt
PPT
427_16SACAOB3_2020051805192483 (1).ppt
PPTX
Unit-3-OB-Perception-attribution-tu-ms-2018.pptx
Week-3_watermark.pdf strategic management
106622.pdf planning process in management and organisational behaviour
Managerial economics markets unit 3 for jntuk
Unit-1-Introduction-to-Management
70055256-Reward-Management.ppt
431239867-MOTIVATION-REWARD-SYSTEM-ppt.ppt
FARRoundtable.ppt
hrm ppt.pptx
15_02_2023_41526209.pdf
17CS008.pdf
15_02_2023_41526209.pdf
HVPE 9.1 About the Course.ppt
UHV I Induction Program Highlights v2.pptx
EITK.pptx
SIP UHV.pptx
CIClassCh05.ppt
UHV lecture 1.ppt
3 Guidelines, Content _ Process of VE.ppt
427_16SACAOB3_2020051805192483 (1).ppt
Unit-3-OB-Perception-attribution-tu-ms-2018.pptx

Recently uploaded (20)

PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
OOP with Java - Java Introduction (Basics)
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Well-logging-methods_new................
PPT
Project quality management in manufacturing
PPTX
Construction Project Organization Group 2.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
573137875-Attendance-Management-System-original
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Foundation to blockchain - A guide to Blockchain Tech
UNIT 4 Total Quality Management .pptx
Model Code of Practice - Construction Work - 21102022 .pdf
OOP with Java - Java Introduction (Basics)
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Well-logging-methods_new................
Project quality management in manufacturing
Construction Project Organization Group 2.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Lesson 3_Tessellation.pptx finite Mathematics
CYBER-CRIMES AND SECURITY A guide to understanding
Geodesy 1.pptx...............................................
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
573137875-Attendance-Management-System-original
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

null.pptx

  • 1. Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I semester 1
  • 2. Big Data And Analytics Seema Acharya Subhashini Chellappan 2
  • 3. Chapter 1 Types of Digital Data 3
  • 4. Learning Objectives and Learning Outcomes Learning Objectives Learning Outcomes Introduction to digital data and its types 1. Structured data: Sources of structured data, ease with structured data, etc. 2. Semi-Structured data: Sources of semi-structured data, characteristics of semi- structured data. 3. Unstructured data: Sources of unstructured data, issues with terminology, dealing with unstructured data. a) To differentiate between structured, semi-structured and unstructured data. b) To understand the need to integrate structured, semi- structured and unstructured data. 4
  • 5. Agenda Types of Digital Data Structured  Sources of structured data  Ease with structured data Semi-Structured  Sources of semi-structured data Unstructured  Sources of unstructured data  Issues with terminology  Dealing with unstructured data 5
  • 6. Classification of Digital Data Digital data is classified into the following categories: Structured data Semi-structured data Unstructured data 6
  • 7. Approximate Percentage Distribution of Digital Data Approximate percentage distribution of digital data 7
  • 9. Structured Data This is the data which is in an organized form (e.g., in rows and columns) and can be easily used by a computer program. Relationships exist between entities of data, such as classes and their objects. Data stored in databases is an example of structured data.  Think structured data, and think data model. (Store, Process and Access) RDBMS.  schema, columns and rows, No.of rows is cardinality and no. of columns degree of relation.  with constraints Employee schema 9
  • 10. Sources of Structured Data Databases such as Oracle, DB2, Teradata, MySql, PostgreSQL, etc Structured data Spreadsheets OLTP Systems 10
  • 11. Ease with Structured Data Ease with Structured data Security Indexing / Searching Input / Update / Delete DML Scalability Transaction Processing (ACID) 11 Storage and processing capability Speeds up data retrieval operations
  • 13. Semi-structured Data This is the data which does not conform to a data model but has some structure. However, it is not in a form which can be used easily by a computer program. Example, emails, XML, markup languages like HTML, etc. Metadata for this data is available but is not sufficient. 13
  • 14. Sources of Semi-structured Data Semi structured data XML (Extensible Markup Language) developed utilizing the SOAP(Simple Object Access protocol) Principles Other Markup Languages JSON (Java Script object Notation) 14
  • 15. JSON is used to transmit data between a server and a web application. JSON is popularized by web services developed utilizing the Representational State Transfer (REST) – an architectural style for creating scalable web services. Mango DB and couchbase store data natively in JSON format. Mango DB(open source, distributed, NoSQL, document oriented database) Couchbase ( originally known as Membase, DB(open source, distributed, NoSQL, document oriented database 15
  • 16. Characteristics of Semi-structured Data Semi-structured data Inconsistent Structure Self-describing (lable/value pairs) Often Schema information is blended with data values Data objects may have different attributes not known beforehand 16
  • 18. Unstructured Data This is the data which does not conform to a data model or is not in a form which can be used easily by a computer program. About 80–90% data of an organization is in this format. Example: memos,chat rooms, PowerPoint presentations, images, videos, letters, researches, white papers, body of an email, etc. 18
  • 19. Sources of Unstructured Data Unstructured data Web Pages Images Free-Form Text Audios Videos Body of Email Text Messages Chats Social Media data Word Document 19
  • 20. Issues with terminology – Unstructured Data Issues with terminology Structure can be implied despite not being formerly defined. Data with some structure may still be labeled unstructured if the structure doesn’t help with processing task at hand Data may have some structure or may even be highly structured in ways that are unanticipated or unannounced. 20
  • 21. Dealing with Unstructured Data Dealing with Unstructured Data Data Mining Natural Language Processing (NLP) Text Analytics Noisy Text Analytics 21 Manual tagging with Metadata Part of speech tagging Unstructured Information Management Architecture
  • 22. Dealing with Unstructured Data 22 Data Mining Natural Language Processing (NLP) Text Analytics Noisy Text Analytics Manual tagging with Metadata Part of speech tagging Unstructured Information Management Architecture Knowledge discovery in databases, popular Mining algorithms are Association rule mining, Regression Analysis, and Collaborative filtering It is related to HCI. It is about enabling computers to understand human or natural language input. Text mining is the process of gleaning high quality and meaningful information from text. It includes tasks such as text categorization, text clustering, sentiment analysis and concept/entity extraction. Process of extraction structured or semi-structured from noisy unstructured data such as chats, blogs, wikis, emails .. Spelling mistakes, abbreviations, uh, hm, non standard words. . Open source platform from IBM used for real time content analytics. POST is the process of reading text and tagging each word in the sentence belonging to particular parts of speech such as noun, verb, objective... This is about tagging manually with adequate meta data to provide the requisite semantics to understand unstructured data.
  • 23. Data Mining : Knowledge discovery in databases, popular Mining algorithms are Association rule mining, Regression Analysis, and Collaborative filtering. NLP: It is related to HCI. It is about enabling computers to understand human or natural language input. Text Analytics: Text mining is the process of gleaning high quality and meaningful information from text. It includes tasks such as text categorization, text clustering, sentiment analysis and concept/entity extraction. 23
  • 24. Noisy Text Analytics: Process of extraction structured or semi- structured from noisy unstructured data such as chats, blogs, wikis, emails .. Spelling mistakes, abbreviations, uh, hm, non standard words. . Manual Tagging with Meta data: This is about tagging manually with adequate meta data to provide the requisite semantics to understand unstructured data. Parts of Speech Tagging : POST is the process of reading text and tagging each word in the sentence belonging to particular parts of speech such as noun, verb, objective... UIMA: Open source platform from IBM used for real time content analytics. 24
  • 25. Answer a few quick questions … 25
  • 26. 26 Place the following in suitable basket: Email ii. MS Access iii. Images iv. Database v. Char conversions vi. Relations / Tables vii. Face book viii.Videos ix. MS Excel x. XML Structured Unstructured Semi structured
  • 27. Match the following Column A Column B NLP Content analytics Text analytics Text messages UIMA Chats Noisy unstructured data Text mining Data mining Comprehend human or natural language input Noisy unstructured data Uses methods at the intersection of statistics, Artificial Intelligence, machine learning & DBs IBM UIMA 27 5,4,1,2,6,3,7
  • 28. Answer Me Which category (structured, semi-structured, or unstructured) will you place a Web Page in? Which category (structured, semi-structured, or unstructured) will you place Word Document in? State a few examples of human generated and machine-generated data. 28
  • 29. List various types of digital data? Structured, Semi-structured and unstructured Why an email placed in the Unstructured category? Because it contains hyperlinks, attachments, videos, images, free flowing text... What category will you place a CCTV footage into? unstructured You have just got a book issued from the library. What are the details about the book that can be placed in an RDBMS table. Ans: Title, author, publisher, year, no.of pages, type of book, price, ISBN, with CD or not. Which category would you place the consumer complaints and feedback? Unstructured. Which category (structured, semi-structured or Unstructured) will you place a web page in? Unstructured Which category (structured, semi-structured or Unstructured) will you place a Power point presentation in? Unstructured Which category (structured, semi-structured or Unstructured) will you place a word document in? Unstructured 29
  • 30. 30 Data Generation Origin Definition Information Management Proficiency Examples Humans Data representing the digitization of human interactions Structured Business process data e.g., payment transactions, sales order, call record, ERP, CRM Semistructured Weblogs Unstructured Content such as Web pages, E- mail, Blog, Wiki, Review, Comment Binary Content such as Video, Audio, Photo Machines Data representing machine-to-machine interactions, or simply not human-generated (Internet of Things) Structured Some devices Semistructured Computer logs, Device logs, Network logs, Sensor/Meter logs Binary Video, Audio, Photo
  • 31. Answer Me 31 •Why an email placed in the Unstructured category? •What category will you place a CCTV footage into? •You have just got a book issued from the library. What are the details about the book that can be placed in an RDBMS table. •Which category would you place the consumer complaints and feedback? •Which category (structured, semi-structured or Unstructured) will you place a web page in? •Which category (structured, semi-structured or Unstructured) will you place a Power point presentation in? •Which category (structured, semi-structured or Unstructured) will you place a word document in?
  • 32. Introduction to Big Data Characteristics of Data: 1.Composition: It deals with structure of data, that is, the sources of data, granularity, the types and the nature of data as to whether it is static or real streaming. 2.Condition: It deals with the state of data, can any one use this as it is or cleansing is required for further enhancement and enrichment. 3.Context: It deals with “where has this data been generated” why was this data generated” how sensitive is this data 32
  • 33. Evolution of Big Data: 1. 1970s and before : main frames (the data is primitive and structured) 2. 1980s and 1990s Relational databases (the data intensive applications) 3. the WWW and the IOT have led to an onslaught of structured, unstructured, and multimedia data. 33
  • 34. The Evolution of Big Data 34 Data Generation and storage Data Utilization Data Driven Complex and unstructured Structured data, Unstructured data, Multimedia data Complex and Relational Relational databases : Data intensive applications Primitive and structured Main frames: Basic data storage 1970s and before Relational 1980s and 1990s 2000s and beyond
  • 35. What’s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. It is about three Vs. "Big data" is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. 35
  • 36. What’s Big Data? “Big data is high-volume, -velocity and -variety information assets” talks about voluminous data that may have great variety(s,s,u) and will require a good speed/pace for storage, preparation, processing and analysis. “Cost-effective, innovative forms of information processing” talks about embracing new techniques and technologies to capture, store, process, persist, integrate, and visualize high volume, high variety and high velocity data. “Enhanced insight and decision making” talks about deriving deeper, richer, and meaningful insights and then using these insights to make faster and better decisions to gain business value and this competitive edge. Data -> information -> Actionable intelligence -> better decisions -> Enhanced business value. 36
  • 37. Challenges with Big Data The challenges with big data: 1.Data today is growing at an exponential rate. The key question is : will all this data be useful for analysis how will separate knowledge from noise.. 2. How to host big data solutions outside the world. 3. The period of retention of big data. 4. Dearth of skilled professionals. 5. Challenges with respect to capture, curation, storage, search, sharing, transfer, analysis, privacy violations and visualization. 6. Shortage of data visualization experts. 37
  • 38. What is Big Data Big data is the data that is Big in volume, velocity and variety Volume: Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes-> Zettabytes-> Yottabytes Where does this data get generated 1. Typical internal data sources: data storage, archives, 2. External data sources: Public web: wikipedia, weather, regulatory, complience.. 3. Both (nternal and external sources). 38
  • 39. What is Big Data Velocity Batch -> periodic ->Near real time ->Real time processing. Variety: Structured, semi-structured and Unstructured. Other: 1. Veracity and validity.. 2. Volatality 3. variability 39
  • 40. 40
  • 41. 41
  • 42. TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA 1. In Traditional BI environment, all the enterprise’s data is housed in a central server where as in a Big data environment data resides in a distributed file system. The distributed file system scales by scaling in or out horizontally as compared to typical database sever that scales vertically. 2. In traditional BI, data is generally analyzed in an offline mode whereas in Big data, it is analyzed both real time as well as in offline mode. 42
  • 43. TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA 3. Traditional BI is about structured data and the data is taken to process functions (move data to code). Where as Big data is about variety: Structured, semi structured, and unstructured data and here the processing functions are taken to the data (move code to data) 43
  • 44. Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 44
  • 45. Typical data warehouse environment 45 CRM, ERP, Legacy, Third party apps. The data is then integrated, cleaned, transformed, and standerdized through the process of Extraction, Transformation and loading(ETL). Used to enable decision making from the use of adhoc queries. Reporing/Dashboarding, OLAP, Adhoc querying, and Modeling.
  • 46. Typical Hadoop environment 46 The data placed in Hadoop Distributed File system. Operational data store
  • 47. What is Big Data Analytics 47 Big data Analytics is the process of examining big data to uncover patterns, unearth trends, and find unknown correlations and other useful information to make faster and better decisions.
  • 48. What is Big Data Analytics 48 The Big Data Analytics is: •Technology enabled analytics: few data analytics and visualization tools •Richer, deeper insights into customers, partners and the business •Competitive advantage
  • 49. Big Data Analytics is: 49 •Collaboration of three communities: IT, Business users and data scientists. •Working with data sets whose volume and variety exceed the current storage and processing capabilities and infrastructure of enterprise. •Move code to data for greater speed and efficiency •Better faster decisions in real time.
  • 50. What is Big Data Analytics 50 Few Top Analytics tools are: MS Excel, SAS, IBM SPSS Modeler, R analytics, Statistica, World Programming Systems (WPS), and Weka. The open source analytics tools are: R analytics and Weka.
  • 51. What is Big Data Analytics 51 The open source analytics tools are: R analytics and Weka. Classification of Analytics: There are basically two schools of thought: Those that classify analytics into basic, operational, advanced and monetized. Those that classify analytics into analytics 1.0, analytics 2.0 and analytics 3.0.
  • 52. First school of thought: 52 Basic analytics: This primarily slicing and slicing of data to help with basic business insights. This is about reporting on historical data, basic visualization etc. Operationalized Analytics: It is operationalized analytics if it gets woven into the enterprise’s business process. Advanced Analytics: This largely is about forecasting for the future by way of predictive and prescriptive modeling. Monetized analytics: This is analytics in use to derive direct business revenue.
  • 53. 53
  • 54. 54 Analytics 1.0 Analytics 2.0 Analytics 3.0 Era: 1950s to 2009 Era: 2005 to 2012 Era: 2012 to present Descriptive statistics (report events, occurrences etc of the past. Descriptive statistics + Predictive statistics (use data from the past to make predictions for the future. Descriptive statistics + Predictive statistics + prescriptive statistics (use data from the past to make prophecies for the future and at the same time make recommendations to leverage the situations to one’s advantage.
  • 55. 55 Analytics 1.0 Analytics 2.0 Analytics 3.0 Era: 1950s to 2009 Era: 2005 to 2012 Era: 2012 to present Key questions asked: What happened? Why did it happen? Key questions are: What will happen? Why will it happen? Key questions are: What will happen? When will it happen? Why will it happen? What should be the action taken to take advantage of what will happen?
  • 56. 56 Data from legacy systems, ERP,CRM and third party applications. Big Data A blend of big data and data from legacy systems, ERP,CRM and third party applications. Small and structured data sources. Data stored in enterprise data warehouses or data marts. Big data is being taken up seriously. Data is mainly unstructured, arriving at a higher pace. This fast flow of big volume data had to be stored and processed rapidly, often on massively parallel servers running hadoop. A blend of big data and traditional analytics to yield insights and offerings with speed and impact.
  • 57. 57 Data was internally sourced. Data was often externally sourced. Data is being both internally and externally sourced. Relational databases Database applications, Hadoopo clusters, SQL to hadoop environments etc.. In ,memory analytics, in database processing, agile analytical methods, Machine learning techniques etc ..
  • 58. 58
  • 59. Top Challenges facing Big Data 59 Scale Security Schema Continuous availability Consistency Partition tolerant Data quality
  • 60. Top Challenges facing Big Data 60 Scale: Storage (RDBMS, NoSQL is the major concern that needs to be addressed Security (poor security mechansim) Schema (no rigid schema, Dynamic is required) Continuous availability (how to provide 24X7 support) Consistency Partition tolerant Data quality
  • 61. Techniques used in Big data environments: 61 •In memory analytics •In-Database processing •Symmetric Multiprocessor system •Massively parallel processing •Distributed systems •Shared nothing architecture •CAP theorem
  • 62. Techniques used in Big data environments: 62 In-memory Analytics: Data access from non-volatile storage such as hard disk is a slow process. This problem has been addressed using in-memory analytics. Here all the relevant data is stored in Random Access memory (RAM) or primary storage thus eliminating the need to access the data from hard disk. The advantage is faster access rapid deployment, better insights, and minimal IT involvement.
  • 63. Techniques used in Big data environments: 63 In-Database Processing: In-Database processing is also called as in-database analytics. It works by fusing data warehouses with analytical systems. Typically the data from various enterprise OLTP systems after cleaning up through the process of ETL is stored in the Enterprise dataware house or data marts. The huge data sets are then exported to analytical programs for complex and extensive computations. With in-database processing, the database program itself can run the computations by eliminating the need for export and thereby saving on time. Leading database vendors are offering this feature to large businesses.
  • 64. Techniques used in Big data environments: 64 Symmetric Multi-Processor System: In this there is single common main memory that is shared by two or more identical processors. The processors have full access to all I/O devices and are controlled by single operating system instance. SMP are tightly coupled multiprocessor systems. Each processor has its own high speed memory called cache memory and are connected using a system bus.
  • 65. Techniques used in Big data environments: 65
  • 66. Techniques used in Big data environments: 66 Massively Parallel Processing: Massively parallel Processing (MPP) refers to the coordinated processing of programs by a number of processors working parallel. The processors, each have their own OS and dedicated memory. They work on different parts of the same program. The MPP processors communicate using some sort of messaging interface. MPP is different from symmetric multiprocessing in that SMP works with processors sharing the same OS and same memory. SMP also referred as tightly coupled Multiprocessing.
  • 67. Techniques used in Big data environments: 67 Massively Parallel Processing: Massively parallel Processing (MPP) refers to the coordinated processing of programs by a number of processors working parallel. The processors, each have their own OS and dedicated memory. They work on different parts of the same program. The MPP processors communicate using some sort of messaging interface. MPP is different from symmetric multiprocessing in that SMP works with processors sharing the same OS and same memory. SMP also referred as tightly coupled Multiprocessing.
  • 68. Techniques used in Big data environments: 68 Shared nothing Architecture: The three most common types of architecture for multiprocessor systems: Shared memory Shared disk Shared nothing. In shared memory architecture, a common central memory is shared by multiple processors. In shared disk architecture, Multiple processors share a common collection of disks while having their own private memory. In shared nothing architecture, neither memory nor disk is shared among multiple processors.
  • 69. 69
  • 70. Techniques used in Big data environments: 70 Advantages of shared nothing architecture: •Fault isolation: •Scalability:
  • 71. 71 CAP Theorem: The CAP theorem is also called the Brewer’s theorem. It states that in a distributed computing environment, it is possible to provide the following guarantees: Consistency Availability Partition tolerance Consistency implies that every read fetches the last write. Availability implies that reads and writes always succeed. In other words, each non-failing node will return response in a reasonable amount of time. Partition tolerance implies that the system will continue to function when network partition occurs.
  • 72. BASE 72 Definition - What does Basically Available, Soft State, Eventual Consistency (BASE) mean? Basically Available, Soft State, Eventual Consistency (BASE) is a data system design philosophy that prizes availability over consistency of operations. BASE was developed as an alternative for producing more scalable and affordable data architectures, providing more options to expanding enterprises/IT clients and simply acquiring more hardware to expand data operations. Techopedia explains Basically Available, Soft State, Eventual Consistency (BASE) BASE may be explained in contrast to another design philosophy - Atomicity, Consistency, Isolation, Durability (ACID). The ACID model promotes consistency over availability, whereas BASE promotes availability over consistency.