SlideShare a Scribd company logo
Digital Data
Unit -3
Dr. T. NIKIL PRAKASH
ASSISTANT PROFESSOR
DEPARTMENT OF INFORMATION TECHNOLOGY,
ST. JOSEPH’S COLLEGE (AUTONOMOUS),
TIRUCHIRAPPALLI-02
Digital Data
• Data is present in homogeneous sources as well as in
heterogeneous sources.
• The need of the hour is to understand, manage,
process, and take the data for analysis to draw
valuable insights
• Types of Digital Data
• Digital data can be structured, semi-structured or unstructured
data.
• Structured data
• When data follows a pre-defined schema/structure we say
it is structured data.
• This is the data that is in an organized form (e.g., in rows
and columns) and be easily used by a computer program.
• Relationships exist between entities of data, such as classes
and their objects.
• About 10% data of an organization is in this format.
• Data stored in databases is an example of structured data
Fundamentals of data science: digital data
Sources of Structured Data
• SQL Databases-Oracle DB,
• Spreadsheets such as Excel
• OLTP Systems
• Online forms
• Sensors such as GPS or RFID tags
• Network and Web server logs
• Medical devices
Fundamentals of data science: digital data
Ease of working with structured data
• Structured data is easier to work with than unstructured data because it's
already formatted and has a clear structure:
• Easy to analyze and manipulate
• Structured data is easy for both humans and machines to work with because it's
already formatted.
• Easy to search and query
• Structured data's organized nature makes it easy to manipulate and query.
• Easy to use
• Structured data can be used by average business users who understand the topic
the data relates to.
• Easy to store
• Structured data can be stored in tabular formats like Excel sheets or SQL
databases, which require less storage space.
• Easy to scale
• Structured data can be stored in data warehouses, which makes it highly scalable
Fundamentals of data science: digital data
• Semi-structured data:
• Semi-structured data is also referred to as self-describing
structure.
• This is the data which does not conform to a data model
but has some structure.
• However, it is not in a form which can be used easily by a
computer program.
• About 10% data of an organization is in this format; for
example, HTML, XML, JSON, email data etc.
Fundamentals of data science: digital data
Source of semistructured data
• Semi-structured data is data that is not captured or formatted
in conventional ways, but it does have some structural
elements. It can come from many sources, including:
• Emails
• Markup languages
• Binary executables
• TCP/IP packets
• Zipped files
• Data integrated from different sources
• Web pages
• Log files
• NoSQL databases
• Electronic data interchange (EDI)
• Example
• <!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
• Unstructured data:
• This is the data which does not conform to a data model or is not
in a form which can be used easily by a computer program.
• About 80% data of an organization is in this format; for example,
memos, chat rooms, PowerPoint presentations, images, videos,
letters. researches, white papers, body of an email, etc.
Issues of Unstructured Data
• Storage and management
• Unstructured data is difficult to store and manage because it comes in many
formats, such as text, video, audio, and social media content. It can also be
difficult to navigate through the large volume of unstructured data.
• Processing
• Unstructured data can be time-consuming and resource-intensive to
process. Traditional data storage options may also be inflexible and unable
to adapt to unstructured data.
• Analysis
• Unstructured data is not organized in a predefined manner, making it
difficult to process and analyze using traditional methods.
• Cyber-attacks
• Unstructured data can make systems more vulnerable to cyber-attacks.
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Deals with unstructured data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Introduction to Big Data
• The "Internet of Things" and its widely ultra-
connected nature are leading to a burgeoning
rise in big data.
• There is no dearth of data for today's
enterprise.
• On the contrary, they are mired in data and
quite deep at that.
• Data is widely available.
Some examples of Big Data:
• There are some examples of Big Data Analytics in different
areas such as retail, IT infrastructure, and social media.
• Retail: As mentioned earlier, Big Data presents many
opportunities to improve sales and marketing analytics.
• An example of this is the U.S. retailer Target.
• After analyzing consumer purchasing behavior, Target's
statisticians determined that the retailer made a great deal
of money from three main life-event situations.
• Marriage, when people tend to buy many new products
• Divorce, when people buy new products and change their
spending habits
Characteristics of Data:
1. Composition: The composition of data deals with the structure of
data, that is, the sources of data, the granularity, the types, and the
nature of data as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of data, that is,
"Can one use this data as is for analysis?" or "Does it require cleansing
for further enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been
generated?" "Why was this data generated?" How sensitive is this
data?" "What are the events associated with this data?" and so on.
• Small data (data as it existed prior to the big data revolution) is about
certainty.
• It is about known data sources; it is about no major changes to the
composition or context of data.
Definition of Big Data:
• Big data is high-velocity and high-variety
information assets that demand cost effective,
innovative forms of information processing for
enhanced insight and decision making.
• Big data refers to datasets whose size is
typically beyond the storage capacity of and
also complex for traditional database software
tools
• Big data is anything beyond the human &
technical infrastructure needed to support
storage, processing and analysis
• Variety: Data can be structured data, semi-structured data
and unstructured data.
• Data stored in a database is an example of structured data.
• HTML data, XML data, email data, CSV files are the
examples of semi-structured data.
• Power point presentation, images, videos, researches, white
papers, body of email etc. are the examples of unstructured
data.
• Velocity: Velocity essentially refers to the speed at which
data is being created in real- time.
• We have moved from simple desktop applications like
payroll application to real- time processing applications.
• Volume: Volume can be in Terabytes or Petabytes or
Zettabytes.
Introduction to big data analytics
• Big Data Analytics is...
• Technology-enabled analytics: Many data analytics and
visualization tools are available in the market today from
leading vendors such as IBM, Tableau, SAS, R
Analytics, Statistical, World Programming Systems
(WPS), etc. to help process and analyze the big data.
• About gaining a meaningful, deeper, and richer insight
into your business to steer it in the right direction.
• Understanding the customer's demographics to cross-sell
and up- sell to them, better leveraging the services of
your vendors and suppliers, etc.
• About a competitive edge over your competitors
by enabling you with findings that allow quicker
and better decision-making.
• A tight handshake between three communities:
IT, business users, and data scientists.
• Working with datasets whose volume and variety
exceed the current storage and processing
capabilities and infrastructure of your enterprise
Big Data Technologies
• Following are the requirements of technologies to
meet challenges of big data:
• The first requirement is cheap and ample storage.
• We need faster processors to help with quicker processing of big
data.
• Affordable open-source distributed big data platforms, such as
Hadoop.
• Parallel processing, clustering, virtualization, large grid
environments (to distribute processing to a number of machines),
high connectivity, and high throughputs(rate at which something
is processed).
• Cloud computing and other flexible resource allocation
arrangements.
• Big Data Technologies Include:
• Apache Kafka
• A big data technology that enables users to process data in motion and
quickly determine what works and is not.
• Tableau
• A popular data engineering tool that gathers data from multiple sources
using a drag-and-drop interface and allows data engineers to build
dashboards for visualization.
• Predictive analytics
• A key component of big data that involves statistical models, machine
learning algorithms, and other techniques to analyze large and complex
datasets.
• Splunk
• A big data platform that simplifies collecting and managing massive
volumes of machine-generated data.
• Data visualization
• An integral part of any big data analytics project that
allows users to create charts, graphs, and other visual
representations of their data.
• TensorFlow
• A predictive and generic deep learning library that uses
big data to offer its extensive capabilities to computer
systems.
• KNIME
• A big data tool that gives users the ability to report and
integrate data across different sources.
• MapReduce
• A programming model that is commonly used in Big
Data Analytics to process and analyze large datasets.
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data
Fundamentals of data science: digital data

More Related Content

PPSX
Introduction to Big Data Analytics.ppsx
PPTX
PPTX
Big data Analytics Fundamentals Chapter 1
PDF
Introduction to Big Data Analytics Unit 1 .pdf
PDF
Unit No2 Introduction to big data.pdf
PPTX
bigdata- Introduction for pg students fo
PPTX
Introduction to Big Data
PPTX
Big Data Analytics
Introduction to Big Data Analytics.ppsx
Big data Analytics Fundamentals Chapter 1
Introduction to Big Data Analytics Unit 1 .pdf
Unit No2 Introduction to big data.pdf
bigdata- Introduction for pg students fo
Introduction to Big Data
Big Data Analytics

Similar to Fundamentals of data science: digital data (20)

PPTX
Big data Analytics Unit - CCS334 Syllabus
PPTX
Big Data.pptx
PPT
Big Data Analytics (Collection of huge Data)
PPT
big-data-notes1.ppt
PPTX
bigdata introduction for students pg msc
PDF
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
PDF
20CS601 - Big data Analytics - types of data , definition of big data
PPTX
Introduction to Big Data
PDF
Unit III.pdf
PDF
BIG DATA AND HADOOP.pdf
PPTX
INTRODUCTION TO BIG DATA AND HADOOP
PPTX
PPTX
Unit 1 - Introduction to Big Data and Big Data Analytics.pptx
PDF
Big-Data-Analytics.8592259.powerpoint.pdf
DOCX
Introduction to big data – convergences.
PPTX
Big data ppt
DOCX
Data and Information.docx
PPTX
sybca-bigdata-ppt.pptx
PPTX
Bigdata Unit1.pptx
PDF
Big Data in Practice.pdf
Big data Analytics Unit - CCS334 Syllabus
Big Data.pptx
Big Data Analytics (Collection of huge Data)
big-data-notes1.ppt
bigdata introduction for students pg msc
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
20CS601 - Big data Analytics - types of data , definition of big data
Introduction to Big Data
Unit III.pdf
BIG DATA AND HADOOP.pdf
INTRODUCTION TO BIG DATA AND HADOOP
Unit 1 - Introduction to Big Data and Big Data Analytics.pptx
Big-Data-Analytics.8592259.powerpoint.pdf
Introduction to big data – convergences.
Big data ppt
Data and Information.docx
sybca-bigdata-ppt.pptx
Bigdata Unit1.pptx
Big Data in Practice.pdf
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Computer network topology notes for revision
PDF
Mega Projects Data Mega Projects Data
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Database Infoormation System (DBIS).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Quality review (1)_presentation of this 21
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Computer network topology notes for revision
Mega Projects Data Mega Projects Data
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
ISS -ESG Data flows What is ESG and HowHow
Database Infoormation System (DBIS).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Knowledge Engineering Part 1
Quality review (1)_presentation of this 21
IBA_Chapter_11_Slides_Final_Accessible.pptx
IB Computer Science - Internal Assessment.pptx
Supervised vs unsupervised machine learning algorithms
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Ad

Fundamentals of data science: digital data

  • 1. Digital Data Unit -3 Dr. T. NIKIL PRAKASH ASSISTANT PROFESSOR DEPARTMENT OF INFORMATION TECHNOLOGY, ST. JOSEPH’S COLLEGE (AUTONOMOUS), TIRUCHIRAPPALLI-02
  • 2. Digital Data • Data is present in homogeneous sources as well as in heterogeneous sources. • The need of the hour is to understand, manage, process, and take the data for analysis to draw valuable insights • Types of Digital Data • Digital data can be structured, semi-structured or unstructured data.
  • 3. • Structured data • When data follows a pre-defined schema/structure we say it is structured data. • This is the data that is in an organized form (e.g., in rows and columns) and be easily used by a computer program. • Relationships exist between entities of data, such as classes and their objects. • About 10% data of an organization is in this format. • Data stored in databases is an example of structured data
  • 5. Sources of Structured Data • SQL Databases-Oracle DB, • Spreadsheets such as Excel • OLTP Systems • Online forms • Sensors such as GPS or RFID tags • Network and Web server logs • Medical devices
  • 7. Ease of working with structured data • Structured data is easier to work with than unstructured data because it's already formatted and has a clear structure: • Easy to analyze and manipulate • Structured data is easy for both humans and machines to work with because it's already formatted. • Easy to search and query • Structured data's organized nature makes it easy to manipulate and query. • Easy to use • Structured data can be used by average business users who understand the topic the data relates to. • Easy to store • Structured data can be stored in tabular formats like Excel sheets or SQL databases, which require less storage space. • Easy to scale • Structured data can be stored in data warehouses, which makes it highly scalable
  • 9. • Semi-structured data: • Semi-structured data is also referred to as self-describing structure. • This is the data which does not conform to a data model but has some structure. • However, it is not in a form which can be used easily by a computer program. • About 10% data of an organization is in this format; for example, HTML, XML, JSON, email data etc.
  • 11. Source of semistructured data • Semi-structured data is data that is not captured or formatted in conventional ways, but it does have some structural elements. It can come from many sources, including: • Emails • Markup languages • Binary executables • TCP/IP packets • Zipped files • Data integrated from different sources • Web pages • Log files • NoSQL databases • Electronic data interchange (EDI)
  • 12. • Example • <!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>
  • 13. • Unstructured data: • This is the data which does not conform to a data model or is not in a form which can be used easily by a computer program. • About 80% data of an organization is in this format; for example, memos, chat rooms, PowerPoint presentations, images, videos, letters. researches, white papers, body of an email, etc.
  • 14. Issues of Unstructured Data • Storage and management • Unstructured data is difficult to store and manage because it comes in many formats, such as text, video, audio, and social media content. It can also be difficult to navigate through the large volume of unstructured data. • Processing • Unstructured data can be time-consuming and resource-intensive to process. Traditional data storage options may also be inflexible and unable to adapt to unstructured data. • Analysis • Unstructured data is not organized in a predefined manner, making it difficult to process and analyze using traditional methods. • Cyber-attacks • Unstructured data can make systems more vulnerable to cyber-attacks.
  • 21. Introduction to Big Data • The "Internet of Things" and its widely ultra- connected nature are leading to a burgeoning rise in big data. • There is no dearth of data for today's enterprise. • On the contrary, they are mired in data and quite deep at that. • Data is widely available.
  • 22. Some examples of Big Data: • There are some examples of Big Data Analytics in different areas such as retail, IT infrastructure, and social media. • Retail: As mentioned earlier, Big Data presents many opportunities to improve sales and marketing analytics. • An example of this is the U.S. retailer Target. • After analyzing consumer purchasing behavior, Target's statisticians determined that the retailer made a great deal of money from three main life-event situations. • Marriage, when people tend to buy many new products • Divorce, when people buy new products and change their spending habits
  • 23. Characteristics of Data: 1. Composition: The composition of data deals with the structure of data, that is, the sources of data, the granularity, the types, and the nature of data as to whether it is static or real-time streaming. 2. Condition: The condition of data deals with the state of data, that is, "Can one use this data as is for analysis?" or "Does it require cleansing for further enhancement and enrichment?" 3. Context: The context of data deals with "Where has this data been generated?" "Why was this data generated?" How sensitive is this data?" "What are the events associated with this data?" and so on. • Small data (data as it existed prior to the big data revolution) is about certainty. • It is about known data sources; it is about no major changes to the composition or context of data.
  • 24. Definition of Big Data: • Big data is high-velocity and high-variety information assets that demand cost effective, innovative forms of information processing for enhanced insight and decision making. • Big data refers to datasets whose size is typically beyond the storage capacity of and also complex for traditional database software tools • Big data is anything beyond the human & technical infrastructure needed to support storage, processing and analysis
  • 25. • Variety: Data can be structured data, semi-structured data and unstructured data. • Data stored in a database is an example of structured data. • HTML data, XML data, email data, CSV files are the examples of semi-structured data. • Power point presentation, images, videos, researches, white papers, body of email etc. are the examples of unstructured data. • Velocity: Velocity essentially refers to the speed at which data is being created in real- time. • We have moved from simple desktop applications like payroll application to real- time processing applications. • Volume: Volume can be in Terabytes or Petabytes or Zettabytes.
  • 26. Introduction to big data analytics • Big Data Analytics is... • Technology-enabled analytics: Many data analytics and visualization tools are available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistical, World Programming Systems (WPS), etc. to help process and analyze the big data. • About gaining a meaningful, deeper, and richer insight into your business to steer it in the right direction. • Understanding the customer's demographics to cross-sell and up- sell to them, better leveraging the services of your vendors and suppliers, etc.
  • 27. • About a competitive edge over your competitors by enabling you with findings that allow quicker and better decision-making. • A tight handshake between three communities: IT, business users, and data scientists. • Working with datasets whose volume and variety exceed the current storage and processing capabilities and infrastructure of your enterprise
  • 28. Big Data Technologies • Following are the requirements of technologies to meet challenges of big data: • The first requirement is cheap and ample storage. • We need faster processors to help with quicker processing of big data. • Affordable open-source distributed big data platforms, such as Hadoop. • Parallel processing, clustering, virtualization, large grid environments (to distribute processing to a number of machines), high connectivity, and high throughputs(rate at which something is processed). • Cloud computing and other flexible resource allocation arrangements.
  • 29. • Big Data Technologies Include: • Apache Kafka • A big data technology that enables users to process data in motion and quickly determine what works and is not. • Tableau • A popular data engineering tool that gathers data from multiple sources using a drag-and-drop interface and allows data engineers to build dashboards for visualization. • Predictive analytics • A key component of big data that involves statistical models, machine learning algorithms, and other techniques to analyze large and complex datasets. • Splunk • A big data platform that simplifies collecting and managing massive volumes of machine-generated data.
  • 30. • Data visualization • An integral part of any big data analytics project that allows users to create charts, graphs, and other visual representations of their data. • TensorFlow • A predictive and generic deep learning library that uses big data to offer its extensive capabilities to computer systems. • KNIME • A big data tool that gives users the ability to report and integrate data across different sources. • MapReduce • A programming model that is commonly used in Big Data Analytics to process and analyze large datasets.