SlideShare a Scribd company logo
ET424.1 Data Analytics
Unit 1: Introduction to Data Analytics
Mr. V. H. Kondekar
E&TC Dept., WIT, Solapur
Course Outcomes
 ET424.1 Discuss challenges in big data analytics and
Describe fundamental techniques and principles for data
analytics.
 ET424.2 Identify, organize and operate on the datasets to
compute statistics for data analysis
 ET424.3 Select and implement appropriate data
visualizations to clearly communicate analytic insights.
 ET424.4 Apply different preprocessing techniques for data
quality enhnacement
 ET424.5 Use the tools and techniques to apply different
algorithms and methodologies
What Can We Do With Data?
 Until recently, researchers working with data analysis were
struggling to obtain data for their experiments.
 Recent advances in the technology of data processing, data
storage and data transmission, associated with advanced and
intelligent computer software, reducing costs and increasing
capacity, have changed this scenario.
 Each day, a larger quantity of data is generated and consumed.
 Whenever you place a comment in your social network,
upload a photograph, some music or a video, navigate
through the Internet, or add a comment to an e-commerce
web site, you are contributing to the data increase.
 Whenever you place a comment in your social network,
upload a photograph, some music or a video, navigate
through the Internet, or add a comment to an e-commerce
web site, you are contributing to the data increase.
 These data provide a rich source of information that can be
transformed into new, useful, valid and human-
understandable knowledge.
 Thus, there is a growing interest in exploring these data to
extract this knowledge, using it to support decision making
in a wide variety of fields: agriculture, commerce,
education, environment, finance, government, industry,
medicine, transport and social care.
 The analysis of these data to extract such knowledge is the
subject of a vibrant area known as data analytics, or simply
“analytics”.
 Analytics The science that analyze crude data to
extract useful knowledge (patterns) from them.
 This process can also include data collection,
organization, pre-processing, transformation, modeling
and interpretation.
 The idea of generalizing knowledge from a data
sample comes from a branch of statistics known as
inductive learning, an area of research with a long
history.
 With the advances of personal computers,
Computational capacity has been used to develop new
methods.
 the ability to perform a given task with more
 In parallel, several researchers have dreamed of being
able to reproduce human behavior using computers.
 reproducing how the human brain works with artificial
neural networks has been studied since the 1940s;
Reproducing how ants work with ant colony
optimization algorithm since the 1990s.
 The term machine learning (ML) appeared in this
context as the “field of study that gives computers the
ability to learn without being explicitly programmed,”
according to Arthur Samuel in 1959 [4].
 In the 1990s, a new term appeared with a different
slight meaning: data mining (DM).
 Companies start to collect more and more data, aiming
to either solve or improve business operations, for
example by detecting frauds with credit cards, by
advising the public of road network constraints in cities,
or by improving relations with clients using more
efficient techniques of relational marketing.
 The question was of being able to mine the data in
order to extract the knowledge necessary for a given
task. This is the goal of data mining.
Big Data and Data Science
 In the first years of the 20th century, the term big data has appeared.
Big data, a technology for data processing, was initially defined by the
“three Vs”, although some more Vs have been proposed since.
 The first three Vs allow us to define a taxonomy of big data.
 They are: volume, variety and velocity.
 Volume is concerned with how to store big data: data repositories for
large amounts of data.
 Variety is concerned with how to put together data from different
sources.
 Velocity concerns the ability to deal with data arriving very fast, in
streams known as data streams.
 Analytics is also about discovering knowledge from data streams,
going beyond the velocity component of big data.
 Another term that has appeared and is sometimes used as a
synonym for big data is data science.
 According to Provost and Fawcett [5], big data are data sets
that are too large to be managed by conventional data-
processing technologies, requiring the development of new
techniques and tools for data storage, processing and
transmission.
 These tools include, for example, MapReduce, Hadoop,
Spark and Storm. But data volume is not the only
characterization of big data.
 The word “big” can refer to the number of data sources, to
the importance of the data, to the need for new processing
techniques, to how fast data arrive, to the combination of
different sets of data so they can be analyzed in real time,
and its ubiquity, since any company, nonprofit organization or
individual has access to data now.
 Big data is more concerned with technology. It provides a
computing environment, not only for analytics, but also for other
data processing tasks. These tasks include finance transaction
processing, web data processing and georeferenced data
processing.
 Data science is concerned with the creation of models able to
extract patterns from complex data and the use of these models
in real-life problems. Data science extracts meaningful and useful
knowledge from data, with the support of suitable technologies. It
has a close relationship to analytics and data mining.
 Data science goes beyond data mining by providing a knowledge
extraction framework, including statistics and visualization.
 Therefore, while big data gives support to data collection and
management, data science applies techniques to these data to
discover new and useful knowledge.
 Big data collects and data science discovers
Big Data Architectures
 As data increase in size, velocity and variety, new
computer technologies become necessary. These new
technologies, which include hardware and software,
must be easily expanded as more data are processed.
 This property is known as scalability.
 One way to obtain scalability is by distributing the data
processing tasks into several computers, which can be
combined into clusters of computers.
 Even if processing power is expanded by combining
several computers in a cluster, creating a distributed
system, conventional software for distributed systems
usually cannot cope with big data.
 One of the limitations is the efficient distribution of data
among the different processing and storage units.
 To deal with these requirements, new software tools
and techniques have been developed.
 One of the first techniques developed for big data
processing using clusters was MapReduce.
MapReduce is a programming model that has two
steps: map and reduce.
 The most famous implementation of MapReduce is called
Hadoop.
 MapReduce divides the data set into parts – chunks – and
stores in the memory of each cluster computer the chunk
of the data set needed by this computer to accomplish its
processing task.
 As an example, suppose that you need to calculate the
average salary of 1 billion people and you have a cluster
with 1000 computers, each with a processing unit and a
storage memory.
 The people can be divided into 1000 chunks – subsets –
with data from 1 million people each. Each chunk can be
processed independently by one of the computers. The
results produced by each these computers (the average
To efficiently solve a big data problem, a distributed system
must attend the following requirements:
 Make sure that no chunk of data is lost and the whole
task is concluded. If one or more computers has a
failure, their tasks, and the corresponding data chunk,
must be assumed by another computer in the cluster.
 Repeat the same task, and corresponding data chunk, in
more than one cluster computer; this is called
redundancy. Thus, if one or more computer fails, the
redundant computer carries on with the task.
 Computers that have had faults can return to the cluster
again when they are fixed.
 Computers can be easily removed from the cluster or
extra ones included in it as the processing demand
Small Data
 In the opposite direction from big data technologies and methods,
there is a movement towards more personal, subjective analysis of
chunks of data, termed “small data”.
 Small data is a data set whose volume and format allows its
processing and analysis by a person or a small organization.
 Thus, instead of collecting data from several sources, with different
formats, and generated at increasing velocities, creating large data
repositories and processing facilities, small data favors the partition
of a problem into small packages, which can be analyzed by
different people or small groups in a distributed and integrated way.
 People are continuously producing small data as they
perform their daily activities, be it navigating the web,
buying a product in a shop, undergoing medical
examinations and using apps in their mobiles.
 When these data are collected to be stored and processed
in large data servers they become big data.
 To be characterized as small data, a data set must have a
size that allows its full understanding by an user.
 The type of knowledge sought in big and small data is also
different, with the first looking for correlations and the
second for causality relations.
 While big data provide tools that allow companies to
understand their customers, small data tools try to help
customers to understand themselves. Thus, big data is
concerned with customers, products and services, and
small data is concerned with the individuals that produced
the data.
What is Data?
 Data, in the information age, are a large set of bits
encoding numbers, texts, images, sounds, videos, and
so on. Unless we add information to data, they are
meaningless.
 When we add information, giving a meaning to them,
these data become knowledge.
 But before data become knowledge, typically, they
pass through several steps where they are still referred
to as data, despite being a bit more organized; that is,
they have some information associated with them.
Let us see the example of data collected
from a private list of acquaintances or
contacts.
 Information as presented in Table 1.1,
usually referred to as tabular data, is
Characterized by the way data are
organized.
 In tabular data, data are organized in
rows and columns, where each
column represents a characteristic of
the data and each row represents an
occurrence of the data.
 A column is referred to as an attribute
or, with the same meaning, a feature,
while a row is referred to as an
instance, or with the same meaning,
an object.
Instance or Object Examples of the concept we want to
characterize.
Example 1.1 In the example in Table 1.1, we intend to characterize
people in our private contact list. Each member is, in this case, an
instance or object. It corresponds to a row of the table.
Attribute or Feature Attributes, also called features, are characteristics
of the instances
Example 1.2 In Table 1.1, contact, age, education level and company
are four different attributes.
 There are, however, data that are not possible to represent
in a single table.
Example 1.3 As an example, if some of the contacts are
relatives of other contacts, a second table, as shown in Table
1.2, representing the family relationships, would be
necessary.
You should note that each person referred to in Table 1.2
also exists in Table 1.1, i.e., there are relations between
attributes of different tables.
 Data sets represented by several tables, making clear the
relations between these tables, are called relational data
sets. This information is easily handled using relational
 Example 1.4 In our example, data is split into two
tables, one with the individual data of each contact
(Table 1.1) and the other with the data about the family
relations between them (Table 1.2).
A Short Taxonomy of Data Analytics
 Now that we know what data are, we will look at what we
can do with them. A natural taxonomy that exists in data
analytics is:
 Descriptive analytics: summarize or condense data to
extract patterns
 Predictive analytics: extract models from data to be used
for future predictions.
 In descriptive analytics tasks, the result of a given method
or technique, 1 is obtained directly by applying an
algorithm to the data. The result can be a statistic, such as
an average, a plot, or a set of groups with similar
instances, among other things, as we will see in this book.
 Method or technique A method or technique is a systematic
procedure that allows us to achieve an intended goal. A
method shows how to perform a given task. But in order to
use a language closer to the language computers can
understand, it is necessary to describe the method/technique
through an algorithm.
 Algorithm An algorithm is a self-contained, step-by-step set
of instructions easily understandable by humans, allowing the
implementation of a given method. They are self-contained in
order to be easily translated to an arbitrary programming
language.
 Example 1.5 The method to obtain the average age of my
contacts uses the ages of each (we could use other methods,
such as using the number of contacts for each different age).
Unit 1 Introduction to Data Analytics .pptx
 We have seen an algorithm that describes a descriptive
method. An algorithm can also describe predictive
methods. In this last case it describes how to generate a
model. Let us see what a model is.
 Model A model in data analytics is a generalization
obtained from data that can be used after words to
generate predictions for new given instances. It can be
seen as a prototype that can be used to make predictions.
Thus, model induction is a predictive task.
 Example 1.7 If we apply an algorithm for induction of
decision trees to provide an explanation of who, among our
contacts, is a good company, we obtain a model, called a
decision tree, like the one presented in Figure 1.1.
• It can be seen that people older
than 38 years are typically better
company than those whose age is
equal or less than 38 more than
80% of people aged 38 or less are
bad company, while more than
80% of people older than 38 are
good company.
• This model could be used to
predict whether a new contact is
or not a good company. It would
be enough to know the age of that
new contact.
Examples of Data Use
We will describe two real-world problems from different areas as an
introduction to the different subjects. One of the problems is from
medicine and the other is from economics.
Breast Cancer in Wisconsin
 Breast cancer is a well-known problem that affects mainly women.
 The detection of breast tumors can be performed through a biopsy
technique known as fine-needle aspiration.
 This uses a fine needle to sample cells from the mass under study.
Samples of breast mass obtained using fine-needle aspiration were
recorded in a set of images.
 Then, a dataset was collected by extracting features from these
images. The objective of the first problem is to detect different patterns
of breast tumors in this dataset, to enable it to be used for diagnostic
purposes.
Polish Company Insolvency Data
 The second problem concerns the prediction of the
economic wealth of Polish companies.
 Can we predict which companies will become
insolvent in the next five years? The answer to this
question is obviously relevant to institutions and
shareholders.
A Project on Data Analytics
Every project needs a plan. Or, to be precise, a methodology to
prepare the plan. A project on data analytics does not imply only the
use of one or more specific methods. It implies:
• understanding the problem to be solved
• defining the objectives of the project
• looking for the necessary data
• preparing these data so that they can be used
• identifying suitable methods and choosing between them
• tuning the hyper-parameters of each method (see below)
• analyzing and evaluating the results
• redoing the pre-processing tasks and repeating the experiments
• and so on.
 we assume that in the induction of a model, there are both
hyper-parameters and parameters whose values are set.
 The values of the hyper-parameters are set by the user, or
some external optimization method. The parameter values,
on the other hand, are model parameters whose values
are set by a modeling or learning algorithm in its internal
procedure.
 When the distinction is not clear, we use the term
parameter.
 Thus, hyper-parameters might be, for example, the number
of layers and the activation function in a multi-layer
perceptron neural network and the number of clusters for
the k-means algorithm.
 Examples of parameters are the weights found by the
backpropagation algorithm when training a multi-layer
perceptron neural network and the distribution of objects
Neural Network
 How can we perform all these operations in an organized
way?
 This section is all about methodologies for planning and
developing projects in data analytics.
A Little History on Methodologies for
Data Analytics
 Machine learning, knowledge discovery from data and
related areas experienced strong development in the
1990s. Both in academia and industry, the research on
these topics was advancing quickly. Naturally,
methodologies for projects in these areas, now referred to
as data analytics, become a necessity.
 In the mid-1990s, both in academia and industry, different
methodologies were presented. The most successful
methodology from academia came from the USA. This was
the KDD process of Usama Fayyad, Gregory Piatetsky-
Shapiro and Padhraic Smyth.
 The most successful tool from industry, was and still is the
CRoss-Industry Standard Process for Data Mining (CRISP-
DM). Conceived in 1996, it later got underway as an
European Union project under the ESPRIT funding
initiative.
 Other methodologies exist. Some of them are domain-
specific: they assume the use of a given tool for data
analytics.This is not the case for SEMMA, which, despite
has been created by SAS, is tool independent. Each letter
of its name, SEMMA, refers to one of its five steps:
Sample, Explore, Modify, Model and Assess.
The KDD Process
Intended to be a methodology that could cope with all the processes
necessary to extract knowledge from data, the KDD process
proposes a sequence of nine steps. In spite of the sequence, the
KDD process considers the possibility of going back to any previous
step in order to redo some part of the process. The nine steps are:
1) Learning the application domain: What is expected in terms of
the application domain? What are the characteristics of the
problem; its specificities? A good understanding of the application
domain is required.
2) Creating a target dataset: What data are needed for the
problem? Which attributes? How will they be collected and put in
the desired format (say, a tabular data set)? Once the application
domain is known, the data analyst team should be able to identify
the data necessary to accomplish the project.
3) Data cleaning and pre-processing: How should missing
values and/or outliers such as extreme values be handled?
What data type should we choose for each attribute? It is
necessary to put the data in a specific format, such as a
tabular format.
4) Data reduction and projection: Which features should we
include to represent the data? From the available features,
which ones should be discarded? Should further information
be added, such as adding the day of the week to a timestamp?
This can be useful in some tasks. Irrelevant attributes should
be removed.
5) Choosing the data mining function: Which type of
methods should be used? Four types of method are:
summarization, clustering, classification and regression. The
first two are from the branch of descriptive analytics while the
6) Choosing the datamining algorithm(s): Given the
characteristics of the problem and the characteristics of the data,
which methods should be used? It is expected that specific
algorithms will be selected.
7) Data mining: Given the characteristics of the problem, the
characteristics of the data, and the applicable method type, which
specific methods should be used? Which values should be
assigned to the hyper-parameters? The choice of method depends
on many different factors: interpretability, ability to handle missing
values, capacity to deal with outliers, computational efficiency,
among others.
8) Interpretation: What is the meaning of the results? What is the
utility for the final user? To select the useful results and to evaluate
them in terms of the application domain is the goal of this step. It is
common to go back to a previous step when the results are not as
good as expected.
9) Using discovered knowledge: How can we apply the new knowledge
in practice? How is it integrated in everyday life? This implies the
integration of the new knowledge into the operational system or in the
reporting system.
 For simplicity sake, the nine steps were described sequentially, which
is typical. However, in practice, some jumps are often necessary.
 As an example, steps 3 and 4 can be grouped together with steps 5
and 6.
 The way we pre-process the data depends on the methods we will
use. For instance, some methods are able to deal with missing values,
others not. When a method is not able to deal with missing values,
those missing values should be included somehow or some attributes
or instances should be removed.
 Also, there are methods that are too sensitive to outliers or extreme
values. When this happens, outliers should be removed. Otherwise, it
is not necessary to remove them.
 These are just examples on how data cleaning and pre-processing
tasks depend on the chosen method(s) (steps 5 and 6).
The CRISP-DM Methodology
 CRoss-Industry
Standard Process for
Data Mining (CRISP-
DM) is a six-step
method, which, like the
KDD process, uses a
non-rigid sequential
framework.
 Despite the six phases,
CRISP-DM is seen as a
perpetual process, used
throughout the life of a
company in successive
iterations (Figure 1.3).
The six phases are:
1) Business understanding: This involves understanding
the business domain, being able to define the problem from
the business domain perspective, and finally being able to
translate such business problems into a data analytics
problem.
2) Data understanding: This involves collection of the
necessary data and their initial visualization/summarization in
order to obtain the first insights, particularly but not
exclusively, about data quality problems such as missing
data or outliers.
3) Data preparation: This involves preparing the data set for
the modeling tool, and includes data transformation, feature
construction, outlier removal, missing data fulfillment and
incomplete instances removal.
4) Modeling: Typically there are several methods that can be used
to solve the same problem in analytics, often with specific data
requirements. This implies that there may be a need for additional
data preparation tasks that are method specific. In such case it is
necessary to go back to the previous step. The modeling phase
also includes tuning the hyper-parameters for each of the chosen
method(s).
5) Evaluation: Solving the problem from the data analytics point of
view is not the end of the process. It is now necessary to
understand how its use is meaningful from the business
perspective; in other words, that he obtained solution answers to
the business requirements.
6) Deployment: The integration of the data analytics solution in the
business process is the main purpose of this phase. Typically, it
implies the integration of the obtained solution into a decision-
support tool, website maintenance process, reporting process or
elsewhere.
Credit card fraud Data

More Related Content

PPTX
Big Data Analytics V2
PPTX
Big data and data mining
PDF
Big Data Analytics M1.pdf big data analytics
PPT
FDS Module I 20.1.2022.ppt
PPTX
Bigdata Hadoop introduction
PPTX
big data on science of analytics and innovativeness among udergraduate studen...
PPT
01-introduction.ppt the paper that you can unless you want to join me because...
PPTX
Big dataorig
Big Data Analytics V2
Big data and data mining
Big Data Analytics M1.pdf big data analytics
FDS Module I 20.1.2022.ppt
Bigdata Hadoop introduction
big data on science of analytics and innovativeness among udergraduate studen...
01-introduction.ppt the paper that you can unless you want to join me because...
Big dataorig

Similar to Unit 1 Introduction to Data Analytics .pptx (20)

PPTX
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
PPTX
Bigdata and Hadoop with applications
PDF
Big Data & Social Analytics presentation
PPTX
Big Data Analytics_Unit1.pptx
PPT
Big data
PPTX
PPTX
Big Data & Data Science
PDF
Big Data Mining - Classification, Techniques and Issues
PPTX
Big data.pptx
PPTX
Big data Analytics Fundamentals Chapter 1
PPTX
Big data Introduction
PPSX
Intro to Data Science Big Data
PDF
Big data analytics 1
PDF
Unit-I.pdf Data Science unit 1 Introduction of data science
PPTX
Big Data Analysis : Deciphering the haystack
PPTX
Data Science
PPTX
Big Data & Data Mining
PPTX
Introduction to data science
PPTX
Foundations of Big Data: Concepts, Techniques, and Applications
PDF
KIT-601 Lecture Notes-UNIT-1.pdf
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Bigdata and Hadoop with applications
Big Data & Social Analytics presentation
Big Data Analytics_Unit1.pptx
Big data
Big Data & Data Science
Big Data Mining - Classification, Techniques and Issues
Big data.pptx
Big data Analytics Fundamentals Chapter 1
Big data Introduction
Intro to Data Science Big Data
Big data analytics 1
Unit-I.pdf Data Science unit 1 Introduction of data science
Big Data Analysis : Deciphering the haystack
Data Science
Big Data & Data Mining
Introduction to data science
Foundations of Big Data: Concepts, Techniques, and Applications
KIT-601 Lecture Notes-UNIT-1.pdf
Ad

More from vipulkondekar (20)

PPTX
Free-Counselling-and-Admission-Facilitation-at-WIT-Campus.pptx
PPTX
Machine Learning Presentation for Engineering
PPTX
Exploring-Scholarship-Opportunities.pptx
PPTX
Documents-Required-for- Engineering Admissions.pptx
PPTX
Backpropagation algorithm in Neural Network
PPTX
Min Max Algorithm in Artificial Intelligence
PPTX
AO Star Algorithm in Artificial Intellligence
PPTX
A Star Algorithm in Artificial intelligence
PPTX
Artificial Intelligence Problem Slaving PPT
PPT
Deep Learning approach in Machine learning
PPT
Artificial Neural Network and Machine Learning
PPT
Microcontroller Timer Counter Modules and applications
PPT
Microcontroller 8051 timer and counter module
PPT
Microcontroller 8051 Timer Counter Interrrupt
PPTX
Microcontroller Introduction and the various features
PPTX
Avishkar competition presentation template
PPTX
Introduction to prototyping in developing the products
PPTX
KNN Algorithm Machine_Learning_KNN_Presentation.pptx
PPTX
New Education Policy Presentation at VIT
PPT
Random Forest algorithm in Machine learning
Free-Counselling-and-Admission-Facilitation-at-WIT-Campus.pptx
Machine Learning Presentation for Engineering
Exploring-Scholarship-Opportunities.pptx
Documents-Required-for- Engineering Admissions.pptx
Backpropagation algorithm in Neural Network
Min Max Algorithm in Artificial Intelligence
AO Star Algorithm in Artificial Intellligence
A Star Algorithm in Artificial intelligence
Artificial Intelligence Problem Slaving PPT
Deep Learning approach in Machine learning
Artificial Neural Network and Machine Learning
Microcontroller Timer Counter Modules and applications
Microcontroller 8051 timer and counter module
Microcontroller 8051 Timer Counter Interrrupt
Microcontroller Introduction and the various features
Avishkar competition presentation template
Introduction to prototyping in developing the products
KNN Algorithm Machine_Learning_KNN_Presentation.pptx
New Education Policy Presentation at VIT
Random Forest algorithm in Machine learning
Ad

Recently uploaded (20)

PDF
Well-logging-methods_new................
PPTX
Sustainable Sites - Green Building Construction
PPTX
Lecture Notes Electrical Wiring System Components
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
web development for engineering and engineering
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
OOP with Java - Java Introduction (Basics)
PDF
composite construction of structures.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
additive manufacturing of ss316l using mig welding
Well-logging-methods_new................
Sustainable Sites - Green Building Construction
Lecture Notes Electrical Wiring System Components
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
web development for engineering and engineering
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
UNIT 4 Total Quality Management .pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
OOP with Java - Java Introduction (Basics)
composite construction of structures.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Internet of Things (IOT) - A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Geodesy 1.pptx...............................................
Foundation to blockchain - A guide to Blockchain Tech
Model Code of Practice - Construction Work - 21102022 .pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
additive manufacturing of ss316l using mig welding

Unit 1 Introduction to Data Analytics .pptx

  • 1. ET424.1 Data Analytics Unit 1: Introduction to Data Analytics Mr. V. H. Kondekar E&TC Dept., WIT, Solapur
  • 2. Course Outcomes  ET424.1 Discuss challenges in big data analytics and Describe fundamental techniques and principles for data analytics.  ET424.2 Identify, organize and operate on the datasets to compute statistics for data analysis  ET424.3 Select and implement appropriate data visualizations to clearly communicate analytic insights.  ET424.4 Apply different preprocessing techniques for data quality enhnacement  ET424.5 Use the tools and techniques to apply different algorithms and methodologies
  • 3. What Can We Do With Data?  Until recently, researchers working with data analysis were struggling to obtain data for their experiments.  Recent advances in the technology of data processing, data storage and data transmission, associated with advanced and intelligent computer software, reducing costs and increasing capacity, have changed this scenario.  Each day, a larger quantity of data is generated and consumed.  Whenever you place a comment in your social network, upload a photograph, some music or a video, navigate through the Internet, or add a comment to an e-commerce web site, you are contributing to the data increase.
  • 4.  Whenever you place a comment in your social network, upload a photograph, some music or a video, navigate through the Internet, or add a comment to an e-commerce web site, you are contributing to the data increase.  These data provide a rich source of information that can be transformed into new, useful, valid and human- understandable knowledge.  Thus, there is a growing interest in exploring these data to extract this knowledge, using it to support decision making in a wide variety of fields: agriculture, commerce, education, environment, finance, government, industry, medicine, transport and social care.  The analysis of these data to extract such knowledge is the subject of a vibrant area known as data analytics, or simply “analytics”.
  • 5.  Analytics The science that analyze crude data to extract useful knowledge (patterns) from them.  This process can also include data collection, organization, pre-processing, transformation, modeling and interpretation.  The idea of generalizing knowledge from a data sample comes from a branch of statistics known as inductive learning, an area of research with a long history.  With the advances of personal computers, Computational capacity has been used to develop new methods.  the ability to perform a given task with more
  • 6.  In parallel, several researchers have dreamed of being able to reproduce human behavior using computers.  reproducing how the human brain works with artificial neural networks has been studied since the 1940s; Reproducing how ants work with ant colony optimization algorithm since the 1990s.  The term machine learning (ML) appeared in this context as the “field of study that gives computers the ability to learn without being explicitly programmed,” according to Arthur Samuel in 1959 [4].  In the 1990s, a new term appeared with a different slight meaning: data mining (DM).
  • 7.  Companies start to collect more and more data, aiming to either solve or improve business operations, for example by detecting frauds with credit cards, by advising the public of road network constraints in cities, or by improving relations with clients using more efficient techniques of relational marketing.  The question was of being able to mine the data in order to extract the knowledge necessary for a given task. This is the goal of data mining.
  • 8. Big Data and Data Science  In the first years of the 20th century, the term big data has appeared. Big data, a technology for data processing, was initially defined by the “three Vs”, although some more Vs have been proposed since.  The first three Vs allow us to define a taxonomy of big data.  They are: volume, variety and velocity.  Volume is concerned with how to store big data: data repositories for large amounts of data.  Variety is concerned with how to put together data from different sources.  Velocity concerns the ability to deal with data arriving very fast, in streams known as data streams.  Analytics is also about discovering knowledge from data streams, going beyond the velocity component of big data.
  • 9.  Another term that has appeared and is sometimes used as a synonym for big data is data science.  According to Provost and Fawcett [5], big data are data sets that are too large to be managed by conventional data- processing technologies, requiring the development of new techniques and tools for data storage, processing and transmission.  These tools include, for example, MapReduce, Hadoop, Spark and Storm. But data volume is not the only characterization of big data.  The word “big” can refer to the number of data sources, to the importance of the data, to the need for new processing techniques, to how fast data arrive, to the combination of different sets of data so they can be analyzed in real time, and its ubiquity, since any company, nonprofit organization or individual has access to data now.
  • 10.  Big data is more concerned with technology. It provides a computing environment, not only for analytics, but also for other data processing tasks. These tasks include finance transaction processing, web data processing and georeferenced data processing.  Data science is concerned with the creation of models able to extract patterns from complex data and the use of these models in real-life problems. Data science extracts meaningful and useful knowledge from data, with the support of suitable technologies. It has a close relationship to analytics and data mining.  Data science goes beyond data mining by providing a knowledge extraction framework, including statistics and visualization.  Therefore, while big data gives support to data collection and management, data science applies techniques to these data to discover new and useful knowledge.  Big data collects and data science discovers
  • 11. Big Data Architectures  As data increase in size, velocity and variety, new computer technologies become necessary. These new technologies, which include hardware and software, must be easily expanded as more data are processed.  This property is known as scalability.  One way to obtain scalability is by distributing the data processing tasks into several computers, which can be combined into clusters of computers.
  • 12.  Even if processing power is expanded by combining several computers in a cluster, creating a distributed system, conventional software for distributed systems usually cannot cope with big data.  One of the limitations is the efficient distribution of data among the different processing and storage units.  To deal with these requirements, new software tools and techniques have been developed.  One of the first techniques developed for big data processing using clusters was MapReduce. MapReduce is a programming model that has two steps: map and reduce.
  • 13.  The most famous implementation of MapReduce is called Hadoop.  MapReduce divides the data set into parts – chunks – and stores in the memory of each cluster computer the chunk of the data set needed by this computer to accomplish its processing task.  As an example, suppose that you need to calculate the average salary of 1 billion people and you have a cluster with 1000 computers, each with a processing unit and a storage memory.  The people can be divided into 1000 chunks – subsets – with data from 1 million people each. Each chunk can be processed independently by one of the computers. The results produced by each these computers (the average
  • 14. To efficiently solve a big data problem, a distributed system must attend the following requirements:  Make sure that no chunk of data is lost and the whole task is concluded. If one or more computers has a failure, their tasks, and the corresponding data chunk, must be assumed by another computer in the cluster.  Repeat the same task, and corresponding data chunk, in more than one cluster computer; this is called redundancy. Thus, if one or more computer fails, the redundant computer carries on with the task.  Computers that have had faults can return to the cluster again when they are fixed.  Computers can be easily removed from the cluster or extra ones included in it as the processing demand
  • 15. Small Data  In the opposite direction from big data technologies and methods, there is a movement towards more personal, subjective analysis of chunks of data, termed “small data”.  Small data is a data set whose volume and format allows its processing and analysis by a person or a small organization.  Thus, instead of collecting data from several sources, with different formats, and generated at increasing velocities, creating large data repositories and processing facilities, small data favors the partition of a problem into small packages, which can be analyzed by different people or small groups in a distributed and integrated way.
  • 16.  People are continuously producing small data as they perform their daily activities, be it navigating the web, buying a product in a shop, undergoing medical examinations and using apps in their mobiles.  When these data are collected to be stored and processed in large data servers they become big data.  To be characterized as small data, a data set must have a size that allows its full understanding by an user.  The type of knowledge sought in big and small data is also different, with the first looking for correlations and the second for causality relations.  While big data provide tools that allow companies to understand their customers, small data tools try to help customers to understand themselves. Thus, big data is concerned with customers, products and services, and small data is concerned with the individuals that produced the data.
  • 17. What is Data?  Data, in the information age, are a large set of bits encoding numbers, texts, images, sounds, videos, and so on. Unless we add information to data, they are meaningless.  When we add information, giving a meaning to them, these data become knowledge.  But before data become knowledge, typically, they pass through several steps where they are still referred to as data, despite being a bit more organized; that is, they have some information associated with them.
  • 18. Let us see the example of data collected from a private list of acquaintances or contacts.  Information as presented in Table 1.1, usually referred to as tabular data, is Characterized by the way data are organized.  In tabular data, data are organized in rows and columns, where each column represents a characteristic of the data and each row represents an occurrence of the data.  A column is referred to as an attribute or, with the same meaning, a feature, while a row is referred to as an instance, or with the same meaning, an object.
  • 19. Instance or Object Examples of the concept we want to characterize. Example 1.1 In the example in Table 1.1, we intend to characterize people in our private contact list. Each member is, in this case, an instance or object. It corresponds to a row of the table. Attribute or Feature Attributes, also called features, are characteristics of the instances Example 1.2 In Table 1.1, contact, age, education level and company are four different attributes.
  • 20.  There are, however, data that are not possible to represent in a single table. Example 1.3 As an example, if some of the contacts are relatives of other contacts, a second table, as shown in Table 1.2, representing the family relationships, would be necessary. You should note that each person referred to in Table 1.2 also exists in Table 1.1, i.e., there are relations between attributes of different tables.  Data sets represented by several tables, making clear the relations between these tables, are called relational data sets. This information is easily handled using relational
  • 21.  Example 1.4 In our example, data is split into two tables, one with the individual data of each contact (Table 1.1) and the other with the data about the family relations between them (Table 1.2).
  • 22. A Short Taxonomy of Data Analytics  Now that we know what data are, we will look at what we can do with them. A natural taxonomy that exists in data analytics is:  Descriptive analytics: summarize or condense data to extract patterns  Predictive analytics: extract models from data to be used for future predictions.  In descriptive analytics tasks, the result of a given method or technique, 1 is obtained directly by applying an algorithm to the data. The result can be a statistic, such as an average, a plot, or a set of groups with similar instances, among other things, as we will see in this book.
  • 23.  Method or technique A method or technique is a systematic procedure that allows us to achieve an intended goal. A method shows how to perform a given task. But in order to use a language closer to the language computers can understand, it is necessary to describe the method/technique through an algorithm.  Algorithm An algorithm is a self-contained, step-by-step set of instructions easily understandable by humans, allowing the implementation of a given method. They are self-contained in order to be easily translated to an arbitrary programming language.  Example 1.5 The method to obtain the average age of my contacts uses the ages of each (we could use other methods, such as using the number of contacts for each different age).
  • 25.  We have seen an algorithm that describes a descriptive method. An algorithm can also describe predictive methods. In this last case it describes how to generate a model. Let us see what a model is.  Model A model in data analytics is a generalization obtained from data that can be used after words to generate predictions for new given instances. It can be seen as a prototype that can be used to make predictions. Thus, model induction is a predictive task.  Example 1.7 If we apply an algorithm for induction of decision trees to provide an explanation of who, among our contacts, is a good company, we obtain a model, called a decision tree, like the one presented in Figure 1.1.
  • 26. • It can be seen that people older than 38 years are typically better company than those whose age is equal or less than 38 more than 80% of people aged 38 or less are bad company, while more than 80% of people older than 38 are good company. • This model could be used to predict whether a new contact is or not a good company. It would be enough to know the age of that new contact.
  • 27. Examples of Data Use We will describe two real-world problems from different areas as an introduction to the different subjects. One of the problems is from medicine and the other is from economics. Breast Cancer in Wisconsin  Breast cancer is a well-known problem that affects mainly women.  The detection of breast tumors can be performed through a biopsy technique known as fine-needle aspiration.  This uses a fine needle to sample cells from the mass under study. Samples of breast mass obtained using fine-needle aspiration were recorded in a set of images.  Then, a dataset was collected by extracting features from these images. The objective of the first problem is to detect different patterns of breast tumors in this dataset, to enable it to be used for diagnostic purposes.
  • 28. Polish Company Insolvency Data  The second problem concerns the prediction of the economic wealth of Polish companies.  Can we predict which companies will become insolvent in the next five years? The answer to this question is obviously relevant to institutions and shareholders.
  • 29. A Project on Data Analytics Every project needs a plan. Or, to be precise, a methodology to prepare the plan. A project on data analytics does not imply only the use of one or more specific methods. It implies: • understanding the problem to be solved • defining the objectives of the project • looking for the necessary data • preparing these data so that they can be used • identifying suitable methods and choosing between them • tuning the hyper-parameters of each method (see below) • analyzing and evaluating the results • redoing the pre-processing tasks and repeating the experiments • and so on.
  • 30.  we assume that in the induction of a model, there are both hyper-parameters and parameters whose values are set.  The values of the hyper-parameters are set by the user, or some external optimization method. The parameter values, on the other hand, are model parameters whose values are set by a modeling or learning algorithm in its internal procedure.  When the distinction is not clear, we use the term parameter.  Thus, hyper-parameters might be, for example, the number of layers and the activation function in a multi-layer perceptron neural network and the number of clusters for the k-means algorithm.  Examples of parameters are the weights found by the backpropagation algorithm when training a multi-layer perceptron neural network and the distribution of objects
  • 32.  How can we perform all these operations in an organized way?  This section is all about methodologies for planning and developing projects in data analytics.
  • 33. A Little History on Methodologies for Data Analytics  Machine learning, knowledge discovery from data and related areas experienced strong development in the 1990s. Both in academia and industry, the research on these topics was advancing quickly. Naturally, methodologies for projects in these areas, now referred to as data analytics, become a necessity.  In the mid-1990s, both in academia and industry, different methodologies were presented. The most successful methodology from academia came from the USA. This was the KDD process of Usama Fayyad, Gregory Piatetsky- Shapiro and Padhraic Smyth.
  • 34.  The most successful tool from industry, was and still is the CRoss-Industry Standard Process for Data Mining (CRISP- DM). Conceived in 1996, it later got underway as an European Union project under the ESPRIT funding initiative.  Other methodologies exist. Some of them are domain- specific: they assume the use of a given tool for data analytics.This is not the case for SEMMA, which, despite has been created by SAS, is tool independent. Each letter of its name, SEMMA, refers to one of its five steps: Sample, Explore, Modify, Model and Assess.
  • 35. The KDD Process Intended to be a methodology that could cope with all the processes necessary to extract knowledge from data, the KDD process proposes a sequence of nine steps. In spite of the sequence, the KDD process considers the possibility of going back to any previous step in order to redo some part of the process. The nine steps are: 1) Learning the application domain: What is expected in terms of the application domain? What are the characteristics of the problem; its specificities? A good understanding of the application domain is required. 2) Creating a target dataset: What data are needed for the problem? Which attributes? How will they be collected and put in the desired format (say, a tabular data set)? Once the application domain is known, the data analyst team should be able to identify the data necessary to accomplish the project.
  • 36. 3) Data cleaning and pre-processing: How should missing values and/or outliers such as extreme values be handled? What data type should we choose for each attribute? It is necessary to put the data in a specific format, such as a tabular format. 4) Data reduction and projection: Which features should we include to represent the data? From the available features, which ones should be discarded? Should further information be added, such as adding the day of the week to a timestamp? This can be useful in some tasks. Irrelevant attributes should be removed. 5) Choosing the data mining function: Which type of methods should be used? Four types of method are: summarization, clustering, classification and regression. The first two are from the branch of descriptive analytics while the
  • 37. 6) Choosing the datamining algorithm(s): Given the characteristics of the problem and the characteristics of the data, which methods should be used? It is expected that specific algorithms will be selected. 7) Data mining: Given the characteristics of the problem, the characteristics of the data, and the applicable method type, which specific methods should be used? Which values should be assigned to the hyper-parameters? The choice of method depends on many different factors: interpretability, ability to handle missing values, capacity to deal with outliers, computational efficiency, among others. 8) Interpretation: What is the meaning of the results? What is the utility for the final user? To select the useful results and to evaluate them in terms of the application domain is the goal of this step. It is common to go back to a previous step when the results are not as good as expected.
  • 38. 9) Using discovered knowledge: How can we apply the new knowledge in practice? How is it integrated in everyday life? This implies the integration of the new knowledge into the operational system or in the reporting system.  For simplicity sake, the nine steps were described sequentially, which is typical. However, in practice, some jumps are often necessary.  As an example, steps 3 and 4 can be grouped together with steps 5 and 6.  The way we pre-process the data depends on the methods we will use. For instance, some methods are able to deal with missing values, others not. When a method is not able to deal with missing values, those missing values should be included somehow or some attributes or instances should be removed.  Also, there are methods that are too sensitive to outliers or extreme values. When this happens, outliers should be removed. Otherwise, it is not necessary to remove them.  These are just examples on how data cleaning and pre-processing tasks depend on the chosen method(s) (steps 5 and 6).
  • 39. The CRISP-DM Methodology  CRoss-Industry Standard Process for Data Mining (CRISP- DM) is a six-step method, which, like the KDD process, uses a non-rigid sequential framework.  Despite the six phases, CRISP-DM is seen as a perpetual process, used throughout the life of a company in successive iterations (Figure 1.3).
  • 40. The six phases are: 1) Business understanding: This involves understanding the business domain, being able to define the problem from the business domain perspective, and finally being able to translate such business problems into a data analytics problem. 2) Data understanding: This involves collection of the necessary data and their initial visualization/summarization in order to obtain the first insights, particularly but not exclusively, about data quality problems such as missing data or outliers. 3) Data preparation: This involves preparing the data set for the modeling tool, and includes data transformation, feature construction, outlier removal, missing data fulfillment and incomplete instances removal.
  • 41. 4) Modeling: Typically there are several methods that can be used to solve the same problem in analytics, often with specific data requirements. This implies that there may be a need for additional data preparation tasks that are method specific. In such case it is necessary to go back to the previous step. The modeling phase also includes tuning the hyper-parameters for each of the chosen method(s). 5) Evaluation: Solving the problem from the data analytics point of view is not the end of the process. It is now necessary to understand how its use is meaningful from the business perspective; in other words, that he obtained solution answers to the business requirements. 6) Deployment: The integration of the data analytics solution in the business process is the main purpose of this phase. Typically, it implies the integration of the obtained solution into a decision- support tool, website maintenance process, reporting process or elsewhere.