SlideShare a Scribd company logo
Syllabus and Introduction
Big Data Analytics
2
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
3
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
Textbooks:
1. Seema Acharya, Subhashini Chellappan ,“Big Data and Analytics”,
Wiley,2017.
2.Alex Holmes,“Big Data Black Book”, Dreamtech,2015.
4
Course Outcomes:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
Upon completion of this course, students will be able to:
1. Identify the issues and challenges related to Big Data.
2. Choose and apply Big Data technologies and tools in solving real life
Big Data problem.
3. Design MapReduce architecture for Big Data problem.
4. Write scripts using Pig and Hive to implement Big Data problem.
5. Derive different Analytics from the Big Data problem.
5
In today’s discussion…
 Introduction to data
 Data and Big data
 Big Data Analytics
 Big data- Definition and Meaning.
 Types of data
 Characteristics of Big data
 Big data vs. small data
 Tools and techniques
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
6
Introduction to data
 Example:
10, 25, …, Nitte, CC3201-1
Anything else?
 Data vs. Information
100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0
Is there any information?
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
7
Big Data Definition:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
8
Big Data-Definition:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
9
Example of Big Data:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
10
Big Data-Meaning:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
11
Big Data Analytics- Definition:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
12
Types of Data:
 Structured
 Unstructured
 Semi-structured
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
13
Structured data
 Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
14
Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 Example:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
15
Semi-structured
 Semi-structured data can contain both the forms of data.
 Examples Of Semi-structured Data
 Personal data stored in an XML file-
<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
16
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
17
How large your data is?
 What is the maximum file size you have dealt so far?
 Movies/files/streaming video that you have used?
 What is the maximum download speed you get?
 To retrieve data stored in distant locations?
 How fast your computation is?
 How much time to just transfer from you, process and get
result?
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
18
Growth of data
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
19
Sources of data
 “Every day, we create 2.5 quintillion bytes of data
 So much that 90% of the data in the world today has been created in the last two years
alone. This explosion of information is known as “Big Data,”
 The data come from several sources :
etc. …… to name a few!
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
20
Social Media:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
21
Now data is Big data!
 No single standard definition!
 ‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires different approaches
 techniques, tools and architectures
…to solve: new problems
…and, of course, in a better way
Big data is data whose scale, diversity, and complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value and hidden knowledge from it…
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
22
Characteristics of Big data: 5V’s
 5 V's of Big Data:
 Volume
 Velocity
 Variety
 Veracity
 Value
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
23
Volume(The scale):
 Volume in Big data represents the amount of data. In today’s
world data is being processed in various formats like word, excel,
pdf format, and sometimes in audio and video. These data can be
structured , unstructured or semi-structured format. The recent
social media platforms produce a tremendous amount of data
which is difficult to handle by the organization. To handle this
huge amount of data organizations should implement modern
business intelligence tools which will capture this data in an
effective form and which will be cost-efficient for the
organization.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
24
Velocity(The speed):
 Velocity refers to the rate/speed at which data is getting generated.
 This is primarily due to the Internet of Things (IoT), mobile data,
social media, and other factors. At least 2 trillion searches each
year, 3.8 million searches per minute, 228 million searches per
hour, and 5.6 billion searches per day are now being conducted.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
25
Variety(Data type):
 Big Data can be structured, unstructured, and semi-
structured that are being collected from different sources. Data
will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos, etc.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
Quasi-structured data:The data format
contains textual data with inconsistent data
formats that are formatted with effort and
time with some tools.
26
Veracity:
 Degree of trustworthiness of data is the veracity of data.
 Veracity means how much the data is reliable. It has many ways to
filter or translate the data. Veracity is the process of being able to
handle and manage data efficiently. Big Data is also essential in
business development.
 For example, Facebook posts with hashtags.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
27
Value:
 Value is an essential characteristic of big data. It is not the data
that we process or store. It is valuable and reliable data that
we store, process, and also analyze.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
28
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
29
Difference between Small data and Big Data
 Mostly structured data
 Stored in KB,MB,GB,TB
 Increases gradually
 Locally present, centralized.
 Sql server, oracle.
 Single node
 Mostly unstructured data
 Stored in PB,EB,ZB,YB
 Increases exponentially
 Globally present, distributed.
 Hadoop, spark.
 Multi-node cluster.
30
Big data vs. small data
 Big data is more real-time in nature than
traditional applications
 Big data architecture
 Traditional architectures are not well-suited
for big data applications (e.g. Exa-data, Tera-
data)
 Massively parallel processing, scale out
architectures are well-suited for big data
applications
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
31
Major players…
 Google
 Hadoop
 MapReduce
 Mahout
 Apache Hbase
 Cassandra
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
32
Tools available
 NoSQL
 Databases MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper
 MapReduce
 Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
 Storage
 S3, HDFS, GDFS
 Servers
 EC2, Google App Engine, Elastic, Beanstalk, Heroku
 Processing
 R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
33
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.
34
Job Opportunities in Big Data
 Data Analysts
 analyze and interpret data, visualize it, and build reports to help make better business decisions.
 Data Scientists
 mine data by assessing data sources and use algorithms and Machine Learning techniques.
 Data Architects
 design database systems and tools.
 Database Managers
 control database system performance, perform troubleshooting, and upgrade hardware and software.
 Big Data Engineers
 design, maintain, and support Big Data solutions.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
35
END OF INTRODUCTION
36
Questions of the day…
1. What is the smallest and largest units of measuring size of data?
2. How big a Quintillion measure is?
3. Give the examples of a smallest the largest entities of data.
4. Give FIVE parameters with which data can be categorized as i)
simple, ii) Moderately complex and iii) complex?
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
Questions of the day…
5. What type of data are involved in the following applications?
1. Weather forecasting
2. Mobile usage of all customers of a service provider
3. Anomaly (e.g. fraud) detection in a bank organization
4. Person categorization, that is, identifying a human
5. Air traffic control in an airport
37
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
38
Big data types (by IBM)
 Social Networks and web data
 Transactions data and Business Process data
 Customer master data
 Machine generated data
 Human generated data
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
39
Big Data Classification : based on characteristics for
designing data architecture for processing and analytics
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
40
41
Scalability and Parallel Processing
 Scalability is the capability to handle growing amounts of data and growing number of
database clients either by adding more hardware resources or by optimization and
more efficient usage of the existing resources.
 Scalability enables increase or decrease in the capacity of data storage, processing and
analytics.
 In short, you need to build scalability into the hardware architecture and database
selection, and can (for the most part) maximize performance later — during the
database design and deployment phase.
 System capability needs increment with the increased workloads. When the workload
and complexity exceed the system capacity, scale it up and scale it down.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
42
Scalability Options
 Assuming you need to scale your system, there are two options:
 scaling up
 scaling out
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
43
Scale UP
 Resources such as CPU, network, and storage are
common targets for scaling up.
 The goal is to increase the resources supporting
your application to reach or maintain adequate
performance.
 In a hardware-centric world, this might mean
adding a larger hard drive to a computer for
increased storage capacity.
 It might mean replacing the entire computer with
a machine that has more CPU and a more
performant network interface.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
44
Scale OUT
 The scale-out option implies a
distributed system whereby
additional machines are added to a
cluster to provide additional
capacity. It's often more likely to
yield a linear increase in scalability,
although not necessarily increased
performance.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
45
Analytics Scalability to Big Data
 Vertical scalability means scaling up the given system resources and increasing the
system’s analytics, reporting and visualization capabilities.
 Ex: designing the algorithm according to the architecture that uses resources
efficiently.
 Horizontal scalability means increasing the number of systems working in
coherence and scaling out the workload.
 Ex: using more resources and distributing the storage and processing task in parallel.
 Note: Alternative ways for scaling up and out processing of analytics software and big
data analytics deploy the Massively Parallel Processing Platforms(MPPs), cloud, grid,
clusters and distributed computing software.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
46
Massively Parallel Processing Platforms(MPPs)
 Massively parallel processing (MPP) is a
collaborative processing of the same
program using two or more processors.
 By using different processors, speed can
be dramatically increased.
 For example, imagine a popular insurance company with millions of
customers. As the number of customers increases, so does the customer
data. Even if the firm uses parallel processing, they may experience a delay
in processing customer data. Assume a data analyst is running a query
against 100 million rows of a database. If the organization uses a massively
parallel processing system with 1000 nodes, each node has to bear only
1/1000 computational load.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
47
Parallelization of tasks
can be done at several
levels:
• Distributing separate tasks on to
separate threads on same CPU.
• Distributing separate tasks onto
separate CPUs on the same
computer.
• Distributing separate tasks onto
separate computer.
There are several types of
MPP database
architectures
• Distributed Computing Model
• Cloud Computing
• Grid and Cluster Computing
• Volunteer Computing
48
Distributed Computing Model
 It uses cloud, grid or clusters, which process and analyze big and
large datasets on distributed computing nodes connected by high
speed network.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
49
Cloud computing
 Type of internet-based computing that provides shared processing
resources and data to the computers and other devices on demand.
 One of the best approach for data processing to perform parallel and
distributed computing
 Offers high data security compared to other distributed technologies
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
50
Cloud resources
 Amazon Web Service(AWS)
 Elastic Compute Cloud(EC2)
 Microsoft Azure or Apache CloudStack
 Amazon Simple Storage Service(S3)
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
51
Cloud computing features
1. On-demand service
2. Resource pooling
3. Scalability
4. Accountability
5. Broad network Access
 Cloud services can be accessed from anywhere and at any time through
the internet
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
52
Cloud services types
1. Infrastructure as a Service(IaaS):
 Providing access to resources, such as hard disks, network connections, databases storage ,
data centre and virtual service space.
 Ex: AWS EC2, Rackspace, Google Compute Engine
2. Platform as a Service(PaaS):
 Providing runtime environment to allow developers to build applications and services.
 Ex: Windows Azure (mostly used as PaaS), Force.com
3. Software as a Service(SaaS):
 Providing software applications as a service to end-users
 Ex: BigCommerce, Google Apps, Salesforce, Dropbox
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
53
Grid and Cluster computing
 Grid Computing:
 Distributed computing, in which a group of computers from several locations are connected with
each other to achieve a common task.
 Grid: A group of computers that might spread over remotely
 This type of computing provides large-scale resource sharing which is flexible , coordinated and
secure among its users.
 For example, a research team might analyze weather patterns in the North Atlantic region, while
another team analyzes the south Atlantic region, and both results can be combined to deliver a
complete picture of Atlantic weather patterns
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
54
Features of Grid computing
 Similar to cloud computing
 Scalable
 Distributed network for resource integration
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
55
Drawbacks of Grid Computing
 Single point of failure
 Storage capacity varies with the number of users, instances
and the amount of data transferred at a given time
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
56
Cluster computing
 Group of computers connected by a network to accomplish the
same task.
 Used mainly for load balancing
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
57
Difference between Cluster and Grid Computing:
58
Volunteer Computing
 Volunteers are organizations or members who own personal
computers.
 They provide computing resources to important projects that use
resources to do distributed computing and/or storage
 Volunteer Computing: uses computing resources of the volunteers
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
59
Issues of volunteer computing systems
 Volunteered computers heterogeneity
 Drop outs from the network over time
 Their sporadic availability
 Incorrect results at volunteers are unaccountable as they are
anonymous
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
60
Designing Data Architecture
“Big data architecture is the logical and/or physical layout/structure of how big data will be
stored, accessed and managed within a big data or IT environment.”
 Architecture logically defines how big data solution will work, the core
components(hardware, database, software, storage) used, flow of information, security
and more.
 Data processing architecture consist of 5 layers:
 (i) identification of data sources
 (ii) acquisition, ingestion, extraction, pre-processing, transformation of data
 (iii) data storage at files, servers, cluster or cloud
 (iv) data processing
 (v) data consumption
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
61
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
62
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
63
64
Managing data for Analysis
 Data managing means enabling, controlling, protecting, delivering
and enhancing the value of data and information asset.
 Data Management functions include:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
65
66
Data Sources
 Applications, programs and tools use data.
 Sources can be external, such as sensors, trackers, web logs,
computer system logs and feeds.
 Sources can be machines, which source data from data-creating
programs.
 Data sources can be i) structured ii) semi-structured iii)multi-
structured or unstructured.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
67
Structured data sources
 The source may be on the same computer running a program or a networked
computer.
 Examples of structured data sources are SQL server, MySQL, Oracle DBMS, file
collection directory at a server.
 The name implies a defined name, which a process uses to identify the source.
 Ex: a name which identifies stored data in student grades during processing, the
name could be studentname_data_grades.
Then, what could be the name of data
source!!!?
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
68
Unstructured data sources
 Distributed over high-speed networks.
 The data need high velocity processing as sources are from distributed file system.
 The sources are of file types, such as .txt, .csv(comma separated value).
 Data may be as key-value pairs, such as hash key-values pairs.
 Data may have internal structures, such as in e-mail, facebook pages, twitter pages, etc..
 Data sources can be sensors, sensor networks, signals from machines, devices,
controllers of different types in the industry M2M communication and the GPS system.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
69
Data Quality
 Data quality is the measure of how well suited a data set is to serve its
specific purpose.
 A high quality data can be data with five R’s.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
70
Data Integrity
 Data integrity refers to the fact that data must be reliable and accurate over
its entire lifecycle.
WHY IS DATA INTEGRITY IMPORTANT?
 Your need to have constant access to data of quality data. Data integrity is
important as it guarantees and secures the searchability and traceability of
your data to its original source.
 Organizations collect more and more data and it has become a priority to
secure and maintain the integrity of this data. Without integrity and
accuracy, your data is worthless.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
71
Examples of data quality problems
 Noise
 Outliers
 Missing values
 Duplicate data
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
72
Noisy data
 For objects, noise is considered an extraneous object.
 For attributes, noise refers to modification of original values.
Here Noise refers to measurement error in data values
Could be random error or systematic error…!!!
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
73
Outliers
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set.
 Could indicate “interesting” cases, or could indicate errors in the
data
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
74
Missing values
 Reasons for missing values
 Information is not collected (e.g., people decline to give their age)
 Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
 Ways to handle missing values
 Eliminate entities with missing values
 Estimate attributes with missing values
 Ignore the missing values during analysis
 Replace with all possible values (weighted by their probabilities)
 Impute missing values
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
75
Duplicate data
 Data set may include data entities that are duplicates, or almost
duplicates of one another
 Major issue when merging data from heterogeneous sources
 Example: same person with multiple email addresses.
 Data cleaning
 Finding and dealing with duplicate entities
 Finding and correcting measurement error
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
76
Data Preprocessing
 It is an important step at the ingestion layer.
 It is a must before data mining, analytics or before running
machine learning algorithms.
 Pre-processing needs are:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
77
Data cleaning
 Process of removing or correcting incomplete , incorrect,
inaccurate or irrelevant parts of the data after detecting
them.
 Example:
 Correcting the grade outliers or mistakenly entered values
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
78
Important terminologies in data cleaning
Data cleaning
tools
Data Enrichment Data Editing
Data Reduction Data wrangling
Data formats
used during Pre-
processing
79
Data cleaning tools
 Data cleaning is done before data mining.
 Data cleaning tools help in refining and structuring data into usable
data.
 Example:
 OpenRefine
 DataCleaner
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
80
Data Enrichment:
Refers to operations or processes
which refine, enhance or improve raw
data.
Data Editing:
Process of reviewing and adjusting the acquired datasets.
Controls data quality
Editing methods are :
• Interactive
• Selective
• Automatic
• Aggregating
• distribution
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
81
Enables the transformation of
acquired information into an ordered,
correct and simplified form.
Enables ingestion of meaningful data
in the datasets.
Basic concept:
Reduction of
multitudinous amount
of data and use the
meaningful parts.
uses editing, scaling, coding, sorting,
collating, smoothening, interpolating,
preparing tabular summaries.
Data
Reduction:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
82
Process of transforming and
mapping the data.
Results from analysis are then
appropriate and valuable.
Example:
Mapping enables data into
another format, which makes
it valuable for analytics and
data visualizations
Data
wrangling
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
83
Data formats used during Pre-processing
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
84
CSV format
Refers to a
plain text file
which stores
the table data
of numbers
and text.
Each CSV
file line is a
data record
Each record
consists of
one or more
fields,
separated by
commas.
CSV files are
most
encountered in
spreadsheets
and databases.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
85
Example :CSV
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
86
Activity :
Find out the
differences
between CSV and
Excel file formats
87
Data format conversions
 Need preprocessing for data-format conversions.
 A number of different applications, services and tools need a
specific format of data only.
 Preprocessing before their usages or storage on cloud services is a
must.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
88
Data store
export to
cloud
89
From the diagram,
 Shows data pre-processing, data mining, analysis, visualization
and data store.
 The data exports to cloud services.
 The results integrate at the enterprise server or data warehouse.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
90
Cloud services
The services can be accessed
through a cloud client, such as
web browser,SQL or other
client.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
91
Data store export
from
machines,files,
computers, web
servers and web
services
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
92
Export of data to AWS and Rackspace Clouds: Example
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
93
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
94
Example 2: BigQuery cloud service at google cloud platform
2
1
3
4
OR
95
Data Storage and Management: Traditional Systems
 Data Store with structured or semi-structured data.
 SQL
 RDBMS uses SQL.
 It is a language for viewing or changing databases.
 SQL does the following
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
96
Data Storage and Analysis
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
97
DDBMS, Enterprise data-
store server and data
warehouse
98
Distributed database management system(DDBMS)
 Collection of logically interrelated databases at multiple system over a
computer network.
 Features of DDBMS are:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
99
In-memory
column formats
data
 Allows faster data retrieval when only few
columns in a table need to be selected for
querying
 Data in a column are kept together in-
memory in columnar format
 A single memory access, therefore loads
many values at the column
 Used in OLAP
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
100
Use of In-memory column formats in OLAP
OLAP : Online Analytical
Processing in real time
transaction processing is
fast when using in-memory
column format tables.
Enables real-time
analytics
CPU accesses all
columns in a single
instance of access to
the memory in
columnar format in-
memory data-storage
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
101
In-
memory
row
formats
data
 A row format in-memory allows
much faster data processing
during OLTP(Online Transaction
Processing)
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
102
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
103
Enterprise
data-store
server and
data
warehouse
Enterprise data, after data cleaning process,
integrate with the server data at
warehouse
Enterprise data server use data from
several distributed sources which store
data using various technologies.
All data merge using an integration tool
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
104
Enterprise data
integration and
management
with big data
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
105
Big Data storage
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
106
Big Data
NoSQL or
Not Only
SQL
 NoSQL DBs are semi-structured
 Big data store uses NoSQL
 NoSQL stands for No SQL or Not Only
SQL.
 Do not integrate with applications using
SQL
 NoSQL also used cloud data store
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
107
Features of NoSQL
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
108
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
109
Terminologies
Consistency :
All copies have the
same value as in
traditional DBs
Availability:
At least one copy is available
in case a partition becomes
inactive or fails
Partition:
Parts which are active but
may not cooperate as in the
distributed DBs
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
110
Coexistence of
bigdata ,nosql and
traditional
datastores
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
111
Various data
sources and
examples of
usages and tools
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
112
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
113
BIG DATA
PLATFORM
Supports large datasets and volume
of data.
The data generate at a higher velocity, in
more varieties or in higher veracity.
Managing Big Data requires large
resources of MPPs , cloud, parallel
processing and specialized tools
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
114
Bigdata platform should provide tools and services for
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
115
Hadoop
 Big Data platform consists of Big Data storage, servers and data
management and BI software
 Storage can deploy HDFS, NoSQL data stores, such as Hbase,
MongoDB,Cassandra.
 HDFS system is an open source storage system
 Scaling , self-managing and self-healing file system
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
116
A scalable and reliable
parallel computing platform
Manages Big Data
distributed databases
Hadoop
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
117
Hadoop based Big data environment
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
118
Mesos
 Mesos v0.9 is a resource management platform which enables sharing of
cluster nodes by multiple frameworks and which has compatibility with an
open analytics stack
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
119
Big Data Stack
 A stack consists of a set of software components and data store units.
 Applications, ML algorithms, Analytics and visualization tools use Big
Data Stack(BDS) at a cloud service, such as Amazon EC2, Azure or
private cloud
 Uses cluster of High Performance machines
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
120
Tools for Big Data environment
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
121
Data analysis is a process of
inspecting, cleaning, transfo
rming,
and modelling data with the
goal of discovering useful
information, informing
conclusions, and supporting
decision-making.
Big Data
Analytics
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
122
Phases in Analytics
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
123
Traditional and
Bid Data
analytics
architecture
reference model
124
Berkeley Data Analytics Stack (BDAS)
Infrastructure
Storage
Data Processing
Application
Resource Management
Data Management
Share infrastructure across frameworks
(multi-programming for datacenters)
Efficient data sharing across
frameworks
Data Processing
• in-memory processing
• trade between time, quality, and cost
Application
New apps: AMP-Genomics, Carat, …
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
125
Why BDAS..!!?
 Easy to combine batch, streaming, and interactive computations
 Single execution model that supports all computation models
 Easy to develop sophisticated algorithms
 High level abstractions for graph based, and ML algorithms
 Compatible with existing open source ecosystem (Hadoop/HDFS)
 Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)
 Support existing execution models (e.g., Hive, GraphLab)
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
126
Big Data in
Marketing
and Sales
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
127
128
Big Data Analytics in Detection of Marketing Fraud
 Fraud means someone deceiving deliberately
 Ex: mortgaging the same assets to multiple financial institutions,
compromising customer data and transferring customer info to third party,
marketing product with compromising quality,..
 Banks and financial services firms use analytics to differentiate fraudulent
interactions from legitimate business transactions.
 The analytics systems suggest immediate actions, such as blocking irregular
transactions, which stops fraud before it occurs and improves profitability.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
129
Big Data and Healthcare
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
130
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
131
132
Healthcare analytics using big data can facilitate the following
 Provision of value-based and customer centric healthcare.
 Utilizing the ‘Internet of Things’ for health care.
 Preventing fraud, waste, abuse in the healthcare industry and
reduce healthcare costs.
 Improving outcomes.
 Monitoring patients in real time.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
133
134
Findings of Big Data in Medicine
 Big data analytics deploys large volume of data to identify and
derive intelligence predictive models about individuals.
 Big data creates patterns and models by data mining and help in
better understanding and research.
 Deploying wearable devices data, that devices data records during
active as well as inactive periods.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
135
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
136
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
137
Key reasons to take into account while using big data to improve
results of digital marketing campaign
Data Visualization Tools
Use of Historical data
Target Consumers
Crowdsourcing
The real power of big data is the ability to forecast client’s needs and hence offering
veracious value
Web Mining
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
138
End of Module 1

More Related Content

PPTX
BDA_Module1.pptx
PDF
BDA Mod1@AzDOCUMENTS.in.pdf
PPTX
Big data Analytics Fundamentals Chapter 1
PPTX
Bigdata Hadoop introduction
PPTX
Big data Analytics
PDF
BDA-UNIT_1-(Intro & Sources of data & Data Preprocessing).pdf
PPTX
Big Data
PPTX
ch1vsat2k_BDA_Introduction11Jan17-converted.pptx
BDA_Module1.pptx
BDA Mod1@AzDOCUMENTS.in.pdf
Big data Analytics Fundamentals Chapter 1
Bigdata Hadoop introduction
Big data Analytics
BDA-UNIT_1-(Intro & Sources of data & Data Preprocessing).pdf
Big Data
ch1vsat2k_BDA_Introduction11Jan17-converted.pptx

Similar to BDA: Big Data Analytics for Unit-1 Vtu syllabus (20)

PPTX
Big Data ppt
PPTX
big-data-8722-m8RQ3h1.pptx
PPTX
Big data
PPTX
Big data
PPT
Big data
PPTX
Big data analytics
PPTX
Big data road map
PPT
big data
PPTX
Presentation on Big Data
PPTX
Special issues on big data
PPTX
Unit – 1 introduction to big datannj.pptx
PDF
BIG DATA AND HADOOP.pdf
PPTX
Big data Presentation
PPTX
sybca-bigdata-ppt.pptx
PPTX
Kartikey tripathi
PPTX
Foundations of Big Data: Concepts, Techniques, and Applications
PPTX
Unit 1 - Introduction to Big Data and Big Data Analytics.pptx
PPTX
Introduction of big data and analytics
PPTX
Introduction to Big Data
Big Data ppt
big-data-8722-m8RQ3h1.pptx
Big data
Big data
Big data
Big data analytics
Big data road map
big data
Presentation on Big Data
Special issues on big data
Unit – 1 introduction to big datannj.pptx
BIG DATA AND HADOOP.pdf
Big data Presentation
sybca-bigdata-ppt.pptx
Kartikey tripathi
Foundations of Big Data: Concepts, Techniques, and Applications
Unit 1 - Introduction to Big Data and Big Data Analytics.pptx
Introduction of big data and analytics
Introduction to Big Data
Ad

More from Shrinivasa6 (11)

PPT
shortest path algorithms with different examplesppt
PPT
dynamic-programming unit 3 power point presentation
PPTX
Module 2_Chapter 3_HDFS DATA STORAGE.pptx
PPTX
Module 2 Chapter 6 Yet another resource locater.pptx
PPTX
hadoop_Introduction module 2 and chapter 3pptx.pptx
PPTX
Big data analytics Module1 contents pptx
PPTX
Module 2 C2_HadoopEcosystemComponents.pptx
PPTX
Hadoop_Introduction unit-2 for vtu syllabus
PPTX
M4,C5 APACHE PIG.pptx
PPTX
Module-1.pptx63.pptx
PPTX
Hadoop_Introduction_pptx.pptx
shortest path algorithms with different examplesppt
dynamic-programming unit 3 power point presentation
Module 2_Chapter 3_HDFS DATA STORAGE.pptx
Module 2 Chapter 6 Yet another resource locater.pptx
hadoop_Introduction module 2 and chapter 3pptx.pptx
Big data analytics Module1 contents pptx
Module 2 C2_HadoopEcosystemComponents.pptx
Hadoop_Introduction unit-2 for vtu syllabus
M4,C5 APACHE PIG.pptx
Module-1.pptx63.pptx
Hadoop_Introduction_pptx.pptx
Ad

Recently uploaded (20)

PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
web development for engineering and engineering
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Welding lecture in detail for understanding
PDF
composite construction of structures.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
web development for engineering and engineering
R24 SURVEYING LAB MANUAL for civil enggi
Automation-in-Manufacturing-Chapter-Introduction.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT 4 Total Quality Management .pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Welding lecture in detail for understanding
composite construction of structures.pdf
Lecture Notes Electrical Wiring System Components
Internet of Things (IOT) - A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
573137875-Attendance-Management-System-original
UNIT-1 - COAL BASED THERMAL POWER PLANTS

BDA: Big Data Analytics for Unit-1 Vtu syllabus

  • 2. 2 SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 3. 3 SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte. Textbooks: 1. Seema Acharya, Subhashini Chellappan ,“Big Data and Analytics”, Wiley,2017. 2.Alex Holmes,“Big Data Black Book”, Dreamtech,2015.
  • 4. 4 Course Outcomes: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte. Upon completion of this course, students will be able to: 1. Identify the issues and challenges related to Big Data. 2. Choose and apply Big Data technologies and tools in solving real life Big Data problem. 3. Design MapReduce architecture for Big Data problem. 4. Write scripts using Pig and Hive to implement Big Data problem. 5. Derive different Analytics from the Big Data problem.
  • 5. 5 In today’s discussion…  Introduction to data  Data and Big data  Big Data Analytics  Big data- Definition and Meaning.  Types of data  Characteristics of Big data  Big data vs. small data  Tools and techniques SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 6. 6 Introduction to data  Example: 10, 25, …, Nitte, CC3201-1 Anything else?  Data vs. Information 100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0 Is there any information? SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 7. 7 Big Data Definition: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 8. 8 Big Data-Definition: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 9. 9 Example of Big Data: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 10. 10 Big Data-Meaning: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 11. 11 Big Data Analytics- Definition: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 12. 12 Types of Data:  Structured  Unstructured  Semi-structured SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 13. 13 Structured data  Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 14. 14 Unstructured  Any data with unknown form or the structure is classified as unstructured data.  Example: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 15. 15 Semi-structured  Semi-structured data can contain both the forms of data.  Examples Of Semi-structured Data  Personal data stored in an XML file- <rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec> <rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec> <rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec> <rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec> SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 16. 16 SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 17. 17 How large your data is?  What is the maximum file size you have dealt so far?  Movies/files/streaming video that you have used?  What is the maximum download speed you get?  To retrieve data stored in distant locations?  How fast your computation is?  How much time to just transfer from you, process and get result? SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 18. 18 Growth of data SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 19. 19 Sources of data  “Every day, we create 2.5 quintillion bytes of data  So much that 90% of the data in the world today has been created in the last two years alone. This explosion of information is known as “Big Data,”  The data come from several sources : etc. …… to name a few! SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 20. 20 Social Media: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 21. 21 Now data is Big data!  No single standard definition!  ‘Big-data’ is similar to ‘Small-data’, but bigger …but having data bigger consequently requires different approaches  techniques, tools and architectures …to solve: new problems …and, of course, in a better way Big data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 22. 22 Characteristics of Big data: 5V’s  5 V's of Big Data:  Volume  Velocity  Variety  Veracity  Value SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 23. 23 Volume(The scale):  Volume in Big data represents the amount of data. In today’s world data is being processed in various formats like word, excel, pdf format, and sometimes in audio and video. These data can be structured , unstructured or semi-structured format. The recent social media platforms produce a tremendous amount of data which is difficult to handle by the organization. To handle this huge amount of data organizations should implement modern business intelligence tools which will capture this data in an effective form and which will be cost-efficient for the organization. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 24. 24 Velocity(The speed):  Velocity refers to the rate/speed at which data is getting generated.  This is primarily due to the Internet of Things (IoT), mobile data, social media, and other factors. At least 2 trillion searches each year, 3.8 million searches per minute, 228 million searches per hour, and 5.6 billion searches per day are now being conducted. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 25. 25 Variety(Data type):  Big Data can be structured, unstructured, and semi- structured that are being collected from different sources. Data will only be collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte. Quasi-structured data:The data format contains textual data with inconsistent data formats that are formatted with effort and time with some tools.
  • 26. 26 Veracity:  Degree of trustworthiness of data is the veracity of data.  Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential in business development.  For example, Facebook posts with hashtags. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 27. 27 Value:  Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable and reliable data that we store, process, and also analyze. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 28. 28 SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 29. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 29 Difference between Small data and Big Data  Mostly structured data  Stored in KB,MB,GB,TB  Increases gradually  Locally present, centralized.  Sql server, oracle.  Single node  Mostly unstructured data  Stored in PB,EB,ZB,YB  Increases exponentially  Globally present, distributed.  Hadoop, spark.  Multi-node cluster.
  • 30. 30 Big data vs. small data  Big data is more real-time in nature than traditional applications  Big data architecture  Traditional architectures are not well-suited for big data applications (e.g. Exa-data, Tera- data)  Massively parallel processing, scale out architectures are well-suited for big data applications SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 31. 31 Major players…  Google  Hadoop  MapReduce  Mahout  Apache Hbase  Cassandra SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 32. 32 Tools available  NoSQL  Databases MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper  MapReduce  Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum  Storage  S3, HDFS, GDFS  Servers  EC2, Google App Engine, Elastic, Beanstalk, Heroku  Processing  R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 33. 33 SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 34. 34 Job Opportunities in Big Data  Data Analysts  analyze and interpret data, visualize it, and build reports to help make better business decisions.  Data Scientists  mine data by assessing data sources and use algorithms and Machine Learning techniques.  Data Architects  design database systems and tools.  Database Managers  control database system performance, perform troubleshooting, and upgrade hardware and software.  Big Data Engineers  design, maintain, and support Big Data solutions. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 35. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 35 END OF INTRODUCTION
  • 36. 36 Questions of the day… 1. What is the smallest and largest units of measuring size of data? 2. How big a Quintillion measure is? 3. Give the examples of a smallest the largest entities of data. 4. Give FIVE parameters with which data can be categorized as i) simple, ii) Moderately complex and iii) complex? SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 37. Questions of the day… 5. What type of data are involved in the following applications? 1. Weather forecasting 2. Mobile usage of all customers of a service provider 3. Anomaly (e.g. fraud) detection in a bank organization 4. Person categorization, that is, identifying a human 5. Air traffic control in an airport 37 SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 38. 38 Big data types (by IBM)  Social Networks and web data  Transactions data and Business Process data  Customer master data  Machine generated data  Human generated data SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 39. 39 Big Data Classification : based on characteristics for designing data architecture for processing and analytics SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 40. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 40
  • 41. 41 Scalability and Parallel Processing  Scalability is the capability to handle growing amounts of data and growing number of database clients either by adding more hardware resources or by optimization and more efficient usage of the existing resources.  Scalability enables increase or decrease in the capacity of data storage, processing and analytics.  In short, you need to build scalability into the hardware architecture and database selection, and can (for the most part) maximize performance later — during the database design and deployment phase.  System capability needs increment with the increased workloads. When the workload and complexity exceed the system capacity, scale it up and scale it down. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 42. 42 Scalability Options  Assuming you need to scale your system, there are two options:  scaling up  scaling out SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 43. 43 Scale UP  Resources such as CPU, network, and storage are common targets for scaling up.  The goal is to increase the resources supporting your application to reach or maintain adequate performance.  In a hardware-centric world, this might mean adding a larger hard drive to a computer for increased storage capacity.  It might mean replacing the entire computer with a machine that has more CPU and a more performant network interface. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 44. 44 Scale OUT  The scale-out option implies a distributed system whereby additional machines are added to a cluster to provide additional capacity. It's often more likely to yield a linear increase in scalability, although not necessarily increased performance. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 45. 45 Analytics Scalability to Big Data  Vertical scalability means scaling up the given system resources and increasing the system’s analytics, reporting and visualization capabilities.  Ex: designing the algorithm according to the architecture that uses resources efficiently.  Horizontal scalability means increasing the number of systems working in coherence and scaling out the workload.  Ex: using more resources and distributing the storage and processing task in parallel.  Note: Alternative ways for scaling up and out processing of analytics software and big data analytics deploy the Massively Parallel Processing Platforms(MPPs), cloud, grid, clusters and distributed computing software. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 46. 46 Massively Parallel Processing Platforms(MPPs)  Massively parallel processing (MPP) is a collaborative processing of the same program using two or more processors.  By using different processors, speed can be dramatically increased.  For example, imagine a popular insurance company with millions of customers. As the number of customers increases, so does the customer data. Even if the firm uses parallel processing, they may experience a delay in processing customer data. Assume a data analyst is running a query against 100 million rows of a database. If the organization uses a massively parallel processing system with 1000 nodes, each node has to bear only 1/1000 computational load. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 47. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 47 Parallelization of tasks can be done at several levels: • Distributing separate tasks on to separate threads on same CPU. • Distributing separate tasks onto separate CPUs on the same computer. • Distributing separate tasks onto separate computer. There are several types of MPP database architectures • Distributed Computing Model • Cloud Computing • Grid and Cluster Computing • Volunteer Computing
  • 48. 48 Distributed Computing Model  It uses cloud, grid or clusters, which process and analyze big and large datasets on distributed computing nodes connected by high speed network. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 49. 49 Cloud computing  Type of internet-based computing that provides shared processing resources and data to the computers and other devices on demand.  One of the best approach for data processing to perform parallel and distributed computing  Offers high data security compared to other distributed technologies SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 50. 50 Cloud resources  Amazon Web Service(AWS)  Elastic Compute Cloud(EC2)  Microsoft Azure or Apache CloudStack  Amazon Simple Storage Service(S3) SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 51. 51 Cloud computing features 1. On-demand service 2. Resource pooling 3. Scalability 4. Accountability 5. Broad network Access  Cloud services can be accessed from anywhere and at any time through the internet SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 52. 52 Cloud services types 1. Infrastructure as a Service(IaaS):  Providing access to resources, such as hard disks, network connections, databases storage , data centre and virtual service space.  Ex: AWS EC2, Rackspace, Google Compute Engine 2. Platform as a Service(PaaS):  Providing runtime environment to allow developers to build applications and services.  Ex: Windows Azure (mostly used as PaaS), Force.com 3. Software as a Service(SaaS):  Providing software applications as a service to end-users  Ex: BigCommerce, Google Apps, Salesforce, Dropbox SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 53. 53 Grid and Cluster computing  Grid Computing:  Distributed computing, in which a group of computers from several locations are connected with each other to achieve a common task.  Grid: A group of computers that might spread over remotely  This type of computing provides large-scale resource sharing which is flexible , coordinated and secure among its users.  For example, a research team might analyze weather patterns in the North Atlantic region, while another team analyzes the south Atlantic region, and both results can be combined to deliver a complete picture of Atlantic weather patterns SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 54. 54 Features of Grid computing  Similar to cloud computing  Scalable  Distributed network for resource integration SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 55. 55 Drawbacks of Grid Computing  Single point of failure  Storage capacity varies with the number of users, instances and the amount of data transferred at a given time SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 56. 56 Cluster computing  Group of computers connected by a network to accomplish the same task.  Used mainly for load balancing SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 57. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 57 Difference between Cluster and Grid Computing:
  • 58. 58 Volunteer Computing  Volunteers are organizations or members who own personal computers.  They provide computing resources to important projects that use resources to do distributed computing and/or storage  Volunteer Computing: uses computing resources of the volunteers SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 59. 59 Issues of volunteer computing systems  Volunteered computers heterogeneity  Drop outs from the network over time  Their sporadic availability  Incorrect results at volunteers are unaccountable as they are anonymous SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 60. 60 Designing Data Architecture “Big data architecture is the logical and/or physical layout/structure of how big data will be stored, accessed and managed within a big data or IT environment.”  Architecture logically defines how big data solution will work, the core components(hardware, database, software, storage) used, flow of information, security and more.  Data processing architecture consist of 5 layers:  (i) identification of data sources  (ii) acquisition, ingestion, extraction, pre-processing, transformation of data  (iii) data storage at files, servers, cluster or cloud  (iv) data processing  (v) data consumption SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 61. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 61
  • 62. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 62
  • 63. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 63
  • 64. 64 Managing data for Analysis  Data managing means enabling, controlling, protecting, delivering and enhancing the value of data and information asset.  Data Management functions include: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 65. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 65
  • 66. 66 Data Sources  Applications, programs and tools use data.  Sources can be external, such as sensors, trackers, web logs, computer system logs and feeds.  Sources can be machines, which source data from data-creating programs.  Data sources can be i) structured ii) semi-structured iii)multi- structured or unstructured. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 67. 67 Structured data sources  The source may be on the same computer running a program or a networked computer.  Examples of structured data sources are SQL server, MySQL, Oracle DBMS, file collection directory at a server.  The name implies a defined name, which a process uses to identify the source.  Ex: a name which identifies stored data in student grades during processing, the name could be studentname_data_grades. Then, what could be the name of data source!!!? SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 68. 68 Unstructured data sources  Distributed over high-speed networks.  The data need high velocity processing as sources are from distributed file system.  The sources are of file types, such as .txt, .csv(comma separated value).  Data may be as key-value pairs, such as hash key-values pairs.  Data may have internal structures, such as in e-mail, facebook pages, twitter pages, etc..  Data sources can be sensors, sensor networks, signals from machines, devices, controllers of different types in the industry M2M communication and the GPS system. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 69. 69 Data Quality  Data quality is the measure of how well suited a data set is to serve its specific purpose.  A high quality data can be data with five R’s. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 70. 70 Data Integrity  Data integrity refers to the fact that data must be reliable and accurate over its entire lifecycle. WHY IS DATA INTEGRITY IMPORTANT?  Your need to have constant access to data of quality data. Data integrity is important as it guarantees and secures the searchability and traceability of your data to its original source.  Organizations collect more and more data and it has become a priority to secure and maintain the integrity of this data. Without integrity and accuracy, your data is worthless. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 71. 71 Examples of data quality problems  Noise  Outliers  Missing values  Duplicate data SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 72. 72 Noisy data  For objects, noise is considered an extraneous object.  For attributes, noise refers to modification of original values. Here Noise refers to measurement error in data values Could be random error or systematic error…!!! SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 73. 73 Outliers  Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set.  Could indicate “interesting” cases, or could indicate errors in the data SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 74. 74 Missing values  Reasons for missing values  Information is not collected (e.g., people decline to give their age)  Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)  Ways to handle missing values  Eliminate entities with missing values  Estimate attributes with missing values  Ignore the missing values during analysis  Replace with all possible values (weighted by their probabilities)  Impute missing values SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 75. 75 Duplicate data  Data set may include data entities that are duplicates, or almost duplicates of one another  Major issue when merging data from heterogeneous sources  Example: same person with multiple email addresses.  Data cleaning  Finding and dealing with duplicate entities  Finding and correcting measurement error SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 76. 76 Data Preprocessing  It is an important step at the ingestion layer.  It is a must before data mining, analytics or before running machine learning algorithms.  Pre-processing needs are: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 77. 77 Data cleaning  Process of removing or correcting incomplete , incorrect, inaccurate or irrelevant parts of the data after detecting them.  Example:  Correcting the grade outliers or mistakenly entered values SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 78. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 78 Important terminologies in data cleaning Data cleaning tools Data Enrichment Data Editing Data Reduction Data wrangling Data formats used during Pre- processing
  • 79. 79 Data cleaning tools  Data cleaning is done before data mining.  Data cleaning tools help in refining and structuring data into usable data.  Example:  OpenRefine  DataCleaner SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 80. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 80 Data Enrichment: Refers to operations or processes which refine, enhance or improve raw data. Data Editing: Process of reviewing and adjusting the acquired datasets. Controls data quality Editing methods are : • Interactive • Selective • Automatic • Aggregating • distribution
  • 81. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 81 Enables the transformation of acquired information into an ordered, correct and simplified form. Enables ingestion of meaningful data in the datasets. Basic concept: Reduction of multitudinous amount of data and use the meaningful parts. uses editing, scaling, coding, sorting, collating, smoothening, interpolating, preparing tabular summaries. Data Reduction:
  • 82. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 82 Process of transforming and mapping the data. Results from analysis are then appropriate and valuable. Example: Mapping enables data into another format, which makes it valuable for analytics and data visualizations Data wrangling
  • 83. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 83 Data formats used during Pre-processing
  • 84. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 84 CSV format Refers to a plain text file which stores the table data of numbers and text. Each CSV file line is a data record Each record consists of one or more fields, separated by commas. CSV files are most encountered in spreadsheets and databases.
  • 85. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 85 Example :CSV
  • 86. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 86 Activity : Find out the differences between CSV and Excel file formats
  • 87. 87 Data format conversions  Need preprocessing for data-format conversions.  A number of different applications, services and tools need a specific format of data only.  Preprocessing before their usages or storage on cloud services is a must. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 88. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 88 Data store export to cloud
  • 89. 89 From the diagram,  Shows data pre-processing, data mining, analysis, visualization and data store.  The data exports to cloud services.  The results integrate at the enterprise server or data warehouse. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 90. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 90 Cloud services The services can be accessed through a cloud client, such as web browser,SQL or other client.
  • 91. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 91 Data store export from machines,files, computers, web servers and web services
  • 92. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 92 Export of data to AWS and Rackspace Clouds: Example
  • 93. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 93
  • 94. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 94 Example 2: BigQuery cloud service at google cloud platform 2 1 3 4 OR
  • 95. 95 Data Storage and Management: Traditional Systems  Data Store with structured or semi-structured data.  SQL  RDBMS uses SQL.  It is a language for viewing or changing databases.  SQL does the following SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 96. 96 Data Storage and Analysis SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 97. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 97 DDBMS, Enterprise data- store server and data warehouse
  • 98. 98 Distributed database management system(DDBMS)  Collection of logically interrelated databases at multiple system over a computer network.  Features of DDBMS are: SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 99. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 99 In-memory column formats data  Allows faster data retrieval when only few columns in a table need to be selected for querying  Data in a column are kept together in- memory in columnar format  A single memory access, therefore loads many values at the column  Used in OLAP
  • 100. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 100 Use of In-memory column formats in OLAP OLAP : Online Analytical Processing in real time transaction processing is fast when using in-memory column format tables. Enables real-time analytics CPU accesses all columns in a single instance of access to the memory in columnar format in- memory data-storage
  • 101. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 101 In- memory row formats data  A row format in-memory allows much faster data processing during OLTP(Online Transaction Processing)
  • 102. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 102
  • 103. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 103 Enterprise data-store server and data warehouse Enterprise data, after data cleaning process, integrate with the server data at warehouse Enterprise data server use data from several distributed sources which store data using various technologies. All data merge using an integration tool
  • 104. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 104 Enterprise data integration and management with big data
  • 105. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 105 Big Data storage
  • 106. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 106 Big Data NoSQL or Not Only SQL  NoSQL DBs are semi-structured  Big data store uses NoSQL  NoSQL stands for No SQL or Not Only SQL.  Do not integrate with applications using SQL  NoSQL also used cloud data store
  • 107. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 107 Features of NoSQL
  • 108. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 108
  • 109. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 109 Terminologies Consistency : All copies have the same value as in traditional DBs Availability: At least one copy is available in case a partition becomes inactive or fails Partition: Parts which are active but may not cooperate as in the distributed DBs
  • 110. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 110 Coexistence of bigdata ,nosql and traditional datastores
  • 111. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 111 Various data sources and examples of usages and tools
  • 112. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 112
  • 113. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 113 BIG DATA PLATFORM Supports large datasets and volume of data. The data generate at a higher velocity, in more varieties or in higher veracity. Managing Big Data requires large resources of MPPs , cloud, parallel processing and specialized tools
  • 114. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 114 Bigdata platform should provide tools and services for
  • 115. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 115 Hadoop  Big Data platform consists of Big Data storage, servers and data management and BI software  Storage can deploy HDFS, NoSQL data stores, such as Hbase, MongoDB,Cassandra.  HDFS system is an open source storage system  Scaling , self-managing and self-healing file system
  • 116. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 116 A scalable and reliable parallel computing platform Manages Big Data distributed databases Hadoop
  • 117. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 117 Hadoop based Big data environment
  • 118. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 118 Mesos  Mesos v0.9 is a resource management platform which enables sharing of cluster nodes by multiple frameworks and which has compatibility with an open analytics stack
  • 119. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 119 Big Data Stack  A stack consists of a set of software components and data store units.  Applications, ML algorithms, Analytics and visualization tools use Big Data Stack(BDS) at a cloud service, such as Amazon EC2, Azure or private cloud  Uses cluster of High Performance machines
  • 120. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 120 Tools for Big Data environment
  • 121. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 121 Data analysis is a process of inspecting, cleaning, transfo rming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Big Data Analytics
  • 122. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 122 Phases in Analytics
  • 123. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 123 Traditional and Bid Data analytics architecture reference model
  • 124. 124 Berkeley Data Analytics Stack (BDAS) Infrastructure Storage Data Processing Application Resource Management Data Management Share infrastructure across frameworks (multi-programming for datacenters) Efficient data sharing across frameworks Data Processing • in-memory processing • trade between time, quality, and cost Application New apps: AMP-Genomics, Carat, … SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 125. 125 Why BDAS..!!?  Easy to combine batch, streaming, and interactive computations  Single execution model that supports all computation models  Easy to develop sophisticated algorithms  High level abstractions for graph based, and ML algorithms  Compatible with existing open source ecosystem (Hadoop/HDFS)  Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)  Support existing execution models (e.g., Hive, GraphLab) SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 126. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 126 Big Data in Marketing and Sales
  • 127. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 127
  • 128. 128 Big Data Analytics in Detection of Marketing Fraud  Fraud means someone deceiving deliberately  Ex: mortgaging the same assets to multiple financial institutions, compromising customer data and transferring customer info to third party, marketing product with compromising quality,..  Banks and financial services firms use analytics to differentiate fraudulent interactions from legitimate business transactions.  The analytics systems suggest immediate actions, such as blocking irregular transactions, which stops fraud before it occurs and improves profitability. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 129. 129 Big Data and Healthcare SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 130. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 130
  • 131. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 131
  • 132. 132 Healthcare analytics using big data can facilitate the following  Provision of value-based and customer centric healthcare.  Utilizing the ‘Internet of Things’ for health care.  Preventing fraud, waste, abuse in the healthcare industry and reduce healthcare costs.  Improving outcomes.  Monitoring patients in real time. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 133. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 133
  • 134. 134 Findings of Big Data in Medicine  Big data analytics deploys large volume of data to identify and derive intelligence predictive models about individuals.  Big data creates patterns and models by data mining and help in better understanding and research.  Deploying wearable devices data, that devices data records during active as well as inactive periods. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.
  • 135. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 135
  • 136. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 136
  • 137. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 137 Key reasons to take into account while using big data to improve results of digital marketing campaign Data Visualization Tools Use of Historical data Target Consumers Crowdsourcing The real power of big data is the ability to forecast client’s needs and hence offering veracious value Web Mining
  • 138. SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA MIT, Nitte. 138 End of Module 1

Editor's Notes

  • #124: … in two fundamental aspects… At the application layer we build new, real applications such a AMP-Genomics a genomics pipeline, and Carat, an application I’m going to talk about soon. Building real applications allows us to drive the features and design of the lower layers.