BDA: Big Data Analytics for Unit-1 Vtu syllabus

Syllabus and Introduction
Big Data Analytics

2
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT, Nitte.

3
Textbooks:
1. Seema Acharya, Subhashini Chellappan ,“Big Data and Analytics”,
Wiley,2017.
2.Alex Holmes,“Big Data Black Book”, Dreamtech,2015.

4
Course Outcomes:
Upon completion of this course, students will be able to:
1. Identify the issues and challenges related to Big Data.
2. Choose and apply Big Data technologies and tools in solving real life
Big Data problem.
3. Design MapReduce architecture for Big Data problem.
4. Write scripts using Pig and Hive to implement Big Data problem.
5. Derive different Analytics from the Big Data problem.

5
In today’s discussion…
 Introduction to data
 Data and Big data
 Big Data Analytics
 Big data- Definition and Meaning.
 Types of data
 Characteristics of Big data
 Big data vs. small data
 Tools and techniques
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMAMIT,
Nitte.

6
Introduction to data
 Example:
10, 25, …, Nitte, CC3201-1
Anything else?
 Data vs. Information
100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0
Is there any information?
Nitte.

7
Big Data Definition:
SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE,
NMAMIT, Nitte.

8
Big Data-Definition:
NMAMIT, Nitte.

9
Example of Big Data:
NMAMIT, Nitte.

10
Big Data-Meaning:
NMAMIT, Nitte.

11
Big Data Analytics- Definition:
NMAMIT, Nitte.

12
Types of Data:
 Structured
 Unstructured
 Semi-structured
NMAMIT, Nitte.

13
Structured data
 Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
Nitte.

14
Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 Example:
Nitte.

15
Semi-structured
 Semi-structured data can contain both the forms of data.
 Examples Of Semi-structured Data
 Personal data stored in an XML file-
<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>
Nitte.

16
Nitte.

17
How large your data is?
 What is the maximum file size you have dealt so far?
 Movies/files/streaming video that you have used?
 What is the maximum download speed you get?
 To retrieve data stored in distant locations?
 How fast your computation is?
 How much time to just transfer from you, process and get
result?
Nitte.

18
Growth of data
Nitte.

19
Sources of data
 “Every day, we create 2.5 quintillion bytes of data
 So much that 90% of the data in the world today has been created in the last two years
alone. This explosion of information is known as “Big Data,”
 The data come from several sources :
etc. …… to name a few!
Nitte.

20
Social Media:
Nitte.

21
Now data is Big data!
 No single standard definition!
 ‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires different approaches
 techniques, tools and architectures
…to solve: new problems
…and, of course, in a better way
Big data is data whose scale, diversity, and complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value and hidden knowledge from it…
Nitte.

22
Characteristics of Big data: 5V’s
 5 V's of Big Data:
 Volume
 Velocity
 Variety
 Veracity
 Value

23
Volume(The scale):
 Volume in Big data represents the amount of data. In today’s
world data is being processed in various formats like word, excel,
pdf format, and sometimes in audio and video. These data can be
structured , unstructured or semi-structured format. The recent
social media platforms produce a tremendous amount of data
which is difficult to handle by the organization. To handle this
huge amount of data organizations should implement modern
business intelligence tools which will capture this data in an
effective form and which will be cost-efficient for the
organization.
Nitte.

24
Velocity(The speed):
 Velocity refers to the rate/speed at which data is getting generated.
 This is primarily due to the Internet of Things (IoT), mobile data,
social media, and other factors. At least 2 trillion searches each
year, 3.8 million searches per minute, 228 million searches per
hour, and 5.6 billion searches per day are now being conducted.

25
Variety(Data type):
 Big Data can be structured, unstructured, and semi-
structured that are being collected from different sources. Data
will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos, etc.
Nitte.
Quasi-structured data:The data format
contains textual data with inconsistent data
formats that are formatted with effort and
time with some tools.

26
Veracity:
 Degree of trustworthiness of data is the veracity of data.
 Veracity means how much the data is reliable. It has many ways to
filter or translate the data. Veracity is the process of being able to
handle and manage data efficiently. Big Data is also essential in
business development.
 For example, Facebook posts with hashtags.
Nitte.

27
Value:
 Value is an essential characteristic of big data. It is not the data
that we process or store. It is valuable and reliable data that
we store, process, and also analyze.
Nitte.

28
Nitte.

SHRINIVASA, Assistant Professor Gd.-III , Dept. of CCE, NMA
MIT, Nitte.
29
Difference between Small data and Big Data
 Mostly structured data
 Stored in KB,MB,GB,TB
 Increases gradually
 Locally present, centralized.
 Sql server, oracle.
 Single node
 Mostly unstructured data
 Stored in PB,EB,ZB,YB
 Increases exponentially
 Globally present, distributed.
 Hadoop, spark.
 Multi-node cluster.

30
Big data vs. small data
 Big data is more real-time in nature than
traditional applications
 Big data architecture
 Traditional architectures are not well-suited
for big data applications (e.g. Exa-data, Tera-
data)
 Massively parallel processing, scale out
architectures are well-suited for big data
applications
Nitte.

31
Major players…
 Google
 Hadoop
 MapReduce
 Mahout
 Apache Hbase
 Cassandra
Nitte.

32
Tools available
 NoSQL
 Databases MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper
 MapReduce
 Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
 Storage
 S3, HDFS, GDFS
 Servers
 EC2, Google App Engine, Elastic, Beanstalk, Heroku
 Processing
 R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
Nitte.

33
Nitte.

34
Job Opportunities in Big Data
 Data Analysts
 analyze and interpret data, visualize it, and build reports to help make better business decisions.
 Data Scientists
 mine data by assessing data sources and use algorithms and Machine Learning techniques.
 Data Architects
 design database systems and tools.
 Database Managers
 control database system performance, perform troubleshooting, and upgrade hardware and software.
 Big Data Engineers
 design, maintain, and support Big Data solutions.
NMAMIT, Nitte.

MIT, Nitte.
35
END OF INTRODUCTION

36
Questions of the day…
1. What is the smallest and largest units of measuring size of data?
2. How big a Quintillion measure is?
3. Give the examples of a smallest the largest entities of data.
4. Give FIVE parameters with which data can be categorized as i)
simple, ii) Moderately complex and iii) complex?
NMAMIT, Nitte.

Questions of the day…
5. What type of data are involved in the following applications?
1. Weather forecasting
2. Mobile usage of all customers of a service provider
3. Anomaly (e.g. fraud) detection in a bank organization
4. Person categorization, that is, identifying a human
5. Air traffic control in an airport
37
NMAMIT, Nitte.

38
Big data types (by IBM)
 Social Networks and web data
 Transactions data and Business Process data
 Customer master data
 Machine generated data
 Human generated data
NMAMIT, Nitte.

39
Big Data Classification : based on characteristics for
designing data architecture for processing and analytics
NMAMIT, Nitte.

MIT, Nitte.
40

41
Scalability and Parallel Processing
 Scalability is the capability to handle growing amounts of data and growing number of
database clients either by adding more hardware resources or by optimization and
more efficient usage of the existing resources.
 Scalability enables increase or decrease in the capacity of data storage, processing and
analytics.
 In short, you need to build scalability into the hardware architecture and database
selection, and can (for the most part) maximize performance later — during the
database design and deployment phase.
 System capability needs increment with the increased workloads. When the workload
and complexity exceed the system capacity, scale it up and scale it down.
NMAMIT, Nitte.

42
Scalability Options
 Assuming you need to scale your system, there are two options:
 scaling up
 scaling out
NMAMIT, Nitte.

43
Scale UP
 Resources such as CPU, network, and storage are
common targets for scaling up.
 The goal is to increase the resources supporting
your application to reach or maintain adequate
performance.
 In a hardware-centric world, this might mean
adding a larger hard drive to a computer for
increased storage capacity.
 It might mean replacing the entire computer with
a machine that has more CPU and a more
performant network interface.
NMAMIT, Nitte.

44
Scale OUT
 The scale-out option implies a
distributed system whereby
additional machines are added to a
cluster to provide additional
capacity. It's often more likely to
yield a linear increase in scalability,
although not necessarily increased
performance.
NMAMIT, Nitte.

45
Analytics Scalability to Big Data
 Vertical scalability means scaling up the given system resources and increasing the
system’s analytics, reporting and visualization capabilities.
 Ex: designing the algorithm according to the architecture that uses resources
efficiently.
 Horizontal scalability means increasing the number of systems working in
coherence and scaling out the workload.
 Ex: using more resources and distributing the storage and processing task in parallel.
 Note: Alternative ways for scaling up and out processing of analytics software and big
data analytics deploy the Massively Parallel Processing Platforms(MPPs), cloud, grid,
clusters and distributed computing software.
NMAMIT, Nitte.

46
Massively Parallel Processing Platforms(MPPs)
 Massively parallel processing (MPP) is a
collaborative processing of the same
program using two or more processors.
 By using different processors, speed can
be dramatically increased.
 For example, imagine a popular insurance company with millions of
customers. As the number of customers increases, so does the customer
data. Even if the firm uses parallel processing, they may experience a delay
in processing customer data. Assume a data analyst is running a query
against 100 million rows of a database. If the organization uses a massively
parallel processing system with 1000 nodes, each node has to bear only
1/1000 computational load.
NMAMIT, Nitte.

MIT, Nitte.
47
Parallelization of tasks
can be done at several
levels:
• Distributing separate tasks on to
separate threads on same CPU.
• Distributing separate tasks onto
separate CPUs on the same
computer.
• Distributing separate tasks onto
separate computer.
There are several types of
MPP database
architectures
• Distributed Computing Model
• Cloud Computing
• Grid and Cluster Computing
• Volunteer Computing

48
Distributed Computing Model
 It uses cloud, grid or clusters, which process and analyze big and
large datasets on distributed computing nodes connected by high
speed network.
NMAMIT, Nitte.

49
Cloud computing
 Type of internet-based computing that provides shared processing
resources and data to the computers and other devices on demand.
 One of the best approach for data processing to perform parallel and
distributed computing
 Offers high data security compared to other distributed technologies
NMAMIT, Nitte.

50
Cloud resources
 Amazon Web Service(AWS)
 Elastic Compute Cloud(EC2)
 Microsoft Azure or Apache CloudStack
 Amazon Simple Storage Service(S3)
NMAMIT, Nitte.

51
Cloud computing features
1. On-demand service
2. Resource pooling
3. Scalability
4. Accountability
5. Broad network Access
 Cloud services can be accessed from anywhere and at any time through
the internet
NMAMIT, Nitte.

52
Cloud services types
1. Infrastructure as a Service(IaaS):
 Providing access to resources, such as hard disks, network connections, databases storage ,
data centre and virtual service space.
 Ex: AWS EC2, Rackspace, Google Compute Engine
2. Platform as a Service(PaaS):
 Providing runtime environment to allow developers to build applications and services.
 Ex: Windows Azure (mostly used as PaaS), Force.com
3. Software as a Service(SaaS):
 Providing software applications as a service to end-users
 Ex: BigCommerce, Google Apps, Salesforce, Dropbox
NMAMIT, Nitte.

53
Grid and Cluster computing
 Grid Computing:
 Distributed computing, in which a group of computers from several locations are connected with
each other to achieve a common task.
 Grid: A group of computers that might spread over remotely
 This type of computing provides large-scale resource sharing which is flexible , coordinated and
secure among its users.
 For example, a research team might analyze weather patterns in the North Atlantic region, while
another team analyzes the south Atlantic region, and both results can be combined to deliver a
complete picture of Atlantic weather patterns
NMAMIT, Nitte.

54
Features of Grid computing
 Similar to cloud computing
 Scalable
 Distributed network for resource integration
NMAMIT, Nitte.

55
Drawbacks of Grid Computing
 Single point of failure
 Storage capacity varies with the number of users, instances
and the amount of data transferred at a given time
NMAMIT, Nitte.

56
Cluster computing
 Group of computers connected by a network to accomplish the
same task.
 Used mainly for load balancing
NMAMIT, Nitte.

MIT, Nitte.
57
Difference between Cluster and Grid Computing:

58
Volunteer Computing
 Volunteers are organizations or members who own personal
computers.
 They provide computing resources to important projects that use
resources to do distributed computing and/or storage
 Volunteer Computing: uses computing resources of the volunteers
NMAMIT, Nitte.

59
Issues of volunteer computing systems
 Volunteered computers heterogeneity
 Drop outs from the network over time
 Their sporadic availability
 Incorrect results at volunteers are unaccountable as they are
anonymous
NMAMIT, Nitte.

60
Designing Data Architecture
“Big data architecture is the logical and/or physical layout/structure of how big data will be
stored, accessed and managed within a big data or IT environment.”
 Architecture logically defines how big data solution will work, the core
components(hardware, database, software, storage) used, flow of information, security
and more.
 Data processing architecture consist of 5 layers:
 (i) identification of data sources
 (ii) acquisition, ingestion, extraction, pre-processing, transformation of data
 (iii) data storage at files, servers, cluster or cloud
 (iv) data processing
 (v) data consumption
NMAMIT, Nitte.

MIT, Nitte.
61

MIT, Nitte.
62

MIT, Nitte.
63

64
Managing data for Analysis
 Data managing means enabling, controlling, protecting, delivering
and enhancing the value of data and information asset.
 Data Management functions include:
NMAMIT, Nitte.

MIT, Nitte.
65

66
Data Sources
 Applications, programs and tools use data.
 Sources can be external, such as sensors, trackers, web logs,
computer system logs and feeds.
 Sources can be machines, which source data from data-creating
programs.
 Data sources can be i) structured ii) semi-structured iii)multi-
structured or unstructured.
NMAMIT, Nitte.

67
Structured data sources
 The source may be on the same computer running a program or a networked
computer.
 Examples of structured data sources are SQL server, MySQL, Oracle DBMS, file
collection directory at a server.
 The name implies a defined name, which a process uses to identify the source.
 Ex: a name which identifies stored data in student grades during processing, the
name could be studentname_data_grades.
Then, what could be the name of data
source!!!?
NMAMIT, Nitte.

68
Unstructured data sources
 Distributed over high-speed networks.
 The data need high velocity processing as sources are from distributed file system.
 The sources are of file types, such as .txt, .csv(comma separated value).
 Data may be as key-value pairs, such as hash key-values pairs.
 Data may have internal structures, such as in e-mail, facebook pages, twitter pages, etc..
 Data sources can be sensors, sensor networks, signals from machines, devices,
controllers of different types in the industry M2M communication and the GPS system.
NMAMIT, Nitte.

69
Data Quality
 Data quality is the measure of how well suited a data set is to serve its
specific purpose.
 A high quality data can be data with five R’s.
NMAMIT, Nitte.

70
Data Integrity
 Data integrity refers to the fact that data must be reliable and accurate over
its entire lifecycle.
WHY IS DATA INTEGRITY IMPORTANT?
 Your need to have constant access to data of quality data. Data integrity is
important as it guarantees and secures the searchability and traceability of
your data to its original source.
 Organizations collect more and more data and it has become a priority to
secure and maintain the integrity of this data. Without integrity and
accuracy, your data is worthless.
NMAMIT, Nitte.

71
Examples of data quality problems
 Noise
 Outliers
 Missing values
 Duplicate data
NMAMIT, Nitte.

72
Noisy data
 For objects, noise is considered an extraneous object.
 For attributes, noise refers to modification of original values.
Here Noise refers to measurement error in data values
Could be random error or systematic error…!!!
NMAMIT, Nitte.

73
Outliers
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set.
 Could indicate “interesting” cases, or could indicate errors in the
data
NMAMIT, Nitte.

74
Missing values
 Reasons for missing values
 Information is not collected (e.g., people decline to give their age)
 Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
 Ways to handle missing values
 Eliminate entities with missing values
 Estimate attributes with missing values
 Ignore the missing values during analysis
 Replace with all possible values (weighted by their probabilities)
 Impute missing values
NMAMIT, Nitte.

75
Duplicate data
 Data set may include data entities that are duplicates, or almost
duplicates of one another
 Major issue when merging data from heterogeneous sources
 Example: same person with multiple email addresses.
 Data cleaning
 Finding and dealing with duplicate entities
 Finding and correcting measurement error
NMAMIT, Nitte.

76
Data Preprocessing
 It is an important step at the ingestion layer.
 It is a must before data mining, analytics or before running
machine learning algorithms.
 Pre-processing needs are:
NMAMIT, Nitte.

77
Data cleaning
 Process of removing or correcting incomplete , incorrect,
inaccurate or irrelevant parts of the data after detecting
them.
 Example:
 Correcting the grade outliers or mistakenly entered values
NMAMIT, Nitte.

MIT, Nitte.
78
Important terminologies in data cleaning
Data cleaning
tools
Data Enrichment Data Editing
Data Reduction Data wrangling
Data formats
used during Pre-
processing

79
Data cleaning tools
 Data cleaning is done before data mining.
 Data cleaning tools help in refining and structuring data into usable
data.
 Example:
 OpenRefine
 DataCleaner
NMAMIT, Nitte.

MIT, Nitte.
80
Data Enrichment:
Refers to operations or processes
which refine, enhance or improve raw
data.
Data Editing:
Process of reviewing and adjusting the acquired datasets.
Controls data quality
Editing methods are :
• Interactive
• Selective
• Automatic
• Aggregating
• distribution

MIT, Nitte.
81
Enables the transformation of
acquired information into an ordered,
correct and simplified form.
Enables ingestion of meaningful data
in the datasets.
Basic concept:
Reduction of
multitudinous amount
of data and use the
meaningful parts.
uses editing, scaling, coding, sorting,
collating, smoothening, interpolating,
preparing tabular summaries.
Data
Reduction:

MIT, Nitte.
82
Process of transforming and
mapping the data.
Results from analysis are then
appropriate and valuable.
Example:
Mapping enables data into
another format, which makes
it valuable for analytics and
data visualizations
Data
wrangling

MIT, Nitte.
83
Data formats used during Pre-processing

MIT, Nitte.
84
CSV format
Refers to a
plain text file
which stores
the table data
of numbers
and text.
Each CSV
file line is a
data record
Each record
consists of
one or more
fields,
separated by
commas.
CSV files are
most
encountered in
spreadsheets
and databases.

MIT, Nitte.
85
Example :CSV

MIT, Nitte.
86
Activity :
Find out the
differences
between CSV and
Excel file formats

87
Data format conversions
 Need preprocessing for data-format conversions.
 A number of different applications, services and tools need a
specific format of data only.
 Preprocessing before their usages or storage on cloud services is a
must.
NMAMIT, Nitte.

MIT, Nitte.
88
Data store
export to
cloud

89
From the diagram,
 Shows data pre-processing, data mining, analysis, visualization
and data store.
 The data exports to cloud services.
 The results integrate at the enterprise server or data warehouse.
NMAMIT, Nitte.

MIT, Nitte.
90
Cloud services
The services can be accessed
through a cloud client, such as
web browser,SQL or other
client.

MIT, Nitte.
91
Data store export
from
machines,files,
computers, web
servers and web
services

MIT, Nitte.
92
Export of data to AWS and Rackspace Clouds: Example

MIT, Nitte.
93

MIT, Nitte.
94
Example 2: BigQuery cloud service at google cloud platform
2
1
3
4
OR

95
Data Storage and Management: Traditional Systems
 Data Store with structured or semi-structured data.
 SQL
 RDBMS uses SQL.
 It is a language for viewing or changing databases.
 SQL does the following
NMAMIT, Nitte.

96
Data Storage and Analysis
NMAMIT, Nitte.

MIT, Nitte.
97
DDBMS, Enterprise data-
store server and data
warehouse

98
Distributed database management system(DDBMS)
 Collection of logically interrelated databases at multiple system over a
computer network.
 Features of DDBMS are:
NMAMIT, Nitte.

MIT, Nitte.
99
In-memory
column formats
data
 Allows faster data retrieval when only few
columns in a table need to be selected for
querying
 Data in a column are kept together in-
memory in columnar format
 A single memory access, therefore loads
many values at the column
 Used in OLAP

MIT, Nitte.
100
Use of In-memory column formats in OLAP
OLAP : Online Analytical
Processing in real time
transaction processing is
fast when using in-memory
column format tables.
Enables real-time
analytics
CPU accesses all
columns in a single
instance of access to
the memory in
columnar format in-
memory data-storage

MIT, Nitte.
101
In-
memory
row
formats
data
 A row format in-memory allows
much faster data processing
during OLTP(Online Transaction
Processing)

MIT, Nitte.
102

MIT, Nitte.
103
Enterprise
data-store
server and
data
warehouse
Enterprise data, after data cleaning process,
integrate with the server data at
warehouse
Enterprise data server use data from
several distributed sources which store
data using various technologies.
All data merge using an integration tool

MIT, Nitte.
104
Enterprise data
integration and
management
with big data

MIT, Nitte.
105
Big Data storage

MIT, Nitte.
106
Big Data
NoSQL or
Not Only
SQL
 NoSQL DBs are semi-structured
 Big data store uses NoSQL
 NoSQL stands for No SQL or Not Only
SQL.
 Do not integrate with applications using
SQL
 NoSQL also used cloud data store

MIT, Nitte.
107
Features of NoSQL

MIT, Nitte.
108

MIT, Nitte.
109
Terminologies
Consistency :
All copies have the
same value as in
traditional DBs
Availability:
At least one copy is available
in case a partition becomes
inactive or fails
Partition:
Parts which are active but
may not cooperate as in the
distributed DBs

MIT, Nitte.
110
Coexistence of
bigdata ,nosql and
traditional
datastores

MIT, Nitte.
111
Various data
sources and
examples of
usages and tools

MIT, Nitte.
112

MIT, Nitte.
113
BIG DATA
PLATFORM
Supports large datasets and volume
of data.
The data generate at a higher velocity, in
more varieties or in higher veracity.
Managing Big Data requires large
resources of MPPs , cloud, parallel
processing and specialized tools

MIT, Nitte.
114
Bigdata platform should provide tools and services for

MIT, Nitte.
115
Hadoop
 Big Data platform consists of Big Data storage, servers and data
management and BI software
 Storage can deploy HDFS, NoSQL data stores, such as Hbase,
MongoDB,Cassandra.
 HDFS system is an open source storage system
 Scaling , self-managing and self-healing file system

MIT, Nitte.
116
A scalable and reliable
parallel computing platform
Manages Big Data
distributed databases
Hadoop

MIT, Nitte.
117
Hadoop based Big data environment

MIT, Nitte.
118
Mesos
 Mesos v0.9 is a resource management platform which enables sharing of
cluster nodes by multiple frameworks and which has compatibility with an
open analytics stack

MIT, Nitte.
119
Big Data Stack
 A stack consists of a set of software components and data store units.
 Applications, ML algorithms, Analytics and visualization tools use Big
Data Stack(BDS) at a cloud service, such as Amazon EC2, Azure or
private cloud
 Uses cluster of High Performance machines

MIT, Nitte.
120
Tools for Big Data environment

MIT, Nitte.
121
Data analysis is a process of
inspecting, cleaning, transfo
rming,
and modelling data with the
goal of discovering useful
information, informing
conclusions, and supporting
decision-making.
Big Data
Analytics

MIT, Nitte.
122
Phases in Analytics

MIT, Nitte.
123
Traditional and
Bid Data
analytics
architecture
reference model

124
Berkeley Data Analytics Stack (BDAS)
Infrastructure
Storage
Data Processing
Application
Resource Management
Data Management
Share infrastructure across frameworks
(multi-programming for datacenters)
Efficient data sharing across
frameworks
Data Processing
• in-memory processing
• trade between time, quality, and cost
Application
New apps: AMP-Genomics, Carat, …
NMAMIT, Nitte.

125
Why BDAS..!!?
 Easy to combine batch, streaming, and interactive computations
 Single execution model that supports all computation models
 Easy to develop sophisticated algorithms
 High level abstractions for graph based, and ML algorithms
 Compatible with existing open source ecosystem (Hadoop/HDFS)
 Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)
 Support existing execution models (e.g., Hive, GraphLab)
NMAMIT, Nitte.

MIT, Nitte.
126
Big Data in
Marketing
and Sales

MIT, Nitte.
127

128
Big Data Analytics in Detection of Marketing Fraud
 Fraud means someone deceiving deliberately
 Ex: mortgaging the same assets to multiple financial institutions,
compromising customer data and transferring customer info to third party,
marketing product with compromising quality,..
 Banks and financial services firms use analytics to differentiate fraudulent
interactions from legitimate business transactions.
 The analytics systems suggest immediate actions, such as blocking irregular
transactions, which stops fraud before it occurs and improves profitability.
NMAMIT, Nitte.

129
Big Data and Healthcare
NMAMIT, Nitte.

MIT, Nitte.
130

MIT, Nitte.
131

132
Healthcare analytics using big data can facilitate the following
 Provision of value-based and customer centric healthcare.
 Utilizing the ‘Internet of Things’ for health care.
 Preventing fraud, waste, abuse in the healthcare industry and
reduce healthcare costs.
 Improving outcomes.
 Monitoring patients in real time.
NMAMIT, Nitte.

MIT, Nitte.
133

134
Findings of Big Data in Medicine
 Big data analytics deploys large volume of data to identify and
derive intelligence predictive models about individuals.
 Big data creates patterns and models by data mining and help in
better understanding and research.
 Deploying wearable devices data, that devices data records during
active as well as inactive periods.
NMAMIT, Nitte.

MIT, Nitte.
135

MIT, Nitte.
136

MIT, Nitte.
137
Key reasons to take into account while using big data to improve
results of digital marketing campaign
Data Visualization Tools
Use of Historical data
Target Consumers
Crowdsourcing
The real power of big data is the ability to forecast client’s needs and hence offering
veracious value
Web Mining

MIT, Nitte.
138
End of Module 1

BDA: Big Data Analytics for Unit-1 Vtu syllabus

More Related Content

Similar to BDA: Big Data Analytics for Unit-1 Vtu syllabus (20)

More from Shrinivasa6 (11)

Recently uploaded (20)

BDA: Big Data Analytics for Unit-1 Vtu syllabus

Editor's Notes