Big data presentation (2014)

© 2014 IBM Corporation1
Big Data
Xavier Constant
xavier.constant@es.ibm.com
Lecture at EADA
International Master in Marketing (2014)

 Big Data Concepts
 Big Data Technology
 Data Scientists

Traditional DW
BI
Server
ERP
CRM
Data
Marts
Reports /
Dashboards
Operational
System
ETL ETL
BENEFITS:
 Mature Technology
 SQL Language (declarative, non technical)
 Skills & resources availablity (programmers, DBAs,…)
LIMITATIONS:
 Big operational data volumes
 Queries take too long or don’t even finish
 Admin complexity (partitions, archiving,…)
 New data types
 Free text, images, video, audio,…
 Data in real time (sensors, logs, geospatial data, etc…)
 New analysis types
 Exploratory
 Predictive
Flat files,
Spread
sheets
Data
Warehouse(s)

1 in 2
business leaders
don’t have access to
data they need
83%
of CIO’s cited BI and
analytics as part of their
visionary plan
5.4X
more likely that top
performers use
business analytics
80%
of the world’s
data today is
unstructured
90%
of the world’s
data was created
in the last two
years
20%
of available data can
be processed by
traditional systems
Source: GigaOM, Software Group, IBM Institute for Business Value"
Intrinsic Property of Data … it grows

Characteristics of Big Data
Velocity is the game changer: It’s NOT just how
fast data is produced or changed, BUT the
speed at which it must be analyzed
received, understood, and processed.

Paradigm shifts enabled by big data I
Leverage more of the data being captured

Paradigm shifts enabled by big data I
Leverage more of the data being captured
Bank X

Paradigm shifts enabled by big data II
Reduce effort required to leverage data

Paradigm shifts enabled by big data III
Data leads the way – and sometimes correlations are good enough

Hypothesis based correlation Weird correlation

Paradigm shifts enabled by big data IV
Leverage data as it is captured

Complementary Analytics
Traditional Approach
Structured, analytical, logical
New Approach
Creative, holistic thought, intuition
Multimedia
Data
Warehouse
Web Logs
Social Data
Sensor data:
images
RFID
Internal App
Data
Transaction
Data
Mainframe
Data
OLTP System
Data
Traditional
databases
ERP
Data
Structured
Repeatable
Linear
Unstructured
Exploratory
Dynamic
Text Data:
emails
Hadoop and
Streams
New
Sources

Types of Analytic Tools

Organisations are prioritising internal data sources
Untapped stores of internal data
 Size and scope of some internal data, such as
detailed transactions and operational log data,
have become too large and varied to manage
within traditional systems
 New infrastructure components make them
accessible for analysis
 Some data has been collected, but not
analyzed, for years
Focus on customer insights
 Customers – influenced by digital experiences
– often expect information provided to an
organization will then be “known” during future
interactions
 Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions, Emails, Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization.

Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity, respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100% due to rounding

Hadoop workloads
92%
92%
83%
58%
42%
25%
58%
92%
92%
92%
67%
67%
67%
83%
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop. BI
Leadership Forum, April, 2012

Big Data Exploration
Find, visualize, understand all
big data to improve decision
making
Enhanced 360o View
of the Customer
Extend existing customer views
(MDM, CRM, etc) by
incorporating additional
internal and external
information sources
Operations Analysis
Analyze a variety of machine
data for improved business results
Data Warehouse Modernization
Integrate big data and data warehouse
capabilities to increase operational efficiency
Security/Intelligence
Extension
Lower risk, detect fraud and
monitor cyber security in
real-time
Key Big Data Use Cases

 Data Scientists

Solution for Big Data
Rest Data:
– Data to analyze are already stored (structured and unstructured)
– Examples: logs, facebook, twitter, etc.
– Solution: Hadoop (open source)
Data in motion:
– Data are analyzed in real time, just in the moment they are generated.
They are analyzed with any previous storage
– Examples: Sensors, RFID, etc.
– Solution: Streams / CEP solutions

Hardware improvements through the years...
 CPU Speeds:
– 1990 - 44 MIPS at 40 MHz
– 2000 - 3,561 MIPS at 1.2 GHz
– 2010 - 147,600 MIPS at 3.3 GHz
 RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2000 – 64MB memory
– 2010 - 8-32GB (and more)
 Disk Capacity
– 1990 – 20MB
– 2000 - 1GB
– 2010 – 1TB
 Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec

How long it will take to read 1TB of data?
 1TB (at 80Mb / sec):
– 1 disk - 3.4 hours
– 10 disks - 20 min
– 100 disks - 2 min
– 1000 disks - 12 sec

Parallel Data Processing is the answer!
 It was with us for a while:
– GRID computing - spreads processing load
– Distributed workload - hard to manage applications, overhead on
developer
– Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the
data)

What is Apache Hadoop?
 Apache Open source software framework.
 Flexible, enterprise-class support for processing large volumes of
data
– Inspired by Google technologies (MapReduce, GFS, BigTable, …)
– Initiated at Yahoo
• Originally built to address scalability problems of Nutch, an open source Web search
technology
– Well-suited to batch-oriented, read-intensive applications
– Supports wide variety of data
 Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel, cost effective manner
– CPU + local disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written

Design principles of Hadoop
 New way of storing and processing the data:
– Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Meant for heterogeneous commodity hardware
 Bring processing to Data!
 Hadoop = HDFS + MapReduce infrastructure
 Optimized to handle
– Massive amounts of data through parallelism
– A variety of data (structured, unstructured, semi-structured)
– Using inexpensive commodity hardware
 Reliability provided through replication

What is the Hadoop Distributed File System?
 Driving principals
– Data is stored across the entire cluster (multiple nodes)
– Programs are brought to the data, not the data to the program
– Follows the Divide and Conquer paradigm.
 Data is stored across the entire cluster (the DFS)
– The entire cluster participates in the file system
– Blocks of a single file are distributed across the cluster
– A given block is typically replicated as well for resiliency
1011010
0101001
0011100
1111110
0101001
1101001
0100101
1001001
0101001
1000101
0010111
0101110
1011110
1101101
0101101
0010101
0010101
0101011
1001001
1010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

Introduction to MapReduce
 Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output
for final processing)
3. Reduce Phase
(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
Distribute map
tasks to cluster
Hadoop Data Nodes

MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
 Count number of word's occurrences
Map 2:
< Hello, 1>
< IBM, 1>
Entry Data
Map
Process
Reduce
Process
Shuffle
Process

How to Analyze Large Data Sets in Hadoop
 It's not just runtime. Development phase has to be taken into
account.
 Although the Hadoop framework is implemented in Java,
MapReduce applications do not need to be written in Java
 To abstract complexities of Hadoop programming model, a few
application development languages have emerged that build on top
of Hadoop:
– Pig
– Hive
– Jaql
– ... Jaql

Pig, Hive, Jaql – Similarities
 Reduced program size over Java
 Applications are translated to map
and reduce jobs behind scenes
 Extension points for extending
existing functionality
 Interoperability with other
languages
 Not designed for random
reads/writes or low-latency queries

Pig, Hive, Jaql – Differences
Characteristic Pig Hive Jaql
Developed by Yahoo! Facebook IBM
Language Pig Latin HiveQL Jaql
Type of
language Data flow Declarative (SQL
dialect) Data flow
Data structures
supported Complex Better suited for
structured data
JSON, semi
structured
Schema Optional Not optional Optional

Example of Hadoop Ecosystem
Visualization & Discovery
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
BigSheets JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard &
Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode
High Avail
Avro

Open Source frameworks I
 Avro: A data serialization system that includes a schema within each file. A schema defines the data types that are
contained within a file, and is validated as the data is written to the file using the Avro APIs. Users can include primary data
types and complex type definitions within a schema.
 Flume: A distributed, reliable, and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
 HBase: A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets. Unlike relational database systems, HBase does not support a structured query language like SQL. HBase applications
are written in Java™, much like a typical MapReduce application. HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together. This approach is different from a row-oriented
relational database, where all columns of a row are stored together
 HCatalog: A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored. You can change how you write data, while still supporting existing data in
older formats. HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
 Hive: A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations, in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS). SQL developers write statements,
which are broken down by the Hive service into MapReduce jobs, and then run across a Hadoop cluster. InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software.

Open Source frameworks II
 Lucene: A high-performance text search engine library that is written entirely in Java. When you search within a
collection of text, Lucene breaks the documents into text fields and builds an index from them. The index is the key
component of Lucene that forms the basis of rapid text search capabilities. You use the searching methods within the Lucene
libraries to find text components. With InfoSphere BigInsights, Lucene is integrated into Jaql, providing the ability to build,
scan, and query Lucene indexes
 Oozie: A management application that simplifies workflow and coordination between MapReduce jobs. Oozie provides
users with the ability to define actions and dependencies between actions. Oozie then schedules actions to run when the
required dependencies are met. Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system.
 R: A Project for Statistical Computing
 Scoop: A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster. You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses.
 Zookeeper: A centralized infrastructure and set of services that enable synchronization across a cluster. ZooKeeper
maintains common objects that are needed in large cluster environments, such as configuration information, distributed
synchronization, and group services.

Dashboard &
Visualization
Integration
Streams
Netezza
Flume
DB2
DataStage
Runtime
File System
MapReduce
HDFS
Data Store
HBase
Extractor Library)
JDBC
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode
High Avail
Avro
BigSheets

BigSheets
 Browser based Analytical Tool that generates Map/Reduce Jobs
working over Hadoop Big Data.
 Helps non-programmers to work with Hadoop cluster.
 User models their big data as familiar spreadsheet-like tabular data
structures (collections). Once data is represented in a collection,
business analysts can filter and enrich its content using built-in
functions and macros. Furthermore, analysts can combine data
residing in different collections as well as generate charts and new
“sheets” (collections) to visualize their data. They can even export
data into a variety of common formats with a click of a button.
 Much of the technology included in Sheets was derived from the
BigSheets project of IBM’s Emerging Technologies team.

BigSheets: Collection Sample
 Spreadsheet-like structures defined by user
 Based on data accessible through BigInsights Web console – e.g., file
system data, output from Web crawl, etc.

Big Sheets: Collection Operations
 Work with built-in “sheets” editor
 Add / delete columns
 Filter data
 Specify formulas to compute new
values using spreadsheet-style
syntax
 Apply built-in or custom macro
functions
 …………..

BigSheets: Collection Graphic Visualization
 Built-in charting facility aids analysis
 Pie charts, bar charts, tag clouds, maps, etc.
 Hover over sections to reveal details

Dashboard &
Visualization
Integration
Streams
Netezza
Flume
DB2
DataStage
Runtime
File System
MapReduce
HDFS
Data Store
HBase
Extractor Library)
JDBC
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
BigSheets
Big SQL

What is Big SQL?
 Big SQL brings robust SQL support to the Hadoop ecosystem
– Scalable server architecture
– Comprehensive SQL'92 ansi support
– Standards compliant client drivers (JDBC & ODBC)
– Efficient handling of "point queries"
– Wide variety of data sources and file formats
– Extensive HBase focus
– Open source interoperability
 Our driving design goals
– Existing queries should run with no or few modifications
– Existing JDBC and ODBC compliant tools should continue to function
– Queries should be executed as efficiently as the chosen storage
mechanisms allow

Architecture
Big SQL shares catalogs with
Hive via the Hive metastore
– Each can query the others tables
 SQL engine analyzes incoming
queries
– Separates portion(s) to execute at
the server vs. portion(s) to execute
on the cluster
– Re-writes query if necessary for
improved performance
– Determines appropriate storage
handler for data
– Produces execution plan
– Executes and coordinates query
 Server layout and relative sizes
for illustrative purposes only!
Application
SQL Language
JDBC / ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
Files
HBase RDBMS •••
•••
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
•••
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore

Dashboard &
Visualization
Integration
Streams
Netezza
Flume
DB2
DataStage
Runtime
File System
MapReduce
HDFS
Data Store
HBase
Extractor Library)
JDBC
MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
BigSheets
Big SQL
Text Analytics

What is Text Analytics?
 High Performance and Scalable rule based Information Extraction Engine.
 Distill structured information from unstructured data
- Rich annotator library supports multiple languages
 Provides sophisticated tooling to help build, test, and refine rules.
– Developer tools, an easy to use text analytics language, and a set of
extractors for fast adoption.
– Multilingual support, including support for DBCS languages.
 Developed at IBM Research since 2004: System T
 BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development

Annotator Query Language (AQL)
 Language to create rules for Text Analytics.
 SQL Like Language.
 Fully declarative text analytics language.
 Once compiled produced an AOG plan to work in the data.
 No “black boxes” or modules that can’t be customized.
 Tooling for easy customization because you are abstracted from the
programmatic details.
 Competing solutions make use of locked up black-box modules that cannot be
customized, which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern <N.match> <U.match>
as match
from Number N, Unit U;

Text Analytic: Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010, one team distinguished well
from the rest winning the final. Early in the second
half, Netherlands’ striker, Arjen Robben, had a chance
to score, but the awesome keeper for Spain, Iker
Casillas made the save. Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win.

50
Text Analytic: Real Example

51
One step beyond: Watson

Dashboard &
Visualization
Integration
Streams
Netezza
Flume
DB2
DataStage
Runtime
File System
MapReduce
HDFS
Data Store
HBase
Extractor Library)
JDBC
MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
BigSheets
Big SQL
Text Analytics
R

Big R
• Explore, visualize, transform,
and model big data using
familiar R syntax and
paradigm
• Scale out R with MR
programming
– Partitioning of large data
– Parallel cluster execution of R
code
• Distributed Machine
Learning
– A scalable statistics engine that
provides canned algorithms, and
an ability to author new ones, all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
1
2
3

Where Does BigData Fit?
Analytical
database
(DW)
Source
Systems
Analytical
tools
5. Explore data
6. Parse, aggregate
“Capture in case
it’s needed”
1. Extract, transform, load
“Capture only what’s
needed”
9. Report and mine data

 Data Scientists

Data scientist – The new cool guy in town
Article in Fortune “The unemployment rate in
the U.S. continues to be abysmal (9.1% in
July), but the tech world has spawned a
new kind of highly skilled, nerdy-cool job
that companies are scrambling to fill: data
scientist”
McKinsey Global Institute “Big data Report”
By 2018, the United States alone could
face a shortage of 140,000 to 190,000
people with deep analytical skills as well as
1.5 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions

Data Science is Multidisciplinary

Successful Data Scientist Characteristics

Data Scientist Qualities

How Long Does It Take For a Beginner to Become
a Good Data Scientist?

www.kaggle.com

Kaggle ranking

© 2014 IBM Corporation63 © 2013 IBM Corporation63
Learn Big Data
 Reading Materials - Online
– Understanding Big Data – Free PDF Book
• http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
– Developing, publishing, and deploying your first big data application with InfoSphere
BigInsights
• www.ibm.com/developerworks/data/library/techarticle/dm-1209bigdatabiginsights/index.html
– Implementing IBM InfoSphere BigInsights on System x - Redbook
• http://www.redbooks.ibm.com/redpieces/abstracts/sg248077.html
 Resources
– Big Data Information Center
• www-01.ibm.com/software/ebusiness/jstart/bigdata/infocenter.html
– InfoSphere BigInsights
• www-01.ibm.com/software/data/infosphere/biginsights/
– Stream Computing
• www-01.ibm.com/software/data/infosphere/stream-computing/
– DeveloperWorks, forums, demos ...
• http://www.ibm.com/developerworks/wiki/biginsights/

Learn Big Data Technologies
BigDataUniversity.com
Flexible on-line delivery
allows learning @your place
and @your pace
 Free courses, free study
materials.
 Cloud-based sandbox for
exercises – zero setup
 Robust Course
Management System and
Content Distribution
infrastructure

Big data presentation (2014)

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Big data presentation (2014) (20)

Recently uploaded (20)

Big data presentation (2014)

Editor's Notes