2012.04.26 big insights streams im forum2

Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams


Wilfried Hoge – Leading Technical Sales Professional
hoge@de.ibm.com
twitter.com/wilfriedhoge

IBM Big Data Strategy: Move the Analytics Closer to the Data

New analytic applications drive
Analytic Applications
the requirements for a big data
BI / Exploration / Functional Industry Predictive Content
platform Reporting Visualization App App Analytics Analytics

•  Integrate and manage the full
variety, velocity and volume of data IBM Big Data Platform
Visualization Application Systems
•  Apply advanced analytics to & Discovery Development Management
information in its native form
•  Visualize all available data for ad- Accelerators
hoc analysis
Hadoop Stream Data
•  Development environment for System Computing Warehouse
building new analytic applications
•  Workload optimization and
scheduling
•  Security and Governance Information Integration & Governance

Volume and Velocity – two dimensions for Big Data
Exa

Wind Turbine Placement &
Up to
10,000
Operation
Times PBs of data
Peta larger Analysis time to 3 days from 3 weeks
1220 IBM iDataPlex nodes
Data Scale

Tera
DeepQA
100s GB for Deep Analytics
Data at Rest
Data Scale

3 sec/decision
Power7, 15TB memory
Giga

Telco Promotions
100,000 records/sec, 6B/day
Traditional Data 10 ms/decision
Mega Warehouse and
270TB for Deep Analytics
Business Intelligence
Up to 10,000
Data in Motion Security
times faster
Kilo
600,000 records/sec, 50B/day
1-2 ms/decision
yr mo wk day hr min sec … ms µs
320TB for Deep Analytics
Occasional Frequent Real-time
Decision Frequency
26.04.2012 © Copyright IBM Corporation 2012 4

BigInsights – analytical platform for persistent “Big Data”
Based on open source & IBM
technologies Analytic Applications
Distinguishing characteristics Reporting Visualization App App Analytics Analytics

•  Built-in analytics . . . enhances business
knowledge IBM Big Data Platform
•  Enterprise software integration . . . Visualization Application Systems
& Discovery Development Management
complements and extends existing
capabilities
•  Production-ready platform with tooling for Accelerators
analysts, developers, and
administrators. . . speeds time-to-value Hadoop Stream Data
and simplifies development/maintenance System Computing Warehouse

IBM advantage
•  Combination of software, hardware,
services and advanced research
Information Integration & Governance

About the BigInsights Platform
Flexible, enterprise-class support for processing large volumes of data
•  Based on Google’s MapReduce technology
•  Inspired by Apache Hadoop; compatible with its ecosystem and distribution
•  Well-suited to batch-oriented, read-intensive applications
•  Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes of
data in a highly parallel, cost effective manner
•  CPU + disks = “node”
•  Nodes can be combined into clusters
•  New nodes can be added as needed without changing
•  Data formats
•  How data is loaded
•  How jobs are written

Hadoop Explained – Map Reduce
Hadoop computation model
•  Data stored in a distributed file system spanning many inexpensive computers
•  Bring function to the data
•  Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data
public
static
class
TokenizerMapper

extends
Mapper<Object,Text,Text,IntWritable>
{

Hadoop Data Nodes

private
final
static
IntWritable

one
=
new
IntWritable(1);

private
Text
word
=
new
Text();

public
void
map(Object
key,
Text
val,
Context

StringTokenizer
itr
=

new
StringTokenizer(val.toString());

1.  Map Phase

while
(itr.hasMoreTokens())
{

word.set(itr.nextToken());

context.write(word,
one);

}

(break job into small parts)

}

}

public
static
class
IntSumReducer

extends
Reducer<Text,IntWritable,Text,IntWrita

Distribute map 2.  Shuffle

private
IntWritable
result
=
new
Intritable();

public
void
reduce(Text
key,

Iterable<IntWritable>
val,
Context
context){

int
sum
=
0;

for
(IntWritable
v
:
val)
{
tasks to cluster (transfer interim output

sum
+=
v.get();

.
.
.

for final processing)

MapReduce Application 3.  Reduce Phase
(boil all output down to
Shuffle a single result set)

Result Set Return a single result set

BigInsights – Value Beyond Open Source
Technical differentiators
•  Built-in analytics
•  Text processing engine, annotators, Eclipse tooling
•  Statistical and predictive analysis
•  Interface to project R (statistical platform)
•  Enterprise software integration (DBMS, warehouse)
•  Spreadsheet-style analytical tool for analysts
•  Ready-made business process accelerators
•  Integrated installation of supported open source and IBM components
•  Web Console for administration and application access
•  Platform enrichment: additional security, performance features, . . .
•  Standard IBM licensing agreement and world-class support
Business benefits
•  Quicker time-to-value due to IBM technology and support
•  Reduced operational risk
•  Enhanced business knowledge with flexible analytical platform
•  Leverages and complements existing software assets

Web Installation Tool
Seamless process for single
node and cluster environments

Integrated installation of all
selected components

Post-install validation of IBM and
open source components

No need to iteratively download, configure, and test multiple open source
projects and their pre-requisite software.

Web Console
Manage BigInsights
•  Inspect system health
•  Add / drop nodes
•  Start / stop services
•  Run / monitor jobs (applications)
•  Explore / modify file system

Launch applications
•  Spreadsheet-like analysis tool
•  Pre-built applications (IBM supplied
or user developed)

Publish applications
Leverage community resources

BigSheets
BigSheets is a visual tool for data manipulation and prototyping
•  Allows more users to do more work, more quickly
•  Simply stated, growing an army of MapReduce developers is not cost effective
•  In your BI environments you have a ratio of 30+ report users for every complex SQL
developer. We need to support the same ratios with BigInsights

Sample Uses
•  Data exploration and visualization
•  Visual job creation

BigSheets – Spreadsheet-style Data Analysis and Discovery

Quick start applications or “apps”
Reusable software assets based on customer engagements
•  Useful for starting point for various applications
•  Can be customized by BigInsights application developers as needed
•  Accessible through Web console

Available assets
•  Data export (to relational DBMS, files, HBase)
•  Data import (from relational DBMS, files)
•  Web crawler, Twitter crawler
•  Boardreader.com support (Web forum search engine)
•  Ad hoc queries for Jaql, Hive, Pig
•  TeraGen-TeraSort, WordCount sample applications

Running Applications from the Web Console

Develop Hive with the SQL Editor and view results

Build a Big Data Program – Map Reduce example

Eclipse based development tools
For JAQL, Hive, Java MapReduce, Text Analytics

Text Analytics in BigInsights
Text analytics – Distill structured information from unstructured data
•  Rich annotator library supports multiple languages
•  Declarative Information Extraction (IE) system based on an algebraic framework
•  Richer, cleaner rule semantics
•  Better performance through optimization

Developed at IBM Research since 2004

Embedded in several IBM products
•  Lotus Notes
•  Cognos Consumer Insights
•  InfoSphere Streams
•  Compose operators to build complex annotators

Turns disparate words into measurable insights
Pre-configured text annotators ready for distributed processing on Big Data
•  City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent,
EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger,
Acquisition, Alliance, etc..
Support for native languages including double-byte

Physically assemble Identify positive or Reporting/Monitoring social
data, standardize Part-of-speech negative sentiment, Iterative classification commentary, combination w/
formats, address auto- identification, standard and NLP-based analytics, using automated and structured data, clustering,
identify language, customized extraction define variables, macros manual techniques. associated concepts,
process punctuation dictionaries, proper noun and rules. Concept derivation & correlated concepts, auto-
and non-grammatical identification, concept inclusion, semantic classification of documents,
characters, standardize categorization, synonyms, networks and co- sites, posts.
spelling. exclusions, multi-terms, occurrence rules
regular expressions, fuzzy-
matching

Text Analytics – highly accurate analysis of textual content
How it works Unstructured text (document, email, etc)
•  Parses text and detects meaning with
annotators Football World Cup 2010, one team
distinguished themselves well, losing to
•  Understands the context in which the
the eventual champions 1-0 in the Final.
text is analyzed
Early in the second half, Netherlands’
•  Hundreds of pre-built annotators for striker, Arjen Robben, had a breakaway,
names, addresses, phone numbers, but the keeper for Spain, Iker Casillas
along others made the save. Winger Andres Iniesta
scored for Spain for the win.
Accuracy
•  Highly accurate in deriving meaning
from complex text
Performance Classification and Insight
•  AQL language optimized for
MapReduce

BigInsights Text Analytics Development – AQL

Text Analytics Tooling
AQL Editor Result Viewer

Runtime Explain

Statistical and Predictive Analysis
Framework for machine learning (ML) implementations on Big Data
•  Large, sparse data sets, e.g. 5B non-zero values
•  Runs on large BigInsights clusters with 1000s of nodes
Productivity
•  Build and enhance predictive models directly on Big Data
•  High-level language – Declarative Machine Learning Language (DML)
•  E.g. 1500 lines of Java code boils down to 15 lines of DML code
•  Parallel SPSS data mining algorithms implementable in DML
Optimization
•  Compile algorithms into optimized parallel code
4500
•  For different clusters and different data characteristics 4000

3500
•  E.g. 1 hr. execution (hand-coded) down to 10 mins

Execution Time (sec)
3000

2500

2000

1500

1000

500

0
0 500 1000 1500 2000

# non zeros (million)

Java Map-Reduce SystemML Single node R

Workload Optimization
Optimized performance for big data analytic workloads

Adaptive MapReduce Hadoop System Scheduler
§  Algorithm to optimize execution time of §  Identifies small and large jobs from
multiple small jobs prior experience

§  Performance gains of 30% reduce §  Sequences work to reduce overhead
overhead of task startup

Task Map Adaptive Map Reduce
(break task into small parts) (optimization — (many results to a
order small units of work) single result set)

InfoSphere BigInsights – Embrace and Extend Hadoop
Analytics
ML Analytics Text Analytics BigSheets Interface

Web console
Application •  Monitor cluster health
Pig Hive Jaql •  Add / remove nodes

Avro
Zookeeper

IBM LZO Compression
•  Start / stop services
MapReduce •  Inspect job status
•  Inspect workflow status
•  Deploy apps
AdaptiveMR FLEX BigIndex •  Launch apps / jobs
•  Work with distrib. file system
•  Work with spreadsheet
Oozie Lucene
interface
•  Support REST-based API
•  . . .

Storage HBase
Eclipse plug-ins
HDFS GPFS-SNC
•  Text analytics
•  MapReduce programming
•  Jaql development
Data Sources/ Netezza BoardReader R •  Hive query development
Streams
Connectors
Data Stage DB2 CSV / XML / JSON SPSS
IBM
Flume JDBC Web Crawler
Open Source

Ways to get started with BigInsights
In the Cloud
•  Via RightScale, or directly on Amazon, Rackspace, IBM
Smart Enterprise Cloud, or on private clouds.
•  Pay only for the resources used.

In the Virtual Classroom
•  Free Hadoop Fundamentals training course
www.bigdatauniversity.com
•  e.g. BD105EN - Text Analytics Essentials

On Your Cluster
•  Download Basic Edition from ibm.com.
In the Classroom
•  Enroll in the InfoSphere BigInsights Essentials course.

Visit the BigInsights technical portal . . . .
Free links to papers, demos, discussion forum, and more
http://www.ibm.com/developerworks/wiki/biginsights/

Streams – analytical platform for in-motion “Big Data”
Built to analyze data in motion
Analytic Applications
•  Multiple concurrent input streams
Reporting Visualization App App Analytics Analytics
•  Massive scalability

IBM Big Data Platform
Process and analyze a variety of Visualization Application Systems
data & Discovery Development Management

•  Structured, unstructured content, video,
audio Accelerators
•  Advanced analytic operators
Hadoop Stream Data
System Computing Warehouse

Information Integration & Governance

Stream Computing – Analyze Data in Motion

Traditional Computing Stream Computing

Historical fact finding Current fact finding

Find and analyze information stored on disk Analyze data in motion – before it is stored

Batch paradigm, pull model Low latency paradigm, push model

Query-driven: submits queries to static data Data driven – bring the data to the query

Query Data Results Data Query Results

Why InfoSphere Streams?
Applications that require on-the-fly processing, filtering and analysis of
streaming data
•  Sensors: environmental, industrial, surveillance video, GPS, …
•  “Data exhaust”: network/system/web server/app server log files
•  High-rate transaction data: financial transactions, call detail records

Criteria: two or more of the following
•  Messages are processed in isolation or in limited data windows
•  Sources include non-traditional data (spatial, imagery, text, …)
•  Sources vary in connection methods, data rates, and processing requirements,
presenting integration challenges
•  Data rates/volumes require the resources of multiple processing nodes
•  Analysis and response are needed with sub-millisecond latency
•  Data rates and volumes are too great for store-and-mine approaches

Massively Scalable Stream Analytics
Linear Scalability Deployments
§  Clustered deployments – unlimited Source Analytic Sync
scalability Adapters Operators Adapters

Automated Deployment
§  Automatically optimize operator
Streams Studio IDE
deployment across clusters
Performance Optimization Automated and
Optimized
§  JVM Sharing – minimize memory use Deployment

§  Fuse operators on Streaming Data Streams Runtime
Sources
same cluster
§  Telco client – 25 Million
Visualization
messages per second
Analytics on Streaming Data
§  Analytic accelerators for a
variety of data types
§  Optimized for real-time performance

Streams approach illustrated tuple

directory: directory: directory: directory:
”/img" ”/img" ”/opt" ”/img"
filename: filename: filename: filename:
height: height: height:
“farm” “bird” “java” “cat” 640 1280 640
width: width: width:
480 1024 480
data: data: data:

InfoSphere Streams for superior real time analytic processing
Streams Processing Language (SPL)
built for Streaming applications: Compile groups of operators into
•  Reusable operators single processes:
•  Rapid application development •  Efficient use of cores
Use the data •  Continuous “pipeline” processing •  Distributed execution
that gives •  Very fast data exchange
you a competitive •  Can be automatic or tuned
advantage: •  Scaled with push of a button
•  Can handle virtually
any data type
•  Use data that is too
expensive and time
sensitive for traditional
approaches

Easy to extend:
•  Built in adaptors
•  Users add capability
with familiar C++ and
Java
Dynamic analysis:
Easy to manage: •  Programmatically change
Flexible and high
•  Automatic placement topology at runtime
performance transport: •  Create new subscriptions
•  Extend applications incrementall
•  Very low latency •  Create new port properties
without downtime
•  High data rates
•  Multi-user / multiple applications

Streams Studio Integrated Development Environment

34

Compiler Framework
Operator Fusion
•  Fine-grained operators

Logical app view
•  From small parts, make larger ones
that fit
Code generation
•  Generates code to match the underlying
runtime environment
•  Number of cores
•  Interconnect characteristics

Physical app view
•  Architecture-specific instructions
•  Driven by automatic profiling
•  Compiler-based optimization
•  Driven by incremental learning of
application characteristics

Streams Data Mining Toolkit
Enables scoring of real-time data in a Streams application
•  Scoring is performed against a predefined model
•  Supports a variety of model types and scoring algorithms

Models represented in Predictive Model Markup Language (PMML)
•  Standard for statistical and data mining models
•  XML Representation

Toolkit provides four Streams operators to enable scoring
•  Classification
•  Clustering
•  Regression
•  Associations
The toolkit supports dynamic replacement of the PMML model used by an
operator.

Without a Big Data Platform IBM Big Data Platform
You Code…
Over 100 sample applications and toolkits with industry
focused toolkits with 300+ functions and operators

Event Custom SQL
Handling and
Scripts
Multithreading

Check Application
Pointing Management Accelerators
Streams provides development, deployment,
HA and runtime, and infrastructure services
Toolkits

Performance Debug
Connectors
Optimization

Security “TerraEchos developers can deliver
applications 45% faster due to the agility
of Streams Processing Language…”
– Alex Philip, CEO and President, TerraEchos

Streams Redbook
redbooks.ibm.com/abstracts/sg247970.html

This book is intended for professionals that
require an understanding of how to process high
volumes of streaming data or need information
about how to implement systems to satisfy
those requirements.

Right-time actions are taken in the new BI/BA ecosystem
• Three routes to analytics
• Application and workload optimized appliances and systems
• Fast data movement and integration

Traditional Traditional /
Warehouse Relational
Data Sources
Database & At-Rest Data Results
Warehouse Analytics

Non-Traditional /
Streams Non-Relational
Data Sources
In-Motion Ultra Low Latency
Analytics Results

Non-Traditional/ InfoSphere
Non-Relational Big Insights
Data Sources
Internet Internet Scale
Scale Traditional/ Data Analytics, Data Results
Relational Data Operations & Model
Sources Building

26.04.2012 © Copyright IBM Corporation 2012 39

Example of 360° customer view

Business Processes"

Events and Master Data Campaign Cognos Consumer
Alerts Management Management Insight

Big Data Platform
Web Traffic and
Social Media Insight

Website Logs
Social Media Internet Scale Analytics

Information Data
Integration Warehouse

Call Detail Call Behavior and
Records Streaming Analytics Experience Insight

2012.04.26 big insights streams im forum2

More Related Content

What's hot (20)

Similar to 2012.04.26 big insights streams im forum2 (20)

More from Wilfried Hoge (11)

Recently uploaded (20)

2012.04.26 big insights streams im forum2