SlideShare a Scribd company logo
Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams
Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams



Wilfried Hoge – Leading Technical Sales Professional
hoge@de.ibm.com
twitter.com/wilfriedhoge
IBM Big Data Strategy: Move the Analytics Closer to the Data

New analytic applications drive
                                                       Analytic Applications
the requirements for a big data
                                            BI /    Exploration / Functional Industry Predictive Content
platform                                  Reporting Visualization    App       App    Analytics Analytics


•  Integrate and manage the full
   variety, velocity and volume of data               IBM Big Data Platform
                                             Visualization        Application         Systems
•  Apply advanced analytics to               & Discovery         Development         Management
   information in its native form
•  Visualize all available data for ad-                             Accelerators
   hoc analysis
                                                Hadoop             Stream               Data
•  Development environment for                  System            Computing           Warehouse
   building new analytic applications
•  Workload optimization and
   scheduling
•  Security and Governance                          Information Integration & Governance
Volume and Velocity – two dimensions for Big Data
             Exa

                                                                                                                   Wind Turbine Placement &
                              Up to
                              10,000
                                                                                                                   Operation
                              Times                                                                                PBs of data
             Peta             larger                                                                               Analysis time to 3 days from 3 weeks
                                                                                                                   1220 IBM iDataPlex nodes
                Data Scale




             Tera
                                                                                                                          DeepQA
                                                                                                                          100s GB for Deep Analytics
                               Data at Rest
Data Scale




                                                                                                                          3 sec/decision
                                                                                                                          Power7, 15TB memory
             Giga

                                                                                                                            Telco Promotions
                                                                                                                            100,000 records/sec, 6B/day
                             Traditional Data                                                                               10 ms/decision
             Mega            Warehouse and
                                                                                                                            270TB for Deep Analytics
                             Business Intelligence
                                                                                  Up to 10,000
                                                         Data in Motion                                                    Security
                                                                                  times faster
             Kilo
                                                                                                                           600,000 records/sec, 50B/day
                                                                                                                           1-2 ms/decision
                     yr        mo             wk   day    hr   min        sec        …        ms        µs
                                                                                                                           320TB for Deep Analytics
                             Occasional                   Frequent                    Real-time
                                                   Decision Frequency
              26.04.2012                                                        © Copyright IBM Corporation 2012                                       4
BigInsights – analytical platform for persistent “Big Data”
Based on open source & IBM
technologies                                                 Analytic Applications
                                                  BI /    Exploration / Functional Industry Predictive Content
Distinguishing characteristics                  Reporting Visualization    App       App    Analytics Analytics


•  Built-in analytics . . . enhances business
   knowledge                                                IBM Big Data Platform
•  Enterprise software integration . . .           Visualization        Application         Systems
                                                   & Discovery         Development         Management
   complements and extends existing
   capabilities
•  Production-ready platform with tooling for                             Accelerators
   analysts, developers, and
   administrators. . . speeds time-to-value           Hadoop             Stream               Data
   and simplifies development/maintenance             System            Computing           Warehouse

IBM advantage
•  Combination of software, hardware,
   services and advanced research
                                                          Information Integration & Governance
About the BigInsights Platform
Flexible, enterprise-class support for processing large volumes of data
•  Based on Google’s MapReduce technology
•  Inspired by Apache Hadoop; compatible with its ecosystem and distribution
•  Well-suited to batch-oriented, read-intensive applications
•  Supports wide variety of data


Enables applications to work with thousands of nodes and petabytes of
data in a highly parallel, cost effective manner
•  CPU + disks = “node”
•  Nodes can be combined into clusters
•  New nodes can be added as needed without changing
  •  Data formats
  •  How data is loaded
  •  How jobs are written
Hadoop Explained – Map Reduce
         Hadoop computation model
             •  Data stored in a distributed file system spanning many inexpensive computers
             •  Bring function to the data
             •  Distribute application to the compute resources where the data is stored
         Scalable to thousands of nodes and petabytes of data
 public	
  static	
  class	
  TokenizerMapper	
  	
  
 	
  	
  	
  extends	
  Mapper<Object,Text,Text,IntWritable>	
  {	
  
                                                                                                              Hadoop Data Nodes
 	
  	
  private	
  final	
  static	
  IntWritable	
  
 	
  	
  	
  	
  	
  one	
  =	
  new	
  IntWritable(1);	
  
 	
  	
  private	
  Text	
  word	
  =	
  new	
  Text();	
  
 	
  

 	
  	
  public	
  void	
  map(Object	
  key,	
  Text	
  val,	
  Context	
  
 	
  	
  	
  	
  StringTokenizer	
  itr	
  =	
  
 	
  	
  	
  	
  	
  	
  	
  new	
  StringTokenizer(val.toString());	
  



                                                                                                                                  1.  Map Phase
 	
  	
  	
  	
  while	
  (itr.hasMoreTokens())	
  {	
  
 	
  	
  	
  	
  word.set(itr.nextToken());	
  
 	
  	
  	
  	
  	
  	
  context.write(word,	
  one);	
  
 	
  	
  	
  	
  }	
  	
  	
  	
   	
  	
  

                                                                                                                                     (break job into small parts)
 	
  	
  }	
  
 }	
  
 	
  
 public	
  static	
  class	
  IntSumReducer	
  	
  
 	
  	
  	
  extends	
  Reducer<Text,IntWritable,Text,IntWrita	
  

                                                                                       Distribute map                             2.  Shuffle
 	
  	
  private	
  IntWritable	
  result	
  =	
  new	
  Intritable();	
  
 	
  

 	
  	
  public	
  void	
  reduce(Text	
  key,	
  
 	
  	
  	
  	
  	
  Iterable<IntWritable>	
  val,	
  Context	
  context){	
  
 	
  	
  	
  	
  int	
  sum	
  =	
  0;	
  
 	
  	
  	
  	
  for	
  (IntWritable	
  v	
  :	
  val)	
  {	
                         tasks to cluster                               (transfer interim output
 	
  	
  	
  	
  	
  	
  sum	
  +=	
  v.get();	
  
 	
  
 .	
  .	
  .	
  
                                                                                                                                     for final processing)

MapReduce Application                                                                                                             3.  Reduce Phase
                                                                                                                                     (boil all output down to
                                                                                              Shuffle                                a single result set)




                   Result Set                                                    Return a single result set
BigInsights – Value Beyond Open Source
Technical differentiators
•  Built-in analytics
  •  Text processing engine, annotators, Eclipse tooling
  •  Statistical and predictive analysis
  •  Interface to project R (statistical platform)
•  Enterprise software integration (DBMS, warehouse)
•  Spreadsheet-style analytical tool for analysts
•  Ready-made business process accelerators
•  Integrated installation of supported open source and IBM components
•  Web Console for administration and application access
•  Platform enrichment: additional security, performance features, . . .
•  Standard IBM licensing agreement and world-class support
Business benefits
•  Quicker time-to-value due to IBM technology and support
•  Reduced operational risk
•  Enhanced business knowledge with flexible analytical platform
•  Leverages and complements existing software assets
Web Installation Tool
Seamless process for single
node and cluster environments


Integrated installation of all
selected components


Post-install validation of IBM and
open source components




       No need to iteratively download, configure, and test multiple open source
       projects and their pre-requisite software.
Web Console
Manage BigInsights
•  Inspect system health
•  Add / drop nodes
•  Start / stop services
•  Run / monitor jobs (applications)
•  Explore / modify file system


Launch applications
•  Spreadsheet-like analysis tool
•  Pre-built applications (IBM supplied
   or user developed)


Publish applications
Leverage community resources
BigSheets
BigSheets is a visual tool for data manipulation and prototyping
•  Allows more users to do more work, more quickly
•  Simply stated, growing an army of MapReduce developers is not cost effective
•  In your BI environments you have a ratio of 30+ report users for every complex SQL
   developer. We need to support the same ratios with BigInsights

Sample Uses
•  Data exploration and visualization
•  Visual job creation
BigSheets – Spreadsheet-style Data Analysis and Discovery
BigSheets – Visualization
Quick start applications or “apps”
Reusable software assets based on customer engagements
•  Useful for starting point for various applications
•  Can be customized by BigInsights application developers as needed
•  Accessible through Web console



Available assets
•  Data export (to relational DBMS, files, HBase)
•  Data import (from relational DBMS, files)
•  Web crawler, Twitter crawler
•  Boardreader.com support (Web forum search engine)
•  Ad hoc queries for Jaql, Hive, Pig
•  TeraGen-TeraSort, WordCount sample applications
Running Applications from the Web Console
Develop Hive with the SQL Editor and view results
Build a Big Data Program – Map Reduce example

                              Eclipse based development tools
                                  For JAQL, Hive, Java MapReduce, Text Analytics
Text Analytics in BigInsights
Text analytics – Distill structured information from unstructured data
•  Rich annotator library supports multiple languages
•  Declarative Information Extraction (IE) system based on an algebraic framework
•  Richer, cleaner rule semantics
•  Better performance through optimization



Developed at IBM Research since 2004


Embedded in several IBM products
•  Lotus Notes
•  Cognos Consumer Insights
•  InfoSphere Streams
•  Compose operators to build complex annotators
Turns disparate words into measurable insights
Pre-configured text annotators ready for distributed processing on Big Data
•  City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent,
   EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger,
   Acquisition, Alliance, etc..
Support for native languages including double-byte




  Physically assemble                                        Identify positive or                                Reporting/Monitoring social
    data, standardize              Part-of-speech           negative sentiment,      Iterative classification   commentary, combination w/
 formats, address auto-    identification, standard and    NLP-based analytics,      using automated and         structured data, clustering,
    identify language,         customized extraction      define variables, macros    manual techniques.            associated concepts,
  process punctuation        dictionaries, proper noun           and rules.          Concept derivation &        correlated concepts, auto-
  and non-grammatical          identification, concept                                 inclusion, semantic      classification of documents,
 characters, standardize    categorization, synonyms,                                   networks and co-                 sites, posts.
         spelling.            exclusions, multi-terms,                                   occurrence rules
                           regular expressions, fuzzy-
                                       matching
Text Analytics – highly accurate analysis of textual content
How it works                               Unstructured text (document, email, etc)
•  Parses text and detects meaning with
   annotators                             Football World Cup 2010, one team
                                          distinguished themselves well, losing to
•  Understands the context in which the
                                          the eventual champions 1-0 in the Final.
   text is analyzed
                                          Early in the second half, Netherlands’
•  Hundreds of pre-built annotators for   striker, Arjen Robben, had a breakaway,
   names, addresses, phone numbers,       but the keeper for Spain, Iker Casillas
   along others                           made the save. Winger Andres Iniesta
                                          scored for Spain for the win.
Accuracy
•  Highly accurate in deriving meaning
   from complex text
Performance                                      Classification and Insight
•  AQL language optimized for
   MapReduce
BigInsights Text Analytics Development – AQL
Text Analytics Tooling
          AQL Editor     Result Viewer




Runtime Explain
Statistical and Predictive Analysis
Framework for machine learning (ML) implementations on Big Data
•  Large, sparse data sets, e.g. 5B non-zero values
•  Runs on large BigInsights clusters with 1000s of nodes
Productivity
•  Build and enhance predictive models directly on Big Data
•  High-level language – Declarative Machine Learning Language (DML)
  •  E.g. 1500 lines of Java code boils down to 15 lines of DML code
•  Parallel SPSS data mining algorithms implementable in DML
Optimization
•  Compile algorithms into optimized parallel code
                                                                                         4500
•  For different clusters and different data characteristics                             4000

                                                                                         3500
•  E.g. 1 hr. execution (hand-coded) down to 10 mins




                                                                  Execution Time (sec)
                                                                                         3000

                                                                                         2500

                                                                                         2000

                                                                                         1500

                                                                                         1000

                                                                                         500

                                                                                           0
                                                                                                0       500            1000            1500            2000

                                                                                                               # non zeros (million)

                                                                                                    Java Map-Reduce     SystemML       Single node R
Workload Optimization
  Optimized performance for big data analytic workloads

         Adaptive MapReduce                                      Hadoop System Scheduler
  §  Algorithm to optimize execution time of               §  Identifies small and large jobs from
      multiple small jobs                                       prior experience

  §  Performance gains of 30% reduce                       §  Sequences work to reduce overhead
      overhead of task startup



Task                 Map                             Adaptive Map                     Reduce
                     (break task into small parts)   (optimization —                  (many results to a
                                                     order small units of work)       single result set)
InfoSphere BigInsights – Embrace and Extend Hadoop
Analytics
                            ML Analytics                            Text Analytics             BigSheets           Interface

                                                                                                                    Web console
Application                                                                                                        •  Monitor cluster health
                                                      Pig                 Hive                Jaql                 •  Add / remove nodes




                                                                                                            Avro
                Zookeeper

                              IBM LZO Compression
                                                                                                                   •  Start / stop services
                                                                      MapReduce                                    •  Inspect job status
                                                                                                                   •  Inspect workflow status
                                                                                                                   •  Deploy apps
                                                    AdaptiveMR              FLEX              BigIndex             •  Launch apps / jobs
                                                                                                                   •  Work with distrib. file system
                                                                                                                   •  Work with spreadsheet
                                                                 Oozie                        Lucene
                                                                                                                      interface
                                                                                                                   •  Support REST-based API
                                                                                                                   •  . . .

Storage                                                                   HBase
                                                                                                                    Eclipse plug-ins
                                                            HDFS                     GPFS-SNC
                                                                                                                   •  Text analytics
                                                                                                                   •  MapReduce programming
                                                                                                                   •  Jaql development
Data Sources/                                               Netezza          BoardReader               R           •  Hive query development
                            Streams
Connectors
                        Data Stage                            DB2          CSV / XML / JSON          SPSS
                                                                                                                                      IBM
                            Flume                            JDBC            Web Crawler
                                                                                                                                      Open Source
Ways to get started with BigInsights
In the Cloud
•  Via RightScale, or directly on Amazon, Rackspace, IBM
   Smart Enterprise Cloud, or on private clouds.
•  Pay only for the resources used.

In the Virtual Classroom
•  Free Hadoop Fundamentals training course
   www.bigdatauniversity.com
  •  e.g. BD105EN - Text Analytics Essentials

On Your Cluster
•  Download Basic Edition from ibm.com.
In the Classroom
•  Enroll in the InfoSphere BigInsights Essentials course.
Visit the BigInsights technical portal . . . .
Free links to papers, demos, discussion forum, and more
http://www.ibm.com/developerworks/wiki/biginsights/
Streams – analytical platform for in-motion “Big Data”
Built to analyze data in motion
                                                           Analytic Applications
•  Multiple concurrent input streams
                                                BI /    Exploration / Functional Industry Predictive Content
                                              Reporting Visualization    App       App    Analytics Analytics
•  Massive scalability

                                                          IBM Big Data Platform
Process and analyze a variety of                 Visualization        Application         Systems
data                                             & Discovery         Development         Management

•  Structured, unstructured content, video,
   audio                                                                Accelerators
•  Advanced analytic operators
                                                    Hadoop             Stream               Data
                                                    System            Computing           Warehouse




                                                        Information Integration & Governance
Stream Computing – Analyze Data in Motion

        Traditional Computing                            Stream Computing




Historical fact finding                        Current fact finding

Find and analyze information stored on disk    Analyze data in motion – before it is stored

Batch paradigm, pull model                     Low latency paradigm, push model

Query-driven: submits queries to static data   Data driven – bring the data to the query

   Query           Data          Results        Data           Query          Results
Why InfoSphere Streams?
Applications that require on-the-fly processing, filtering and analysis of
streaming data
•  Sensors: environmental, industrial, surveillance video, GPS, …
•  “Data exhaust”: network/system/web server/app server log files
•  High-rate transaction data: financial transactions, call detail records


Criteria: two or more of the following
•  Messages are processed in isolation or in limited data windows
•  Sources include non-traditional data (spatial, imagery, text, …)
•  Sources vary in connection methods, data rates, and processing requirements,
   presenting integration challenges
•  Data rates/volumes require the resources of multiple processing nodes
•  Analysis and response are needed with sub-millisecond latency
•  Data rates and volumes are too great for store-and-mine approaches
Massively Scalable Stream Analytics
Linear Scalability                                 Deployments
§  Clustered deployments – unlimited               Source       Analytic    Sync
    scalability                                     Adapters     Operators   Adapters

Automated Deployment
§  Automatically optimize operator
                                                                     Streams Studio IDE
    deployment across clusters
Performance Optimization                                                                  Automated and
                                                                                          Optimized
§  JVM Sharing – minimize memory use                                                     Deployment

§  Fuse operators on             Streaming Data   Streams Runtime
                                        Sources
    same cluster
§  Telco client – 25 Million
                                                                                              Visualization
    messages per second
Analytics on Streaming Data
§  Analytic accelerators for a
    variety of data types
§  Optimized for real-time performance
Streams approach illustrated                                   tuple

                               directory: directory: directory: directory:
                                  ”/img" ”/img"        ”/opt" ”/img"
                               filename: filename: filename: filename:
                                                                             height:   height:   height:
                                  “farm” “bird” “java” “cat”                   640       1280      640
                                                                             width:    width:    width:
                                                                               480       1024      480
                                                                             data:     data:     data:
InfoSphere Streams for superior real time analytic processing
                          Streams Processing Language (SPL)
                          built for Streaming applications:                 Compile groups of operators into
                          •    Reusable operators                           single processes:
                          •    Rapid application development                •  Efficient use of cores
     Use the data         •    Continuous “pipeline” processing             •    Distributed execution
     that gives                                                             •    Very fast data exchange
     you a competitive                                                      •    Can be automatic or tuned
     advantage:                                                             •    Scaled with push of a button
     •  Can handle virtually
        any data type
     •  Use data that is too
        expensive and time
        sensitive for traditional
        approaches

Easy to extend:
•      Built in adaptors
•      Users add capability
       with familiar C++ and
       Java
                                                                                       Dynamic analysis:
         Easy to manage:                                                               •    Programmatically change
                                                    Flexible and high
         •    Automatic placement                                                           topology at runtime
                                                    performance transport:             •    Create new subscriptions
         •    Extend applications incrementall
                                                    •    Very low latency              •    Create new port properties
              without downtime
                                                    •    High data rates
         •    Multi-user / multiple applications
Streams Studio Integrated Development Environment




                                                    34
Compiler Framework
Operator Fusion
•  Fine-grained operators




                                                Logical app view
•  From small parts, make larger ones
   that fit
Code generation
•  Generates code to match the underlying
   runtime environment
  •  Number of cores
  •  Interconnect characteristics




                                            Physical app view
  •  Architecture-specific instructions
•  Driven by automatic profiling
•  Compiler-based optimization
•  Driven by incremental learning of
   application characteristics
Streams Data Mining Toolkit
Enables scoring of real-time data in a Streams application
•  Scoring is performed against a predefined model
•  Supports a variety of model types and scoring algorithms

Models represented in Predictive Model Markup Language (PMML)
  •  Standard for statistical and data mining models
  •  XML Representation

Toolkit provides four Streams operators to enable scoring
•  Classification
•  Clustering
•  Regression
•  Associations
The toolkit supports dynamic replacement of the PMML model used by an
operator.
Without a Big Data Platform                                      IBM Big Data Platform
You Code…
                                                        Over 100 sample applications and toolkits with industry
                                                          focused toolkits with 300+ functions and operators

       Event            Custom SQL
      Handling              and
                          Scripts
                                       Multithreading


  Check           Application
 Pointing        Management                              Accelerators
                                                                        Streams provides development, deployment,
                                   HA                        and             runtime, and infrastructure services
                                                           Toolkits




                        Performance          Debug
       Connectors
                        Optimization




 Security                                                                   “TerraEchos developers can deliver
                                                                          applications 45% faster due to the agility
                                                                            of Streams Processing Language…”
                                                                             – Alex Philip, CEO and President, TerraEchos
Streams Redbook
redbooks.ibm.com/abstracts/sg247970.html


This book is intended for professionals that
require an understanding of how to process high
volumes of streaming data or need information
about how to implement systems to satisfy
those requirements.
Right-time actions are taken in the new BI/BA ecosystem
 • Three routes to analytics
 • Application and workload optimized appliances and systems
 • Fast data movement and integration

Traditional      Traditional /
Warehouse         Relational
                Data Sources
                                                                          Database &       At-Rest Data    Results
                                                                          Warehouse         Analytics


               Non-Traditional /
   Streams      Non-Relational
                Data Sources
                                   In-Motion                                                          Ultra Low Latency
                                   Analytics                                                                Results

               Non-Traditional/                                           InfoSphere
               Non-Relational                                             Big Insights
                Data Sources
    Internet                        Internet Scale
      Scale      Traditional/                                                            Data Analytics, Data   Results
               Relational Data                                                           Operations & Model
                  Sources                                                                      Building


 26.04.2012                            © Copyright IBM Corporation 2012                                              39
Example of 360° customer view

                Business Processes"




                  Events and                       Master Data         Campaign          Cognos Consumer
                    Alerts                         Management         Management             Insight


                         Big Data Platform
                                                                   Web Traffic and
                                                                 Social Media Insight




      Website Logs
      Social Media      Internet Scale Analytics


                                                       Information                        Data
                                                        Integration                     Warehouse


        Call Detail                                              Call Behavior and
         Records          Streaming Analytics                    Experience Insight
Big Data Plattform der IBM

InfoSphere BigInsights und InfoSphere Streams

More Related Content

PDF
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
PDF
Building Big Data Applications
PDF
Liquidity Risk Management powered by SAP HANA
PDF
Wed 1130 aasman_jans_color
PDF
Hadoop Data Reservoir Webinar
PDF
Treasure Data and Heroku
PPTX
Library support for life cycle
PDF
Hw09 Terapot Email Archiving With Hadoop
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
Building Big Data Applications
Liquidity Risk Management powered by SAP HANA
Wed 1130 aasman_jans_color
Hadoop Data Reservoir Webinar
Treasure Data and Heroku
Library support for life cycle
Hw09 Terapot Email Archiving With Hadoop

What's hot (20)

PDF
[Hadoop] NexR Terapot: Massive Email Archiving
PDF
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
PDF
Introduction to Gruter and Gruter's BigData Platform
PDF
Streaming Hadoop for Enterprise Adoption
PDF
Cetas Analytics as a Service for Predictive Analytics
PDF
Cetas Predictive Analytics Prezo
PDF
Hadoop - Now, Next and Beyond
PDF
Big Data Real Time Applications
PDF
Big Data For Investment Research Management
PPTX
Big Data and HPC
PDF
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
PPTX
MapR lucidworks joint webinar
PPTX
Jubatus Presentation on R&D forum 2011
PDF
Core concepts and Key technologies - Big Data Analytics
PDF
2013 storage prediction hds hong kong
PDF
2011 x.commerce Innovate Data Alchemy
PDF
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
PDF
CSC1100 - Chapter08 - Database Management
PPTX
Vodafone xone fev142013v3 ext
PPT
Big data analytics, survey r.nabati
[Hadoop] NexR Terapot: Massive Email Archiving
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
Introduction to Gruter and Gruter's BigData Platform
Streaming Hadoop for Enterprise Adoption
Cetas Analytics as a Service for Predictive Analytics
Cetas Predictive Analytics Prezo
Hadoop - Now, Next and Beyond
Big Data Real Time Applications
Big Data For Investment Research Management
Big Data and HPC
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
MapR lucidworks joint webinar
Jubatus Presentation on R&D forum 2011
Core concepts and Key technologies - Big Data Analytics
2013 storage prediction hds hong kong
2011 x.commerce Innovate Data Alchemy
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
CSC1100 - Chapter08 - Database Management
Vodafone xone fev142013v3 ext
Big data analytics, survey r.nabati
Ad

Similar to 2012.04.26 big insights streams im forum2 (20)

PDF
PDF
Ibm big data ibm marriage of hadoop and data warehousing
PDF
Intel Cloud Summit: Big Data
PPT
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
PPTX
Big Data, Big Content, and Aligning Your Storage Strategy
PDF
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
PPTX
2012 10 bigdata_overview
PDF
Intel Cloud summit: Big Data by Nick Knupffer
PDF
Not about the Big in Big Data
PDF
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
PDF
Cosbench apac
PDF
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
PDF
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
PDF
Martin Wildberger Presentation
PPTX
IBM Big Data Platform, 2012
PDF
A blueprint for smarter storage management
PDF
Making your Analytics Investment Pay Off - StampedeCon 2012
PDF
Microsoft StreamInsight
PDF
IBM Big Data Platform Nov 2012
PDF
FLASH MEMORY: THE BIG DATA from Structure:Data 2012
Ibm big data ibm marriage of hadoop and data warehousing
Intel Cloud Summit: Big Data
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Big Data, Big Content, and Aligning Your Storage Strategy
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
2012 10 bigdata_overview
Intel Cloud summit: Big Data by Nick Knupffer
Not about the Big in Big Data
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Cosbench apac
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Martin Wildberger Presentation
IBM Big Data Platform, 2012
A blueprint for smarter storage management
Making your Analytics Investment Pay Off - StampedeCon 2012
Microsoft StreamInsight
IBM Big Data Platform Nov 2012
FLASH MEMORY: THE BIG DATA from Structure:Data 2012
Ad

More from Wilfried Hoge (11)

PDF
Cloud Data Services - from prototyping to scalable analytics on cloud
PDF
Is it harder to find a taxi when it is raining?
PDF
innovations born in the cloud - cloud data services from IBM to prototype you...
PDF
2015.05.07 watson rp15
PDF
Twitter analytics in Bluemix
PDF
InfoSphere BigInsights - Analytics power for Hadoop - field experience
PDF
Big SQL 3.0 - Fast and easy SQL on Hadoop
PDF
2014.07.11 biginsights data2014
PDF
2013.12.12 big data heise webcast
PDF
InfoSphere BigInsights
PDF
IBM - Big Value from Big Data
Cloud Data Services - from prototyping to scalable analytics on cloud
Is it harder to find a taxi when it is raining?
innovations born in the cloud - cloud data services from IBM to prototype you...
2015.05.07 watson rp15
Twitter analytics in Bluemix
InfoSphere BigInsights - Analytics power for Hadoop - field experience
Big SQL 3.0 - Fast and easy SQL on Hadoop
2014.07.11 biginsights data2014
2013.12.12 big data heise webcast
InfoSphere BigInsights
IBM - Big Value from Big Data

Recently uploaded (20)

PDF
Psychology and Work Today 10th Edition by Duane Schultz Test Bank.pdf
PDF
Want to Fly Like an Eagle - Leave the Chickens Behind.pdf
DOCX
Paulo Tuynmam: Nine Timeless Anchors of Authentic Leadership
PDF
Lesson 4 Education for Better Work. Evaluate your training options.
PDF
OneRead_20250728_1807.pdfbdjsajaajjajajsjsj
PDF
Why is mindset more important than motivation.pdf
PDF
The Blogs_ Humanity Beyond All Differences _ Andy Blumenthal _ The Times of I...
PDF
PLAYLISTS DEI MEGAMIX E DEEJAY PARADE DAL 1991 AL 2004 SU RADIO DEEJAY
PPTX
Atomic and Molecular physics pp p TTT B
PPTX
A portfolio Template for Interior Designer
PDF
Anxiety Awareness Journal One Week Preview
PDF
How Long Does It Take to Quit Vaping.pdf
PPTX
Unlocking Success Through the Relentless Power of Grit
PPT
Lesson From Geese! Understanding Teamwork
PPTX
chuong-2-nhung-hinh-thuc-tu-duy-20250711081647-e-20250718055609-e.pptx
PDF
Quiet Wins: Why the Silent Fish Survives.pdf
PDF
technical writing on emotional quotient ppt
PPTX
My future self called today–I answered.pptx
PDF
relational self of self improvements etc
PPTX
UNIVERSAL HUMAN VALUES for NEP student .pptx
Psychology and Work Today 10th Edition by Duane Schultz Test Bank.pdf
Want to Fly Like an Eagle - Leave the Chickens Behind.pdf
Paulo Tuynmam: Nine Timeless Anchors of Authentic Leadership
Lesson 4 Education for Better Work. Evaluate your training options.
OneRead_20250728_1807.pdfbdjsajaajjajajsjsj
Why is mindset more important than motivation.pdf
The Blogs_ Humanity Beyond All Differences _ Andy Blumenthal _ The Times of I...
PLAYLISTS DEI MEGAMIX E DEEJAY PARADE DAL 1991 AL 2004 SU RADIO DEEJAY
Atomic and Molecular physics pp p TTT B
A portfolio Template for Interior Designer
Anxiety Awareness Journal One Week Preview
How Long Does It Take to Quit Vaping.pdf
Unlocking Success Through the Relentless Power of Grit
Lesson From Geese! Understanding Teamwork
chuong-2-nhung-hinh-thuc-tu-duy-20250711081647-e-20250718055609-e.pptx
Quiet Wins: Why the Silent Fish Survives.pdf
technical writing on emotional quotient ppt
My future self called today–I answered.pptx
relational self of self improvements etc
UNIVERSAL HUMAN VALUES for NEP student .pptx

2012.04.26 big insights streams im forum2

  • 1. Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams
  • 2. Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams Wilfried Hoge – Leading Technical Sales Professional hoge@de.ibm.com twitter.com/wilfriedhoge
  • 3. IBM Big Data Strategy: Move the Analytics Closer to the Data New analytic applications drive Analytic Applications the requirements for a big data BI / Exploration / Functional Industry Predictive Content platform Reporting Visualization App App Analytics Analytics •  Integrate and manage the full variety, velocity and volume of data IBM Big Data Platform Visualization Application Systems •  Apply advanced analytics to & Discovery Development Management information in its native form •  Visualize all available data for ad- Accelerators hoc analysis Hadoop Stream Data •  Development environment for System Computing Warehouse building new analytic applications •  Workload optimization and scheduling •  Security and Governance Information Integration & Governance
  • 4. Volume and Velocity – two dimensions for Big Data Exa Wind Turbine Placement & Up to 10,000 Operation Times PBs of data Peta larger Analysis time to 3 days from 3 weeks 1220 IBM iDataPlex nodes Data Scale Tera DeepQA 100s GB for Deep Analytics Data at Rest Data Scale 3 sec/decision Power7, 15TB memory Giga Telco Promotions 100,000 records/sec, 6B/day Traditional Data 10 ms/decision Mega Warehouse and 270TB for Deep Analytics Business Intelligence Up to 10,000 Data in Motion Security times faster Kilo 600,000 records/sec, 50B/day 1-2 ms/decision yr mo wk day hr min sec … ms µs 320TB for Deep Analytics Occasional Frequent Real-time Decision Frequency 26.04.2012 © Copyright IBM Corporation 2012 4
  • 5. BigInsights – analytical platform for persistent “Big Data” Based on open source & IBM technologies Analytic Applications BI / Exploration / Functional Industry Predictive Content Distinguishing characteristics Reporting Visualization App App Analytics Analytics •  Built-in analytics . . . enhances business knowledge IBM Big Data Platform •  Enterprise software integration . . . Visualization Application Systems & Discovery Development Management complements and extends existing capabilities •  Production-ready platform with tooling for Accelerators analysts, developers, and administrators. . . speeds time-to-value Hadoop Stream Data and simplifies development/maintenance System Computing Warehouse IBM advantage •  Combination of software, hardware, services and advanced research Information Integration & Governance
  • 6. About the BigInsights Platform Flexible, enterprise-class support for processing large volumes of data •  Based on Google’s MapReduce technology •  Inspired by Apache Hadoop; compatible with its ecosystem and distribution •  Well-suited to batch-oriented, read-intensive applications •  Supports wide variety of data Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner •  CPU + disks = “node” •  Nodes can be combined into clusters •  New nodes can be added as needed without changing •  Data formats •  How data is loaded •  How jobs are written
  • 7. Hadoop Explained – Map Reduce Hadoop computation model •  Data stored in a distributed file system spanning many inexpensive computers •  Bring function to the data •  Distribute application to the compute resources where the data is stored Scalable to thousands of nodes and petabytes of data public  static  class  TokenizerMapper          extends  Mapper<Object,Text,Text,IntWritable>  {   Hadoop Data Nodes    private  final  static  IntWritable            one  =  new  IntWritable(1);      private  Text  word  =  new  Text();        public  void  map(Object  key,  Text  val,  Context          StringTokenizer  itr  =                new  StringTokenizer(val.toString());   1.  Map Phase        while  (itr.hasMoreTokens())  {          word.set(itr.nextToken());              context.write(word,  one);          }             (break job into small parts)    }   }     public  static  class  IntSumReducer          extends  Reducer<Text,IntWritable,Text,IntWrita   Distribute map 2.  Shuffle    private  IntWritable  result  =  new  Intritable();        public  void  reduce(Text  key,            Iterable<IntWritable>  val,  Context  context){          int  sum  =  0;          for  (IntWritable  v  :  val)  {   tasks to cluster (transfer interim output            sum  +=  v.get();     .  .  .   for final processing) MapReduce Application 3.  Reduce Phase (boil all output down to Shuffle a single result set) Result Set Return a single result set
  • 8. BigInsights – Value Beyond Open Source Technical differentiators •  Built-in analytics •  Text processing engine, annotators, Eclipse tooling •  Statistical and predictive analysis •  Interface to project R (statistical platform) •  Enterprise software integration (DBMS, warehouse) •  Spreadsheet-style analytical tool for analysts •  Ready-made business process accelerators •  Integrated installation of supported open source and IBM components •  Web Console for administration and application access •  Platform enrichment: additional security, performance features, . . . •  Standard IBM licensing agreement and world-class support Business benefits •  Quicker time-to-value due to IBM technology and support •  Reduced operational risk •  Enhanced business knowledge with flexible analytical platform •  Leverages and complements existing software assets
  • 9. Web Installation Tool Seamless process for single node and cluster environments Integrated installation of all selected components Post-install validation of IBM and open source components No need to iteratively download, configure, and test multiple open source projects and their pre-requisite software.
  • 10. Web Console Manage BigInsights •  Inspect system health •  Add / drop nodes •  Start / stop services •  Run / monitor jobs (applications) •  Explore / modify file system Launch applications •  Spreadsheet-like analysis tool •  Pre-built applications (IBM supplied or user developed) Publish applications Leverage community resources
  • 11. BigSheets BigSheets is a visual tool for data manipulation and prototyping •  Allows more users to do more work, more quickly •  Simply stated, growing an army of MapReduce developers is not cost effective •  In your BI environments you have a ratio of 30+ report users for every complex SQL developer. We need to support the same ratios with BigInsights Sample Uses •  Data exploration and visualization •  Visual job creation
  • 12. BigSheets – Spreadsheet-style Data Analysis and Discovery
  • 14. Quick start applications or “apps” Reusable software assets based on customer engagements •  Useful for starting point for various applications •  Can be customized by BigInsights application developers as needed •  Accessible through Web console Available assets •  Data export (to relational DBMS, files, HBase) •  Data import (from relational DBMS, files) •  Web crawler, Twitter crawler •  Boardreader.com support (Web forum search engine) •  Ad hoc queries for Jaql, Hive, Pig •  TeraGen-TeraSort, WordCount sample applications
  • 15. Running Applications from the Web Console
  • 16. Develop Hive with the SQL Editor and view results
  • 17. Build a Big Data Program – Map Reduce example Eclipse based development tools For JAQL, Hive, Java MapReduce, Text Analytics
  • 18. Text Analytics in BigInsights Text analytics – Distill structured information from unstructured data •  Rich annotator library supports multiple languages •  Declarative Information Extraction (IE) system based on an algebraic framework •  Richer, cleaner rule semantics •  Better performance through optimization Developed at IBM Research since 2004 Embedded in several IBM products •  Lotus Notes •  Cognos Consumer Insights •  InfoSphere Streams •  Compose operators to build complex annotators
  • 19. Turns disparate words into measurable insights Pre-configured text annotators ready for distributed processing on Big Data •  City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent, EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger, Acquisition, Alliance, etc.. Support for native languages including double-byte Physically assemble Identify positive or Reporting/Monitoring social data, standardize Part-of-speech negative sentiment, Iterative classification commentary, combination w/ formats, address auto- identification, standard and NLP-based analytics, using automated and structured data, clustering, identify language, customized extraction define variables, macros manual techniques. associated concepts, process punctuation dictionaries, proper noun and rules. Concept derivation & correlated concepts, auto- and non-grammatical identification, concept inclusion, semantic classification of documents, characters, standardize categorization, synonyms, networks and co- sites, posts. spelling. exclusions, multi-terms, occurrence rules regular expressions, fuzzy- matching
  • 20. Text Analytics – highly accurate analysis of textual content How it works Unstructured text (document, email, etc) •  Parses text and detects meaning with annotators Football World Cup 2010, one team distinguished themselves well, losing to •  Understands the context in which the the eventual champions 1-0 in the Final. text is analyzed Early in the second half, Netherlands’ •  Hundreds of pre-built annotators for striker, Arjen Robben, had a breakaway, names, addresses, phone numbers, but the keeper for Spain, Iker Casillas along others made the save. Winger Andres Iniesta scored for Spain for the win. Accuracy •  Highly accurate in deriving meaning from complex text Performance Classification and Insight •  AQL language optimized for MapReduce
  • 21. BigInsights Text Analytics Development – AQL
  • 22. Text Analytics Tooling AQL Editor Result Viewer Runtime Explain
  • 23. Statistical and Predictive Analysis Framework for machine learning (ML) implementations on Big Data •  Large, sparse data sets, e.g. 5B non-zero values •  Runs on large BigInsights clusters with 1000s of nodes Productivity •  Build and enhance predictive models directly on Big Data •  High-level language – Declarative Machine Learning Language (DML) •  E.g. 1500 lines of Java code boils down to 15 lines of DML code •  Parallel SPSS data mining algorithms implementable in DML Optimization •  Compile algorithms into optimized parallel code 4500 •  For different clusters and different data characteristics 4000 3500 •  E.g. 1 hr. execution (hand-coded) down to 10 mins Execution Time (sec) 3000 2500 2000 1500 1000 500 0 0 500 1000 1500 2000 # non zeros (million) Java Map-Reduce SystemML Single node R
  • 24. Workload Optimization Optimized performance for big data analytic workloads Adaptive MapReduce Hadoop System Scheduler §  Algorithm to optimize execution time of §  Identifies small and large jobs from multiple small jobs prior experience §  Performance gains of 30% reduce §  Sequences work to reduce overhead overhead of task startup Task Map Adaptive Map Reduce (break task into small parts) (optimization — (many results to a order small units of work) single result set)
  • 25. InfoSphere BigInsights – Embrace and Extend Hadoop Analytics ML Analytics Text Analytics BigSheets Interface Web console Application •  Monitor cluster health Pig Hive Jaql •  Add / remove nodes Avro Zookeeper IBM LZO Compression •  Start / stop services MapReduce •  Inspect job status •  Inspect workflow status •  Deploy apps AdaptiveMR FLEX BigIndex •  Launch apps / jobs •  Work with distrib. file system •  Work with spreadsheet Oozie Lucene interface •  Support REST-based API •  . . . Storage HBase Eclipse plug-ins HDFS GPFS-SNC •  Text analytics •  MapReduce programming •  Jaql development Data Sources/ Netezza BoardReader R •  Hive query development Streams Connectors Data Stage DB2 CSV / XML / JSON SPSS IBM Flume JDBC Web Crawler Open Source
  • 26. Ways to get started with BigInsights In the Cloud •  Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise Cloud, or on private clouds. •  Pay only for the resources used. In the Virtual Classroom •  Free Hadoop Fundamentals training course www.bigdatauniversity.com •  e.g. BD105EN - Text Analytics Essentials On Your Cluster •  Download Basic Edition from ibm.com. In the Classroom •  Enroll in the InfoSphere BigInsights Essentials course.
  • 27. Visit the BigInsights technical portal . . . . Free links to papers, demos, discussion forum, and more http://www.ibm.com/developerworks/wiki/biginsights/
  • 28. Streams – analytical platform for in-motion “Big Data” Built to analyze data in motion Analytic Applications •  Multiple concurrent input streams BI / Exploration / Functional Industry Predictive Content Reporting Visualization App App Analytics Analytics •  Massive scalability IBM Big Data Platform Process and analyze a variety of Visualization Application Systems data & Discovery Development Management •  Structured, unstructured content, video, audio Accelerators •  Advanced analytic operators Hadoop Stream Data System Computing Warehouse Information Integration & Governance
  • 29. Stream Computing – Analyze Data in Motion Traditional Computing Stream Computing Historical fact finding Current fact finding Find and analyze information stored on disk Analyze data in motion – before it is stored Batch paradigm, pull model Low latency paradigm, push model Query-driven: submits queries to static data Data driven – bring the data to the query Query Data Results Data Query Results
  • 30. Why InfoSphere Streams? Applications that require on-the-fly processing, filtering and analysis of streaming data •  Sensors: environmental, industrial, surveillance video, GPS, … •  “Data exhaust”: network/system/web server/app server log files •  High-rate transaction data: financial transactions, call detail records Criteria: two or more of the following •  Messages are processed in isolation or in limited data windows •  Sources include non-traditional data (spatial, imagery, text, …) •  Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges •  Data rates/volumes require the resources of multiple processing nodes •  Analysis and response are needed with sub-millisecond latency •  Data rates and volumes are too great for store-and-mine approaches
  • 31. Massively Scalable Stream Analytics Linear Scalability Deployments §  Clustered deployments – unlimited Source Analytic Sync scalability Adapters Operators Adapters Automated Deployment §  Automatically optimize operator Streams Studio IDE deployment across clusters Performance Optimization Automated and Optimized §  JVM Sharing – minimize memory use Deployment §  Fuse operators on Streaming Data Streams Runtime Sources same cluster §  Telco client – 25 Million Visualization messages per second Analytics on Streaming Data §  Analytic accelerators for a variety of data types §  Optimized for real-time performance
  • 32. Streams approach illustrated tuple directory: directory: directory: directory: ”/img" ”/img" ”/opt" ”/img" filename: filename: filename: filename: height: height: height: “farm” “bird” “java” “cat” 640 1280 640 width: width: width: 480 1024 480 data: data: data:
  • 33. InfoSphere Streams for superior real time analytic processing Streams Processing Language (SPL) built for Streaming applications: Compile groups of operators into •  Reusable operators single processes: •  Rapid application development •  Efficient use of cores Use the data •  Continuous “pipeline” processing •  Distributed execution that gives •  Very fast data exchange you a competitive •  Can be automatic or tuned advantage: •  Scaled with push of a button •  Can handle virtually any data type •  Use data that is too expensive and time sensitive for traditional approaches Easy to extend: •  Built in adaptors •  Users add capability with familiar C++ and Java Dynamic analysis: Easy to manage: •  Programmatically change Flexible and high •  Automatic placement topology at runtime performance transport: •  Create new subscriptions •  Extend applications incrementall •  Very low latency •  Create new port properties without downtime •  High data rates •  Multi-user / multiple applications
  • 34. Streams Studio Integrated Development Environment 34
  • 35. Compiler Framework Operator Fusion •  Fine-grained operators Logical app view •  From small parts, make larger ones that fit Code generation •  Generates code to match the underlying runtime environment •  Number of cores •  Interconnect characteristics Physical app view •  Architecture-specific instructions •  Driven by automatic profiling •  Compiler-based optimization •  Driven by incremental learning of application characteristics
  • 36. Streams Data Mining Toolkit Enables scoring of real-time data in a Streams application •  Scoring is performed against a predefined model •  Supports a variety of model types and scoring algorithms Models represented in Predictive Model Markup Language (PMML) •  Standard for statistical and data mining models •  XML Representation Toolkit provides four Streams operators to enable scoring •  Classification •  Clustering •  Regression •  Associations The toolkit supports dynamic replacement of the PMML model used by an operator.
  • 37. Without a Big Data Platform IBM Big Data Platform You Code… Over 100 sample applications and toolkits with industry focused toolkits with 300+ functions and operators Event Custom SQL Handling and Scripts Multithreading Check Application Pointing Management Accelerators Streams provides development, deployment, HA and runtime, and infrastructure services Toolkits Performance Debug Connectors Optimization Security “TerraEchos developers can deliver applications 45% faster due to the agility of Streams Processing Language…” – Alex Philip, CEO and President, TerraEchos
  • 38. Streams Redbook redbooks.ibm.com/abstracts/sg247970.html This book is intended for professionals that require an understanding of how to process high volumes of streaming data or need information about how to implement systems to satisfy those requirements.
  • 39. Right-time actions are taken in the new BI/BA ecosystem • Three routes to analytics • Application and workload optimized appliances and systems • Fast data movement and integration Traditional Traditional / Warehouse Relational Data Sources Database & At-Rest Data Results Warehouse Analytics Non-Traditional / Streams Non-Relational Data Sources In-Motion Ultra Low Latency Analytics Results Non-Traditional/ InfoSphere Non-Relational Big Insights Data Sources Internet Internet Scale Scale Traditional/ Data Analytics, Data Results Relational Data Operations & Model Sources Building 26.04.2012 © Copyright IBM Corporation 2012 39
  • 40. Example of 360° customer view Business Processes" Events and Master Data Campaign Cognos Consumer Alerts Management Management Insight Big Data Platform Web Traffic and Social Media Insight Website Logs Social Media Internet Scale Analytics Information Data Integration Warehouse Call Detail Call Behavior and Records Streaming Analytics Experience Insight
  • 41. Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams