SlideShare a Scribd company logo
Lily
A SMART DATA PLATFORM
MAKING BIG DATA APPS EASY



     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
volume
                                                            data

                            need for
                           distributed
                           processing
                                                                        moore




                                                                             time




         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   2
need for
distributed
processing




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   3
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   4
distributed
systems are
hard.
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
data shuffling, data duplication



  database                     data warehouse                          analytics

      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org    6
“Top-performing
         organizations are twice
         as likely to apply
         analytics to activities.”

         (MIT Sloan Management
         Review, Winter 2011)



IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   7
Heavy Math?
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   10
                                                                             8
lots of
data?
   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   11
                                                                               9
what drives insights?




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   10
your audience does.




IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   11
data

                                                              audience data
recommendations



                                        insights
      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   12
LILY combines
scalable storage,
indexing and search
with real-time usage
metrics, insights and
recommendations.
   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
...on top of CDH

       Cloudera’s Distribution including Apache Hadoop




  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
lily features

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
schema
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   18
                                                                            16
lily data model
» adds a high-level data model on top of HBase’s byte[ ]’s




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   17
value types
» basic value types                                      » datetime

 » string                                                » blob

 » integer                                               » uri

 » long                                               » parametrized value types
 » double                                                » list
 » decimal                                               » path
 » boolean                                               » link
 » date                                                  » record

             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   18
why a schema?

» contract for client apps
» validation
» application lifecycle mgmt
  » schema versioning
  » cfr. Avro
» content-based indexing



          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   19
versioning
    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   17
                                                                                20
record version                                            schema version




IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   21
Versioning scopes




  data gets overwritten               versioned data                version status data



          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       22
API
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   23
builder API
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   24
builder API
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   25
other apis



                                                               Java Builder
                                                                                      REST API
» REST (HTTP + json)                                               API

» Java (Avro)                                                              Java API




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                26
indexing
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   27
                                                                            19
features

» indexing configuration
» denormalization + link dereferencing
» indexing of multiple versions
» incremental and batch (MR) index updates
» blob content extraction (Apache Tika)
» Solr index sharding (!)



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   28
indexer configuration




     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   29
search



  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   20
                                                                              30
simple:
 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   31
consistency


  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   21
                                                                              32
consistencY

» many things can go wrong
 » service failure: HBase, HDFS, Solr, Lily
 » network failure, time skew
 » node failure
» no Lily master node
 » services can pick up where died ones left
 » Zookeeper as service coordinator + lookup



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   33
architecture

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
High-level data model / easy API                           indexes



    UI Framework                                  SDK
            (HUE)                               (HUE SDK)

                                                                         Search
                                                                                               Dev2Dev
  Workflow               Scheduling                    Metadata                               tutoring,
    (OOZIE)                  (oozie)                    (HIVE)                               integrated
                                                                                             deployment
                                                                                                 and
                         Languages /                                                         enterprise
    Data                  Compilers                       Fast       usage metrics,            support
Integration                (PIG, HIVE)                 Read/Write     analytics &
  (FLUME,                                                Access        recommen-
  SQOOP)                                                 (HBASE)        dations
                           (PIG, HIVE)



                                       Coordination
                                         (ZOOKEEPER)


                                                                            CDH
                 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                35
falling in love with Hbase
  HBase = Google BigTable, open source
» datamodel with column families and cell versioning: flexible, for sparse data
» ordered tables with range scans
» HDFS for blob storage
» Apache
» consistent!
» atomic single-row updates
» automatic scaling to large data sets
» fault-tolerance
» commodity hardware
» efficient random access M/R for index regeneration

            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   36
Lily Architecture
(deployment)




           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   37
Lily Architecture
                    (components)




                                   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   38
HBase indexing & RowLog Library
» building and querying                                          » need for sync/async
 indexes, GAE-style                                                 operations
                                                                    » updating of secondary indexes
               rowkey            col          col
 content
                   A             val3         foo6                      (e.g. link tables)
   table
                   B             val2         foo7
                                                                    » feeding of Indexer
                                                                        (= indexes Lily-content into Solr)
                        rowkey          col                      » not: transactions
           order




   index
 table A                val2-B
                        val3-A                                   » need for distribution and
                                                                    durability

                        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            39
WHERE?



                  www.lilyproject.org




IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   40
lily enterprise


» adds tools:
 » Redhat/Debian package repo
 » cluster deploy tools
   (based on Whirr)
 » Administration UI
» + enterprise support



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   41
food for
(outer)thought


  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
from analytics to recommendations
» shorten distance between transactional and analytical aspects
» (near) real-time analytics > Insights
  » "people are buying this now"
» algorithmical feedback > data-backed feedback
  » based on Insights > recommendations
  » shorten feedback cycle
» store data + metadata + attention data in one data system
  » basis for analytical queries
  » single point of growth
  » incremental insights

           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   43
challenges ahead
» solving real-time aspects
 » in a distributed environment
» terra-byte-sized snapshots & backups?
 » build resilience into data architecture
   (i.e. against operator malfunction)
» presenting bigdata insights: new UIs
 » less static / report-driven
 » more 'Minority Report'
 » from near-real-time data to real-time decision systems

         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   44
conclusion: 4 trends


» real-time is rapidly becoming really important
» agility with complex/dynamic information
» predictive analysis
» one store for operational data + analytics




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   45
Thank you !
                               for your attention
                               for your questions

                               » stevenn@outerthought.org

                               »           @stevenn

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

More Related Content

PDF
Welcome to the Age of Data
PDF
KVIV / NoSQL : the new generation of database servers
PDF
NoSQL with Hadoop and HBase
PDF
Learning Lessons: Building a CMS on top of NoSQL technologies
PPT
The Tipping point
PPT
Periodismo de Datos
PDF
Portland FINAL 8.11.16LL
Welcome to the Age of Data
KVIV / NoSQL : the new generation of database servers
NoSQL with Hadoop and HBase
Learning Lessons: Building a CMS on top of NoSQL technologies
The Tipping point
Periodismo de Datos
Portland FINAL 8.11.16LL

Viewers also liked (18)

PPTX
Clases de videojuegos
PPTX
The insight group
PDF
Menasa technical brochure sea
PDF
Get Smart About Rewarded Video
PDF
Notiplastic Febrero 2015
PPTX
Felinos
PPTX
Plan de Drenage Urbano de Puerto Alegre Brasil del Dr. Carlos Tucci
PPT
Qué es un email
PDF
Revista IAI. Presentación Riesgos
PDF
Revista NBE- New Business Enterprises - Edición Mayo '14
PDF
Netcitizens vietnam report en
PDF
Oekonomie-Digitalisierung
PDF
Metodo suzuki v. 1
PDF
MR101 Combined
PDF
Informe de resultados para concursos online
PPS
12 Impresionante-(www.menudospeques.net)
PPTX
Informe OAC i Pla d'Atenció Ciutadana
Clases de videojuegos
The insight group
Menasa technical brochure sea
Get Smart About Rewarded Video
Notiplastic Febrero 2015
Felinos
Plan de Drenage Urbano de Puerto Alegre Brasil del Dr. Carlos Tucci
Qué es un email
Revista IAI. Presentación Riesgos
Revista NBE- New Business Enterprises - Edición Mayo '14
Netcitizens vietnam report en
Oekonomie-Digitalisierung
Metodo suzuki v. 1
MR101 Combined
Informe de resultados para concursos online
12 Impresionante-(www.menudospeques.net)
Informe OAC i Pla d'Atenció Ciutadana
Ad

Similar to Hadoop World 2011: Lily: Smart Data at Scale, Made Easy (20)

PDF
Lily for the Bay Area HBase UG - NYC edition
PDF
Outerthought / Lily Partnerships
PDF
Lily @ Work Webinar
PDF
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
PDF
NoSQL intro for YaJUG / NoSQL UG Luxembourg
PDF
N-O-SQL, new database technologies on the rise
PDF
Devoxx 2010 | LAB : ReST in Java
PDF
Devoxx 2010 | Tools In Action : Kauri and Lily
KEY
Building a CMS on top of NoSQL (for ParisJUG)
PDF
The Lily RowLog library
PDF
クラウドネイティブ時代の分散トレーシング - Distributed Tracing in a Cloud Native Age
PDF
From Content Storage to Scaling Smart Data
PDF
Lily at HUG UK
PDF
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
PDF
Huguk lily
PDF
Flux QL - Nexgen Management of Time Series Inspired by JS
PDF
MongoDB and the Internet of Things
PPTX
Interop 2017 - Managing Containers in Production
PPTX
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
PDF
Our Brave Modular Future
Lily for the Bay Area HBase UG - NYC edition
Outerthought / Lily Partnerships
Lily @ Work Webinar
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
NoSQL intro for YaJUG / NoSQL UG Luxembourg
N-O-SQL, new database technologies on the rise
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | Tools In Action : Kauri and Lily
Building a CMS on top of NoSQL (for ParisJUG)
The Lily RowLog library
クラウドネイティブ時代の分散トレーシング - Distributed Tracing in a Cloud Native Age
From Content Storage to Scaling Smart Data
Lily at HUG UK
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Huguk lily
Flux QL - Nexgen Management of Time Series Inspired by JS
MongoDB and the Internet of Things
Interop 2017 - Managing Containers in Production
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Our Brave Modular Future
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
KodekX | Application Modernization Development
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KodekX | Application Modernization Development
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

Hadoop World 2011: Lily: Smart Data at Scale, Made Easy

  • 1. Lily A SMART DATA PLATFORM MAKING BIG DATA APPS EASY IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 2. volume data need for distributed processing moore time IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2
  • 3. need for distributed processing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
  • 4. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  • 5. distributed systems are hard. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 6. data shuffling, data duplication database data warehouse analytics IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
  • 7. “Top-performing organizations are twice as likely to apply analytics to activities.” (MIT Sloan Management Review, Winter 2011) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  • 8. Heavy Math? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10 8
  • 9. lots of data? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11 9
  • 10. what drives insights? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  • 11. your audience does. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  • 12. data audience data recommendations insights IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  • 13. LILY combines scalable storage, indexing and search with real-time usage metrics, insights and recommendations. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 14. ...on top of CDH Cloudera’s Distribution including Apache Hadoop IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 15. lily features IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 16. schema IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18 16
  • 17. lily data model » adds a high-level data model on top of HBase’s byte[ ]’s IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  • 18. value types » basic value types » datetime » string » blob » integer » uri » long » parametrized value types » double » list » decimal » path » boolean » link » date » record IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
  • 19. why a schema? » contract for client apps » validation » application lifecycle mgmt » schema versioning » cfr. Avro » content-based indexing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  • 20. versioning IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17 20
  • 21. record version schema version IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  • 22. Versioning scopes data gets overwritten versioned data version status data IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  • 23. API IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  • 24. builder API IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  • 25. builder API IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  • 26. other apis Java Builder REST API » REST (HTTP + json) API » Java (Avro) Java API IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  • 27. indexing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27 19
  • 28. features » indexing configuration » denormalization + link dereferencing » indexing of multiple versions » incremental and batch (MR) index updates » blob content extraction (Apache Tika) » Solr index sharding (!) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  • 29. indexer configuration IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  • 30. search IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20 30
  • 31. simple: IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
  • 32. consistency IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21 32
  • 33. consistencY » many things can go wrong » service failure: HBase, HDFS, Solr, Lily » network failure, time skew » node failure » no Lily master node » services can pick up where died ones left » Zookeeper as service coordinator + lookup IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  • 34. architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 35. High-level data model / easy API indexes UI Framework SDK (HUE) (HUE SDK) Search Dev2Dev Workflow Scheduling Metadata tutoring, (OOZIE) (oozie) (HIVE) integrated deployment and Languages / enterprise Data Compilers Fast usage metrics, support Integration (PIG, HIVE) Read/Write analytics & (FLUME, Access recommen- SQOOP) (HBASE) dations (PIG, HIVE) Coordination (ZOOKEEPER) CDH IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  • 36. falling in love with Hbase HBase = Google BigTable, open source » datamodel with column families and cell versioning: flexible, for sparse data » ordered tables with range scans » HDFS for blob storage » Apache » consistent! » atomic single-row updates » automatic scaling to large data sets » fault-tolerance » commodity hardware » efficient random access M/R for index regeneration IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  • 37. Lily Architecture (deployment) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  • 38. Lily Architecture (components) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  • 39. HBase indexing & RowLog Library » building and querying » need for sync/async indexes, GAE-style operations » updating of secondary indexes rowkey col col content A val3 foo6 (e.g. link tables) table B val2 foo7 » feeding of Indexer (= indexes Lily-content into Solr) rowkey col » not: transactions order index table A val2-B val3-A » need for distribution and durability IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
  • 40. WHERE? www.lilyproject.org IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  • 41. lily enterprise » adds tools: » Redhat/Debian package repo » cluster deploy tools (based on Whirr) » Administration UI » + enterprise support IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
  • 42. food for (outer)thought IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 43. from analytics to recommendations » shorten distance between transactional and analytical aspects » (near) real-time analytics > Insights » "people are buying this now" » algorithmical feedback > data-backed feedback » based on Insights > recommendations » shorten feedback cycle » store data + metadata + attention data in one data system » basis for analytical queries » single point of growth » incremental insights IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  • 44. challenges ahead » solving real-time aspects » in a distributed environment » terra-byte-sized snapshots & backups? » build resilience into data architecture (i.e. against operator malfunction) » presenting bigdata insights: new UIs » less static / report-driven » more 'Minority Report' » from near-real-time data to real-time decision systems IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  • 45. conclusion: 4 trends » real-time is rapidly becoming really important » agility with complex/dynamic information » predictive analysis » one store for operational data + analytics IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  • 46. Thank you ! for your attention for your questions » stevenn@outerthought.org » @stevenn IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org