SlideShare a Scribd company logo
Learning
Lessons
Building a content repository
on top of NoSQL Technologies


      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
hello,
I’m @stevenn from @outerthought




  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   2
This story is about




     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   3
Complexity
complexity




                            software architecture
                                                                      3.0




                                             2.0
             1.0


                                                                                         age




             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         4
Complexity
 complexity
user interest




                                                                         3.0




                                                2.0
            1.0


                                                                                            age




                IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         5
We Prefer Sophistication




» the challenge for us was to scale ...
 without dropping features




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   6
The typical CMS ‘architecture’




  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   7
The typical CMS ‘architecture’




  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   8
The typical CMS ‘architecture’




  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   9
The typical CMS ‘architecture’


  even more cache

  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   10
The typical CMS ‘architecture’

  client

  even more cache

  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   11
The typical CMS ‘architecture’

  client (+cache)

  even more cache

  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   12
What we found hard to scale
» access control

» facet browsing

» all the nifty stuff people were using our
 software for


» ... anything that required random access
 to in-memory-cache data for computations

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   13
Beyond the ‘scaling’ problem
» three-prong data layer



                                                                      fs




 » result set merging (between MySQL & Lucene)
   » happened in appcode/memory

 » ‘transactions’, set operations = hard


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   14
Beyond the three-prong problem




» errrr..... “Failover” ..... ?




         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   15
If we would be able to add more nodes ...


                                                   scalability


» True Distribution                                                  availability


                                                 performance

                   ... in the line of fire

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org    16
Solution 1




» do MORE inside the database




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   17
Functional




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   18
Functional




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   19
Infrastructural



                                                                 e !
                                                         a s
                                             ta b
                                    d a
                     o re
              m


     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   20
e !
                                                            a s
                                               tab
                                d            a
                           o re
             n m
  eve



IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   21
s !
                                                u s se
                                             e b
                                         sa g
                           m es
                 d d
    ’s a
l et


  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         22
f f!
                                                                 ! s tu
                                                       B C
                  J D
                r
               e !
             ov 0t
            S 0
         J M w
  M   I!
R


  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org    23
http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html



             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   24
Business Development 101
user interest




                                                                                            budget




                IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            25
Solution II
sophistication
ability to cope




                                                                       3.0
           mysql                                                                          nosql?

                                              2.0
              1.0




              IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            26
Enter The Cambrian Explosion

                                       Cassandra




                                  NoSQL
                                                                            neo4j




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       27
Requirements, phase I
» automatic scaling to large data sets

» fault-tolerance: replication, automatic handling of failing nodes

» a flexible data model supporting sparse data

» runs on commodity hardware

» efficient random access to data

» open source, ability to participate in the development thus
  drive the direction of the project
» some preference for a Java-based solution


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   28
Requirements, phase II

» After careful consideration, we realized the
 important choices were also:
 » consistency: no chance of having two conflicting
   versions of a row
 » atomic updates of a single row, single-row
   transactions
 » bonus points for MapReduce integration
   » e.g. full-text index rebuilding



        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   29
That brought us to HBase, which bought us:
» a datamodel where you can have column
 families which keep all versions and others
 which do not, which fits very well on our
 CMS document model
» ordered tables with the ability to do range
 scans on them, which allows to build
 scalable indexes on top of it
» HDFS, a convenient place to store large blobs

» Apache license and community, a familiar
 environment for us

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   30
» OK, so now we had a data store !




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   31
» However, content repository =
 store + search                          !
                                    u ch
                                o



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   32
a s
                                                        w
                                                      at !
                                                    h sy
                                                  T a         .. .)
                                                      e ver
                                                       we
                                                    h o
                                                  (
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   33
Search ponderings

» CMS = two types of search
 » structured search
  » numbers, strings
  » based on logic         (SQL, anyone?)

 » information retrieval (or: full-text search)
  » text
  » based on statistics



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   34
Search ponderings




» All of that, at scale




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   35
Structured Search
» HBase Indexing Library
 » idea from Google App Engine datastore indexes
 » http://code.google.com/appengine/articles/
  index_building.html

    rowkey             col              col                             rowkey          col



                                                          order
      A               val3             foo6                              val2-B

      B               val2             foo7                              val3-A

                 content table                                              index table A


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           36
Full-text / IR search


» Lucene?
 » no sharding (for scale)
 » no replication (for availability)
 » batched index updates (not real-time)




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   37
Beyond Lucene
» Katta
  » scalable architecture, however only search, no indexing

» Elastic Search
  » very young (sorry)

» hbasene et al.
  » stores inverted index in HBase, might not scale all features

» SOLR
  » widely used, schema, facets, query syntax, cloud branch



More info: http://lilycms.org/lily/prerelease/technology.html

          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   38
?
                             +
                         =
                                                r ?
                                      ! O
                     a sy
                    E
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   39
Remember distribution ?
Remember secondary indexes ?




 ➙ Need for reliable queuing

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   40
Connecting things
» we needed a reliable bridge between our
 main storage (HBase) and our index/search
 server(s) (SOLR)
 » indexing, reindexing, mass reindexing (M/R)

» we need a reliable method of updating
 HBase secondary indexes
» all of that eventually to run distributed

» distribution means coping with failure

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   41
Solution


» ACMEMessageQueue ? Bzzzzzt.
 We wanted fault-safe HBase persistence for
 the queues.
 Also for ease of administration.
» ➙ WAL  & Queue implemented on top of
 HBase tables


      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   42
WAL / Queue
» WAL                                                 » Queue
 » guaranteed execution                                  » triggering of async
   of synchronous actions                                    actions
 » call doesn’t return before                            » e.g. (re)index (updated)
   secondary action finishes                                  record with SOLR back-end
 » e.g. update secondary actions                         » size depends on speed of
 » if all goes well,                                         back-end process
   size = #concurrent ops
 » will be useful/made available
   outside of Lily context as
   well!


             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   43
The Sum
» Lily model (records & fields)

» mapped onto HBase (=storage)

» indexed and searchable through
 SOLR
» using a WAL/Queue mechanism
 implemented in HBase
» runtime based on Kauri

» with client/server comms via
 Avro
        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   44
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   45
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   46
Roadmap

» Today = release of learning material
 (architecture, model, API, Javadoc)
 ➥ www.lilycms.org
 ➥ bit.ly/lilyprerelease

» Mid July = ‘proof of architecture’ release                                                e re!
                                                                                          th
                                                                                    early
                                                                                   N
» from there on, ca. 3-monthly releases
 leading up to Lily 1.0


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org               47
bit.ly/lilyprerelease




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   48
License




» Apache




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   49
Business model
» Consulting, mentoring, turn-key projects

» Strong focus on partner relations
 » targeting vertical markets
 » geographic coverage
 » SaaS offerings

» Markets: media, finance, insurance, govt,
 heritage ... LOTS of semi-structured data
» Not: OLAP

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   50
More ?



» @outerthought

» www.lilycms.org/lily/prerelease.html




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   51

More Related Content

PDF
NoSQL with Hadoop and HBase
POTX
Content Management with MongoDB by Mark Helmstetter
KEY
Building a CMS on top of NoSQL (for ParisJUG)
PDF
N-O-SQL, new database technologies on the rise
PDF
KVIV / NoSQL : the new generation of database servers
PDF
Outerthought / Lily Partnerships
PDF
Welcome to the Age of Data
PDF
Lily for the Bay Area HBase UG - NYC edition
NoSQL with Hadoop and HBase
Content Management with MongoDB by Mark Helmstetter
Building a CMS on top of NoSQL (for ParisJUG)
N-O-SQL, new database technologies on the rise
KVIV / NoSQL : the new generation of database servers
Outerthought / Lily Partnerships
Welcome to the Age of Data
Lily for the Bay Area HBase UG - NYC edition

Similar to Learning Lessons: Building a CMS on top of NoSQL technologies (18)

PDF
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
PDF
Lily @ Work Webinar
PDF
NoSQL intro for YaJUG / NoSQL UG Luxembourg
PDF
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
PDF
Devoxx 2010 | Tools In Action : Kauri and Lily
PDF
The Lily RowLog library
PDF
Devoxx 2010 | LAB : ReST in Java
PDF
From Content Storage to Scaling Smart Data
PDF
Lily at HUG UK
PDF
Huguk lily
PPTX
Binary Analysis - Luxembourg
PDF
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
PDF
The world is the computer and the programmer is you
PDF
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
PDF
MongoDB and the Internet of Things
DOCX
มโนทัศน์เทคโนโลยีการศึกษา
PDF
Revitalizing Aging Architectures with Microservices
PDF
2006 — Technology Adoption: emerging technologies and their likely impact
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
Lily @ Work Webinar
NoSQL intro for YaJUG / NoSQL UG Luxembourg
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Devoxx 2010 | Tools In Action : Kauri and Lily
The Lily RowLog library
Devoxx 2010 | LAB : ReST in Java
From Content Storage to Scaling Smart Data
Lily at HUG UK
Huguk lily
Binary Analysis - Luxembourg
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
The world is the computer and the programmer is you
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
MongoDB and the Internet of Things
มโนทัศน์เทคโนโลยีการศึกษา
Revitalizing Aging Architectures with Microservices
2006 — Technology Adoption: emerging technologies and their likely impact
Ad

More from NGDATA (6)

PDF
NGDATA Corporate Presentation
PDF
20110514 appsforghent
PPT
Big Data
PDF
Devoxx 2010 | Tools In Action : Kauri and Lily
KEY
NoSQL BOF at Devoxx
KEY
NoSQL "Tools in Action" talk at Devoxx
NGDATA Corporate Presentation
20110514 appsforghent
Big Data
Devoxx 2010 | Tools In Action : Kauri and Lily
NoSQL BOF at Devoxx
NoSQL "Tools in Action" talk at Devoxx
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity

Learning Lessons: Building a CMS on top of NoSQL technologies

  • 1. Learning Lessons Building a content repository on top of NoSQL Technologies IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 2. hello, I’m @stevenn from @outerthought IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2
  • 3. This story is about IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
  • 4. Complexity complexity software architecture 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  • 5. Complexity complexity user interest 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
  • 6. We Prefer Sophistication » the challenge for us was to scale ... without dropping features IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
  • 7. The typical CMS ‘architecture’ database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  • 8. The typical CMS ‘architecture’ application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
  • 9. The typical CMS ‘architecture’ more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
  • 10. The typical CMS ‘architecture’ even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  • 11. The typical CMS ‘architecture’ client even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  • 12. The typical CMS ‘architecture’ client (+cache) even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  • 13. What we found hard to scale » access control » facet browsing » all the nifty stuff people were using our software for » ... anything that required random access to in-memory-cache data for computations IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
  • 14. Beyond the ‘scaling’ problem » three-prong data layer fs » result set merging (between MySQL & Lucene) » happened in appcode/memory » ‘transactions’, set operations = hard IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14
  • 15. Beyond the three-prong problem » errrr..... “Failover” ..... ? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  • 16. If we would be able to add more nodes ... scalability » True Distribution availability performance ... in the line of fire IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  • 17. Solution 1 » do MORE inside the database IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  • 18. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
  • 19. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  • 20. Infrastructural e ! a s ta b d a o re m IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  • 21. e ! a s tab d a o re n m eve IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  • 22. s ! u s se e b sa g m es d d ’s a l et IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  • 23. f f! ! s tu B C J D r e ! ov 0t S 0 J M w M I! R IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  • 24. http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  • 25. Business Development 101 user interest budget IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  • 26. Solution II sophistication ability to cope 3.0 mysql nosql? 2.0 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  • 27. Enter The Cambrian Explosion Cassandra NoSQL neo4j IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  • 28. Requirements, phase I » automatic scaling to large data sets » fault-tolerance: replication, automatic handling of failing nodes » a flexible data model supporting sparse data » runs on commodity hardware » efficient random access to data » open source, ability to participate in the development thus drive the direction of the project » some preference for a Java-based solution IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  • 29. Requirements, phase II » After careful consideration, we realized the important choices were also: » consistency: no chance of having two conflicting versions of a row » atomic updates of a single row, single-row transactions » bonus points for MapReduce integration » e.g. full-text index rebuilding IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  • 30. That brought us to HBase, which bought us: » a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model » ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it » HDFS, a convenient place to store large blobs » Apache license and community, a familiar environment for us IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  • 31. » OK, so now we had a data store ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
  • 32. » However, content repository = store + search ! u ch o IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
  • 33. a s w at ! h sy T a .. .) e ver we h o ( IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  • 34. Search ponderings » CMS = two types of search » structured search » numbers, strings » based on logic (SQL, anyone?) » information retrieval (or: full-text search) » text » based on statistics IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  • 35. Search ponderings » All of that, at scale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  • 36. Structured Search » HBase Indexing Library » idea from Google App Engine datastore indexes » http://code.google.com/appengine/articles/ index_building.html rowkey col col rowkey col order A val3 foo6 val2-B B val2 foo7 val3-A content table index table A IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  • 37. Full-text / IR search » Lucene? » no sharding (for scale) » no replication (for availability) » batched index updates (not real-time) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  • 38. Beyond Lucene » Katta » scalable architecture, however only search, no indexing » Elastic Search » very young (sorry) » hbasene et al. » stores inverted index in HBase, might not scale all features » SOLR » widely used, schema, facets, query syntax, cloud branch More info: http://lilycms.org/lily/prerelease/technology.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  • 39. ? + = r ? ! O a sy E IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
  • 40. Remember distribution ? Remember secondary indexes ? ➙ Need for reliable queuing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  • 41. Connecting things » we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR) » indexing, reindexing, mass reindexing (M/R) » we need a reliable method of updating HBase secondary indexes » all of that eventually to run distributed » distribution means coping with failure IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
  • 42. Solution » ACMEMessageQueue ? Bzzzzzt. We wanted fault-safe HBase persistence for the queues. Also for ease of administration. » ➙ WAL & Queue implemented on top of HBase tables IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  • 43. WAL / Queue » WAL » Queue » guaranteed execution » triggering of async of synchronous actions actions » call doesn’t return before » e.g. (re)index (updated) secondary action finishes record with SOLR back-end » e.g. update secondary actions » size depends on speed of » if all goes well, back-end process size = #concurrent ops » will be useful/made available outside of Lily context as well! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  • 44. The Sum » Lily model (records & fields) » mapped onto HBase (=storage) » indexed and searchable through SOLR » using a WAL/Queue mechanism implemented in HBase » runtime based on Kauri » with client/server comms via Avro IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  • 45. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  • 46. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  • 47. Roadmap » Today = release of learning material (architecture, model, API, Javadoc) ➥ www.lilycms.org ➥ bit.ly/lilyprerelease » Mid July = ‘proof of architecture’ release e re! th early N » from there on, ca. 3-monthly releases leading up to Lily 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
  • 48. bit.ly/lilyprerelease IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
  • 49. License » Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 49
  • 50. Business model » Consulting, mentoring, turn-key projects » Strong focus on partner relations » targeting vertical markets » geographic coverage » SaaS offerings » Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data » Not: OLAP IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 50
  • 51. More ? » @outerthought » www.lilycms.org/lily/prerelease.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 51