SlideShare a Scribd company logo
Jukka Zitting  |  Senior DeveloperRepository performance tuning
AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceFull text indexingQuestions and answers2
Performance tuning stepsStep 1: Identify the symptomCreate a test case that consistently measures current performanceDefine the performance target if current level unacceptableMake sure that the test case and the target performance are really relevantStep 2: Identify the causeMain suspects: Hardware, Repository, Application, ClientRevise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, IometerStep 3: Identify/implement possible solutionsChange content, configuration, code or upgrade hardwareStep 4: Verify resultsIf target not reached, iterate the process or revise the goal3
Repository internals4DataStorePersistenceManagerQueryIndexClusterJournal
Data StoreContent-addressed storage for large binary propertiesArbitrarily sized binary streamsAddressed by MD5 hashString properties not included, use UTF-8 to map to binaryFast delivery of binary contentRead directly from diskCan also be read in rangesImproved write throughputMultiple uploads can proceed concurrently (within hardware limits)Cheap copiesGarbage collection used to reclaim disk spaceLogically shared by the entire cluster5DataStore
Cluster JournalJournal of all persisted changes in the repositoryContent changesNamespace, nodetype registrations, etc.Used to keep all cluster nodes in syncObservation events to all cluster nodes (see JackrabbitEvent.isExternal)Search index updatesInternal cache invalidationOld events need to be discarded eventuallyNo notable performance impact, just extra disk spaceKeep events for the longest possible time a node can be offline without getting completely recreatedLogically shared by the entire clusterWrites synchronized over the entire cluster6ClusterJournal
Persistence ManagerIdentifier-addressed storage for nodes and propertiesEach node has a UUID, even if not mix:referenceableEssentially a key-value store, even when backed by a RDBMSAlso keeps track of node referencesBundles as units of contentBundle = UUID, type, properties, child node references, etc.Only large binaries stored elsewhere in the data storeDesigned for balanced content hierarchies, avoid too many child nodesAtomic updatesA save() call persists the entire transient space as a single atomic operationOne PM per workspace (and one for the shared version store)Logically (often also physically) shared across a cluster7PersistenceManager
Query IndexInverse index based on Apache LuceneFlexible mapping from terms to node identifiersSpecial handling for the path structureMostly synchronous index updatesLong full text extraction tasks handled in backgroundOther cluster nodes will update their indexes at next cluster sync Everything indexed by defaultIndexing configuration for tweaking functionality, performance and disk usageOne index per workspace (and one for the shared version store)Not shared across a cluster, indexes are local to each cluster nodeSee http://wiki.apache.org/jackrabbit/Search#Search_Configuration8QueryIndex
AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceIndexing configurationQuestions and answers9
Basic content accessVery fast access by path and IDUnderlying storage addressed by ID, but path traversal is in any case needed for ACL checksRelevant caches:Path to ID map (internal structure, not configurable)Item state caches (automatically balanced, configurable for special cases)Bundle cache (default fairly low, increase for large deployments)Also some PM-specific options (TarPM index, etc.)Caches optimized for a reasonably sized active working settypical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writesPerformance hit especially when updating nodes with lots of child nodesFineGrainedISMLocking for concurrent, non-overlapping writes10
Example: Bundle cache configuration11<!-- In …/repository/worspaces/${wsp.name}/workspace.xml --><Workspace …>  <PersistenceManager class=“…">  <paramname="bundleCacheSize" value="8"/>  </PersistenceManager></Workspace>
Batch processingTwo issues: read and writeReading lots of contentTree traversal the best approach, but will flood cachesSchedule for off-peak timesAdd explicit delay (used by the garbage collectors)Use a dedicated cluster node for batch processingWriting lots of content (including deleting large subtrees)The entire transient space is kept in memory and committed atomicallySplit the operation to smaller piecesSave after every ~1k nodesLeverage the data store if possible12
ClusteringGood for horizontally scaling readsPractically zero overhead on read accessNot so good for heavy concurrent writesExclusive lock over the whole clusterDirect all writes to a single master nodeLeverage the data storeNote the cluster sync interval for query consistency, etc.Session.refresh() can be used to force a cluster sync13
Query performanceWhat’s really fast?Constraints on properties, node types, full textTypically O(n) where n is the number of results, vs. the total number of nodes What’s pretty fast?Path constraintsWhat needs some planning?Constraints on the child axisSorting, limit/offset JoinsWhat’s not yet available?Aggregate queries (COUNT, SUM, DISTINCT, etc.)Faceting14
Join engine15SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b  <PersistenceManager class=“…">  <paramname="bundleCacheSize" value="8"/>  </PersistenceManager></Workspace>
Indexing configurationDefault configurationIndex all non-binary propertiesIndex binary jcr:data properties (think nt:file/nt:resource)Full text extraction support for all major document formatsFull text extraction from images, packages, etc. is explicitly disabledCQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc.Why change the configuration?Reduce the index size (by default almost as large as the PM)Enable features like aggregate indexesAssign boost values for selected properties to improve search result relevance16
Indexing configurationHow to change the configuration?indexing_configuration.xml file in the workspace directoryReferenced by the indexingConfiguration option in the workspace.xml fileSee http://wiki.apache.org/jackrabbit/IndexingConfigurationExample:17<?xml version="1.0"?><!DOCTYPE configuration SYSTEM"http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"><configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"xmlns:nt="http://www.jcp.org/jcr/nt/1.0">  <aggregateprimaryType="nt:file">   <include>jcr:content</include> </aggregate></configuration>
Question and Answers18
Repository performance tuning

More Related Content

PPTX
Railway track:An Introduction
PPTX
suspension bridge
PPTX
Suspension bridge
PPTX
Introduction to Spring Boot
PDF
Spring Boot
PDF
Introduction and Classification of Bridges
PPTX
Unit 1-uses for scripting languages,web scripting
Railway track:An Introduction
suspension bridge
Suspension bridge
Introduction to Spring Boot
Spring Boot
Introduction and Classification of Bridges
Unit 1-uses for scripting languages,web scripting

What's hot (20)

PPT
Apache TomEE - Tomcat with a kick
PPTX
JAVA CHARACTER SETS- FUNDAMENTALS OF JAVA
PDF
Track work presentation
PDF
Java troubleshooting thread dump
PDF
Tacacs
PPTX
Track training 03082013
ODP
Introduction to Spring Framework and Spring IoC
PPTX
Wireless Site Survey
PDF
12 Steps to API Load Testing with Apache JMeter
PDF
Java - Interfaces & Packages
PPT
WebLogic Scripting Tool Overview
PDF
Cross Section of Permanent Way.pdf
PPTX
What is component in reactjs
PDF
Testing with Spring: An Introduction
PDF
Spring Boot
PPTX
Spring boot
PPTX
Java - Sockets
PPT
Introduction to bridges
PDF
Signature bridge pdf
PPT
Java collections concept
Apache TomEE - Tomcat with a kick
JAVA CHARACTER SETS- FUNDAMENTALS OF JAVA
Track work presentation
Java troubleshooting thread dump
Tacacs
Track training 03082013
Introduction to Spring Framework and Spring IoC
Wireless Site Survey
12 Steps to API Load Testing with Apache JMeter
Java - Interfaces & Packages
WebLogic Scripting Tool Overview
Cross Section of Permanent Way.pdf
What is component in reactjs
Testing with Spring: An Introduction
Spring Boot
Spring boot
Java - Sockets
Introduction to bridges
Signature bridge pdf
Java collections concept
Ad

Viewers also liked (20)

PPTX
Apache Jackrabbit @ Swiss Open Source Awards 2011
PPTX
OSGifying the repository
PPTX
Oak, the architecture of Apache Jackrabbit 3
PPTX
MicroKernel & NodeStore
PPT
The return of the hierarchical model
KEY
Open source masterclass - Life in the Apache Incubator
PPTX
/path/to/content - the Apache Jackrabbit content repository
PPTX
Apache development with GitHub and Travis CI
KEY
Content extraction with apache tika
PPT
Content Management With Apache Jackrabbit
PPTX
The new repository in AEM 6
PDF
Enterprise Manager: Write powerful scripts with EMCLI
PDF
JCR, Sling or AEM? Which API should I use and when?
PDF
Oracle Enterprise Manager Cloud Control 13c for DBAs
PDF
新浪云平台的经验和教训
PPS
Good Luck
PPTX
Shakespeare revealed 02.ppt
PDF
Digital thinking
PDF
Open Cultuur Data Masterclass #3 - Open State - Lex Slaghuis
Apache Jackrabbit @ Swiss Open Source Awards 2011
OSGifying the repository
Oak, the architecture of Apache Jackrabbit 3
MicroKernel & NodeStore
The return of the hierarchical model
Open source masterclass - Life in the Apache Incubator
/path/to/content - the Apache Jackrabbit content repository
Apache development with GitHub and Travis CI
Content extraction with apache tika
Content Management With Apache Jackrabbit
The new repository in AEM 6
Enterprise Manager: Write powerful scripts with EMCLI
JCR, Sling or AEM? Which API should I use and when?
Oracle Enterprise Manager Cloud Control 13c for DBAs
新浪云平台的经验和教训
Good Luck
Shakespeare revealed 02.ppt
Digital thinking
Open Cultuur Data Masterclass #3 - Open State - Lex Slaghuis
Ad

Similar to Repository performance tuning (20)

PPTX
Overview of MongoDB and Other Non-Relational Databases
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
PDF
Performance and predictability
PPT
PPTX
NoSQL Introduction, Theory, Implementations
PPT
IntelliJ IDEA Architecture and Performance
PPT
Optimizing your java applications for multi core hardware
PPT
Planning for-high-performance-web-application
PPTX
Apache ignite as in-memory computing platform
PPTX
Unit-4 swapping.pptx
PDF
Performance and predictability
PPT
Climbing the beanstalk
PPTX
Drupal Backend Performance and Scalability
PPT
tittle
PPT
Ch9 OS
 
PPT
PPT
Chapter 8 - Main Memory
PPT
FOWA Scaling The Lamp Stack Workshop
PPT
Main memory os - prashant odhavani- 160920107003
Overview of MongoDB and Other Non-Relational Databases
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Performance and predictability
NoSQL Introduction, Theory, Implementations
IntelliJ IDEA Architecture and Performance
Optimizing your java applications for multi core hardware
Planning for-high-performance-web-application
Apache ignite as in-memory computing platform
Unit-4 swapping.pptx
Performance and predictability
Climbing the beanstalk
Drupal Backend Performance and Scalability
tittle
Ch9 OS
 
Chapter 8 - Main Memory
FOWA Scaling The Lamp Stack Workshop
Main memory os - prashant odhavani- 160920107003

More from Jukka Zitting (9)

PPT
Text and metadata extraction with Apache Tika
PPT
Mime Magic With Apache Tika
PPT
NoSQL Oakland
PPT
Content Storage With Apache Jackrabbit
ODP
Introduction to JCR and Apache Jackrabbi
PPT
File System On Steroids
PPT
Mime Magic With Apache Tika
PPT
Design and architecture of Jackrabbit
PPT
Apache Tika
Text and metadata extraction with Apache Tika
Mime Magic With Apache Tika
NoSQL Oakland
Content Storage With Apache Jackrabbit
Introduction to JCR and Apache Jackrabbi
File System On Steroids
Mime Magic With Apache Tika
Design and architecture of Jackrabbit
Apache Tika

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx

Repository performance tuning

  • 1. Jukka Zitting | Senior DeveloperRepository performance tuning
  • 2. AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceFull text indexingQuestions and answers2
  • 3. Performance tuning stepsStep 1: Identify the symptomCreate a test case that consistently measures current performanceDefine the performance target if current level unacceptableMake sure that the test case and the target performance are really relevantStep 2: Identify the causeMain suspects: Hardware, Repository, Application, ClientRevise the test case until the problem no longer occurs;for example: Selenium, JMeter, JUnit, IometerStep 3: Identify/implement possible solutionsChange content, configuration, code or upgrade hardwareStep 4: Verify resultsIf target not reached, iterate the process or revise the goal3
  • 5. Data StoreContent-addressed storage for large binary propertiesArbitrarily sized binary streamsAddressed by MD5 hashString properties not included, use UTF-8 to map to binaryFast delivery of binary contentRead directly from diskCan also be read in rangesImproved write throughputMultiple uploads can proceed concurrently (within hardware limits)Cheap copiesGarbage collection used to reclaim disk spaceLogically shared by the entire cluster5DataStore
  • 6. Cluster JournalJournal of all persisted changes in the repositoryContent changesNamespace, nodetype registrations, etc.Used to keep all cluster nodes in syncObservation events to all cluster nodes (see JackrabbitEvent.isExternal)Search index updatesInternal cache invalidationOld events need to be discarded eventuallyNo notable performance impact, just extra disk spaceKeep events for the longest possible time a node can be offline without getting completely recreatedLogically shared by the entire clusterWrites synchronized over the entire cluster6ClusterJournal
  • 7. Persistence ManagerIdentifier-addressed storage for nodes and propertiesEach node has a UUID, even if not mix:referenceableEssentially a key-value store, even when backed by a RDBMSAlso keeps track of node referencesBundles as units of contentBundle = UUID, type, properties, child node references, etc.Only large binaries stored elsewhere in the data storeDesigned for balanced content hierarchies, avoid too many child nodesAtomic updatesA save() call persists the entire transient space as a single atomic operationOne PM per workspace (and one for the shared version store)Logically (often also physically) shared across a cluster7PersistenceManager
  • 8. Query IndexInverse index based on Apache LuceneFlexible mapping from terms to node identifiersSpecial handling for the path structureMostly synchronous index updatesLong full text extraction tasks handled in backgroundOther cluster nodes will update their indexes at next cluster sync Everything indexed by defaultIndexing configuration for tweaking functionality, performance and disk usageOne index per workspace (and one for the shared version store)Not shared across a cluster, indexes are local to each cluster nodeSee http://wiki.apache.org/jackrabbit/Search#Search_Configuration8QueryIndex
  • 9. AgendaPerformance tuning stepsRepository internalsBasic content accessBatch processingClusteringQuery performanceIndexing configurationQuestions and answers9
  • 10. Basic content accessVery fast access by path and IDUnderlying storage addressed by ID, but path traversal is in any case needed for ACL checksRelevant caches:Path to ID map (internal structure, not configurable)Item state caches (automatically balanced, configurable for special cases)Bundle cache (default fairly low, increase for large deployments)Also some PM-specific options (TarPM index, etc.)Caches optimized for a reasonably sized active working settypical web access pattern: handful of key resources and a long tail of less frequently accessed content, few writesPerformance hit especially when updating nodes with lots of child nodesFineGrainedISMLocking for concurrent, non-overlapping writes10
  • 11. Example: Bundle cache configuration11<!-- In …/repository/worspaces/${wsp.name}/workspace.xml --><Workspace …> <PersistenceManager class=“…"> <paramname="bundleCacheSize" value="8"/> </PersistenceManager></Workspace>
  • 12. Batch processingTwo issues: read and writeReading lots of contentTree traversal the best approach, but will flood cachesSchedule for off-peak timesAdd explicit delay (used by the garbage collectors)Use a dedicated cluster node for batch processingWriting lots of content (including deleting large subtrees)The entire transient space is kept in memory and committed atomicallySplit the operation to smaller piecesSave after every ~1k nodesLeverage the data store if possible12
  • 13. ClusteringGood for horizontally scaling readsPractically zero overhead on read accessNot so good for heavy concurrent writesExclusive lock over the whole clusterDirect all writes to a single master nodeLeverage the data storeNote the cluster sync interval for query consistency, etc.Session.refresh() can be used to force a cluster sync13
  • 14. Query performanceWhat’s really fast?Constraints on properties, node types, full textTypically O(n) where n is the number of results, vs. the total number of nodes What’s pretty fast?Path constraintsWhat needs some planning?Constraints on the child axisSorting, limit/offset JoinsWhat’s not yet available?Aggregate queries (COUNT, SUM, DISTINCT, etc.)Faceting14
  • 15. Join engine15SELECT a.* FROM [nt:unstructured] AS a JOIN [nt:unstructured] AS b <PersistenceManager class=“…"> <paramname="bundleCacheSize" value="8"/> </PersistenceManager></Workspace>
  • 16. Indexing configurationDefault configurationIndex all non-binary propertiesIndex binary jcr:data properties (think nt:file/nt:resource)Full text extraction support for all major document formatsFull text extraction from images, packages, etc. is explicitly disabledCQ5 / WEM comes with default aggregate indexing rules for cq:Pages, etc.Why change the configuration?Reduce the index size (by default almost as large as the PM)Enable features like aggregate indexesAssign boost values for selected properties to improve search result relevance16
  • 17. Indexing configurationHow to change the configuration?indexing_configuration.xml file in the workspace directoryReferenced by the indexingConfiguration option in the workspace.xml fileSee http://wiki.apache.org/jackrabbit/IndexingConfigurationExample:17<?xml version="1.0"?><!DOCTYPE configuration SYSTEM"http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"><configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregateprimaryType="nt:file"> <include>jcr:content</include> </aggregate></configuration>