SlideShare a Scribd company logo
Oracle Open World
Data-and-Compute-Intensive
processing
Use Case: Lucene Domain Index
Marcelo F. Ochoa
Fac. Cs. Exactas - UNICEN - Tandil - Argentina
Agenda
Data-and-Compute Intensive Search
What is Lucene?
What is Lucene Domain Index?
Performance
Application integration
Demo
Future plans
Data-and-Compute Intensive Search
Data-and-Compute Intensive Search
Strategies
Middle-tier-based search engines
Google Appliance
SES
Nutch
Solr
Database-embedded search engines
Oracle Text
Lucene Domain Index (Lucene OJVM)
Middle-tier-based Search Engines
Benefits
Simple (crawler mode)
Medium complexity (Solr
WS)
Out off the box solution
(crawler)
No application code is
necessary to integrate
(crawler)
Medium out off the box
solution (Solr WS)
Drawbacks
Updated are slow (usually
monthly, weekly)
A lot a wasted traffic
You can not index pages
based on database
information which requires
login (crawler mode)
Indexing tables requires
triggers, batch process or a
persistent layer to transfer
modifications
Database-embedded Search Engines
Benefits
Fastest update
No extra coding its
necessary, SQL access
Ready to use to any
language, PHP, Phyton, .Net
You can index tables
Changes are automatically
notified
No network traffic
No network marshalling
Drawbacks
A little slow down of Java
execution compared to a Sun
JDK JVM
What is Lucene?
Open Source Information Retrieval (IR) Library with extensible
APIs
Top level Apache project
Is the core component of Apache Solr and Nutch projects
100% Java
Around 800 classes
47.000 lines of code
33.000 lines of test
78.000 lines at contrib area
Can search and Index any textual data
Can scales to millions of pages or records
Provides fuzzy search, proximity search, range queries, ...
Wildcards: single and multiple characters, anywhere in the search
words
What is Lucene Domain Index?
An Embebed version of Lucene IR library running inside Oracle
OJVM
37 new Java Classes and a new PLSQL Object Type
A new domain index for Oracle Databases using Data Cartridge
API (ODCI)
A new Store implementation for Lucene (OJVMDirectory) which
replaces a traditional filesystem by Secure BLOB
Two new SQL operators lcontains() and lscore()
An orthogonal/uptodate Lucene solution for any programming
language, especially Java, Ruby, Python, PHP and .Net, currently
latest production version - 2.3.2.
Benefits
Benefits added to Oracle Applications
No network round trip for indexing Oracle tables
A fault tolerant, transactional and scalable storage for Lucene
inverted index
Small Lucene index structure
Support for IOT
Support for indexing joined tables using default User Data Store
Support for indexing virtual columns
Support for order by, filter by and in-line pagination operations at
Index level layer
Support padding/formatting for Text/Date/Time/Number
But more important than above is
Easily to adapt for a new functionality
Performance Test Suite
Corpus: XML Spanish WikiPedia dump:
Total documents: 1.056.163 - 2,67 Gb
Average size per document: 2.533 bytes
Lucene Index size:
10 BLOB/files
808Mb total.
5 fields (title,revisionDate,comment,text)
Table structure (XMLDB):
pages
title:VARCHAR2
id:NUMBER
pages_revisions
id:NUMBER
revisionDate:TIMESTAMP
comment:VARCHAR2
text:CLOB
1
n
..
Middle-tier-based approach
Requires transfer all database table data to the middle tier
A middle tier application performs this query:
SELECT /*+ DYNAMIC_SAMPLING(0) RULE NOCACHE(PAGES) */ PAGES.rowid,
extractValue(object_value,'/page/title') "title",extractValue(object_value,
'/page/revision/comment') "comment",extract(object_value,
'/page/revision/text/text()') "text",extractValue(object_value,
'/page/revision/timestamp') "revisionDate"
FROM
ESWIKI.PAGES where PAGES.rowid in
(select rid from (select rowid rid,rownum rn from ENWIKI.PAGES) where rn>=1 and rn<=300)
For 300 rows SQLTrace reports:
bytes sent via SQL*Net to client 3.245.358
bytes received via SQL*Net from client 1.785.912
SQL*Net roundtrips to/from client 2.383
Total indexing time for 33.912 rows 824 seconds
SQL
Application
External
Indexer
Database-embedded approach
Index Definition, Lucene Domain Index syntax:
SQL> ALTER SESSION SET sql_trace=true;
SQL> ALTER SESSION SET EVENTS '10046 trace name context forever, level 8';
SQL> create index pages_lidx_all on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;DefaultColumn:text;
SyncMode:Deferred;LogLevel:WARNING;
Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;
ExtraCols:extractValue(object_value,''/page/title'') "title",
extractValue(object_value,''/page/revision/comment'') "comment",
extract(object_value,''/page/revision/text/text()'') "text",
extractValue(object_value,''/page/revision/timestamp'') "revisionDate";
FormatCols:revisionDate(day);IncludeMasterColumn:false;
LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW
CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
After creating an Index is necessary to submit changes for indexing
This can be done using:
DECLARE
ridlist sys.ODCIRidList;
BEGIN
select rid BULK COLLECT INTO ridlist
from (select rowid rid,rownum rn from pages) where rn>=1 and rn<=300;
LuceneDomainIndex.enqueueChange(USER||'.PAGES_LIDX_ALL',ridlist,'insert');
END;
For 300 rows SQLTrace reports:
bytes sent via SQL*Net to client 1.301
bytes received via SQL*Net from client 1.354
SQL*Net roundtrips to/from client 4
Total indexing time for 33.912 rows 346 seconds
Database-embedded approach
Jav
aJD
BC
Cal
ls
SQ
Stored Procedure Call
Application
Client
CPU / Network usage during indexing
External indexing (824 s.)
Integrated indexing (346 s.)
User
Sys
CPU Load
Nice
User
Sys
CPU Load
Nice
Receiver
Transmiter
Network
Server side
Receiver
Transmiter
Network
Client side
Client side
Server side
CPU/IO for Database-embedded indexing
This information was taken with SQL_TRACE=true indexing 3.000
rows inside OJVM.
Most time is spent in full table scan
In addition, middle-tier indexing of 3.000 rows would require sending
34Mb of data over the network.
58% 43,2%
Application integration: Fuzzy searches
Old implementation (without Lucene):
select p.id,p.first_Name,p.last_Name,p.nationality,p.sex,p.type_Document,
p.number_Document,p.civil_State,p.date_Birth,p.mail,g.organization_id ,
fnmatchperson(p.first_name, p.last_name, 'John Doe') as suma
from person p left join (select * from guest where organization_id = 67) g on g.person_id = p.id
where p.state = 1 and fnmatchperson(p.first_name, p.last_name, 'John Doe') >= 50 order by suma desc
Lucene implementation:
select /*+ DOMAIN_INDEX_SORT */
p.id,p.first_Name,p.last_Name,p.nationality,p.sex,p.type_Document,
p.number_Document,p.civil_State,p.date_Birth,p.mail,g.organization_id ,
lscore(1) as suma from person p left join
(select * from guest where organization_id = 67) g on g.person_id = p.id
where p.state = 1 and lcontains(p.first_name, 'rownum:[1 TO 20] AND John~ Doe~',1) > 0
where "John Doe " is searched as "John~ Doe~ " to provide partial
match.
~ Lucene operator uses Levenshtein Distance, or Edit Distance
algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance
Execution plan for both queries
fnMatch solution
Lucene solution
Key points
Only one extra index is required:
create index person_lidx on person(first_name)
indextype is lucene.LuceneIndex
parameters('SyncMode:OnLine;LogLevel:ALL;
AutoTuneMemory:true;IncludeMasterColumn:false;DefaultOperator:OR;
DefaultColumn:name_str;Analyzer:org.apache.lucene.analysis.
SimpleAnalyzer;
ExtraCols:first_name||'' ''||last_name "name_str"');
Simple to adapt, only one class was modified to provide partial
match.
String split[] = firstLastName.split(" ");
sql3 = "";
for(int i = 0; i < split.length; i++){
sql3 += /*"'" +*/ split[i].toLowerCase().trim() + "~ ";
}
sql3 = sql3.substring(0, sql3.length()-1);//le quita la coma
/*sql += ", fnmatchperson(p.first_name, p.last_name, " + sql3 + ") as suma ";*/
/*sql3 = " and fnmatchperson(p.first_name, p.last_name, " + sql3 + ") >= 50 " ;*/
sql3 = "and lcontains(p.first_name, 'rownum:[1 TO 20] AND "+sql3 +"',1) > 0 ";
Key points, cont.
Less network traffic
In the above example, around of 20% of the rows are discarded
by the filter operation
"GUEST"."PERSON_ID"(+)="P"."ID" AND ORGANIZATION_ID"(+)=67"
"P"."STATE"=1
In Solr implementation a new row on person table imply:
N bytes of SQLNet +283 bytes for HTTP Post method
Faster updates
Compared to Solr approach we send 283 bytes less which
means faster operations.
Compared to middle tier approach, once a new row is added to
the table it is ready to be included in the next query in the
example shown this is critical constraint
Minimal application code impact
only a new index
only a rewrite where condition is needed to replace fnMatch
Future plans
Add Faceted search, may be using ODCI aggregate functions or
Pipeline Tables
Strong committing to latest Lucene production release, once 2.4
version will be released, we will test inside OJVM
Add ODCI Extensible Optimizer Interface to better dialogue with
the Oracle SQL Engine
A slave session which collects query from different parallel
session to reduce memory foot print and to provides highest hit
ratios
A JMX interface to monitor Lucene Domain Index using Sun's
JMX console
Useful links
Lucene Project:
http://lucene.apache.org/java/docs/index.html
Lucene Oracle Integration
http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
Forum, Peer to Peer Support
http://sourceforge.net/forum/forum.php?forum_id=187896
Download Binary Distribution (10g/11g)
http://sourceforge.net/project/showfiles.php?group_id=56183&package_id=255524
CVS Access
cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login
cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm
http://dbprism.cvs.sourceforge.net/dbprism/ojvm/
Q & A
Thank you

More Related Content

PPTX
Environment Canada's Data Management Service
PDF
SQL2SPARQL
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PPT
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
PDF
Impala SQL Support
PPTX
Data Warehouse and Business Intelligence - Recipe 3
PPTX
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...
PDF
Distributed Systems Naming
Environment Canada's Data Management Service
SQL2SPARQL
Real-time Inverted Search in the Cloud Using Lucene and Storm
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Impala SQL Support
Data Warehouse and Business Intelligence - Recipe 3
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...
Distributed Systems Naming

What's hot (20)

PPTX
Data Migration in Database
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PPTX
Data Warehouse and Business Intelligence - Recipe 2
PPTX
Case_Study_-_Advanced_Oracle_PLSQL
PPT
Recipe 14 - Build a Staging Area for an Oracle Data Warehouse (2)
PPTX
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...
PPTX
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
PDF
Oracle DBA interview_questions
PDF
Tools, not only for Oracle RAC
PDF
Oracle 12c Multi Process Multi Threaded
PPTX
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
PPTX
Recipes 9 of Data Warehouse and Business Intelligence - Techniques to control...
PPTX
Recipes 6 of Data Warehouse and Business Intelligence - Naming convention tec...
PPTX
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
DOC
Whitepaper To Study Filestream Option In Sql Server
PPTX
Data Warehouse and Business Intelligence - Recipe 1
PDF
A Deep Dive into Structured Streaming in Apache Spark
PPTX
Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...
PDF
Making Structured Streaming Ready for Production
PPTX
Leo's notes - Oracle DBA 2 Days
Data Migration in Database
High Performance JSON Search and Relational Faceted Browsing with Lucene
Data Warehouse and Business Intelligence - Recipe 2
Case_Study_-_Advanced_Oracle_PLSQL
Recipe 14 - Build a Staging Area for an Oracle Data Warehouse (2)
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Oracle DBA interview_questions
Tools, not only for Oracle RAC
Oracle 12c Multi Process Multi Threaded
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 9 of Data Warehouse and Business Intelligence - Techniques to control...
Recipes 6 of Data Warehouse and Business Intelligence - Naming convention tec...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Whitepaper To Study Filestream Option In Sql Server
Data Warehouse and Business Intelligence - Recipe 1
A Deep Dive into Structured Streaming in Apache Spark
Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...
Making Structured Streaming Ready for Production
Leo's notes - Oracle DBA 2 Days
Ad

Viewers also liked (14)

PDF
Oracle cloud portal
PDF
Oracle Cloud Café - Cloud backup et Disaster recovery
PPTX
Backup & recovery with rman
PPTX
10 Problems with your RMAN backup script
PPT
Introduction to Oracle RMAN, backup and recovery tool.
PDF
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
PDF
Manage and Monitor Oracle Applications in the Cloud
PDF
Rman Presentation
PDF
Oracle Application Performance Monitoring Cloud Service 소개
PPTX
Oracle vm Disaster Recovery Solutions
PPTX
Virtual Compute Appliance Oracle IaaS
PDF
Oracle Compute Cloud Service vs. Amazon Web Services EC2 : A Hands-On Review
PDF
Cloud watch on hrms solutions
PDF
Application Performance Cloud Service
Oracle cloud portal
Oracle Cloud Café - Cloud backup et Disaster recovery
Backup & recovery with rman
10 Problems with your RMAN backup script
Introduction to Oracle RMAN, backup and recovery tool.
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
Manage and Monitor Oracle Applications in the Cloud
Rman Presentation
Oracle Application Performance Monitoring Cloud Service 소개
Oracle vm Disaster Recovery Solutions
Virtual Compute Appliance Oracle IaaS
Oracle Compute Cloud Service vs. Amazon Web Services EC2 : A Hands-On Review
Cloud watch on hrms solutions
Application Performance Cloud Service
Ad

Similar to Data-and-Compute-Intensive processing Use Case: Lucene Domain Index (20)

PPT
Lucene and MySQL
PDF
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
PPT
Advanced full text searching techniques using Lucene
PPTX
Illuminating Lucene.Net
PPTX
Apache lucene
PPT
Intelligent crawling and indexing using lucene
PPT
Lucene basics
PPTX
Search Me: Using Lucene.Net
PDF
What is in a Lucene index?
PDF
Full Text Search In PostgreSQL
PDF
Hibernate Search in Action 1st Edition Emmanuel Bernard
PPT
Introduction to Search Engines
PDF
Hibernate Search in Action 1st Edition Emmanuel Bernard
PPTX
JavaEdge09 : Java Indexing and Searching
PPT
Lucene BootCamp
PDF
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
PDF
Solr中国6月21日企业搜索
PPTX
PPTX
Apache solr
PPTX
DrupalTour. Lviv — Apache solr. Advanced use cases (Artem Sylchuk, InternetDe...
Lucene and MySQL
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
Advanced full text searching techniques using Lucene
Illuminating Lucene.Net
Apache lucene
Intelligent crawling and indexing using lucene
Lucene basics
Search Me: Using Lucene.Net
What is in a Lucene index?
Full Text Search In PostgreSQL
Hibernate Search in Action 1st Edition Emmanuel Bernard
Introduction to Search Engines
Hibernate Search in Action 1st Edition Emmanuel Bernard
JavaEdge09 : Java Indexing and Searching
Lucene BootCamp
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
Solr中国6月21日企业搜索
Apache solr
DrupalTour. Lviv — Apache solr. Advanced use cases (Artem Sylchuk, InternetDe...

Recently uploaded (20)

PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
System and Network Administration Chapter 2
PDF
top salesforce developer skills in 2025.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
assetexplorer- product-overview - presentation
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Nekopoi APK 2025 free lastest update
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PTS Company Brochure 2025 (1).pdf.......
How to Choose the Right IT Partner for Your Business in Malaysia
Wondershare Filmora 15 Crack With Activation Key [2025
System and Network Administration Chapter 2
top salesforce developer skills in 2025.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
assetexplorer- product-overview - presentation
Upgrade and Innovation Strategies for SAP ERP Customers
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Nekopoi APK 2025 free lastest update
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Computer Software and OS of computer science of grade 11.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Odoo POS Development Services by CandidRoot Solutions
Understanding Forklifts - TECH EHS Solution
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Digital Systems & Binary Numbers (comprehensive )
Digital Strategies for Manufacturing Companies

Data-and-Compute-Intensive processing Use Case: Lucene Domain Index

  • 1. Oracle Open World Data-and-Compute-Intensive processing Use Case: Lucene Domain Index Marcelo F. Ochoa Fac. Cs. Exactas - UNICEN - Tandil - Argentina
  • 2. Agenda Data-and-Compute Intensive Search What is Lucene? What is Lucene Domain Index? Performance Application integration Demo Future plans
  • 3. Data-and-Compute Intensive Search Data-and-Compute Intensive Search Strategies Middle-tier-based search engines Google Appliance SES Nutch Solr Database-embedded search engines Oracle Text Lucene Domain Index (Lucene OJVM)
  • 4. Middle-tier-based Search Engines Benefits Simple (crawler mode) Medium complexity (Solr WS) Out off the box solution (crawler) No application code is necessary to integrate (crawler) Medium out off the box solution (Solr WS) Drawbacks Updated are slow (usually monthly, weekly) A lot a wasted traffic You can not index pages based on database information which requires login (crawler mode) Indexing tables requires triggers, batch process or a persistent layer to transfer modifications
  • 5. Database-embedded Search Engines Benefits Fastest update No extra coding its necessary, SQL access Ready to use to any language, PHP, Phyton, .Net You can index tables Changes are automatically notified No network traffic No network marshalling Drawbacks A little slow down of Java execution compared to a Sun JDK JVM
  • 6. What is Lucene? Open Source Information Retrieval (IR) Library with extensible APIs Top level Apache project Is the core component of Apache Solr and Nutch projects 100% Java Around 800 classes 47.000 lines of code 33.000 lines of test 78.000 lines at contrib area Can search and Index any textual data Can scales to millions of pages or records Provides fuzzy search, proximity search, range queries, ... Wildcards: single and multiple characters, anywhere in the search words
  • 7. What is Lucene Domain Index? An Embebed version of Lucene IR library running inside Oracle OJVM 37 new Java Classes and a new PLSQL Object Type A new domain index for Oracle Databases using Data Cartridge API (ODCI) A new Store implementation for Lucene (OJVMDirectory) which replaces a traditional filesystem by Secure BLOB Two new SQL operators lcontains() and lscore() An orthogonal/uptodate Lucene solution for any programming language, especially Java, Ruby, Python, PHP and .Net, currently latest production version - 2.3.2.
  • 8. Benefits Benefits added to Oracle Applications No network round trip for indexing Oracle tables A fault tolerant, transactional and scalable storage for Lucene inverted index Small Lucene index structure Support for IOT Support for indexing joined tables using default User Data Store Support for indexing virtual columns Support for order by, filter by and in-line pagination operations at Index level layer Support padding/formatting for Text/Date/Time/Number But more important than above is Easily to adapt for a new functionality
  • 9. Performance Test Suite Corpus: XML Spanish WikiPedia dump: Total documents: 1.056.163 - 2,67 Gb Average size per document: 2.533 bytes Lucene Index size: 10 BLOB/files 808Mb total. 5 fields (title,revisionDate,comment,text) Table structure (XMLDB): pages title:VARCHAR2 id:NUMBER pages_revisions id:NUMBER revisionDate:TIMESTAMP comment:VARCHAR2 text:CLOB 1 n ..
  • 10. Middle-tier-based approach Requires transfer all database table data to the middle tier A middle tier application performs this query: SELECT /*+ DYNAMIC_SAMPLING(0) RULE NOCACHE(PAGES) */ PAGES.rowid, extractValue(object_value,'/page/title') "title",extractValue(object_value, '/page/revision/comment') "comment",extract(object_value, '/page/revision/text/text()') "text",extractValue(object_value, '/page/revision/timestamp') "revisionDate" FROM ESWIKI.PAGES where PAGES.rowid in (select rid from (select rowid rid,rownum rn from ENWIKI.PAGES) where rn>=1 and rn<=300) For 300 rows SQLTrace reports: bytes sent via SQL*Net to client 3.245.358 bytes received via SQL*Net from client 1.785.912 SQL*Net roundtrips to/from client 2.383 Total indexing time for 33.912 rows 824 seconds SQL Application External Indexer
  • 11. Database-embedded approach Index Definition, Lucene Domain Index syntax: SQL> ALTER SESSION SET sql_trace=true; SQL> ALTER SESSION SET EVENTS '10046 trace name context forever, level 8'; SQL> create index pages_lidx_all on pages p (value(p)) indextype is Lucene.LuceneIndex parameters('PopulateIndex:false;DefaultColumn:text; SyncMode:Deferred;LogLevel:WARNING; Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer; ExtraCols:extractValue(object_value,''/page/title'') "title", extractValue(object_value,''/page/revision/comment'') "comment", extract(object_value,''/page/revision/text/text()'') "text", extractValue(object_value,''/page/revision/timestamp'') "revisionDate"; FormatCols:revisionDate(day);IncludeMasterColumn:false; LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
  • 12. After creating an Index is necessary to submit changes for indexing This can be done using: DECLARE ridlist sys.ODCIRidList; BEGIN select rid BULK COLLECT INTO ridlist from (select rowid rid,rownum rn from pages) where rn>=1 and rn<=300; LuceneDomainIndex.enqueueChange(USER||'.PAGES_LIDX_ALL',ridlist,'insert'); END; For 300 rows SQLTrace reports: bytes sent via SQL*Net to client 1.301 bytes received via SQL*Net from client 1.354 SQL*Net roundtrips to/from client 4 Total indexing time for 33.912 rows 346 seconds Database-embedded approach Jav aJD BC Cal ls SQ Stored Procedure Call Application Client
  • 13. CPU / Network usage during indexing External indexing (824 s.) Integrated indexing (346 s.) User Sys CPU Load Nice User Sys CPU Load Nice Receiver Transmiter Network Server side Receiver Transmiter Network Client side Client side Server side
  • 14. CPU/IO for Database-embedded indexing This information was taken with SQL_TRACE=true indexing 3.000 rows inside OJVM. Most time is spent in full table scan In addition, middle-tier indexing of 3.000 rows would require sending 34Mb of data over the network. 58% 43,2%
  • 15. Application integration: Fuzzy searches Old implementation (without Lucene): select p.id,p.first_Name,p.last_Name,p.nationality,p.sex,p.type_Document, p.number_Document,p.civil_State,p.date_Birth,p.mail,g.organization_id , fnmatchperson(p.first_name, p.last_name, 'John Doe') as suma from person p left join (select * from guest where organization_id = 67) g on g.person_id = p.id where p.state = 1 and fnmatchperson(p.first_name, p.last_name, 'John Doe') >= 50 order by suma desc Lucene implementation: select /*+ DOMAIN_INDEX_SORT */ p.id,p.first_Name,p.last_Name,p.nationality,p.sex,p.type_Document, p.number_Document,p.civil_State,p.date_Birth,p.mail,g.organization_id , lscore(1) as suma from person p left join (select * from guest where organization_id = 67) g on g.person_id = p.id where p.state = 1 and lcontains(p.first_name, 'rownum:[1 TO 20] AND John~ Doe~',1) > 0 where "John Doe " is searched as "John~ Doe~ " to provide partial match. ~ Lucene operator uses Levenshtein Distance, or Edit Distance algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance
  • 16. Execution plan for both queries fnMatch solution Lucene solution
  • 17. Key points Only one extra index is required: create index person_lidx on person(first_name) indextype is lucene.LuceneIndex parameters('SyncMode:OnLine;LogLevel:ALL; AutoTuneMemory:true;IncludeMasterColumn:false;DefaultOperator:OR; DefaultColumn:name_str;Analyzer:org.apache.lucene.analysis. SimpleAnalyzer; ExtraCols:first_name||'' ''||last_name "name_str"'); Simple to adapt, only one class was modified to provide partial match. String split[] = firstLastName.split(" "); sql3 = ""; for(int i = 0; i < split.length; i++){ sql3 += /*"'" +*/ split[i].toLowerCase().trim() + "~ "; } sql3 = sql3.substring(0, sql3.length()-1);//le quita la coma /*sql += ", fnmatchperson(p.first_name, p.last_name, " + sql3 + ") as suma ";*/ /*sql3 = " and fnmatchperson(p.first_name, p.last_name, " + sql3 + ") >= 50 " ;*/ sql3 = "and lcontains(p.first_name, 'rownum:[1 TO 20] AND "+sql3 +"',1) > 0 ";
  • 18. Key points, cont. Less network traffic In the above example, around of 20% of the rows are discarded by the filter operation "GUEST"."PERSON_ID"(+)="P"."ID" AND ORGANIZATION_ID"(+)=67" "P"."STATE"=1 In Solr implementation a new row on person table imply: N bytes of SQLNet +283 bytes for HTTP Post method Faster updates Compared to Solr approach we send 283 bytes less which means faster operations. Compared to middle tier approach, once a new row is added to the table it is ready to be included in the next query in the example shown this is critical constraint Minimal application code impact only a new index only a rewrite where condition is needed to replace fnMatch
  • 19. Future plans Add Faceted search, may be using ODCI aggregate functions or Pipeline Tables Strong committing to latest Lucene production release, once 2.4 version will be released, we will test inside OJVM Add ODCI Extensible Optimizer Interface to better dialogue with the Oracle SQL Engine A slave session which collects query from different parallel session to reduce memory foot print and to provides highest hit ratios A JMX interface to monitor Lucene Domain Index using Sun's JMX console
  • 20. Useful links Lucene Project: http://lucene.apache.org/java/docs/index.html Lucene Oracle Integration http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg Forum, Peer to Peer Support http://sourceforge.net/forum/forum.php?forum_id=187896 Download Binary Distribution (10g/11g) http://sourceforge.net/project/showfiles.php?group_id=56183&package_id=255524 CVS Access cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm http://dbprism.cvs.sourceforge.net/dbprism/ojvm/
  • 21. Q & A Thank you