SlideShare a Scribd company logo
Apache: Big Data Europe 2015
Search-based business intelligence and
reverse data engineering with Apache Solr
M a r i o-Leander Rei mer
C hi ef T echnol og i st
Apache: Big Data Europe 2015
This talk will …
Mario-Leander Reimer 228. September 2015
o Give a brief overview of the AIR system’s architecture
o Show reverse data engineering using Solr and MIR
o Talk about the fight for our right to Solr
o Describe solutions for the problem of combinatorial explosion
o Outline a flexible and lightweight ETL approach for Solr
Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 328. September 2015
A <<Anwendungscluster>>
AIR Repository
A <<Application Cluster>>
AIR Loader
Mechanic
A <<System>>
AIR Central
A <<Subsystem>>
Maintenance
I <<Subsystem>>
Apache Solr
A <<Client>>
AIR Client
I <<Subsystem>>
.NET WPF A <<Subsystem>>
Solr Extensions
A <<Subsystem>>
Defects
A <<Subsystem>>
Flat Rates
A <<Subsystem>>
Service Bulletins
Service
Technician
A <<Ext. System>>
3rd Party Application
A <<Subsystem>>
AIR Fork DLL
A <<Subsystem>>
AIR Call DLL
Launch
I <<Subsystem>>
Spring Framework
I <<Subsystem>>
JEE 5
A <<System>>
AIR Control
I <<Subsystem>>
Jenkins
A <<Subsystem>>
Documents
A <<Subsystem>>
Vehicles
A <<Subsystem>>
Measures
Backend Databases
and Systems
A <<Subsystem>>
Repair Overview
A <<Subsystem>>
...
A <<Subsystem>>
JSF Web UI
A <<Subsystem>>
REST API
Independent
Workshop
A <<Client>>
Browser
Search and
Display
A <<Ext. System>>
3rd Party iOS App
A <<Subsystem>>
AIR iOS Lib
A <<Subsystem>>
Defects
A <<Subsystem>>
Flat Rates
A <<Subsystem>>
Service Bulletins
I <<Subsystem>>
Spring Framework
A <<Subsystem>>
Documents
A <<Subsystem>>
Parts
A <<Subsystem>>
WS Clients
A <<Subsystem>>
File Storage
A <<Subsystem>>
Solr Access
A <<Subsystem>>
Protocoll
A <<Subsystem>>
Watchlist
A <<Subsystem>>
Masterdata
A <<Subsystem>>
Retrofits
AIR DB
Document
Storage
A <<Ext. System>>
AIR Bus
I <<Ext. System>>
Backend Systems
Query
A <<Subsystem>>
Vehicles Execute
Load
20 Languages
800 GB
Solr Index
A <<Subsystem>>
Maintenance
Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 428. September 2015
A <<Anwendungscluster>>
AIR Repository
A <<Application Cluster>>
AIR Loader
A <<Subsystem>>
Maintenance
I <<Subsystem>>
Apache Solr Master
A <<Subsystem>>
Solr Extensions
A <<Subsystem>>
Defects
A <<Subsystem>>
Flat Rates
A <<Subsystem>>
Service Bulletins
A <<System>>
AIR Control
I <<Subsystem>>
Jenkins
A <<Subsystem>>
Documents
Backend Databases
and Systems
A <<Subsystem>>
Repair Overview
A <<Subsystem>>
...
I <<Subsystem>>
Spring Framework
A <<Subsystem>>
Vehicles Execute
Load
20 Languages
800 GB
Solr Index
I <<Subsystem>>
Apache Solr Slave
A <<Subsystem>>
Solr Extensions
Replicate
A <<System>>
MIR
20 Languages
800 GB
Solr Index
Search
Apache: Big Data Europe 2015
Let‘s go back to when it all began …
Source: http://www.october212015.com/images/timecircuits.jpg 5
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
The project vision: find the right information
in less than 3 clicks.
6
The situation:
o Users had to use up to 7 different applications for their daily work.
o Systems were not really integrated nicely.
o Finding the correct information was laborious and error prone.
The idea:
o Combine the data into a consistent information network.
o Make the information network and its data searchable and navigable.
o Replace existing application with one easy to use application.
Mario-Leander Reimer28. September 2015
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
But how do we find the originating system
for the desired data?
7Mario-Leander Reimer28. September 2015
Where to find the vehicle data?
60 potential systems and 5000 entities. Other data
Vehicle data
System A System B System C System D
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
And how do we find the hidden relations
between the systems and their data?
8Mario-Leander Reimer28. September 2015
How is the data linked to each other?
400.000 potential relations. Other data
Vehicle
System A System B System C System D
Customer
Documents
Apache: Big Data Europe 2015
Meta Information Research (MIR)
9Source: http://www.thewallpapers.org/photo/31865/Mir_space_station_12_June_1998.jpg
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
MIR is a simple and lightweight data reverse
engineering and analysis tool based on Solr.
10
o MIR manages meta information about the source systems (the data
models and record descriptions)
o MIR allows to navigate and search in the metadata, you can drill into
the metadata using facets
o MIR also manages the target data model and Solr schema description
Mario-Leander Reimer28. September 2015
Metadata
Index
A <<System>>
Meta Information Research
I <<Subsystem>>
Apache Solr
A <<Subsystem>>
MIR User Interface
Backend Databases
and Systems
A <<Subsystem>>
MIR Loader
A <<Subsystem>>
MIR Generators
Read
Sources (Java, XML)
Magic Draw
25MB
Apache: Big Data Europe 2015 11
Wildcard
queries
Facetted
drill down
Tree view of
systems, tables
and attributes
Search
results
Found potential
synonyms for the
chassis number
Apache: Big Data Europe 2015 12
EAT YOUR OWN DOG FOOD.
The AIR Solr schema definition is
modelled and defined within MIR.
Solr schema
attributesSolr entities for
each release
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
def sourceGenerator = MIR + Solr + Maven;
13Mario-Leander Reimer28. September 2015
14
But Solr is a full text
search engine. You have
to use an Oracle DB for
your application data!
NO!
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Some of the AIR requirements were ...
15
o Focus is on search. Transactions are not required.
o High demands on request volume and performance.
o Free navigation on data model and content.
o Support for full text search and facetted search.
o Offline capabilities.
o Scalability from low-end device to server to cloud.
Mario-Leander Reimer28. September 2015
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Apache Solr outperformed Oracle significantly
in query time as well as index size.
16Mario-Leander Reimer28. September 2015
SELECT * FROM VEHICLE WHERE VIN='V%'
INFO_TYPE:VEHICLE AND VIN:V*
SELECT * FROM MEASURE WHERE TEXT='engine'
INFO_TYPE:MEASURE AND TEXT:engine
SELECT * FROM VEHICLE WHERE VIN='%X%'
INFO_TYPE:VEHICLE AND VIN:*X*
| 038 ms | 000 ms | 000 ms
| 383 ms | 384 ms | 383 ms
| 092 ms | 000 ms | 000 ms
| 389 ms | 387 ms | 386 ms
| 039 ms | 000 ms | 000 ms
| 859 ms | 379 ms | 383 ms
Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle
Apache: Big Data Europe 2015 1728. September 2015Source: http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg
Dirt Race Use Case:
o Low-end devices
o No Internet
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Running Solr and AIR-2-Go on Raspberry Pi
Model B worked like a charm.
18
Running Debian Linux + JDK8
Jetty Container with the Solr and
AIR WARs deployed
Reduced Solr data set with approx
~1.5 Mio documents
Mario-Leander Reimer28. September 2015
Model B Hardware Specs:
o ARMv6 CPU at 700Mhz
o 512MB RAM
o 32GB SD-Card
19
YOU GOTTA FIGHT FOR
YOUR RIGHT TO SOLR!
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
No silver bullet. A careful schema design is
crucial for your Solr performance.
2028. September 2015 Mario-Leander Reimer
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
33.071.137
Vehicles
648.129
Technical
Documents
14.830.197
Flat Rate Units
5.078.411
FRU Groups
55.000
Parts
648.129
Measures
18.573
Repair
Instructions
6.180
Fault Indications
m
n
m
n
m
n
1.678.667
Packages
n
m
n
n
m
n
n41.385
Types
Naive data denormalization can quickly lead
to combinatorial explosion.
21Mario-Leander Reimer28. September 2015
Num Docs: 55.777.706 Relationship Navigation
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Multi-valued fields can efficiently store 1..n
relations but may result in false positives.
22Mario-Leander Reimer28. September 2015
{
"INFO_TYPE":"AWPOS_GROUP",
"NUMMER" :[ "1134190" , "1235590" ]
"BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"]
"E_SERIES" :[ "F10" , "E30" ]
}
In case this doesn‘t matter, perform a post filtering in your application.
Note: latest Solr versions support nested child documents. Use instead.
Index 0 Index 1
q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:F10
q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:E30
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Technical documents and their validity were
expressed in a binary representation.
23
o Validity expressions may have up to 46 characteristics.
o Validity expressions use 5 different boolean operators (AND, NOT, …)
o Validity expessions can be nested and complex.
o Some characteristics are dynamic and not even known at index time.
Mario-Leander Reimer28. September 2015
Solution: transform the validity expressions into the
equivalent JavaScript terms and evaluate these terms
at query time using a custom function query filter.
Apache: Big Data Europe 2015
Binary validity expression example.
2428. September 2015
Type(53078923) = ‚Brand‘, Value(53086475) = ‚BMW PKW‘
Type(53088651) = ‚E-Series‘, Value(53161483) = ‚F10‘
Type(64555275) = ‚Transmission‘, Value(53161483) = ‚MECH‘
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Transformation of binary validity terms into
their JavaScript equivalent at index time.
25Mario-Leander Reimer28. September 2015
((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH'))
AND(Brand='BMW PKW', E-Series='F10'‚ Transmission='MECH')
{
"INFO_TYPE": "TECHNISCHES_DOKUMENT",
"DOKUMENT_TITEL": "Getriebe aus- und einbauen",
"DOKUMENT_ART": " reparaturanleitung",
"VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))",
„BRAND": [„BMW PKW"],
...
}
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
The JavaScript validity term is evaluated at
query time using a custom function query.
26Mario-Leander Reimer28. September 2015
&fq=INFO_TYPE:TECHNISCHES_DOKUMENT
&fq=DOKUMENT_ART:reparaturanleitung
&fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500}
jsTerm(VALIDITY,eyJNT1RPUl9LUkFGVFNUT0ZGQVJUX01PVE9SQVJCRUlUU
1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG
wsIlNJQ0hFUkhFSVRTRkFIUlpFVUciOiIwIiwiQU5UUklFQiI6IkFXRCIsIkV
kJBVVJFSUhFIjoiWCcifQ==)
http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html
Base64decode
{
„BRAND":"BMW PKW",
"E_SERIES":"F10",
"TRANSMISSION":"MECH"
}
27
How often do we load
data? How do we ensure
data consistency?
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
A traditional approach using a DWH and ETL:
too inflexible, heavy weight and expensive.
28Mario-Leander Reimer28. September 2015
Data
Warehouse
System
B
System
A
System
C
File
DB
File
DB
AIR
Solr
ETL
ETL
ETL
ETL
ETL
ETL
ETL jobs would usually be
implemented with Informatica
Significant business logic
required depending on the
source database
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Flexible and lightweight ETL combined with
Continuous Delivery and DevOps.
29Mario-Leander Reimer28. September 2015
H <<System>>
AIR Search
H <<System>>
AIR Loader Slave
I <<System>>
Jenkins Slave
I <<System>>
Apache Maven
Developer
Operations
Solr Index
A <<System>>
AIR Loader
I <<System>>
Apache Solr
Data Source A
I <<System>>
Jenkins Master
Start
I <<System>>
Nexus Repository
Build&Deploy
BuildRun
Solr Index
I <<System>>
Apache Solr
Replicate
Data Source n
Extract
Load
30
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Apache Solr has become a powerful tool for
data analytics applications. Be creative.
31
Our next big project using Apache Solr is already on its way.
High performance application to predict and calculate the bill
of materials for all required parts and orders.
Apache Solr as a compressed, scalable and high performance
time series database.
FOSDEM’15 – Florian Lautenschlager, QAware GmbH
Leveraging the Power of SOLR and SPARK
Apache: Big Data 2015 – Johannes Weigend, QAware GmbH
Mario-Leander Reimer28. September 2015
32
Business intelligence
is about asking the
right questions about
your data.
33
And with Apache Solr
you can search and
find the answers you
are looking for.
https://twitter.com/leanderreimer/
https://slideshare.net/MarioLeanderReimer/
https://speakerdeck.com/lreimer/
&
Mario-Leander Reimer
Chief Technologist, QAware GmbH

More Related Content

PDF
Big Data Analytics with Spark
PPT
Business Intelligence Solution Using Search Engine
PDF
How to Gain Greater Business Intelligence from Lucene/Solr
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
PPTX
BI, Reporting and Analytics on Apache Cassandra
PDF
Automotive Information Research Driven by Apache Solr: Presented by Mario-Lea...
PDF
Automotive Information Research driven by Apache Solr
PDF
Automotive Information Research driven by Apache Solr
Big Data Analytics with Spark
Business Intelligence Solution Using Search Engine
How to Gain Greater Business Intelligence from Lucene/Solr
Building Intelligent Search Applications with Apache Solr and PHP5
BI, Reporting and Analytics on Apache Cassandra
Automotive Information Research Driven by Apache Solr: Presented by Mario-Lea...
Automotive Information Research driven by Apache Solr
Automotive Information Research driven by Apache Solr

Similar to Search-based business intelligence and reverse data engineering with Apache Solr (20)

PDF
Leveraging the power of solr with spark
PPTX
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
PDF
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
PPTX
Apache solr
PDF
Go fast in a graph world
PDF
Real World Analytics with Solr Cloud and Spark
PDF
Data Engineering with Solr and Spark
PDF
Download full ebook of Apache Solr Search Patterns Jayant Kumar instant downl...
PDF
Apace Solr Web Development.pdf
PDF
Reflected intelligence evolving self-learning data systems
PDF
Apache Solr as a compressed, scalable, and high performance time series database
PPTX
Apache solr
PPTX
DrupalTour. Lviv — Apache solr. Advanced use cases (Artem Sylchuk, InternetDe...
PDF
Searching Billions of Product Logs in Real Time (Use Case)
PPT
Working with solr.pptx
PDF
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
PDF
Leveraging the Power of Solr with Spark
PDF
Solr @ eBay Kleinanzeigen
ODP
Solr features
PDF
Apace Solr Web Development.pdf
Leveraging the power of solr with spark
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache solr
Go fast in a graph world
Real World Analytics with Solr Cloud and Spark
Data Engineering with Solr and Spark
Download full ebook of Apache Solr Search Patterns Jayant Kumar instant downl...
Apace Solr Web Development.pdf
Reflected intelligence evolving self-learning data systems
Apache Solr as a compressed, scalable, and high performance time series database
Apache solr
DrupalTour. Lviv — Apache solr. Advanced use cases (Artem Sylchuk, InternetDe...
Searching Billions of Product Logs in Real Time (Use Case)
Working with solr.pptx
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark
Solr @ eBay Kleinanzeigen
Solr features
Apace Solr Web Development.pdf
Ad

More from Mario-Leander Reimer (20)

PDF
Steinzeit war gestern! Vielfältige Wege der Cloud-nativen Evolution.
PDF
A Hitchhiker's Guide to Cloud Native Java EE
PDF
Steinzeit war gestern! Die vielfältigen Wege der Cloud-nativen Evolution
PDF
Everything-as-code: DevOps und Continuous Delivery aus Sicht des Entwicklers....
PPTX
Das kleine Einmaleins der sicheren Architektur @heise_devSec
PDF
Polyglot Adventures for the Modern Java Developer #javaone2017
PDF
Elegantes In-Memory Computing mit Apache Ignite und Kubernetes. @data2day
PDF
Cloud-native .NET-Microservices mit Kubernetes @BASTAcon
PDF
A Hitchhiker’s Guide to the Cloud Native Stack. #DevoxxPL
PDF
Everything-as-code. A polyglot adventure. #DevoxxPL
PDF
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
PDF
Per Anhalter durch den Cloud Native Stack. #SEACONHH
PDF
Everything-as-code. Ein polyglottes Abenteuer. #jax2017
PDF
Everything-as-code. Eine vielsprachige Reise. #javaland
PDF
Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017
PDF
Per Anhalter durch den Cloud Native Stack (Extended Edition) #oop2017
PDF
Der Cloud Native Stack in a Nutshell. #CloudExpoEurope
PDF
A Hitchhiker’s Guide to the Cloud Native Stack. #ContainerConf
PDF
Secure Architecture and Programming 101
PDF
Automotive Information Research driven by Apache Solr
Steinzeit war gestern! Vielfältige Wege der Cloud-nativen Evolution.
A Hitchhiker's Guide to Cloud Native Java EE
Steinzeit war gestern! Die vielfältigen Wege der Cloud-nativen Evolution
Everything-as-code: DevOps und Continuous Delivery aus Sicht des Entwicklers....
Das kleine Einmaleins der sicheren Architektur @heise_devSec
Polyglot Adventures for the Modern Java Developer #javaone2017
Elegantes In-Memory Computing mit Apache Ignite und Kubernetes. @data2day
Cloud-native .NET-Microservices mit Kubernetes @BASTAcon
A Hitchhiker’s Guide to the Cloud Native Stack. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPL
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
Per Anhalter durch den Cloud Native Stack. #SEACONHH
Everything-as-code. Ein polyglottes Abenteuer. #jax2017
Everything-as-code. Eine vielsprachige Reise. #javaland
Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017
Per Anhalter durch den Cloud Native Stack (Extended Edition) #oop2017
Der Cloud Native Stack in a Nutshell. #CloudExpoEurope
A Hitchhiker’s Guide to the Cloud Native Stack. #ContainerConf
Secure Architecture and Programming 101
Automotive Information Research driven by Apache Solr
Ad

Recently uploaded (20)

PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
history of c programming in notes for students .pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
medical staffing services at VALiNTRY
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Nekopoi APK 2025 free lastest update
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
L1 - Introduction to python Backend.pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
ai tools demonstartion for schools and inter college
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administraation Chapter 3
How to Migrate SBCGlobal Email to Yahoo Easily
history of c programming in notes for students .pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
medical staffing services at VALiNTRY
Upgrade and Innovation Strategies for SAP ERP Customers
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Nekopoi APK 2025 free lastest update
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
L1 - Introduction to python Backend.pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Odoo Companies in India – Driving Business Transformation.pdf
ai tools demonstartion for schools and inter college
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
top salesforce developer skills in 2025.pdf
System and Network Administraation Chapter 3

Search-based business intelligence and reverse data engineering with Apache Solr

  • 1. Apache: Big Data Europe 2015 Search-based business intelligence and reverse data engineering with Apache Solr M a r i o-Leander Rei mer C hi ef T echnol og i st
  • 2. Apache: Big Data Europe 2015 This talk will … Mario-Leander Reimer 228. September 2015 o Give a brief overview of the AIR system’s architecture o Show reverse data engineering using Solr and MIR o Talk about the fight for our right to Solr o Describe solutions for the problem of combinatorial explosion o Outline a flexible and lightweight ETL approach for Solr
  • 3. Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 328. September 2015 A <<Anwendungscluster>> AIR Repository A <<Application Cluster>> AIR Loader Mechanic A <<System>> AIR Central A <<Subsystem>> Maintenance I <<Subsystem>> Apache Solr A <<Client>> AIR Client I <<Subsystem>> .NET WPF A <<Subsystem>> Solr Extensions A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins Service Technician A <<Ext. System>> 3rd Party Application A <<Subsystem>> AIR Fork DLL A <<Subsystem>> AIR Call DLL Launch I <<Subsystem>> Spring Framework I <<Subsystem>> JEE 5 A <<System>> AIR Control I <<Subsystem>> Jenkins A <<Subsystem>> Documents A <<Subsystem>> Vehicles A <<Subsystem>> Measures Backend Databases and Systems A <<Subsystem>> Repair Overview A <<Subsystem>> ... A <<Subsystem>> JSF Web UI A <<Subsystem>> REST API Independent Workshop A <<Client>> Browser Search and Display A <<Ext. System>> 3rd Party iOS App A <<Subsystem>> AIR iOS Lib A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins I <<Subsystem>> Spring Framework A <<Subsystem>> Documents A <<Subsystem>> Parts A <<Subsystem>> WS Clients A <<Subsystem>> File Storage A <<Subsystem>> Solr Access A <<Subsystem>> Protocoll A <<Subsystem>> Watchlist A <<Subsystem>> Masterdata A <<Subsystem>> Retrofits AIR DB Document Storage A <<Ext. System>> AIR Bus I <<Ext. System>> Backend Systems Query A <<Subsystem>> Vehicles Execute Load 20 Languages 800 GB Solr Index A <<Subsystem>> Maintenance
  • 4. Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 428. September 2015 A <<Anwendungscluster>> AIR Repository A <<Application Cluster>> AIR Loader A <<Subsystem>> Maintenance I <<Subsystem>> Apache Solr Master A <<Subsystem>> Solr Extensions A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins A <<System>> AIR Control I <<Subsystem>> Jenkins A <<Subsystem>> Documents Backend Databases and Systems A <<Subsystem>> Repair Overview A <<Subsystem>> ... I <<Subsystem>> Spring Framework A <<Subsystem>> Vehicles Execute Load 20 Languages 800 GB Solr Index I <<Subsystem>> Apache Solr Slave A <<Subsystem>> Solr Extensions Replicate A <<System>> MIR 20 Languages 800 GB Solr Index Search
  • 5. Apache: Big Data Europe 2015 Let‘s go back to when it all began … Source: http://www.october212015.com/images/timecircuits.jpg 5
  • 6. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 The project vision: find the right information in less than 3 clicks. 6 The situation: o Users had to use up to 7 different applications for their daily work. o Systems were not really integrated nicely. o Finding the correct information was laborious and error prone. The idea: o Combine the data into a consistent information network. o Make the information network and its data searchable and navigable. o Replace existing application with one easy to use application. Mario-Leander Reimer28. September 2015
  • 7. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 But how do we find the originating system for the desired data? 7Mario-Leander Reimer28. September 2015 Where to find the vehicle data? 60 potential systems and 5000 entities. Other data Vehicle data System A System B System C System D
  • 8. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 And how do we find the hidden relations between the systems and their data? 8Mario-Leander Reimer28. September 2015 How is the data linked to each other? 400.000 potential relations. Other data Vehicle System A System B System C System D Customer Documents
  • 9. Apache: Big Data Europe 2015 Meta Information Research (MIR) 9Source: http://www.thewallpapers.org/photo/31865/Mir_space_station_12_June_1998.jpg
  • 10. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 MIR is a simple and lightweight data reverse engineering and analysis tool based on Solr. 10 o MIR manages meta information about the source systems (the data models and record descriptions) o MIR allows to navigate and search in the metadata, you can drill into the metadata using facets o MIR also manages the target data model and Solr schema description Mario-Leander Reimer28. September 2015 Metadata Index A <<System>> Meta Information Research I <<Subsystem>> Apache Solr A <<Subsystem>> MIR User Interface Backend Databases and Systems A <<Subsystem>> MIR Loader A <<Subsystem>> MIR Generators Read Sources (Java, XML) Magic Draw 25MB
  • 11. Apache: Big Data Europe 2015 11 Wildcard queries Facetted drill down Tree view of systems, tables and attributes Search results Found potential synonyms for the chassis number
  • 12. Apache: Big Data Europe 2015 12 EAT YOUR OWN DOG FOOD. The AIR Solr schema definition is modelled and defined within MIR. Solr schema attributesSolr entities for each release
  • 13. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 def sourceGenerator = MIR + Solr + Maven; 13Mario-Leander Reimer28. September 2015
  • 14. 14 But Solr is a full text search engine. You have to use an Oracle DB for your application data! NO!
  • 15. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Some of the AIR requirements were ... 15 o Focus is on search. Transactions are not required. o High demands on request volume and performance. o Free navigation on data model and content. o Support for full text search and facetted search. o Offline capabilities. o Scalability from low-end device to server to cloud. Mario-Leander Reimer28. September 2015
  • 16. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Apache Solr outperformed Oracle significantly in query time as well as index size. 16Mario-Leander Reimer28. September 2015 SELECT * FROM VEHICLE WHERE VIN='V%' INFO_TYPE:VEHICLE AND VIN:V* SELECT * FROM MEASURE WHERE TEXT='engine' INFO_TYPE:MEASURE AND TEXT:engine SELECT * FROM VEHICLE WHERE VIN='%X%' INFO_TYPE:VEHICLE AND VIN:*X* | 038 ms | 000 ms | 000 ms | 383 ms | 384 ms | 383 ms | 092 ms | 000 ms | 000 ms | 389 ms | 387 ms | 386 ms | 039 ms | 000 ms | 000 ms | 859 ms | 379 ms | 383 ms Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle
  • 17. Apache: Big Data Europe 2015 1728. September 2015Source: http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg Dirt Race Use Case: o Low-end devices o No Internet
  • 18. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Running Solr and AIR-2-Go on Raspberry Pi Model B worked like a charm. 18 Running Debian Linux + JDK8 Jetty Container with the Solr and AIR WARs deployed Reduced Solr data set with approx ~1.5 Mio documents Mario-Leander Reimer28. September 2015 Model B Hardware Specs: o ARMv6 CPU at 700Mhz o 512MB RAM o 32GB SD-Card
  • 19. 19 YOU GOTTA FIGHT FOR YOUR RIGHT TO SOLR!
  • 20. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 No silver bullet. A careful schema design is crucial for your Solr performance. 2028. September 2015 Mario-Leander Reimer
  • 21. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 33.071.137 Vehicles 648.129 Technical Documents 14.830.197 Flat Rate Units 5.078.411 FRU Groups 55.000 Parts 648.129 Measures 18.573 Repair Instructions 6.180 Fault Indications m n m n m n 1.678.667 Packages n m n n m n n41.385 Types Naive data denormalization can quickly lead to combinatorial explosion. 21Mario-Leander Reimer28. September 2015 Num Docs: 55.777.706 Relationship Navigation
  • 22. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Multi-valued fields can efficiently store 1..n relations but may result in false positives. 22Mario-Leander Reimer28. September 2015 { "INFO_TYPE":"AWPOS_GROUP", "NUMMER" :[ "1134190" , "1235590" ] "BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"] "E_SERIES" :[ "F10" , "E30" ] } In case this doesn‘t matter, perform a post filtering in your application. Note: latest Solr versions support nested child documents. Use instead. Index 0 Index 1 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:F10 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:E30
  • 23. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Technical documents and their validity were expressed in a binary representation. 23 o Validity expressions may have up to 46 characteristics. o Validity expressions use 5 different boolean operators (AND, NOT, …) o Validity expessions can be nested and complex. o Some characteristics are dynamic and not even known at index time. Mario-Leander Reimer28. September 2015 Solution: transform the validity expressions into the equivalent JavaScript terms and evaluate these terms at query time using a custom function query filter.
  • 24. Apache: Big Data Europe 2015 Binary validity expression example. 2428. September 2015 Type(53078923) = ‚Brand‘, Value(53086475) = ‚BMW PKW‘ Type(53088651) = ‚E-Series‘, Value(53161483) = ‚F10‘ Type(64555275) = ‚Transmission‘, Value(53161483) = ‚MECH‘
  • 25. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Transformation of binary validity terms into their JavaScript equivalent at index time. 25Mario-Leander Reimer28. September 2015 ((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH')) AND(Brand='BMW PKW', E-Series='F10'‚ Transmission='MECH') { "INFO_TYPE": "TECHNISCHES_DOKUMENT", "DOKUMENT_TITEL": "Getriebe aus- und einbauen", "DOKUMENT_ART": " reparaturanleitung", "VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))", „BRAND": [„BMW PKW"], ... }
  • 26. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 The JavaScript validity term is evaluated at query time using a custom function query. 26Mario-Leander Reimer28. September 2015 &fq=INFO_TYPE:TECHNISCHES_DOKUMENT &fq=DOKUMENT_ART:reparaturanleitung &fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500} jsTerm(VALIDITY,eyJNT1RPUl9LUkFGVFNUT0ZGQVJUX01PVE9SQVJCRUlUU 1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG wsIlNJQ0hFUkhFSVRTRkFIUlpFVUciOiIwIiwiQU5UUklFQiI6IkFXRCIsIkV kJBVVJFSUhFIjoiWCcifQ==) http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html Base64decode { „BRAND":"BMW PKW", "E_SERIES":"F10", "TRANSMISSION":"MECH" }
  • 27. 27 How often do we load data? How do we ensure data consistency?
  • 28. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 A traditional approach using a DWH and ETL: too inflexible, heavy weight and expensive. 28Mario-Leander Reimer28. September 2015 Data Warehouse System B System A System C File DB File DB AIR Solr ETL ETL ETL ETL ETL ETL ETL jobs would usually be implemented with Informatica Significant business logic required depending on the source database
  • 29. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Flexible and lightweight ETL combined with Continuous Delivery and DevOps. 29Mario-Leander Reimer28. September 2015 H <<System>> AIR Search H <<System>> AIR Loader Slave I <<System>> Jenkins Slave I <<System>> Apache Maven Developer Operations Solr Index A <<System>> AIR Loader I <<System>> Apache Solr Data Source A I <<System>> Jenkins Master Start I <<System>> Nexus Repository Build&Deploy BuildRun Solr Index I <<System>> Apache Solr Replicate Data Source n Extract Load
  • 30. 30
  • 31. Apache: Big Data Europe 2015Apache: Big Data Europe 2015 Apache Solr has become a powerful tool for data analytics applications. Be creative. 31 Our next big project using Apache Solr is already on its way. High performance application to predict and calculate the bill of materials for all required parts and orders. Apache Solr as a compressed, scalable and high performance time series database. FOSDEM’15 – Florian Lautenschlager, QAware GmbH Leveraging the Power of SOLR and SPARK Apache: Big Data 2015 – Johannes Weigend, QAware GmbH Mario-Leander Reimer28. September 2015
  • 32. 32 Business intelligence is about asking the right questions about your data.
  • 33. 33 And with Apache Solr you can search and find the answers you are looking for.