SlideShare a Scribd company logo
ApacheCon NA 2011 Report 2011/12/19 @ijokarumawak
About myself Nutch Cloudera Certified Hadoop Developer Hadoop Administrator CouchDB JP
ApacheCon  http://na11.apachecon.com/ 2 days training 3 days sessions Keynotes, 5 tracks Over 80 sessions Slide and audio files http://lanyrd.com/2011/apachecon-north-america/
Why did I go there? Because I wanted to! Nov 5,6: CouchHack Nov 7: CouchConf Berlin Nov 3: Left Japan Nov 14: Came back Nov 9-11: ApachCon Nov 12: Apach BarCamp Image from:  http:// en.wikipedia.org/wiki/File:World_map_blank_gmt.svg
Keynote| Building in Security and Innovation David A. Wheeler A specialist at developing  Secure Open Source Software The importance of developing secure software Do not make the same mistake Learn how to make it secure before start to develop it
Keynote | The Apache Way Done Right: The Success of Hadoop Eric Baldeschwieler co-founder and the CEO of  History of Hadoop Difficulty of leading a huge community “ Being optimistic and good things will happen.”
Keynote | Watson, a Reasoning System: based on Apache Inside! David Boloker CTO of IBM's Emerging Internet Technology group IBM’s Watson won Jeopardy Commercialization of Watson Its target is medical field
Lucene/Solr Meet up Discussion with core committers of Lucene/Solr Erik Hatcher Chris Hostetter Simon Willnauer We are supposed to drink beer, aren't we?
Sessions I attended to Lucene 4.0 - next generation open source search Simon Willnauer Solr Flair Erik Hatcher And more… 20 sessions! http://www.atware.co.jp/category/column/apachecon-na-2011/
Lucene 4.0 - next generation open source search - by Simon Willnauer
about the author Lucene core committer Project Management Committee chair (PMC) Berlin Buzzwords co-founder http://berlinbuzzwords.de/ Community portal targeting OpenSource Search http:// www.searchworkings.org /
Lucene 4.0 The latest is currently Lucene 3.5.0 When does the Lucene 4.0 come out? Any time. He doesn’t know.
IndexWriter & IndexReader Talk to a Directory (file system) Just a factory for input and output streams From Lucene4 Flex API on the Codec layer Codec Defines the file format Data structures Fields, term dictionaries You can use MySQL as a backup (it’s not a good idea though) 90% won’t get in touch 10% might be researchers Backward compatibility File System Directory Codec Flex API IndexWriter & Reader
Storing Strings in UTF8 Lucene 3 uses UTF16 From Lucene 4, UTF8 Performance will improve when you switched to Lucene 4
PostingsFormat PostingsFormat can be defined per field field:uid = Pulsing – PostingsFormat Usually 1 doc per uid Inlines postings into term dictionary Safes additional disc lookup field:spell = Memory – PostingsFormat Spelling correction doesn’t need posting list traversal Large amount of key lookups Load terms into RAM field:body = Default – PostingsFormat Primary Key lookup 170K qps -> 550K qps with Memory PostingsFormat Term Dictionary Posting List Term Posting List RAM Terms
IndexDocValues Lucene uses inverted index ( Term to Doc ) It’s not good at to get a value of certain field from a document Fast access to a certain field’s value for every document To sort documents or to display doc’s values not only its ID Stored Fields  It works but it’s not an efficient way It’s designed for bulk read FieldCache ( on RAM ) Undo the entire work in the indexing time to make an array (un-inverting) It works well until certain size of the index It can be a problem under real-time or near-real-time usecases IndexDocValue 1 value per field, type safe It can reside on disk Reading 10M docs from a disc FieldCache: 3161 ms DocValues: 90 ms Term Doc Doc Doc How to sort docs?
DWPT (Document Writer Per Thread) In Lucene 3 IndexWriter merges segments and flushes it to the disk While flushing data, multi-threaded IndexWriter takes a break From Lucene 4 IndexWriter doesn’t merge data anymore It flushes its own segment to the disc simultaneously less RAM more Concurrency
Automaton Query Automaton Query RegExp: (ftp|http).* Fuzzy: dogs~1 Fuzzy-Prefix: (dogs~1).* Fuzzy query was too slow to use in production Prior to 4.0, Fuzzy query took the simple yet horribly costly brute force approach  In Lucene 3 this is about 0.1 - 0.2 QPS Now it’s 50 QPS, 20k% improvement! http://java.dzone.com/news/lucenes-fuzzyquery-100-times
Solr Flair by Erik Hatcher
Solr Flair User Interfaces User Interactions Ajax suggestion Did you mean? – Spell Checking Facet Cluster .. So on
wt = velocity http://wiki.apache.org/solr/VelocityResponseWriter Solritas /browse
Prism https://github.com/lucidimagination/Prism Requires Lucid Works Enterprise JRuby with Sinatra gem installed Production use of LucidWorks Enterprise requires an annual subscription It’s free to play :’)
blacklight http://projectblacklight.org/ Ruby on Rails DEMO http:// demo.projectblacklight.org / Being used by Universities University of Versinia http:// search.lib.virginia.edu/catalog?portal = all&q = lucene   Stanford University http:// searchworks.stanford.edu/?q = lucene+in+action&search_field =search
VUFind http:// vufind.org /   blacklight competitor library resource portal  PHP DEMO http://vufind.org/demo/
TwigKit http:// twigkit.com /   JSP tag library Search UI components Samples http:// twigkit.com/components.html
Ajax Solr https://github.com/evolvingweb/ajax-solr   Javascript library goes with JQuery DEMO http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html
ApacheCon 2012 ApacheCon EUROPE November 2012 Germany!!?
Thank you!

More Related Content

PDF
Server Locality Using Razor and LLDP - PuppetConf 2014
PDF
Pipeline+over view
PDF
XFLTReat: a new dimension in tunnelling
PPTX
So Easy, A Ten Year Old Can Do It by Zeph Gardler
PPTX
Dock ir incident response in a containerized, immutable, continually deploy...
PPT
Learn ELK in docker
PPTX
Node collaboration - sharing information between your systems
PDF
2013 CrossRef Workshops System Update: Guarding Your Data, Andrew Gilmartin
Server Locality Using Razor and LLDP - PuppetConf 2014
Pipeline+over view
XFLTReat: a new dimension in tunnelling
So Easy, A Ten Year Old Can Do It by Zeph Gardler
Dock ir incident response in a containerized, immutable, continually deploy...
Learn ELK in docker
Node collaboration - sharing information between your systems
2013 CrossRef Workshops System Update: Guarding Your Data, Andrew Gilmartin

What's hot (20)

PPTX
Ansible Best Practices - July 30
PDF
Jaringan, Linux, Docker
PPTX
Transforming Infrastructure into Code - Importing existing cloud resources u...
PDF
Writing Well-Behaved Unix Utilities
PDF
Pharo Status Fosdem 2015
PDF
Introduce to Terraform
PPTX
Terraform
PDF
Infrastructure as Code with Terraform
PPTX
The tale of 100 cve's
PDF
libAttachSQL, The Next-Generation C Connector For MySQL
KEY
Celery
PDF
elk_stack_alexander_szalonnas
PDF
Woo: Writing a fast web server @ ELS2015
PPTX
Find the Hacker
PPTX
Apache Flink Hands-On
PPS
Erlang plus BDB: Disrupting the Conventional Web Wisdom
PDF
Investigation of testing with ansible
PDF
Dexador Rises
PPT
Drupal and Elasticsearch
PDF
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
Ansible Best Practices - July 30
Jaringan, Linux, Docker
Transforming Infrastructure into Code - Importing existing cloud resources u...
Writing Well-Behaved Unix Utilities
Pharo Status Fosdem 2015
Introduce to Terraform
Terraform
Infrastructure as Code with Terraform
The tale of 100 cve's
libAttachSQL, The Next-Generation C Connector For MySQL
Celery
elk_stack_alexander_szalonnas
Woo: Writing a fast web server @ ELS2015
Find the Hacker
Apache Flink Hands-On
Erlang plus BDB: Disrupting the Conventional Web Wisdom
Investigation of testing with ansible
Dexador Rises
Drupal and Elasticsearch
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
Ad

Viewers also liked (17)

PDF
Html2 presentation
DOC
نتيجة تكليف أطباء مصر 2014 من خريجي جامعة أسيوط
PDF
Bukurouについて
PPT
CouchDB JP Developers Dummit LT
PDF
Couchbase 30-dbtechshowcase-tokyo2014
PPT
Introduction of CouchDB JP
PPT
痛風2
PDF
China Media Landscape - White Paper
PPT
Lexical approach
PDF
Apache NiFiで、楽して、つながる、広がる IoTプロジェクト
PDF
Kafka含むデータ処理フローを NiFiで構築するさまを実演する5分間
PDF
そのデータフロー NiFiで楽にしてあげましょう
PPTX
Usmle step by step 1
PDF
Fixing medicine bill_final_sep_18_2012_final
PPT
もうひとつのNo sql couchdbとは
PPTX
Introduce couchbase server
PDF
Zzz satélites geoestacionarios
Html2 presentation
نتيجة تكليف أطباء مصر 2014 من خريجي جامعة أسيوط
Bukurouについて
CouchDB JP Developers Dummit LT
Couchbase 30-dbtechshowcase-tokyo2014
Introduction of CouchDB JP
痛風2
China Media Landscape - White Paper
Lexical approach
Apache NiFiで、楽して、つながる、広がる IoTプロジェクト
Kafka含むデータ処理フローを NiFiで構築するさまを実演する5分間
そのデータフロー NiFiで楽にしてあげましょう
Usmle step by step 1
Fixing medicine bill_final_sep_18_2012_final
もうひとつのNo sql couchdbとは
Introduce couchbase server
Zzz satélites geoestacionarios
Ad

Similar to ApacheCon NA 2011 report (20)

ODP
OWASP WTE - Now in the Cloud!
PDF
Making Apache Kafka Even Faster And More Scalable
PPTX
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
PDF
Alfresco Day Roma 2015: Platform Update
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
PPTX
Flink in action
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
PPTX
ElasticSearch in Production: lessons learned
PPTX
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
PPTX
Immutable Infrastructure: the new App Deployment
PPT
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
ODP
New Oracle Infrastructure2
PDF
Real time cloud native open source streaming of any data to apache solr
PDF
101 ways to configure kafka - badly
PPT
Tldr solr-courseload
PDF
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
PDF
Shell tutorial
PDF
A Jupyter kernel for Scala and Apache Spark.pdf
PDF
ApacheCon 2021 Apache Deep Learning 302
OWASP WTE - Now in the Cloud!
Making Apache Kafka Even Faster And More Scalable
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Alfresco Day Roma 2015: Platform Update
Building Intelligent Search Applications with Apache Solr and PHP5
Flink in action
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
ElasticSearch in Production: lessons learned
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Immutable Infrastructure: the new App Deployment
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
New Oracle Infrastructure2
Real time cloud native open source streaming of any data to apache solr
101 ways to configure kafka - badly
Tldr solr-courseload
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Shell tutorial
A Jupyter kernel for Scala and Apache Spark.pdf
ApacheCon 2021 Apache Deep Learning 302

More from Koji Kawamura (8)

PDF
Broadcast チームの オブザーバビリティ向上活動.pdf
PDF
Elastic Stack を網羅する ハンズオンワークショップを 作ってみた.pdf
PDF
Drupal Elasticsearch Connector の日本語検索の質を高める
PDF
20200324 ms open-tech-elastic
PDF
祝Elasticsearch 7.6、date, number 型での ソートがさらに高速に!? Magic WANDってなんですか?
PPTX
Apache NiFi 流れるデータにもスキーマを
PPTX
What will be new in Apache NiFi 1.2.0
PDF
Apache NiFi 1.0 in Nutshell
Broadcast チームの オブザーバビリティ向上活動.pdf
Elastic Stack を網羅する ハンズオンワークショップを 作ってみた.pdf
Drupal Elasticsearch Connector の日本語検索の質を高める
20200324 ms open-tech-elastic
祝Elasticsearch 7.6、date, number 型での ソートがさらに高速に!? Magic WANDってなんですか?
Apache NiFi 流れるデータにもスキーマを
What will be new in Apache NiFi 1.2.0
Apache NiFi 1.0 in Nutshell

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Spectroscopy.pptx food analysis technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25-Week II
Agricultural_Statistics_at_a_Glance_2022_0.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectroscopy.pptx food analysis technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Machine Learning_overview_presentation.pptx
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Per capita expenditure prediction using model stacking based on satellite ima...

ApacheCon NA 2011 report

  • 1. ApacheCon NA 2011 Report 2011/12/19 @ijokarumawak
  • 2. About myself Nutch Cloudera Certified Hadoop Developer Hadoop Administrator CouchDB JP
  • 3. ApacheCon http://na11.apachecon.com/ 2 days training 3 days sessions Keynotes, 5 tracks Over 80 sessions Slide and audio files http://lanyrd.com/2011/apachecon-north-america/
  • 4. Why did I go there? Because I wanted to! Nov 5,6: CouchHack Nov 7: CouchConf Berlin Nov 3: Left Japan Nov 14: Came back Nov 9-11: ApachCon Nov 12: Apach BarCamp Image from: http:// en.wikipedia.org/wiki/File:World_map_blank_gmt.svg
  • 5. Keynote| Building in Security and Innovation David A. Wheeler A specialist at developing Secure Open Source Software The importance of developing secure software Do not make the same mistake Learn how to make it secure before start to develop it
  • 6. Keynote | The Apache Way Done Right: The Success of Hadoop Eric Baldeschwieler co-founder and the CEO of History of Hadoop Difficulty of leading a huge community “ Being optimistic and good things will happen.”
  • 7. Keynote | Watson, a Reasoning System: based on Apache Inside! David Boloker CTO of IBM's Emerging Internet Technology group IBM’s Watson won Jeopardy Commercialization of Watson Its target is medical field
  • 8. Lucene/Solr Meet up Discussion with core committers of Lucene/Solr Erik Hatcher Chris Hostetter Simon Willnauer We are supposed to drink beer, aren't we?
  • 9. Sessions I attended to Lucene 4.0 - next generation open source search Simon Willnauer Solr Flair Erik Hatcher And more… 20 sessions! http://www.atware.co.jp/category/column/apachecon-na-2011/
  • 10. Lucene 4.0 - next generation open source search - by Simon Willnauer
  • 11. about the author Lucene core committer Project Management Committee chair (PMC) Berlin Buzzwords co-founder http://berlinbuzzwords.de/ Community portal targeting OpenSource Search http:// www.searchworkings.org /
  • 12. Lucene 4.0 The latest is currently Lucene 3.5.0 When does the Lucene 4.0 come out? Any time. He doesn’t know.
  • 13. IndexWriter & IndexReader Talk to a Directory (file system) Just a factory for input and output streams From Lucene4 Flex API on the Codec layer Codec Defines the file format Data structures Fields, term dictionaries You can use MySQL as a backup (it’s not a good idea though) 90% won’t get in touch 10% might be researchers Backward compatibility File System Directory Codec Flex API IndexWriter & Reader
  • 14. Storing Strings in UTF8 Lucene 3 uses UTF16 From Lucene 4, UTF8 Performance will improve when you switched to Lucene 4
  • 15. PostingsFormat PostingsFormat can be defined per field field:uid = Pulsing – PostingsFormat Usually 1 doc per uid Inlines postings into term dictionary Safes additional disc lookup field:spell = Memory – PostingsFormat Spelling correction doesn’t need posting list traversal Large amount of key lookups Load terms into RAM field:body = Default – PostingsFormat Primary Key lookup 170K qps -> 550K qps with Memory PostingsFormat Term Dictionary Posting List Term Posting List RAM Terms
  • 16. IndexDocValues Lucene uses inverted index ( Term to Doc ) It’s not good at to get a value of certain field from a document Fast access to a certain field’s value for every document To sort documents or to display doc’s values not only its ID Stored Fields It works but it’s not an efficient way It’s designed for bulk read FieldCache ( on RAM ) Undo the entire work in the indexing time to make an array (un-inverting) It works well until certain size of the index It can be a problem under real-time or near-real-time usecases IndexDocValue 1 value per field, type safe It can reside on disk Reading 10M docs from a disc FieldCache: 3161 ms DocValues: 90 ms Term Doc Doc Doc How to sort docs?
  • 17. DWPT (Document Writer Per Thread) In Lucene 3 IndexWriter merges segments and flushes it to the disk While flushing data, multi-threaded IndexWriter takes a break From Lucene 4 IndexWriter doesn’t merge data anymore It flushes its own segment to the disc simultaneously less RAM more Concurrency
  • 18. Automaton Query Automaton Query RegExp: (ftp|http).* Fuzzy: dogs~1 Fuzzy-Prefix: (dogs~1).* Fuzzy query was too slow to use in production Prior to 4.0, Fuzzy query took the simple yet horribly costly brute force approach In Lucene 3 this is about 0.1 - 0.2 QPS Now it’s 50 QPS, 20k% improvement! http://java.dzone.com/news/lucenes-fuzzyquery-100-times
  • 19. Solr Flair by Erik Hatcher
  • 20. Solr Flair User Interfaces User Interactions Ajax suggestion Did you mean? – Spell Checking Facet Cluster .. So on
  • 21. wt = velocity http://wiki.apache.org/solr/VelocityResponseWriter Solritas /browse
  • 22. Prism https://github.com/lucidimagination/Prism Requires Lucid Works Enterprise JRuby with Sinatra gem installed Production use of LucidWorks Enterprise requires an annual subscription It’s free to play :’)
  • 23. blacklight http://projectblacklight.org/ Ruby on Rails DEMO http:// demo.projectblacklight.org / Being used by Universities University of Versinia http:// search.lib.virginia.edu/catalog?portal = all&q = lucene Stanford University http:// searchworks.stanford.edu/?q = lucene+in+action&search_field =search
  • 24. VUFind http:// vufind.org / blacklight competitor library resource portal PHP DEMO http://vufind.org/demo/
  • 25. TwigKit http:// twigkit.com / JSP tag library Search UI components Samples http:// twigkit.com/components.html
  • 26. Ajax Solr https://github.com/evolvingweb/ajax-solr Javascript library goes with JQuery DEMO http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html
  • 27. ApacheCon 2012 ApacheCon EUROPE November 2012 Germany!!?