23. Why I choose Sphinx Terabyte Index 良好的文档支持 与 LAMP 软件栈紧密集成 唯一可选的 C/C++ 检索系统( @2006 ) Lucene 不适用于复杂检索( @2006 ) I hate Java
24. Sphinx (Coreseek) 特性介绍( 1 ) high indexing and searching performance; advanced indexing and querying tools (flexible and feature-rich text tokenizer, querying language, several different ranking modes, etc); advanced result set post-processing (SELECT with expressions, WHERE, ORDER BY, GROUP BY etc over text search results); proven scalability up to billions of documents, terabytes of data, and thousands of queries per second;
25. Sphinx(Coreseek) 特性介绍( 2 ) easy integration with SQL and XML data sources, and SphinxAPI, SphinxQL, or SphinxSE search interfaces; easy scaling with distributed searches. Python data source adapter layer Build-in Chinese Tokenizer
27. Sphinx VS Lucene Faster Indexing Faster, more relevant searching SQL style queries We can do Java, but don`t require a Java stack. RT-Index VS. In memory Index
28. Sphinx 优势 BM25 Ranker phrase base ranking Boosts (sub) phrase matches Perfect match is guaranteed to be ranked #1 内置 Grouping 、分布式支持
29. Sphinx 限制 ~ = 20 G , Per-single Index CRC64 WordID Field Mask, Only 24 Field Supported All attributes in memory! Poor windows support No inner cache support Hard to handle more than 3T data
30. 吹牛时间(酒精考验的 Sphinx ) Boardreader.com 3KW 文档, 1M+ query/day craigslist.com 20~30GB docs, 50M+ query/day 国内的实施 ChinaUnix Blogbus 51CTO 金融街 BBS .... Many site I never seen , due to Open Source ;-) 某档案馆检索( Tb )