Engineering Web Search Applications Alessandro Bozzon Marco Brambilla Vienna July 5, 2010
Alessandro Bozzon Post-doc @Politecnico di Milano http://home.dei.polimi.it/bozzon Marco Brambilla Assistant Professor @Politecnico di Milano http://home.dei.polimi.it/mbrambil About the speakers © 2010 Alessandro Bozzon, Marco Brambilla Research background and interests Web engineering and model-driven development  WebML and WebRatio Complex enterprise application design BPM, SOA and integration with Web application devel. Search engine and complex search application development Search Computing: multidomain search Pharos: multimedia search framework July 5, 2010 ABOUT   //
About the tutorial Information Retrieval is a >40y old discipline tackled from a myriad of viewpoints This tutorial is: Breadth-oriented Development process driven … …  using real-world case studies as examples The tutorial is necessarily shallow But we provide references and links © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 ABOUT  //
Agenda © 2010 Alessandro Bozzon, Marco Brambilla
AGENDA Introduction What are Web search applications? Requirements Which are their requirements? Design How to design them? Implementation How to implement them? Validation How to measure their success? © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 AGENDA  //
Introduction © 2010 Alessandro Bozzon, Marco Brambilla
Search prevails Search  is an integral part of online life of people Web search   has become a standard (and often preferred) source of information finding “ ... 92%  of Internet users say the Internet is a good place to go for getting everyday  information...” - 2004 Pew Internet Survey Web search engines are now the  second most frequently used  online computer application, after email Search is fully integrated into operating systems and is viewed as an essential part of most information systems © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION   //
Some numbers … Web Estimated size:  ~ 60 billion pages  – 22/06/2010 http://www.worldwidewebsize.com/ > 9.3 billion queries … just in the U.S. … in  May  2010 http://blog.nielsen.com/nielsenwire/online_mobile/top-u-s-search-sites-for-may-2010/ …  and growing Twitter # of new tweets per day: 55 million # of search queries per day: 600 million Facebook 400 Million Global Users (and growing) The average Facebook User Spends 55 Minutes Per Day © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION   //
…  more numbers … IDC Digital Universe report estimates: digital data grew by 62% between 2008 and 2009  ~ 800,000 petabytes (PB) >1.2 million PB in 2010  reach 35 ZB (zetabytes) by 2020. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION   // [Ramakrishnan and Tomkins 2007]
Information Retrieval Information retrieval (IR)  deals with the representation, storage, organization of, and access to information items.  “ Old” discipline As an academic field of study: Information retrieval (IR) is devoted to  finding relevant documents , not finding simple match to patterns.  Information retrieval (IR) is finding material (usually documents) of an  unstructured nature  (usually text) that satisfy an information need from within large collections (usually stored on computers). [Manning et al., 2007] © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // July 5, 2010
Information Retrieval Applications Search  (‘ad hoc’ retrieval) Static document collection Dynamic queries July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION   // Filtering Queries are static Document collection constantly changing Example: corporate mails routed by predefined queries to different parts of the organizations Static Document Collection Ranked Result Ad-Hoc query Document Routing System Predetermined queries or User profiles Incoming  Documents
The nature of information retrieval …  retrieving all objects which  might be useful or relevant  to the user information need Usually  unstructured  queries (no formal semantics) The IR system ‘interpret’ the contents of the information items Examples: keyword-based queries, context queries, proximity, phrases, natural language queries… Also structural queries and, in recent systems, structured query languages are supported (but with a different semantics) Errors  in the results are  tolerated Core concept:  relevance Relevance Ranking  (according  to the user need) It is not clear what “degree of relevance”  the user is happy with  The user starts from the top of the  ranked list and explore down satisfied  July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION   //
Information Retrieval  is  NOT  Data Retrieval Data Retrieval (RDBMS, XML DB) …  retrieving all objects which  satisfy clearly defined conditions  expressed trough a query language. Data has a well defined structure and semantics Formal query languages Regular expression, relation algebra expression, etc. Results are  EXACT matches     errors are not tolerated No  ranking  w.r.t. the user  information need Binary retrieval: does not allow the user to control the magnitude of the output For a given query, the system may return: Under-dimensioned output Over-dimensioned output July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION   //
The Information Retrieval Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION  // Content Management Query analysis Query Interaction Generic search-oriented application B A C K E N D F R O N T E N D q’ q r r’ Search Result Composition Result Manipulation
Search Engine vs. Search Application Search Engine data management system which uses information retrieval algorithms to retrieve information items from one or more sources upon the submission of a query Web Search Application data management system where search engines are a piece of a more complex puzzle, that includes: data source integration (e.g. databases,  legacy systems, the Web) content analysis technologies orchestration user interfaces Web-mediated social interactions, etc.  July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION   //
Characterization of the user information need  It is not a simple problem: “ Blurred” goals Sensory Gap Gap between the object in the  world and the information in a  (computational) description Semantic Gap Lack of coincidence between the (computational) description of the  information and their interpretation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION   //
Evaluating an IR System Precision:   fraction of retrieved docs that are relevant   P(relevant|retrieved) “ degree of soundness” of the system not considering the total number of documents Recall:   fraction of relevant docs that are retrieved P(retrieved|relevant) “ degree of completeness” of the system July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION   //
Enterprise search Public Web search engines are the ones known to the general public But there is also a huge need (and market share!) for  professional search over enterprise repositories Enterprise search is covered by Packaged suites Microsoft FAST  Autonomy IDOL IBM OmniFind Exalead Frameworks Apache UIMA (ex IBM) Smila Solr July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION   //
Case Studies Textual Search YaGoBi  Multi-media Search The PHAROS Project Multi-domain Search The Search Computing project Example of Web Search Application Chansonnier  © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  CASE STUDIES   //
YaGoBi THE  Web Search 92% of market  share in the U.S. Searching on Web pages, Blog, News, Books, Scientific Publications, Emails Images and Videos (but only trough  textual descriptions ) Tweets … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla CASE STUDIES   //
The PHAROS Project FP6 IP, 3Years, 12 Partners, ~15 M€ budget Mission : Develop SOA-compliant,  open and distributed   technology  platform   for development of information access solutions for  audio visual content www.pharos-audiovisual-search.eu © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES   //
The Search Computing Project European Research Council (ERC), 2008 Call for "IDEAS Advanced Grants”, 5y (started in 2009) Mission : provide the abstractions, foundations, methods, and tools required to answer  multi-domain   queries by interacting with a constellation of cooperating search services, using  ranking and joining of results as the dominant factors for  service  composition www.search-computing.org © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES   //
Chansonnier BsC Thesis project Mission : graduate   Open source video analysis  application based on  open frameworks  (SMILA / SOLR) Crawling of Web video Download of song lyrics Analysis on lyrics text Language, emotion Keyframe extraction for video snippets http://github.com/giorgiosironi/Chansonnier © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  CASE STUDIES   //
Requirements © 2010 Alessandro Bozzon, Marco Brambilla
Key Requirements and Design Dimensions for Web Search © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   // Data Source User Behavior Query Format User Interface Security Data Analysis Performance Data Format Social Interactions Search Engine
Data Sources Web Databases File systems Intranet / Extranets Legacy systems Users Sensors (in wide sense)  and streams © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
Data Type Unstructured data Textual Documents Blog Posts (Semi) Structured data Software Code Models XML Files Media Pictures Video Music © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
Textual Analysis Deals with basic language units (morphemes, roots, stems, words, phrases, sentences, etc.) Media Analysis Deals with media contents Transcoding Classification Feature Extraction Data Analysis July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // An activity performed at the purpose of providing a representation of a content item suited for the application
Search Engine _1 Textual Textual contents represented as collection of unstructured text terms Fielded Textual contents structured in  fields  (e.g., metadata) Semi-structured Textual contents organized in  complex (possibly heterogeneous)  structure (e.g., XML, HTML) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
Search Engine _2 Content-based Media contents described by low-level features  Geographic and other special dimensions Content featuring geo-spatial features  Streaming content searched by temporal features (e.g., recency) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
Query Format Representation of the user information need Natural Language For instance trough vocal interfaces Keyword Set of text items, plus Boolean (AND/OR/NOT), proximity ( lexical nearness) and/or wildcard conditions Fielded Keyword Text items defined on one or more fields Queries to semi-structured search-engines and  Faceted queries Content-based Query by example (text, image, video, audio, etc.) Geographic  and other special dimensions Geographic coordinates plus spatial operator terms ( near, north of, within X kilometers from, etc.) Timestamps plus temporal operator terms (recent, near, interval, etc.) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
YaGoBi Data Sources Web : crawling of Web resources Users : comments, preferences, relationships  Data Types Unstructured data :  Web pages Documents : PDF, PPT, DOC, etc. Data Analysis Textual : for content, document, and user generated comments Media : some basic image analysis for color, faces, size   Search Engine Fielded: filetype, page title, site, page content Content-based: image similarity in Google Query Format: Fielded keyword Geographic July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
PHAROS Data Sources Web : crawling of audio/video files File System : NAS and content provider media archives Users : comments, preferences, relationships  Data Types Structured data : content provider description metadata Media : hi-quality video and audio files Semi-structured data : MPEG-7 description of processed media files and user annotations Data Analysis Textual : for content metadata and user generated comments Media : for audio and video Audio/Video Mood classification, Image concept classification, Music Genre, Danceability classification, face recognition and identification, speech to text July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
PHAROS Search Engine Semi-structured : XML search engine for MPEG-7 content description Plus  geographic  annotations and geo-based ranking   3 content-based engines :  one CB for music,  one for images (shots of the video)  one for face similarity Query Format Fielded-keyword : XQuery for XML search engine Query by example : for image, music and faces  MPQF: high level query language AND/OR/AND THEN for fielded keyword and by-example queries July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
Query Federation in PHAROS July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // JPG Long/Lat XPath Keywords “ amsterdam” \\where[contains(“amsterdam”)]  and \\topic[contains(“building”)] Geo search R-tree index 52.37N 4.89 E Text search Inverted index XML search Semantic index Image search Similarity index Query analysis Federation
User Behavior Search is evolving Content Vs. Intent People don’t want to search People want to get task done and get answers Moving towards  identifying a user’s task Enabling means for  task completion Search as a Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Search applications must Support the user in the search process (try to) Infer the user intent to help him accomplishing his task Ricardo Baeza-Yates  Next Generation Search , 2 nd  SeCo Workshop,  Milan, 24/06/2010 Start End I am craving for a good  Wiener Schnitzel  and a  Sachertorte  in Vienna  Search Menu Reviews Map
Information Seeking  [Bates, 2002] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Bates, Marcia J. 2002. Toward an integrated model for information seeking and searching. In: The Fourth International Conference on Information Needs, Seeking and Use in Different Contexts.
Information Foraging Information foraging  applies the ideas from  optimal foraging theory   to understand how human users search for information.  Assumption: humans use "built-in" foraging mechanisms that evolved to help our animal ancestors find food. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   // Some References Fu, Wai-Tat; Pirolli, Peter (2007), "SNIF-ACT: a cognitive model of user navigation on the world wide web", Human-Computer Interaction: 335–412  Jason Withrow, "Do your links stink?," American Society for Information Science Bulletin, June 1, 2002 Pirolli, Peter (2009), "An elementary social information foraging model", Proceedings of the 27th international conference on Human factors in computing systems: 605–614
Moving between patches Patches of information = websites Problem:  should I continue foraging in the current patch  or look for another patch?    Expected gain from continuing in current patch vs. moving to another © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // July 5, 2010
Information seeking funnel [D. Rose, 2008] Wandering:  the user  does not have  an  information seeking-goal in mind.  Exploring:  the user has a  general goal  but not a plan for how to achieve it. Seeking:   the user has  started to identify  information needs that must be satisfied but the needs are open-ended. Asking:  the user has a  very specific  information  need that corresponds to a closed-class question © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
Berrypicking vs. Orienteering vs. Teleporting ...  Information needs  change during interactions M.J. Bates. The design of  browsing and berrypicking  techniques for the online  search interface.  Online  Review, 13(5):407–431,1989. Orienteering   [ Teevan et al., CHI 2004 ] :  Searcher issues a quick, imprecise to get to approximately the right information space region and then follows known paths that require small steps that move them closer to their goal.  Easy!   (“perfect” query not needed) Teleporting:  Expert searchers issue longer queries to jump directly to the target. Requires more effort and experience. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
…  vs. exploratory search Exploratory Search:  user’s intent is primarily to learn more on a topic of interest, by exploring various directions and sources “…  exploratory search  blends querying and browsing strategies” and is different  “from  retrieval  that is best served by analytical strategies…” Marchionini, G. Exploratory search:  from finding to understanding.  Communications ACM 49(4): 41-46 (2006) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   // Some references Definition and analysis of the problem White, R. W., and Drucker, S. M. Investigating behavioral variability in web search. 16th WWW Conf. (Banff, Canada, 2007) Complex Search and Exploratory Search Aula, A., and Russell, D.M. Complex and Exploratory Web Search. ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)
Multi-domain Exploratory Search “…  search for upcoming  concerts   close  to an  attractive  location  (like a beach, lake, mountain, natural park, and so on), considering also availability of  good ,  close-by   hotels ” Current approach the user can adopt: Independently explore search services Manually combine findings July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
Multi-domain Exploratory Search “…  expand  the search to get information about available restaurants near the candidate concert locations, news associated to the event and possible options to combine further events scheduled in the same days and located in a close-by place with respect to the first one…” July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
Existing Approaches _1 Topic based search : instance of exploratory search centered on the goal of collecting information on a subject matter of interest from multiple sources July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Kosmix : topic discovery engine, keyword search, a topic page summarizes the most relevant information on the subject Hakia : resume pages for topics associated with user’s queries, natural language processing techniques
Existing Approaches _2 Structured Object Search : process queries and present results that address entities or real world objects described in Web pages July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Google Squared: keyword search, results collected in a table (called a square) featuring all the attributes relevant to the result items as columns headers Google Fusion Tables: upload data tables (e.g., spreadsheet files) and join (or “fuse”) the data in some column with other tables
The note-taking limit There is a limit after which the found options need to be marked down. July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // [Aula and Russel, 2008]
Liquid Queries “  A new paradigm allowing users to  formulate  and get  responses  to  multi-domain  queries through an  exploratory information seeking  approach, based upon  structured  information sources exposed as software services…” Composite  answers obtained by aggregating search results from various domains Highlight  the contribution of each search service Join  of results based on the structural information afforded by the search service interfaces Refine  the user query Re-shape  the result list July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri.  Liquid Query: multi-domain exploratory search on the Web . WWW 2010, Raleigh, USA
Liquid Queries Definition _1 Template-based approach It consists of subsetting and parametrizing the resource graph... July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Concert Artist Exhibition Restaurant Hotel Movie Metro Station Theatre Photo Landmark News Photo Concert Metro Station Restaurant News Exhibition Artist Hotel = inputs, outputs  +  GR = global ranking
Liquid Queries Definition _2 And then characterizing the user interaction Plus: Parametrization of global ranking Data visualization options .. and so on July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Photo Concert Metro Station Restaurant News Exhibition Artist Hotel Expand
Result Exploration Support If the current set of combinations is not satisfactory, the user may ask for  more  values for a service (more one) or for all services (more all) More concerts, more hotels, or more combinations Add new information  about further domains for selected combinations (expand) Find close-by restaurants or co-located events Aggregate  information to ease analysis and readability (clustering, grouping) Group events by venue Reduce  the number of shown items through filtering Total walked distance for the night Re-order  (ranking or sorting) Calculate derived values from existing ones Total walked distance for the night Alternative  data visualization Map, parallel coordinates, … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // DEMO :  http://demo.search-computing.org
User Intent Understand the user information need User intent taxonomy (Broder2002) Informational –want to learn about something (~40% / 65%) Navigational –want to go to a given page (~25% / 15%)  Transactional  – want to do something (web-mediated) (~35% / 20%) Grey Areas Find a good hub Exploratory search  July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // [from SIGIR 2008 Tutorial, Baeza-Yates and Jones]  History nyonya food Singapore Airlines Jakarta Weather Nikon Finepix Car Rental Kuala Lumpur
Contextual Content Delivery Context Vs. Personalization Trigger the right search depending on the context Task Location User Engagement Not interested in your personal profile Your favorite restaurant? It depends on where you are! July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // from Ricardo Baeza-Yates, Next Generation Search ,  2 nd  Search Computing Workshop, Milan, 24/06/2010 Demo: http://sandbox.yahoo.com/Motif
Relevance: the Top-k problem Relevance of the results  with respect to the request is the main expectation for search engine users Top-k relevant items : retrieve quickly a number ( k)  of highest ranking tuples in the presence of monotone ranking functions defined on the attributes of underlying relations Some References R. Fagin. Combining fuzzy information from multiple systems. J. Comput. Syst. Sci., 58(1):83–99, 1999.  F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization. In SIGMOD Conference, pages 203–214, 2004 D. Martinenghi and M. Tagliasacchi: Proximity Rank Join,  to appear in PVLDB July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
Result Diversification  Relevance is not the only success factor for a result set User satisfaction  is increased if the first items cover a good spectrum of options If user  intent is ambiguous , diversification tries to cover the most likely intents If several top-k  items are very similar ,  they can be clustered together Thus: an optimization problem Objective: find the set of k  elements that contains the  most relevant and diverse items Maximal Marginal Relevance  [Carbonell and Goldstein 1998] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Relevance Diversity
User Interface More Complete information on one search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   // Shortcuts Deep Links Enhanced Results
User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS   //
Performance Users don’t want to  lose their time  waiting for a search result User satisfaction Performances are the leading factor  for the evaluation of  Web Search applications Queries per seconds (QPS) Time to Index Scalability Content Queries Distribution Service-oriented computing Content Delivery Networks But intellectual properties may be a concern More in section (ARCHITECTURE) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
Other Requirements Social Interaction Content evaluation User relationships and actions as additional content description Security & Privacy Access policies Collection Vs. Item level Anonymity Who I am = What I like + What I do + Where I am ? A search process tells a lot about whom is doing it  Alessandro Bozzon, Tereza Iofciu, Wolfgang Nejdl, Antonio V. Taddeo, Sascha Tönnies, Role Based Access Control for the interaction with Search Engines, (COOPER) 2007, Crete, Greece . © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS   //
Design © 2010 Alessandro Bozzon, Marco Brambilla
Designing Web Search Applications Reference architecture Reference execution processes  Set of design dimensions Development methodology Tools supporting the methodology July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Search Applications from 1000 feet © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  DESIGN   //
Bird eye view on Search Applications © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  DESIGN   //
Search Application Processes July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
An example of Indexing Process  July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Pharos: the architecture July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Search Computing: the architecture July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Main Query flow <Uses> relation
Search Computing: the architecture July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // High level query “ Where can I attend a DB scientific conference close to  a beautiful beach reachable  with cheap flights?” Sub query 1 “ Where can I attend  a DB scientific  conference?” Sub query 2 “ place close to  a beautiful  beach?” Sub query 3 “ place reachable  with cheap flight?”
Search Computing: the architecture July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Low level query 1 ConfSearch(“DB”,placeX,dateY) Low level query 2 TourSearch(“Beach”,PlaceX) Low level query 3 Flight(“cost<200”,PlaceX,DateY)
Search Computing: the architecture July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Services invocations  and operators execution Presented results ESWC-Crete-Olympic CAISE- Hammamet – Alitalia TOOLS-Malaga-EasyJet Query plan Results
Design Dimensions July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Design Dimension Affected Process Values Retrieval Policy Indexing Push Pull Data Homogeneity Indexing Homogeneity Heterogeneity Data Analysis Indexing Mono Annotation Multi  Annotation Mono Modal Multi Modal Search Technology Indexing, Query and Result Presentation Search Engine(s) Type Homogeneity Heterogeneity Query Format Query and Result Presentation, User Interface Query Type Mono Modal Multi Modal Mono Domain Multi Domain User Interaction User Interface Direct Indirect Active Passive
Designing Web Search Applications -  A MDD approach Alessandro Bozzon, Marco Brambilla, Piero Fraternali.  Conceptual Modeling of Multimedia Search Applications using Rich Process Models . ICWE 2009, June 24-26, 2009, San Sebastian, Spain July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Clear  separation of concerns  among the involved actors Central roles of models as key development artifacts Automatic code generation, etc.
Development Methodology Process Models E.g.: BPMN  Domain data and process metadata E.g.: ER/UML July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Model To Model Transformation E.g.: Java / XSLT / ATL Application Models DSL, e.g. WebML Model To Code Transformation Running Application
An example domain model Content Analysis / ER Content : the objects that relate to the Content Items indexed by a search application Annotation : structure of the annotations associated with searchable Content Items during the indexing process Usage : usage groups of the application (RBAC model) Index : abstraction for the actual physical implementation of search engine indexes July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
An example process model Content Analysis / BPMN - WebML Coarse  indexing process model Content Registration Content Analysis Content Indexation July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Fine-grained  process model Analysis of audiovisual content trough face recognition and identification technologies Application  model Face Recognition and Segmentation activity Running  CPA process Console trace of the working annotation technology Process advancement control UI Refinement M2M Transformation M2T Transformation
An Example of Complex Process July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Analysis of audiovisual content Incremental analysis of audio-visual content with textual annotations
Modeling User Interface The information seeking interaction  modes (Searching, Browsing, Monitoring,  Being aware, Social interactions) Distilled  30+ information seeking user interaction patterns Query execution and result presentation Keyword (Faceted, Similarity, Geo) search  specification and refinement... Browsing, content organization, content-based awareness, etc. Relationship setting, recommendation, etc. UI designed as assembly of standard interaction patterns  expressed in WebML July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // Alessandro Bozzon, Model-driven development of Search Based Web Applications, Ph.D Thesis, Politecnico di Milano, April 2009.
Pattern Example: Faceted Search July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Pattern Example: Faceted Search July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Pharos: Modeling User Interface July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // http://www.youtube.com/watch?v=ZpxyNi6Ht50
Pharos: Modeling User Interface July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // http://www.youtube.com/watch?v=ZpxyNi6Ht50 KEYWORD REFINEMENT FACETED REFINEMENT CONTENT-BASED REFINEMENT RESULT PRESENTATION
An Example of M2M Transformation BPMN*    WebML July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
MDD in Search Computing 4 artifact models Search Service, Query, Query Parameters, Result A query plan model For the runtime query transformation July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Search Computing Model Example Search Service Model ServiceMart abstraction (e.g., Hotel) of one or more Web service implementations (e.g., Bookings and Expedia) possibly ranked and chunked into page Attribute Atomic or Composite AccessPattern specifies RankingType and AttributeDirection (I/O) ConnectionPattern is defined as an input-output relationship between pairs of service marts (for joining them) the output city of Concert used as input for Hotel. ServiceInterface physical interface of the service Exact  or  Search  (ranked) details about chunk size, cost July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Search Computing Query Meta-model LogicalQuery is a conjunctive query over services can be defined at an abstract level ( AccessPatternLevelQuery ) or at physical level ( InterfaceLevelQuery ).  QueryClause a LogicalQuery is composed by a set of QueryClauses a QueryClause can refer to the service mart level or to the Service Interface level.  Several types InvocationClauses PredicateClauses JoinClauses RankingClauses July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   //
Search Computing  Model Transformations Vertical transformations for Queries and ServiceMarts QueryToPlan transformation Query Execution transformation (at runtime) Result transformation (at runtime) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // 1 1 2 4 3 Prototype:  http://dbgroup.como.polimi.it/brambilla/SeCoMDA
Search Computing DSLs  (& Transformations): Panta Rhei describes both the execution flow and the data flow between nodes of a query plan.  Several types of nodes exist service invocators, sorting, join, and chunk operators, clocks (defining the frequency of invocations), caches, and others. The query result model is constructed stepwise, following the execution flow July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla DESIGN   // D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca: Panta Rhei: An Execution Model for Queries over Web Information Sources, http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf
Implementation © 2010 Alessandro Bozzon, Marco Brambilla
From the models to implementation Once the design phase is completed IMPLEMENTATION TIME Never implement a search engine/app from scratch!! Start from your requirements and design and: Identify possible existing solutions ( REUSE ) Select the best fitting wrt your needs ( SHOPPING ) Implement what you need ( DEPLOY  vs.  CONFIGURE ) We will see: open source (products) vs. Open search (services) A full-fledged model-driven approach can be devised: Model to code transformation that generate: The code for the pieces of Web search applications that you need The configuration for the tools of choice © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  IMPLEMENTATION   //
Search Framework Vs. Search Engine Search Engines “ provide an interface to a group of items that enables users to specify criteria about an item of interest and have the engine find the matching items. The criteria are referred to as a search query” Wikipedia Search Frameworks Software components that target a set (possibly exhaustive) of the architectural layers of a Search Applications E.g., crawling + analysis + indexing/querying  © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  IMPLEMENTATION   //
Open Source Search Vs Open Search Open Source     build your own engine © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  IMPLEMENTATION   // www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo! Open Search     exploit commercial engines API v. 2
Open Source Search High level comparison July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // Extended version of www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo! Product License Lang. Docs Ranking Users Parallel Scale Support Lucene Apache Java/ C++ Several Flexible Amazon Yes TB 5/5 Zettair BSD Like C HTML, TREC, TXT Flexible Research No TB 1/5 Indri BSD Like C++ Many Very Flexible Research Yes TB 1.5/5 Sphinx GPL C++ Many Flexible Craiglist Yes YB 4/5 Xapian GPL C++ Many Flexible GMane Yes TB 3/5 RDBMS BSD, GPL C Limited Maybe GB 4/5
Open Source Search Benchmark _1 [Middleton+Baeza-Yates 07]: A comparison of open source search engine http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ Vik Singh /Yahoo, Weekend project: Index 1M tweet Source Code available at  http://github.com/zooie/opensearch July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Open Source Search Benchmark _2 Relevancy tested on TREC 9 – Filtering Track collection Judgment data for 63 query-like tasks July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Lucene High-performance, scalable information retrieval (IR)  library in Java There’s also pyLucene & Clucene Apache License Lot of industrial support with proven scalability Amazon, Netflix, Wikipedia Core API for full-text indexing and searching Plus plug-in modules Text analysis: text analyzer, tokenizer, token-filter, stemmer, N-gram filters, shingle filters spell-checkers, result highlight, “more like this” Fuzzy queries, regex queries Geo ranking July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Lucene Indexing Example July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Additional Indexing Features Documents  can be Updated and Deleted Boosted     doc.setBoost(1.5F); Fields  can be  Indexed - to search in Stored - to show the original content (e.g.,  abstract ) coded in term vectors - to enable  more like this Multivalued (e.g.,  authors  field) Boosted     subjectField.setBoost(1.2F); There are built-in  field types  for  numbers, dates , and  time , to better support sorting or range search July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Lucene Querying Example July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // Simple Term Query Query Parser
Additional Querying Features Boolean  Prefix Phrase Wildcard Fuzzy  Scoring function Fielded TF-IDF, weighted by term occurrences Term and document boost July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
More Features Thread and  multi-JVM safety Any number of read-only IndexReaders may be open at once on a single index Only a single writer may be open on an index at once IndexReaders may be open even while an IndexWriter is making changes to the index Any number of threads can share a single instance of IndexReader or Index- Writer    not thread safe, but it scales Lucene implements the  ACID transactional model only one transaction (writer) may be open at once July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Why Open Search? Search as a software service No need of in-house engine development Search as a  commodity Internals are unknown, the features are taken off the shelf Javascript Access to search features through client-side programming (no server needed at all) But … you can search only for Web resources July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Open Search APIs Google   Ajax Search  API http://code.google.com/apis/ajaxsearch/ Google Custom Search  API http://code.google.com/intl/en/apis/customsearch/ Microsoft Bing  API http://www.bing.com/toolbox/developers/ Yahoo Boss  (Build your Own Search Service) http://developer.yahoo.com/search/boss/ July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // API v. 2
Google  Ajax Search API Javascript  Widget REST API No limitations on the number of queries 8 results per query No change in the result order Query Web, Local, Video,  Images, Blog, Book, News Very limited   customization  of result presentation  July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // Code Snippets from Google Ajax Search API Documentation
Google  Custom Search API Custom search engine  for a Web site, blog, or a collection of Web sites Max 5000 sites On-demand 24 hour Web Indexing iFrame or Custom Search Element results for developers; XML for enterprise Few result  personalization  options July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Microsoft  BING API REST APIs Query Ad, Image, News, Phonebook, Video, Web Unlimited  traffic Results can be modified, but with some restrictions You cannot re-rank or merge non-Bing sources July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Yahoo! Boss  (+ Search  Monkey) Unlimited queries Blend, re-order, discard Full Presentation control Usage: http://boss.yahooapis.com/ysearch/ {vert} /v1/ {q} ? appid= {appid} &start=0&count=10&lang=en& format=xml&view=keyterms Verticals Web, News, Images, Spelling In query syntax inurl, url, intitle, site, AND/OR, “-”, “+” Notable web view fields Delicious bookmarks SearchMonkey ( microformats ) Larger abstracts Extracted Entities (keyterms) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // WWW 2010 Tutorial Open Search Tools - Drake & Jones SearchMonkey keyterms Bookmarks
Search Frameworks –  State of the industry © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  IMPLEMENTATION   //
Open Source Search Frameworks © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  IMPLEMENTATION   //
SMILA SeMantic Information Logistics Architecture http://www.eclipse.org/smila/ Open Source Search Framework  based on  SOA principles  and standards (e.g. BPEL, SCA) dedicated to the access and integration of (unstructured) information Standard interfaces  for the integration of the main components of a Search application Set of  out-of the box components  included Crawlers  (Web, FS) and agents (e.g. RSS feeds) Lucene/Solr  indexer interfaces for  management, operation and monitoring  of the framework and its components Written in  Java Based on OSGi (Eclipse Equinox) Cloud-ready July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Data Model Record Representation of an information item Composed of a set of Attributes : textual metadata (e.g., mime type) Attachments : binary data (e.g., picture) Annotations :  associated both to records, attributes or attachments Attributes and attachments are usually produced during the discovery of data Annotations are usually produced during the indexing process July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Chansonnier Data Model Record : a song id: the download URI Attributes Link, PageTitle, Description, Keywords, Title, Artists Lyrics Language, language confidence Emotion, emotion confidence Attachments Original videos Extracted keyframes July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
SMILA Architecture 3 Macro components  Each one can run on a dedicated OSGi instance Distribution, replication Each one aggregates a set of OSGi bundles Set of data storages Metadata Binary data Ontologies Delta Indexing July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // CONNECTIVITY SEARCH PROCESSING
Processing Pipelines Orchestration performed through BPEL Engine (Apache ODE) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // Process Invocation Condition on a record attribute Condition on an annotation value Activity Invocation
Chansonnier Activities  Lyrics Wiki To decorate a song with its  lyrics  by querying the LyricWiki service (http://lyrics.wikia.com/).  Google Translate To  identify the language  of a song’s lyric (with a given confidence) Synesketch To analyze the song’s lyric in order to  infer the dominant emotion  in it FFMPEG To  extract the keyframes  from the song video July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Distribution July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // EclipseCON 2010: http://www.eclipsecon.org/2010/sessions/?page=sessions&id=1388
Content Analysis July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // Text Annotation Media Annotation Transcoding Media Artifact Generation Media Analysis Media Analysis Text Analysis Text Analysis Media Artifact Generation Media Item Text Item
Text Processing Not all words are equally significant  for representing the semantics of a document usually, noun words (or groups of noun words) are the most representative of a document content Vocabulary : language used to describe documents and queries Worthwhile to preprocess the text of the documents in the collection to determine the terms to be used as  index terms Subset of words selected to represent a document’s content July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Index Terms and Precision/Recall Trade off Exhaustiveness Cover the whole document content    assign a big number of terms to a document Specificity Generic terms:  low discriminative power, their frequency is high in all the documents (e.g., “and”, “or”, “of”, etc.) Specific terms:  higher discriminative power, variable document frequency    their frequency denotes their document’s representativeness  Recall High-frequency in the overall collection Index expansion via associative techniques (thesauri, clustering) Precision  High frequency just in  some  documents July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Text Analysis Process Document Parsing Lexical analysis : manage digits, hyphens, punctuation marks, letter cases Elimination of  stopwords  (e.g., “and”, “or”, “of”, etc.) Thesaurus  Phrases (noun groups) Stemming  (reduction of a word to its grammatical root)  Selection  and  weighting  of index terms (noun, adjectives, etc…) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // Document Parsing Lexycal Analysis Phrases Stemming Indexing Weighting Structure Full text Index Terms Stopwords Removal
Document Parsing What  format : pdf/word/excel/html? What  language ? What  character set ? Problems: Documents being indexed can include docs from  many different languages Sometimes a document or its components can contain  multiple  languages/formats (French email with a Portuguese pdf attachment. What is a unit document ? (An email? With attachments? An email with a zip containing documents?) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Lexical Analysis Process that transforms an input character stream (the original document’s text) into a flow of words ( tokens ) GOAL: identification of words in the text Example Input: “ Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing But what are  valid tokens  to emit? July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Tokenization Trivial case: recognition of  blanks  as word separator Other cases might need to be addressed: Phrases Finland’s capital -> Finland? Finlands?, Finland’s? Hewlett-Packard -> Hewlett and Packard as two tokens? San Francisco: one token or two? How do you decide it is one token? Language issues  (normalization) Accents: résumé vs. resume. L'ensemble -> one token or two? L ? L’ ? Le ? How are your users like to write their queries for these words?  Use locale? Punctuation  (e.g: U.S.A. vs. USA) Numbers (100.45 vs. 100,45 vs. 1.0045 E+2 ) Dates (e.g. March 1 st  2009 vs. 03/01/09 vs. 1/03/2009) Case folding  …. It depends on the addressed language E.g., in Chinese spaces do not separate words  (tokenization based on vocabulary) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Stopword Removal Removal of high-frequency words , which carry less information Strategies Statistical analysis on the indexed collection Functional terms (articles, conjunctions, auxiliary verbs) A-priori knowledge, based on the IR system domain Creation of a “stop-list” with all the terms to remove English stop list is about 200-300 terms (e.g., “been”, “a”, “about”, “otherwise”, “the”, etc..) http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words < 30% - 50% of tokens (smaller dictionary) It can  decrease recall  (e.g. “to be or not to be”, “let it be”) Most of WEB search engines  do not  remove stopwords  [ ManningIR] July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Phrases (noun groups) Phrases capture the meaning behind the bag of words and result in  multi-term phrases Uses of phrases: Added to the query: a query “New” “York” should be modified to search for “New York”    > 10% in precision and recall  Replace terms in index: empirically considered not as good as query rewriting July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Phrases (noun groups) - Strategies Simple Phrases Many systems identify phrases as any pairs of terms not separated by: stop term punctuation mark special character Phrases occurring fewer than 25 times are removed (decrease in memory requirements) NLP Part Of Speech and Word Sense tagging statistical or rule-based methods to identify the part of speech (noun, verb, adjective) of each token Syntactic parsing Identify the key syntactic components of a sentence usually by tagging according to POS and then applying a grammar (FSA and NFSA) Thesauri July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Thesauri A thesaurus is as a  classification  scheme composed of  words and phrases  whose organization aims at  facilitating  the expression of ideas in written text E.g.: synonyms and homonyms Example entry from Roget’s 1  thesaurus: cowardly  adjective Ignobly lacking in courage: cowardly turncoats. Syns: chicken (slang) chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered A thesaurus can be Thematic: specific to the IR system’s domain of application (most frequent case) E.g.: Thesaurus of Engineering and Scientific Terms Generic A thesaurus can be used to Help  user formulate queries Modification  of queries by the system Select  index terms July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Thesauri Many kinds of thesauri have been developed for IR systems Hierarchical:   synonyms  (RT    related terms, UF    use for),  generalization  (BT    broader term),  specialization  (NT    narrower term) ISO and ANSI standards, almost always thematic Manually built and updated by domain experts Clustered:  cluster (or synset) of words Non-typed, semantic relationships among cluster Each cluster is a set of word having strong semantic relationship (usually UF) WORDNET Clustered Thesauri can be automatically generated if no distinction is made among semantic relationships Associative:  graph of words, where nodes represents words and edges represents  semantic similarity  among words Edges can be oriented or not, according to the symmetry of the similarity relationship Edged can be weighted (fuzzy pseudo-thesauri) Can be automatic generated from a collection of documents using a co-occurrence relationships July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Stemming and Lemmatization Goals Reduce terms to their “roots” before indexing Reduce inflectional/variant forms to base form language dependent E.g., am, are, is -> be car, cars, car's, cars' -> car the boy's cars are different colors -> the boy car  be different color Stemming : heuristic process that chops off the ends of words in the hope of achieving the goal correctly the most of the time Stemming collapses derivationally related words Lemmatization : NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word  Lemmatization collapses the different inflectional forms of a lemma Not widely used cause it harms performances July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Stemming Many different algorithms :  Porter’s algorithm Commonest algorithm for stemming English  Porter, Martin F. 1980. An algorithm for suffix stripping.  Program 14:130–137. http://www.tartarus.org/˜martin/PorterStemmer/ One-pass Lovins stemmer Lovins, Julie Beth. 1968. Development of a stemming algorithm.  Translation and Lancaster http://www.comp.lancs.ac.uk/computing/research/stemming/ Paice, Chris D. 1990. Another stemmer.  SIGIR Forum 24:56–61 http://snowball.tartarus.org/demo.php Stemming increases recall while harming precision July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Stemming Example July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Tools for text analysis _1 Lucene and Solr   contains a lot of text analyzer working on several languages http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters CharFilters, Tokenizer, Token Analyzers Apache Tika http://tika.apache.org/ toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries GATE  (General Architecture for Text Engineering) http://gate.ac.uk/ ANNIE (A Nearly-New Information Extraction System)  tokenizer, gazetteer, sentence splitter, part of speech tagger,  named entities transducer, coreference tagger Support for English, Spanish, Chinese, Arabic, French, German,  Hindi, Italian, Cebuano, Romanian, Russian MALLET  (Machine Learning for Language Toolkit) http://mallet.cs.umass.edu/index.php Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Tools for text analysis _2 OpenNLP http://opennlp.sourceforge.net/projects.html open source projects related to natural language processing) Cognitive Computation Group – University of Illinois  http://l2r.cs.uiuc.edu/~cogcomp/software.php Chunker, Part of Speech tagger, String similarity, Semantic Role Labeler Named Entity Extractor, etc. Supersense Tagger http://medialab.di.unipi.it/wiki/SuperSense_Tagger tool for assigning to each noun, verb, adjective and adverb of a sentence one of the  45 standard WordNet supersenses Wordnet Domains http://wndomains.fbk.eu/hierarchy.html Synesketch http://www.synesketch.krcadinac.com/ Open source textual emotion recognition July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla SECTION NAME   //
Multimedia Content Analysis Computer are not able to catch the underlying meaning of a multimedia content.  Annotation is needed. Manual annotation Expensive It can take up to 10x the duration of the video Problems in scaling to millions of contents Incomplete or inaccurate People might not be able to holistically catch all the meanings associated with a multimedia object Difficult Some contents are tedious to describe with words E.g., a melody without lyrics Automatic annotation Reasonably good quality Some technologies have a ~90% precision “ Low” cost © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  IMPLEMENTATION   //
Audio Segmentation GOAL: split an audio track according to contained information  Music Speech Noise … Additional usage Identification and removal of ads © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  IMPLEMENTATION   //
Video Segmentation Keyframe segmentation: segment a video track according to its keyframes fixed-length temporal segments Shot detection: automated detection of transitions between shots a shot is a series of consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space. July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // CREDITS:  Thorsten Hermes@SSMT2006
Speech Analysis Speaker Identification : identify people participating in a discussion Additional usage: Vocal command execution Speech To Text : automatically recognize spoken words belonging to an open dictionary July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // ERIC DAVID JOHN
Classification of Music Genre GOAL: automatically classify the genre and mood of a song Rock, pop, Jazz, Blues, etc. Happy, aggressive, sad, melancholic,  Additional usage: Automatic selection of songs for playlist composition Tutorial from PHAROS Summer School  http://www.pharos-audiovisual-search.eu/  res/files/SummerSchool/Programme_Summer_School_file.zip July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // Rock Dance!
Images: Low-level features GOAL: extract implicit characteristics of a picture luminosity orientations textures Color distribution July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Face Identification and Recognition GOAL: recognize and identify faces in an image Usage examples: People counting Security applications July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   // CREDITS:  Thorsten Hermes@SSMT2006
Image Concept Detection GOAL: recognize context/ concepts of an image E.g., playground, seaside, road, ... Extraction of low level features from raw data  color histograms, color correlograms, color moments,  co-occurrence texture matrices, edge direction histograms, etc.. Features can be used to build  discrete classifiers , which may associate semantic concepts to images or regions thereof The MediaMill semantic search engine defines 491 semantic concepts http://www.science.uva.nl/research/mediamill/demo Concepts can be detected also from text (e.g., from manual or automatic metadata) using NLP techniques July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Image Object Identification GOAL: identify objects appearing in a picture Basket ball, cars, planes, players, etc. July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Tools for media analysis _1 OpenCV http://opencv.willowgarage.com/wiki/ Framework for image analysis Octave http://www.gnu.org/software/octave/ high-level language, primarily intended for numerical computations, it works well with Matlab Marsyas  (Music Analysis, Retrieval and Synthesis for Audio Signals) http://marsyas.sness.net/ Framework for music analysis and retrieval July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Tools for media analysis _2 TINA  (TINA Is No Acronym) http://www.tina-vision.net/ is an open source environment developed to accelerate the process of image analysis research.  Sphynx http://cmusphinx.sourceforge.net/sphinx4/ speech recognition system written entirely in the Java WEKA http://www.cs.waikato.ac.nz/ml/weka/ A collection of machine learning algorithms for data mining July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION   //
Validation © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010
Disclaimer This section is inspired by the WWW2010 tutorial  by Dasdan, Tsioutsiouliklis, Velipasaoglu @ WWW2010 Web Search Engine Metrics  for Measuring User Satisfaction   http://analytics.ncsu.edu/reports/wsmt.pdf July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Measures for IR Systems Measurable  properties  How fast does it process (index) documents? Number of documents/hour Average document size How fast does it search? Latency as a function of index size Expressiveness of query language Speed on complex queries The  key  measure: user  happiness What is this?  Speed of response/size of index are factors But blindingly fast, useless answers won’t  make a user happy How do we quantify user happiness? July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Measuring User Happiness Who  is the user we are trying to make  happy? Depends on the setting Web engine: user finds what they want and  return to the engine Can measure rate of return users eCommerce site: user finds what they want  and make a purchase Is it the end-user, or the eCommerce site,  whose happiness we measure? Measure time to purchase, or fraction of  searchers who become buyers? Enterprise (company/govt/academic): Care about “user productivity” How much time do my users save when  looking for information? Many other criteria having to do with breadth of access, secure access … July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Evaluation measures Relevance  Of search  results Coverage Presence of content of interest in a catalog Diversity Of  result set Discovery and Latency How many new resources (in the collection) are in the catalogue How long it took to get the new resources in the catalog? Time to first click Freshness July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Relevance as a measure of user happiness How do you measure relevance? In order to assess the performance of a IR system you needed a test collection composed of: A benchmark document collection A benchmark suite of queries A binary assessment of either  Relevant  or  Irrelevant  for each query-doc pair ( gold standard , or  ground truth ) Test collection must be of a reasonable size Need to average performance since results are very variable over different documents and information needs July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Evaluating Relevance Set  based evaluation Rank  based evaluation with  explicit  judgment Absolute judgment Preference judgment Rank  based evaluation with  implicit  judgment Direct and indirect evaluation by clicks Model  based evaluation Browsing models User satisfaction July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   // NOT  COVERED HERE
Information Need Translation Relevance is assessed relative to the need  not to the query E.g., Information need:  I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query:  wine red white heart attack effective A document is relevant if it  addresses  the stated information need,  not   just because it  contains  all the word in the query July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Set-based evaluation The two most frequent and basic measures for IR effectiveness are  precision  and  recall Precision:   fraction of retrieved docs that are relevant   P(relevant|retrieved) Provides a measure of the “degree of soundness” of the system This not consider the total number of documents Recall:   fraction of relevant docs that are retrieved P(retrieved|relevant) Provides a measure of the “degree of completeness” of the system July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Precision / Recall Can get high  recall   (but low  precision ) by  retrieving all docs for all queries! Recall is a  non-decreasing  function of the  number of docs retrieved Precision usually decreases (in a good system) Precision  can be computed  at different levels of  recall Perhaps most appropriate for web search: all people want are good matches on the first one or two  results pages Precision-oriented users Web surfers Recall-oriented users Professional searchers, paralegals, intelligence analysts July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
F-Measure Combined measure  that assesses the tradeoff between precision and recall (weighted harmonic  mean): Values of β<1 emphasize precision Values of β>1 emphasize recall People usually use balanced  F 1  measure i.e., with β = 1 or α = ½ Harmonic mean is conservative average [CJ van Rijsbergen,  Information Retrieval ] July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Difficulties in using precision/recall Average over large corpus/query… Need human relevance assessments People aren’t reliable assessors Assessments have to be binary Nuanced assessments? Heavily skewed by corpus/authorship Results may not translate from one domain to another The relevance of one document is treated as  independent  of the relevance of other document This is also an assumption in most retrieval system July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Ranked Based evaluation In ranked retrieval systems,  P  and  R  are values relative to a  rank position Evaluation performed by computing precision as a function of recall Function computed at each rank position in which a relevant  document has been retrieved Resulting values are interpolated   yielding a precision/recall plot July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Measures for Ranked Based evaluation Mean average precision ( MAP ) Measure of quality at all recall levels [email_address] Not all queries will have more than K relevant results Even a perfect system may have a score less than 1.0 for some queries R-Precision   [Allan 2005] Use a variable result set cut-off for each query based on number of its relevant results Mean Reciprocal Rank ( MRR )  [ Voorhees 1999] Reciprocal of the rank of the  first relevant result averaged  over a population of queries July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Discounted Cumulative Gain (DCG) [Järvelin and Kekäläinen 2002] Gain adjustable for importance of different relevance grades  for user satisfaction Discounting desirable for web ranking Most users don’t browse deep Search engines truncate the list of results returned. DCG yields  unbounded scores For each query, divide the DCG by the best attainable DCG for that query     Normalized Discounted Cumulative Gain (nDCG) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   // Example: Very Useful: 3 Somehow useful: 1 Not Useful: 0
Preference Judgment Kendall tau  coefficient Based on counts of preferences Range in [-1, 1] Robust for incomplete judgments Binary Preference   (bpref) Buckley and Voorhees (2004) Designed for incomplete  judgments Generalized to graded judgment De Beer and Moens (2006) July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   // A: preferences in agreement D: preferences in disagreement N r  = # of non-relevant docs above relevant doc r, In the first R non-relevant R = number of relevant results for the query
Presentation Metrics How to present information? Which information Where they should be displayed Which presentation elements should be used? Font, colors, design elements, interaction design Generalization How to measure success? User studies On-line, on-home, usability, eye tracking, focus group, surveys Log analysis Editorial Comparative, Perceived vs. actual July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Not all results are likely to be reviewed July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   // (Source:  iprospect.com  WhitePaper_2006_SearchEngineUserBehavior.pdf) ‏
Clicks and views depend on rank July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   // [Joachims et al, 2005]
Eye Tracking Studies July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Heat Maps Golden Triangle The  first result   is always considered more  trusted   and more  relevant   by default The user spend less time reading the lower part of the page [Marti A. Hearst,  Search User Interfaces , Cambridge University Press, 2009] July 5, 2010  © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION   //
Thank you for your attention! Questions? © 2010 Alessandro Bozzon, Marco Brambilla Alessandro Bozzon Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/bozzon  Marco Brambilla Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/mbrambil http://www.search-computing.org/book July 5, 2010  REFERENCES   //
References – Books Modern Information Retrieval Ricardo Baeza-Yates, Berthier Ribeiro-Neto ,  Addison Wesley Longman Publishing Co. Inc., 2010 [ManningIR] Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,  Cambridge University Press, 2008 Information Retrieval: Algorithms and Heuristics . D.A. Grossman, O. Frieder. Springer, 2004 Managing Gigabytes.  I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999 Mining the Web: Analysis of Hypertext and Semi Structured Data .  S. Chakrabarti. Morgan Kaufmann, 2002 Search User Interfaces Marti A. Hearst. Cambridge University Press, 2009 Search Computing – Challenges and directions Stefano Ceri, Marco Brambilla  (eds.) . Springer LNCS, vol. 5950, 2010 © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  REFERENCES   //
References - Tutorial Web Search Engine Metrics: Direct Metrics to Measure User Satisfaction Ali Dasdan, Kostas Tsioutsiouliklis, Emre Velipasaoglu (Yahoo!) www2010 Recent Progress on Inferring Web Searcher Intent  Eugene Agichtein (Emory University) www2010 Applications of Open Search Tools Rosie Jones, Ted Drake (Yahoo!) www2010 [BAEZASeco2010] New Frontiers for Search Ricardo Baeza-Yates www2010 Web Mining for Search Ricardo Baeza-Yates and Rosie Jones (Yahoo!) SIGIR 2008 © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  REFERENCES   //
References - Papers [Ramakrishnan and Tomkins 2007] Raghu Ramakrishnan, Andrew Tomkins:  Toward a PeopleWeb IEEE Computer 40(8): 63-72 (2007) [Broder2002] A. Broder.  A taxonomy of web search SIGIR Forum, 36(2):3–10, 2002.  [BATES2002] Bates, Marcia J.  Toward an integrated model for information seeking and searching In: The Fourth International Conference on Information Needs, Seeking and Use in Different Contexts, 2002 [FU2007] Fu, Wai-Tat; Pirolli, Peter,  SNIF-ACT: a cognitive model of user navigation on the world wide web Human-Computer Interaction: 335–412 , 2007 [Withrow2002] Jason Withrow,  Do your links stink? American Society for Information Science Bulletin, June 1, 2002 [Pirolli2009] Pirolli, Peter  An elementary social information foraging model Proceedings of the 27th international conference on Human factors in computing systems: 605–614, 2009 [D. Rose, 2008] [BATES1989] M.J. Bates.  The design of browsing and berrypicking techniques for the online search interface Online Review, 13(5):407–431,1989. [Teevan et al., CHI 2004] Teevan, J., Alvarado, C., Ackerman, M. and Karger, D.  The perfect Search Engine is not Enough: A Study of Orienteering Behavior in Directed Search Proceedings of ACM CHI 2004, pp. 415-4422. [MARCHIONINI2006] Marchionini, G.  Exploratory search:  from finding to understanding .  Communications ACM 49(4): 41-46 (2006) [WHITE2007] White, R. W., and Drucker, S. M.  Investigating behavioral variability in web search 16th WWW Conf. (Banff, Canada, 2007) [AULA2008] Aula, A., and Russell, D.M.  Complex and Exploratory Web Search ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  REFERENCES   //
References - Papers [BozzonEtAL2010] Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri.  Liquid Query: multi-domain exploratory search on the Web WWW 2010, Raleigh, USA [FAGIN1999] R. Fagin.  Combining fuzzy information from multiple systems J. Comput. Syst. Sci., 58(1):83–99, 1999.  [ILYAS1999] F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid.  Rank-aware query optimization In SIGMOD Conference, pages 203–214, 2004. [MARTINENGHI2010] D. Martinenghi and M. Tagliasacchi:  Proximity Rank Join to appear in PVLDB [Carbonell and Goldstein 1998] J. Goldstein and J. Carbonell (1998), Summarization:  Using MMR for Diversity- based Reranking SIGIR’98 [BozzonEtAl2007] Alessandro Bozzon,  et Al  Role Based Access Control for the interaction with Search Engines International Workshop on Collaborative Open Environments for Project-Centered Learning (COOPER) 2007, Crete, Greece. [BozzonEtAl2009] Alessandro Bozzon, Marco Brambilla, Piero Fraternali  Conceptual Modeling of Multimedia Search Applications using Rich Process Models ICWE 2009, June 24-26, 2009, San Sebastian, Spain [BozzonThesis2009]Alessandro Bozzon,  Model-driven development of Search Based Web Applications Ph.D Thesis, Politecnico di Milano, April 2009. [BragaEtAl2010] D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca:  Panta Rhei: An Execution Model for Queries over Web Information Sources http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  REFERENCES   //
References - Papers [Allan 2005] J. Allan (2005),  HARD track overview in TREC 2005: High accuracy retrieval from documents.  [Voorhees 1999] E.M. Voorhees (1999),  TREC-8 question answering track report   [Järvelin and Kekäläinen 2002] K. Järvelin and J. Kekäläinen,  Cumulated gain-based evaluation of IR techniques   ACM Trans. IS, 20(4): 422-446, 2002 [Buckley and Voorhees (2004)] C. Buckley and E.M. Voorhees,  Retrieval evaluation with incomplete information   SIGIR’04.  [De Beer and Moens (2006)] De Beer, Jan; Moens, Marie-Francine.  Rpref: a generalization of Bpref towards graded relevance judgments SIGIR 2006, Seattle, USA, 6-11 August 2006, pages 637-638, ACM © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  REFERENCES   //
References - Links Search Computing Course Lecture Notes http://www.search-computing.it/course Fabio Aolli,  Università di Padova, http://www.math.unipd.it/~aiolli/corsi/0809/IR/IR.html http://www.ir.disco.unimib.it/ © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010  REFERENCES   //

More Related Content

PPT
2006-05-25__coi-semdis
PDF
Empirical evaluation of web based personal
PDF
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
PPTX
What happened to the Semantic Web?
DOCX
Web Mining
DOCX
NE7012- SOCIAL NETWORK ANALYSIS
PDF
Kp3518241828
PPTX
Social Networks and the Semantic Web: a retrospective of the past 10 years
2006-05-25__coi-semdis
Empirical evaluation of web based personal
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
What happened to the Semantic Web?
Web Mining
NE7012- SOCIAL NETWORK ANALYSIS
Kp3518241828
Social Networks and the Semantic Web: a retrospective of the past 10 years

What's hot (15)

PDF
Scraping and Clustering Techniques for the Characterization of Linkedin Profiles
PPTX
Discovering Semantic Equivalence of People behind Online Profiles (RED 2012 -...
PPT
Semantic Search overview at SSSW 2012
PPTX
Web Information Network Extraction and Analysis
PDF
710201947
PPTX
Semantic Search tutorial at SemTech 2012
PDF
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
PPTX
PPTX
The Importance of being LOUD
PDF
The Future of Search - Martin White
ODP
Learning Resource Metadata Initiative: Vocabulary Development Best Practices
PPT
Opportunity and risk in social computing environments
DOC
Abraham
PDF
Research Inventy : International Journal of Engineering and Science
PPTX
SemTech 2011 Semantic Search tutorial
Scraping and Clustering Techniques for the Characterization of Linkedin Profiles
Discovering Semantic Equivalence of People behind Online Profiles (RED 2012 -...
Semantic Search overview at SSSW 2012
Web Information Network Extraction and Analysis
710201947
Semantic Search tutorial at SemTech 2012
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
The Importance of being LOUD
The Future of Search - Martin White
Learning Resource Metadata Initiative: Vocabulary Development Best Practices
Opportunity and risk in social computing environments
Abraham
Research Inventy : International Journal of Engineering and Science
SemTech 2011 Semantic Search tutorial
Ad

Viewers also liked (20)

PPTX
Web Application Performance
PPT
E-commerce and M-commerce
PPT
What is Portfolio Management
PPTX
Mobile Commerce: A Security Perspective
PPTX
Interferometric modulator (imod)
PPTX
Mobile Ecosystem
PPTX
Introduction to Financial Services
PPTX
Receivable management presentation1
PPTX
Módulo 3. ventilación mecánica neonatal
PPTX
M commerce ppt
PDF
Mobile Tech Trends for 2017
PDF
Seminar Report on NFC
PDF
Instrumentacion-control-procesos
PPTX
What is VAVE
PPT
Antiemeticos..farma
PPTX
Samples Management System
PPT
Tomografía computada de energía dual
PPT
eMBMS for LTE
Web Application Performance
E-commerce and M-commerce
What is Portfolio Management
Mobile Commerce: A Security Perspective
Interferometric modulator (imod)
Mobile Ecosystem
Introduction to Financial Services
Receivable management presentation1
Módulo 3. ventilación mecánica neonatal
M commerce ppt
Mobile Tech Trends for 2017
Seminar Report on NFC
Instrumentacion-control-procesos
What is VAVE
Antiemeticos..farma
Samples Management System
Tomografía computada de energía dual
eMBMS for LTE
Ad

Similar to Engineering Web Search Applications (20)

PDF
A Multimodal Approach to Incremental User Profile Building
PPT
a Model-driven development methodology for 3D User Interface for Information ...
PPTX
Making things findable
DOCX
Proposal.docx
PPT
Brand niemann06032010
PPTX
Ranking the Linked Data: the case of DBpedia - ICWE 2010
PDF
Chatbot
PDF
IRJET- PDF Extraction using Data Mining Techniques
PDF
IRJET - A Web-based College Enquiry Chatbot using .Net and Dataset
PPTX
EUDAT Webinar "Organise, retrieve and aggregate data using annotations with B...
PPTX
National seminar on emergence of internet of things (io t) trends and challe...
PPTX
LouRosenfeldInterview
PPT
Information Architecture: Putting the "I" back in IT
PPT
Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...
PPTX
Research Process
PDF
Study on Issues in Managing and Protecting Data of IOT
PDF
Big social data analytics - social network analysis
PPTX
Enterprise Search Using SharePoint 2010 and FAST
PDF
BoscoChat (A free Wi-Fi Chat Room in Android)
PDF
Structured and Unstructured Information Extraction Using Text Mining and Natu...
A Multimodal Approach to Incremental User Profile Building
a Model-driven development methodology for 3D User Interface for Information ...
Making things findable
Proposal.docx
Brand niemann06032010
Ranking the Linked Data: the case of DBpedia - ICWE 2010
Chatbot
IRJET- PDF Extraction using Data Mining Techniques
IRJET - A Web-based College Enquiry Chatbot using .Net and Dataset
EUDAT Webinar "Organise, retrieve and aggregate data using annotations with B...
National seminar on emergence of internet of things (io t) trends and challe...
LouRosenfeldInterview
Information Architecture: Putting the "I" back in IT
Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...
Research Process
Study on Issues in Managing and Protecting Data of IOT
Big social data analytics - social network analysis
Enterprise Search Using SharePoint 2010 and FAST
BoscoChat (A free Wi-Fi Chat Room in Android)
Structured and Unstructured Information Extraction Using Text Mining and Natu...

More from Alessandro Bozzon (11)

PDF
Weaving the Web of People and Things for Intelligent Cities
PDF
Trustworthy Micro-task Crowdsourcing: Challenges and Opportunities
PDF
SAIL 2015 Crowdmanagement Experiment. Pitch slides
PDF
Social Data Science For Intelligent Cities
PDF
Pattern-Based Specification of Crowdsourcing Applications
PDF
ICWE 2013 - Slides From The Poster And Demo Session
PPTX
An Introduction to Human Computation and Games With A Purpose - Part I
PPTX
Reactive crowdsourcing
KEY
A Service-Based Architecture for Multi-domain Search on the Web
PPTX
Search Computing
PPTX
Liquid Query: Multi-domain Exploratory Search on the Web
Weaving the Web of People and Things for Intelligent Cities
Trustworthy Micro-task Crowdsourcing: Challenges and Opportunities
SAIL 2015 Crowdmanagement Experiment. Pitch slides
Social Data Science For Intelligent Cities
Pattern-Based Specification of Crowdsourcing Applications
ICWE 2013 - Slides From The Poster And Demo Session
An Introduction to Human Computation and Games With A Purpose - Part I
Reactive crowdsourcing
A Service-Based Architecture for Multi-domain Search on the Web
Search Computing
Liquid Query: Multi-domain Exploratory Search on the Web

Recently uploaded (20)

DOCX
search engine optimization ppt fir known well about this
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Architecture types and enterprise applications.pdf
PPTX
Configure Apache Mutual Authentication
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPT
What is a Computer? Input Devices /output devices
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
search engine optimization ppt fir known well about this
A contest of sentiment analysis: k-nearest neighbor versus neural network
Custom Battery Pack Design Considerations for Performance and Safety
A proposed approach for plagiarism detection in Myanmar Unicode text
Enhancing emotion recognition model for a student engagement use case through...
Consumable AI The What, Why & How for Small Teams.pdf
1 - Historical Antecedents, Social Consideration.pdf
Architecture types and enterprise applications.pdf
Configure Apache Mutual Authentication
Benefits of Physical activity for teenagers.pptx
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
A review of recent deep learning applications in wood surface defect identifi...
NewMind AI Weekly Chronicles – August ’25 Week III
sbt 2.0: go big (Scala Days 2025 edition)
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
What is a Computer? Input Devices /output devices
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
A comparative study of natural language inference in Swahili using monolingua...
Convolutional neural network based encoder-decoder for efficient real-time ob...

Engineering Web Search Applications

  • 1. Engineering Web Search Applications Alessandro Bozzon Marco Brambilla Vienna July 5, 2010
  • 2. Alessandro Bozzon Post-doc @Politecnico di Milano http://home.dei.polimi.it/bozzon Marco Brambilla Assistant Professor @Politecnico di Milano http://home.dei.polimi.it/mbrambil About the speakers © 2010 Alessandro Bozzon, Marco Brambilla Research background and interests Web engineering and model-driven development WebML and WebRatio Complex enterprise application design BPM, SOA and integration with Web application devel. Search engine and complex search application development Search Computing: multidomain search Pharos: multimedia search framework July 5, 2010 ABOUT //
  • 3. About the tutorial Information Retrieval is a >40y old discipline tackled from a myriad of viewpoints This tutorial is: Breadth-oriented Development process driven … … using real-world case studies as examples The tutorial is necessarily shallow But we provide references and links © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 ABOUT //
  • 4. Agenda © 2010 Alessandro Bozzon, Marco Brambilla
  • 5. AGENDA Introduction What are Web search applications? Requirements Which are their requirements? Design How to design them? Implementation How to implement them? Validation How to measure their success? © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 AGENDA //
  • 6. Introduction © 2010 Alessandro Bozzon, Marco Brambilla
  • 7. Search prevails Search is an integral part of online life of people Web search has become a standard (and often preferred) source of information finding “ ... 92% of Internet users say the Internet is a good place to go for getting everyday information...” - 2004 Pew Internet Survey Web search engines are now the second most frequently used online computer application, after email Search is fully integrated into operating systems and is viewed as an essential part of most information systems © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION //
  • 8. Some numbers … Web Estimated size: ~ 60 billion pages – 22/06/2010 http://www.worldwidewebsize.com/ > 9.3 billion queries … just in the U.S. … in May 2010 http://blog.nielsen.com/nielsenwire/online_mobile/top-u-s-search-sites-for-may-2010/ … and growing Twitter # of new tweets per day: 55 million # of search queries per day: 600 million Facebook 400 Million Global Users (and growing) The average Facebook User Spends 55 Minutes Per Day © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION //
  • 9. … more numbers … IDC Digital Universe report estimates: digital data grew by 62% between 2008 and 2009 ~ 800,000 petabytes (PB) >1.2 million PB in 2010 reach 35 ZB (zetabytes) by 2020. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION // [Ramakrishnan and Tomkins 2007]
  • 10. Information Retrieval Information retrieval (IR) deals with the representation, storage, organization of, and access to information items. “ Old” discipline As an academic field of study: Information retrieval (IR) is devoted to finding relevant documents , not finding simple match to patterns. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually stored on computers). [Manning et al., 2007] © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // July 5, 2010
  • 11. Information Retrieval Applications Search (‘ad hoc’ retrieval) Static document collection Dynamic queries July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // Filtering Queries are static Document collection constantly changing Example: corporate mails routed by predefined queries to different parts of the organizations Static Document Collection Ranked Result Ad-Hoc query Document Routing System Predetermined queries or User profiles Incoming Documents
  • 12. The nature of information retrieval … retrieving all objects which might be useful or relevant to the user information need Usually unstructured queries (no formal semantics) The IR system ‘interpret’ the contents of the information items Examples: keyword-based queries, context queries, proximity, phrases, natural language queries… Also structural queries and, in recent systems, structured query languages are supported (but with a different semantics) Errors in the results are tolerated Core concept: relevance Relevance Ranking (according to the user need) It is not clear what “degree of relevance” the user is happy with The user starts from the top of the ranked list and explore down satisfied July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • 13. Information Retrieval is NOT Data Retrieval Data Retrieval (RDBMS, XML DB) … retrieving all objects which satisfy clearly defined conditions expressed trough a query language. Data has a well defined structure and semantics Formal query languages Regular expression, relation algebra expression, etc. Results are EXACT matches  errors are not tolerated No ranking w.r.t. the user information need Binary retrieval: does not allow the user to control the magnitude of the output For a given query, the system may return: Under-dimensioned output Over-dimensioned output July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • 14. The Information Retrieval Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // Content Management Query analysis Query Interaction Generic search-oriented application B A C K E N D F R O N T E N D q’ q r r’ Search Result Composition Result Manipulation
  • 15. Search Engine vs. Search Application Search Engine data management system which uses information retrieval algorithms to retrieve information items from one or more sources upon the submission of a query Web Search Application data management system where search engines are a piece of a more complex puzzle, that includes: data source integration (e.g. databases, legacy systems, the Web) content analysis technologies orchestration user interfaces Web-mediated social interactions, etc. July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • 16. Characterization of the user information need It is not a simple problem: “ Blurred” goals Sensory Gap Gap between the object in the world and the information in a (computational) description Semantic Gap Lack of coincidence between the (computational) description of the information and their interpretation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • 17. Evaluating an IR System Precision: fraction of retrieved docs that are relevant P(relevant|retrieved) “ degree of soundness” of the system not considering the total number of documents Recall: fraction of relevant docs that are retrieved P(retrieved|relevant) “ degree of completeness” of the system July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • 18. Enterprise search Public Web search engines are the ones known to the general public But there is also a huge need (and market share!) for professional search over enterprise repositories Enterprise search is covered by Packaged suites Microsoft FAST Autonomy IDOL IBM OmniFind Exalead Frameworks Apache UIMA (ex IBM) Smila Solr July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • 19. Case Studies Textual Search YaGoBi Multi-media Search The PHAROS Project Multi-domain Search The Search Computing project Example of Web Search Application Chansonnier © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • 20. YaGoBi THE Web Search 92% of market share in the U.S. Searching on Web pages, Blog, News, Books, Scientific Publications, Emails Images and Videos (but only trough textual descriptions ) Tweets … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla CASE STUDIES //
  • 21. The PHAROS Project FP6 IP, 3Years, 12 Partners, ~15 M€ budget Mission : Develop SOA-compliant, open and distributed technology platform for development of information access solutions for audio visual content www.pharos-audiovisual-search.eu © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • 22. The Search Computing Project European Research Council (ERC), 2008 Call for &quot;IDEAS Advanced Grants”, 5y (started in 2009) Mission : provide the abstractions, foundations, methods, and tools required to answer multi-domain queries by interacting with a constellation of cooperating search services, using ranking and joining of results as the dominant factors for service composition www.search-computing.org © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • 23. Chansonnier BsC Thesis project Mission : graduate  Open source video analysis application based on open frameworks (SMILA / SOLR) Crawling of Web video Download of song lyrics Analysis on lyrics text Language, emotion Keyframe extraction for video snippets http://github.com/giorgiosironi/Chansonnier © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • 24. Requirements © 2010 Alessandro Bozzon, Marco Brambilla
  • 25. Key Requirements and Design Dimensions for Web Search © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // Data Source User Behavior Query Format User Interface Security Data Analysis Performance Data Format Social Interactions Search Engine
  • 26. Data Sources Web Databases File systems Intranet / Extranets Legacy systems Users Sensors (in wide sense) and streams © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 27. Data Type Unstructured data Textual Documents Blog Posts (Semi) Structured data Software Code Models XML Files Media Pictures Video Music © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 28. Textual Analysis Deals with basic language units (morphemes, roots, stems, words, phrases, sentences, etc.) Media Analysis Deals with media contents Transcoding Classification Feature Extraction Data Analysis July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // An activity performed at the purpose of providing a representation of a content item suited for the application
  • 29. Search Engine _1 Textual Textual contents represented as collection of unstructured text terms Fielded Textual contents structured in fields (e.g., metadata) Semi-structured Textual contents organized in complex (possibly heterogeneous) structure (e.g., XML, HTML) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 30. Search Engine _2 Content-based Media contents described by low-level features Geographic and other special dimensions Content featuring geo-spatial features Streaming content searched by temporal features (e.g., recency) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 31. Query Format Representation of the user information need Natural Language For instance trough vocal interfaces Keyword Set of text items, plus Boolean (AND/OR/NOT), proximity ( lexical nearness) and/or wildcard conditions Fielded Keyword Text items defined on one or more fields Queries to semi-structured search-engines and Faceted queries Content-based Query by example (text, image, video, audio, etc.) Geographic and other special dimensions Geographic coordinates plus spatial operator terms ( near, north of, within X kilometers from, etc.) Timestamps plus temporal operator terms (recent, near, interval, etc.) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 32. YaGoBi Data Sources Web : crawling of Web resources Users : comments, preferences, relationships Data Types Unstructured data : Web pages Documents : PDF, PPT, DOC, etc. Data Analysis Textual : for content, document, and user generated comments Media : some basic image analysis for color, faces, size Search Engine Fielded: filetype, page title, site, page content Content-based: image similarity in Google Query Format: Fielded keyword Geographic July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 33. PHAROS Data Sources Web : crawling of audio/video files File System : NAS and content provider media archives Users : comments, preferences, relationships Data Types Structured data : content provider description metadata Media : hi-quality video and audio files Semi-structured data : MPEG-7 description of processed media files and user annotations Data Analysis Textual : for content metadata and user generated comments Media : for audio and video Audio/Video Mood classification, Image concept classification, Music Genre, Danceability classification, face recognition and identification, speech to text July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 34. PHAROS Search Engine Semi-structured : XML search engine for MPEG-7 content description Plus geographic annotations and geo-based ranking 3 content-based engines : one CB for music, one for images (shots of the video) one for face similarity Query Format Fielded-keyword : XQuery for XML search engine Query by example : for image, music and faces MPQF: high level query language AND/OR/AND THEN for fielded keyword and by-example queries July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 35. Query Federation in PHAROS July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // JPG Long/Lat XPath Keywords “ amsterdam” \\where[contains(“amsterdam”)] and \\topic[contains(“building”)] Geo search R-tree index 52.37N 4.89 E Text search Inverted index XML search Semantic index Image search Similarity index Query analysis Federation
  • 36. User Behavior Search is evolving Content Vs. Intent People don’t want to search People want to get task done and get answers Moving towards identifying a user’s task Enabling means for task completion Search as a Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Search applications must Support the user in the search process (try to) Infer the user intent to help him accomplishing his task Ricardo Baeza-Yates Next Generation Search , 2 nd SeCo Workshop, Milan, 24/06/2010 Start End I am craving for a good Wiener Schnitzel and a Sachertorte in Vienna Search Menu Reviews Map
  • 37. Information Seeking [Bates, 2002] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Bates, Marcia J. 2002. Toward an integrated model for information seeking and searching. In: The Fourth International Conference on Information Needs, Seeking and Use in Different Contexts.
  • 38. Information Foraging Information foraging applies the ideas from optimal foraging theory to understand how human users search for information. Assumption: humans use &quot;built-in&quot; foraging mechanisms that evolved to help our animal ancestors find food. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // Some References Fu, Wai-Tat; Pirolli, Peter (2007), &quot;SNIF-ACT: a cognitive model of user navigation on the world wide web&quot;, Human-Computer Interaction: 335–412 Jason Withrow, &quot;Do your links stink?,&quot; American Society for Information Science Bulletin, June 1, 2002 Pirolli, Peter (2009), &quot;An elementary social information foraging model&quot;, Proceedings of the 27th international conference on Human factors in computing systems: 605–614
  • 39. Moving between patches Patches of information = websites Problem: should I continue foraging in the current patch or look for another patch?  Expected gain from continuing in current patch vs. moving to another © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // July 5, 2010
  • 40. Information seeking funnel [D. Rose, 2008] Wandering: the user does not have an information seeking-goal in mind. Exploring: the user has a general goal but not a plan for how to achieve it. Seeking: the user has started to identify information needs that must be satisfied but the needs are open-ended. Asking: the user has a very specific information need that corresponds to a closed-class question © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 41. Berrypicking vs. Orienteering vs. Teleporting ... Information needs change during interactions M.J. Bates. The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5):407–431,1989. Orienteering [ Teevan et al., CHI 2004 ] : Searcher issues a quick, imprecise to get to approximately the right information space region and then follows known paths that require small steps that move them closer to their goal. Easy! (“perfect” query not needed) Teleporting: Expert searchers issue longer queries to jump directly to the target. Requires more effort and experience. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 42. … vs. exploratory search Exploratory Search: user’s intent is primarily to learn more on a topic of interest, by exploring various directions and sources “… exploratory search blends querying and browsing strategies” and is different “from retrieval that is best served by analytical strategies…” Marchionini, G. Exploratory search: from finding to understanding. Communications ACM 49(4): 41-46 (2006) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // Some references Definition and analysis of the problem White, R. W., and Drucker, S. M. Investigating behavioral variability in web search. 16th WWW Conf. (Banff, Canada, 2007) Complex Search and Exploratory Search Aula, A., and Russell, D.M. Complex and Exploratory Web Search. ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)
  • 43. Multi-domain Exploratory Search “… search for upcoming concerts close to an attractive location (like a beach, lake, mountain, natural park, and so on), considering also availability of good , close-by hotels ” Current approach the user can adopt: Independently explore search services Manually combine findings July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 44. Multi-domain Exploratory Search “… expand the search to get information about available restaurants near the candidate concert locations, news associated to the event and possible options to combine further events scheduled in the same days and located in a close-by place with respect to the first one…” July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 45. Existing Approaches _1 Topic based search : instance of exploratory search centered on the goal of collecting information on a subject matter of interest from multiple sources July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Kosmix : topic discovery engine, keyword search, a topic page summarizes the most relevant information on the subject Hakia : resume pages for topics associated with user’s queries, natural language processing techniques
  • 46. Existing Approaches _2 Structured Object Search : process queries and present results that address entities or real world objects described in Web pages July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Google Squared: keyword search, results collected in a table (called a square) featuring all the attributes relevant to the result items as columns headers Google Fusion Tables: upload data tables (e.g., spreadsheet files) and join (or “fuse”) the data in some column with other tables
  • 47. The note-taking limit There is a limit after which the found options need to be marked down. July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // [Aula and Russel, 2008]
  • 48. Liquid Queries “ A new paradigm allowing users to formulate and get responses to multi-domain queries through an exploratory information seeking approach, based upon structured information sources exposed as software services…” Composite answers obtained by aggregating search results from various domains Highlight the contribution of each search service Join of results based on the structural information afforded by the search service interfaces Refine the user query Re-shape the result list July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri. Liquid Query: multi-domain exploratory search on the Web . WWW 2010, Raleigh, USA
  • 49. Liquid Queries Definition _1 Template-based approach It consists of subsetting and parametrizing the resource graph... July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Concert Artist Exhibition Restaurant Hotel Movie Metro Station Theatre Photo Landmark News Photo Concert Metro Station Restaurant News Exhibition Artist Hotel = inputs, outputs + GR = global ranking
  • 50. Liquid Queries Definition _2 And then characterizing the user interaction Plus: Parametrization of global ranking Data visualization options .. and so on July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Photo Concert Metro Station Restaurant News Exhibition Artist Hotel Expand
  • 51. Result Exploration Support If the current set of combinations is not satisfactory, the user may ask for more values for a service (more one) or for all services (more all) More concerts, more hotels, or more combinations Add new information about further domains for selected combinations (expand) Find close-by restaurants or co-located events Aggregate information to ease analysis and readability (clustering, grouping) Group events by venue Reduce the number of shown items through filtering Total walked distance for the night Re-order (ranking or sorting) Calculate derived values from existing ones Total walked distance for the night Alternative data visualization Map, parallel coordinates, … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // DEMO : http://demo.search-computing.org
  • 52. User Intent Understand the user information need User intent taxonomy (Broder2002) Informational –want to learn about something (~40% / 65%) Navigational –want to go to a given page (~25% / 15%) Transactional – want to do something (web-mediated) (~35% / 20%) Grey Areas Find a good hub Exploratory search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // [from SIGIR 2008 Tutorial, Baeza-Yates and Jones] History nyonya food Singapore Airlines Jakarta Weather Nikon Finepix Car Rental Kuala Lumpur
  • 53. Contextual Content Delivery Context Vs. Personalization Trigger the right search depending on the context Task Location User Engagement Not interested in your personal profile Your favorite restaurant? It depends on where you are! July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // from Ricardo Baeza-Yates, Next Generation Search , 2 nd Search Computing Workshop, Milan, 24/06/2010 Demo: http://sandbox.yahoo.com/Motif
  • 54. Relevance: the Top-k problem Relevance of the results with respect to the request is the main expectation for search engine users Top-k relevant items : retrieve quickly a number ( k) of highest ranking tuples in the presence of monotone ranking functions defined on the attributes of underlying relations Some References R. Fagin. Combining fuzzy information from multiple systems. J. Comput. Syst. Sci., 58(1):83–99, 1999. F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization. In SIGMOD Conference, pages 203–214, 2004 D. Martinenghi and M. Tagliasacchi: Proximity Rank Join, to appear in PVLDB July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 55. Result Diversification Relevance is not the only success factor for a result set User satisfaction is increased if the first items cover a good spectrum of options If user intent is ambiguous , diversification tries to cover the most likely intents If several top-k items are very similar , they can be clustered together Thus: an optimization problem Objective: find the set of k elements that contains the most relevant and diverse items Maximal Marginal Relevance [Carbonell and Goldstein 1998] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Relevance Diversity
  • 56. User Interface More Complete information on one search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Shortcuts Deep Links Enhanced Results
  • 57. User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 58. User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 59. User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 60. User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 61. User Interface Optimization of the result set layout (and of page space) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • 62. Performance Users don’t want to lose their time waiting for a search result User satisfaction Performances are the leading factor for the evaluation of Web Search applications Queries per seconds (QPS) Time to Index Scalability Content Queries Distribution Service-oriented computing Content Delivery Networks But intellectual properties may be a concern More in section (ARCHITECTURE) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 63. Other Requirements Social Interaction Content evaluation User relationships and actions as additional content description Security & Privacy Access policies Collection Vs. Item level Anonymity Who I am = What I like + What I do + Where I am ? A search process tells a lot about whom is doing it Alessandro Bozzon, Tereza Iofciu, Wolfgang Nejdl, Antonio V. Taddeo, Sascha Tönnies, Role Based Access Control for the interaction with Search Engines, (COOPER) 2007, Crete, Greece . © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • 64. Design © 2010 Alessandro Bozzon, Marco Brambilla
  • 65. Designing Web Search Applications Reference architecture Reference execution processes Set of design dimensions Development methodology Tools supporting the methodology July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 66. Search Applications from 1000 feet © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 DESIGN //
  • 67. Bird eye view on Search Applications © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 DESIGN //
  • 68. Search Application Processes July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 69. An example of Indexing Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 70. Pharos: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 71. Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Main Query flow <Uses> relation
  • 72. Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // High level query “ Where can I attend a DB scientific conference close to a beautiful beach reachable with cheap flights?” Sub query 1 “ Where can I attend a DB scientific conference?” Sub query 2 “ place close to a beautiful beach?” Sub query 3 “ place reachable with cheap flight?”
  • 73. Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Low level query 1 ConfSearch(“DB”,placeX,dateY) Low level query 2 TourSearch(“Beach”,PlaceX) Low level query 3 Flight(“cost<200”,PlaceX,DateY)
  • 74. Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Services invocations and operators execution Presented results ESWC-Crete-Olympic CAISE- Hammamet – Alitalia TOOLS-Malaga-EasyJet Query plan Results
  • 75. Design Dimensions July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Design Dimension Affected Process Values Retrieval Policy Indexing Push Pull Data Homogeneity Indexing Homogeneity Heterogeneity Data Analysis Indexing Mono Annotation Multi Annotation Mono Modal Multi Modal Search Technology Indexing, Query and Result Presentation Search Engine(s) Type Homogeneity Heterogeneity Query Format Query and Result Presentation, User Interface Query Type Mono Modal Multi Modal Mono Domain Multi Domain User Interaction User Interface Direct Indirect Active Passive
  • 76. Designing Web Search Applications - A MDD approach Alessandro Bozzon, Marco Brambilla, Piero Fraternali. Conceptual Modeling of Multimedia Search Applications using Rich Process Models . ICWE 2009, June 24-26, 2009, San Sebastian, Spain July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Clear separation of concerns among the involved actors Central roles of models as key development artifacts Automatic code generation, etc.
  • 77. Development Methodology Process Models E.g.: BPMN Domain data and process metadata E.g.: ER/UML July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Model To Model Transformation E.g.: Java / XSLT / ATL Application Models DSL, e.g. WebML Model To Code Transformation Running Application
  • 78. An example domain model Content Analysis / ER Content : the objects that relate to the Content Items indexed by a search application Annotation : structure of the annotations associated with searchable Content Items during the indexing process Usage : usage groups of the application (RBAC model) Index : abstraction for the actual physical implementation of search engine indexes July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 79. An example process model Content Analysis / BPMN - WebML Coarse indexing process model Content Registration Content Analysis Content Indexation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Fine-grained process model Analysis of audiovisual content trough face recognition and identification technologies Application model Face Recognition and Segmentation activity Running CPA process Console trace of the working annotation technology Process advancement control UI Refinement M2M Transformation M2T Transformation
  • 80. An Example of Complex Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Analysis of audiovisual content Incremental analysis of audio-visual content with textual annotations
  • 81. Modeling User Interface The information seeking interaction modes (Searching, Browsing, Monitoring, Being aware, Social interactions) Distilled 30+ information seeking user interaction patterns Query execution and result presentation Keyword (Faceted, Similarity, Geo) search specification and refinement... Browsing, content organization, content-based awareness, etc. Relationship setting, recommendation, etc. UI designed as assembly of standard interaction patterns expressed in WebML July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Alessandro Bozzon, Model-driven development of Search Based Web Applications, Ph.D Thesis, Politecnico di Milano, April 2009.
  • 82. Pattern Example: Faceted Search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 83. Pattern Example: Faceted Search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 84. Pharos: Modeling User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // http://www.youtube.com/watch?v=ZpxyNi6Ht50
  • 85. Pharos: Modeling User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // http://www.youtube.com/watch?v=ZpxyNi6Ht50 KEYWORD REFINEMENT FACETED REFINEMENT CONTENT-BASED REFINEMENT RESULT PRESENTATION
  • 86. An Example of M2M Transformation BPMN*  WebML July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 87. MDD in Search Computing 4 artifact models Search Service, Query, Query Parameters, Result A query plan model For the runtime query transformation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 88. Search Computing Model Example Search Service Model ServiceMart abstraction (e.g., Hotel) of one or more Web service implementations (e.g., Bookings and Expedia) possibly ranked and chunked into page Attribute Atomic or Composite AccessPattern specifies RankingType and AttributeDirection (I/O) ConnectionPattern is defined as an input-output relationship between pairs of service marts (for joining them) the output city of Concert used as input for Hotel. ServiceInterface physical interface of the service Exact or Search (ranked) details about chunk size, cost July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 89. Search Computing Query Meta-model LogicalQuery is a conjunctive query over services can be defined at an abstract level ( AccessPatternLevelQuery ) or at physical level ( InterfaceLevelQuery ). QueryClause a LogicalQuery is composed by a set of QueryClauses a QueryClause can refer to the service mart level or to the Service Interface level. Several types InvocationClauses PredicateClauses JoinClauses RankingClauses July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • 90. Search Computing Model Transformations Vertical transformations for Queries and ServiceMarts QueryToPlan transformation Query Execution transformation (at runtime) Result transformation (at runtime) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // 1 1 2 4 3 Prototype: http://dbgroup.como.polimi.it/brambilla/SeCoMDA
  • 91. Search Computing DSLs (& Transformations): Panta Rhei describes both the execution flow and the data flow between nodes of a query plan. Several types of nodes exist service invocators, sorting, join, and chunk operators, clocks (defining the frequency of invocations), caches, and others. The query result model is constructed stepwise, following the execution flow July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca: Panta Rhei: An Execution Model for Queries over Web Information Sources, http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf
  • 92. Implementation © 2010 Alessandro Bozzon, Marco Brambilla
  • 93. From the models to implementation Once the design phase is completed IMPLEMENTATION TIME Never implement a search engine/app from scratch!! Start from your requirements and design and: Identify possible existing solutions ( REUSE ) Select the best fitting wrt your needs ( SHOPPING ) Implement what you need ( DEPLOY vs. CONFIGURE ) We will see: open source (products) vs. Open search (services) A full-fledged model-driven approach can be devised: Model to code transformation that generate: The code for the pieces of Web search applications that you need The configuration for the tools of choice © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • 94. Search Framework Vs. Search Engine Search Engines “ provide an interface to a group of items that enables users to specify criteria about an item of interest and have the engine find the matching items. The criteria are referred to as a search query” Wikipedia Search Frameworks Software components that target a set (possibly exhaustive) of the architectural layers of a Search Applications E.g., crawling + analysis + indexing/querying © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • 95. Open Source Search Vs Open Search Open Source  build your own engine © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION // www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo! Open Search  exploit commercial engines API v. 2
  • 96. Open Source Search High level comparison July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Extended version of www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo! Product License Lang. Docs Ranking Users Parallel Scale Support Lucene Apache Java/ C++ Several Flexible Amazon Yes TB 5/5 Zettair BSD Like C HTML, TREC, TXT Flexible Research No TB 1/5 Indri BSD Like C++ Many Very Flexible Research Yes TB 1.5/5 Sphinx GPL C++ Many Flexible Craiglist Yes YB 4/5 Xapian GPL C++ Many Flexible GMane Yes TB 3/5 RDBMS BSD, GPL C Limited Maybe GB 4/5
  • 97. Open Source Search Benchmark _1 [Middleton+Baeza-Yates 07]: A comparison of open source search engine http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ Vik Singh /Yahoo, Weekend project: Index 1M tweet Source Code available at http://github.com/zooie/opensearch July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 98. Open Source Search Benchmark _2 Relevancy tested on TREC 9 – Filtering Track collection Judgment data for 63 query-like tasks July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 99. Lucene High-performance, scalable information retrieval (IR) library in Java There’s also pyLucene & Clucene Apache License Lot of industrial support with proven scalability Amazon, Netflix, Wikipedia Core API for full-text indexing and searching Plus plug-in modules Text analysis: text analyzer, tokenizer, token-filter, stemmer, N-gram filters, shingle filters spell-checkers, result highlight, “more like this” Fuzzy queries, regex queries Geo ranking July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 100. Lucene Indexing Example July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 101. Additional Indexing Features Documents can be Updated and Deleted Boosted  doc.setBoost(1.5F); Fields can be Indexed - to search in Stored - to show the original content (e.g., abstract ) coded in term vectors - to enable more like this Multivalued (e.g., authors field) Boosted  subjectField.setBoost(1.2F); There are built-in field types for numbers, dates , and time , to better support sorting or range search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 102. Lucene Querying Example July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Simple Term Query Query Parser
  • 103. Additional Querying Features Boolean Prefix Phrase Wildcard Fuzzy Scoring function Fielded TF-IDF, weighted by term occurrences Term and document boost July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 104. More Features Thread and multi-JVM safety Any number of read-only IndexReaders may be open at once on a single index Only a single writer may be open on an index at once IndexReaders may be open even while an IndexWriter is making changes to the index Any number of threads can share a single instance of IndexReader or Index- Writer  not thread safe, but it scales Lucene implements the ACID transactional model only one transaction (writer) may be open at once July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 105. Why Open Search? Search as a software service No need of in-house engine development Search as a commodity Internals are unknown, the features are taken off the shelf Javascript Access to search features through client-side programming (no server needed at all) But … you can search only for Web resources July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 106. Open Search APIs Google Ajax Search API http://code.google.com/apis/ajaxsearch/ Google Custom Search API http://code.google.com/intl/en/apis/customsearch/ Microsoft Bing API http://www.bing.com/toolbox/developers/ Yahoo Boss (Build your Own Search Service) http://developer.yahoo.com/search/boss/ July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // API v. 2
  • 107. Google Ajax Search API Javascript Widget REST API No limitations on the number of queries 8 results per query No change in the result order Query Web, Local, Video, Images, Blog, Book, News Very limited customization of result presentation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Code Snippets from Google Ajax Search API Documentation
  • 108. Google Custom Search API Custom search engine for a Web site, blog, or a collection of Web sites Max 5000 sites On-demand 24 hour Web Indexing iFrame or Custom Search Element results for developers; XML for enterprise Few result personalization options July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 109. Microsoft BING API REST APIs Query Ad, Image, News, Phonebook, Video, Web Unlimited traffic Results can be modified, but with some restrictions You cannot re-rank or merge non-Bing sources July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 110. Yahoo! Boss (+ Search Monkey) Unlimited queries Blend, re-order, discard Full Presentation control Usage: http://boss.yahooapis.com/ysearch/ {vert} /v1/ {q} ? appid= {appid} &start=0&count=10&lang=en& format=xml&view=keyterms Verticals Web, News, Images, Spelling In query syntax inurl, url, intitle, site, AND/OR, “-”, “+” Notable web view fields Delicious bookmarks SearchMonkey ( microformats ) Larger abstracts Extracted Entities (keyterms) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // WWW 2010 Tutorial Open Search Tools - Drake & Jones SearchMonkey keyterms Bookmarks
  • 111. Search Frameworks – State of the industry © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • 112. Open Source Search Frameworks © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • 113. SMILA SeMantic Information Logistics Architecture http://www.eclipse.org/smila/ Open Source Search Framework based on SOA principles and standards (e.g. BPEL, SCA) dedicated to the access and integration of (unstructured) information Standard interfaces for the integration of the main components of a Search application Set of out-of the box components included Crawlers (Web, FS) and agents (e.g. RSS feeds) Lucene/Solr indexer interfaces for management, operation and monitoring of the framework and its components Written in Java Based on OSGi (Eclipse Equinox) Cloud-ready July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 114. Data Model Record Representation of an information item Composed of a set of Attributes : textual metadata (e.g., mime type) Attachments : binary data (e.g., picture) Annotations : associated both to records, attributes or attachments Attributes and attachments are usually produced during the discovery of data Annotations are usually produced during the indexing process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 115. Chansonnier Data Model Record : a song id: the download URI Attributes Link, PageTitle, Description, Keywords, Title, Artists Lyrics Language, language confidence Emotion, emotion confidence Attachments Original videos Extracted keyframes July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 116. SMILA Architecture 3 Macro components Each one can run on a dedicated OSGi instance Distribution, replication Each one aggregates a set of OSGi bundles Set of data storages Metadata Binary data Ontologies Delta Indexing July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CONNECTIVITY SEARCH PROCESSING
  • 117. Processing Pipelines Orchestration performed through BPEL Engine (Apache ODE) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Process Invocation Condition on a record attribute Condition on an annotation value Activity Invocation
  • 118. Chansonnier Activities Lyrics Wiki To decorate a song with its lyrics by querying the LyricWiki service (http://lyrics.wikia.com/). Google Translate To identify the language of a song’s lyric (with a given confidence) Synesketch To analyze the song’s lyric in order to infer the dominant emotion in it FFMPEG To extract the keyframes from the song video July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 119. Distribution July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // EclipseCON 2010: http://www.eclipsecon.org/2010/sessions/?page=sessions&id=1388
  • 120. Content Analysis July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Text Annotation Media Annotation Transcoding Media Artifact Generation Media Analysis Media Analysis Text Analysis Text Analysis Media Artifact Generation Media Item Text Item
  • 121. Text Processing Not all words are equally significant for representing the semantics of a document usually, noun words (or groups of noun words) are the most representative of a document content Vocabulary : language used to describe documents and queries Worthwhile to preprocess the text of the documents in the collection to determine the terms to be used as index terms Subset of words selected to represent a document’s content July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 122. Index Terms and Precision/Recall Trade off Exhaustiveness Cover the whole document content  assign a big number of terms to a document Specificity Generic terms: low discriminative power, their frequency is high in all the documents (e.g., “and”, “or”, “of”, etc.) Specific terms: higher discriminative power, variable document frequency  their frequency denotes their document’s representativeness Recall High-frequency in the overall collection Index expansion via associative techniques (thesauri, clustering) Precision High frequency just in some documents July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 123. Text Analysis Process Document Parsing Lexical analysis : manage digits, hyphens, punctuation marks, letter cases Elimination of stopwords (e.g., “and”, “or”, “of”, etc.) Thesaurus Phrases (noun groups) Stemming (reduction of a word to its grammatical root) Selection and weighting of index terms (noun, adjectives, etc…) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Document Parsing Lexycal Analysis Phrases Stemming Indexing Weighting Structure Full text Index Terms Stopwords Removal
  • 124. Document Parsing What format : pdf/word/excel/html? What language ? What character set ? Problems: Documents being indexed can include docs from many different languages Sometimes a document or its components can contain multiple languages/formats (French email with a Portuguese pdf attachment. What is a unit document ? (An email? With attachments? An email with a zip containing documents?) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 125. Lexical Analysis Process that transforms an input character stream (the original document’s text) into a flow of words ( tokens ) GOAL: identification of words in the text Example Input: “ Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing But what are valid tokens to emit? July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 126. Tokenization Trivial case: recognition of blanks as word separator Other cases might need to be addressed: Phrases Finland’s capital -> Finland? Finlands?, Finland’s? Hewlett-Packard -> Hewlett and Packard as two tokens? San Francisco: one token or two? How do you decide it is one token? Language issues (normalization) Accents: résumé vs. resume. L'ensemble -> one token or two? L ? L’ ? Le ? How are your users like to write their queries for these words? Use locale? Punctuation (e.g: U.S.A. vs. USA) Numbers (100.45 vs. 100,45 vs. 1.0045 E+2 ) Dates (e.g. March 1 st 2009 vs. 03/01/09 vs. 1/03/2009) Case folding …. It depends on the addressed language E.g., in Chinese spaces do not separate words (tokenization based on vocabulary) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 127. Stopword Removal Removal of high-frequency words , which carry less information Strategies Statistical analysis on the indexed collection Functional terms (articles, conjunctions, auxiliary verbs) A-priori knowledge, based on the IR system domain Creation of a “stop-list” with all the terms to remove English stop list is about 200-300 terms (e.g., “been”, “a”, “about”, “otherwise”, “the”, etc..) http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words < 30% - 50% of tokens (smaller dictionary) It can decrease recall (e.g. “to be or not to be”, “let it be”) Most of WEB search engines do not remove stopwords [ ManningIR] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 128. Phrases (noun groups) Phrases capture the meaning behind the bag of words and result in multi-term phrases Uses of phrases: Added to the query: a query “New” “York” should be modified to search for “New York”  > 10% in precision and recall Replace terms in index: empirically considered not as good as query rewriting July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 129. Phrases (noun groups) - Strategies Simple Phrases Many systems identify phrases as any pairs of terms not separated by: stop term punctuation mark special character Phrases occurring fewer than 25 times are removed (decrease in memory requirements) NLP Part Of Speech and Word Sense tagging statistical or rule-based methods to identify the part of speech (noun, verb, adjective) of each token Syntactic parsing Identify the key syntactic components of a sentence usually by tagging according to POS and then applying a grammar (FSA and NFSA) Thesauri July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 130. Thesauri A thesaurus is as a classification scheme composed of words and phrases whose organization aims at facilitating the expression of ideas in written text E.g.: synonyms and homonyms Example entry from Roget’s 1 thesaurus: cowardly adjective Ignobly lacking in courage: cowardly turncoats. Syns: chicken (slang) chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered A thesaurus can be Thematic: specific to the IR system’s domain of application (most frequent case) E.g.: Thesaurus of Engineering and Scientific Terms Generic A thesaurus can be used to Help user formulate queries Modification of queries by the system Select index terms July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 131. Thesauri Many kinds of thesauri have been developed for IR systems Hierarchical: synonyms (RT  related terms, UF  use for), generalization (BT  broader term), specialization (NT  narrower term) ISO and ANSI standards, almost always thematic Manually built and updated by domain experts Clustered: cluster (or synset) of words Non-typed, semantic relationships among cluster Each cluster is a set of word having strong semantic relationship (usually UF) WORDNET Clustered Thesauri can be automatically generated if no distinction is made among semantic relationships Associative: graph of words, where nodes represents words and edges represents semantic similarity among words Edges can be oriented or not, according to the symmetry of the similarity relationship Edged can be weighted (fuzzy pseudo-thesauri) Can be automatic generated from a collection of documents using a co-occurrence relationships July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 132. Stemming and Lemmatization Goals Reduce terms to their “roots” before indexing Reduce inflectional/variant forms to base form language dependent E.g., am, are, is -> be car, cars, car's, cars' -> car the boy's cars are different colors -> the boy car be different color Stemming : heuristic process that chops off the ends of words in the hope of achieving the goal correctly the most of the time Stemming collapses derivationally related words Lemmatization : NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word Lemmatization collapses the different inflectional forms of a lemma Not widely used cause it harms performances July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 133. Stemming Many different algorithms : Porter’s algorithm Commonest algorithm for stemming English Porter, Martin F. 1980. An algorithm for suffix stripping. Program 14:130–137. http://www.tartarus.org/˜martin/PorterStemmer/ One-pass Lovins stemmer Lovins, Julie Beth. 1968. Development of a stemming algorithm. Translation and Lancaster http://www.comp.lancs.ac.uk/computing/research/stemming/ Paice, Chris D. 1990. Another stemmer. SIGIR Forum 24:56–61 http://snowball.tartarus.org/demo.php Stemming increases recall while harming precision July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 134. Stemming Example July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 135. Tools for text analysis _1 Lucene and Solr contains a lot of text analyzer working on several languages http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters CharFilters, Tokenizer, Token Analyzers Apache Tika http://tika.apache.org/ toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries GATE (General Architecture for Text Engineering) http://gate.ac.uk/ ANNIE (A Nearly-New Information Extraction System) tokenizer, gazetteer, sentence splitter, part of speech tagger, named entities transducer, coreference tagger Support for English, Spanish, Chinese, Arabic, French, German, Hindi, Italian, Cebuano, Romanian, Russian MALLET (Machine Learning for Language Toolkit) http://mallet.cs.umass.edu/index.php Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 136. Tools for text analysis _2 OpenNLP http://opennlp.sourceforge.net/projects.html open source projects related to natural language processing) Cognitive Computation Group – University of Illinois http://l2r.cs.uiuc.edu/~cogcomp/software.php Chunker, Part of Speech tagger, String similarity, Semantic Role Labeler Named Entity Extractor, etc. Supersense Tagger http://medialab.di.unipi.it/wiki/SuperSense_Tagger tool for assigning to each noun, verb, adjective and adverb of a sentence one of the 45 standard WordNet supersenses Wordnet Domains http://wndomains.fbk.eu/hierarchy.html Synesketch http://www.synesketch.krcadinac.com/ Open source textual emotion recognition July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla SECTION NAME //
  • 137. Multimedia Content Analysis Computer are not able to catch the underlying meaning of a multimedia content. Annotation is needed. Manual annotation Expensive It can take up to 10x the duration of the video Problems in scaling to millions of contents Incomplete or inaccurate People might not be able to holistically catch all the meanings associated with a multimedia object Difficult Some contents are tedious to describe with words E.g., a melody without lyrics Automatic annotation Reasonably good quality Some technologies have a ~90% precision “ Low” cost © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • 138. Audio Segmentation GOAL: split an audio track according to contained information Music Speech Noise … Additional usage Identification and removal of ads © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • 139. Video Segmentation Keyframe segmentation: segment a video track according to its keyframes fixed-length temporal segments Shot detection: automated detection of transitions between shots a shot is a series of consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space. July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS: Thorsten Hermes@SSMT2006
  • 140. Speech Analysis Speaker Identification : identify people participating in a discussion Additional usage: Vocal command execution Speech To Text : automatically recognize spoken words belonging to an open dictionary July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // ERIC DAVID JOHN
  • 141. Classification of Music Genre GOAL: automatically classify the genre and mood of a song Rock, pop, Jazz, Blues, etc. Happy, aggressive, sad, melancholic, Additional usage: Automatic selection of songs for playlist composition Tutorial from PHAROS Summer School http://www.pharos-audiovisual-search.eu/ res/files/SummerSchool/Programme_Summer_School_file.zip July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Rock Dance!
  • 142. Images: Low-level features GOAL: extract implicit characteristics of a picture luminosity orientations textures Color distribution July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 143. Face Identification and Recognition GOAL: recognize and identify faces in an image Usage examples: People counting Security applications July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS: Thorsten Hermes@SSMT2006
  • 144. Image Concept Detection GOAL: recognize context/ concepts of an image E.g., playground, seaside, road, ... Extraction of low level features from raw data color histograms, color correlograms, color moments, co-occurrence texture matrices, edge direction histograms, etc.. Features can be used to build discrete classifiers , which may associate semantic concepts to images or regions thereof The MediaMill semantic search engine defines 491 semantic concepts http://www.science.uva.nl/research/mediamill/demo Concepts can be detected also from text (e.g., from manual or automatic metadata) using NLP techniques July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 145. Image Object Identification GOAL: identify objects appearing in a picture Basket ball, cars, planes, players, etc. July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 146. Tools for media analysis _1 OpenCV http://opencv.willowgarage.com/wiki/ Framework for image analysis Octave http://www.gnu.org/software/octave/ high-level language, primarily intended for numerical computations, it works well with Matlab Marsyas (Music Analysis, Retrieval and Synthesis for Audio Signals) http://marsyas.sness.net/ Framework for music analysis and retrieval July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 147. Tools for media analysis _2 TINA (TINA Is No Acronym) http://www.tina-vision.net/ is an open source environment developed to accelerate the process of image analysis research. Sphynx http://cmusphinx.sourceforge.net/sphinx4/ speech recognition system written entirely in the Java WEKA http://www.cs.waikato.ac.nz/ml/weka/ A collection of machine learning algorithms for data mining July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • 148. Validation © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010
  • 149. Disclaimer This section is inspired by the WWW2010 tutorial by Dasdan, Tsioutsiouliklis, Velipasaoglu @ WWW2010 Web Search Engine Metrics for Measuring User Satisfaction http://analytics.ncsu.edu/reports/wsmt.pdf July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 150. Measures for IR Systems Measurable properties How fast does it process (index) documents? Number of documents/hour Average document size How fast does it search? Latency as a function of index size Expressiveness of query language Speed on complex queries The key measure: user happiness What is this? Speed of response/size of index are factors But blindingly fast, useless answers won’t make a user happy How do we quantify user happiness? July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 151. Measuring User Happiness Who is the user we are trying to make happy? Depends on the setting Web engine: user finds what they want and return to the engine Can measure rate of return users eCommerce site: user finds what they want and make a purchase Is it the end-user, or the eCommerce site, whose happiness we measure? Measure time to purchase, or fraction of searchers who become buyers? Enterprise (company/govt/academic): Care about “user productivity” How much time do my users save when looking for information? Many other criteria having to do with breadth of access, secure access … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 152. Evaluation measures Relevance Of search results Coverage Presence of content of interest in a catalog Diversity Of result set Discovery and Latency How many new resources (in the collection) are in the catalogue How long it took to get the new resources in the catalog? Time to first click Freshness July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 153. Relevance as a measure of user happiness How do you measure relevance? In order to assess the performance of a IR system you needed a test collection composed of: A benchmark document collection A benchmark suite of queries A binary assessment of either Relevant or Irrelevant for each query-doc pair ( gold standard , or ground truth ) Test collection must be of a reasonable size Need to average performance since results are very variable over different documents and information needs July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 154. Evaluating Relevance Set based evaluation Rank based evaluation with explicit judgment Absolute judgment Preference judgment Rank based evaluation with implicit judgment Direct and indirect evaluation by clicks Model based evaluation Browsing models User satisfaction July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // NOT COVERED HERE
  • 155. Information Need Translation Relevance is assessed relative to the need not to the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective A document is relevant if it addresses the stated information need, not just because it contains all the word in the query July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 156. Set-based evaluation The two most frequent and basic measures for IR effectiveness are precision and recall Precision: fraction of retrieved docs that are relevant P(relevant|retrieved) Provides a measure of the “degree of soundness” of the system This not consider the total number of documents Recall: fraction of relevant docs that are retrieved P(retrieved|relevant) Provides a measure of the “degree of completeness” of the system July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 157. Precision / Recall Can get high recall (but low precision ) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases (in a good system) Precision can be computed at different levels of recall Perhaps most appropriate for web search: all people want are good matches on the first one or two results pages Precision-oriented users Web surfers Recall-oriented users Professional searchers, paralegals, intelligence analysts July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 158. F-Measure Combined measure that assesses the tradeoff between precision and recall (weighted harmonic mean): Values of β<1 emphasize precision Values of β>1 emphasize recall People usually use balanced F 1 measure i.e., with β = 1 or α = ½ Harmonic mean is conservative average [CJ van Rijsbergen, Information Retrieval ] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 159. Difficulties in using precision/recall Average over large corpus/query… Need human relevance assessments People aren’t reliable assessors Assessments have to be binary Nuanced assessments? Heavily skewed by corpus/authorship Results may not translate from one domain to another The relevance of one document is treated as independent of the relevance of other document This is also an assumption in most retrieval system July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 160. Ranked Based evaluation In ranked retrieval systems, P and R are values relative to a rank position Evaluation performed by computing precision as a function of recall Function computed at each rank position in which a relevant document has been retrieved Resulting values are interpolated yielding a precision/recall plot July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 161. Measures for Ranked Based evaluation Mean average precision ( MAP ) Measure of quality at all recall levels [email_address] Not all queries will have more than K relevant results Even a perfect system may have a score less than 1.0 for some queries R-Precision [Allan 2005] Use a variable result set cut-off for each query based on number of its relevant results Mean Reciprocal Rank ( MRR ) [ Voorhees 1999] Reciprocal of the rank of the first relevant result averaged over a population of queries July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 162. Discounted Cumulative Gain (DCG) [Järvelin and Kekäläinen 2002] Gain adjustable for importance of different relevance grades for user satisfaction Discounting desirable for web ranking Most users don’t browse deep Search engines truncate the list of results returned. DCG yields unbounded scores For each query, divide the DCG by the best attainable DCG for that query  Normalized Discounted Cumulative Gain (nDCG) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // Example: Very Useful: 3 Somehow useful: 1 Not Useful: 0
  • 163. Preference Judgment Kendall tau coefficient Based on counts of preferences Range in [-1, 1] Robust for incomplete judgments Binary Preference (bpref) Buckley and Voorhees (2004) Designed for incomplete judgments Generalized to graded judgment De Beer and Moens (2006) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // A: preferences in agreement D: preferences in disagreement N r = # of non-relevant docs above relevant doc r, In the first R non-relevant R = number of relevant results for the query
  • 164. Presentation Metrics How to present information? Which information Where they should be displayed Which presentation elements should be used? Font, colors, design elements, interaction design Generalization How to measure success? User studies On-line, on-home, usability, eye tracking, focus group, surveys Log analysis Editorial Comparative, Perceived vs. actual July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 165. Not all results are likely to be reviewed July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) ‏
  • 166. Clicks and views depend on rank July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // [Joachims et al, 2005]
  • 167. Eye Tracking Studies July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 168. Heat Maps Golden Triangle The first result is always considered more trusted and more relevant by default The user spend less time reading the lower part of the page [Marti A. Hearst, Search User Interfaces , Cambridge University Press, 2009] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • 169. Thank you for your attention! Questions? © 2010 Alessandro Bozzon, Marco Brambilla Alessandro Bozzon Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/bozzon Marco Brambilla Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/mbrambil http://www.search-computing.org/book July 5, 2010 REFERENCES //
  • 170. References – Books Modern Information Retrieval Ricardo Baeza-Yates, Berthier Ribeiro-Neto , Addison Wesley Longman Publishing Co. Inc., 2010 [ManningIR] Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press, 2008 Information Retrieval: Algorithms and Heuristics . D.A. Grossman, O. Frieder. Springer, 2004 Managing Gigabytes. I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999 Mining the Web: Analysis of Hypertext and Semi Structured Data . S. Chakrabarti. Morgan Kaufmann, 2002 Search User Interfaces Marti A. Hearst. Cambridge University Press, 2009 Search Computing – Challenges and directions Stefano Ceri, Marco Brambilla (eds.) . Springer LNCS, vol. 5950, 2010 © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • 171. References - Tutorial Web Search Engine Metrics: Direct Metrics to Measure User Satisfaction Ali Dasdan, Kostas Tsioutsiouliklis, Emre Velipasaoglu (Yahoo!) www2010 Recent Progress on Inferring Web Searcher Intent Eugene Agichtein (Emory University) www2010 Applications of Open Search Tools Rosie Jones, Ted Drake (Yahoo!) www2010 [BAEZASeco2010] New Frontiers for Search Ricardo Baeza-Yates www2010 Web Mining for Search Ricardo Baeza-Yates and Rosie Jones (Yahoo!) SIGIR 2008 © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • 172. References - Papers [Ramakrishnan and Tomkins 2007] Raghu Ramakrishnan, Andrew Tomkins: Toward a PeopleWeb IEEE Computer 40(8): 63-72 (2007) [Broder2002] A. Broder. A taxonomy of web search SIGIR Forum, 36(2):3–10, 2002. [BATES2002] Bates, Marcia J. Toward an integrated model for information seeking and searching In: The Fourth International Conference on Information Needs, Seeking and Use in Different Contexts, 2002 [FU2007] Fu, Wai-Tat; Pirolli, Peter, SNIF-ACT: a cognitive model of user navigation on the world wide web Human-Computer Interaction: 335–412 , 2007 [Withrow2002] Jason Withrow, Do your links stink? American Society for Information Science Bulletin, June 1, 2002 [Pirolli2009] Pirolli, Peter An elementary social information foraging model Proceedings of the 27th international conference on Human factors in computing systems: 605–614, 2009 [D. Rose, 2008] [BATES1989] M.J. Bates. The design of browsing and berrypicking techniques for the online search interface Online Review, 13(5):407–431,1989. [Teevan et al., CHI 2004] Teevan, J., Alvarado, C., Ackerman, M. and Karger, D. The perfect Search Engine is not Enough: A Study of Orienteering Behavior in Directed Search Proceedings of ACM CHI 2004, pp. 415-4422. [MARCHIONINI2006] Marchionini, G. Exploratory search: from finding to understanding . Communications ACM 49(4): 41-46 (2006) [WHITE2007] White, R. W., and Drucker, S. M. Investigating behavioral variability in web search 16th WWW Conf. (Banff, Canada, 2007) [AULA2008] Aula, A., and Russell, D.M. Complex and Exploratory Web Search ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • 173. References - Papers [BozzonEtAL2010] Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri. Liquid Query: multi-domain exploratory search on the Web WWW 2010, Raleigh, USA [FAGIN1999] R. Fagin. Combining fuzzy information from multiple systems J. Comput. Syst. Sci., 58(1):83–99, 1999. [ILYAS1999] F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization In SIGMOD Conference, pages 203–214, 2004. [MARTINENGHI2010] D. Martinenghi and M. Tagliasacchi: Proximity Rank Join to appear in PVLDB [Carbonell and Goldstein 1998] J. Goldstein and J. Carbonell (1998), Summarization: Using MMR for Diversity- based Reranking SIGIR’98 [BozzonEtAl2007] Alessandro Bozzon, et Al Role Based Access Control for the interaction with Search Engines International Workshop on Collaborative Open Environments for Project-Centered Learning (COOPER) 2007, Crete, Greece. [BozzonEtAl2009] Alessandro Bozzon, Marco Brambilla, Piero Fraternali Conceptual Modeling of Multimedia Search Applications using Rich Process Models ICWE 2009, June 24-26, 2009, San Sebastian, Spain [BozzonThesis2009]Alessandro Bozzon, Model-driven development of Search Based Web Applications Ph.D Thesis, Politecnico di Milano, April 2009. [BragaEtAl2010] D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca: Panta Rhei: An Execution Model for Queries over Web Information Sources http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • 174. References - Papers [Allan 2005] J. Allan (2005), HARD track overview in TREC 2005: High accuracy retrieval from documents. [Voorhees 1999] E.M. Voorhees (1999), TREC-8 question answering track report [Järvelin and Kekäläinen 2002] K. Järvelin and J. Kekäläinen, Cumulated gain-based evaluation of IR techniques ACM Trans. IS, 20(4): 422-446, 2002 [Buckley and Voorhees (2004)] C. Buckley and E.M. Voorhees, Retrieval evaluation with incomplete information SIGIR’04. [De Beer and Moens (2006)] De Beer, Jan; Moens, Marie-Francine. Rpref: a generalization of Bpref towards graded relevance judgments SIGIR 2006, Seattle, USA, 6-11 August 2006, pages 637-638, ACM © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • 175. References - Links Search Computing Course Lecture Notes http://www.search-computing.it/course Fabio Aolli, Università di Padova, http://www.math.unipd.it/~aiolli/corsi/0809/IR/IR.html http://www.ir.disco.unimib.it/ © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //

Editor's Notes

  • #13: i.e. it might not be clear to the system whether the user is “recall-oriented” or “precision-oriented”
  • #32: In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
  • #43: complex search is characterized by: multiple searches, possibly over multiple sessions and spanning multiple sources of information; a combination of exploration and more directed information finding activities; the need of note-taking, the variation of the search goal during the search process.
  • #63: In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
  • #64: In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
  • #67: From an high-level perspective, “search” is enabled by mechanisms which allow the extraction of contents from data repositories (e.g., text file, audio file, video file, databases, etc). Contents are therefore processed in order to build an index of the managed information, optimized for efficiently answer to users’ queries. Before being indexed, contents are analyzed and enriched with annotations 1 that build contents’ representation. Along with the index, search leverages on ranking models, i.e., mathematical methods that associates a score to the relevance of a content item w.r.t. a query. Once contents are indexed, multiple user interfaces (e.g., Web applications) provide users the means to interact with the search engine by executing queries and displaying the retrieved results.
  • #69: We define (i) an Indexing process (represented as a dashed line), which addresses the indexation of contents coming from the application data sources (thus involving data retrieval from external sources, transformation or aggregation of the retrieved data and, finally, their indexation) (ii) a Query and Result Presentation (QRP) process (represented as a solid line), addressing the operations related to query execution, orchestration and result-set composition (iii) a User Interaction process (represented as a dotted line), i.e., the way users interact with the application’s functionalities.
  • #78: One aspect of the proposed development framework is the definition of a methodology for the design and implementation of the application to be produced. A development approach based on a formal methodology and appropriate high level modeling languages smoothly incorporates change management into the mainstream production life-cycle, and greatly reduces the risk of breaking the software engineering process due to the occurrence of changes. The proposed methodology follows the path of the MDD approach by leveraging on a incremental, iterative design steps that foster separation of concerns among the actors involved in the SBA design. The Conceptual Design macro activity represents the core of the development lifecycle, since it involves the main design activities In the terminology of MDD, the BPMN Process Model can be seen as a Computation Independent Model (CIM), which specifies SBA requirements for the CAI and QRP processes; as we will see, instead, the UI process is address as an Interaction pattern composition activity. The WebML application model is a Platform Independent Model (PIM), which exploits SOA and Web hypertext interfaces as a technical space. Finally, the application code is a Platform Specific Model (PSM) for the Java 2 technical space. Initially, requirements are conceptualized in a Domain Model, which formalizes the essential data objects managed by the application, and a Process Model, which pinpoints the workflow of the CAI,QRP and UI processes. The link between the domain and process models is established by the type of objects that flow between activities. The designed solutions do not take into account domain specific informations like the schema of the adopted search technologies, or the format of the annotations produced by the analysis components. Nonetheless, the focus on a specific class of applications allows one to include, in the business model, high-level concepts relative to the applications’ domain. For SBA, for instance, the concept of query, user, index and so on. The use of an high-level model combined with coarse grained domain concepts allows one to address the designed application in perspective, possibly by creating designs that can be applied to classes of applications (e.g., audiovisual search engines), more than punctual solutions. Abstract-level notation, though, cannot be translated into running code,due to the lack of platform-specific details (e.g., the technologies adopted by actual search engines, analysis components, deployment platform etc.) needed to enact code generation. The Domain Model and Process Model are then subject to a first (CIM to PIM) transformation, which produces the Application Model and process metadata. objects. Therefore, coarse-grained design is followed by refinements that take into account more domain-specific information, like the structure and format for the contents, the annotations and indexes. To do so, a finer grained model is adopted, in order to enable the definition of domain-and application-specific details that can lead to automatic code generation. The proposed approach is generic enough in order to adopt alternative modeling languages, both for process and application design. This slide discusses how to derive an application model from high-level process model. The proposed framework employ the BPMN modeling language for process specification and the WebML modeling language for the design of hypertextes and Web service orchestrations
  • #79: Let’s now have a bird’s eye view on some reference, example design for all the 3 identified SBA’s processes. The CAI process can be defined as the work to be performed by the actors of a SBA to achieve the indexation of a content item . The goal of the domain model is to formalize content- and index-related data and metadata managed by the search applications. Such models build on five basic domain concepts: + Content Item : a Content Item is an individual information unit which is relevant in a search based Web application for indexing purposes. + Annotation : an annotation is the textual information associated with a content item for indexing and searching purposes. Such information might be of different nature, being both manual annotation, provided by the content provider or by the user, and automatically generated annotation, produced by the search application during the Indexing process. + Usage Group : Content Items are published by one or more Content Provider, which is responsible for their publication. A Usage Group is an access profile specified by a content provider to define the set of operations allowed for a given content item to a set of users: + Index : the notion of Index, well known in many disciplines of computer science, denotes a data structure designed in order to optimize speed and performance in finding relevant content items for a search query.
  • #82: User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • #83: User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • #84: User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • #85: User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • #86: User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • #87: Thanks to the implemented extensions, we inject more information in the higher level model, thus leading to: + finer-grained application models + less errors + more efficiency. Transformations were implemented in ATL, a language for model transformations. Here’s a graphical example of model transformation among BPMN* activities and WebML model, and here’s just to give you a hint of how transformations are coded
  • #97: Indri/Lemur Language modeling BM25, Okapi, Cosine similarity, inQuery Lucene TF-IDF, weighted by term occurrences Fielded search Terrier Okapi BM25, language modeling and TF-IDF Divergence from Randomness Your own re-ranking code using open search
  • #98: Not enough comparative benchmarks out there. Hard to do; we really need standards Optimize each platform, per hardware and data set Lot of platforms, with different APIs, options and numerical settings Need good diverse data sets, small &amp; large Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).
  • #99: Larger data set (3x larger than the Twitter one) we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space &gt; 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.
  • #118: &lt;!-- When a message on portType an operation &amp;quot;process&amp;quot; instantiate a variable named &amp;quot;Request&amp;quot; --&gt; &lt;!-- tipicamente la request conterrà un solo Record. Record multipli sono prodotti ad esempio da annotatori che esaminano archivi zip|rar|tgz. L&apos;extension activity verrà eseguita se l&apos;attributo workflow-attribute&apos; presente sul record contiene il valore &amp;quot;split&amp;quot;. Le condizioni sono espresse come espressioni XPath e gli attributi e annotazioni utilizzati devono essere espressamente resi disponibili al workflow BPEL tramite configurazione (di org.eclipse.smila.blackboard). --&gt;
  • #120: RAP – Rich Ajax Platform G-Eclipse: extensible framework including a GRID model for seamless integration of GRID/Cloud resources. It support different Grid/Cloud interfaces, including AWS
  • #133: Example: the token “saw” Stemming  it might return just “s” Lemmatization  attempts to return “see” or “saw” depending on whether the use of the token is a verb or a noun