Engineering Web Search Applications

Engineering Web Search Applications Alessandro Bozzon Marco Brambilla Vienna July 5, 2010

Alessandro Bozzon Post-doc @Politecnico di Milano http://home.dei.polimi.it/bozzon Marco Brambilla Assistant Professor @Politecnico di Milano http://home.dei.polimi.it/mbrambil About the speakers © 2010 Alessandro Bozzon, Marco Brambilla Research background and interests Web engineering and model-driven development WebML and WebRatio Complex enterprise application design BPM, SOA and integration with Web application devel. Search engine and complex search application development Search Computing: multidomain search Pharos: multimedia search framework July 5, 2010 ABOUT //

About the tutorial Information Retrieval is a >40y old discipline tackled from a myriad of viewpoints This tutorial is: Breadth-oriented Development process driven … … using real-world case studies as examples The tutorial is necessarily shallow But we provide references and links © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 ABOUT //

AGENDA Introduction What are Web search applications? Requirements Which are their requirements? Design How to design them? Implementation How to implement them? Validation How to measure their success? © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 AGENDA //

Search prevails Search is an integral part of online life of people Web search has become a standard (and often preferred) source of information finding “ ... 92% of Internet users say the Internet is a good place to go for getting everyday information...” - 2004 Pew Internet Survey Web search engines are now the second most frequently used online computer application, after email Search is fully integrated into operating systems and is viewed as an essential part of most information systems © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION //

Some numbers … Web Estimated size: ~ 60 billion pages – 22/06/2010 http://www.worldwidewebsize.com/ > 9.3 billion queries … just in the U.S. … in May 2010 http://blog.nielsen.com/nielsenwire/online_mobile/top-u-s-search-sites-for-may-2010/ … and growing Twitter # of new tweets per day: 55 million # of search queries per day: 600 million Facebook 400 Million Global Users (and growing) The average Facebook User Spends 55 Minutes Per Day © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION //

… more numbers … IDC Digital Universe report estimates: digital data grew by 62% between 2008 and 2009 ~ 800,000 petabytes (PB) >1.2 million PB in 2010 reach 35 ZB (zetabytes) by 2020. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION // [Ramakrishnan and Tomkins 2007]

Information Retrieval Information retrieval (IR) deals with the representation, storage, organization of, and access to information items. “ Old” discipline As an academic field of study: Information retrieval (IR) is devoted to finding relevant documents , not finding simple match to patterns. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually stored on computers). [Manning et al., 2007] © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // July 5, 2010

Information Retrieval Applications Search (‘ad hoc’ retrieval) Static document collection Dynamic queries July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // Filtering Queries are static Document collection constantly changing Example: corporate mails routed by predefined queries to different parts of the organizations Static Document Collection Ranked Result Ad-Hoc query Document Routing System Predetermined queries or User profiles Incoming Documents

The nature of information retrieval … retrieving all objects which might be useful or relevant to the user information need Usually unstructured queries (no formal semantics) The IR system ‘interpret’ the contents of the information items Examples: keyword-based queries, context queries, proximity, phrases, natural language queries… Also structural queries and, in recent systems, structured query languages are supported (but with a different semantics) Errors in the results are tolerated Core concept: relevance Relevance Ranking (according to the user need) It is not clear what “degree of relevance” the user is happy with The user starts from the top of the ranked list and explore down satisfied July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //

Information Retrieval is NOT Data Retrieval Data Retrieval (RDBMS, XML DB) … retrieving all objects which satisfy clearly defined conditions expressed trough a query language. Data has a well defined structure and semantics Formal query languages Regular expression, relation algebra expression, etc. Results are EXACT matches  errors are not tolerated No ranking w.r.t. the user information need Binary retrieval: does not allow the user to control the magnitude of the output For a given query, the system may return: Under-dimensioned output Over-dimensioned output July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //

The Information Retrieval Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // Content Management Query analysis Query Interaction Generic search-oriented application B A C K E N D F R O N T E N D q’ q r r’ Search Result Composition Result Manipulation

Search Engine vs. Search Application Search Engine data management system which uses information retrieval algorithms to retrieve information items from one or more sources upon the submission of a query Web Search Application data management system where search engines are a piece of a more complex puzzle, that includes: data source integration (e.g. databases, legacy systems, the Web) content analysis technologies orchestration user interfaces Web-mediated social interactions, etc. July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //

Characterization of the user information need It is not a simple problem: “ Blurred” goals Sensory Gap Gap between the object in the world and the information in a (computational) description Semantic Gap Lack of coincidence between the (computational) description of the information and their interpretation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //

Evaluating an IR System Precision: fraction of retrieved docs that are relevant P(relevant|retrieved) “ degree of soundness” of the system not considering the total number of documents Recall: fraction of relevant docs that are retrieved P(retrieved|relevant) “ degree of completeness” of the system July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //

Enterprise search Public Web search engines are the ones known to the general public But there is also a huge need (and market share!) for professional search over enterprise repositories Enterprise search is covered by Packaged suites Microsoft FAST Autonomy IDOL IBM OmniFind Exalead Frameworks Apache UIMA (ex IBM) Smila Solr July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //

Case Studies Textual Search YaGoBi Multi-media Search The PHAROS Project Multi-domain Search The Search Computing project Example of Web Search Application Chansonnier © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //

YaGoBi THE Web Search 92% of market share in the U.S. Searching on Web pages, Blog, News, Books, Scientific Publications, Emails Images and Videos (but only trough textual descriptions ) Tweets … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla CASE STUDIES //

The PHAROS Project FP6 IP, 3Years, 12 Partners, ~15 M€ budget Mission : Develop SOA-compliant, open and distributed technology platform for development of information access solutions for audio visual content www.pharos-audiovisual-search.eu © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //

The Search Computing Project European Research Council (ERC), 2008 Call for "IDEAS Advanced Grants”, 5y (started in 2009) Mission : provide the abstractions, foundations, methods, and tools required to answer multi-domain queries by interacting with a constellation of cooperating search services, using ranking and joining of results as the dominant factors for service composition www.search-computing.org © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //

Chansonnier BsC Thesis project Mission : graduate  Open source video analysis application based on open frameworks (SMILA / SOLR) Crawling of Web video Download of song lyrics Analysis on lyrics text Language, emotion Keyframe extraction for video snippets http://github.com/giorgiosironi/Chansonnier © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //

Key Requirements and Design Dimensions for Web Search © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // Data Source User Behavior Query Format User Interface Security Data Analysis Performance Data Format Social Interactions Search Engine

Data Type Unstructured data Textual Documents Blog Posts (Semi) Structured data Software Code Models XML Files Media Pictures Video Music © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

Textual Analysis Deals with basic language units (morphemes, roots, stems, words, phrases, sentences, etc.) Media Analysis Deals with media contents Transcoding Classification Feature Extraction Data Analysis July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // An activity performed at the purpose of providing a representation of a content item suited for the application

Search Engine _1 Textual Textual contents represented as collection of unstructured text terms Fielded Textual contents structured in fields (e.g., metadata) Semi-structured Textual contents organized in complex (possibly heterogeneous) structure (e.g., XML, HTML) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

Search Engine _2 Content-based Media contents described by low-level features Geographic and other special dimensions Content featuring geo-spatial features Streaming content searched by temporal features (e.g., recency) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

Query Format Representation of the user information need Natural Language For instance trough vocal interfaces Keyword Set of text items, plus Boolean (AND/OR/NOT), proximity ( lexical nearness) and/or wildcard conditions Fielded Keyword Text items defined on one or more fields Queries to semi-structured search-engines and Faceted queries Content-based Query by example (text, image, video, audio, etc.) Geographic and other special dimensions Geographic coordinates plus spatial operator terms ( near, north of, within X kilometers from, etc.) Timestamps plus temporal operator terms (recent, near, interval, etc.) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

YaGoBi Data Sources Web : crawling of Web resources Users : comments, preferences, relationships Data Types Unstructured data : Web pages Documents : PDF, PPT, DOC, etc. Data Analysis Textual : for content, document, and user generated comments Media : some basic image analysis for color, faces, size Search Engine Fielded: filetype, page title, site, page content Content-based: image similarity in Google Query Format: Fielded keyword Geographic July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

PHAROS Data Sources Web : crawling of audio/video files File System : NAS and content provider media archives Users : comments, preferences, relationships Data Types Structured data : content provider description metadata Media : hi-quality video and audio files Semi-structured data : MPEG-7 description of processed media files and user annotations Data Analysis Textual : for content metadata and user generated comments Media : for audio and video Audio/Video Mood classification, Image concept classification, Music Genre, Danceability classification, face recognition and identification, speech to text July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

PHAROS Search Engine Semi-structured : XML search engine for MPEG-7 content description Plus geographic annotations and geo-based ranking 3 content-based engines : one CB for music, one for images (shots of the video) one for face similarity Query Format Fielded-keyword : XQuery for XML search engine Query by example : for image, music and faces MPQF: high level query language AND/OR/AND THEN for fielded keyword and by-example queries July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

Query Federation in PHAROS July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // JPG Long/Lat XPath Keywords “ amsterdam” \\where[contains(“amsterdam”)] and \\topic[contains(“building”)] Geo search R-tree index 52.37N 4.89 E Text search Inverted index XML search Semantic index Image search Similarity index Query analysis Federation

User Behavior Search is evolving Content Vs. Intent People don’t want to search People want to get task done and get answers Moving towards identifying a user’s task Enabling means for task completion Search as a Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Search applications must Support the user in the search process (try to) Infer the user intent to help him accomplishing his task Ricardo Baeza-Yates Next Generation Search , 2 nd SeCo Workshop, Milan, 24/06/2010 Start End I am craving for a good Wiener Schnitzel and a Sachertorte in Vienna Search Menu Reviews Map

Information Seeking [Bates, 2002] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Bates, Marcia J. 2002. Toward an integrated model for information seeking and searching. In: The Fourth International Conference on Information Needs, Seeking and Use in Diﬀerent Contexts.

Information Foraging Information foraging applies the ideas from optimal foraging theory to understand how human users search for information. Assumption: humans use "built-in" foraging mechanisms that evolved to help our animal ancestors find food. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // Some References Fu, Wai-Tat; Pirolli, Peter (2007), "SNIF-ACT: a cognitive model of user navigation on the world wide web", Human-Computer Interaction: 335–412 Jason Withrow, "Do your links stink?," American Society for Information Science Bulletin, June 1, 2002 Pirolli, Peter (2009), "An elementary social information foraging model", Proceedings of the 27th international conference on Human factors in computing systems: 605–614

Moving between patches Patches of information = websites Problem: should I continue foraging in the current patch or look for another patch?  Expected gain from continuing in current patch vs. moving to another © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // July 5, 2010

Information seeking funnel [D. Rose, 2008] Wandering: the user does not have an information seeking-goal in mind. Exploring: the user has a general goal but not a plan for how to achieve it. Seeking: the user has started to identify information needs that must be satisfied but the needs are open-ended. Asking: the user has a very specific information need that corresponds to a closed-class question © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

Berrypicking vs. Orienteering vs. Teleporting ... Information needs change during interactions M.J. Bates. The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5):407–431,1989. Orienteering [ Teevan et al., CHI 2004 ] : Searcher issues a quick, imprecise to get to approximately the right information space region and then follows known paths that require small steps that move them closer to their goal. Easy! (“perfect” query not needed) Teleporting: Expert searchers issue longer queries to jump directly to the target. Requires more effort and experience. © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

… vs. exploratory search Exploratory Search: user’s intent is primarily to learn more on a topic of interest, by exploring various directions and sources “… exploratory search blends querying and browsing strategies” and is different “from retrieval that is best served by analytical strategies…” Marchionini, G. Exploratory search: from finding to understanding. Communications ACM 49(4): 41-46 (2006) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // Some references Definition and analysis of the problem White, R. W., and Drucker, S. M. Investigating behavioral variability in web search. 16th WWW Conf. (Banff, Canada, 2007) Complex Search and Exploratory Search Aula, A., and Russell, D.M. Complex and Exploratory Web Search. ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)

Multi-domain Exploratory Search “… search for upcoming concerts close to an attractive location (like a beach, lake, mountain, natural park, and so on), considering also availability of good , close-by hotels ” Current approach the user can adopt: Independently explore search services Manually combine findings July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

Multi-domain Exploratory Search “… expand the search to get information about available restaurants near the candidate concert locations, news associated to the event and possible options to combine further events scheduled in the same days and located in a close-by place with respect to the first one…” July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

Existing Approaches _1 Topic based search : instance of exploratory search centered on the goal of collecting information on a subject matter of interest from multiple sources July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Kosmix : topic discovery engine, keyword search, a topic page summarizes the most relevant information on the subject Hakia : resume pages for topics associated with user’s queries, natural language processing techniques

Existing Approaches _2 Structured Object Search : process queries and present results that address entities or real world objects described in Web pages July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Google Squared: keyword search, results collected in a table (called a square) featuring all the attributes relevant to the result items as columns headers Google Fusion Tables: upload data tables (e.g., spreadsheet files) and join (or “fuse”) the data in some column with other tables

Liquid Queries “ A new paradigm allowing users to formulate and get responses to multi-domain queries through an exploratory information seeking approach, based upon structured information sources exposed as software services…” Composite answers obtained by aggregating search results from various domains Highlight the contribution of each search service Join of results based on the structural information afforded by the search service interfaces Refine the user query Re-shape the result list July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri. Liquid Query: multi-domain exploratory search on the Web . WWW 2010, Raleigh, USA

Liquid Queries Definition _1 Template-based approach It consists of subsetting and parametrizing the resource graph... July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Concert Artist Exhibition Restaurant Hotel Movie Metro Station Theatre Photo Landmark News Photo Concert Metro Station Restaurant News Exhibition Artist Hotel = inputs, outputs + GR = global ranking

Liquid Queries Definition _2 And then characterizing the user interaction Plus: Parametrization of global ranking Data visualization options .. and so on July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Photo Concert Metro Station Restaurant News Exhibition Artist Hotel Expand

Result Exploration Support If the current set of combinations is not satisfactory, the user may ask for more values for a service (more one) or for all services (more all) More concerts, more hotels, or more combinations Add new information about further domains for selected combinations (expand) Find close-by restaurants or co-located events Aggregate information to ease analysis and readability (clustering, grouping) Group events by venue Reduce the number of shown items through filtering Total walked distance for the night Re-order (ranking or sorting) Calculate derived values from existing ones Total walked distance for the night Alternative data visualization Map, parallel coordinates, … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // DEMO : http://demo.search-computing.org

User Intent Understand the user information need User intent taxonomy (Broder2002) Informational –want to learn about something (~40% / 65%) Navigational –want to go to a given page (~25% / 15%) Transactional – want to do something (web-mediated) (~35% / 20%) Grey Areas Find a good hub Exploratory search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // [from SIGIR 2008 Tutorial, Baeza-Yates and Jones] History nyonya food Singapore Airlines Jakarta Weather Nikon Finepix Car Rental Kuala Lumpur

Contextual Content Delivery Context Vs. Personalization Trigger the right search depending on the context Task Location User Engagement Not interested in your personal profile Your favorite restaurant? It depends on where you are! July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // from Ricardo Baeza-Yates, Next Generation Search , 2 nd Search Computing Workshop, Milan, 24/06/2010 Demo: http://sandbox.yahoo.com/Motif

Relevance: the Top-k problem Relevance of the results with respect to the request is the main expectation for search engine users Top-k relevant items : retrieve quickly a number ( k) of highest ranking tuples in the presence of monotone ranking functions defined on the attributes of underlying relations Some References R. Fagin. Combining fuzzy information from multiple systems. J. Comput. Syst. Sci., 58(1):83–99, 1999. F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization. In SIGMOD Conference, pages 203–214, 2004 D. Martinenghi and M. Tagliasacchi: Proximity Rank Join, to appear in PVLDB July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

Result Diversification Relevance is not the only success factor for a result set User satisfaction is increased if the first items cover a good spectrum of options If user intent is ambiguous , diversification tries to cover the most likely intents If several top-k items are very similar , they can be clustered together Thus: an optimization problem Objective: find the set of k elements that contains the most relevant and diverse items Maximal Marginal Relevance [Carbonell and Goldstein 1998] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Relevance Diversity

Performance Users don’t want to lose their time waiting for a search result User satisfaction Performances are the leading factor for the evaluation of Web Search applications Queries per seconds (QPS) Time to Index Scalability Content Queries Distribution Service-oriented computing Content Delivery Networks But intellectual properties may be a concern More in section (ARCHITECTURE) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

Other Requirements Social Interaction Content evaluation User relationships and actions as additional content description Security & Privacy Access policies Collection Vs. Item level Anonymity Who I am = What I like + What I do + Where I am ? A search process tells a lot about whom is doing it Alessandro Bozzon, Tereza Iofciu, Wolfgang Nejdl, Antonio V. Taddeo, Sascha Tönnies, Role Based Access Control for the interaction with Search Engines, (COOPER) 2007, Crete, Greece . © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

Designing Web Search Applications Reference architecture Reference execution processes Set of design dimensions Development methodology Tools supporting the methodology July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //

Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // High level query “ Where can I attend a DB scientific conference close to a beautiful beach reachable with cheap flights?” Sub query 1 “ Where can I attend a DB scientific conference?” Sub query 2 “ place close to a beautiful beach?” Sub query 3 “ place reachable with cheap flight?”

Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Low level query 1 ConfSearch(“DB”,placeX,dateY) Low level query 2 TourSearch(“Beach”,PlaceX) Low level query 3 Flight(“cost<200”,PlaceX,DateY)

Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Services invocations and operators execution Presented results ESWC-Crete-Olympic CAISE- Hammamet – Alitalia TOOLS-Malaga-EasyJet Query plan Results

Design Dimensions July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Design Dimension Affected Process Values Retrieval Policy Indexing Push Pull Data Homogeneity Indexing Homogeneity Heterogeneity Data Analysis Indexing Mono Annotation Multi Annotation Mono Modal Multi Modal Search Technology Indexing, Query and Result Presentation Search Engine(s) Type Homogeneity Heterogeneity Query Format Query and Result Presentation, User Interface Query Type Mono Modal Multi Modal Mono Domain Multi Domain User Interaction User Interface Direct Indirect Active Passive

Designing Web Search Applications - A MDD approach Alessandro Bozzon, Marco Brambilla, Piero Fraternali. Conceptual Modeling of Multimedia Search Applications using Rich Process Models . ICWE 2009, June 24-26, 2009, San Sebastian, Spain July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Clear separation of concerns among the involved actors Central roles of models as key development artifacts Automatic code generation, etc.

Development Methodology Process Models E.g.: BPMN Domain data and process metadata E.g.: ER/UML July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Model To Model Transformation E.g.: Java / XSLT / ATL Application Models DSL, e.g. WebML Model To Code Transformation Running Application

An example domain model Content Analysis / ER Content : the objects that relate to the Content Items indexed by a search application Annotation : structure of the annotations associated with searchable Content Items during the indexing process Usage : usage groups of the application (RBAC model) Index : abstraction for the actual physical implementation of search engine indexes July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //

An example process model Content Analysis / BPMN - WebML Coarse indexing process model Content Registration Content Analysis Content Indexation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Fine-grained process model Analysis of audiovisual content trough face recognition and identification technologies Application model Face Recognition and Segmentation activity Running CPA process Console trace of the working annotation technology Process advancement control UI Refinement M2M Transformation M2T Transformation

Modeling User Interface The information seeking interaction modes (Searching, Browsing, Monitoring, Being aware, Social interactions) Distilled 30+ information seeking user interaction patterns Query execution and result presentation Keyword (Faceted, Similarity, Geo) search specification and refinement... Browsing, content organization, content-based awareness, etc. Relationship setting, recommendation, etc. UI designed as assembly of standard interaction patterns expressed in WebML July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Alessandro Bozzon, Model-driven development of Search Based Web Applications, Ph.D Thesis, Politecnico di Milano, April 2009.

Pharos: Modeling User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // http://www.youtube.com/watch?v=ZpxyNi6Ht50 KEYWORD REFINEMENT FACETED REFINEMENT CONTENT-BASED REFINEMENT RESULT PRESENTATION

MDD in Search Computing 4 artifact models Search Service, Query, Query Parameters, Result A query plan model For the runtime query transformation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //

Search Computing Model Example Search Service Model ServiceMart abstraction (e.g., Hotel) of one or more Web service implementations (e.g., Bookings and Expedia) possibly ranked and chunked into page Attribute Atomic or Composite AccessPattern specifies RankingType and AttributeDirection (I/O) ConnectionPattern is defined as an input-output relationship between pairs of service marts (for joining them) the output city of Concert used as input for Hotel. ServiceInterface physical interface of the service Exact or Search (ranked) details about chunk size, cost July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //

Search Computing Query Meta-model LogicalQuery is a conjunctive query over services can be defined at an abstract level ( AccessPatternLevelQuery ) or at physical level ( InterfaceLevelQuery ). QueryClause a LogicalQuery is composed by a set of QueryClauses a QueryClause can refer to the service mart level or to the Service Interface level. Several types InvocationClauses PredicateClauses JoinClauses RankingClauses July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //

Search Computing Model Transformations Vertical transformations for Queries and ServiceMarts QueryToPlan transformation Query Execution transformation (at runtime) Result transformation (at runtime) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // 1 1 2 4 3 Prototype: http://dbgroup.como.polimi.it/brambilla/SeCoMDA

Search Computing DSLs (& Transformations): Panta Rhei describes both the execution flow and the data flow between nodes of a query plan. Several types of nodes exist service invocators, sorting, join, and chunk operators, clocks (defining the frequency of invocations), caches, and others. The query result model is constructed stepwise, following the execution flow July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca: Panta Rhei: An Execution Model for Queries over Web Information Sources, http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf

From the models to implementation Once the design phase is completed IMPLEMENTATION TIME Never implement a search engine/app from scratch!! Start from your requirements and design and: Identify possible existing solutions ( REUSE ) Select the best fitting wrt your needs ( SHOPPING ) Implement what you need ( DEPLOY vs. CONFIGURE ) We will see: open source (products) vs. Open search (services) A full-fledged model-driven approach can be devised: Model to code transformation that generate: The code for the pieces of Web search applications that you need The configuration for the tools of choice © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //

Search Framework Vs. Search Engine Search Engines “ provide an interface to a group of items that enables users to specify criteria about an item of interest and have the engine find the matching items. The criteria are referred to as a search query” Wikipedia Search Frameworks Software components that target a set (possibly exhaustive) of the architectural layers of a Search Applications E.g., crawling + analysis + indexing/querying © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //

Open Source Search Vs Open Search Open Source  build your own engine © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION // www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo! Open Search  exploit commercial engines API v. 2

Open Source Search High level comparison July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Extended version of www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo! Product License Lang. Docs Ranking Users Parallel Scale Support Lucene Apache Java/ C++ Several Flexible Amazon Yes TB 5/5 Zettair BSD Like C HTML, TREC, TXT Flexible Research No TB 1/5 Indri BSD Like C++ Many Very Flexible Research Yes TB 1.5/5 Sphinx GPL C++ Many Flexible Craiglist Yes YB 4/5 Xapian GPL C++ Many Flexible GMane Yes TB 3/5 RDBMS BSD, GPL C Limited Maybe GB 4/5

Open Source Search Benchmark _1 [Middleton+Baeza-Yates 07]: A comparison of open source search engine http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ Vik Singh /Yahoo, Weekend project: Index 1M tweet Source Code available at http://github.com/zooie/opensearch July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Lucene High-performance, scalable information retrieval (IR) library in Java There’s also pyLucene & Clucene Apache License Lot of industrial support with proven scalability Amazon, Netflix, Wikipedia Core API for full-text indexing and searching Plus plug-in modules Text analysis: text analyzer, tokenizer, token-filter, stemmer, N-gram filters, shingle filters spell-checkers, result highlight, “more like this” Fuzzy queries, regex queries Geo ranking July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Additional Indexing Features Documents can be Updated and Deleted Boosted  doc.setBoost(1.5F); Fields can be Indexed - to search in Stored - to show the original content (e.g., abstract ) coded in term vectors - to enable more like this Multivalued (e.g., authors field) Boosted  subjectField.setBoost(1.2F); There are built-in field types for numbers, dates , and time , to better support sorting or range search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Additional Querying Features Boolean Prefix Phrase Wildcard Fuzzy Scoring function Fielded TF-IDF, weighted by term occurrences Term and document boost July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

More Features Thread and multi-JVM safety Any number of read-only IndexReaders may be open at once on a single index Only a single writer may be open on an index at once IndexReaders may be open even while an IndexWriter is making changes to the index Any number of threads can share a single instance of IndexReader or Index- Writer  not thread safe, but it scales Lucene implements the ACID transactional model only one transaction (writer) may be open at once July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Why Open Search? Search as a software service No need of in-house engine development Search as a commodity Internals are unknown, the features are taken off the shelf Javascript Access to search features through client-side programming (no server needed at all) But … you can search only for Web resources July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Open Search APIs Google Ajax Search API http://code.google.com/apis/ajaxsearch/ Google Custom Search API http://code.google.com/intl/en/apis/customsearch/ Microsoft Bing API http://www.bing.com/toolbox/developers/ Yahoo Boss (Build your Own Search Service) http://developer.yahoo.com/search/boss/ July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // API v. 2

Google Ajax Search API Javascript Widget REST API No limitations on the number of queries 8 results per query No change in the result order Query Web, Local, Video, Images, Blog, Book, News Very limited customization of result presentation July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Code Snippets from Google Ajax Search API Documentation

Google Custom Search API Custom search engine for a Web site, blog, or a collection of Web sites Max 5000 sites On-demand 24 hour Web Indexing iFrame or Custom Search Element results for developers; XML for enterprise Few result personalization options July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Microsoft BING API REST APIs Query Ad, Image, News, Phonebook, Video, Web Unlimited traffic Results can be modified, but with some restrictions You cannot re-rank or merge non-Bing sources July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Yahoo! Boss (+ Search Monkey) Unlimited queries Blend, re-order, discard Full Presentation control Usage: http://boss.yahooapis.com/ysearch/ {vert} /v1/ {q} ? appid= {appid} &start=0&count=10&lang=en& format=xml&view=keyterms Verticals Web, News, Images, Spelling In query syntax inurl, url, intitle, site, AND/OR, “-”, “+” Notable web view fields Delicious bookmarks SearchMonkey ( microformats ) Larger abstracts Extracted Entities (keyterms) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // WWW 2010 Tutorial Open Search Tools - Drake & Jones SearchMonkey keyterms Bookmarks

SMILA SeMantic Information Logistics Architecture http://www.eclipse.org/smila/ Open Source Search Framework based on SOA principles and standards (e.g. BPEL, SCA) dedicated to the access and integration of (unstructured) information Standard interfaces for the integration of the main components of a Search application Set of out-of the box components included Crawlers (Web, FS) and agents (e.g. RSS feeds) Lucene/Solr indexer interfaces for management, operation and monitoring of the framework and its components Written in Java Based on OSGi (Eclipse Equinox) Cloud-ready July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Data Model Record Representation of an information item Composed of a set of Attributes : textual metadata (e.g., mime type) Attachments : binary data (e.g., picture) Annotations : associated both to records, attributes or attachments Attributes and attachments are usually produced during the discovery of data Annotations are usually produced during the indexing process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Chansonnier Data Model Record : a song id: the download URI Attributes Link, PageTitle, Description, Keywords, Title, Artists Lyrics Language, language confidence Emotion, emotion confidence Attachments Original videos Extracted keyframes July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

SMILA Architecture 3 Macro components Each one can run on a dedicated OSGi instance Distribution, replication Each one aggregates a set of OSGi bundles Set of data storages Metadata Binary data Ontologies Delta Indexing July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CONNECTIVITY SEARCH PROCESSING

Processing Pipelines Orchestration performed through BPEL Engine (Apache ODE) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Process Invocation Condition on a record attribute Condition on an annotation value Activity Invocation

Chansonnier Activities Lyrics Wiki To decorate a song with its lyrics by querying the LyricWiki service (http://lyrics.wikia.com/). Google Translate To identify the language of a song’s lyric (with a given confidence) Synesketch To analyze the song’s lyric in order to infer the dominant emotion in it FFMPEG To extract the keyframes from the song video July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Content Analysis July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Text Annotation Media Annotation Transcoding Media Artifact Generation Media Analysis Media Analysis Text Analysis Text Analysis Media Artifact Generation Media Item Text Item

Text Processing Not all words are equally significant for representing the semantics of a document usually, noun words (or groups of noun words) are the most representative of a document content Vocabulary : language used to describe documents and queries Worthwhile to preprocess the text of the documents in the collection to determine the terms to be used as index terms Subset of words selected to represent a document’s content July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Index Terms and Precision/Recall Trade off Exhaustiveness Cover the whole document content  assign a big number of terms to a document Specificity Generic terms: low discriminative power, their frequency is high in all the documents (e.g., “and”, “or”, “of”, etc.) Specific terms: higher discriminative power, variable document frequency  their frequency denotes their document’s representativeness Recall High-frequency in the overall collection Index expansion via associative techniques (thesauri, clustering) Precision High frequency just in some documents July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Text Analysis Process Document Parsing Lexical analysis : manage digits, hyphens, punctuation marks, letter cases Elimination of stopwords (e.g., “and”, “or”, “of”, etc.) Thesaurus Phrases (noun groups) Stemming (reduction of a word to its grammatical root) Selection and weighting of index terms (noun, adjectives, etc…) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Document Parsing Lexycal Analysis Phrases Stemming Indexing Weighting Structure Full text Index Terms Stopwords Removal

Document Parsing What format : pdf/word/excel/html? What language ? What character set ? Problems: Documents being indexed can include docs from many different languages Sometimes a document or its components can contain multiple languages/formats (French email with a Portuguese pdf attachment. What is a unit document ? (An email? With attachments? An email with a zip containing documents?) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Lexical Analysis Process that transforms an input character stream (the original document’s text) into a flow of words ( tokens ) GOAL: identification of words in the text Example Input: “ Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing But what are valid tokens to emit? July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Tokenization Trivial case: recognition of blanks as word separator Other cases might need to be addressed: Phrases Finland’s capital -> Finland? Finlands?, Finland’s? Hewlett-Packard -> Hewlett and Packard as two tokens? San Francisco: one token or two? How do you decide it is one token? Language issues (normalization) Accents: résumé vs. resume. L'ensemble -> one token or two? L ? L’ ? Le ? How are your users like to write their queries for these words? Use locale? Punctuation (e.g: U.S.A. vs. USA) Numbers (100.45 vs. 100,45 vs. 1.0045 E+2 ) Dates (e.g. March 1 st 2009 vs. 03/01/09 vs. 1/03/2009) Case folding …. It depends on the addressed language E.g., in Chinese spaces do not separate words (tokenization based on vocabulary) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Stopword Removal Removal of high-frequency words , which carry less information Strategies Statistical analysis on the indexed collection Functional terms (articles, conjunctions, auxiliary verbs) A-priori knowledge, based on the IR system domain Creation of a “stop-list” with all the terms to remove English stop list is about 200-300 terms (e.g., “been”, “a”, “about”, “otherwise”, “the”, etc..) http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words < 30% - 50% of tokens (smaller dictionary) It can decrease recall (e.g. “to be or not to be”, “let it be”) Most of WEB search engines do not remove stopwords [ ManningIR] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Phrases (noun groups) Phrases capture the meaning behind the bag of words and result in multi-term phrases Uses of phrases: Added to the query: a query “New” “York” should be modified to search for “New York”  > 10% in precision and recall Replace terms in index: empirically considered not as good as query rewriting July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Phrases (noun groups) - Strategies Simple Phrases Many systems identify phrases as any pairs of terms not separated by: stop term punctuation mark special character Phrases occurring fewer than 25 times are removed (decrease in memory requirements) NLP Part Of Speech and Word Sense tagging statistical or rule-based methods to identify the part of speech (noun, verb, adjective) of each token Syntactic parsing Identify the key syntactic components of a sentence usually by tagging according to POS and then applying a grammar (FSA and NFSA) Thesauri July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Thesauri A thesaurus is as a classification scheme composed of words and phrases whose organization aims at facilitating the expression of ideas in written text E.g.: synonyms and homonyms Example entry from Roget’s 1 thesaurus: cowardly adjective Ignobly lacking in courage: cowardly turncoats. Syns: chicken (slang) chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered A thesaurus can be Thematic: specific to the IR system’s domain of application (most frequent case) E.g.: Thesaurus of Engineering and Scientific Terms Generic A thesaurus can be used to Help user formulate queries Modification of queries by the system Select index terms July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Thesauri Many kinds of thesauri have been developed for IR systems Hierarchical: synonyms (RT  related terms, UF  use for), generalization (BT  broader term), specialization (NT  narrower term) ISO and ANSI standards, almost always thematic Manually built and updated by domain experts Clustered: cluster (or synset) of words Non-typed, semantic relationships among cluster Each cluster is a set of word having strong semantic relationship (usually UF) WORDNET Clustered Thesauri can be automatically generated if no distinction is made among semantic relationships Associative: graph of words, where nodes represents words and edges represents semantic similarity among words Edges can be oriented or not, according to the symmetry of the similarity relationship Edged can be weighted (fuzzy pseudo-thesauri) Can be automatic generated from a collection of documents using a co-occurrence relationships July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Stemming and Lemmatization Goals Reduce terms to their “roots” before indexing Reduce inflectional/variant forms to base form language dependent E.g., am, are, is -> be car, cars, car's, cars' -> car the boy's cars are different colors -> the boy car be different color Stemming : heuristic process that chops off the ends of words in the hope of achieving the goal correctly the most of the time Stemming collapses derivationally related words Lemmatization : NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word Lemmatization collapses the different inflectional forms of a lemma Not widely used cause it harms performances July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Stemming Many different algorithms : Porter’s algorithm Commonest algorithm for stemming English Porter, Martin F. 1980. An algorithm for suffix stripping. Program 14:130–137. http://www.tartarus.org/˜martin/PorterStemmer/ One-pass Lovins stemmer Lovins, Julie Beth. 1968. Development of a stemming algorithm. Translation and Lancaster http://www.comp.lancs.ac.uk/computing/research/stemming/ Paice, Chris D. 1990. Another stemmer. SIGIR Forum 24:56–61 http://snowball.tartarus.org/demo.php Stemming increases recall while harming precision July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Tools for text analysis _1 Lucene and Solr contains a lot of text analyzer working on several languages http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters CharFilters, Tokenizer, Token Analyzers Apache Tika http://tika.apache.org/ toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries GATE (General Architecture for Text Engineering) http://gate.ac.uk/ ANNIE (A Nearly-New Information Extraction System) tokenizer, gazetteer, sentence splitter, part of speech tagger, named entities transducer, coreference tagger Support for English, Spanish, Chinese, Arabic, French, German, Hindi, Italian, Cebuano, Romanian, Russian MALLET (Machine Learning for Language Toolkit) http://mallet.cs.umass.edu/index.php Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Tools for text analysis _2 OpenNLP http://opennlp.sourceforge.net/projects.html open source projects related to natural language processing) Cognitive Computation Group – University of Illinois http://l2r.cs.uiuc.edu/~cogcomp/software.php Chunker, Part of Speech tagger, String similarity, Semantic Role Labeler Named Entity Extractor, etc. Supersense Tagger http://medialab.di.unipi.it/wiki/SuperSense_Tagger tool for assigning to each noun, verb, adjective and adverb of a sentence one of the 45 standard WordNet supersenses Wordnet Domains http://wndomains.fbk.eu/hierarchy.html Synesketch http://www.synesketch.krcadinac.com/ Open source textual emotion recognition July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla SECTION NAME //

Multimedia Content Analysis Computer are not able to catch the underlying meaning of a multimedia content. Annotation is needed. Manual annotation Expensive It can take up to 10x the duration of the video Problems in scaling to millions of contents Incomplete or inaccurate People might not be able to holistically catch all the meanings associated with a multimedia object Difficult Some contents are tedious to describe with words E.g., a melody without lyrics Automatic annotation Reasonably good quality Some technologies have a ~90% precision “ Low” cost © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //

Audio Segmentation GOAL: split an audio track according to contained information Music Speech Noise … Additional usage Identification and removal of ads © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //

Video Segmentation Keyframe segmentation: segment a video track according to its keyframes fixed-length temporal segments Shot detection: automated detection of transitions between shots a shot is a series of consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space. July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS: Thorsten Hermes@SSMT2006

Speech Analysis Speaker Identification : identify people participating in a discussion Additional usage: Vocal command execution Speech To Text : automatically recognize spoken words belonging to an open dictionary July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // ERIC DAVID JOHN

Classification of Music Genre GOAL: automatically classify the genre and mood of a song Rock, pop, Jazz, Blues, etc. Happy, aggressive, sad, melancholic, Additional usage: Automatic selection of songs for playlist composition Tutorial from PHAROS Summer School http://www.pharos-audiovisual-search.eu/ res/files/SummerSchool/Programme_Summer_School_file.zip July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Rock Dance!

Images: Low-level features GOAL: extract implicit characteristics of a picture luminosity orientations textures Color distribution July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Face Identification and Recognition GOAL: recognize and identify faces in an image Usage examples: People counting Security applications July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS: Thorsten Hermes@SSMT2006

Image Concept Detection GOAL: recognize context/ concepts of an image E.g., playground, seaside, road, ... Extraction of low level features from raw data color histograms, color correlograms, color moments, co-occurrence texture matrices, edge direction histograms, etc.. Features can be used to build discrete classifiers , which may associate semantic concepts to images or regions thereof The MediaMill semantic search engine defines 491 semantic concepts http://www.science.uva.nl/research/mediamill/demo Concepts can be detected also from text (e.g., from manual or automatic metadata) using NLP techniques July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Tools for media analysis _1 OpenCV http://opencv.willowgarage.com/wiki/ Framework for image analysis Octave http://www.gnu.org/software/octave/ high-level language, primarily intended for numerical computations, it works well with Matlab Marsyas (Music Analysis, Retrieval and Synthesis for Audio Signals) http://marsyas.sness.net/ Framework for music analysis and retrieval July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Tools for media analysis _2 TINA (TINA Is No Acronym) http://www.tina-vision.net/ is an open source environment developed to accelerate the process of image analysis research. Sphynx http://cmusphinx.sourceforge.net/sphinx4/ speech recognition system written entirely in the Java WEKA http://www.cs.waikato.ac.nz/ml/weka/ A collection of machine learning algorithms for data mining July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //

Disclaimer This section is inspired by the WWW2010 tutorial by Dasdan, Tsioutsiouliklis, Velipasaoglu @ WWW2010 Web Search Engine Metrics for Measuring User Satisfaction http://analytics.ncsu.edu/reports/wsmt.pdf July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Measures for IR Systems Measurable properties How fast does it process (index) documents? Number of documents/hour Average document size How fast does it search? Latency as a function of index size Expressiveness of query language Speed on complex queries The key measure: user happiness What is this? Speed of response/size of index are factors But blindingly fast, useless answers won’t make a user happy How do we quantify user happiness? July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Measuring User Happiness Who is the user we are trying to make happy? Depends on the setting Web engine: user finds what they want and return to the engine Can measure rate of return users eCommerce site: user finds what they want and make a purchase Is it the end-user, or the eCommerce site, whose happiness we measure? Measure time to purchase, or fraction of searchers who become buyers? Enterprise (company/govt/academic): Care about “user productivity” How much time do my users save when looking for information? Many other criteria having to do with breadth of access, secure access … July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Evaluation measures Relevance Of search results Coverage Presence of content of interest in a catalog Diversity Of result set Discovery and Latency How many new resources (in the collection) are in the catalogue How long it took to get the new resources in the catalog? Time to first click Freshness July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Relevance as a measure of user happiness How do you measure relevance? In order to assess the performance of a IR system you needed a test collection composed of: A benchmark document collection A benchmark suite of queries A binary assessment of either Relevant or Irrelevant for each query-doc pair ( gold standard , or ground truth ) Test collection must be of a reasonable size Need to average performance since results are very variable over different documents and information needs July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Evaluating Relevance Set based evaluation Rank based evaluation with explicit judgment Absolute judgment Preference judgment Rank based evaluation with implicit judgment Direct and indirect evaluation by clicks Model based evaluation Browsing models User satisfaction July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // NOT COVERED HERE

Information Need Translation Relevance is assessed relative to the need not to the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective A document is relevant if it addresses the stated information need, not just because it contains all the word in the query July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Set-based evaluation The two most frequent and basic measures for IR effectiveness are precision and recall Precision: fraction of retrieved docs that are relevant P(relevant|retrieved) Provides a measure of the “degree of soundness” of the system This not consider the total number of documents Recall: fraction of relevant docs that are retrieved P(retrieved|relevant) Provides a measure of the “degree of completeness” of the system July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Precision / Recall Can get high recall (but low precision ) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases (in a good system) Precision can be computed at different levels of recall Perhaps most appropriate for web search: all people want are good matches on the first one or two results pages Precision-oriented users Web surfers Recall-oriented users Professional searchers, paralegals, intelligence analysts July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

F-Measure Combined measure that assesses the tradeoff between precision and recall (weighted harmonic mean): Values of β<1 emphasize precision Values of β>1 emphasize recall People usually use balanced F 1 measure i.e., with β = 1 or α = ½ Harmonic mean is conservative average [CJ van Rijsbergen, Information Retrieval ] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Difficulties in using precision/recall Average over large corpus/query… Need human relevance assessments People aren’t reliable assessors Assessments have to be binary Nuanced assessments? Heavily skewed by corpus/authorship Results may not translate from one domain to another The relevance of one document is treated as independent of the relevance of other document This is also an assumption in most retrieval system July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Ranked Based evaluation In ranked retrieval systems, P and R are values relative to a rank position Evaluation performed by computing precision as a function of recall Function computed at each rank position in which a relevant document has been retrieved Resulting values are interpolated yielding a precision/recall plot July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Measures for Ranked Based evaluation Mean average precision ( MAP ) Measure of quality at all recall levels [email_address] Not all queries will have more than K relevant results Even a perfect system may have a score less than 1.0 for some queries R-Precision [Allan 2005] Use a variable result set cut-off for each query based on number of its relevant results Mean Reciprocal Rank ( MRR ) [ Voorhees 1999] Reciprocal of the rank of the first relevant result averaged over a population of queries July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Discounted Cumulative Gain (DCG) [Järvelin and Kekäläinen 2002] Gain adjustable for importance of different relevance grades for user satisfaction Discounting desirable for web ranking Most users don’t browse deep Search engines truncate the list of results returned. DCG yields unbounded scores For each query, divide the DCG by the best attainable DCG for that query  Normalized Discounted Cumulative Gain (nDCG) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // Example: Very Useful: 3 Somehow useful: 1 Not Useful: 0

Preference Judgment Kendall tau coefficient Based on counts of preferences Range in [-1, 1] Robust for incomplete judgments Binary Preference (bpref) Buckley and Voorhees (2004) Designed for incomplete judgments Generalized to graded judgment De Beer and Moens (2006) July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // A: preferences in agreement D: preferences in disagreement N r = # of non-relevant docs above relevant doc r, In the first R non-relevant R = number of relevant results for the query

Presentation Metrics How to present information? Which information Where they should be displayed Which presentation elements should be used? Font, colors, design elements, interaction design Generalization How to measure success? User studies On-line, on-home, usability, eye tracking, focus group, surveys Log analysis Editorial Comparative, Perceived vs. actual July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Heat Maps Golden Triangle The first result is always considered more trusted and more relevant by default The user spend less time reading the lower part of the page [Marti A. Hearst, Search User Interfaces , Cambridge University Press, 2009] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

Thank you for your attention! Questions? © 2010 Alessandro Bozzon, Marco Brambilla Alessandro Bozzon Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/bozzon Marco Brambilla Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/mbrambil http://www.search-computing.org/book July 5, 2010 REFERENCES //

References – Books Modern Information Retrieval Ricardo Baeza-Yates, Berthier Ribeiro-Neto , Addison Wesley Longman Publishing Co. Inc., 2010 [ManningIR] Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press, 2008 Information Retrieval: Algorithms and Heuristics . D.A. Grossman, O. Frieder. Springer, 2004 Managing Gigabytes. I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999 Mining the Web: Analysis of Hypertext and Semi Structured Data . S. Chakrabarti. Morgan Kaufmann, 2002 Search User Interfaces Marti A. Hearst. Cambridge University Press, 2009 Search Computing – Challenges and directions Stefano Ceri, Marco Brambilla (eds.) . Springer LNCS, vol. 5950, 2010 © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //

References - Tutorial Web Search Engine Metrics: Direct Metrics to Measure User Satisfaction Ali Dasdan, Kostas Tsioutsiouliklis, Emre Velipasaoglu (Yahoo!) www2010 Recent Progress on Inferring Web Searcher Intent Eugene Agichtein (Emory University) www2010 Applications of Open Search Tools Rosie Jones, Ted Drake (Yahoo!) www2010 [BAEZASeco2010] New Frontiers for Search Ricardo Baeza-Yates www2010 Web Mining for Search Ricardo Baeza-Yates and Rosie Jones (Yahoo!) SIGIR 2008 © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //

References - Papers [Ramakrishnan and Tomkins 2007] Raghu Ramakrishnan, Andrew Tomkins: Toward a PeopleWeb IEEE Computer 40(8): 63-72 (2007) [Broder2002] A. Broder. A taxonomy of web search SIGIR Forum, 36(2):3–10, 2002. [BATES2002] Bates, Marcia J. Toward an integrated model for information seeking and searching In: The Fourth International Conference on Information Needs, Seeking and Use in Diﬀerent Contexts, 2002 [FU2007] Fu, Wai-Tat; Pirolli, Peter, SNIF-ACT: a cognitive model of user navigation on the world wide web Human-Computer Interaction: 335–412 , 2007 [Withrow2002] Jason Withrow, Do your links stink? American Society for Information Science Bulletin, June 1, 2002 [Pirolli2009] Pirolli, Peter An elementary social information foraging model Proceedings of the 27th international conference on Human factors in computing systems: 605–614, 2009 [D. Rose, 2008] [BATES1989] M.J. Bates. The design of browsing and berrypicking techniques for the online search interface Online Review, 13(5):407–431,1989. [Teevan et al., CHI 2004] Teevan, J., Alvarado, C., Ackerman, M. and Karger, D. The perfect Search Engine is not Enough: A Study of Orienteering Behavior in Directed Search Proceedings of ACM CHI 2004, pp. 415-4422. [MARCHIONINI2006] Marchionini, G. Exploratory search: from finding to understanding . Communications ACM 49(4): 41-46 (2006) [WHITE2007] White, R. W., and Drucker, S. M. Investigating behavioral variability in web search 16th WWW Conf. (Banff, Canada, 2007) [AULA2008] Aula, A., and Russell, D.M. Complex and Exploratory Web Search ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008) © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //

References - Papers [BozzonEtAL2010] Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri. Liquid Query: multi-domain exploratory search on the Web WWW 2010, Raleigh, USA [FAGIN1999] R. Fagin. Combining fuzzy information from multiple systems J. Comput. Syst. Sci., 58(1):83–99, 1999. [ILYAS1999] F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization In SIGMOD Conference, pages 203–214, 2004. [MARTINENGHI2010] D. Martinenghi and M. Tagliasacchi: Proximity Rank Join to appear in PVLDB [Carbonell and Goldstein 1998] J. Goldstein and J. Carbonell (1998), Summarization: Using MMR for Diversity- based Reranking SIGIR’98 [BozzonEtAl2007] Alessandro Bozzon, et Al Role Based Access Control for the interaction with Search Engines International Workshop on Collaborative Open Environments for Project-Centered Learning (COOPER) 2007, Crete, Greece. [BozzonEtAl2009] Alessandro Bozzon, Marco Brambilla, Piero Fraternali Conceptual Modeling of Multimedia Search Applications using Rich Process Models ICWE 2009, June 24-26, 2009, San Sebastian, Spain [BozzonThesis2009]Alessandro Bozzon, Model-driven development of Search Based Web Applications Ph.D Thesis, Politecnico di Milano, April 2009. [BragaEtAl2010] D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca: Panta Rhei: An Execution Model for Queries over Web Information Sources http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //

References - Papers [Allan 2005] J. Allan (2005), HARD track overview in TREC 2005: High accuracy retrieval from documents. [Voorhees 1999] E.M. Voorhees (1999), TREC-8 question answering track report [Järvelin and Kekäläinen 2002] K. Järvelin and J. Kekäläinen, Cumulated gain-based evaluation of IR techniques ACM Trans. IS, 20(4): 422-446, 2002 [Buckley and Voorhees (2004)] C. Buckley and E.M. Voorhees, Retrieval evaluation with incomplete information SIGIR’04. [De Beer and Moens (2006)] De Beer, Jan; Moens, Marie-Francine. Rpref: a generalization of Bpref towards graded relevance judgments SIGIR 2006, Seattle, USA, 6-11 August 2006, pages 637-638, ACM © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //

References - Links Search Computing Course Lecture Notes http://www.search-computing.it/course Fabio Aolli, Università di Padova, http://www.math.unipd.it/~aiolli/corsi/0809/IR/IR.html http://www.ir.disco.unimib.it/ © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //

Engineering Web Search Applications

More Related Content

What's hot (15)

Viewers also liked (20)

Similar to Engineering Web Search Applications (20)

More from Alessandro Bozzon (11)

Recently uploaded (20)

Engineering Web Search Applications

Editor's Notes