SlideShare a Scribd company logo
Indexing and Searching Cross
  Media Content in a Social
          Network
    Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi


                 University of Florence
         Department of Systems and Informatics
Distributed Systems and Internet Technology Laboratory


               ECLAP Conference, May 7-9, 2012
ECLAP Social Network

   ECLAP is a Digital Library on Performing
    Arts connected with Europeana

   ECLAP is a Social Network (blogs,
    forums, comments, tagging, voting, …)
Goals/Requirements
   Develop an Indexing/Searching solution for ECLAP
    Social Network allowing:
       Indexing multilingual crossmedia content metadata and
        data (e.g. documents)
       Indexing portal blogs, forums, events, group pages,
        comments, etc.
       Efficient multilingual search (keyword search and
        advanced search) supporting:
            misspelled words (e.g. shespeare)
            partial word search
       Sorting and filtering search results
       re-index the whole data without blocking the system
       Log and monitor users activity
       …
   Evaluate the Indexing/Searchig service
ECLAP Data Model
                                       Group/Channel
                               0..n
                                                  0..n


           0..n                                   0..n
                           0..n 0..n                     1      0..n
      TaxonomyTerm                            Content                   Comment          Performing
                                                                                             Arts

                                                                          Metadata       Dublin Core

                                                                                          Technical
                                                                              1..n
    Blog             WebPage                  Forum             Object


                                                                                                0..n

                                                                Playlist      Document   Collection
                                0..n   1..2                            0..n
                  Annotation                   AVObject       1..n




                                 Image           Video       Audio
4
Indexing
   Indexing & Search system
       Based on Apache Solr
   Multilingual aspects
       Translate the metadata or translate the query?
       We use metadata translation
   Indexing schema
       Dublin Core + DCTerms (multi language)
       Performing Arts
       Technical (provider, content type, GPS, IPR, duration, quality, …)
       Groups associations (multi language)
       Taxonomy associations (multi language)
       Comments & multi language tags
       FullText of the textual digital resources
Indexing
                                            Taxnmy,   Comment,
                   DC          Perf.   Full Group     Tags
Media Type         (ML)   Tech Arts    Text (ML)      (ML)       Votes
Audio/Video/
Image
                    Y      Y     Y            Y          Y         Y
Document
(pdf, doc, …)
                    Y      Y     Y     Y      Y          Y         Y
CrossMedia
(html, MPEG21,…)
                    Y      Y     Y     Y      Y          Y         Y
Aggregations
(playlist,
                    Y      Y     Y            Y          Y         Y
collection, …)

Info text
(blog, web
                   (Y)                 Y                 Y
pages, forum,
events, …)
Indexing
   Multilingual fields
       title_en, title_it, title_de, title_fr, title_ca, …
   Catch-all fields
                     Component fields                Boost Weight
    text             pdf_*, doc_*, ppt_*, htm_*, …   1.0
    body             body_*                          0.5
    title            title_*                         3.1
    description      description_*                   2.0
    contributor      contributor_*                   0.8
    subject          subject_*                       1.5
    taxonomy         taxonomy_*                      0.8
    PerformingArts   PerformingArtsMetadata.#        1.0
Indexing
   Re-indexing
       In case of new indexing schema or index
        corruption the search system should not be
        blocked
       The re-indexing is done on a separete indexing
        machine while the production system uses the
        actual index
       During re-index the new uploaded/modified
        content is marked to be reindexed when the
        new index is put in production
Searching
   Full text search
       Uses the catch all fields to search for
        keywords in most important fields in all
        languages (title, description, text, body,
        subject,…)
   Fuzzy search
       Allows matching mistyped words
   Deep search
       Allows searching for partial words
   Relevance & boosting of terms
Searching
   Faceted search
Searching
   Advanced search
Search Facility Assessment
   Analisys performed on 3 months
   11294 vists (6032 unique visits)
   62768 page views (avg 5.76 pages per visit)
   7.29 minutes of permanence on the portal
   30502 contents accesses (view, play and
    download)
Search Facility Assessment
               # Full Text # Faceted # Last        #Featured # Popular
users          Query       Query     Posted List   List      List
simple         323        24          4            22        17
registered

partners       1094       21          27           19        9

anonymous 2634            147         234          302       213
Total          4051       192         265          343       239
Clicks after   1564       200         318          2799      231
query/list
Search Facility Assessment
   Click order distribution




               First page
Conclusions
   Solution allows indexing multilingual
    metadata and texts
   Searching & filtering results
   Search facility assessment show that
    search is a used feature
Context & Assessment
   Context
       Social Network
            User and content items
       Content distribution portal
            Video on demand portal
       Archive, digital library, Performing Arts
            http://www.eclap.eu
   Assessment
       User behavior
            Log user actions on the Web portal
       User happiness
            Measure the level of user satisfaction about the exposed
             services
Logging User Profile
   User Profile
       Registered or anonymous, uid (user id)
       Timestamp YY-mm-dd hh:mm:ss
       IP address, Proxy type etc.
       Platform (OS, Browser)
       GeoIP data (Country, Region, City)
       Friends, connections
          Betweenness, Eccentricity
          Joined groups

          User preferred contents
Understanding User behavior
   Online survey
       A simple module, in the right side of the portal
       Presenting 3 - 4 questions per topic (depending on the
        current portal section visited)
   Stat Drupal Modules
       Custom implemented modules
       Log User Activity
       Keep track and depict main figures about portal activity
       Can be filtered by date, user, type of content, group,
        type of activity (content enrichment, social promotion,
        networking etc.)
   Google Analytics
Understanding User behavior
  Top   Metrics
      Avg # Visits/User
      Avg # Queries/User
      Avg # Clicks/User
      Avg Visit duration
      Avg Query length
      Query refinement rate
      Next Page Click Rate
      Back Page Click Rate
      Frequency of searching (once/day, week etc.)
      Success of searching (assessment...)
      …
Logging User Behavior
   Logging user activities on the portal
      Downloads/Views

      Queries

      Anonymous/Register portal accesses
       (login/logout)
      Adding/Updating/Deleting digital contents

      Menu clicks

      Content Upload

      Content Management

      Social Promotion & Networking
Logging User Behavior
   Content Accesses (Download/View)
       Axmedis Content
          Pdf, Document, Video, Playlist, Slide, Flash, Image,
           Excel, Archive, Audio, Tool, Collection
       Drupal Content
          Page, Blog, Event, Forum, Group, Comment

   Distribution of Content Access per
       Access Type, Portal, Platform, Section, Locale,
        Country, Region, City, Axoid, Nid, Content Type,
        Partner, User, Timestamp
Logging User Behavior
   Queries (Simple, Faceted, Advanced)
       Distribution of Queries per
            User, Content type, Device, IP, User Agent, Query Type,
             Country, Region, City, Locale, Filter (faceted)
   Query Cloud
   Keyword Cloud
   IPR Wizard
       Definition and usage of IPR Models
   Metadata Editor
       Access and usage
            Add, Edit metadata
   Video Annotations
       Personal content
       Other users content
Logging User Behavior
   Social Promotion & Networking
       Analysis of
            Eccentricity
            Betweenness
            Connections
       Creation, Access of Public/Private Web Pages
       Activity on Forums, Blogs, Groups or between users
            New Contents
            Comments to Objects/Web Pages
            Invited People
            Featured Objects
            Recommendations, suggested content
            Export/Import of links to/from other SN
            Private Messages
Logging User Behavior
   Menu Clicks
       Distribution of clicks per
            User, IP, Locale, Timestamp etc.
       LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH,
        UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS ,
        MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY
        POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW
        CONTENT, etc.
   Ranking/Voting
       # of ranked items
       Distribution per
            User, IP, Locale, Timestamp etc.
   QR Code
       Access from Mobile Devices
   Workflow
       Distribution of Workflow Type
   Content Upload
       Distribution of uploads per
            User, Partner, Timestamp
Content Access
                      September 1st – November 30th 2011

      Affiliation             # View/Play                        # Download
DSI                     46                                 0
Not                     1292                               14
partners/Affiliated
Partners/Affiliated     6712                               119
(except DSI)
Public Users            21418                              947

        Affiliation           # View/Play                       # Download
  DSI                    3                                 0
  Not                 100                                  4
  partners/Affiliated
  Partners/Affiliated 218                                  11
  (except DSI)
  Public Users           2225                              869
Menu Clicks
                 September 1st – November 30th 2011

          Menu                                        # Clicks
ABOUT->ECLAP DESCRIPTION 671
EVENTS->PAST AND FUTURE             536
SEARCH->GROUPS                      524
ABOUT->ECLAP NEWS BLOG              463
CONTENT->LAST POSTED                265
CONTENT->FEATURED                   343
HOWTO->UPLOAD AND                   330
INGEST
SEARCH->ADVANCED                    314
SEARCH
EVENTS->CALENDAR                    298
ABOUT->ECLAP PARTNERS               269
ABOUT->MAIN CONTACT                 249
CONTENT->POPULAR                    239
Search
                     September 1st – November 30th 2011

      Affiliation         # Simple Queries                     # Faceted
                                                                Queries
DSI                      13                               0
Not                      323                              24
partners/Affiliate
d
Partners/Affiliated 1094                                  21
(except DSI)
Public Users Affiliation
                    2634                     # Advanced
                                                   147
                                               Queries
            DSI                         0
            Not                         18
            partners/Affiliate
            d
            Partners/Affiliated 4
            (except DSI)
Drupal Stat Metrics
                September 1st – November 30th 2011

   Content Access per nid
Drupal Stat Metrics
              September 1st – November 30th 2011

   Views by Query
Drupal Stat Metrics
               September 1st – November 30th 2011

   Content Access per Platform
Understanding User behavior
   Drupal Stats (collapsible menus on the right)
Google Analytics vs Drupal Stats
    Service             Pros                    Cons

Google            Traffic source
                   data
                                          IP approach, each IP
                                           is considered an
Analytics         Bounce rate
                                       
                                           unique visitor
                                           Can’t deal with
                  Recency (since
                                           specific actions on
                   when)                   portal (e.g.
                  Loyalty (how            downloads, queries)
                   often)
                  Session times

Drupal Stats   
               
                   Identity approach
                   Actions
                                          Can’t deal with
                                           traffic source data
                  Download                and bounce rate
                  User Access            Session time raw
                  Queries                 approximation
                  Content type
                   filtering
Sorting Results
   Sorting by
       Upload Time (first time doc uploading date)
       Update Time (last time doc updating date)
       Score (doc relevance to search query)
   Combined with faceting and paging
Suggestions
   REALTIME, while typing a query suggests
    similar searches
       ecl…
           eclap
           eclap-de-2-1-1-user
           eclap-de-2-2-1-usergroup
           …
ECLAP Survey
Indexing/Searching Reqs
   Enriching search experience
       Results Sorting
       Suggestions
   Large # of contents (~ 104-106)
       External Indexing Service
   Hidden/Private contents management
   Monitoring Exceptions
       Email notifications
   Search Engine Friendly (Google, Bing, Yahoo etc.)
       content site crawling       HTML dumping
External Indexing Service 1/3
   Setup an external service to avoid server
    overloading when building the index
       Taxonomization
       Indexing (with exceptions monitoring)
       Index Synchronization
       Old Index replacement with new one
       Index updating
       Old contents cleaning (optional)
External Indexing Service 2/3
                                                        Taxonom        Parent
                                                           y
   Taxonomization                                      Performing        -
                                                            Arts
         Has a cost        pre-computing                Cinema       Performing
         Digital content                                                 Arts
                                                          Music       Performing
         Execution Rule (JS)                                             Arts

         Indexed with object records                   Documenta      Cinema
                                                            ry
                                                         Historical    Cinema
                           Performing
                                                         Classical      Music
                              Arts
                                                           Pop          Music


              Cinema                     Music

                                                                 Object
        Documentary    Historical   Classical    Pop
                                                              Taxonomy
                                                          Performing Arts
                                                         Cinema           Music
                                                       Documentar      Classical
                                                           y
External Indexing Service 3/3
   Indexing with exceptions monitoring
       Real-time notifying system
       Event time and type (add, update)
       Full stacktrace info
       Customizable recipients
       Object Indexing Recovery
            Resource Parse Error     Metadata Indexing
•   Index synchronization
       During external indexing, contents may be
           Updated/added/deleted on the original index
           Need to update these contents               Indexed   External
                                                                  Indexed
            on the index (state flag)
                                                           1         1


                                                           0         1
Search Engine Friendly
   HTLM dump service
      JAVA external service

      Periodically invoked by an AXCP rule

      Full metadata exporting

      Thumbnail

      Resource link

      Multilanguage

      Paginated results
Conclusions
   Drupal integrated solution for user behavior tracking
    and analysis
       Logging
       Stat Data Graph
       Online Survey
   External Indexing Service
       Avoids server overloading
       HA of query service
       Error recovering
       Detailed event notifying system
       Index Optimization
   Dumping tool for portal contents (SEO)
       Full metadata HTML exporting
       Scheduled Service
Future Work
   Keep collecting Data
   Deeper Data Analysis
       User Sessions
          1st,   2nd..., nth click          average user behavior
       Depict a modular view of the system usage
          Popularity/Usability        for each feature &
           functionality
       Social Network Analysis (SNA)
          Huge     Population
                 User relationships, connections, friendships
References

   P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for
    scalable media computing and intelligence on
    distributed scenarious", IEEE Multimedia, 2011
   P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M.
    Serena, "Semantic Model for Cultural Heritage Social
    Network and Cross Media Content for Multiple
    Devices", Conference of the Italian Association of
    Artificial Intelligence, Workshop for Cultural Heritage,
    15-17 September 2011, Palermo, Italy
Q&A
APPENDIX
Architecture (former)
                              Index Rebuilder
                                                  Indexing Rule JS
                                  Rule JS
                                        SolrJ Client
                               Grid                 Rule
                               Node               Scheduler

                                                AXCP

            Solr
XML/HTTP             JSP      Indexing            Searching
            Cell
                               Module              Module
                   Indexing
    Apache Solr     Service
                                      Drupal
    Apache Tomcat                               Apache HTTP
Drupal
What is it?
Open source content management platform

Developed by Dries Buytaert in 2001

Written in PHP

Users: The Economist, Examiner.com, The
White House, data.gov.uk
Runs on a WEB server (e.g. Apache, IIS) and
a database (e.g. MySQL, PostgreSQL)
Apache Lucene
What is it?
High-performance, full-featured text
search engine library (indexing and
searching documents)
Developed by Doug Cutting (2000)
SourceForge, joined Apache Software
Foundation in 2001
Written entirely in Java
Users: Wikipedia, Technorati, Nabble,
TheServerSide, Akamai, SourceForge
Apache Lucene
Features
Ranked   searching (best results returned first)
Powerful query types: phrase queries, wildcard
queries, proximity queries, range queries and more
Fielded searching (e.g., title, author, contents)
Date-range searching
Sorting by any field
Multiple-index searching with merged results
Allows simultaneous update and searching
Apache Lucene
Features
Documents added via IndexWriter

Document = a collection of fields

No config files, dynamic field typing

Flexible text analysis tokenizers, filters

Search for documents via IndexSearcher
     Hits = search(Query,Filter,Sort,topN)

Scoring:    tf * idf * lengthNorm
Apache
          Solr
What is it?
A full text search server based on
Lucene (Lucene sub-project)
Developed by Yonik Seeley at CNET
Networks (2004), donated to the Apache
Software Foundation (2006)
Written in Java, deployable as a WAR
Users: CNET Reviews, CNET Channel,
shopper.com, news.com, nines.org,
krugle.com, oodle.com, booklooker.de
Apache
Features
            Solr
Advanced   Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces (XML, JSON,
HTTP)
Web Administration Interface
Server statistics exposed over JMX for
monitoring
Scalability, efficient Replication to other Solr
Search Servers
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture

More Related Content

PDF
Knowledge Engineering for TELDAP
PPTX
Aggravation Aggregation: A Sweet Story About Statistics - Lauren Fancher
PDF
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
PPT
Corrib.org - OpenSource and Research
PDF
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
PDF
Distributed Systems, Sync Time and Time ordering
PDF
Architetture Distribuite per la Creazione e lo Sfruttamento della Conoscenza,...
PDF
Technologies for Enhancing Knowledge and Training, the future of e-learning t...
Knowledge Engineering for TELDAP
Aggravation Aggregation: A Sweet Story About Statistics - Lauren Fancher
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
Corrib.org - OpenSource and Research
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Distributed Systems, Sync Time and Time ordering
Architetture Distribuite per la Creazione e lo Sfruttamento della Conoscenza,...
Technologies for Enhancing Knowledge and Training, the future of e-learning t...

Viewers also liked (15)

PDF
DISIT lab Overview on Tourism and Training, June 2014
PDF
ECLAP 2013 tutorial at Porto, April 2013
PDF
Anatomy of Social Networks, a guide for social media strategists
PDF
Social Media Technologies, Part B of 2
PDF
MyStoryPlayer on ECLAP and overview
PDF
ICT e per la Gestione soccorso integrato nelle maxi emergenze
PDF
ECLAP Tutorial first part, ECLAP 2012 conference. the general overview
PDF
Anatomy of a Cross Media Best Practice Network for Media Aggregation and Frui...
PDF
Modelli Semantici e Gestione della Conoscenza: Social Network vs Knowledge Ma...
PDF
ECLAP short overview at Ljubljana
PDF
Anatomy of a Social Network, ECLAP
PDF
A Trust P2P network for the Access to Open Archive resources
PDF
Eclap lubec-19-ottobre-2012-v1-0c
PDF
TUTORIAL 2/2 (of the second part) ECLAP 2012 Conference, IPR management, IPR ...
PDF
Personal Content Management on PDA for Health Care Applications
DISIT lab Overview on Tourism and Training, June 2014
ECLAP 2013 tutorial at Porto, April 2013
Anatomy of Social Networks, a guide for social media strategists
Social Media Technologies, Part B of 2
MyStoryPlayer on ECLAP and overview
ICT e per la Gestione soccorso integrato nelle maxi emergenze
ECLAP Tutorial first part, ECLAP 2012 conference. the general overview
Anatomy of a Cross Media Best Practice Network for Media Aggregation and Frui...
Modelli Semantici e Gestione della Conoscenza: Social Network vs Knowledge Ma...
ECLAP short overview at Ljubljana
Anatomy of a Social Network, ECLAP
A Trust P2P network for the Access to Open Archive resources
Eclap lubec-19-ottobre-2012-v1-0c
TUTORIAL 2/2 (of the second part) ECLAP 2012 Conference, IPR management, IPR ...
Personal Content Management on PDA for Health Care Applications
Ad

Similar to Indexing and Searching Cross Media Content in a Social Network (20)

PPT
Resource discovery and information sharing: reaching the 2.0 turn
PDF
Improving the Search Experience in a Social Network with Cross Media Contents
PPT
Slawek Korea
PPTX
Search Me: Using Lucene.Net
PPT
Semantic Web in Action
PPT
Scratchpad 2, Virtual Research Environment: Project Update
PDF
Indexator_oct2022.pdf
PPT
The JISC Information Environment and collection description
PPT
Metadata first, ontologies second
PPT
Technical overview of the JISC Information Environment
PPT
The JISC Information Environment and VLEs
PPT
From Provider to Portal - a chain of interoperability
PPS
Modular Documentation Joe Gelb Techshoret 2009
PPT
Introduction into Search Engines and Information Retrieval
PDF
bridging formal semantics and social semantics on the web
PDF
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
PPT
Tracking the Tiddlythesaurus
PPT
Intro to Digitization Projects
PPTX
Axendo uMedial - BUUG Festival
PPTX
Gallery Systems: eMuseum Network: Bringing Access to All
Resource discovery and information sharing: reaching the 2.0 turn
Improving the Search Experience in a Social Network with Cross Media Contents
Slawek Korea
Search Me: Using Lucene.Net
Semantic Web in Action
Scratchpad 2, Virtual Research Environment: Project Update
Indexator_oct2022.pdf
The JISC Information Environment and collection description
Metadata first, ontologies second
Technical overview of the JISC Information Environment
The JISC Information Environment and VLEs
From Provider to Portal - a chain of interoperability
Modular Documentation Joe Gelb Techshoret 2009
Introduction into Search Engines and Information Retrieval
bridging formal semantics and social semantics on the web
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Tracking the Tiddlythesaurus
Intro to Digitization Projects
Axendo uMedial - BUUG Festival
Gallery Systems: eMuseum Network: Bringing Access to All
Ad

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
Teaching material agriculture food technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
A comparative study of natural language inference in Swahili using monolingua...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
Programs and apps: productivity, graphics, security and other tools
Assigned Numbers - 2025 - Bluetooth® Document
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
OMC Textile Division Presentation 2021.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
NewMind AI Weekly Chronicles - August'25-Week II
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding
Group 1 Presentation -Planning and Decision Making .pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
TLE Review Electricity (Electricity).pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction

Indexing and Searching Cross Media Content in a Social Network

  • 1. Indexing and Searching Cross Media Content in a Social Network Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi University of Florence Department of Systems and Informatics Distributed Systems and Internet Technology Laboratory ECLAP Conference, May 7-9, 2012
  • 2. ECLAP Social Network  ECLAP is a Digital Library on Performing Arts connected with Europeana  ECLAP is a Social Network (blogs, forums, comments, tagging, voting, …)
  • 3. Goals/Requirements  Develop an Indexing/Searching solution for ECLAP Social Network allowing:  Indexing multilingual crossmedia content metadata and data (e.g. documents)  Indexing portal blogs, forums, events, group pages, comments, etc.  Efficient multilingual search (keyword search and advanced search) supporting:  misspelled words (e.g. shespeare)  partial word search  Sorting and filtering search results  re-index the whole data without blocking the system  Log and monitor users activity  …  Evaluate the Indexing/Searchig service
  • 4. ECLAP Data Model Group/Channel 0..n 0..n 0..n 0..n 0..n 0..n 1 0..n TaxonomyTerm Content Comment Performing Arts Metadata Dublin Core Technical 1..n Blog WebPage Forum Object 0..n Playlist Document Collection 0..n 1..2 0..n Annotation AVObject 1..n Image Video Audio 4
  • 5. Indexing  Indexing & Search system  Based on Apache Solr  Multilingual aspects  Translate the metadata or translate the query?  We use metadata translation  Indexing schema  Dublin Core + DCTerms (multi language)  Performing Arts  Technical (provider, content type, GPS, IPR, duration, quality, …)  Groups associations (multi language)  Taxonomy associations (multi language)  Comments & multi language tags  FullText of the textual digital resources
  • 6. Indexing Taxnmy, Comment, DC Perf. Full Group Tags Media Type (ML) Tech Arts Text (ML) (ML) Votes Audio/Video/ Image Y Y Y Y Y Y Document (pdf, doc, …) Y Y Y Y Y Y Y CrossMedia (html, MPEG21,…) Y Y Y Y Y Y Y Aggregations (playlist, Y Y Y Y Y Y collection, …) Info text (blog, web (Y) Y Y pages, forum, events, …)
  • 7. Indexing  Multilingual fields  title_en, title_it, title_de, title_fr, title_ca, …  Catch-all fields Component fields Boost Weight text pdf_*, doc_*, ppt_*, htm_*, … 1.0 body body_* 0.5 title title_* 3.1 description description_* 2.0 contributor contributor_* 0.8 subject subject_* 1.5 taxonomy taxonomy_* 0.8 PerformingArts PerformingArtsMetadata.# 1.0
  • 8. Indexing  Re-indexing  In case of new indexing schema or index corruption the search system should not be blocked  The re-indexing is done on a separete indexing machine while the production system uses the actual index  During re-index the new uploaded/modified content is marked to be reindexed when the new index is put in production
  • 9. Searching  Full text search  Uses the catch all fields to search for keywords in most important fields in all languages (title, description, text, body, subject,…)  Fuzzy search  Allows matching mistyped words  Deep search  Allows searching for partial words  Relevance & boosting of terms
  • 10. Searching  Faceted search
  • 11. Searching  Advanced search
  • 12. Search Facility Assessment  Analisys performed on 3 months  11294 vists (6032 unique visits)  62768 page views (avg 5.76 pages per visit)  7.29 minutes of permanence on the portal  30502 contents accesses (view, play and download)
  • 13. Search Facility Assessment # Full Text # Faceted # Last #Featured # Popular users Query Query Posted List List List simple 323 24 4 22 17 registered partners 1094 21 27 19 9 anonymous 2634 147 234 302 213 Total 4051 192 265 343 239 Clicks after 1564 200 318 2799 231 query/list
  • 14. Search Facility Assessment  Click order distribution First page
  • 15. Conclusions  Solution allows indexing multilingual metadata and texts  Searching & filtering results  Search facility assessment show that search is a used feature
  • 16. Context & Assessment  Context  Social Network  User and content items  Content distribution portal  Video on demand portal  Archive, digital library, Performing Arts  http://www.eclap.eu  Assessment  User behavior  Log user actions on the Web portal  User happiness  Measure the level of user satisfaction about the exposed services
  • 17. Logging User Profile  User Profile  Registered or anonymous, uid (user id)  Timestamp YY-mm-dd hh:mm:ss  IP address, Proxy type etc.  Platform (OS, Browser)  GeoIP data (Country, Region, City)  Friends, connections  Betweenness, Eccentricity  Joined groups  User preferred contents
  • 18. Understanding User behavior  Online survey  A simple module, in the right side of the portal  Presenting 3 - 4 questions per topic (depending on the current portal section visited)  Stat Drupal Modules  Custom implemented modules  Log User Activity  Keep track and depict main figures about portal activity  Can be filtered by date, user, type of content, group, type of activity (content enrichment, social promotion, networking etc.)  Google Analytics
  • 19. Understanding User behavior  Top Metrics  Avg # Visits/User  Avg # Queries/User  Avg # Clicks/User  Avg Visit duration  Avg Query length  Query refinement rate  Next Page Click Rate  Back Page Click Rate  Frequency of searching (once/day, week etc.)  Success of searching (assessment...)  …
  • 20. Logging User Behavior  Logging user activities on the portal  Downloads/Views  Queries  Anonymous/Register portal accesses (login/logout)  Adding/Updating/Deleting digital contents  Menu clicks  Content Upload  Content Management  Social Promotion & Networking
  • 21. Logging User Behavior  Content Accesses (Download/View)  Axmedis Content  Pdf, Document, Video, Playlist, Slide, Flash, Image, Excel, Archive, Audio, Tool, Collection  Drupal Content  Page, Blog, Event, Forum, Group, Comment  Distribution of Content Access per  Access Type, Portal, Platform, Section, Locale, Country, Region, City, Axoid, Nid, Content Type, Partner, User, Timestamp
  • 22. Logging User Behavior  Queries (Simple, Faceted, Advanced)  Distribution of Queries per  User, Content type, Device, IP, User Agent, Query Type, Country, Region, City, Locale, Filter (faceted)  Query Cloud  Keyword Cloud  IPR Wizard  Definition and usage of IPR Models  Metadata Editor  Access and usage  Add, Edit metadata  Video Annotations  Personal content  Other users content
  • 23. Logging User Behavior  Social Promotion & Networking  Analysis of  Eccentricity  Betweenness  Connections  Creation, Access of Public/Private Web Pages  Activity on Forums, Blogs, Groups or between users  New Contents  Comments to Objects/Web Pages  Invited People  Featured Objects  Recommendations, suggested content  Export/Import of links to/from other SN  Private Messages
  • 24. Logging User Behavior  Menu Clicks  Distribution of clicks per  User, IP, Locale, Timestamp etc.  LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH, UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS , MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW CONTENT, etc.  Ranking/Voting  # of ranked items  Distribution per  User, IP, Locale, Timestamp etc.  QR Code  Access from Mobile Devices  Workflow  Distribution of Workflow Type  Content Upload  Distribution of uploads per  User, Partner, Timestamp
  • 25. Content Access September 1st – November 30th 2011 Affiliation # View/Play # Download DSI 46 0 Not 1292 14 partners/Affiliated Partners/Affiliated 6712 119 (except DSI) Public Users 21418 947 Affiliation # View/Play # Download DSI 3 0 Not 100 4 partners/Affiliated Partners/Affiliated 218 11 (except DSI) Public Users 2225 869
  • 26. Menu Clicks September 1st – November 30th 2011 Menu # Clicks ABOUT->ECLAP DESCRIPTION 671 EVENTS->PAST AND FUTURE 536 SEARCH->GROUPS 524 ABOUT->ECLAP NEWS BLOG 463 CONTENT->LAST POSTED 265 CONTENT->FEATURED 343 HOWTO->UPLOAD AND 330 INGEST SEARCH->ADVANCED 314 SEARCH EVENTS->CALENDAR 298 ABOUT->ECLAP PARTNERS 269 ABOUT->MAIN CONTACT 249 CONTENT->POPULAR 239
  • 27. Search September 1st – November 30th 2011 Affiliation # Simple Queries # Faceted Queries DSI 13 0 Not 323 24 partners/Affiliate d Partners/Affiliated 1094 21 (except DSI) Public Users Affiliation 2634 # Advanced 147 Queries DSI 0 Not 18 partners/Affiliate d Partners/Affiliated 4 (except DSI)
  • 28. Drupal Stat Metrics September 1st – November 30th 2011  Content Access per nid
  • 29. Drupal Stat Metrics September 1st – November 30th 2011  Views by Query
  • 30. Drupal Stat Metrics September 1st – November 30th 2011  Content Access per Platform
  • 31. Understanding User behavior  Drupal Stats (collapsible menus on the right)
  • 32. Google Analytics vs Drupal Stats Service Pros Cons Google  Traffic source data  IP approach, each IP is considered an Analytics  Bounce rate  unique visitor Can’t deal with  Recency (since specific actions on when) portal (e.g.  Loyalty (how downloads, queries) often)  Session times Drupal Stats   Identity approach Actions  Can’t deal with traffic source data  Download and bounce rate  User Access  Session time raw  Queries approximation  Content type filtering
  • 33. Sorting Results  Sorting by  Upload Time (first time doc uploading date)  Update Time (last time doc updating date)  Score (doc relevance to search query)  Combined with faceting and paging
  • 34. Suggestions  REALTIME, while typing a query suggests similar searches  ecl…  eclap  eclap-de-2-1-1-user  eclap-de-2-2-1-usergroup  …
  • 36. Indexing/Searching Reqs  Enriching search experience  Results Sorting  Suggestions  Large # of contents (~ 104-106)  External Indexing Service  Hidden/Private contents management  Monitoring Exceptions  Email notifications  Search Engine Friendly (Google, Bing, Yahoo etc.)  content site crawling HTML dumping
  • 37. External Indexing Service 1/3  Setup an external service to avoid server overloading when building the index  Taxonomization  Indexing (with exceptions monitoring)  Index Synchronization  Old Index replacement with new one  Index updating  Old contents cleaning (optional)
  • 38. External Indexing Service 2/3 Taxonom Parent y  Taxonomization Performing - Arts  Has a cost pre-computing Cinema Performing  Digital content Arts Music Performing  Execution Rule (JS) Arts  Indexed with object records Documenta Cinema ry Historical Cinema Performing Classical Music Arts Pop Music Cinema Music Object Documentary Historical Classical Pop Taxonomy Performing Arts Cinema Music Documentar Classical y
  • 39. External Indexing Service 3/3  Indexing with exceptions monitoring  Real-time notifying system  Event time and type (add, update)  Full stacktrace info  Customizable recipients  Object Indexing Recovery  Resource Parse Error Metadata Indexing • Index synchronization  During external indexing, contents may be  Updated/added/deleted on the original index  Need to update these contents Indexed External Indexed on the index (state flag) 1 1 0 1
  • 40. Search Engine Friendly  HTLM dump service  JAVA external service  Periodically invoked by an AXCP rule  Full metadata exporting  Thumbnail  Resource link  Multilanguage  Paginated results
  • 41. Conclusions  Drupal integrated solution for user behavior tracking and analysis  Logging  Stat Data Graph  Online Survey  External Indexing Service  Avoids server overloading  HA of query service  Error recovering  Detailed event notifying system  Index Optimization  Dumping tool for portal contents (SEO)  Full metadata HTML exporting  Scheduled Service
  • 42. Future Work  Keep collecting Data  Deeper Data Analysis  User Sessions  1st, 2nd..., nth click average user behavior  Depict a modular view of the system usage  Popularity/Usability for each feature & functionality  Social Network Analysis (SNA)  Huge Population  User relationships, connections, friendships
  • 43. References  P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for scalable media computing and intelligence on distributed scenarious", IEEE Multimedia, 2011  P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M. Serena, "Semantic Model for Cultural Heritage Social Network and Cross Media Content for Multiple Devices", Conference of the Italian Association of Artificial Intelligence, Workshop for Cultural Heritage, 15-17 September 2011, Palermo, Italy
  • 44. Q&A
  • 46. Architecture (former) Index Rebuilder Indexing Rule JS Rule JS SolrJ Client Grid Rule Node Scheduler AXCP Solr XML/HTTP JSP Indexing Searching Cell Module Module Indexing Apache Solr Service Drupal Apache Tomcat Apache HTTP
  • 47. Drupal What is it? Open source content management platform Developed by Dries Buytaert in 2001 Written in PHP Users: The Economist, Examiner.com, The White House, data.gov.uk Runs on a WEB server (e.g. Apache, IIS) and a database (e.g. MySQL, PostgreSQL)
  • 48. Apache Lucene What is it? High-performance, full-featured text search engine library (indexing and searching documents) Developed by Doug Cutting (2000) SourceForge, joined Apache Software Foundation in 2001 Written entirely in Java Users: Wikipedia, Technorati, Nabble, TheServerSide, Akamai, SourceForge
  • 49. Apache Lucene Features Ranked searching (best results returned first) Powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more Fielded searching (e.g., title, author, contents) Date-range searching Sorting by any field Multiple-index searching with merged results Allows simultaneous update and searching
  • 50. Apache Lucene Features Documents added via IndexWriter Document = a collection of fields No config files, dynamic field typing Flexible text analysis tokenizers, filters Search for documents via IndexSearcher  Hits = search(Query,Filter,Sort,topN) Scoring: tf * idf * lengthNorm
  • 51. Apache Solr What is it? A full text search server based on Lucene (Lucene sub-project) Developed by Yonik Seeley at CNET Networks (2004), donated to the Apache Software Foundation (2006) Written in Java, deployable as a WAR Users: CNET Reviews, CNET Channel, shopper.com, news.com, nines.org, krugle.com, oodle.com, booklooker.de
  • 52. Apache Features Solr Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces (XML, JSON, HTTP) Web Administration Interface Server statistics exposed over JMX for monitoring Scalability, efficient Replication to other Solr Search Servers Flexible and Adaptable with XML configuration Extensible Plugin Architecture