SlideShare a Scribd company logo
Content Management,  Metadata & Semantic Web Keynote Address Net.ObjectDAYS 2001, Erfurt, Germany, September 11, 2001 Amit Sheth CTO/SrVP, Voquette (www.voquette.com)  [formerly Founder/CEO, Taalee, www.taalee.com] Director, Large Scale Distributed Information Systems Lab,  University Of Georgia (lsdis.cs.uga.edu) [email_address] Metadata Extraction is a patented pending technology of Taalee, Inc. Semantic Engine and WorldModel are trademarks of Taalee. Inc.
Agenda What is Traditional Content Management New Content Management Challenges faced by Enterprises Semantic Content Management Metadata Metadata Descriptions and Standards (Automated) Metadata Creation/Extraction/Tagging Metadata Usage/Applications Semantics (and Semantic Web) Current and Future
Traditional Content Management:  Core Objectives and Features Primary Objective: Effectively create, manage and publish internal content, with  Existing content creation applications (MS-Office, Notes) and provide some new capabilities (Speech to text) (Basic, Syntactic) metadata Workflow or lifecycle support (from author to Web publication or distribution) Versioning and Rollback (Keyword-based/Syntactical) Search and Personalization Internal Distribution Web publishing Content Creation and Edition Content Management Content Personalization and Services Content Delivery
Technology/Product Provider Landscape Traditional Content Management Companies Interwoven, Vignette, Broadvision, Enprise, Documentum, Open Market Three of several upcoming companies focusing on metadata, semantics and/or semantic web Applied Semantics, Voquette (Taalee), Ontoprise See  http://business.semanticweb.org  for more
Enterprise Content Management  – sample user requirements (from a large Financial Svcs Company) “ If a new bond comes into inventory, then we should get a message, an alert...and be able to refine to say that  I only have California, Oregon and Washington clients ...."  “ In the month of July, I received 95 e-mails from my subscriptions. These e-mails included 61 that had 143 attachments that  had 67 more attachments. In total therefore, I received almost  400 documents including 5 different types (HTML,PDF, Word, Rich Media, …).  Even with this volume, I had  subscribed to only 10 categories  in the Equities area. There are a total  of 26 Equity Subscription  areas and a  total of 166 categories  to which a user can subscribe across all Product Areas.” Professional users of a traditional Content Management Product/Solution
Enterprise Content Management  – sample user requirements (from a large Financial Svcs Company) The real question is, " Which sales ideas may have significant relevance to my book of business ?" For example, an earnings warning on an equity rated Hold or Lower and not owned by any of my clients may not be of high relevance to me. Ideally, a  relevance analysis  would: Greatly reduce the volume of Product Area Ideas sent to every FA, hopefully to perhaps  10% to 20% or less of today's volume with ideas that are potentially actionable  for that FA and his/her client Result in FAs reading and evaluating the Product Area Ideas, taking  appropriate actions , and  generating sales because the Product Area Ideas would be relevant Result in  customer satisfaction  because clients would understand FAs are paying attention to their needs and developing focused ideas Professional users of a traditional Content Management Product/Solution
Enterprise Content Management  – sample product requirements (from a large Financial Svcs Company) “ Content generation is a more complex and probably costly problem to solve ... we reportedly create about  9 million messages a month  for field delivery. On average, this would mean  1,000 messages per month per ‘ big user’  or perhaps only 500 to 600 per ‘ little user’ .…I strongly believe an analysis is in order of the nature and necessity of generated content , the establishment of  content generation standards , the movement towards development and implementation of a  relevance engine, … “ Director  (Product Management) of a large company that uses a leading Content Management Product
New Enterprise Content Management Challenges More variety and complexity More formats (MPEG, PDF, MS Office, WM, Real, AVI, etc) More  types (Docs, Images -> Audio, Video, Variety of text-structured, unstructured) More sources (internal, extranet, internet, feeds) Information Overload Too much data, precious little information (Relevance) Creating Value from Content How to Distribute the right content to the right people as needed? (Personalization -- book of business) Customized delivery for different consumption options (mobile/desktop, devices) Insight, Decision Making (Actionable)
New Enterprise Content Management Technical Challenges Aggregation Feed handlers/Agents that understand content representation and media semantics Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured data of different types Homogenization and Enhancement Enterprise-wide common view Domain model, taxonomy/classification, metadata standards Semantic Metadata– created automatically if possible Semantic Applications Search, personalization, directory, alerts, etc. using metadata and semantics (semantic association and correlation), for improved relevance, intelligent personalization, customization
Creating and Serving Metadata to Power the Life-cycle of Content Applications Back End "A Web content repository without metadata is like a library without an index."  - Jack Jia, IWOV “ Metadata increases content value in each step of content value chain.”  Amit Sheth Where is the content? Whose is it? Produce Aggregate What is this content about? Catalog/ Index What other content is it related to? Integrate Syndicate What is the right content for this user? Personalize What is the best way to monetize this interaction? Interactive Marketing Broadcast, Wireline, Wireless, Interactive TV Semantic Metadata
A Metadata Classification Data   (Heterogeneous Types/Media) Content Independent Metadata   (creation-date, location, type-of-sensor...) Content Dependent Metadata   (size, max colors, rows, columns...) Direct Content Based Metadata (inverted lists,  document vectors, LSI) Domain Independent (structural) Metadata   (C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...) Domain Specific Metadata area, population (Census), land-cover, relief (GIS),metadata  concept descriptions from ontologies Ontologies Classifications Domain Models User More  Semantics for  Relevance  to tackle Information Overload!!
Semantics “ meaning or relationship of meanings, or relating to meaning” (Webster) is concerned with the relationship between the linguistic symbols and their meaning or real-world objects meaning and use of data (Information System) Example: Palm -> Company, Product, Technology, Tree Name, part of location (Palm Spring, Palm Beach) Semantics, Ontologies (Domain Models), Metamodels, Metadata, Content/Data
“ The Web of data (and connections) with meaning  in the sense that  a computer program can learn enough about what the data means to process it .  . . .  Imagine what computers can understand when there is a vast tangle of interconnected terms and data that can automatically be followed.”  (Tim Berners-Lee,  Weaving the Web , 1999) A Content Management centric definition of Semantic Web: The concept that Web-accessible  content can be organized and utilized semantically,  rather than though syntactic and structural methods. Semantics:  The Next Step in the Web’s Evolution
Next Generation: Semantic Content Management
Organizing Content Different and Related Objectives: Search, Browse, Summarization, Association/Relationships Indexing Clustering Classification Controlled Vocabulary, Reference Data/ Dictionary/Thesaurus  Metadata Knowledge Base (Entities/Objects  and  Relationships)
Statistical/AI  Techniques Customer  Article Feed  4715 Classification of  Article 4715 Customer  Training  Set Traditional Text Categorization Routing/Distribution Classify Place in a taxonomy feed Most traditional Content Management Products support  Categorization of unstructured content.. Standard Metadata Feed Source : iSyndicate    Posted Date : 11/20/2000
Knowledge-base &  Statistical/AI Techniques Article Feed 4715 Classification  of Article 4715 Customer  Training  Set & KB Routing/Distribution Classify Place in a taxonomy Taalee  Training  Set & KB Map to another taxonomy Metadata Catalog Semantic Engine™ Precise Personalization/ Syndication/Filtering Voquette/Taalee’s Categorization & Automatic Metadata Creation feed Standard  metadata Semantic  metadata FTE Company Analysis Conference Calls Earnings Stock Analysis ENT Company Analysis Conference Calls Earnings Stock Analysis NYSE Member Companies Market News IPOs Automated Content  Enrichment (ACE) Article 4715 Metadata Feed Source :  iSyndicate      Posted Date : 11/20/2000  Company Name :  France Telecom ,    Equant   Ticker Symbol :  FTE ,  ENT   Exchange :  NYSE   Topic :  Company News
Technologies for Organizing Content Information Retrieval/Document Indexing TF-IDF/statistical, Clustering, LSI Statistical learning/AI: Machine learning, Bayesian, Markov Chains, Neural Network Lexical, Natural language Thesaurus, Reference data, Domain models ( Ontology ) Information Extractors  Reasoning/Inferencing: Logic based, Knowledge-based, Rule processing and  Most powerful solutions require combine several of these, addressing more of the objectives
Multiple competitng standards! Multiple heterogeneous metadata models with different tag names for the same data in the same GIS domain Kansas State FGDC Metadata Model Theme keywords :  digital line graph, hydrography, transportation... Title : Dakota Aquifer Online linkage : http://gisdasc.kgs.ukans.edu/dasc/ Direct Spatial Reference Method:  Vector Horizontal Coordinate System Definition: Universal Transverse Mercator   … … … ... UDK Metadata Model Search terms :  digital line graph,  hydrography, transportation... Topic :  Dakota Aquifer Adress Id: http://gisdasc.kgs.ukans.edu/dasc/ Measuring Techniques:  Vector Co-ordinate System: Universal Transverse Mercator … … … ...
Basis for Semantics A.   Facts/Concepts/Terms/Entities Dictionary, Thesaurus, Reference Data, Vocabulary B.   Facts with Relationships Taxonomy/(Categories), Ontology Domain Modeling (e.g., Golf = golfer, tournament name, golf course, event) Knowledge Base
Ontology Standardizes meaning, description, representation of involved concepts/terms/attributes Captures the semantics involved via domain characteristics, resulting in semantic metadata “ Ontological Commitment” forms basis for knowledge sharing and reuse  Ontology provides semantic underpinning.
An Ontology Disaster eventDate description site => latitude, longitude site latitude longitude Natural Disaster Man-made Disaster damage numberOfDeaths damagePhoto Volcano Earthquake NuclearTest magnitude bodyWaveMagnitude conductedBy explosiveYield bodyWaveMagnitude < 10 bodyWaveMagnitude > 0 magnitude < 10 magnitude > 0 Terms/Concepts (Attributes) Functional Dependencies (FDs) Domain Rules Hierarchies
Controlled  Vocabularies/ Classifications/Taxonomies/Ontologies WordNet Cyc The Medical Subject Headings (MeSH):  NLM's controlled vocabulary used for indexing articles, for cataloging books and other holdings, and for searching MeSH-indexed databases, including  MEDLINE . MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts. Year 2000 MeSH includes more than 19,000 main headings, 110,000 Supplementary Concept Records (formerly Supplementary Chemical Records), and an entry vocabulary of over 300,000 terms.
Open Directory Project (ODP): Classification/Taxonomy & Directory
Metadata Specifications  (MetaModels) Metadata Domain Independent   (Dublin Core, RDF, DAML+OIL) Frameworks/Infrastructures   (XCM, XMI) Function Specific ICE (Syndication)  Domain (Application) Specific MARC (Library), FGDC and UDK (Geographic), PRISM (Publishing), FXML (Financial Transactions). RIXML (Buy-Sell Research/Financial Services),  IMS Learning Resource (Distance Learning). ….. Media Specific MPEGx, VoiceXML NewsML (News exchange)
Types of Specs and Standards  (or MetaModels) Domain Independent: (MCF), RDF, (MOF), DublinCore Media Specific: MPEG4, MPEG7, VoiceXML Domain/Industry Specific (metamodels): MARC (Library), FGDC and UDK (Geographic), NewsML (News), PRISM (Publishing), RIXML (Buy-Sell Research/Financial Services) Application Specific: ICE (Syndication), IMS Learning Resource (Distance Learning) Exchange/Sharing: XCM, XMI Orthogonal/(Other): RDFS, namespaces, ontologies, domain models, (DAML, OIL)
Dublin Core Metadata Initiative Simple element set designed for resource description  International, inter-discipline, W3C community consensus  “ Semantic” interface among resource description communities (very limited form of semantics) Source:www.desire.org
Dublin Core RDF <xml> <?namespace href = &quot;http://w3.org/rdf-schema&quot; as = &quot;RDF&quot;> <?namespace href = &quot;http://metadata.net/DC&quot; as = &quot;DC&quot;> <RDF:Abbreviated> <RDF:Assertion RDF:HREF =   http://www.mysite.com/mydoc.html DC:Title = &quot;I've Never Metadata I've Never Liked“ DC:Creator = &quot;Mary Crystal“ DC:Subject = &quot;Metadata, Dublin Core, Stuff&quot;/> </RDF:Abbreviated> </xml>
NewsML Source:http://www.mediabricks.com The  content provider  supplies NewsML packaged media content to the operator. The content can be categorized as current events, finance, sport, etc. (but no standards is specified) and updated hourly. The  operator  receives NewsML data from the content provider. The content server automatically pushes updated news articles to all news service subscribers.  Consumers  sign up for the news service directly on the device. When using the news service, the user browses through the categories and reads the news articles. The news articles are presented in a continuous flow (one after the other) without end-user interaction.
NewsML Content-descriptive metadata: < HeadLine > Seattle attacked by Godzilla-like creature, Microsoft closes HQ </ HeadLine >       < DateLine > Seattle, Was., Aug 30, 2009 /AthensWire via COMTEX/ -- </ DateLine >       < CopyrightLine > Copyright (C) 2009 AthensWire. All rights reserved. </ CopyrightLine >       Administrative metadata: < Provider >< Party   FormalName =&quot; Comtex &quot; /></ Provider >  < Source >< Party   FormalName =&quot; AthensWire &quot; /></ Source > Rights metadata: < CopyrightDate > 2009 </ CopyrightDate >  Descriptive metadata: < Language   FormalName =&quot; en &quot; />       < Property   FormalName =&quot; Location &quot;  Value =“ Seattle, Washington, United States, North America &quot; />   < Property FormalName =&quot; PublicCompany &quot;  Vocabulary =&quot; urn:newsml:comtexnews.net:20010201:DomesticPublicCompanies:1 &quot;>     < Property   FormalName =&quot; CompanyName &quot;  Value =“ Microsoft Corp. &quot; /> < Property   FormalName =&quot; StockSymbol &quot;  Value =&quot; MSFT &quot;/>< Property   FormalName =&quot; StockExchange &quot;  Value =&quot; Nasdaq &quot; />  </ Property  >
RIXML Financial metadata for Buy/Sell sides Highly domain-specific Schema (see next slide) [from UserGuide, p. 31] Example:  MorningCall.xml
RIXML Schema
Metadata Creation and Semanticization Automatic Content    Classification/Categorization Metadata Creation/Extraction:   Types of metadata created Semantic Engine and WorldModel are trademarks of Taalee, Inc. Metadata Extraction is a patented technology of Taalee, Inc.
Content Handling/Ingest Infrastructure/Exchange Feed Handlers Crawlers/Screen Scrapers/Bots Software Agents Centralized, Distributed, or Mobile/Migratory
Information Extraction for Metadata Creation METADATA EXTRACTORS Key challenge:  Create/extract as much (semantics) metadata automatically as possible WWW, Enterprise Repositories Digital Maps Nexis UPI AP Feeds/ Documents Digital Audios Data Stores Digital Videos Digital Images . . . . . . . . .
Extracting a Text Document: Syntactic approach INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997  - 0530 MDT NATIONAL PREPAREDNESS LEVEL II CURRENT  SITUATION: Alaska continues to experience large fire activity.  Additional fires have been staffed for structure protection. SIMELS, Galena District, BLM .   This fire is on the east side of the Innoko Flats, between Galena and McGr The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce.  The fire has increased in size, but was not mapped due to thick smoke.  The slopover on the eastern perimeter is 35% contained, while protection of the historic cabit continues. CHINIKLIK MOUNTAIN, Galena District, BLM .   A Type II Incident Management Team (Wehking) is  assigned to the Chiniklik fire.  The fire is contained.  Major areas of heat have been mopped up.  The fire is contained.  Major areas of heat have been mopped-up.  All crews and overhead will mop-up where the fire burned beyond the meadows.  No flare-ups occurred today.  Demobilization is planned for this weekend, depending on the results of infrared scanning. LAYOUT Date => day month int ‘,’ int
Extraction  Agent Web Page Enhanced Metadata Asset Taalee Extraction and Knowledgebase  Enhancement
Automatic Categorization & Metadata Tagging  (unstructured text/transcript of A/V) ABSOLUTE CONTROL OF THE SENATE IS STILL IN QUESTION. AS OF TONIGHT, THE REPUBLICANS HAVE 50 SENATE SEATS AND THE DEMOCRATS 49. IN WASHINGTON STATE, THE SENATE RACE REMAINS TOO CLOSE TO CALL. IF THE DEMOCRATIC CHALLENGER UNSEATS THE REPUBLICAN IUMBENT THE SENATE WILL BE EVENLY DIVIDED. IN MISSOURI, REPUBLICAN SENATOR JOHN ASHCROFT SAYS HE WILL NOT CHALLENGE HIS LOSS TO GOVERNOR MEL CARNAHAN WHO DIED IN A CRASH THREE WEEKS AGO. GOVERNOR CARNAHAN'S WIFE IS EXPECTED TO TAKE HIS PLACE. IN THE HIGHEST PROFILE SENATE EVENT OF THE NIGHT, HILLARY CLINTON WON THE NEW YORK SENATE SEAT. SHE IS THE FIRST FIRST LADY TO RUN MUCH LESS WIN.  Video Segment with Associated Text Segment Description Semantic Metadata Auto Categorization
Video with Editorialized  Text on the Web Automatic Categorization & Metadata Tagging (Web page) Auto Categorization Semantic Metadata
Automatic Categorization & Metadata Tagging (Feed) Text From Bllomberg Auto Categorization Semantic Metadata
      Taalee Metadata on  Football Assets Rich Media Reference Page Baltimore 31, Pit 24 http://www.nfl.com Quandry Ismail and Tony Banks hook up for their third long touchdown, this time on a 76-yarder to extend the Raven’s lead to 31-24 in the third quarter. Professional Ravens, Steelers Bal 31, Pit 24 Quandry Ismail, Tony Banks Touchdown NFL.com 2/02/2000 League: Teams: Score: Players: Event: Produced by: Posted date: Crawler provided text for indexing vs  Agent provided semantic metadata Virage Search on  football touchdown Jimmy Smith Interview Part Seven Jimmy Smith explains his  philosophy on showboating.  URL:  http://cbs.sportsline... Brian Griese Interview Part Four Brian Griese talks about the  first touchdown he ever threw.  URL:  http://cbs.sportsline... Metadata from Typical Cataloging of Football Assets
Traditional Content  Management Agent Push Pull Information Extraction Agents Dynamic KB Custom WorldModel Relevant Metadata Enhancement Knowledge Management Aggregation & Metadata Extraction Knowledge Management  (Knowledge Base, Domain Model, Metadata) Agent Front End Portal Voquette  Semantic Applications Feeds (proprietary  formats,  standards-based,  NewsML) Corporate Repositories Web Sites One Approach to Extending Traditional CM:  Voquette’s Semantic Engine Technology Search Personalization Alerts Notifications Custom “research”  applications Content Metadata Metadata Metadata Metadata
Taalee/Voquette Semantic Platform Architecture Content of all format, media, push/pull: Web sites/pages: static, dynamic Content Feeds (unstructured, semistructured/docs, tagged/XML) Corporate Repositories/databases Homogenization/integration: with taxonomy (categorization) contextually relevant metadata wrt to domain model,   automatically generated    from content and inferenced © Taalee Inc.
Content which does contain the  words the user asked for Extractor Agents Content which does not contain the  words the user  asked for, but is  about  what he asked for. Value-added Metadata Content the user did not  think to ask for , but which he  needs to know . Semantic Associations + + Semantic Content End-User Semantic  Content
Metadata and Semantic Technology enabled Applications
Taalee’s Semantic Search Highly customizable, precise and freshest A/V search Context and Domain Specific Attributes Uniform Metadata for Content from Multiple  Sources, Can be sorted by any field Delightful, relevant information, exceptional targeting opportunity
Creating a Web of related information What can a context do?
Example  (test on  http://directory.mediaanywhere.com ) Search for company ‘Commerce One’ Links to news on companies that compete against Commerce One Links to news on companies Commerce One competes against (To view news on Ariba, click on the link for Ariba) Crucial news on Commerce One’s competitors (Ariba) can be accessed easily and automatically
What else can a context do? (a commercial perspective) Semantic Enrichment Semantic Targeting
Semantic/Interactive Targeting Precisely targeted through the use of Structured Metadata and integration from multiple sources Buy  Al Pacino  Videos Buy  Russell Crowe  Videos Buy  Christopher Plummer  Videos Buy  Diane Venora  Videos Buy  Philip Baker Hall  Videos Buy  The Insider  Video
Example 1 – Snapshots (“Jamal Anderson”) Click on first result for Jamal Anderson View metadata. Note that  Team name  and  League name  are also included in the metadata Search for ‘Jamal Anderson’ in ‘Football’ View the original source HTML page. Verify that the source page contains no mention of  Team name  and  League name . They were Taalee’s value-additions to the metadata to facilitate easier search.
Example 2 – Snapshots (“Gary Sheffield”) Click on first result for Gary Sheffield View metadata. Note that  Team name  and  League name  are also included in the metadata Search for ‘Gary Sheffield’ in ‘Baseball’ View the original source HTML page. Verify that the source page contains no mention of  Team name  and  League name . They were Taalee’s value-additions to the metadata to facilitate easier search.
Semantic Web – Intelligent Content (supported by Taalee Semantic Engine) Related Stock  News Industry News Technology  Products COMPANY EPA Regulations Competition COMPANIES in Same or Related INDUSTRY COMPANIES  in INDUSTRY with Competing  PRODUCTS Impacting INDUSTRY or Filed By COMPANY Important to INDUSTRY or COMPANY SEC Intelligent Content = What You Asked for + What you need to know!
Semantic Application – Equity Dashboard Focused relevant content organized by topic ( semantic categorization ) Automatic Content Aggregation from multiple content providers and feeds Related news not specifically asked for (Semantic Associations) Competitive research inferred automatically Automatic 3 rd  party content integration
Internal Source 1 Research Internal Source 2 External feeds/Web (e.g. Reuters) Voquette Metabase World Model Third-party Content Mgmt And Syndication Semantic Engine 1 2 3 4 Cisco  story from  Source 1 passed on to add semantic associations Consults Knowledge Base for  Cisco ’s competition Returns result: Lucent  is a competitor of  Cisco Lucent  story  from external  feeds picked for publishing as “semantically  related” to  Cisco  story – passed on to Dashboard Story on Lucent Story on Cisco XCM-compliant metadata, XML or other format Semantic Application ASP/Enterprise hosted Extractor  Agent 1 Extractor  Agent 2 Extractor  Agent 3 Metadata centric Content Management Architecture
Wireless Application of  Semantic Metadata  and  Automatic Content Enrichment  Clicking on the link for Cisco Analyst Calls displays a listing sorted by date.  Semantic filtering uses just the right metadata to meet screen and other constrains.  E.g., Analyst Call focuses on the source and analyst name or company.  The icon denote additional metadata, such as “Strong Buy” by H&Q Analyst. MyStocks News Sports Music MyMedia    $  My Stocks CSCO NT IBM Market CSCO Analyst Call Conf Call Earnings    11/08 ON24 Payne 11/07 ON24 H&Q   11/06 CBS  Langlesis CSCO Analysis
Scene Description Tree Retrieve Scene Description Track “ NSF Playoff” Node Enhanced  XML  Description MPEG-2/4/7 Enhanced  Digital Cable Video MPEG Encoder MPEG Decoder Node = AVO Object Voqutte/Taalee Semantic Engine Produced by:  Fox Sports    Creation Date:  12/05/2000  League:  NFL Teams:  Seattle Seahawks, Atlanta Falcons  Players:  John Kitna   Coaches:  Mike Holmgren, Dan Reeves   Location:  Atlanta   Object Content Information (OCI) Metadata-rich Value-added Node Create Scene Description Tree  GREAT USER EXPERIENCE Metadata’s role in emerging  iTV infrastructure  Channel sales through Video Server Vendors,  Video App Servers, and Broadcasters License metadata decoder and  semantic applications to  device makers “ NSF Playoff”
Metadata for  Automatic Content Enrichment Interactive Television This segment has embedded or referenced metadata that is used by personalization application to show only the stocks that user is interested in. This screen is customizable with interactivity feature using metadata such as whether there is a new Conference Call video on CSCO. Part of the screen can be automatically customized to  show conference call specific  information– including transcript, participation, etc. all of which are relevant metadata Conference Call itself can have  embedded metadata to  support personalization and interactivity.
Semantic Technology Features Unstructured Text Content Semi-Structured Content Structured Content Audio/Video Content with associated text (transcript, journalist notes) Create a Customized &quot;World Model&quot; (Taxonomy Tree with customized domain attributes) Automatically homogenize content feed tags Automatically categorize unstructured text Automatically create tags based on text Itself Create and maintain a Customized Knowledge Base for any domain Automatically enhance content tags based on information beyond text Build contextually relevant custom research applications Contextual Search (an order of magnitude better than keyword-based search) Support push or pull delivery/ingestion of content Personalization/Alerts/Notifications Real Time Indexing (stories indexed for search/personalization within a minute) Provide the user with relevant information not explicitly asked for (Semantic Associations)
Along with the evolution of metadata and semantic technologies enabling the next generation of the Web, Content Management has entered the next generation of Enhanced Content Management.
Resources/References RDF: w w w . w 3 . o r g / T R / R E C - r d f - s y n t a x / ICE:  www.icestandard.org Meta Object Facility (MOF) Specification, Version 1.3, September 27, 1999:  http://cgi.omg.org/cgi-bin/doc?ad/99-09-05   XML Metadata Interchange (XMI) Specification, Version 1.1, October 25, 1999:  http://cgi.omg.org/cgi-bin/doc?ad/9910-02   http://cgi.omg.org/cgi-bin/doc?ad/99-10-03   DAML:  www.daml.org NEWSML: newsshowcase.reuters.com PRISM:  www.prismstandard.org/techdev/prismspec1.asp RIXML: www.rixml.org XCM: www.vignette.com OIL:  www.ontoknowledge.org/oil SEMANTICWEB:  www.semanticweb.org , business.semanticweb.org  VOICEXML:  www.voicexml.org MPEG7:  www.darmstadt.gmd.de/mobile/MPEG7/ Taalee:  www.taalee.com Applied Semantics:  www.appliedsemantics.com Ontoprose: www.ontoprise.com
Multimedia Data Management: Using Metadata to Integrate and Apply Digital Media,  Amit Sheth & Wolfgang Klas, Eds., McGraw Hill, ISBN: 0-07-057735-8, 1998. Information Brokering, Vipul Kashyap & Amit Sheth, Kluwer Academic Publishers, 2001. Voquette Semantic Technology White Paper. Mysteries of Metadata, Speaker – Amit Sheth, Workshop at Content World 2001. Infoquilt Project, LSDIS lab. http://www.taalee.com   http://lsdis.cs.uga.edu/~amit

More Related Content

PPT
Taxonomies And Search Aiim Mn
PPTX
#SPSVancouver 2016 - The importance of metadata
PPT
Taxonomies and Metadata in Information Architecture
PPTX
Taxonomies and Metadata
PPTX
Six Ways to Simplify Metadata Management
PDF
How to Use Site Search to Drive Conversions and Create Customers
PPTX
Building internal-competencies-in-ioa
PDF
Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Taxonomies And Search Aiim Mn
#SPSVancouver 2016 - The importance of metadata
Taxonomies and Metadata in Information Architecture
Taxonomies and Metadata
Six Ways to Simplify Metadata Management
How to Use Site Search to Drive Conversions and Create Customers
Building internal-competencies-in-ioa
Information Architecture Primer - Integrating search,tagging, taxonomy and us...

What's hot (20)

PPTX
Taxonomy And Metadata
PPTX
Semantic Technology in Publishing & Finance
PDF
Taxonomy 101
PPTX
Looking Under the Hood -- Australia SharePoint Conference
PPTX
Taxonomy and Metadata Demystified
PPT
Building An XML Publishing System With DITA
PDF
Five creative search solutions using text analytics
PPT
Kbee Spaces Financial Services
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
PPT
NYC Sem Web Meetup 20090219
PPTX
Successful Content Management Through Taxonomy And Metadata Design
PPTX
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
PPTX
Managing Electronic Resources for Public Libraries, Part 1
PPTX
Managing Electronic Resources for Public Libraries: Part 2
PPT
Semantics in Financial Services -David Newman
PDF
An Overview of Dow Jones' Use of Semantic Technologies
PPTX
Semantic Applications for Financial Services
PPTX
FAST Search-webinar-06-29-2010
PPTX
Webinar: Does the SharePoint 2010 Term Store Seem Like Alphabet Soup? Find ...
Taxonomy And Metadata
Semantic Technology in Publishing & Finance
Taxonomy 101
Looking Under the Hood -- Australia SharePoint Conference
Taxonomy and Metadata Demystified
Building An XML Publishing System With DITA
Five creative search solutions using text analytics
Kbee Spaces Financial Services
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
NYC Sem Web Meetup 20090219
Successful Content Management Through Taxonomy And Metadata Design
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
Managing Electronic Resources for Public Libraries, Part 1
Managing Electronic Resources for Public Libraries: Part 2
Semantics in Financial Services -David Newman
An Overview of Dow Jones' Use of Semantic Technologies
Semantic Applications for Financial Services
FAST Search-webinar-06-29-2010
Webinar: Does the SharePoint 2010 Term Store Seem Like Alphabet Soup? Find ...
Ad

Similar to Content Management, Metadata and Semantic Web (20)

PPT
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
PPTX
Taxonomy and seo sla 05-06-10(jc)
PPT
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
PPT
Content management
PPTX
Structuring Serendipitous Collaboration
PPS
Semantic Web in Action: Ontology-driven information search, integration and a...
PPT
Hid content management systems
PPTX
Chris McNulty - Managed Metadata and Taxonomies
PPTX
PPT
DITA, Semantics, Content Management, Dynamic Documents, and Linked Data – A M...
PPTX
MMS2010
PPTX
KMA Webinar: Managed Metadata Services in SharePoint 2010
PPTX
Dc2010 fanning
PPTX
BPC10 BuckleyMetadata-share
PPT
Document repositories-and-metadata
PPT
Implementing Semantic Search
PPT
User-Driven Taxonomies
PPT
IWMW 2002: The Value of Metadata and How to Realise It
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Taxonomy and seo sla 05-06-10(jc)
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Content management
Structuring Serendipitous Collaboration
Semantic Web in Action: Ontology-driven information search, integration and a...
Hid content management systems
Chris McNulty - Managed Metadata and Taxonomies
DITA, Semantics, Content Management, Dynamic Documents, and Linked Data – A M...
MMS2010
KMA Webinar: Managed Metadata Services in SharePoint 2010
Dc2010 fanning
BPC10 BuckleyMetadata-share
Document repositories-and-metadata
Implementing Semantic Search
User-Driven Taxonomies
IWMW 2002: The Value of Metadata and How to Realise It
Ad

Recently uploaded (20)

PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
GDM (1) (1).pptx small presentation for students
PDF
RMMM.pdf make it easy to upload and study
PPTX
master seminar digital applications in india
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Cell Types and Its function , kingdom of life
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Lesson notes of climatology university.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Basic Mud Logging Guide for educational purpose
Renaissance Architecture: A Journey from Faith to Humanism
GDM (1) (1).pptx small presentation for students
RMMM.pdf make it easy to upload and study
master seminar digital applications in india
Sports Quiz easy sports quiz sports quiz
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
TR - Agricultural Crops Production NC III.pdf
PPH.pptx obstetrics and gynecology in nursing
Cell Types and Its function , kingdom of life
Computing-Curriculum for Schools in Ghana
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Lesson notes of climatology university.
Module 4: Burden of Disease Tutorial Slides S2 2025
Microbial diseases, their pathogenesis and prophylaxis
O7-L3 Supply Chain Operations - ICLT Program
Pharma ospi slides which help in ospi learning
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Basic Mud Logging Guide for educational purpose

Content Management, Metadata and Semantic Web

  • 1. Content Management, Metadata & Semantic Web Keynote Address Net.ObjectDAYS 2001, Erfurt, Germany, September 11, 2001 Amit Sheth CTO/SrVP, Voquette (www.voquette.com) [formerly Founder/CEO, Taalee, www.taalee.com] Director, Large Scale Distributed Information Systems Lab, University Of Georgia (lsdis.cs.uga.edu) [email_address] Metadata Extraction is a patented pending technology of Taalee, Inc. Semantic Engine and WorldModel are trademarks of Taalee. Inc.
  • 2. Agenda What is Traditional Content Management New Content Management Challenges faced by Enterprises Semantic Content Management Metadata Metadata Descriptions and Standards (Automated) Metadata Creation/Extraction/Tagging Metadata Usage/Applications Semantics (and Semantic Web) Current and Future
  • 3. Traditional Content Management: Core Objectives and Features Primary Objective: Effectively create, manage and publish internal content, with Existing content creation applications (MS-Office, Notes) and provide some new capabilities (Speech to text) (Basic, Syntactic) metadata Workflow or lifecycle support (from author to Web publication or distribution) Versioning and Rollback (Keyword-based/Syntactical) Search and Personalization Internal Distribution Web publishing Content Creation and Edition Content Management Content Personalization and Services Content Delivery
  • 4. Technology/Product Provider Landscape Traditional Content Management Companies Interwoven, Vignette, Broadvision, Enprise, Documentum, Open Market Three of several upcoming companies focusing on metadata, semantics and/or semantic web Applied Semantics, Voquette (Taalee), Ontoprise See http://business.semanticweb.org for more
  • 5. Enterprise Content Management – sample user requirements (from a large Financial Svcs Company) “ If a new bond comes into inventory, then we should get a message, an alert...and be able to refine to say that I only have California, Oregon and Washington clients ....&quot; “ In the month of July, I received 95 e-mails from my subscriptions. These e-mails included 61 that had 143 attachments that had 67 more attachments. In total therefore, I received almost 400 documents including 5 different types (HTML,PDF, Word, Rich Media, …). Even with this volume, I had subscribed to only 10 categories in the Equities area. There are a total of 26 Equity Subscription areas and a total of 166 categories to which a user can subscribe across all Product Areas.” Professional users of a traditional Content Management Product/Solution
  • 6. Enterprise Content Management – sample user requirements (from a large Financial Svcs Company) The real question is, &quot; Which sales ideas may have significant relevance to my book of business ?&quot; For example, an earnings warning on an equity rated Hold or Lower and not owned by any of my clients may not be of high relevance to me. Ideally, a relevance analysis would: Greatly reduce the volume of Product Area Ideas sent to every FA, hopefully to perhaps 10% to 20% or less of today's volume with ideas that are potentially actionable for that FA and his/her client Result in FAs reading and evaluating the Product Area Ideas, taking appropriate actions , and generating sales because the Product Area Ideas would be relevant Result in customer satisfaction because clients would understand FAs are paying attention to their needs and developing focused ideas Professional users of a traditional Content Management Product/Solution
  • 7. Enterprise Content Management – sample product requirements (from a large Financial Svcs Company) “ Content generation is a more complex and probably costly problem to solve ... we reportedly create about 9 million messages a month for field delivery. On average, this would mean 1,000 messages per month per ‘ big user’ or perhaps only 500 to 600 per ‘ little user’ .…I strongly believe an analysis is in order of the nature and necessity of generated content , the establishment of content generation standards , the movement towards development and implementation of a relevance engine, … “ Director (Product Management) of a large company that uses a leading Content Management Product
  • 8. New Enterprise Content Management Challenges More variety and complexity More formats (MPEG, PDF, MS Office, WM, Real, AVI, etc) More types (Docs, Images -> Audio, Video, Variety of text-structured, unstructured) More sources (internal, extranet, internet, feeds) Information Overload Too much data, precious little information (Relevance) Creating Value from Content How to Distribute the right content to the right people as needed? (Personalization -- book of business) Customized delivery for different consumption options (mobile/desktop, devices) Insight, Decision Making (Actionable)
  • 9. New Enterprise Content Management Technical Challenges Aggregation Feed handlers/Agents that understand content representation and media semantics Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured data of different types Homogenization and Enhancement Enterprise-wide common view Domain model, taxonomy/classification, metadata standards Semantic Metadata– created automatically if possible Semantic Applications Search, personalization, directory, alerts, etc. using metadata and semantics (semantic association and correlation), for improved relevance, intelligent personalization, customization
  • 10. Creating and Serving Metadata to Power the Life-cycle of Content Applications Back End &quot;A Web content repository without metadata is like a library without an index.&quot; - Jack Jia, IWOV “ Metadata increases content value in each step of content value chain.” Amit Sheth Where is the content? Whose is it? Produce Aggregate What is this content about? Catalog/ Index What other content is it related to? Integrate Syndicate What is the right content for this user? Personalize What is the best way to monetize this interaction? Interactive Marketing Broadcast, Wireline, Wireless, Interactive TV Semantic Metadata
  • 11. A Metadata Classification Data (Heterogeneous Types/Media) Content Independent Metadata (creation-date, location, type-of-sensor...) Content Dependent Metadata (size, max colors, rows, columns...) Direct Content Based Metadata (inverted lists, document vectors, LSI) Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...) Domain Specific Metadata area, population (Census), land-cover, relief (GIS),metadata concept descriptions from ontologies Ontologies Classifications Domain Models User More Semantics for Relevance to tackle Information Overload!!
  • 12. Semantics “ meaning or relationship of meanings, or relating to meaning” (Webster) is concerned with the relationship between the linguistic symbols and their meaning or real-world objects meaning and use of data (Information System) Example: Palm -> Company, Product, Technology, Tree Name, part of location (Palm Spring, Palm Beach) Semantics, Ontologies (Domain Models), Metamodels, Metadata, Content/Data
  • 13. “ The Web of data (and connections) with meaning in the sense that a computer program can learn enough about what the data means to process it . . . . Imagine what computers can understand when there is a vast tangle of interconnected terms and data that can automatically be followed.” (Tim Berners-Lee, Weaving the Web , 1999) A Content Management centric definition of Semantic Web: The concept that Web-accessible content can be organized and utilized semantically, rather than though syntactic and structural methods. Semantics: The Next Step in the Web’s Evolution
  • 14. Next Generation: Semantic Content Management
  • 15. Organizing Content Different and Related Objectives: Search, Browse, Summarization, Association/Relationships Indexing Clustering Classification Controlled Vocabulary, Reference Data/ Dictionary/Thesaurus Metadata Knowledge Base (Entities/Objects and Relationships)
  • 16. Statistical/AI Techniques Customer Article Feed 4715 Classification of Article 4715 Customer Training Set Traditional Text Categorization Routing/Distribution Classify Place in a taxonomy feed Most traditional Content Management Products support Categorization of unstructured content.. Standard Metadata Feed Source : iSyndicate   Posted Date : 11/20/2000
  • 17. Knowledge-base & Statistical/AI Techniques Article Feed 4715 Classification of Article 4715 Customer Training Set & KB Routing/Distribution Classify Place in a taxonomy Taalee Training Set & KB Map to another taxonomy Metadata Catalog Semantic Engine™ Precise Personalization/ Syndication/Filtering Voquette/Taalee’s Categorization & Automatic Metadata Creation feed Standard metadata Semantic metadata FTE Company Analysis Conference Calls Earnings Stock Analysis ENT Company Analysis Conference Calls Earnings Stock Analysis NYSE Member Companies Market News IPOs Automated Content Enrichment (ACE) Article 4715 Metadata Feed Source : iSyndicate   Posted Date : 11/20/2000 Company Name : France Telecom , Equant Ticker Symbol : FTE , ENT Exchange : NYSE Topic : Company News
  • 18. Technologies for Organizing Content Information Retrieval/Document Indexing TF-IDF/statistical, Clustering, LSI Statistical learning/AI: Machine learning, Bayesian, Markov Chains, Neural Network Lexical, Natural language Thesaurus, Reference data, Domain models ( Ontology ) Information Extractors Reasoning/Inferencing: Logic based, Knowledge-based, Rule processing and Most powerful solutions require combine several of these, addressing more of the objectives
  • 19. Multiple competitng standards! Multiple heterogeneous metadata models with different tag names for the same data in the same GIS domain Kansas State FGDC Metadata Model Theme keywords : digital line graph, hydrography, transportation... Title : Dakota Aquifer Online linkage : http://gisdasc.kgs.ukans.edu/dasc/ Direct Spatial Reference Method: Vector Horizontal Coordinate System Definition: Universal Transverse Mercator … … … ... UDK Metadata Model Search terms : digital line graph, hydrography, transportation... Topic : Dakota Aquifer Adress Id: http://gisdasc.kgs.ukans.edu/dasc/ Measuring Techniques: Vector Co-ordinate System: Universal Transverse Mercator … … … ...
  • 20. Basis for Semantics A. Facts/Concepts/Terms/Entities Dictionary, Thesaurus, Reference Data, Vocabulary B. Facts with Relationships Taxonomy/(Categories), Ontology Domain Modeling (e.g., Golf = golfer, tournament name, golf course, event) Knowledge Base
  • 21. Ontology Standardizes meaning, description, representation of involved concepts/terms/attributes Captures the semantics involved via domain characteristics, resulting in semantic metadata “ Ontological Commitment” forms basis for knowledge sharing and reuse Ontology provides semantic underpinning.
  • 22. An Ontology Disaster eventDate description site => latitude, longitude site latitude longitude Natural Disaster Man-made Disaster damage numberOfDeaths damagePhoto Volcano Earthquake NuclearTest magnitude bodyWaveMagnitude conductedBy explosiveYield bodyWaveMagnitude < 10 bodyWaveMagnitude > 0 magnitude < 10 magnitude > 0 Terms/Concepts (Attributes) Functional Dependencies (FDs) Domain Rules Hierarchies
  • 23. Controlled Vocabularies/ Classifications/Taxonomies/Ontologies WordNet Cyc The Medical Subject Headings (MeSH): NLM's controlled vocabulary used for indexing articles, for cataloging books and other holdings, and for searching MeSH-indexed databases, including MEDLINE . MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts. Year 2000 MeSH includes more than 19,000 main headings, 110,000 Supplementary Concept Records (formerly Supplementary Chemical Records), and an entry vocabulary of over 300,000 terms.
  • 24. Open Directory Project (ODP): Classification/Taxonomy & Directory
  • 25. Metadata Specifications (MetaModels) Metadata Domain Independent (Dublin Core, RDF, DAML+OIL) Frameworks/Infrastructures (XCM, XMI) Function Specific ICE (Syndication) Domain (Application) Specific MARC (Library), FGDC and UDK (Geographic), PRISM (Publishing), FXML (Financial Transactions). RIXML (Buy-Sell Research/Financial Services), IMS Learning Resource (Distance Learning). ….. Media Specific MPEGx, VoiceXML NewsML (News exchange)
  • 26. Types of Specs and Standards (or MetaModels) Domain Independent: (MCF), RDF, (MOF), DublinCore Media Specific: MPEG4, MPEG7, VoiceXML Domain/Industry Specific (metamodels): MARC (Library), FGDC and UDK (Geographic), NewsML (News), PRISM (Publishing), RIXML (Buy-Sell Research/Financial Services) Application Specific: ICE (Syndication), IMS Learning Resource (Distance Learning) Exchange/Sharing: XCM, XMI Orthogonal/(Other): RDFS, namespaces, ontologies, domain models, (DAML, OIL)
  • 27. Dublin Core Metadata Initiative Simple element set designed for resource description International, inter-discipline, W3C community consensus “ Semantic” interface among resource description communities (very limited form of semantics) Source:www.desire.org
  • 28. Dublin Core RDF <xml> <?namespace href = &quot;http://w3.org/rdf-schema&quot; as = &quot;RDF&quot;> <?namespace href = &quot;http://metadata.net/DC&quot; as = &quot;DC&quot;> <RDF:Abbreviated> <RDF:Assertion RDF:HREF = http://www.mysite.com/mydoc.html DC:Title = &quot;I've Never Metadata I've Never Liked“ DC:Creator = &quot;Mary Crystal“ DC:Subject = &quot;Metadata, Dublin Core, Stuff&quot;/> </RDF:Abbreviated> </xml>
  • 29. NewsML Source:http://www.mediabricks.com The content provider supplies NewsML packaged media content to the operator. The content can be categorized as current events, finance, sport, etc. (but no standards is specified) and updated hourly. The operator receives NewsML data from the content provider. The content server automatically pushes updated news articles to all news service subscribers. Consumers sign up for the news service directly on the device. When using the news service, the user browses through the categories and reads the news articles. The news articles are presented in a continuous flow (one after the other) without end-user interaction.
  • 30. NewsML Content-descriptive metadata: < HeadLine > Seattle attacked by Godzilla-like creature, Microsoft closes HQ </ HeadLine >   < DateLine > Seattle, Was., Aug 30, 2009 /AthensWire via COMTEX/ -- </ DateLine >   < CopyrightLine > Copyright (C) 2009 AthensWire. All rights reserved. </ CopyrightLine >   Administrative metadata: < Provider >< Party FormalName =&quot; Comtex &quot; /></ Provider > < Source >< Party FormalName =&quot; AthensWire &quot; /></ Source > Rights metadata: < CopyrightDate > 2009 </ CopyrightDate > Descriptive metadata: < Language FormalName =&quot; en &quot; />   < Property FormalName =&quot; Location &quot; Value =“ Seattle, Washington, United States, North America &quot; /> < Property FormalName =&quot; PublicCompany &quot; Vocabulary =&quot; urn:newsml:comtexnews.net:20010201:DomesticPublicCompanies:1 &quot;>   < Property FormalName =&quot; CompanyName &quot; Value =“ Microsoft Corp. &quot; /> < Property FormalName =&quot; StockSymbol &quot; Value =&quot; MSFT &quot;/>< Property FormalName =&quot; StockExchange &quot; Value =&quot; Nasdaq &quot; /> </ Property >
  • 31. RIXML Financial metadata for Buy/Sell sides Highly domain-specific Schema (see next slide) [from UserGuide, p. 31] Example: MorningCall.xml
  • 33. Metadata Creation and Semanticization Automatic Content Classification/Categorization Metadata Creation/Extraction: Types of metadata created Semantic Engine and WorldModel are trademarks of Taalee, Inc. Metadata Extraction is a patented technology of Taalee, Inc.
  • 34. Content Handling/Ingest Infrastructure/Exchange Feed Handlers Crawlers/Screen Scrapers/Bots Software Agents Centralized, Distributed, or Mobile/Migratory
  • 35. Information Extraction for Metadata Creation METADATA EXTRACTORS Key challenge: Create/extract as much (semantics) metadata automatically as possible WWW, Enterprise Repositories Digital Maps Nexis UPI AP Feeds/ Documents Digital Audios Data Stores Digital Videos Digital Images . . . . . . . . .
  • 36. Extracting a Text Document: Syntactic approach INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997 - 0530 MDT NATIONAL PREPAREDNESS LEVEL II CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have been staffed for structure protection. SIMELS, Galena District, BLM . This fire is on the east side of the Innoko Flats, between Galena and McGr The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is 35% contained, while protection of the historic cabit continues. CHINIKLIK MOUNTAIN, Galena District, BLM . A Type II Incident Management Team (Wehking) is assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire is contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fire burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend, depending on the results of infrared scanning. LAYOUT Date => day month int ‘,’ int
  • 37. Extraction Agent Web Page Enhanced Metadata Asset Taalee Extraction and Knowledgebase Enhancement
  • 38. Automatic Categorization & Metadata Tagging (unstructured text/transcript of A/V) ABSOLUTE CONTROL OF THE SENATE IS STILL IN QUESTION. AS OF TONIGHT, THE REPUBLICANS HAVE 50 SENATE SEATS AND THE DEMOCRATS 49. IN WASHINGTON STATE, THE SENATE RACE REMAINS TOO CLOSE TO CALL. IF THE DEMOCRATIC CHALLENGER UNSEATS THE REPUBLICAN IUMBENT THE SENATE WILL BE EVENLY DIVIDED. IN MISSOURI, REPUBLICAN SENATOR JOHN ASHCROFT SAYS HE WILL NOT CHALLENGE HIS LOSS TO GOVERNOR MEL CARNAHAN WHO DIED IN A CRASH THREE WEEKS AGO. GOVERNOR CARNAHAN'S WIFE IS EXPECTED TO TAKE HIS PLACE. IN THE HIGHEST PROFILE SENATE EVENT OF THE NIGHT, HILLARY CLINTON WON THE NEW YORK SENATE SEAT. SHE IS THE FIRST FIRST LADY TO RUN MUCH LESS WIN. Video Segment with Associated Text Segment Description Semantic Metadata Auto Categorization
  • 39. Video with Editorialized Text on the Web Automatic Categorization & Metadata Tagging (Web page) Auto Categorization Semantic Metadata
  • 40. Automatic Categorization & Metadata Tagging (Feed) Text From Bllomberg Auto Categorization Semantic Metadata
  • 41.       Taalee Metadata on Football Assets Rich Media Reference Page Baltimore 31, Pit 24 http://www.nfl.com Quandry Ismail and Tony Banks hook up for their third long touchdown, this time on a 76-yarder to extend the Raven’s lead to 31-24 in the third quarter. Professional Ravens, Steelers Bal 31, Pit 24 Quandry Ismail, Tony Banks Touchdown NFL.com 2/02/2000 League: Teams: Score: Players: Event: Produced by: Posted date: Crawler provided text for indexing vs Agent provided semantic metadata Virage Search on football touchdown Jimmy Smith Interview Part Seven Jimmy Smith explains his philosophy on showboating. URL: http://cbs.sportsline... Brian Griese Interview Part Four Brian Griese talks about the first touchdown he ever threw. URL: http://cbs.sportsline... Metadata from Typical Cataloging of Football Assets
  • 42. Traditional Content Management Agent Push Pull Information Extraction Agents Dynamic KB Custom WorldModel Relevant Metadata Enhancement Knowledge Management Aggregation & Metadata Extraction Knowledge Management (Knowledge Base, Domain Model, Metadata) Agent Front End Portal Voquette Semantic Applications Feeds (proprietary formats, standards-based, NewsML) Corporate Repositories Web Sites One Approach to Extending Traditional CM: Voquette’s Semantic Engine Technology Search Personalization Alerts Notifications Custom “research” applications Content Metadata Metadata Metadata Metadata
  • 43. Taalee/Voquette Semantic Platform Architecture Content of all format, media, push/pull: Web sites/pages: static, dynamic Content Feeds (unstructured, semistructured/docs, tagged/XML) Corporate Repositories/databases Homogenization/integration: with taxonomy (categorization) contextually relevant metadata wrt to domain model, automatically generated from content and inferenced © Taalee Inc.
  • 44. Content which does contain the words the user asked for Extractor Agents Content which does not contain the words the user asked for, but is about what he asked for. Value-added Metadata Content the user did not think to ask for , but which he needs to know . Semantic Associations + + Semantic Content End-User Semantic Content
  • 45. Metadata and Semantic Technology enabled Applications
  • 46. Taalee’s Semantic Search Highly customizable, precise and freshest A/V search Context and Domain Specific Attributes Uniform Metadata for Content from Multiple Sources, Can be sorted by any field Delightful, relevant information, exceptional targeting opportunity
  • 47. Creating a Web of related information What can a context do?
  • 48. Example (test on http://directory.mediaanywhere.com ) Search for company ‘Commerce One’ Links to news on companies that compete against Commerce One Links to news on companies Commerce One competes against (To view news on Ariba, click on the link for Ariba) Crucial news on Commerce One’s competitors (Ariba) can be accessed easily and automatically
  • 49. What else can a context do? (a commercial perspective) Semantic Enrichment Semantic Targeting
  • 50. Semantic/Interactive Targeting Precisely targeted through the use of Structured Metadata and integration from multiple sources Buy Al Pacino Videos Buy Russell Crowe Videos Buy Christopher Plummer Videos Buy Diane Venora Videos Buy Philip Baker Hall Videos Buy The Insider Video
  • 51. Example 1 – Snapshots (“Jamal Anderson”) Click on first result for Jamal Anderson View metadata. Note that Team name and League name are also included in the metadata Search for ‘Jamal Anderson’ in ‘Football’ View the original source HTML page. Verify that the source page contains no mention of Team name and League name . They were Taalee’s value-additions to the metadata to facilitate easier search.
  • 52. Example 2 – Snapshots (“Gary Sheffield”) Click on first result for Gary Sheffield View metadata. Note that Team name and League name are also included in the metadata Search for ‘Gary Sheffield’ in ‘Baseball’ View the original source HTML page. Verify that the source page contains no mention of Team name and League name . They were Taalee’s value-additions to the metadata to facilitate easier search.
  • 53. Semantic Web – Intelligent Content (supported by Taalee Semantic Engine) Related Stock News Industry News Technology Products COMPANY EPA Regulations Competition COMPANIES in Same or Related INDUSTRY COMPANIES in INDUSTRY with Competing PRODUCTS Impacting INDUSTRY or Filed By COMPANY Important to INDUSTRY or COMPANY SEC Intelligent Content = What You Asked for + What you need to know!
  • 54. Semantic Application – Equity Dashboard Focused relevant content organized by topic ( semantic categorization ) Automatic Content Aggregation from multiple content providers and feeds Related news not specifically asked for (Semantic Associations) Competitive research inferred automatically Automatic 3 rd party content integration
  • 55. Internal Source 1 Research Internal Source 2 External feeds/Web (e.g. Reuters) Voquette Metabase World Model Third-party Content Mgmt And Syndication Semantic Engine 1 2 3 4 Cisco story from Source 1 passed on to add semantic associations Consults Knowledge Base for Cisco ’s competition Returns result: Lucent is a competitor of Cisco Lucent story from external feeds picked for publishing as “semantically related” to Cisco story – passed on to Dashboard Story on Lucent Story on Cisco XCM-compliant metadata, XML or other format Semantic Application ASP/Enterprise hosted Extractor Agent 1 Extractor Agent 2 Extractor Agent 3 Metadata centric Content Management Architecture
  • 56. Wireless Application of Semantic Metadata and Automatic Content Enrichment  Clicking on the link for Cisco Analyst Calls displays a listing sorted by date. Semantic filtering uses just the right metadata to meet screen and other constrains. E.g., Analyst Call focuses on the source and analyst name or company. The icon denote additional metadata, such as “Strong Buy” by H&Q Analyst. MyStocks News Sports Music MyMedia    $  My Stocks CSCO NT IBM Market CSCO Analyst Call Conf Call Earnings    11/08 ON24 Payne 11/07 ON24 H&Q  11/06 CBS Langlesis CSCO Analysis
  • 57. Scene Description Tree Retrieve Scene Description Track “ NSF Playoff” Node Enhanced XML Description MPEG-2/4/7 Enhanced Digital Cable Video MPEG Encoder MPEG Decoder Node = AVO Object Voqutte/Taalee Semantic Engine Produced by: Fox Sports   Creation Date: 12/05/2000 League: NFL Teams: Seattle Seahawks, Atlanta Falcons Players: John Kitna Coaches: Mike Holmgren, Dan Reeves Location: Atlanta Object Content Information (OCI) Metadata-rich Value-added Node Create Scene Description Tree  GREAT USER EXPERIENCE Metadata’s role in emerging iTV infrastructure Channel sales through Video Server Vendors, Video App Servers, and Broadcasters License metadata decoder and semantic applications to device makers “ NSF Playoff”
  • 58. Metadata for Automatic Content Enrichment Interactive Television This segment has embedded or referenced metadata that is used by personalization application to show only the stocks that user is interested in. This screen is customizable with interactivity feature using metadata such as whether there is a new Conference Call video on CSCO. Part of the screen can be automatically customized to show conference call specific information– including transcript, participation, etc. all of which are relevant metadata Conference Call itself can have embedded metadata to support personalization and interactivity.
  • 59. Semantic Technology Features Unstructured Text Content Semi-Structured Content Structured Content Audio/Video Content with associated text (transcript, journalist notes) Create a Customized &quot;World Model&quot; (Taxonomy Tree with customized domain attributes) Automatically homogenize content feed tags Automatically categorize unstructured text Automatically create tags based on text Itself Create and maintain a Customized Knowledge Base for any domain Automatically enhance content tags based on information beyond text Build contextually relevant custom research applications Contextual Search (an order of magnitude better than keyword-based search) Support push or pull delivery/ingestion of content Personalization/Alerts/Notifications Real Time Indexing (stories indexed for search/personalization within a minute) Provide the user with relevant information not explicitly asked for (Semantic Associations)
  • 60. Along with the evolution of metadata and semantic technologies enabling the next generation of the Web, Content Management has entered the next generation of Enhanced Content Management.
  • 61. Resources/References RDF: w w w . w 3 . o r g / T R / R E C - r d f - s y n t a x / ICE: www.icestandard.org Meta Object Facility (MOF) Specification, Version 1.3, September 27, 1999: http://cgi.omg.org/cgi-bin/doc?ad/99-09-05 XML Metadata Interchange (XMI) Specification, Version 1.1, October 25, 1999: http://cgi.omg.org/cgi-bin/doc?ad/9910-02 http://cgi.omg.org/cgi-bin/doc?ad/99-10-03 DAML: www.daml.org NEWSML: newsshowcase.reuters.com PRISM: www.prismstandard.org/techdev/prismspec1.asp RIXML: www.rixml.org XCM: www.vignette.com OIL: www.ontoknowledge.org/oil SEMANTICWEB: www.semanticweb.org , business.semanticweb.org VOICEXML: www.voicexml.org MPEG7: www.darmstadt.gmd.de/mobile/MPEG7/ Taalee: www.taalee.com Applied Semantics: www.appliedsemantics.com Ontoprose: www.ontoprise.com
  • 62. Multimedia Data Management: Using Metadata to Integrate and Apply Digital Media, Amit Sheth & Wolfgang Klas, Eds., McGraw Hill, ISBN: 0-07-057735-8, 1998. Information Brokering, Vipul Kashyap & Amit Sheth, Kluwer Academic Publishers, 2001. Voquette Semantic Technology White Paper. Mysteries of Metadata, Speaker – Amit Sheth, Workshop at Content World 2001. Infoquilt Project, LSDIS lab. http://www.taalee.com http://lsdis.cs.uga.edu/~amit

Editor's Notes

  • #14: 04/29/10 Taalee Proprietary &amp; Confidential. Do not copy or distribute.
  • #15: 04/29/10 Taalee Proprietary &amp; Confidential. Do not copy or distribute.
  • #17: 04/29/10 Taalee Proprietary &amp; Confidential. Do not copy or distribute. Companies in categorization field: Autonomy, Metacode (bought by Interwoven), Semio, Inxight, etc. Typical strategies employed by competition: Statistical/AI/Parsing/NLP/Rules-based/Collaborative Filtering Result: Partial success in categorization Placement of a document in a node, solely based on above strategies (nothing to do with metadata describing it – the basis behind semantics) Resulting classification – rigid/static/ambiguous/fuzzy Captures only standard physical metadata (source, date, length etc.), which is often useless in categorization purposes
  • #18: 04/29/10 Taalee Proprietary &amp; Confidential. Do not copy or distribute. Taalee performs categorization by laying importance to semantic metadata extracted from any document Strategies employed by Taalee: Knowledge-based/Statistical/Rules-based/AI techniques Result: Complete success in categorization! Precise category/categories chalked out for classifying document Resulting classification – flexible/dynamic/unambiguous/crisp Value-added metadata churned out to rig out the context/gist of the document Metadata =&gt; Great potential for Automated Content Enrichment (ACE) Classifying into or mapping to other taxonomies possible Promise to greatly enhance the current functioning of Content Manager and Syndication Software/Service
  • #22: 04/29/10 Taalee Proprietary &amp; Confidential. Do not copy or distribute. Why? What is its use?
  • #44: 04/29/10 Taalee Proprietary &amp; Confidential. Do not copy or distribute.
  • #45: 04/29/10 Taalee Proprietary &amp; Confidential. Do not copy or distribute.