SlideShare a Scribd company logo
Incorpora(ng	
  Site-­‐Level	
  Knowledge	
  to	
  
Extract	
  Structured	
  Data	
  from	
  Web	
  Forums

            Jiang-­‐Ming	
  Yang,	
  Rui	
  Cai,	
  Yida	
  Wang,	
  Jun	
  Zhu,	
  Lei	
  Zhang,	
  and	
  Wei-­‐Ying	
  Ma
                                             Web	
  Search	
  &	
  Mining	
  Group
                                                 Microso=	
  Research	
  Asia


                                                             2009-­‐04



Saturday, May 22, 2010
Web	
  Forum	
  Data
      • An	
  important	
  informa,on	
  resource	
  with	
  a	
  lot	
  of	
  human	
  
        knowledge.


      • These	
  informa,on	
  include	
  recrea,on,	
  sports,	
  games,	
  
        computers,	
  art,	
  society,	
  science,	
  home,	
  health;


      • 20%	
  pages	
  on	
  the	
  search	
  results	
  are	
  from	
  forums




Saturday, May 22, 2010
Understanding	
  Forum


                                                   Quality	
  
                                       Data	
  
                         Crawling                 Assessmen
                                    ExtracIon
                                                       t




Saturday, May 22, 2010
Understanding	
  Forum


                                                                                                Quality	
  
                                                                 Data	
  
                         Crawling                                                              Assessmen
                                                              ExtracIon
                                                                                                    t
     WWW’08                                             WWW’09,                          SIGIR’09
     iRobot:	
  An	
  Intelligent	
  Crawler	
  for	
   AutomaIon	
  Data	
  ExtracIon   Quality	
  Assessment
     Web	
  Forums

     SIGIR’08
     Exploring	
  Traversal	
  Strategy

     KDD’09
     Incremental	
  Crawling



Saturday, May 22, 2010
Understanding	
  Forum


                                                                                                Quality	
  
                                                                 Data	
  
                         Crawling                                                              Assessmen
                                                              ExtracIon
                                                                                                    t
     WWW’08                                             WWW’09,                          SIGIR’09
     iRobot:	
  An	
  Intelligent	
  Crawler	
  for	
   AutomaIon	
  Data	
  ExtracIon   Quality	
  Assessment
     Web	
  Forums

     SIGIR’08
     Exploring	
  Traversal	
  Strategy

     KDD’09
     Incremental	
  Crawling



Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




     •    Leverage	
  more	
  site-­‐level	
  knowledge




Saturday, May 22, 2010
Saturday, May 22, 2010
Saturday, May 22, 2010
Forum	
  Sitemap
      • A	
  sitemap	
  is	
  a	
  directed	
  graph	
  corresponding	
  
        consis,ng	
  of	
  a	
  set	
  of	
  ver$ces	
  and	
  the	
  links




Saturday, May 22, 2010
Forum	
  Sitemap
        • A	
  sitemap	
  is	
  a	
  directed	
  graph	
  corresponding	
  
          consis,ng	
  of	
  a	
  set	
  of	
  ver$ces	
  and	
  the	
  links




    •     Rui	
  Cai,	
  Jiangming	
  Yang,	
  Wei	
  Lai,	
  Yida	
  Wang	
  and	
  Lei	
  Zhang.	
  iRobot:	
  An	
  Intelligent	
  Crawler	
  for	
  Web	
  Forums.	
  In	
  Proceedings	
  of	
  WWW	
  2008	
  Conference



Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering




Saturday, May 22, 2010
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery




Saturday, May 22, 2010
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery




Saturday, May 22, 2010
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery




                                  Clustering	
  by	
  
                                  Virtual	
  Tables




Saturday, May 22, 2010
Link	
  Analysis




                         A	
  Link	
  =	
  URL	
  Pa4ern	
  +	
  Loca9on



Saturday, May 22, 2010
Saturday, May 22, 2010
Inner-­‐Page	
  Features
                                          •   The	
  inclusion	
  rela9on.	
  Data	
  records	
  
                                              usually	
  have	
  inclusion	
  relaIons.

                                          •   The	
  alignment	
  rela9on.	
  Since	
  data	
  is	
  
                                              generated	
  from	
  database	
  and	
  
                                              represented	
  via	
  templates,	
  data	
  
                                              records	
  with	
  the	
  same	
  label	
  may	
  
                                              appear	
  repeatedly	
  in	
  a	
  page.

                                          •   Time	
  Order.	
  Since	
  post	
  records	
  are	
  
                                              generated	
  sequenIally	
  along	
  
                                              Imeline,	
  the	
  post	
  Ime	
  should	
  be	
  
                                              sorted	
  ascending	
  or	
  descending.




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Saturday, May 22, 2010
Problem	
  SeGng




Saturday, May 22, 2010
Problem	
  SeGng

                         Author




Saturday, May 22, 2010
Problem	
  SeGng

                         Author     Title




Saturday, May 22, 2010
Problem	
  SeGng

                         Author     Title   Content




Saturday, May 22, 2010
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




                               • Formulas	
  for	
  iden9fying	
  list	
  9tle




Saturday, May 22, 2010
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




                               • Formulas	
  for	
  iden9fying	
  list	
  9tle




Saturday, May 22, 2010
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




                               • Formulas	
  for	
  iden9fying	
  list	
  9tle




Saturday, May 22, 2010
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  record




                            • Formulas	
  for	
  iden9fying	
  post	
  author




Saturday, May 22, 2010
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  record




                            • Formulas	
  for	
  iden9fying	
  post	
  author




Saturday, May 22, 2010
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  9me




                            • Formulas	
  for	
  iden9fying	
  post	
  content




Saturday, May 22, 2010
Saturday, May 22, 2010
Markov	
  Logic	
  Networks
      • An	
  MLN	
  can	
  be	
  viewed	
  as	
  a	
  template	
  for	
  construc,ng	
  Markov	
  
        Random	
  Fields.	
  


      • With	
  a	
  set	
  of	
  formulas	
  and	
  constants,	
  MLNs	
  define	
  a	
  Markov	
  
        network	
  with	
  one	
  node	
  per	
  ground	
  atom	
  and	
  one	
  feature	
  per	
  
        ground	
  formula.	
  The	
  probability	
  of	
  a	
  state	
  x	
  in	
  such	
  a	
  network	
  
        is	
  given	
  by:




Saturday, May 22, 2010
Markov	
  Logic	
  Networks
      • Divide	
  DOM	
  tree	
  elements	
  into	
  three	
  categories	
  :

            – Text	
  element
            – Hyperlink	
  element
            – Inner	
  element

      • Benefit

            – Reduce	
  the	
  number	
  of	
  possible	
  groundings	
  in	
  inference.	
  

            – Reduce	
  the	
  ambiguity	
  and	
  achieve	
  beRer	
  performance.


Saturday, May 22, 2010
Experiments




                         List	
  Pages    Post	
  Pages


Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Future	
  works




Saturday, May 22, 2010
Future	
  works




                                           hJp://discussions.apple.com/
Saturday, May 22, 2010
Conclusion
      • A	
  template-­‐independent	
  approach	
  to	
  extract	
  
        structured	
  data	
  from	
  web	
  forum	
  sites.

      • we	
  can	
  leverage	
  power	
  of	
  site-­‐level	
  informaIon,	
  
        such	
  as	
  the	
  mutual	
  informaIon	
  among	
  pages,	
  
        inner	
  or	
  inter	
  verIces	
  of	
  the	
  sitemap.

      • hZp://research.microso=.com/people/jmyang/


Saturday, May 22, 2010

More Related Content

PDF
Hadoop in Love
PDF
Incorporating site level knowledge for incremental crawling of web forums - a...
PDF
Apache Solr for TYPO3 (@ T3CON10 Dallas, TX)
PDF
Nuxeo World Session: Semantic Technologies - Update on Recent Research
PPTX
Statistical Analysis of Web of Data Usage
PDF
GateIn - Presented at Atlanta JUG on 1/19/2010
PDF
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
PDF
Linked data and libraries
Hadoop in Love
Incorporating site level knowledge for incremental crawling of web forums - a...
Apache Solr for TYPO3 (@ T3CON10 Dallas, TX)
Nuxeo World Session: Semantic Technologies - Update on Recent Research
Statistical Analysis of Web of Data Usage
GateIn - Presented at Atlanta JUG on 1/19/2010
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Linked data and libraries

Similar to Incorporating site level knowledge to extract structured data from web forums - keynote (20)

PDF
Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Group
PDF
06 View Controllers
PPT
Sakai And The Academic Enterprise
PDF
Search Engine Optimization
PDF
Web Typography with CSS3
PPTX
Web mining
PPTX
Web Mining
PPTX
CSC 8101 Non Relational Databases
PDF
Websockets - OMG! Someone broke the internet!
PDF
Jim Webber R E S Tful Services
PDF
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
PDF
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
PDF
HTML 5: The Future of the Web
PDF
Movable Type 5 : 成長するプラットフォーム
PDF
Web技術の現状と将来 (Open Source Conference 2011 Kyoto)
PDF
Data Collection and Integration, Linked Data Management
PDF
Database Management for 
Real Estate Professionals
PDF
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
PDF
First look at SharePoint 2013
PDF
Jquery Introduction
Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Group
06 View Controllers
Sakai And The Academic Enterprise
Search Engine Optimization
Web Typography with CSS3
Web mining
Web Mining
CSC 8101 Non Relational Databases
Websockets - OMG! Someone broke the internet!
Jim Webber R E S Tful Services
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
HTML 5: The Future of the Web
Movable Type 5 : 成長するプラットフォーム
Web技術の現状と将来 (Open Source Conference 2011 Kyoto)
Data Collection and Integration, Linked Data Management
Database Management for 
Real Estate Professionals
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
First look at SharePoint 2013
Jquery Introduction
Ad

More from George Ang (20)

PDF
Wrapper induction construct wrappers automatically to extract information f...
PDF
Opinion mining and summarization
PPT
Huffman coding
PPT
Do not crawl in the dust 
different ur ls similar text
PPT
大规模数据处理的那些事儿
PPT
腾讯大讲堂02 休闲游戏发展的文化趋势
PPT
腾讯大讲堂03 qq邮箱成长历程
PPT
腾讯大讲堂04 im qq
PPT
腾讯大讲堂05 面向对象应对之道
PPT
腾讯大讲堂06 qq邮箱性能优化
PPT
腾讯大讲堂07 qq空间
PPT
腾讯大讲堂08 可扩展web架构探讨
PPT
腾讯大讲堂09 如何建设高性能网站
PPT
腾讯大讲堂01 移动qq产品发展历程
PPT
腾讯大讲堂10 customer engagement
PPT
腾讯大讲堂11 拍拍ce工作经验分享
PPT
腾讯大讲堂14 qq直播(qq live) 介绍
PPT
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
PPTX
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
PPT
腾讯大讲堂16 产品经理工作心得分享
Wrapper induction construct wrappers automatically to extract information f...
Opinion mining and summarization
Huffman coding
Do not crawl in the dust 
different ur ls similar text
大规模数据处理的那些事儿
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂04 im qq
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂07 qq空间
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂10 customer engagement
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂16 产品经理工作心得分享
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Spectroscopy.pptx food analysis technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Electronic commerce courselecture one. Pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Programs and apps: productivity, graphics, security and other tools
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Spectroscopy.pptx food analysis technology
The AUB Centre for AI in Media Proposal.docx
sap open course for s4hana steps from ECC to s4
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Electronic commerce courselecture one. Pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)

Incorporating site level knowledge to extract structured data from web forums - keynote

  • 1. Incorpora(ng  Site-­‐Level  Knowledge  to   Extract  Structured  Data  from  Web  Forums Jiang-­‐Ming  Yang,  Rui  Cai,  Yida  Wang,  Jun  Zhu,  Lei  Zhang,  and  Wei-­‐Ying  Ma Web  Search  &  Mining  Group Microso=  Research  Asia 2009-­‐04 Saturday, May 22, 2010
  • 2. Web  Forum  Data • An  important  informa,on  resource  with  a  lot  of  human   knowledge. • These  informa,on  include  recrea,on,  sports,  games,   computers,  art,  society,  science,  home,  health; • 20%  pages  on  the  search  results  are  from  forums Saturday, May 22, 2010
  • 3. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t Saturday, May 22, 2010
  • 4. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 5. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 13. Challenge • Leverage  more  site-­‐level  knowledge Saturday, May 22, 2010
  • 16. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links Saturday, May 22, 2010
  • 17. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links • Rui  Cai,  Jiangming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.  iRobot:  An  Intelligent  Crawler  for  Web  Forums.  In  Proceedings  of  WWW  2008  Conference Saturday, May 22, 2010
  • 18. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 19. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 20. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 21. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 23. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 24. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 25. Page  Clustering Dom  Path  Feature   Discovery Clustering  by   Virtual  Tables Saturday, May 22, 2010
  • 26. Link  Analysis A  Link  =  URL  Pa4ern  +  Loca9on Saturday, May 22, 2010
  • 28. Inner-­‐Page  Features • The  inclusion  rela9on.  Data  records   usually  have  inclusion  relaIons. • The  alignment  rela9on.  Since  data  is   generated  from  database  and   represented  via  templates,  data   records  with  the  same  label  may   appear  repeatedly  in  a  page. • Time  Order.  Since  post  records  are   generated  sequenIally  along   Imeline,  the  post  Ime  should  be   sorted  ascending  or  descending. Saturday, May 22, 2010
  • 37. Problem  SeGng Author Saturday, May 22, 2010
  • 38. Problem  SeGng Author Title Saturday, May 22, 2010
  • 39. Problem  SeGng Author Title Content Saturday, May 22, 2010
  • 40. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 41. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 42. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 43. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 44. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 45. Formulas  of  post  page • Formulas  for  iden9fying  post  9me • Formulas  for  iden9fying  post  content Saturday, May 22, 2010
  • 47. Markov  Logic  Networks • An  MLN  can  be  viewed  as  a  template  for  construc,ng  Markov   Random  Fields.   • With  a  set  of  formulas  and  constants,  MLNs  define  a  Markov   network  with  one  node  per  ground  atom  and  one  feature  per   ground  formula.  The  probability  of  a  state  x  in  such  a  network   is  given  by: Saturday, May 22, 2010
  • 48. Markov  Logic  Networks • Divide  DOM  tree  elements  into  three  categories  : – Text  element – Hyperlink  element – Inner  element • Benefit – Reduce  the  number  of  possible  groundings  in  inference.   – Reduce  the  ambiguity  and  achieve  beRer  performance. Saturday, May 22, 2010
  • 49. Experiments List  Pages Post  Pages Saturday, May 22, 2010
  • 57. Future  works hJp://discussions.apple.com/ Saturday, May 22, 2010
  • 58. Conclusion • A  template-­‐independent  approach  to  extract   structured  data  from  web  forum  sites. • we  can  leverage  power  of  site-­‐level  informaIon,   such  as  the  mutual  informaIon  among  pages,   inner  or  inter  verIces  of  the  sitemap. • hZp://research.microso=.com/people/jmyang/ Saturday, May 22, 2010