SlideShare a Scribd company logo
Big Data Challenges: Getting Some
March 31, 2011




Gil Elbaz
   @factual
   @gilelbaz
Road to Information Singularity




                          Conf dential
                             i           2
Networks Underlying Information Flow


                                                                                ! Density:
                                                                                  number of
                                                                                  connecting paths
                                                                                ! Plasticity:
                                                                                  ease of forming
                                                                                  new paths
                                                                                !
                                                                                  Speed & Flow:
 !""#$%%&&&'()*++,-+.*/(01,-211(**3'4*5%()*++,-+6.*/6(01,-211%
                                                                                  rate of information
                                                                                  transfer


                                                                 Conf dential
                                                                    i                                   3
The Internet




               !""#$%%&&&'7578*-'4*5%9,+,"7):;/*11/*7<1:=52/,47-:>2)24*550-,47",*-1:?-"2/-2"%<#%@ABACD@ECF




                                                          Conf dential
                                                             i                                               4
Search Engines




                 Conf dential
                    i           5
Social Networks: Facebook




                            !""#$%%A'(#'()*+1#*"'4*5%



             600 million Facebook users
                 130 average friends
              8 friend requests / month

              15 messages / day / user
                        Conf dential
                           i                            6
Trending of Unfriending




                          Conf dential
                             i           7
Conf dential
   i           8
Unfriending




              Conf dential
                 i           9
Another Network: The Brain




                  100 billion neurons

               1000 ‘hardwired’ synapses




                       !""#$%%&2)4*52"*G57/"'4*5%A@CC%@C




                            Conf dential
                               i                           10
Web 3.0: Data Web




                    Conf dential
                       i           11
Web Scale Data = More Pain


                     Findability
                       Access
                       Rights
                     Economics
                     Standards
             Integration & Aggregation
                        Trust
                         Conf dential
                            i            12
Web 2.0 Model: Scale-Free Networks




&&&'.0"0/22H#)*/7",*-'-2"   Conf dential
                               i           13
Book Data: Progress Being Made




    Google Book Search API
      Open Library Books API
         ISBNdb
           Amazon API
             LibraryThing
                GoodReads
                   WorldCat


                        Conf dential
                           i           14
Google Book Search API                 Amazon API
    Open Library Books API                LibraryThing
         ISBNdb     WorldCat           GoodReads



            I,-<7(,),"JKKKKKKKKKKKK=44211KKKKKKK
            L,+!"1KKKKKKKKKKKKM4*-*5,41KKKKKKK
            N"7-<7/<1KKKKKKKKKKKKKKK>/01"KKKKKK




                        Conf dential
                           i
Another Case Study: Local Data




                                        !""#$%%1"2O24!2-2J'#*1"2/*01'4*5%




                         Conf dential
                            i                                               16
Another Case Study: Local Data


    !"##$%$&$'$(#)*+()(,-&(##)%.'/!"#$%"$&"'$()*$*!)$%+*+0

         !"#$$%&                                  !"#$$%&
      '()%*++,       Examine Twitter sentiment    '()%*++,
                     (avoid dirty coffee shops)
           -++.$                                  -++.$
     '+/&01/(&%       Identify areas of highest   '+/&01/(&%
                               bike thefts
             2%3.                                 2%3.
          4#33+"                                  4#33+"
                      Correlate check-ins with
         5++63%            property values        5++63%
 7+8%9:/;)$#+;                                    7+8%9:/;)$#+;

                               Conf dential
                                  i                               17
HomeJunction




               Conf dential
                  i           18
Factual is Example of New Information Network

      "#$#%&'(   )'$&*+*#(&(    345&*'6&'$       ,-./#'&01&-*'&2

                           ,-."'-%$%+*+




                 Aggregate      Mash Curate
                   Dedupe       Canonicalize



          Developers      Publishers          Search Engines


        !"#$%"&'()"*+$,-.-/(0(1("*+$%231#-&"$4..*

                               Conf dential
                                  i                                19
Factual’s Open Data Model

  Free, access via APIs, SDKs, and downloads BUT…
     we ask you to contribute back into ecosystem.

                                           Benef ts
                                               i

                                           ! Drive down costs
                                           ! Rapid iteration
                                           ! Differentiate on user
                                              experience

                                           ! Only need small %
                                              participation from world
                                              (e.g. Wikipedia)



                            Conf dential
                               i                                         20
Equivalence Measurements




                     =?
    Subway Sandwiches                 Subway
    52 E Court St                     52 West Court St
    Cincinnati, OH                    45202
    (513)-241-6699                    (800)-653-2323


                       Conf dential
                          i                              21
Large-Scale Aggregation Technologies




                         Conf dential
                            i           22
Large-Scale Aggregation Technologies

                      =#7/"52-"1KPK=#"1
                         ;2-"2/KPK;"/
                    ;*/#KPK;*/#*/7",*-
                         N2/O,42KPKNO4
                        =""*/-2JKPK=""J
                      =11*4KPK=11*4,7"21
                      ?-4KPK?-4*/#*/7"2<
                      =11-KPK=11*4,7",*-
                        ;*KPK;*5#7-J
                          Q*0-"KPKQ"
                        R/*1KPKR/*"!2/1
                KKKKKKKRRSKPKR7/(2T02KKK'''''
                           U*/,KPK>2<
                          Conf dential
                             i                  23
Large-Scale Aggregation Technologies

                 L21"70/7-"KPKL1"/-"
               L21"70/7-"KPKL21"07/7-"
                    V*1#KPKV*1#,"7)
                   R,))7/<1KPKR,)),7/<1
                       N7)*-KPKN)-
                     R0..2"KPKR0..2""
                      ;2-"2/KPK;"/
                  =#7/"52-"1KPK=#"1
                     R*0",T02KPKR"T
                   W2&2)2/1KPKW2&)2/1
                    ;)27-2/1KPK;)-/1
                KKKKKQ7/32"KPKQ3"8K'''''
                  X/7+2-KPKYZL2,))JK[
                         Conf dential
                            i              24
Kragen O'Reilly?




                   Conf dential
                      i           25
Large-Scale Deduping




   • Specialized data compression & folding techniques
   • Eliminate redundant entities - endpoints and authority pages
   • Improves precision & recall
   • Enables real-time dedupe and crosswalks

                               Conf dential
                                  i                                 26
Shared Foundational Data

  ! Commoditization of data
  ! Head attributes for people, places, things decreasing in value
    ! hCard data value driven to zero (visual of local data being
       identical on thousand of apps)
    ! Entertainment: IMDB exposed all their data for non-
       commercial use (link to site map)
    ! Yet, there are still lots of errors in foundation data – thus
       need “living” model




                                Conf dential
                                   i
LA Neighborhoods: Another Crowdsourcing Example




 ! LA Times started with 87
   neighborhoods based on census
   tracts
 ! Incorporated 650+ user maps
 ! Ended with 114 neighborhoods for
   LA City
 ! Added additional 158
   neighborhoods for LA County




                                   Conf dential
                                      i
Ownership & Rights: LA Neighborhoods:


  ! Terms of Service:
    Creative Commons
    Attribution,
    Noncommercial, Share-
    Alike license
  ! Can share and remix as
    long as it’s for
    noncommercial uses,
    attributed to the LA
    Times, and shared
    under the same terms



                             Conf dential
                                i
Evolving “Buy” Model


 ! Data Marketplaces (“itunes of data?”)
 ! Data Search Engines
 ! Microformats / Semantic Web Markups / Other
   Standards
 ! Electronic Forms of T&Cs




                            Conf dential
                               i
Summary: Road to the Information Singularity


 ! Rise in community storage and access
 ! New common schemas and standards
 ! Def nitive, accountable sources of “open” data
     i
 ! Trends towards sharing of foundational data
 ! 'Buy' models based on unique data, novel access
   methods, SLAs, value-added services




                            Conf dential
                               i                     31
Thank you!
              Questions......

Gil Elbaz
  @factual
  @gilelbaz

More Related Content

PDF
Kadenza+ Workshop 300310
PDF
Digital Philippines 2011 Yahoo - Nielsen Net Index Highlights
PDF
E-Commerce in the Philippines: Opportunities and Challenges
PDF
Creating Award Winning Integrated Marketing Campaigns
PDF
Insecure mag-33
PDF
Social Media For Lawyers Ibj 2009 1
RTF
Cyborgs
PDF
Youth's Role in Social Media and E-Commerce Growth
Kadenza+ Workshop 300310
Digital Philippines 2011 Yahoo - Nielsen Net Index Highlights
E-Commerce in the Philippines: Opportunities and Challenges
Creating Award Winning Integrated Marketing Campaigns
Insecure mag-33
Social Media For Lawyers Ibj 2009 1
Cyborgs
Youth's Role in Social Media and E-Commerce Growth

Similar to Factual 2011 Web 2.0 Presentation (20)

PDF
NOSQLEU - Graph Databases and Neo4j
PDF
C gros-webscience-talk
PPTX
국제E비즈니스학회강장묵 차세대소셜서비스개발(20090619)
PPTX
Tacoma Keynote (4-17-09)
PDF
PDF
SAIL: demo handout at Future Media Distribution using Information Centric Net...
PPTX
Social Piggybacking: Leveraging Common Friends to Generate Event Streams
PDF
Wed 1315 lucas_jason_color
PPTX
강장묵 차세대소셜네트워크 Social Network Service
PDF
Ngn abridged oct2010
PPTX
Network of Excellence in Internet Science (Multidisciplinarity and its Implic...
PDF
Non techie journey in social internet age noiselessinnovation
PDF
Social Data@World Innovation Forum
PDF
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
PDF
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
PDF
Unexperienced pasts
PDF
Web2.0 and The Changing Role Of IAs
KEY
RESTFul Services, Does it Matter Anymore?
PDF
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
KEY
Story Telling as an Activity-based Architecture
NOSQLEU - Graph Databases and Neo4j
C gros-webscience-talk
국제E비즈니스학회강장묵 차세대소셜서비스개발(20090619)
Tacoma Keynote (4-17-09)
SAIL: demo handout at Future Media Distribution using Information Centric Net...
Social Piggybacking: Leveraging Common Friends to Generate Event Streams
Wed 1315 lucas_jason_color
강장묵 차세대소셜네트워크 Social Network Service
Ngn abridged oct2010
Network of Excellence in Internet Science (Multidisciplinarity and its Implic...
Non techie journey in social internet age noiselessinnovation
Social Data@World Innovation Forum
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
Unexperienced pasts
Web2.0 and The Changing Role Of IAs
RESTFul Services, Does it Matter Anymore?
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
Story Telling as an Activity-based Architecture
Ad

Recently uploaded (20)

PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
August Patch Tuesday
PDF
Encapsulation theory and applications.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Mushroom cultivation and it's methods.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
project resource management chapter-09.pdf
Enhancing emotion recognition model for a student engagement use case through...
1. Introduction to Computer Programming.pptx
Hybrid model detection and classification of lung cancer
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Tartificialntelligence_presentation.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
August Patch Tuesday
Encapsulation theory and applications.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
TLE Review Electricity (Electricity).pptx
cloud_computing_Infrastucture_as_cloud_p
Mushroom cultivation and it's methods.pdf
A novel scalable deep ensemble learning framework for big data classification...
Programs and apps: productivity, graphics, security and other tools
DP Operators-handbook-extract for the Mautical Institute
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
project resource management chapter-09.pdf
Ad

Factual 2011 Web 2.0 Presentation

  • 1. Big Data Challenges: Getting Some March 31, 2011 Gil Elbaz @factual @gilelbaz
  • 2. Road to Information Singularity Conf dential i 2
  • 3. Networks Underlying Information Flow ! Density: number of connecting paths ! Plasticity: ease of forming new paths ! Speed & Flow: !""#$%%&&&'()*++,-+.*/(01,-211(**3'4*5%()*++,-+6.*/6(01,-211% rate of information transfer Conf dential i 3
  • 4. The Internet !""#$%%&&&'7578*-'4*5%9,+,"7):;/*11/*7<1:=52/,47-:>2)24*550-,47",*-1:?-"2/-2"%<#%@ABACD@ECF Conf dential i 4
  • 5. Search Engines Conf dential i 5
  • 6. Social Networks: Facebook !""#$%%A'(#'()*+1#*"'4*5% 600 million Facebook users 130 average friends 8 friend requests / month 15 messages / day / user Conf dential i 6
  • 7. Trending of Unfriending Conf dential i 7
  • 9. Unfriending Conf dential i 9
  • 10. Another Network: The Brain 100 billion neurons 1000 ‘hardwired’ synapses !""#$%%&2)4*52"*G57/"'4*5%A@CC%@C Conf dential i 10
  • 11. Web 3.0: Data Web Conf dential i 11
  • 12. Web Scale Data = More Pain Findability Access Rights Economics Standards Integration & Aggregation Trust Conf dential i 12
  • 13. Web 2.0 Model: Scale-Free Networks &&&'.0"0/22H#)*/7",*-'-2" Conf dential i 13
  • 14. Book Data: Progress Being Made Google Book Search API Open Library Books API ISBNdb Amazon API LibraryThing GoodReads WorldCat Conf dential i 14
  • 15. Google Book Search API Amazon API Open Library Books API LibraryThing ISBNdb WorldCat GoodReads I,-<7(,),"JKKKKKKKKKKKK=44211KKKKKKK L,+!"1KKKKKKKKKKKKM4*-*5,41KKKKKKK N"7-<7/<1KKKKKKKKKKKKKKK>/01"KKKKKK Conf dential i
  • 16. Another Case Study: Local Data !""#$%%1"2O24!2-2J'#*1"2/*01'4*5% Conf dential i 16
  • 17. Another Case Study: Local Data !"##$%$&$'$(#)*+()(,-&(##)%.'/!"#$%"$&"'$()*$*!)$%+*+0 !"#$$%& !"#$$%& '()%*++, Examine Twitter sentiment '()%*++, (avoid dirty coffee shops) -++.$ -++.$ '+/&01/(&% Identify areas of highest '+/&01/(&% bike thefts 2%3. 2%3. 4#33+" 4#33+" Correlate check-ins with 5++63% property values 5++63% 7+8%9:/;)$#+; 7+8%9:/;)$#+; Conf dential i 17
  • 18. HomeJunction Conf dential i 18
  • 19. Factual is Example of New Information Network "#$#%&'( )'$&*+*#(&( 345&*'6&'$ ,-./#'&01&-*'&2 ,-."'-%$%+*+ Aggregate Mash Curate Dedupe Canonicalize Developers Publishers Search Engines !"#$%"&'()"*+$,-.-/(0(1("*+$%231#-&"$4..* Conf dential i 19
  • 20. Factual’s Open Data Model Free, access via APIs, SDKs, and downloads BUT… we ask you to contribute back into ecosystem. Benef ts i ! Drive down costs ! Rapid iteration ! Differentiate on user experience ! Only need small % participation from world (e.g. Wikipedia) Conf dential i 20
  • 21. Equivalence Measurements =? Subway Sandwiches Subway 52 E Court St 52 West Court St Cincinnati, OH 45202 (513)-241-6699 (800)-653-2323 Conf dential i 21
  • 23. Large-Scale Aggregation Technologies =#7/"52-"1KPK=#"1 ;2-"2/KPK;"/ ;*/#KPK;*/#*/7",*- N2/O,42KPKNO4 =""*/-2JKPK=""J =11*4KPK=11*4,7"21 ?-4KPK?-4*/#*/7"2< =11-KPK=11*4,7",*- ;*KPK;*5#7-J Q*0-"KPKQ" R/*1KPKR/*"!2/1 KKKKKKKRRSKPKR7/(2T02KKK''''' U*/,KPK>2< Conf dential i 23
  • 24. Large-Scale Aggregation Technologies L21"70/7-"KPKL1"/-" L21"70/7-"KPKL21"07/7-" V*1#KPKV*1#,"7) R,))7/<1KPKR,)),7/<1 N7)*-KPKN)- R0..2"KPKR0..2"" ;2-"2/KPK;"/ =#7/"52-"1KPK=#"1 R*0",T02KPKR"T W2&2)2/1KPKW2&)2/1 ;)27-2/1KPK;)-/1 KKKKKQ7/32"KPKQ3"8K''''' X/7+2-KPKYZL2,))JK[ Conf dential i 24
  • 25. Kragen O'Reilly? Conf dential i 25
  • 26. Large-Scale Deduping • Specialized data compression & folding techniques • Eliminate redundant entities - endpoints and authority pages • Improves precision & recall • Enables real-time dedupe and crosswalks Conf dential i 26
  • 27. Shared Foundational Data ! Commoditization of data ! Head attributes for people, places, things decreasing in value ! hCard data value driven to zero (visual of local data being identical on thousand of apps) ! Entertainment: IMDB exposed all their data for non- commercial use (link to site map) ! Yet, there are still lots of errors in foundation data – thus need “living” model Conf dential i
  • 28. LA Neighborhoods: Another Crowdsourcing Example ! LA Times started with 87 neighborhoods based on census tracts ! Incorporated 650+ user maps ! Ended with 114 neighborhoods for LA City ! Added additional 158 neighborhoods for LA County Conf dential i
  • 29. Ownership & Rights: LA Neighborhoods: ! Terms of Service: Creative Commons Attribution, Noncommercial, Share- Alike license ! Can share and remix as long as it’s for noncommercial uses, attributed to the LA Times, and shared under the same terms Conf dential i
  • 30. Evolving “Buy” Model ! Data Marketplaces (“itunes of data?”) ! Data Search Engines ! Microformats / Semantic Web Markups / Other Standards ! Electronic Forms of T&Cs Conf dential i
  • 31. Summary: Road to the Information Singularity ! Rise in community storage and access ! New common schemas and standards ! Def nitive, accountable sources of “open” data i ! Trends towards sharing of foundational data ! 'Buy' models based on unique data, novel access methods, SLAs, value-added services Conf dential i 31
  • 32. Thank you! Questions...... Gil Elbaz @factual @gilelbaz