SlideShare a Scribd company logo
Build Your Own Search Engine Jeff Barr Web Services Evangelist Amazon Web Services NGW044
Agenda Amazon Web Services Overview Looking Back Build  Your Own Search Engine Q&A
Introduction And Background Software development background Veteran of several startups Visual Studio team at Microsoft (DHTML, XML, Web Services)  3.5 Years with Amazon Amazon Web Services Evangelist
What Is Amazon? Online Retailer Over 55 million active customer accounts Seven countries: US, UK, Germany, Japan, France,  Canada, China Technology Consumer Multi-National Web Sites Vast Data Warehouse – 25 TB World-Class Logistics – 21 fulfillment centers; 9 million ft2 Technology Provider Hundreds of thousands of Amazon Associates Over 1,050,000 active seller accounts Over 150,000 software developers registered to use Amazon Web Services
What Is Alexa? Amazon subsidiary since 1999 Alexa Toolbar Web metrics Traffic rankings Web crawling
What Is Amazon Web Services?  APIs that give developers programmatic access to Amazon’s data and technology Building-block web services Web-scale infrastructure E-commerce capability Content, data, and information New business models Customer-created content
AWS Product Family Amazon E-Commerce Service Complete access to Amazon’s product catalog Free + Associates commissions paid Amazon Historical Pricing Data warehouse access for product pricing Monthly Fee Amazon Mechanical Turk  Artificial Artificial Intelligence 10% Commission  Paid workforce Amazon Simple Queue Service IT building block In beta Amazon S3 Storage for the internet Charge by storage/bandwidth usage Alexa Web Information Service Data warehouse access for web crawl data 10K calls per month free, then 15 cents per 1000 calls Alexa Top Sites Top sites by Alexa traffic rank Charges by URL Alexa Web Search Platform Roll your own search engine Pay for time, storage, bandwidth
Amazon S3 Simple Storage Service Storage for the internet - web service to read and write data 15 cents per Gigabyte-Month to store data 20 cents per Gigabyte to access data Private and  public storage Scalable, reliable, cost-effective, and simple!
Looking Back
Getting Online History Lesson 1996 vs. 2006 Lot has changed Let’s take a look
Going Online Then and Now What does is take to bring a simple web site online? Domain registration DNS support Network connection Server Hardware Development Tools Publicity Vehicle Monetization System
Then And Now Domain Registration Then Expensive ($70/year) Single vendor Multi-step, multi-day process Now Cheap ($10 or less / year) Dozens of vendors Single step, 10 minute process
Then And Now  DNS Support Then Leech off of friend or university Long propagation times Complicated Days to understand & set up Now Free services (e.g. ZoneEdit) Very short propagation time Minutes to understand & set up
Then Versus Now Network Connection Then 9600 baud modem ISDN T1 Expensive Now DSL Dedicated hosting Cheap
Then Versus Now Server Hardware Then Start with dedicated PC Upgrade to expensive Sun hardware Now Build your own PC Hosting providers (EV1, BocaCom, Server Beach) Expensive Sun hardware
 
Then And Now  Development Tools Then Text Editor Shell Window Now Visual Web Developer HTML Kit Front Page
Then Versus Now Publicity Vehicle Then Yahoo What’s New Usenet Press Release Wired Magazine Now Blogs / RSS / Pings  Link sites Word of Mouth
Then Versus Now Monetization System Then Money? We are purists and we are doing this  for fun! Banner ads Ad sales people Large sites only Now Pay per click Self serve Monetize page views
Then Building a Search Engine Lots of Servers Lots of Bandwidth Lots of Software Lots of Money Lots of Intellectual Capital Lots of Time
Now Building a Search Engine Use our infrastructure Leverage Alexa’s Crawl Alexa Web Search Platform 300 TB Archive 10 Billion web pages Pay as you go
AWSP Alexa Web Search Platform Build your own search engine! Process Specify pages to access within the 300TB archive Write parallelizable application to process pages Publish results as XML feed or as web service Pricing – everything costs $1 50 GB of data processing 1 CPU Hour 1 GB of data downloaded 4000 web service requests
AWSP Concepts Interactive Node - Development User Store – 12 TB of storage Compute Node – Processing Data Store 4 billion documents per crawl  3 crawls @ 100 TB In Process Current Previous All document types (HTML, Media, XML) Document header data
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Great Ideas Vertical search engine Search engine optimization (SEO) Search engine marketing (SEM) Research < your idea here >
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Write Code Run on Interactive Node Linux command  line Interactive application development Use Collection API for data retrieval Use any language Libraries for C, Java, Perl Execution framework Application processes one document
Write Code Code can Examine document Examine headers Write to a collection Write to <stdout> Store data to Amazon S3
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Test Code Run small test on Interactive Node Use predefined document collection Ensure proper functioning Measure document processing time
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Identify Pages Choose a crawl Choose pages within the crawl by URL Linkage Alexa Traffic Rank (Top N) Redirection status Content Define a Collection
 
 
 
 
 
 
 
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Schedule Job Allocate compute cluster resources Time Processors (1-10) Each processor 3.6 GHz CPU 4 GB of RAM 500 GB of local disk storage Charged at $1 per CPU hour
 
 
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Run Job Job runs at specified time Code instances created on each node Job output combined automatically Collection Compute Node #1 Compute Node N ... Combine Results
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Check Results Monitor progress using portal Final status email Log files Output
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Publishing Results Store data to S3 Create a new index for AWIS use Publish data for access via web search
Q & A
 
For More Information: AWSP:  websearch.alexa.com Alexa Blog:  awis.blogspot.com AWS  Blog:  aws.typepad.com Amazon Web Services:  aws.amazon.com © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
© 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

More Related Content

PPTX
Marshall Magee - Build a dynamic website for less than $1.55/month using S3 a...
PPTX
Exploring Contact Lens and Amazon Connect
PPTX
Web crawler with seo analysis
PPTX
Web crawler
PPT
Webmaster
PPT
Webmaster
PPT
Amazon web services
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
Marshall Magee - Build a dynamic website for less than $1.55/month using S3 a...
Exploring Contact Lens and Amazon Connect
Web crawler with seo analysis
Web crawler
Webmaster
Webmaster
Amazon web services
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...

Similar to Build Your Own Search Engine (15)

PDF
Big Data Architecture and Design Patterns
PPT
WTIA Cloud Computing Series - Part II: Scaling into the Cloud with Amazon Web...
PPTX
Integrating technology to your startup
PDF
Web Architecture with Infopark's Cloud Platform - Thomas Witt @Cloud Develope...
PPT
Amazon Web Services
PPTX
Salesforce com-architecture
PPTX
8 - Productividad en la Nube con BPOS - SharePoint Online, por Luis Du Solier
PDF
Aws-What You Need to Know_Simon Elisha
PPT
Artificial Artificial Intelligence: Using Amazon Mechanical Turk and .NET to ...
PPTX
The Internet as a Single Database
PPTX
Salesforce online training SFDC online course
PPT
Amazon Webservice & Cloud Computing
PDF
Cloud School Dublin - Intro
PDF
AWS Cloud School Introductory Presentation
PDF
AWS CloudSchool Introduction - December 2014
Big Data Architecture and Design Patterns
WTIA Cloud Computing Series - Part II: Scaling into the Cloud with Amazon Web...
Integrating technology to your startup
Web Architecture with Infopark's Cloud Platform - Thomas Witt @Cloud Develope...
Amazon Web Services
Salesforce com-architecture
8 - Productividad en la Nube con BPOS - SharePoint Online, por Luis Du Solier
Aws-What You Need to Know_Simon Elisha
Artificial Artificial Intelligence: Using Amazon Mechanical Turk and .NET to ...
The Internet as a Single Database
Salesforce online training SFDC online course
Amazon Webservice & Cloud Computing
Cloud School Dublin - Intro
AWS Cloud School Introductory Presentation
AWS CloudSchool Introduction - December 2014
Ad

More from goodfriday (20)

PPT
Narine Presentations 20051021 134052
PDF
Triunemar05
PDF
09 03 22 easter
PDF
Holy Week Easter 2009
PDF
Holt Park Easter 09 Swim
PDF
Easter Letter
PDF
April2009
PDF
Swarthmore Lentbrochure20092
PDF
Eastercard2009
PDF
Easterservices2009
PDF
Bulletin Current
PDF
Easter2009
PDF
Bulletin
PDF
March 2009 Newsletter
PDF
Mar 29 2009
PDF
Lent Easter 2009
PDF
Easterpowersports09
PDF
Easter Trading 09
PDF
Easter Brochure 2009
PDF
March April 2009 Calendar
Narine Presentations 20051021 134052
Triunemar05
09 03 22 easter
Holy Week Easter 2009
Holt Park Easter 09 Swim
Easter Letter
April2009
Swarthmore Lentbrochure20092
Eastercard2009
Easterservices2009
Bulletin Current
Easter2009
Bulletin
March 2009 Newsletter
Mar 29 2009
Lent Easter 2009
Easterpowersports09
Easter Trading 09
Easter Brochure 2009
March April 2009 Calendar
Ad

Build Your Own Search Engine

  • 1. Build Your Own Search Engine Jeff Barr Web Services Evangelist Amazon Web Services NGW044
  • 2. Agenda Amazon Web Services Overview Looking Back Build Your Own Search Engine Q&A
  • 3. Introduction And Background Software development background Veteran of several startups Visual Studio team at Microsoft (DHTML, XML, Web Services) 3.5 Years with Amazon Amazon Web Services Evangelist
  • 4. What Is Amazon? Online Retailer Over 55 million active customer accounts Seven countries: US, UK, Germany, Japan, France, Canada, China Technology Consumer Multi-National Web Sites Vast Data Warehouse – 25 TB World-Class Logistics – 21 fulfillment centers; 9 million ft2 Technology Provider Hundreds of thousands of Amazon Associates Over 1,050,000 active seller accounts Over 150,000 software developers registered to use Amazon Web Services
  • 5. What Is Alexa? Amazon subsidiary since 1999 Alexa Toolbar Web metrics Traffic rankings Web crawling
  • 6. What Is Amazon Web Services? APIs that give developers programmatic access to Amazon’s data and technology Building-block web services Web-scale infrastructure E-commerce capability Content, data, and information New business models Customer-created content
  • 7. AWS Product Family Amazon E-Commerce Service Complete access to Amazon’s product catalog Free + Associates commissions paid Amazon Historical Pricing Data warehouse access for product pricing Monthly Fee Amazon Mechanical Turk Artificial Artificial Intelligence 10% Commission Paid workforce Amazon Simple Queue Service IT building block In beta Amazon S3 Storage for the internet Charge by storage/bandwidth usage Alexa Web Information Service Data warehouse access for web crawl data 10K calls per month free, then 15 cents per 1000 calls Alexa Top Sites Top sites by Alexa traffic rank Charges by URL Alexa Web Search Platform Roll your own search engine Pay for time, storage, bandwidth
  • 8. Amazon S3 Simple Storage Service Storage for the internet - web service to read and write data 15 cents per Gigabyte-Month to store data 20 cents per Gigabyte to access data Private and public storage Scalable, reliable, cost-effective, and simple!
  • 10. Getting Online History Lesson 1996 vs. 2006 Lot has changed Let’s take a look
  • 11. Going Online Then and Now What does is take to bring a simple web site online? Domain registration DNS support Network connection Server Hardware Development Tools Publicity Vehicle Monetization System
  • 12. Then And Now Domain Registration Then Expensive ($70/year) Single vendor Multi-step, multi-day process Now Cheap ($10 or less / year) Dozens of vendors Single step, 10 minute process
  • 13. Then And Now DNS Support Then Leech off of friend or university Long propagation times Complicated Days to understand & set up Now Free services (e.g. ZoneEdit) Very short propagation time Minutes to understand & set up
  • 14. Then Versus Now Network Connection Then 9600 baud modem ISDN T1 Expensive Now DSL Dedicated hosting Cheap
  • 15. Then Versus Now Server Hardware Then Start with dedicated PC Upgrade to expensive Sun hardware Now Build your own PC Hosting providers (EV1, BocaCom, Server Beach) Expensive Sun hardware
  • 16.  
  • 17. Then And Now Development Tools Then Text Editor Shell Window Now Visual Web Developer HTML Kit Front Page
  • 18. Then Versus Now Publicity Vehicle Then Yahoo What’s New Usenet Press Release Wired Magazine Now Blogs / RSS / Pings Link sites Word of Mouth
  • 19. Then Versus Now Monetization System Then Money? We are purists and we are doing this for fun! Banner ads Ad sales people Large sites only Now Pay per click Self serve Monetize page views
  • 20. Then Building a Search Engine Lots of Servers Lots of Bandwidth Lots of Software Lots of Money Lots of Intellectual Capital Lots of Time
  • 21. Now Building a Search Engine Use our infrastructure Leverage Alexa’s Crawl Alexa Web Search Platform 300 TB Archive 10 Billion web pages Pay as you go
  • 22. AWSP Alexa Web Search Platform Build your own search engine! Process Specify pages to access within the 300TB archive Write parallelizable application to process pages Publish results as XML feed or as web service Pricing – everything costs $1 50 GB of data processing 1 CPU Hour 1 GB of data downloaded 4000 web service requests
  • 23. AWSP Concepts Interactive Node - Development User Store – 12 TB of storage Compute Node – Processing Data Store 4 billion documents per crawl 3 crawls @ 100 TB In Process Current Previous All document types (HTML, Media, XML) Document header data
  • 24. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 25. Great Ideas Vertical search engine Search engine optimization (SEO) Search engine marketing (SEM) Research < your idea here >
  • 26. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 27. Write Code Run on Interactive Node Linux command line Interactive application development Use Collection API for data retrieval Use any language Libraries for C, Java, Perl Execution framework Application processes one document
  • 28. Write Code Code can Examine document Examine headers Write to a collection Write to <stdout> Store data to Amazon S3
  • 29. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 30. Test Code Run small test on Interactive Node Use predefined document collection Ensure proper functioning Measure document processing time
  • 31. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 32. Identify Pages Choose a crawl Choose pages within the crawl by URL Linkage Alexa Traffic Rank (Top N) Redirection status Content Define a Collection
  • 33.  
  • 34.  
  • 35.  
  • 36.  
  • 37.  
  • 38.  
  • 39.  
  • 40. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 41. Schedule Job Allocate compute cluster resources Time Processors (1-10) Each processor 3.6 GHz CPU 4 GB of RAM 500 GB of local disk storage Charged at $1 per CPU hour
  • 42.  
  • 43.  
  • 44. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 45. Run Job Job runs at specified time Code instances created on each node Job output combined automatically Collection Compute Node #1 Compute Node N ... Combine Results
  • 46. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 47. Check Results Monitor progress using portal Final status email Log files Output
  • 48. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 49. Publishing Results Store data to S3 Create a new index for AWIS use Publish data for access via web search
  • 50. Q & A
  • 51.  
  • 52. For More Information: AWSP: websearch.alexa.com Alexa Blog: awis.blogspot.com AWS Blog: aws.typepad.com Amazon Web Services: aws.amazon.com © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
  • 53. © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.