SlideShare a Scribd company logo
try it the                         way !!!
Founders:Larry Page (currently, President of Manufacturing) and Sergey Brin (President of Technology)Created “BackRub” web search engine in 1996 with a motive to bring the net on their system
History of Google so Far :In 1998 Larry and Sergey(Stanford Graduates)  changed the name BackRub to google and started their company “Google Inc.”Later that year they received their first funding cheque worth $100,000.In 2000, google toolbar and adwords were introduced.AOL added google as their search partners officially.In 2003, google launched their adSense program.
Some Rough Statistics of Google (from August 29th, 1996)Number of webpages fetched-24 MillionTotal indexable HTML urls: 75.2306 MillionTotal content downloaded: 207.022 gigabytes
Services Provided by Google apart from being a Search Engine
Try It The Google Way .
What made Google so popular ?Chief features are:pageRank Algorithm Anchor textOther features are:Big FilesRepositoryDocument IndexHit lists
PageRank Algorithm(Bringing Order to the Web)A PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. Firstly, citation graphs are created, containing as many as 518 million hyperlinks(Assumed).These maps help in calculating the page rank of different web pages.A simple formula is used to create the page ranks for any search
Try It The Google Way .
PageRank FormulaPR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))T1….Tn are citations to a paged is the Damping Factor (value between 0 to 1). Usually has a value of 0.85.C(A) is the no of links going out of that page.pageRank can be calculated by using a simple iterative algorithm.
Anchor TextUsually the links are given the text as the type of page they are associated with.Google creates a separate database to maitainthese indexes.This helps to retrieve even those pages which are not being crawled.In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it.
RepositoryThe repository contains the full HTML of every web page.Each page is compressed using zlib.compression rate of zlib is 3 to 1.the documents are stored one after the other and are prefixed by docID, length, and URL.
HIT LISTS-A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.DOCUMENT INDEX-The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. BIGFILES-BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers.
Google Architecture Overview
Crawling The WebIn order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once.At peak speeds, the system can crawl over 100 web pages per second using four crawlers.Googlebot is the search bot software used by Google, which collects documents from the web to build a searchable index for the Google Search engine.
What else google can do ?Refine search resultsCalculatorCurrency converterTime zonesSpecific “filetype” searchAdvanced searchI Am Feeling Lucky.DictionaryLanguage translator
Try It The Google Way .
Created By:Anmol Buber(0713313015)Abhinav Singh(0713313003)

More Related Content

PDF
Using the whole web as your dataset
PDF
Insight Data Engineering project
PDF
RESTo - restful semantic search tool for geospatial
PPT
Analytics and Access to the UK web archive
ODP
3 Google Operators
PPT
HTML Flight Scraper
KEY
Papyri.info's Linked Data Story
Using the whole web as your dataset
Insight Data Engineering project
RESTo - restful semantic search tool for geospatial
Analytics and Access to the UK web archive
3 Google Operators
HTML Flight Scraper
Papyri.info's Linked Data Story

What's hot (20)

PDF
Insight_150115_Demo
PDF
Cenitpede: Analyzing Webcrawl
PPT
From Web 2.0 to the Semantic Web: Bridging the Gap in the Newsmedia Industry
PPTX
Watch Your Log!
PPTX
Analysing GitHub commits with R
PDF
Overview of Dan Olteanu's Research presentation
PPTX
Analysing GitHub commits with R
PPTX
Analysing GitHub commits with R
PDF
20171012 found IT #9 PySparkの勘所
PPTX
Visualizing Data in Elasticsearch DevFest DC 2016
PDF
PyCon 2012 - Data Driven Design
PDF
20170210 sapporotechbar7
PPTX
Introduction to Big Data processing (FGRE2016)
PDF
Graph Analysis over JSON, Larus
PPTX
Ten things to consider for interactive analytics on write once workloads
DOCX
Data_Size_statistics
PDF
Building real apps on serverless
PPTX
Intro to hadoop ecosystem
PDF
GitConnect
PPT
Big data hadoop
Insight_150115_Demo
Cenitpede: Analyzing Webcrawl
From Web 2.0 to the Semantic Web: Bridging the Gap in the Newsmedia Industry
Watch Your Log!
Analysing GitHub commits with R
Overview of Dan Olteanu's Research presentation
Analysing GitHub commits with R
Analysing GitHub commits with R
20171012 found IT #9 PySparkの勘所
Visualizing Data in Elasticsearch DevFest DC 2016
PyCon 2012 - Data Driven Design
20170210 sapporotechbar7
Introduction to Big Data processing (FGRE2016)
Graph Analysis over JSON, Larus
Ten things to consider for interactive analytics on write once workloads
Data_Size_statistics
Building real apps on serverless
Intro to hadoop ecosystem
GitConnect
Big data hadoop
Ad

Viewers also liked (9)

PPTX
Implementing page rank algorithm using hadoop map reduce
PPT
Behm Shah Pagerank
PPT
Seo and page rank algorithm
PDF
The Google Pagerank algorithm - How does it work?
PPTX
PPT
Web crawler
PDF
Large Scale Graph Processing with Apache Giraph
PPTX
Web crawler
Implementing page rank algorithm using hadoop map reduce
Behm Shah Pagerank
Seo and page rank algorithm
The Google Pagerank algorithm - How does it work?
Web crawler
Large Scale Graph Processing with Apache Giraph
Web crawler
Ad

Similar to Try It The Google Way . (20)

PPTX
DC presentation 1
PPTX
Google - A presentation by Pushpendra Singh Dangi
PPT
Googling of GooGle
PDF
Google Paper
PPTX
best digital marketing training in Pune
PPT
Google Search Engine
PPT
Google Search Engine
PPT
Google
ODP
Web2.0.2012 - lesson 8 - Google world
PPT
Google And Search Engines
PPT
Google ppt by amit
PPTX
Google history nd architecture
PPTX
How Google Search Algorithm Works ??
PPTX
How Google Search Engine Algorithm Works ??
PPT
Google
PPT
Web Search And Mining (Ntuim)
PPT
Google Search Engine
PPTX
Google
PPTX
How Google Search Works By Tushar Joshi
PPTX
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
DC presentation 1
Google - A presentation by Pushpendra Singh Dangi
Googling of GooGle
Google Paper
best digital marketing training in Pune
Google Search Engine
Google Search Engine
Google
Web2.0.2012 - lesson 8 - Google world
Google And Search Engines
Google ppt by amit
Google history nd architecture
How Google Search Algorithm Works ??
How Google Search Engine Algorithm Works ??
Google
Web Search And Mining (Ntuim)
Google Search Engine
Google
How Google Search Works By Tushar Joshi
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...

Try It The Google Way .

  • 1. try it the way !!!
  • 2. Founders:Larry Page (currently, President of Manufacturing) and Sergey Brin (President of Technology)Created “BackRub” web search engine in 1996 with a motive to bring the net on their system
  • 3. History of Google so Far :In 1998 Larry and Sergey(Stanford Graduates) changed the name BackRub to google and started their company “Google Inc.”Later that year they received their first funding cheque worth $100,000.In 2000, google toolbar and adwords were introduced.AOL added google as their search partners officially.In 2003, google launched their adSense program.
  • 4. Some Rough Statistics of Google (from August 29th, 1996)Number of webpages fetched-24 MillionTotal indexable HTML urls: 75.2306 MillionTotal content downloaded: 207.022 gigabytes
  • 5. Services Provided by Google apart from being a Search Engine
  • 7. What made Google so popular ?Chief features are:pageRank Algorithm Anchor textOther features are:Big FilesRepositoryDocument IndexHit lists
  • 8. PageRank Algorithm(Bringing Order to the Web)A PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. Firstly, citation graphs are created, containing as many as 518 million hyperlinks(Assumed).These maps help in calculating the page rank of different web pages.A simple formula is used to create the page ranks for any search
  • 10. PageRank FormulaPR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))T1….Tn are citations to a paged is the Damping Factor (value between 0 to 1). Usually has a value of 0.85.C(A) is the no of links going out of that page.pageRank can be calculated by using a simple iterative algorithm.
  • 11. Anchor TextUsually the links are given the text as the type of page they are associated with.Google creates a separate database to maitainthese indexes.This helps to retrieve even those pages which are not being crawled.In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it.
  • 12. RepositoryThe repository contains the full HTML of every web page.Each page is compressed using zlib.compression rate of zlib is 3 to 1.the documents are stored one after the other and are prefixed by docID, length, and URL.
  • 13. HIT LISTS-A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.DOCUMENT INDEX-The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. BIGFILES-BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers.
  • 15. Crawling The WebIn order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once.At peak speeds, the system can crawl over 100 web pages per second using four crawlers.Googlebot is the search bot software used by Google, which collects documents from the web to build a searchable index for the Google Search engine.
  • 16. What else google can do ?Refine search resultsCalculatorCurrency converterTime zonesSpecific “filetype” searchAdvanced searchI Am Feeling Lucky.DictionaryLanguage translator