SlideShare a Scribd company logo
Public Terabyte Dataset Project Web crawling with Amazon’s EMR Ken Krugler, Bixo Labs, Inc. Hadoop Bay Area Meetup 21 April 2010
About me Background in consulting, search, vertical crawl Apple (Mac), I18N, Expert systems, Palm OS Krugle search engine for open source code Open source projects Nutch, Lucene, Solr Bixo web mining toolkit Tika content extraction  Founder of Bixo Labs | http://bixolabs.com Elastic web mining platform Hadoop, Cascading, Bixo in EC2/EMR
In 20 Minutes I’ll Talk About… Public Terabyte Dataset project Amazon’s Elastic MapReduce Some really embarrassing mistakes
What is the Public Terabyte Dataset? Large scale crawl of top US domains Sponsored by Concurrent/Bixo Labs Public dataset for use in Amazon’s cloud As expected, taking longer than expected Pain is weakness leaving the body Questions, input?  http://bixolabs.com/PTD/
Using Many Different Technologies AWS  Elastic MapReduce  - server farm Hadoop, Cascading - processing workflow Bixo, Tika - web crawling, extracting links AWS  SimpleDB  - maintaining crawl state AWS S3 - saving results Apache  Avro  - storing results
One Example of Analyzing Results Tika charset detection is…not great Simple code to process Avro files Comparing meta tags with derived Then make the results sexy in Excel
Cascading Analysis Workflow
 
Why Use Avro For Resulting Dataset? Originally tried WARC (Web Archive) But not really cross-language (Java, C) And not easily splittable
Amazon’s Elastic MapReduce Auto-configured Hadoop clusters Transient / on-demand Good for “bursty” jobs Low $$$/Ops requirements
Effectively Using Elastic MapReduce Avoid the 10 Second Failure use the --alive option Avoid the TB Log File never run with trace debugging Use new Bootscript Actions tune configuration for larger clusters
Why Use SimpleDB? Need to maintain crawl state Too big for MySQL Too expensive with HBase Too painful with SequenceFiles SimpleDB to the rescue (sort of)
SimpleDB Fundamentals Distributed key/value store Some interesting query/update support Pay for usage, not storage Uses HTTP for requests Latency, throughput issues Shared resource So there’s the “Back OFF!” issue
SimpleDB Tap - simple
SimpleDB Tap - batch puts
SimpleDB Tap - sharding
SimpleDB Tap - multithreading
SimpleDB Tap - distributed
Why is My Job Running So Sloooow? I blame Amazon “ EMR servers must suck” I blame Hadoop It  is  an older version I blame Cascading The workflow planner must have a bug Prehistoric Caveman Profiling kill -QUIT to the rescue
Configuration Bugs, not Code Bugs
The Real Problems Fetching ALL the pages Tika language detection enabled (Re)building of distributed data cache Generating log files, not results
Summary Public Terabyte Dataset is “getting there” Useful for testing analysis code Free, easy to use in EC2 Elastic MapReduce works well For bursty, occasional jobs When coupled with other AWS services Many “bugs” are configuration problems
Any Questions? My email: [email_address] Blog post about sample PTD results:  http://bixolabs.com/blog/2010/032010/04/21/first-sample-of-public-terabyte-dataset/ Input for Machine Learning in EC2 talk:  http://bixolabs.com/ml-talk/
 

More Related Content

PDF
The Bixo Web Mining Toolkit
PDF
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
PPTX
Building a Scalable Web Crawler with Hadoop
PPT
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
PPT
Hadoop at Yahoo! -- University Talks
PPT
Hadoop and Voldemort @ LinkedIn
PDF
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
The Bixo Web Mining Toolkit
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Building a Scalable Web Crawler with Hadoop
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop at Yahoo! -- University Talks
Hadoop and Voldemort @ LinkedIn
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...

What's hot (20)

PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PDF
HUG August 2010: Best practices
PPT
Nextag talk
PPTX
Facebook Retrospective - Big data-world-europe-2012
PPT
Nov 2010 HUG: Fuzzy Table - B.A.H
PDF
Karmasphere Studio for Hadoop
PPT
Hadoop Hive Talk At IIT-Delhi
PPTX
January 2011 HUG: Howl Presentation
PDF
introduction to data processing using Hadoop and Pig
PPTX
Cloud Optimized Big Data
ODP
Hadoop - Overview
PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
Pig, Making Hadoop Easy
PDF
Hive Quick Start Tutorial
PDF
Hadoop sqoop
PPTX
Qubole @ AWS Meetup Bangalore - July 2015
PPTX
Qubole Overview at the Fifth Elephant Conference
KEY
Intro To Hadoop
PDF
Accelerating Hive with Alluxio on S3
Nov HUG 2009: Hadoop Record Reader In Python
HUG August 2010: Best practices
Nextag talk
Facebook Retrospective - Big data-world-europe-2012
Nov 2010 HUG: Fuzzy Table - B.A.H
Karmasphere Studio for Hadoop
Hadoop Hive Talk At IIT-Delhi
January 2011 HUG: Howl Presentation
introduction to data processing using Hadoop and Pig
Cloud Optimized Big Data
Hadoop - Overview
Hadoop trainting in hyderabad@kelly technologies
Pig, Making Hadoop Easy
Hive Quick Start Tutorial
Hadoop sqoop
Qubole @ AWS Meetup Bangalore - July 2015
Qubole Overview at the Fifth Elephant Conference
Intro To Hadoop
Accelerating Hive with Alluxio on S3
Ad

Viewers also liked (20)

PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PPTX
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
PPTX
August 2016 HUG: Recent development in Apache Oozie
PPTX
January 2011 HUG: Pig Presentation
PPT
January 2011 HUG: Kafka Presentation
PPTX
Yahoo compares Storm and Spark
PDF
Nov 2010 HUG: Business Intelligence for Big Data
PPTX
HUG Nov 2010: HDFS Raid - Facebook
ODP
Cascalog internal dsl_preso
PDF
Hdfs high availability
PPTX
Common crawlpresentation
PDF
Karmasphere hadoop-productivity-tools
PPT
Pig at Linkedin
PDF
Next Generation MapReduce
PDF
Bay Area HUG Feb 2011 Intro
PDF
Next Generation Hadoop Operations
PDF
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
KEY
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
PPT
2 hadoop@e bay-hug-2010-07-21
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Mail antispam - Bay area Hadoop user group
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Recent development in Apache Oozie
January 2011 HUG: Pig Presentation
January 2011 HUG: Kafka Presentation
Yahoo compares Storm and Spark
Nov 2010 HUG: Business Intelligence for Big Data
HUG Nov 2010: HDFS Raid - Facebook
Cascalog internal dsl_preso
Hdfs high availability
Common crawlpresentation
Karmasphere hadoop-productivity-tools
Pig at Linkedin
Next Generation MapReduce
Bay Area HUG Feb 2011 Intro
Next Generation Hadoop Operations
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
2 hadoop@e bay-hug-2010-07-21
Ad

Similar to Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce (20)

ZIP
Distributed-ness: Distributed computing & the clouds
ODP
Front Range PHP NoSQL Databases
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
PPS
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPS
Web20expo Scalable Web Arch
PPS
Web20expo Scalable Web Arch
PPS
Web20expo Scalable Web Arch
PDF
Cloud Computing Bootcamp On The Google App Engine [v1.1]
PPT
Hadoop ecosystem framework n hadoop in live environment
PDF
DrupalCampLA 2011: Drupal backend-performance
PPTX
Cost effective BigData Processing on Amazon EC2
PPT
AWS (Hadoop) Meetup 30.04.09
PPT
SQL or NoSQL, that is the question!
PPT
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
KEY
WebWorkersCamp 2010
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PPTX
Uotm workshop
PPT
Bhupeshbansal bigdata
Distributed-ness: Distributed computing & the clouds
Front Range PHP NoSQL Databases
UnConference for Georgia Southern Computer Science March 31, 2015
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Web20expo Scalable Web Arch
Web20expo Scalable Web Arch
Web20expo Scalable Web Arch
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Hadoop ecosystem framework n hadoop in live environment
DrupalCampLA 2011: Drupal backend-performance
Cost effective BigData Processing on Amazon EC2
AWS (Hadoop) Meetup 30.04.09
SQL or NoSQL, that is the question!
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
WebWorkersCamp 2010
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Uotm workshop
Bhupeshbansal bigdata

More from Hadoop User Group (15)

PDF
Hdfs high availability
PPT
1 content optimization-hug-2010-07-21
PDF
3 avro hug-2010-07-21
PPT
1 hadoop security_in_details_hadoop_summit2010
PPT
Hadoop Security Preview
PPT
Flightcaster Presentation Hadoop
PPTX
Map Reduce Online
PPT
Hadoop Security Preview
PPT
Hadoop Security Preview
PPT
Hadoop Release Plan Feb17
PDF
Twitter Protobufs And Hadoop Hug 021709
PPTX
Ordered Record Collection
PPS
Searching At Scale
PPTX
Hadoop Record Reader In Python
PPTX
File Context
Hdfs high availability
1 content optimization-hug-2010-07-21
3 avro hug-2010-07-21
1 hadoop security_in_details_hadoop_summit2010
Hadoop Security Preview
Flightcaster Presentation Hadoop
Map Reduce Online
Hadoop Security Preview
Hadoop Security Preview
Hadoop Release Plan Feb17
Twitter Protobufs And Hadoop Hug 021709
Ordered Record Collection
Searching At Scale
Hadoop Record Reader In Python
File Context

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Electronic commerce courselecture one. Pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025
Electronic commerce courselecture one. Pdf
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

  • 1. Public Terabyte Dataset Project Web crawling with Amazon’s EMR Ken Krugler, Bixo Labs, Inc. Hadoop Bay Area Meetup 21 April 2010
  • 2. About me Background in consulting, search, vertical crawl Apple (Mac), I18N, Expert systems, Palm OS Krugle search engine for open source code Open source projects Nutch, Lucene, Solr Bixo web mining toolkit Tika content extraction Founder of Bixo Labs | http://bixolabs.com Elastic web mining platform Hadoop, Cascading, Bixo in EC2/EMR
  • 3. In 20 Minutes I’ll Talk About… Public Terabyte Dataset project Amazon’s Elastic MapReduce Some really embarrassing mistakes
  • 4. What is the Public Terabyte Dataset? Large scale crawl of top US domains Sponsored by Concurrent/Bixo Labs Public dataset for use in Amazon’s cloud As expected, taking longer than expected Pain is weakness leaving the body Questions, input? http://bixolabs.com/PTD/
  • 5. Using Many Different Technologies AWS Elastic MapReduce - server farm Hadoop, Cascading - processing workflow Bixo, Tika - web crawling, extracting links AWS SimpleDB - maintaining crawl state AWS S3 - saving results Apache Avro - storing results
  • 6. One Example of Analyzing Results Tika charset detection is…not great Simple code to process Avro files Comparing meta tags with derived Then make the results sexy in Excel
  • 8.  
  • 9. Why Use Avro For Resulting Dataset? Originally tried WARC (Web Archive) But not really cross-language (Java, C) And not easily splittable
  • 10. Amazon’s Elastic MapReduce Auto-configured Hadoop clusters Transient / on-demand Good for “bursty” jobs Low $$$/Ops requirements
  • 11. Effectively Using Elastic MapReduce Avoid the 10 Second Failure use the --alive option Avoid the TB Log File never run with trace debugging Use new Bootscript Actions tune configuration for larger clusters
  • 12. Why Use SimpleDB? Need to maintain crawl state Too big for MySQL Too expensive with HBase Too painful with SequenceFiles SimpleDB to the rescue (sort of)
  • 13. SimpleDB Fundamentals Distributed key/value store Some interesting query/update support Pay for usage, not storage Uses HTTP for requests Latency, throughput issues Shared resource So there’s the “Back OFF!” issue
  • 14. SimpleDB Tap - simple
  • 15. SimpleDB Tap - batch puts
  • 16. SimpleDB Tap - sharding
  • 17. SimpleDB Tap - multithreading
  • 18. SimpleDB Tap - distributed
  • 19. Why is My Job Running So Sloooow? I blame Amazon “ EMR servers must suck” I blame Hadoop It is an older version I blame Cascading The workflow planner must have a bug Prehistoric Caveman Profiling kill -QUIT to the rescue
  • 21. The Real Problems Fetching ALL the pages Tika language detection enabled (Re)building of distributed data cache Generating log files, not results
  • 22. Summary Public Terabyte Dataset is “getting there” Useful for testing analysis code Free, easy to use in EC2 Elastic MapReduce works well For bursty, occasional jobs When coupled with other AWS services Many “bugs” are configuration problems
  • 23. Any Questions? My email: [email_address] Blog post about sample PTD results: http://bixolabs.com/blog/2010/032010/04/21/first-sample-of-public-terabyte-dataset/ Input for Machine Learning in EC2 talk: http://bixolabs.com/ml-talk/
  • 24.  

Editor's Notes

  • #2: Elastic Web Mining 01 November 2009
  • #3: Elastic Web Mining 01 November 2009 Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
  • #4: Elastic Web Mining 01 November 2009
  • #5: Elastic Web Mining 01 November 2009 Goal is to generate large, high quality web crawl that results in an Amazon public dataset Having a large set of public data can be very useful, and I'll show one example later. For Amazon, it's some honey to lure people into running jobs in EC2 & EMR. For us, it was a reason to really push the boundaries of what Bixo could handle. Bixo is the open source web mining toolkit project we sponsor, and we use it for a lot of things, but crawling 100M+ pages wasn't one of them It also wound up being a good test for incremental releases of Cascading 1.1, which I think is now final? Why has it taken longer? Well, part of it is because any time you have to deal with a large slice of the web, it hurts. As Jimmy Liu at CMU said, the web is an endless series of edge cases. Plus we had some work to do to figure out how to effectively run in EMR
  • #6: Elastic Web Mining 01 November 2009 The three in bold are the ones that I’ll spend a bit more time talking about today
  • #7: Elastic Web Mining 01 November 2009 Surprising, given that it’s based on ICU, and that’s often the gold standard for internationalization. One simple approach to validating quality is to compare with any charset found in HTML meta tags. Which can still be wrong, but is usually correct - and way better than the http response headers. Our input is the Avro files we generated from a sample crawl covering 1.7M top domains, based on US traffic reports From Alexa and Quantcast. BTW, please provide input on the format.Link to blog post about it at end of talk. Like to finalize before we generate a lot more data.
  • #8: Elastic Web Mining 01 November 2009 I’m not expecting you to be able to understand or even read this, but it shows the actual workflow portion of a typical analysis app There are three functions you’re not seeing, that pick the records to analyze, do the analysis, and then generate a report.
  • #9: Elastic Web Mining 01 November 2009 I took a slice of data and calculated the accuracy of Tika, when compared to the meta tag charset. Now the meta tag data can lie, though it’s better than the http response headers, and is usually pretty accurate. The vertical scale is accuracy, from 0 to 100%, and the horizontal scale is the number of pages for each charset (log) Now it’s a log scale so that you don’t wind up with all but a few charsets squished to the left. From this, you can see that what we want is for common charsets to have high accuracy, so points in the upper-right But we don’t get that, unfortunately. And this is with calling us-ascii and iso-8859-1 subsets of UTF-8 For some reason Tika really likes the gb18030 encoding - many UTF-8 pages are classified as this. Clearly we could re-run this with modified versions of Tika, or other open source detectors. In fact it would be great to find an intern to implement the approach that Ted Dunning has recommended, of using Log-liklihood ratios, to see how that compares. Similar issues exist for Tika’s language detection, so it could be a two-fer. The key point is that by having a large enough data set, and an easy way to process it, you have the ability to quickly try new approaches, and feel confident about the end results.
  • #10: Elastic Web Mining 01 November 2009 As Tom White pointed out, any input format needs to be splittable for efficiency. So it’s not just about which compression format to use (the answer to that is usually LZO) It relates to anything beyond simple one-line-per-record text files There was some pain in creating a Cascading scheme that let us use Avro for reading & writing. But true to the Just-In-Time nature of open source projects, Hadoop input/output formats for Avro were recently committed. So we’ve posted an initial version of a Cascading scheme for Avro, which seems to be working well so far.
  • #11: Elastic Web Mining 01 November 2009 I should ask first how many people know about Amazon’s Elastic Mapreduce? Don’t be shy. And how many have actually run jobs using it? Web mining, and crawling like we’re doing, is very bursty. You run a job, then analyze the results, figure out what went wrong, and run again. Even though it costs more per-hour than raw EC2, often it's cheaper- no waking up to find that your 20 server cluster chokes after 15 minutes, but you paid for 10 hours- no coming back from vacation and realizing that you forgot to terminate the cluster. Also, unlike our friends here at Yahoo, I don't have money to build out evena 20 server cluster. And for bursty jobs, most of the time those 20 servers are sitting idle, which increases the effective cost significantly.
  • #12: Elastic Web Mining 01 November 2009 Log files get auto-uploaded to S3, which is niceExcept when they're huge, so you're waiting around for the file, and when it finally shows up it's too big to effectively download & examine.So never run in trace mode EMR uses an older version of Hadoop, 0.18.3. Which means I regularly something working fine in EC2, and then it would fail in EMR due to some version dependency. Oh, and use Bootstrap Actions to tune your cluster. For example, with 50 slaves I was often running out of namenode listener threads.
  • #13: Elastic Web Mining 01 November 2009 Every URL, status, additional bits of info SQL starts having problems when you get past a few million things.And the seed list for the crawl was 1.7M top-level domains .After the first loop, there would be close to 70M URLs. 3 medium instances for three weeks = $250And right now, for example, I'd be stressing about paying for servers I wasn't actually using.Web crawling is very burstyAnd secretly, I just wasn't excited about configuring and maintaining an HBase cluster.For a different type of crawl, HBase would make total sense. It’s not about you, It’s about me. Maybe this has never happened for any of you, but occasionally I look at a piece of code and I say to myself - WTF? who the heck thought this was a good idea?Something like that happened to me, the first time I looked at the Nutch code that handled updating the CrawlDB, where the state of the crawl is kept.I was thinking, I know Doug's a smart guy, why?So then I was busy trying to use SequenceFiles to store the crawl state, and I realized I was recreating exactly the same logic, only not as well.Which was my code telling me to just stop.Plus I found that copying this ever-growing file to and from S3 to be increasingly time consuming.
  • #14: Elastic Web Mining 01 November 2009 Paying for something you’re not using is like having a timeshare that you never visit. Right now I’ve got 75M URLs in SimpleDB, and it’s not costing me anything.
  • #15: Elastic Web Mining 01 November 2009 Applies to interaction with other external, distributed, shared systems. Very similar to issues related to web crawling, in fact. Eg. High latency, low throughput, high error rates compared to disk I/O
  • #19: Elastic Web Mining 01 November 2009 At this point we’re starting to talk about real performance. And also real backup issues. Even with 10 mappers, it’s easy to completely swamp SimpleDB. We were loading about 200K records/minute with 4 mappers, and 100 threads/mapper. For accessing shared resources via HTTP (web pages, SimpleDB), you wind up having to layer a mutithreading adapter on top of Hadoop to get reasonable performance. But since these are shared resources, you have additional constraints in dealing fairness And you have to be able to handle a much higher error rate - it's not like writing to a disk
  • #20: Elastic Web Mining 01 November 2009 Easy to blame lots of things EMR must get the dregs from the EC2 server pool We not getting rack locality for our data The DNS servers at Amazon are being overwhelmed Maybe there's a bad network card” Todd Lipcon - Poor Manís Profiling
  • #21: Elastic Web Mining 01 November 2009 Why this picture? Because many of these mistakes have to do with configuration, not code.
  • #22: Elastic Web Mining 01 November 2009 Lesson learned: early termination of outliers can give huge effective performance boost On by default, and really slow - and I don't need I Lesson learned: you need to worry about the entire software stack, not just the top (your stuff) and the bottom (Hadoop) JVM reuse not enabled Lesson learned: avoid processing in setup (use pre-built bloom filter), or only do it in reducers where you have more control logging level still set to trace. Generating 100GB+ of data E.g. logging each domain (1.7M) in whitelist that was getting put into the set.
  • #23: Elastic Web Mining 01 November 2009
  • #24: Elastic Web Mining 01 November 2009 So that’s it. I think we’ve got time for questions. One more thing - i n 20 minutes I couldn't cover some other aspects - one of which was using Mahout to do ML in Hadoop/EC2. Specifically, to classify links as spammy or not, based on analyzing the fetch process and the fetched content. If there's enough interest, I could finish that presentation and do something online - there’s a link at the bottom for anyone who’s interested.
  • #25: Elastic Web Mining 01 November 2009