Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

16 likes4,839 views

The document discusses the Public Terabyte Dataset Project which aims to create a large crawl of top US domains for public use on Amazon's cloud. It describes how the project uses various Amazon Web Services like Elastic MapReduce and SimpleDB along with technologies like Hadoop, Cascading, and Tika for web crawling and data processing. Common issues encountered include configuration problems, slow performance from fetching all web pages or using Tika language detection, and generating log files instead of results.

Technology

Public Terabyte Dataset Project Web crawling with Amazon’s EMR Ken Krugler, Bixo Labs, Inc. Hadoop Bay Area Meetup 21 April 2010

About me Background in consulting, search, vertical crawl Apple (Mac), I18N, Expert systems, Palm OS Krugle search engine for open source code Open source projects Nutch, Lucene, Solr Bixo web mining toolkit Tika content extraction Founder of Bixo Labs | http://bixolabs.com Elastic web mining platform Hadoop, Cascading, Bixo in EC2/EMR

In 20 Minutes I’ll Talk About… Public Terabyte Dataset project Amazon’s Elastic MapReduce Some really embarrassing mistakes

What is the Public Terabyte Dataset? Large scale crawl of top US domains Sponsored by Concurrent/Bixo Labs Public dataset for use in Amazon’s cloud As expected, taking longer than expected Pain is weakness leaving the body Questions, input? http://bixolabs.com/PTD/

Using Many Different Technologies AWS Elastic MapReduce - server farm Hadoop, Cascading - processing workflow Bixo, Tika - web crawling, extracting links AWS SimpleDB - maintaining crawl state AWS S3 - saving results Apache Avro - storing results

One Example of Analyzing Results Tika charset detection is…not great Simple code to process Avro files Comparing meta tags with derived Then make the results sexy in Excel

Why Use Avro For Resulting Dataset? Originally tried WARC (Web Archive) But not really cross-language (Java, C) And not easily splittable

Amazon’s Elastic MapReduce Auto-configured Hadoop clusters Transient / on-demand Good for “bursty” jobs Low $$$/Ops requirements

Effectively Using Elastic MapReduce Avoid the 10 Second Failure use the --alive option Avoid the TB Log File never run with trace debugging Use new Bootscript Actions tune configuration for larger clusters

Why Use SimpleDB? Need to maintain crawl state Too big for MySQL Too expensive with HBase Too painful with SequenceFiles SimpleDB to the rescue (sort of)

SimpleDB Fundamentals Distributed key/value store Some interesting query/update support Pay for usage, not storage Uses HTTP for requests Latency, throughput issues Shared resource So there’s the “Back OFF!” issue

Why is My Job Running So Sloooow? I blame Amazon “ EMR servers must suck” I blame Hadoop It is an older version I blame Cascading The workflow planner must have a bug Prehistoric Caveman Profiling kill -QUIT to the rescue

The Real Problems Fetching ALL the pages Tika language detection enabled (Re)building of distributed data cache Generating log files, not results

Summary Public Terabyte Dataset is “getting there” Useful for testing analysis code Free, easy to use in EC2 Elastic MapReduce works well For bursty, occasional jobs When coupled with other AWS services Many “bugs” are configuration problems

Any Questions? My email: [email_address] Blog post about sample PTD results: http://bixolabs.com/blog/2010/032010/04/21/first-sample-of-public-terabyte-dataset/ Input for Machine Learning in EC2 talk: http://bixolabs.com/ml-talk/

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

1. Public Terabyte Dataset Project Web crawling with Amazon’s EMR Ken Krugler, Bixo Labs, Inc. Hadoop Bay Area Meetup 21 April 2010

2. About me Background in consulting, search, vertical crawl Apple (Mac), I18N, Expert systems, Palm OS Krugle search engine for open source code Open source projects Nutch, Lucene, Solr Bixo web mining toolkit Tika content extraction Founder of Bixo Labs | http://bixolabs.com Elastic web mining platform Hadoop, Cascading, Bixo in EC2/EMR

3. In 20 Minutes I’ll Talk About… Public Terabyte Dataset project Amazon’s Elastic MapReduce Some really embarrassing mistakes

4. What is the Public Terabyte Dataset? Large scale crawl of top US domains Sponsored by Concurrent/Bixo Labs Public dataset for use in Amazon’s cloud As expected, taking longer than expected Pain is weakness leaving the body Questions, input? http://bixolabs.com/PTD/

5. Using Many Different Technologies AWS Elastic MapReduce - server farm Hadoop, Cascading - processing workflow Bixo, Tika - web crawling, extracting links AWS SimpleDB - maintaining crawl state AWS S3 - saving results Apache Avro - storing results

6. One Example of Analyzing Results Tika charset detection is…not great Simple code to process Avro files Comparing meta tags with derived Then make the results sexy in Excel

7. Cascading Analysis Workflow

9. Why Use Avro For Resulting Dataset? Originally tried WARC (Web Archive) But not really cross-language (Java, C) And not easily splittable

10. Amazon’s Elastic MapReduce Auto-configured Hadoop clusters Transient / on-demand Good for “bursty” jobs Low $$$/Ops requirements

11. Effectively Using Elastic MapReduce Avoid the 10 Second Failure use the --alive option Avoid the TB Log File never run with trace debugging Use new Bootscript Actions tune configuration for larger clusters

12. Why Use SimpleDB? Need to maintain crawl state Too big for MySQL Too expensive with HBase Too painful with SequenceFiles SimpleDB to the rescue (sort of)

13. SimpleDB Fundamentals Distributed key/value store Some interesting query/update support Pay for usage, not storage Uses HTTP for requests Latency, throughput issues Shared resource So there’s the “Back OFF!” issue

14. SimpleDB Tap - simple

15. SimpleDB Tap - batch puts

16. SimpleDB Tap - sharding

17. SimpleDB Tap - multithreading

18. SimpleDB Tap - distributed

19. Why is My Job Running So Sloooow? I blame Amazon “ EMR servers must suck” I blame Hadoop It is an older version I blame Cascading The workflow planner must have a bug Prehistoric Caveman Profiling kill -QUIT to the rescue

20. Configuration Bugs, not Code Bugs

21. The Real Problems Fetching ALL the pages Tika language detection enabled (Re)building of distributed data cache Generating log files, not results

22. Summary Public Terabyte Dataset is “getting there” Useful for testing analysis code Free, easy to use in EC2 Elastic MapReduce works well For bursty, occasional jobs When coupled with other AWS services Many “bugs” are configuration problems

23. Any Questions? My email: [email_address] Blog post about sample PTD results: http://bixolabs.com/blog/2010/032010/04/21/first-sample-of-public-terabyte-dataset/ Input for Machine Learning in EC2 talk: http://bixolabs.com/ml-talk/

24.

Editor's Notes

#2: Elastic Web Mining 01 November 2009
#3: Elastic Web Mining 01 November 2009 Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
#4: Elastic Web Mining 01 November 2009
#5: Elastic Web Mining 01 November 2009 Goal is to generate large, high quality web crawl that results in an Amazon public dataset Having a large set of public data can be very useful, and I'll show one example later. For Amazon, it's some honey to lure people into running jobs in EC2 & EMR. For us, it was a reason to really push the boundaries of what Bixo could handle. Bixo is the open source web mining toolkit project we sponsor, and we use it for a lot of things, but crawling 100M+ pages wasn't one of them It also wound up being a good test for incremental releases of Cascading 1.1, which I think is now final? Why has it taken longer? Well, part of it is because any time you have to deal with a large slice of the web, it hurts. As Jimmy Liu at CMU said, the web is an endless series of edge cases. Plus we had some work to do to figure out how to effectively run in EMR
#6: Elastic Web Mining 01 November 2009 The three in bold are the ones that I’ll spend a bit more time talking about today
#7: Elastic Web Mining 01 November 2009 Surprising, given that it’s based on ICU, and that’s often the gold standard for internationalization. One simple approach to validating quality is to compare with any charset found in HTML meta tags. Which can still be wrong, but is usually correct - and way better than the http response headers. Our input is the Avro files we generated from a sample crawl covering 1.7M top domains, based on US traffic reports From Alexa and Quantcast. BTW, please provide input on the format.Link to blog post about it at end of talk. Like to finalize before we generate a lot more data.
#8: Elastic Web Mining 01 November 2009 I’m not expecting you to be able to understand or even read this, but it shows the actual workflow portion of a typical analysis app There are three functions you’re not seeing, that pick the records to analyze, do the analysis, and then generate a report.
#9: Elastic Web Mining 01 November 2009 I took a slice of data and calculated the accuracy of Tika, when compared to the meta tag charset. Now the meta tag data can lie, though it’s better than the http response headers, and is usually pretty accurate. The vertical scale is accuracy, from 0 to 100%, and the horizontal scale is the number of pages for each charset (log) Now it’s a log scale so that you don’t wind up with all but a few charsets squished to the left. From this, you can see that what we want is for common charsets to have high accuracy, so points in the upper-right But we don’t get that, unfortunately. And this is with calling us-ascii and iso-8859-1 subsets of UTF-8 For some reason Tika really likes the gb18030 encoding - many UTF-8 pages are classified as this. Clearly we could re-run this with modified versions of Tika, or other open source detectors. In fact it would be great to find an intern to implement the approach that Ted Dunning has recommended, of using Log-liklihood ratios, to see how that compares. Similar issues exist for Tika’s language detection, so it could be a two-fer. The key point is that by having a large enough data set, and an easy way to process it, you have the ability to quickly try new approaches, and feel confident about the end results.
#10: Elastic Web Mining 01 November 2009 As Tom White pointed out, any input format needs to be splittable for efficiency. So it’s not just about which compression format to use (the answer to that is usually LZO) It relates to anything beyond simple one-line-per-record text files There was some pain in creating a Cascading scheme that let us use Avro for reading & writing. But true to the Just-In-Time nature of open source projects, Hadoop input/output formats for Avro were recently committed. So we’ve posted an initial version of a Cascading scheme for Avro, which seems to be working well so far.
#11: Elastic Web Mining 01 November 2009 I should ask first how many people know about Amazon’s Elastic Mapreduce? Don’t be shy. And how many have actually run jobs using it? Web mining, and crawling like we’re doing, is very bursty. You run a job, then analyze the results, figure out what went wrong, and run again. Even though it costs more per-hour than raw EC2, often it's cheaper- no waking up to find that your 20 server cluster chokes after 15 minutes, but you paid for 10 hours- no coming back from vacation and realizing that you forgot to terminate the cluster. Also, unlike our friends here at Yahoo, I don't have money to build out evena 20 server cluster. And for bursty jobs, most of the time those 20 servers are sitting idle, which increases the effective cost significantly.
#12: Elastic Web Mining 01 November 2009 Log files get auto-uploaded to S3, which is niceExcept when they're huge, so you're waiting around for the file, and when it finally shows up it's too big to effectively download & examine.So never run in trace mode EMR uses an older version of Hadoop, 0.18.3. Which means I regularly something working fine in EC2, and then it would fail in EMR due to some version dependency. Oh, and use Bootstrap Actions to tune your cluster. For example, with 50 slaves I was often running out of namenode listener threads.
#13: Elastic Web Mining 01 November 2009 Every URL, status, additional bits of info SQL starts having problems when you get past a few million things.And the seed list for the crawl was 1.7M top-level domains .After the first loop, there would be close to 70M URLs. 3 medium instances for three weeks = $250And right now, for example, I'd be stressing about paying for servers I wasn't actually using.Web crawling is very burstyAnd secretly, I just wasn't excited about configuring and maintaining an HBase cluster.For a different type of crawl, HBase would make total sense. It’s not about you, It’s about me. Maybe this has never happened for any of you, but occasionally I look at a piece of code and I say to myself - WTF? who the heck thought this was a good idea?Something like that happened to me, the first time I looked at the Nutch code that handled updating the CrawlDB, where the state of the crawl is kept.I was thinking, I know Doug's a smart guy, why?So then I was busy trying to use SequenceFiles to store the crawl state, and I realized I was recreating exactly the same logic, only not as well.Which was my code telling me to just stop.Plus I found that copying this ever-growing file to and from S3 to be increasingly time consuming.
#14: Elastic Web Mining 01 November 2009 Paying for something you’re not using is like having a timeshare that you never visit. Right now I’ve got 75M URLs in SimpleDB, and it’s not costing me anything.
#15: Elastic Web Mining 01 November 2009 Applies to interaction with other external, distributed, shared systems. Very similar to issues related to web crawling, in fact. Eg. High latency, low throughput, high error rates compared to disk I/O
#19: Elastic Web Mining 01 November 2009 At this point we’re starting to talk about real performance. And also real backup issues. Even with 10 mappers, it’s easy to completely swamp SimpleDB. We were loading about 200K records/minute with 4 mappers, and 100 threads/mapper. For accessing shared resources via HTTP (web pages, SimpleDB), you wind up having to layer a mutithreading adapter on top of Hadoop to get reasonable performance. But since these are shared resources, you have additional constraints in dealing fairness And you have to be able to handle a much higher error rate - it's not like writing to a disk
#20: Elastic Web Mining 01 November 2009 Easy to blame lots of things EMR must get the dregs from the EC2 server pool We not getting rack locality for our data The DNS servers at Amazon are being overwhelmed Maybe there's a bad network card” Todd Lipcon - Poor Manís Profiling
#21: Elastic Web Mining 01 November 2009 Why this picture? Because many of these mistakes have to do with configuration, not code.
#22: Elastic Web Mining 01 November 2009 Lesson learned: early termination of outliers can give huge effective performance boost On by default, and really slow - and I don't need I Lesson learned: you need to worry about the entire software stack, not just the top (your stuff) and the bottom (Hadoop) JVM reuse not enabled Lesson learned: avoid processing in setup (use pre-built bloom filter), or only do it in reducers where you have more control logging level still set to trace. Generating 100GB+ of data E.g. logging each domain (1.7M) in whitelist that was getting put into the set.
#23: Elastic Web Mining 01 November 2009
#24: Elastic Web Mining 01 November 2009 So that’s it. I think we’ve got time for questions. One more thing - i n 20 minutes I couldn't cover some other aspects - one of which was using Mahout to do ML in Hadoop/EC2. Specifically, to classify links as spammy or not, based on analyzing the fetch process and the fetched content. If there's enough interest, I could finish that presentation and do something online - there’s a link at the bottom for anyone who’s interested.
#25: Elastic Web Mining 01 November 2009

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce (20)

More from Hadoop User Group (15)

Recently uploaded (20)

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

Editor's Notes