Aws Quick Dirty Hadoop Mapreduce Ec2 S3

2. QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar

3. EC2 S3

5. Tools AWS Command line tools

6. Elastic MapReduce Ruby library

7. Hadoop

8. s3cmd

9. Hadoop MapReduce Job Tracker HDFS – Distributed file system

10. Hadoop MapReduce usage Data crunching in general Clicks Statistics etc

11. Hadoop Project Mgmt Committee

12. MapReduce ?

13. MapReduce Key Pairs <key,value>

14. MapReduce

15. HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)

16. MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>

17. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer

18. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer

19. Real life example of Hadoop Streaming usage

20. Wikipedia Page Access Logs

21. Wine Grape Varieties

22. Wikipedia WGV Page Access Stats

23. Business Decisions

24. Launching a virtual Hadoop Cluster $ elastic-mapreduce --create --name "Wiki log crunch" --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $ ec2din (...)

26. Hadoop Standalone Operation

27. Pseudo-Distributed Operation

28. Fully-Distributed Operation

29. NameNode

30. JobTracker

31. DataNode + TaskTracker

32. Hadoop Standalone Operation

33. Pseudo-Distributed Operation

34. Fully-Distributed Operation

35. NameNode

36. JobTracker

37. DataNode + TaskTracker

38. Add a step $ elastic-mapreduce --jobflow <jfid> --stream \ --step-name "Wiki log crunch" \ --input s3n://dsikar-wikilogs-2009/dec/ \ --output s3n://dsikar-wikilogs-output/21 \ --mapper s3n://dsikar-wiki-scripts/wikidictionarymap.pl \ --reducer s3n://dsikar-wiki-scripts/wikireduce.pl http://<instance public dns>:9100

39. s3cmd # make bucket $ s3cmd mb s3://dsikar-wikilogs # put log files $ s3cmd put pagecounts-200912*.gz s3://dsikar-wikilogs/dec $ s3cmd put pagecounts-201004*.gz s3://dsikar-wikilogs/apr # list log files $ s3cmd ls s3://dsikar-wikilogs/ # put scripts $ s3cmd put *.pl s3://dsikar-wiki-scripts/ # delete log files $ s3cmd del --recursive --force s3://dsikar-wikilogs/ # remove bucket $ s3cmd rb s3://dsikar-wikilogs/

40. Elastic MapReduce --create --list --jobflow --describe --stream --terminate

41. Output files part-00000 part-00001 part-00002 (...)

42. Further aggregation

43. Conclusion Hadoop MapReduce provides out-of-the-box ready-to-go distributed computing.

44. That's all folks and thanks for attending: QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

More Related Content

What's hot (20)

Similar to Aws Quick Dirty Hadoop Mapreduce Ec2 S3 (20)

More from Skills Matter (20)

Recently uploaded (20)

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

Editor's Notes