HadoopDB

1. HadoopDB Miguel Angel Pastor Olivar miguelinlas3 at gmail dot com http://miguelinlas3.blogspot.com http://twitter.com/miguelinlas3

2. Contenidos Introduction,o bjetives and background

3. HadoopDB Architecture

4. Results

5. Conclusions

6. Introduction

7. General Analytics are important today

8. Data amount is exploding

9. Previous problem -> Shared nothing architectures

10. Approachs: Parallel databases

11. Map/Reduce systems

12. Desired properties Performance Cheaper upgrades

13. Pricing mode (cloud) Fault tolerance Transactional workloads: recover

14. Analytics environments: not restart querys

15. Problem at scaling

16. Desired properties Heterogeneus environments Increasing number of nodes

17. Difficult homogeneous Flexible query interface BI usually JDBC or ODBC

18. UDF mechanism

19. Desirable SQL and no SQL interfaces

20. Background: parallel databases Standard relational tables and SQL Indexing, compression,caching, I/O sharing Tables partitioned over nodes Transparent to the user

21. Optimizer tailored Meet p erformance Needed highly skilled DBA

22. Background: parallel databases Flexible query interfaces UDFs varies acroos implementations Fault tolerance Not score so well

23. Assumption: failures are rare

24. Assumption: dozens of nodes in clusters

25. Engineering decisions

26. Background: Map/Reduce

27. Background: Map/Reduce Satisfies fault tolerance

28. Works on heterogeneus environment

29. Drawback: performance Not previous modeling

30. No enhacing performance techniques Interfaces Write M/R jobs in multiple languages

31. SQL not supported directly ( Hive )

32. HadoopDB

33. Ideas Main goal: achieve the properties described before

34. Connect multiple single-datanode systems Hadoop reponsible for task coordination and network layer

35. Queries parallelized along de nodes Fault tolerant and work in heterogeneus nodes

36. Parallel databases performance Query processing in database engine

37. Architecture background Hadoop distributed file system (HDFS) Block structured file system managed by central node

38. Files broken in blocks and ditributed Processing layer (Map/Reduce framework) Master/slave architecture

39. Job and Task trackers

40. Architecture

41. Database connector module Interface between database and task tracker

42. Responsabilities Connect to the database

43. Execute the SQL query

44. Return the results as key-value pairs Achieved goal Datasources are similar to datablocks in HDFS

45. Catalog module Metadata about databases Database location, driver class, credentials

46. Datasets in cluster, replica or partitioning Catalog stored as xml file in HDFS

47. Plan to deploy as separated service

48. Data loader module Responsabilities: Globally repartitioning data

49. Breaking single data node in ckunks

50. Bulk-load data in single data node chunks Two main components: Global hasher Map/Reduce job read from HDS and repartition Local Hasher Copies from HDFS to local file system

51. SMS Planner module SQL interface to analyst based on Hive

52. Steps AST building

53. Semantic analyzer connects to catalog

54. DAG of relational operators

55. Optimizer reestructuration

56. Convert plan to M/R jobs

57. DAG in M/R serialized in xml plan

58. SMS Planner extensions Update metastore with table references

59. Two phases before execution Retrieve data fields to determine partitioning keys

60. Traverse DAG (bottom up). Rule based SQL generator

61. Benckmarking

62. Environment Amazon EC2 “large” instances

63. Each instance 7,5 GB memory

64. 2 virtual cores

65. 850 GB storage

66. 64 bits Linux Fedora 8

67. Benchmarked systems Hadoop 256MB data blocks

68. 1024 MB heap size

69. 200Mb sort buffer HadoopDB Similar to Hadoop conf,

70. PostgreSQL 8.2.5

71. No compress data

72. Benchmarked systems Vertica New parallel database (column store),

73. Used a cloud edition

74. All data is compressed DBMS-X Comercial parallel row

75. Run on EC2 (not cloud edition available)

76. Used data Http log files, html pages, ranking

77. Sizes (per node): 155 millions user visits (~ 20Gigabytes)

78. 18 millions ranking (~1Gigabyte)

79. Stored as plain text in HDFS

80. Loading data

81. Grep Task

82. Selection Task Consulta ejecutada Select pageUrl, pageRank from Rankings where pageRank > 10

83. Aggregation Task Smaller query SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); Larger query SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP

84. Join Task Query SELECT sourceIP, COUNT(pageRank), SUM(pageRank),SUM(adRevenue) FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’ GROUP BY UV.sourceIP; Not same query

85. UDF Aggregation Task

86. Summary HaddopDB approach parallel databases in absence of failures PostgreSLQ not column store

87. DBMS-X 15% overly optimistic

88. No compression un PosgreSQL data Outperforms Hadoop

89. Fault tolerance and heterogeneus environments

90. Benchmarks

91. Discussion Vertica is faster

92. Reduce the number of nodes to achieve the same order of magnitude

93. Fault tolerance is important

94. Conclusions

95. Conclusion Approach parallel databases and fault tolerance

96. PostgreSQL is not a column store

97. Hadoop and hive relatively new open source projects

98. HadoopDB is flexible and extensible

99. References

100. References Hadoop web page

101. HadoopDB article

102. HadoopDB project

103. Vertica

104. Apache Hive

105. That´s all!

HadoopDB

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to HadoopDB (20)

More from Miguel Pastor (18)

Recently uploaded (20)

HadoopDB