SlideShare a Scribd company logo
Web Services in
Hadoop
Nicholas Sze and Alan F. Gates
@szetszwo, @alanfgates




                                 Page 1
REST-ful API Front-door for Hadoop
• Opens the door to languages other than Java
• Thin clients via web services vs. fat-clients in gateway
• Insulation from interface changes release to release


                              HCatalog web interfaces




                           MapReduce     Pig      Hive

                                       HCatalog




                                                  External
                            HDFS        HBase
                                                   Store


      © 2012 Hortonworks                                     Page 2
Not Covered in this Talk
•  HttpFS (fka Hoop) – same API as WebHDFS but proxied
•  Stargate – REST API for HBase




       © 2012 Hortonworks
                                                         Page 3
HDFS Clients
• DFSClient: the native client
  – High performance (using RPC)
  – Java blinding


• libhdfs: a C++ client interface
  – Using JNI => large overhead
  – Also Java blinding (require Hadoop installing)




     Architecting the Future of Big Data     Page 4
HFTP
• Designed for cross-version copying (DistCp)
  – High performance (using HTTP)
  – Read-only
  – The HTTP API is proprietary
  – Clients must use HftpFileSystem (hftp://)


• WebHDFS is a rewrite of HFTP



     Architecting the Future of Big Data        Page 5
Design Goals

• Support a public HTTP API

• Support Read and Write

• High Performance

• Cross-version

• Security



     Architecting the Future of Big Data   Page 6
WebHDFS features
• HTTP REST API
  – Defines a public API
  – Permits non-Java client implementation
  – Support common tools like curl/wget


• Wire Compatibility
  – The REST API will be maintained for wire compatibility
  – WebHDFS clients can talk to different Hadoop versions.




     Architecting the Future of Big Data     Page 7
WebHDFS features                     (2)

• A Complete HDFS Interface
  – Support all user operations
     – reading files
     – writing to files
     – mkdir, chmod, chown, mv, rm, …


• High Performance
  – Using HTTP redirection to provide data locality
  – File read/write are redirected to the corresponding
    datanodes



     Architecting the Future of Big Data     Page 8
WebHDFS features                     (3)

• Secure Authentication
  – Same as Hadoop authentication: Kerberos (SPNEGO)
    and Hadoop delegation tokens
  – Support proxy users


• A HDFS Built-in Component
  – WebHDFS is a first class built-in component of HDFS.
  – Run inside Namenodes and Datanodes

• Apache Open Source
  – Available in Apache Hadoop 1.0 and above.

     Architecting the Future of Big Data   Page 9
WebHDFS URI & URL
• FileSystem scheme:
          webhdfs://

• FileSystem URI:
          webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URL:
  http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=..

  – Path prefix:    /webhdfs/v1
  – Query:          ?op=..



      Architecting the Future of Big Data   Page 10
URI/URL Examples
•  Suppose we have the following file
     hdfs://namenode:8020/user/szetszwo/w.txt

•  WebHDFS FileSystem URI
    webhdfs://namenode:50070/user/szetszwo/w.txt

•  WebHDFS HTTP URL
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=..

•  WebHDFS HTTP URL to open the file
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN

      Architecting the Future of Big Data   Page 11
Example: curl
•  Use curl to open a file

$curl -i -L "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN"

HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0
Content-Length: 0
Server: Jetty(6.1.26)




       Architecting the Future of Big Data   Page 12
Example: curl (2)

HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 21
Server: Jetty(6.1.26)

Hello, WebHDFS user!




     Architecting the Future of Big Data   Page 13
Example: wget
•  Use wget to open the same file

$wget "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN" –O w.txt

Resolving ...
Connecting to ... connected.
HTTP request sent, awaiting response...
307 TEMPORARY_REDIRECT
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0 [following]




       Architecting the Future of Big Data   Page 14
Example: wget (2)

--2012-06-13 01:42:10-- http://192.168.5.2:50075/
webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0
Connecting to 192.168.5.2:50075... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21 [application/octet-stream]
Saving to: `w.txt'

100%[=================>] 21                --.-K/s     in 0s

2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved
[21/21]




     Architecting the Future of Big Data     Page 15
Example: Firefox




   Architecting the Future of Big Data   Page 16
HCatalog REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
•  Uses JSON to describe metadata objects
•  Versioned, because we assume we will have to update it:
   http://hadoop.acme.com/templeton/v1/…
•  Runs in a Jetty server
•  Supports security
     –  Authentication done via kerberos using SPNEGO
•  Included in HDP, runs on Thrift metastore server machine
•  Not yet checked in, but you can find the code on Apache’s JIRA
   HCATALOG-182




         © 2012 Hortonworks
                                                                              Page 17
HCatalog REST API
                          Get a list of all tables in the default database:




           GET
           http://…/v1/ddl/database/default/table
                                                                              Hadoop/
                                                                              HCatalog
           {
               "tables": ["counted","processed",],
               "database": "default"
           }



 Indicate user with URL parameter:
 http://…/v1/ddl/database/default/table?user.name=gates
 Actions authorized as indicated user

     © Hortonworks 2012
                                                                                         Page 18
HCatalog REST API
                        Create new table “rawevents”

         PUT
         {"columns": [{ "name": "url", "type": "string" },
                      { "name": "user", "type": "string"}],
          "partitionedBy": [{ "name": "ds", "type": "string" }]}

         http://…/v1/ddl/database/default/table/rawevents

                                                            Hadoop/
                                                            HCatalog
         {
             "table": "rawevents",
             "database": "default”
         }




   © Hortonworks 2012
                                                                       Page 19
HCatalog REST API
                        Describe table “rawevents”




         GET
         http://…/v1/ddl/database/default/table/rawevents
                                                           Hadoop/
                                                           HCatalog
         {
              "columns": [{"name": "url","type": "string"},
                          {"name": "user","type": "string"}],
              "database": "default",
              "table": "rawevents"
         }




   © Hortonworks 2012
                                                                      Page 20
Job Management
•  Includes APIs to submit and monitor jobs
•  Any files needed for the job first uploaded to HDFS via WebHDFS
   –  Pig and Hive scripts
   –  Jars, Python scripts, or Ruby scripts for UDFs
   –  Pig macros
•  Results from job stored to HDFS, can be retrieved via WebHDFS
•  User responsible for cleaning up output in HDFS
•  Job state information stored in ZooKeeper or HDFS




        © 2012 Hortonworks
                                                                     Page 21
Job Submission
•  Can submit MapReduce, Pig, and Hive jobs
•  POST parameters include
   –  script to run or HDFS file containing script/jar to run
   –  username to execute the job as
   –  optionally an HDFS directory to write results to (defaults to user’s home directory)
   –  optionally a URL to invoke GET on when job is done


              POST
              http://hadoop.acme.com/templeton/v1/pig
                                                                            Hadoop/
                                                                            HCatalog
              {"id": "job_201111111311_0012",…}




        © 2012 Hortonworks
                                                                                       Page 22
Find all Your Jobs
•  GET on queue returns all jobs belonging to the submitting user
•  Pig, Hive, and MapReduce jobs will be returned




              GET
              http://…/templeton/v1/queue?user.name=gates
                                                                    Hadoop/
                                                                    HCatalog
              {"job_201111111311_0008",
               "job_201111111311_0012"}




        © 2012 Hortonworks
                                                                               Page 23
Get Status of a Job
•  Doing a GET on jobid gets you information about a particular job
•  Can be used to poll to see if job is finished
•  Used after job is finished to get job information
•  Doing a DELETE on jobid kills the job




              GET
              http://…/templeton/v1/queue/job_201111111311_0012
                                                                      Hadoop/
                                                                      HCatalog
              {…, "percentComplete": "100% complete",
                  "exitValue": 0,…
                  "completed": "done"
               }




        © 2012 Hortonworks
                                                                                 Page 24
Future
•  Job management
   –  Job management APIs don’t belong in HCatalog
   –  Only there by historical accident
   –  Need to move them out to MapReduce framework
•  Authentication needs more options than kerberos
•  Integration with Oozie
•  Need a directory service
   –  Users should not need to connect to different servers for HDFS, HBase, HCatalog,
      Oozie, and job submission




        © 2012 Hortonworks
                                                                                  Page 25

More Related Content

PDF
YARN - Strata 2014
PDF
Jan 2012 HUG: HCatalog
PPTX
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
PPTX
Future of HCatalog - Hadoop Summit 2012
PDF
HCatalog
PDF
HiveServer2 for Apache Hive
PPTX
Hive hcatalog
PPTX
YARN - Strata 2014
Jan 2012 HUG: HCatalog
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Future of HCatalog - Hadoop Summit 2012
HCatalog
HiveServer2 for Apache Hive
Hive hcatalog

What's hot (20)

PPTX
HCatalog Hadoop Summit 2011
PDF
Yahoo! Hack Europe Workshop
PDF
Introduction to Hive and HCatalog
PPTX
HBaseCon 2015: Analyzing HBase Data with Apache Hive
PDF
HBaseCon 2013: Integration of Apache Hive and HBase
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PPTX
A Survey of HBase Application Archetypes
PDF
2013 feb 20_thug_h_catalog
PPTX
Introduction to Apache Drill
PDF
Simplified Data Management And Process Scheduling in Hadoop
PPTX
Moving from C#/.NET to Hadoop/MongoDB
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
PDF
August 2014 HUG : Hive 13 Security
PPTX
Dancing with the elephant h base1_final
PDF
Efficient in situ processing of various storage types on apache tajo
PDF
Mar 2012 HUG: Hive with HBase
PDF
Sql on everything with drill
PPTX
Evolving HDFS to a Generalized Distributed Storage Subsystem
PPTX
Apache hive
PDF
Large-scale Web Apps @ Pinterest
HCatalog Hadoop Summit 2011
Yahoo! Hack Europe Workshop
Introduction to Hive and HCatalog
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2013: Integration of Apache Hive and HBase
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
A Survey of HBase Application Archetypes
2013 feb 20_thug_h_catalog
Introduction to Apache Drill
Simplified Data Management And Process Scheduling in Hadoop
Moving from C#/.NET to Hadoop/MongoDB
Hadoop in Practice (SDN Conference, Dec 2014)
August 2014 HUG : Hive 13 Security
Dancing with the elephant h base1_final
Efficient in situ processing of various storage types on apache tajo
Mar 2012 HUG: Hive with HBase
Sql on everything with drill
Evolving HDFS to a Generalized Distributed Storage Subsystem
Apache hive
Large-scale Web Apps @ Pinterest
Ad

Viewers also liked (19)

PPTX
Big data and its impact on SOA
PDF
Flume intro-100715
PDF
Inside Flume
PDF
Apache Hadoop In Theory And Practice
PPTX
10 Amazing Things To Do With a Hadoop-Based Data Lake
PDF
Spark streaming: Best Practices
PDF
Apache Flume
PDF
Introduction To Apache Pig at WHUG
PPTX
Hadoop and Spark for the SAS Developer
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PPT
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
PDF
Why your Spark job is failing
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PPTX
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
PPTX
A Multi Colored YARN
PDF
Spark 2.x Troubleshooting Guide
 
PPTX
Why your Spark Job is Failing
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
PPTX
Tuning and Debugging in Apache Spark
Big data and its impact on SOA
Flume intro-100715
Inside Flume
Apache Hadoop In Theory And Practice
10 Amazing Things To Do With a Hadoop-Based Data Lake
Spark streaming: Best Practices
Apache Flume
Introduction To Apache Pig at WHUG
Hadoop and Spark for the SAS Developer
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Why your Spark job is failing
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
A Multi Colored YARN
Spark 2.x Troubleshooting Guide
 
Why your Spark Job is Failing
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Tuning and Debugging in Apache Spark
Ad

Similar to Web Services Hadoop Summit 2012 (20)

PDF
Future of HCatalog
PPTX
Big data analytics with hadoop volume 2
PPTX
H cat berlinbuzzwords2012
PPTX
Apache Hadoop Now Next and Beyond
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
Hadoop: today and tomorrow
PDF
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
PPTX
מיכאל
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
PPTX
How to develop Big Data Pipelines for Hadoop, by Costin Leau
PPTX
Leveraging Hadoop in Polyglot Architectures
PDF
HCatalog: Table Management for Hadoop - CHUG - 20120917
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PDF
Integration of HIve and HBase
PDF
Integration of Hive and HBase
PPTX
Cosmos, Big Data GE implementation in FIWARE
PPTX
Cosmos, Big Data GE Implementation
PDF
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
PPT
Brust hadoopecosystem
PPTX
Apache Hive
Future of HCatalog
Big data analytics with hadoop volume 2
H cat berlinbuzzwords2012
Apache Hadoop Now Next and Beyond
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Hadoop: today and tomorrow
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
מיכאל
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Leveraging Hadoop in Polyglot Architectures
HCatalog: Table Management for Hadoop - CHUG - 20120917
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Integration of HIve and HBase
Integration of Hive and HBase
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE Implementation
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Brust hadoopecosystem
Apache Hive

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
PDF
HDF 3.2 - What's New
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
PDF
Premier Inside-Out: Apache Druid
PDF
Accelerating Data Science and Real Time Analytics at Scale
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
PDF
Making Enterprise Big Data Small with Ease
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
PDF
Driving Digital Transformation Through Global Data Management
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Johns Hopkins - Using Hadoop to Secure Access Log Events
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
HDF 3.2 - What's New
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
IBM+Hortonworks = Transformation of the Big Data Landscape
Premier Inside-Out: Apache Druid
Accelerating Data Science and Real Time Analytics at Scale
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Making Enterprise Big Data Small with Ease
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Driving Digital Transformation Through Global Data Management
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Unlock Value from Big Data with Apache NiFi and Streaming CDC

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Machine Learning_overview_presentation.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Machine Learning_overview_presentation.pptx
Approach and Philosophy of On baking technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
A comparative analysis of optical character recognition models for extracting...

Web Services Hadoop Summit 2012

  • 1. Web Services in Hadoop Nicholas Sze and Alan F. Gates @szetszwo, @alanfgates Page 1
  • 2. REST-ful API Front-door for Hadoop • Opens the door to languages other than Java • Thin clients via web services vs. fat-clients in gateway • Insulation from interface changes release to release HCatalog web interfaces MapReduce Pig Hive HCatalog External HDFS HBase Store © 2012 Hortonworks Page 2
  • 3. Not Covered in this Talk •  HttpFS (fka Hoop) – same API as WebHDFS but proxied •  Stargate – REST API for HBase © 2012 Hortonworks Page 3
  • 4. HDFS Clients • DFSClient: the native client – High performance (using RPC) – Java blinding • libhdfs: a C++ client interface – Using JNI => large overhead – Also Java blinding (require Hadoop installing) Architecting the Future of Big Data Page 4
  • 5. HFTP • Designed for cross-version copying (DistCp) – High performance (using HTTP) – Read-only – The HTTP API is proprietary – Clients must use HftpFileSystem (hftp://) • WebHDFS is a rewrite of HFTP Architecting the Future of Big Data Page 5
  • 6. Design Goals • Support a public HTTP API • Support Read and Write • High Performance • Cross-version • Security Architecting the Future of Big Data Page 6
  • 7. WebHDFS features • HTTP REST API – Defines a public API – Permits non-Java client implementation – Support common tools like curl/wget • Wire Compatibility – The REST API will be maintained for wire compatibility – WebHDFS clients can talk to different Hadoop versions. Architecting the Future of Big Data Page 7
  • 8. WebHDFS features (2) • A Complete HDFS Interface – Support all user operations – reading files – writing to files – mkdir, chmod, chown, mv, rm, … • High Performance – Using HTTP redirection to provide data locality – File read/write are redirected to the corresponding datanodes Architecting the Future of Big Data Page 8
  • 9. WebHDFS features (3) • Secure Authentication – Same as Hadoop authentication: Kerberos (SPNEGO) and Hadoop delegation tokens – Support proxy users • A HDFS Built-in Component – WebHDFS is a first class built-in component of HDFS. – Run inside Namenodes and Datanodes • Apache Open Source – Available in Apache Hadoop 1.0 and above. Architecting the Future of Big Data Page 9
  • 10. WebHDFS URI & URL • FileSystem scheme: webhdfs:// • FileSystem URI: webhdfs://<HOST>:<HTTP_PORT>/<PATH> • HTTP URL: http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=.. – Path prefix: /webhdfs/v1 – Query: ?op=.. Architecting the Future of Big Data Page 10
  • 11. URI/URL Examples •  Suppose we have the following file hdfs://namenode:8020/user/szetszwo/w.txt •  WebHDFS FileSystem URI webhdfs://namenode:50070/user/szetszwo/w.txt •  WebHDFS HTTP URL http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=.. •  WebHDFS HTTP URL to open the file http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN Architecting the Future of Big Data Page 11
  • 12. Example: curl •  Use curl to open a file $curl -i -L "http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN" HTTP/1.1 307 TEMPORARY_REDIRECT Content-Type: application/octet-stream Location: http://192.168.5.2:50075/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN&offset=0 Content-Length: 0 Server: Jetty(6.1.26) Architecting the Future of Big Data Page 12
  • 13. Example: curl (2) HTTP/1.1 200 OK Content-Type: application/octet-stream Content-Length: 21 Server: Jetty(6.1.26) Hello, WebHDFS user! Architecting the Future of Big Data Page 13
  • 14. Example: wget •  Use wget to open the same file $wget "http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN" –O w.txt Resolving ... Connecting to ... connected. HTTP request sent, awaiting response... 307 TEMPORARY_REDIRECT Location: http://192.168.5.2:50075/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN&offset=0 [following] Architecting the Future of Big Data Page 14
  • 15. Example: wget (2) --2012-06-13 01:42:10-- http://192.168.5.2:50075/ webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0 Connecting to 192.168.5.2:50075... connected. HTTP request sent, awaiting response... 200 OK Length: 21 [application/octet-stream] Saving to: `w.txt' 100%[=================>] 21 --.-K/s in 0s 2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved [21/21] Architecting the Future of Big Data Page 15
  • 16. Example: Firefox Architecting the Future of Big Data Page 16
  • 17. HCatalog REST API •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop •  Uses JSON to describe metadata objects •  Versioned, because we assume we will have to update it: http://hadoop.acme.com/templeton/v1/… •  Runs in a Jetty server •  Supports security –  Authentication done via kerberos using SPNEGO •  Included in HDP, runs on Thrift metastore server machine •  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © 2012 Hortonworks Page 17
  • 18. HCatalog REST API Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } Indicate user with URL parameter: http://…/v1/ddl/database/default/table?user.name=gates Actions authorized as indicated user © Hortonworks 2012 Page 18
  • 19. HCatalog REST API Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 19
  • 20. HCatalog REST API Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } © Hortonworks 2012 Page 20
  • 21. Job Management •  Includes APIs to submit and monitor jobs •  Any files needed for the job first uploaded to HDFS via WebHDFS –  Pig and Hive scripts –  Jars, Python scripts, or Ruby scripts for UDFs –  Pig macros •  Results from job stored to HDFS, can be retrieved via WebHDFS •  User responsible for cleaning up output in HDFS •  Job state information stored in ZooKeeper or HDFS © 2012 Hortonworks Page 21
  • 22. Job Submission •  Can submit MapReduce, Pig, and Hive jobs •  POST parameters include –  script to run or HDFS file containing script/jar to run –  username to execute the job as –  optionally an HDFS directory to write results to (defaults to user’s home directory) –  optionally a URL to invoke GET on when job is done POST http://hadoop.acme.com/templeton/v1/pig Hadoop/ HCatalog {"id": "job_201111111311_0012",…} © 2012 Hortonworks Page 22
  • 23. Find all Your Jobs •  GET on queue returns all jobs belonging to the submitting user •  Pig, Hive, and MapReduce jobs will be returned GET http://…/templeton/v1/queue?user.name=gates Hadoop/ HCatalog {"job_201111111311_0008", "job_201111111311_0012"} © 2012 Hortonworks Page 23
  • 24. Get Status of a Job •  Doing a GET on jobid gets you information about a particular job •  Can be used to poll to see if job is finished •  Used after job is finished to get job information •  Doing a DELETE on jobid kills the job GET http://…/templeton/v1/queue/job_201111111311_0012 Hadoop/ HCatalog {…, "percentComplete": "100% complete", "exitValue": 0,… "completed": "done" } © 2012 Hortonworks Page 24
  • 25. Future •  Job management –  Job management APIs don’t belong in HCatalog –  Only there by historical accident –  Need to move them out to MapReduce framework •  Authentication needs more options than kerberos •  Integration with Oozie •  Need a directory service –  Users should not need to connect to different servers for HDFS, HBase, HCatalog, Oozie, and job submission © 2012 Hortonworks Page 25