SlideShare a Scribd company logo
Copyright © 2013 Splunk Inc.

Hunk: Technical Overview
Agenda
What is Hunk?
2. Powerful Developer Platform
3. Preparation
4. Connect Hunk to HDFS and MapReduce
5. Create Virtual Indexes
6. MapReduce as the Orchestration Framework
7. Search Data in Hadoop
8. Flexible, Iterative Workflow for Business Users
1.

2
Explore, Analyze, Visualize Data in Hadoop
Unlock business value of data in Hadoop

No fixed schema to search unstructured data

Fast to learn instead of scarce skills

Preview results while MapReduce jobs start

Integrated – explore, analyze and visualize

Easier app development than in raw Hadoop

3
Unmet Needs for Hadoop Analytics
OPTION 1

“Do it yourself”
Hadoop / Pig

Hive or SQL on

Extract to
in-memory store

OPTION 2 Hadoop

OPTION 3

Problems

Problems

Problems

•
•
•
•
•
•
•
•
•

•
•
•
•
•
•
•
•

• Data too big to move
• Limited drill down to raw
data
• No results preview
• Another data mart
• Expensive hardware

Scarce skill sets to hire
Need to know MapReduce
Wait for slow jobs to finish
Upfront schema (Pig)
No interactive exploration
No results preview
No built-in visualization
No granular authentication
Slow time to value

Pre-defined fixed schema
Need knowledge of data
Miss data that “doesn’t fit”
No results preview
No built-in visualization
No granular authentication
Scarce skill sets to hire
Slow time to value

4
Integrated Analytics Platform for Hadoop Data
Full-featured,
Integrated
Product

Explore

Analyze

Visualize

Insights for
Everyone
Works with
What You
Have Today

Hadoop
(MapReduce
& HDFS)
5

5

Dashboards

Share
About Hunk
Features
Delivery Model
License Model

Trial License
Where Data is Stored and Read

Hunk
Licensed install
Size of Hadoop cluster: number of Hadoop DataNodes
Hunk does not require a Splunk Enterprise license
Free for 60 days
HDFS or HDFS proprietary variants (MapR)
Needs read only access to data

Supported Hadoop Distributions Hortonworks, Cloudera, MapR and Pivotal
Indexes
Supported Operating Systems
Operations Management
Data Ingest Management

Virtual Indexes
64-bit Linux
Splunk App for HadoopOps
HDFS API or Flume / Scribe / Sqoop: not managed by Hunk
Splunk Hadoop Connect between Splunk Enterprise and
HDFS
6
What Hunk Does Not Do
1.

Hunk does not replace your Hadoop distribution

2.

Hunk does not replace or require Splunk Enterprise

3.

Interactive but not real time or needle in
haystack search

4.

No data ingest management

5.

No Hadoop operations management

7
Product Portfolio
Real-time
indexing
Real-time
search

App Dev
&
App
Mgmt.

Ad hoc analytics of
historical data in Hadoop

IT
Ops.

Web
Intelligence

Security &
Compliance

Product and
Service
Analytics

Business
Analytics

Complete
3600
Customer Security
Analytics
View

Developers building big data apps on top of Hadoop
Splunk Apps
Vibrant and passionate developer community
8

Splunk Hadoop Connect
Powerful Developer Platform with Familiar Tools
Add New
UI components

JavaScript

Java

With Known
Languages
and Frameworks

Integrate into
Existing Systems

Python

PHP

API

9

C#

Ruby
Integration Methods
Dashboards and Views

User Interface Extensibility
• Interactive
dashboards and
user workflows

• Simple or
advanced XML
or REST API and
SDKs

• Custom styling,
behavior & visuals

• iframe embed

• Integrate Hunk charts, dashboards and query results into other applications
• Create workflows that trigger an action in an external system or use REST endpoints

10
Preparation
1.

2.

What are your goals for analytics of
data in Hadoop?

3.

What are the potential use cases?

4.

What is your Hadoop environment?

Who are the business and IT users?

5.

What are your Hadoop access policies?

Hadoop Cluster

11
Prerequisites

Data in
Hadoop
to analyze

Hadoop
client
libraries

Hadoop
access
rights

Java 1.6+

12

HDFS
scratch
space

DataNode
local temp
disk space
Get Started
1.

Set up virtual or physical 64-bit Linux server

2.

Download and install Hunk software

3.

Start Splunk > ./splunk/bin/splunk start

Follow instructions to install or update
4. Hadoop client libraries and Java

13
Hunk Server
Explore

Analyze

Visualize

Dashboards

Share

splunkweb
• Web and Application server
• Python, AJAX, CSS, XSLT, XML
REST API

COMMAND LINE

ODBC (beta)

splunkd
• Search Head
• Virtual Indexes
• C++, Web Services

Hadoop Interface
• Hadoop Client Libraries
• JAVA

64-bit Linux OS
14
Hunk Uses Virtual Indexes

• Enables seamless use of almost the entire Splunk stack on data in Hadoop
• Automatically handles MapReduce
• Technology is patent pending
17
Examples of Virtual Indexes
External System 1

index = syslog (/home/syslog/…)

Hunk
Search Head >

External System 2

External System 3

18

index = apache_logs

index = sensor_data

index = twitter
Point at Hadoop Cluster

Specify basic
properties about
the Hadoop cluster

Hunk works with any compression method
supported by HDFS (e.g., gzip, bzip or lzo)
19
Set Additional Parameters
Prepopulated
fields save time
and can be
overwritten

Add more MapReduce settings

•
•

Configuration files can be edited manually:
indexes.conf, props.conf and transforms.conf
No restart is necessary if working with .conf files.
20
Define Virtual Indexes and Paths
External Resource
(e.g. hadoop.prod)

Virtual Index
(e.g. twitter)

Virtual Index
(e.g. sensor data)

Virtual Index
(e.g. Apache logs)

Specify Virtual Index and data paths, and optionally:

• Filter files or directories using a whitelist or blacklist
• Extract metadata or time range from paths
• Use props/transforms.conf to specify search time processing

21

21
Set Authentication and Access Control

•

Splunk role-based access control

•

No field-based access control

•

LDAP/AD for authentication and group management

•

Single sign on (tokens, certificates)

22
MapReduce as the Orchestration Framework
1. Copy splunkd
binary

HDFS

.tgz

Hunk
Search Head >

2. Copy
.tgz

.tgz

TaskTracker 1

TaskTracker 2

3. Expand in specified location on each TaskTracker

23

TaskTracker 3
4. Receive binary in
subsequent searches
Search Data in Hadoop
Run a copy of splunkd to process
Hunk
Search Head >

1.

JSON
configs

External Resource
(e.g. hadoop.prod)

5.

DataNode /
TaskTracker
(Node in YARN)

NameNode

MapReduce
jobs

DataNode /
TaskTracker
(Node in YARN)

2.
JobTracker
(MapReduce
Resource
Manager in
YARN)

/ working
directory

Tasks
3.

24

DataNode /
TaskTracker
(Node in YARN)

HDFS

4.
Data Processing Pipeline
Raw data
(HDFS)

Custom
processing

stdin

You can plug in
data preprocessors
e.g. Apache Avro or
format readers

Indexing
pipeline
Event breaking
Timestamping

Search
pipeline
Event typing
Lookups
Tagging
Search processors

splunkd/C++

MapReduce/Java
25

25
Hunk Applies Schema on the Fly
• Structure applied at
search time
• No brittle schema to
work around
• Automatically find
patterns and trends

Hunk applies schema for all fields – including transactions – at search time
26
Hunk Usage in HDFS

hdfs://<scratch_space_path>/ bundles
– Search Head bundles: keeps last 5 bundles

packages
– Hunk .tgz packages: no automatic cleanup

dispatch/<sid>
– Search scratch space: cleanup when sid is invalid

27
Search Optimization: Partition Pruning

• Most data types are stored in hierarchical directories
– Such as /<base_path>/<date>/<hour>/<hostname>/somefile.log

• You can instruct Hunk to extract fields and time ranges from a path
• Searches ignore directories that cannot possibly contain search results
– Such as time ranges outside of a defined range

Example time-based partition pruning
Search: index=hunk earliest_time=“2013-06-10T01:00:00” latest_time =“2013-06-10T02:00:00”
28
Common Issues with Hunk Configuration
User running Hunk lacks permission to write to HDFS or run MapReduce
HDFS scratch space for Hunk is not writable
DataNode or TaskTracker scratch space is not writable or out of disk
Data reading permission issues

29
Search Performance with MapReduce
MapReduce considerations
Stats/chart/timechart/top/etc. commands work well in a distributed environment

– They MapReduce well
Time and order commands don’t work well in a distributed environment
– They don’t MapReduce well

Summary
Indexing

•
•
•
•

Useful for speeding up searches
Summaries could have different retention policy
In most cases resides on the search head
Backfill is a manual (scripted) process

30
Mixed-mode Search
Streaming

Reporting

• Transfers first several blocks from

• Pushes computation to the

HDFS to the Hunk Search Head
for immediate processing

DataNodes and TaskTrackers for
the complete search

• Hunk starts the streaming and reporting modes concurrently
• Streaming results show until the reporting results come in
• Allows users to search interactively by pausing and refining queries

31
Interactively Question your Data in Hadoop

Pause means stop fetching results
from Hadoop
Stop means treat the current results
as final and kill the MapReduce job

32
Data Discovery Modes

Hunk supports almost all of the Search Processing Language (SPL), excluding
Transactions and Localize, which require Splunk Enterprise native indexes.
33
Flexible, Iterative Workflow for Business Users
Interactive Analytics
Explore

• Preview results
• Normalization as it’s
needed
• Faster implementation
and flexibility
• Easy search language +
data models & pivot
• Multiple views into the
same data

Share

Analyze

Visualize

Model

Pivot

34
Thank You

More Related Content

PPTX
SplunkLive! Hunk Technical Deep Dive
PPTX
December 2013 HUG: Hunk - Splunk over Hadoop
PPTX
BlueData Hunk Integration: Splunk Analytics for Hadoop
PPTX
Hunk: Splunk Analytics for Hadoop
PPTX
Hunk - Unlocking The Power of Big Data Breakout Session
PPTX
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
PPTX
Hunk - Unlocking the Power of Big Data
PPTX
Monitoring a Database Driven System Utilizing Splunk's DB Connect
SplunkLive! Hunk Technical Deep Dive
December 2013 HUG: Hunk - Splunk over Hadoop
BlueData Hunk Integration: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for Hadoop
Hunk - Unlocking The Power of Big Data Breakout Session
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
Hunk - Unlocking the Power of Big Data
Monitoring a Database Driven System Utilizing Splunk's DB Connect

What's hot (20)

PPTX
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
PPTX
Splunk Architecture overview
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
PPTX
GPU 101: The Beast In Data Centers
PDF
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
PDF
Solution Brief: Big Data Lab Accelerator
PPTX
Interactive query in hadoop
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
PPTX
De-Bugging Hive with Hadoop-in-the-Cloud
PDF
Intro to Big Data - Spark
PDF
Overview of stinger interactive query for hive
PPTX
Splunk Architecture
PPTX
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
PPTX
SplunkLive! Presentation - Data Onboarding with Splunk
PDF
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
PPTX
SplunkLive! London: Splunk ninjas- new features and search dojo
PPTX
Check Point Big Data Forum m3
PPTX
Summer Shorts: Big Data Integration
 
PPTX
Hadoop from Hive with Stinger to Tez
PPTX
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Splunk Architecture overview
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
GPU 101: The Beast In Data Centers
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
Solution Brief: Big Data Lab Accelerator
Interactive query in hadoop
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
De-Bugging Hive with Hadoop-in-the-Cloud
Intro to Big Data - Spark
Overview of stinger interactive query for hive
Splunk Architecture
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
SplunkLive! Presentation - Data Onboarding with Splunk
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
SplunkLive! London: Splunk ninjas- new features and search dojo
Check Point Big Data Forum m3
Summer Shorts: Big Data Integration
 
Hadoop from Hive with Stinger to Tez
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Ad

Similar to SplunkLive! Hunk Technical Overview (20)

PDF
Cloudera Hunk
PDF
Splunk hunkbeta
PDF
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
PDF
SplunkSummit 2015 - Real World Big Data Architecture
PDF
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
PDF
Splunk-hortonworks-risk-management-oct-2014
PPTX
Splunk Developer Platform
PPTX
Splunk bsides
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PPTX
SplunkLive! Introduction to the Splunk Developer Platform
PPTX
Taking Splunk to the Next Level – Architecture
PPTX
Taking Splunk to the Next Level - Technical
PPTX
Splunk live beginner training nyc
PPTX
Taking Splunk to the Next Level – Architecture
PPTX
Getting Started with Splunk
PPTX
Getting Started with Splunk Enterprise
PPTX
SplunkLive! Beginner Session
PPTX
Taking Splunk to the Next Level - Architecture
PPTX
Taking Splunk to the Next Level - Architecture
PPTX
SplunkLive! London 2016 Splunk Overview
Cloudera Hunk
Splunk hunkbeta
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
SplunkSummit 2015 - Real World Big Data Architecture
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Splunk-hortonworks-risk-management-oct-2014
Splunk Developer Platform
Splunk bsides
Taking Splunk to the Next Level - Architecture Breakout Session
SplunkLive! Introduction to the Splunk Developer Platform
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level - Technical
Splunk live beginner training nyc
Taking Splunk to the Next Level – Architecture
Getting Started with Splunk
Getting Started with Splunk Enterprise
SplunkLive! Beginner Session
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
SplunkLive! London 2016 Splunk Overview
Ad

More from Splunk (20)

PDF
Splunk Leadership Forum Wien - 20.05.2025
PDF
Splunk Security Update | Public Sector Summit Germany 2025
PDF
Building Resilience with Energy Management for the Public Sector
PDF
IT-Lagebild: Observability for Resilience (SVA)
PDF
Nach dem SOC-Aufbau ist vor der Automatisierung (OFD Baden-Württemberg)
PDF
Monitoring einer Sicheren Inter-Netzwerk Architektur (SINA)
PDF
Praktische Erfahrungen mit dem Attack Analyser (gematik)
PDF
Cisco XDR & Splunk SIEM - stronger together (DATAGROUP Cyber Security)
PDF
Security - Mit Sicherheit zum Erfolg (Telekom)
PDF
One Cisco - Splunk Public Sector Summit Germany April 2025
PDF
.conf Go 2023 - Data analysis as a routine
PDF
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
PDF
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
PDF
.conf Go 2023 - Raiffeisen Bank International
PDF
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
PDF
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
PDF
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
PDF
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
PDF
.conf go 2023 - De NOC a CSIRT (Cellnex)
PDF
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
Splunk Leadership Forum Wien - 20.05.2025
Splunk Security Update | Public Sector Summit Germany 2025
Building Resilience with Energy Management for the Public Sector
IT-Lagebild: Observability for Resilience (SVA)
Nach dem SOC-Aufbau ist vor der Automatisierung (OFD Baden-Württemberg)
Monitoring einer Sicheren Inter-Netzwerk Architektur (SINA)
Praktische Erfahrungen mit dem Attack Analyser (gematik)
Cisco XDR & Splunk SIEM - stronger together (DATAGROUP Cyber Security)
Security - Mit Sicherheit zum Erfolg (Telekom)
One Cisco - Splunk Public Sector Summit Germany April 2025
.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - De NOC a CSIRT (Cellnex)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Modernizing your data center with Dell and AMD
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Modernizing your data center with Dell and AMD
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf

SplunkLive! Hunk Technical Overview

  • 1. Copyright © 2013 Splunk Inc. Hunk: Technical Overview
  • 2. Agenda What is Hunk? 2. Powerful Developer Platform 3. Preparation 4. Connect Hunk to HDFS and MapReduce 5. Create Virtual Indexes 6. MapReduce as the Orchestration Framework 7. Search Data in Hadoop 8. Flexible, Iterative Workflow for Business Users 1. 2
  • 3. Explore, Analyze, Visualize Data in Hadoop Unlock business value of data in Hadoop No fixed schema to search unstructured data Fast to learn instead of scarce skills Preview results while MapReduce jobs start Integrated – explore, analyze and visualize Easier app development than in raw Hadoop 3
  • 4. Unmet Needs for Hadoop Analytics OPTION 1 “Do it yourself” Hadoop / Pig Hive or SQL on Extract to in-memory store OPTION 2 Hadoop OPTION 3 Problems Problems Problems • • • • • • • • • • • • • • • • • • Data too big to move • Limited drill down to raw data • No results preview • Another data mart • Expensive hardware Scarce skill sets to hire Need to know MapReduce Wait for slow jobs to finish Upfront schema (Pig) No interactive exploration No results preview No built-in visualization No granular authentication Slow time to value Pre-defined fixed schema Need knowledge of data Miss data that “doesn’t fit” No results preview No built-in visualization No granular authentication Scarce skill sets to hire Slow time to value 4
  • 5. Integrated Analytics Platform for Hadoop Data Full-featured, Integrated Product Explore Analyze Visualize Insights for Everyone Works with What You Have Today Hadoop (MapReduce & HDFS) 5 5 Dashboards Share
  • 6. About Hunk Features Delivery Model License Model Trial License Where Data is Stored and Read Hunk Licensed install Size of Hadoop cluster: number of Hadoop DataNodes Hunk does not require a Splunk Enterprise license Free for 60 days HDFS or HDFS proprietary variants (MapR) Needs read only access to data Supported Hadoop Distributions Hortonworks, Cloudera, MapR and Pivotal Indexes Supported Operating Systems Operations Management Data Ingest Management Virtual Indexes 64-bit Linux Splunk App for HadoopOps HDFS API or Flume / Scribe / Sqoop: not managed by Hunk Splunk Hadoop Connect between Splunk Enterprise and HDFS 6
  • 7. What Hunk Does Not Do 1. Hunk does not replace your Hadoop distribution 2. Hunk does not replace or require Splunk Enterprise 3. Interactive but not real time or needle in haystack search 4. No data ingest management 5. No Hadoop operations management 7
  • 8. Product Portfolio Real-time indexing Real-time search App Dev & App Mgmt. Ad hoc analytics of historical data in Hadoop IT Ops. Web Intelligence Security & Compliance Product and Service Analytics Business Analytics Complete 3600 Customer Security Analytics View Developers building big data apps on top of Hadoop Splunk Apps Vibrant and passionate developer community 8 Splunk Hadoop Connect
  • 9. Powerful Developer Platform with Familiar Tools Add New UI components JavaScript Java With Known Languages and Frameworks Integrate into Existing Systems Python PHP API 9 C# Ruby
  • 10. Integration Methods Dashboards and Views User Interface Extensibility • Interactive dashboards and user workflows • Simple or advanced XML or REST API and SDKs • Custom styling, behavior & visuals • iframe embed • Integrate Hunk charts, dashboards and query results into other applications • Create workflows that trigger an action in an external system or use REST endpoints 10
  • 11. Preparation 1. 2. What are your goals for analytics of data in Hadoop? 3. What are the potential use cases? 4. What is your Hadoop environment? Who are the business and IT users? 5. What are your Hadoop access policies? Hadoop Cluster 11
  • 12. Prerequisites Data in Hadoop to analyze Hadoop client libraries Hadoop access rights Java 1.6+ 12 HDFS scratch space DataNode local temp disk space
  • 13. Get Started 1. Set up virtual or physical 64-bit Linux server 2. Download and install Hunk software 3. Start Splunk > ./splunk/bin/splunk start Follow instructions to install or update 4. Hadoop client libraries and Java 13
  • 14. Hunk Server Explore Analyze Visualize Dashboards Share splunkweb • Web and Application server • Python, AJAX, CSS, XSLT, XML REST API COMMAND LINE ODBC (beta) splunkd • Search Head • Virtual Indexes • C++, Web Services Hadoop Interface • Hadoop Client Libraries • JAVA 64-bit Linux OS 14
  • 15. Hunk Uses Virtual Indexes • Enables seamless use of almost the entire Splunk stack on data in Hadoop • Automatically handles MapReduce • Technology is patent pending 17
  • 16. Examples of Virtual Indexes External System 1 index = syslog (/home/syslog/…) Hunk Search Head > External System 2 External System 3 18 index = apache_logs index = sensor_data index = twitter
  • 17. Point at Hadoop Cluster Specify basic properties about the Hadoop cluster Hunk works with any compression method supported by HDFS (e.g., gzip, bzip or lzo) 19
  • 18. Set Additional Parameters Prepopulated fields save time and can be overwritten Add more MapReduce settings • • Configuration files can be edited manually: indexes.conf, props.conf and transforms.conf No restart is necessary if working with .conf files. 20
  • 19. Define Virtual Indexes and Paths External Resource (e.g. hadoop.prod) Virtual Index (e.g. twitter) Virtual Index (e.g. sensor data) Virtual Index (e.g. Apache logs) Specify Virtual Index and data paths, and optionally: • Filter files or directories using a whitelist or blacklist • Extract metadata or time range from paths • Use props/transforms.conf to specify search time processing 21 21
  • 20. Set Authentication and Access Control • Splunk role-based access control • No field-based access control • LDAP/AD for authentication and group management • Single sign on (tokens, certificates) 22
  • 21. MapReduce as the Orchestration Framework 1. Copy splunkd binary HDFS .tgz Hunk Search Head > 2. Copy .tgz .tgz TaskTracker 1 TaskTracker 2 3. Expand in specified location on each TaskTracker 23 TaskTracker 3 4. Receive binary in subsequent searches
  • 22. Search Data in Hadoop Run a copy of splunkd to process Hunk Search Head > 1. JSON configs External Resource (e.g. hadoop.prod) 5. DataNode / TaskTracker (Node in YARN) NameNode MapReduce jobs DataNode / TaskTracker (Node in YARN) 2. JobTracker (MapReduce Resource Manager in YARN) / working directory Tasks 3. 24 DataNode / TaskTracker (Node in YARN) HDFS 4.
  • 23. Data Processing Pipeline Raw data (HDFS) Custom processing stdin You can plug in data preprocessors e.g. Apache Avro or format readers Indexing pipeline Event breaking Timestamping Search pipeline Event typing Lookups Tagging Search processors splunkd/C++ MapReduce/Java 25 25
  • 24. Hunk Applies Schema on the Fly • Structure applied at search time • No brittle schema to work around • Automatically find patterns and trends Hunk applies schema for all fields – including transactions – at search time 26
  • 25. Hunk Usage in HDFS hdfs://<scratch_space_path>/ bundles – Search Head bundles: keeps last 5 bundles packages – Hunk .tgz packages: no automatic cleanup dispatch/<sid> – Search scratch space: cleanup when sid is invalid 27
  • 26. Search Optimization: Partition Pruning • Most data types are stored in hierarchical directories – Such as /<base_path>/<date>/<hour>/<hostname>/somefile.log • You can instruct Hunk to extract fields and time ranges from a path • Searches ignore directories that cannot possibly contain search results – Such as time ranges outside of a defined range Example time-based partition pruning Search: index=hunk earliest_time=“2013-06-10T01:00:00” latest_time =“2013-06-10T02:00:00” 28
  • 27. Common Issues with Hunk Configuration User running Hunk lacks permission to write to HDFS or run MapReduce HDFS scratch space for Hunk is not writable DataNode or TaskTracker scratch space is not writable or out of disk Data reading permission issues 29
  • 28. Search Performance with MapReduce MapReduce considerations Stats/chart/timechart/top/etc. commands work well in a distributed environment – They MapReduce well Time and order commands don’t work well in a distributed environment – They don’t MapReduce well Summary Indexing • • • • Useful for speeding up searches Summaries could have different retention policy In most cases resides on the search head Backfill is a manual (scripted) process 30
  • 29. Mixed-mode Search Streaming Reporting • Transfers first several blocks from • Pushes computation to the HDFS to the Hunk Search Head for immediate processing DataNodes and TaskTrackers for the complete search • Hunk starts the streaming and reporting modes concurrently • Streaming results show until the reporting results come in • Allows users to search interactively by pausing and refining queries 31
  • 30. Interactively Question your Data in Hadoop Pause means stop fetching results from Hadoop Stop means treat the current results as final and kill the MapReduce job 32
  • 31. Data Discovery Modes Hunk supports almost all of the Search Processing Language (SPL), excluding Transactions and Localize, which require Splunk Enterprise native indexes. 33
  • 32. Flexible, Iterative Workflow for Business Users Interactive Analytics Explore • Preview results • Normalization as it’s needed • Faster implementation and flexibility • Easy search language + data models & pivot • Multiple views into the same data Share Analyze Visualize Model Pivot 34