SlideShare a Scribd company logo
Retrieving Big Data
For the non-developer
Intended Audience
People who do not write code
But don’t want to wait for IT to bring them data
Disclaimer
You will have to write code. Sorry...
Worth Noting
A common objection, “But I’m not a developer”
Coding does not make you a developer
anymore than patching some drywall makes
you a carpenter
Agenda
● The minimum you need to know about Big
Data (Hadoop)
o Specifically, HBase and Pig
● How you can retrieve data in HBase with Pig
o How to use Python with Pig to make querying easier
One Big Caveat
● We are not talking about analysis
● Analysis is hard
● Learning code and trying to understand an
analytical approach is really hard
● Following a straightforward Pig tutorial is
better than a boring lecture
Big Data in One Slide (oh boy)
● Today, Big Data == Hadoop
● Hadoop is both a distributed file system
(HDFS) and an approach to messing with
data on the file system (MapReduce)
o HBase is a popular database that sits on top of
HDFS
o Pig is a high level language that makes messing
with data on HDFS or in HBase easier
HBase in one slide
● HBase = Hadoop Database, based on
Google’s Big Table
● Column-oriented database – basically one
giant table
Pig in one slide
● A data flow language we will use to write
queries against HBase
● Pig is not the developer’s solution for
retrieving data from HBase, but it works well
enough for the BI analyst (and, of course, we
aren’t developers)
Pig is easier...Not Easy
● If you have no coding background, Pig will
not be easy
● But it’s the best of a bad set of options right
now
● Not hating on SQL-on-Hadoop providers, but
with SQL you tell the computer what you
want, which quickly gets complicated
Here’s our HBase table
Let’s dive in - Load
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name, '-loadKey true -limit 1')
AS (id:chararray, first_name:chararray, last_name:chararray);
You have to specify each field and it’s type in
order to load it
Response is as expected
'info:first_name info:last_name,
AS (first_name:chararray,
last_name:chararray);
Will return a first name and last name as
seperate fields, e.g., “Steve”, “Buscemi”
If you can write a Vlookup()
=VLOOKUP(C34, Z17:AZ56, 17, FALSE)
You can write a load statement in Pig.
Both are equally esoteric.
But what if we don’t know the fields?
● Suppose we have a column family of friends
● Each record will contain will zero to many
friends, e.g., friend_0: “John”, friend_1:
“Paul”
The number of friends is variable
● There could be thousands of friends per row
● And we cannot specify “friend_5” because
there is no guarantee that each record has
five friends
This is common...
● NoSQL databases are known for flexible
schemas and flat table structures
● Unfortunately, the way Pig handles this
problem utterly sucks...
Loading unknown friends
raw = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
Now we have info:friends_* that is represented
as a “map”
A map is just a collection of key-
value pairs
● That look like this: friend_1# ‘Steve’,
friend_2# ‘Willie’
● They are very similar to Python
dictionaries...
Here’s why they suck
● We can’t iterate over them
● In order to access a value, in this case a
friend’s name, I have to provide the specific
key value, e.g., friend_5, in order to receive
the name of the fifth friend
But I thought you said we didn’t
know the number of friends?
● You are right – Pig expects us to provide the
specific value of something unknown
● If only there were some way to iterate over a
collection of key-value pairs…
Enter Python
● Pig may not allow you to iterate over a map,
but it does allow you to write User-Defined
Functions (UDFs) in Python
● In a python UDF we can read in a map as a
python dict and return key-value pairs
Python UDF for Pig
@outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
return map_dict.items()
We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie”
and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’:
‘Willie’}
Based on blog post by Chase Seibert
We can add loops and logic too
@outputSchema("status:chararray")
def get_steve(map_dict):
for key, value in map_dict:
if value == 'Steve':
return "I hate that guy"
else:
return value
Or if you just want the data in Excel
register ‘sample_udf.py’ using jython as my_udf
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends));
dump clean_table;
Final Thought
Make Your Big Data Small
● Prototype your Pig Scripts on your local file
system
o Download some data to your local machine
o Start you pig shell from the command line: pig -x
local
o Load - Transform - Dump
Notes
Pig Tutorials
● Excellent video on Pig
● Mortar Data introduction to Pig
● Flatten HBase column with Python
Me
● codingcharlatan.com
● @GusCavanaugh

More Related Content

PPTX
Mongo db
PPTX
Hadoop Jute Record Python
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PDF
Cpp lab 13_pres
PDF
Doing More with MongoDB Aggregation
PPTX
Introduction to MongoDB and Workshop
PDF
Introduction to Hadoop - FinistJug
PPTX
Apache pig
Mongo db
Hadoop Jute Record Python
Nov HUG 2009: Hadoop Record Reader In Python
Cpp lab 13_pres
Doing More with MongoDB Aggregation
Introduction to MongoDB and Workshop
Introduction to Hadoop - FinistJug
Apache pig

What's hot (13)

PPTX
IPTC News in JSON AGM 2013
PPTX
Data Science Stack with MongoDB and RStudio
PDF
図でわかるHDFS Erasure Coding
ODP
A Year With MongoDB: The Tips
PDF
PPTX
Introduction to MongoDB at IGDTUW
PDF
R, Hadoop and Amazon Web Services
KEY
Papyri.info's Linked Data Story
PPTX
An Introduction to Apache Pig
PPT
Java Development with MongoDB (James Williams)
PPT
hadoop&zing
PPTX
Introduction to MongoDB and Hadoop
PDF
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
IPTC News in JSON AGM 2013
Data Science Stack with MongoDB and RStudio
図でわかるHDFS Erasure Coding
A Year With MongoDB: The Tips
Introduction to MongoDB at IGDTUW
R, Hadoop and Amazon Web Services
Papyri.info's Linked Data Story
An Introduction to Apache Pig
Java Development with MongoDB (James Williams)
hadoop&zing
Introduction to MongoDB and Hadoop
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Ad

Similar to Retrieving big data for the non developer (20)

PPTX
Hadoop with Python
PDF
Hadoop Pig: MapReduce the easy way!
PPTX
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
PDF
Elephant in the room: A DBA's Guide to Hadoop
PDF
06 pig-01-intro
PDF
A super fast introduction to Spark and glance at BEAM
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Big data beyond the JVM - DDTX 2018
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Introduction to pig & pig latin
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
PDF
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PPTX
The Bund language
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
PDF
Groovy On Trading Desk (2010)
PDF
Sql saturday pig session (wes floyd) v2
PPT
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
PPTX
מיכאל
Hadoop with Python
Hadoop Pig: MapReduce the easy way!
A fast introduction to PySpark with a quick look at Arrow based UDFs
Elephant in the room: A DBA's Guide to Hadoop
06 pig-01-intro
A super fast introduction to Spark and glance at BEAM
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Big data beyond the JVM - DDTX 2018
Big Data Beyond the JVM - Strata San Jose 2018
Introduction to pig & pig latin
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
The Bund language
Getting started with Hadoop, Hive, and Elastic MapReduce
Groovy On Trading Desk (2010)
Sql saturday pig session (wes floyd) v2
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
מיכאל
Ad

Retrieving big data for the non developer

  • 1. Retrieving Big Data For the non-developer
  • 2. Intended Audience People who do not write code But don’t want to wait for IT to bring them data
  • 3. Disclaimer You will have to write code. Sorry...
  • 4. Worth Noting A common objection, “But I’m not a developer” Coding does not make you a developer anymore than patching some drywall makes you a carpenter
  • 5. Agenda ● The minimum you need to know about Big Data (Hadoop) o Specifically, HBase and Pig ● How you can retrieve data in HBase with Pig o How to use Python with Pig to make querying easier
  • 6. One Big Caveat ● We are not talking about analysis ● Analysis is hard ● Learning code and trying to understand an analytical approach is really hard ● Following a straightforward Pig tutorial is better than a boring lecture
  • 7. Big Data in One Slide (oh boy) ● Today, Big Data == Hadoop ● Hadoop is both a distributed file system (HDFS) and an approach to messing with data on the file system (MapReduce) o HBase is a popular database that sits on top of HDFS o Pig is a high level language that makes messing with data on HDFS or in HBase easier
  • 8. HBase in one slide ● HBase = Hadoop Database, based on Google’s Big Table ● Column-oriented database – basically one giant table
  • 9. Pig in one slide ● A data flow language we will use to write queries against HBase ● Pig is not the developer’s solution for retrieving data from HBase, but it works well enough for the BI analyst (and, of course, we aren’t developers)
  • 10. Pig is easier...Not Easy ● If you have no coding background, Pig will not be easy ● But it’s the best of a bad set of options right now ● Not hating on SQL-on-Hadoop providers, but with SQL you tell the computer what you want, which quickly gets complicated
  • 12. Let’s dive in - Load raw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name, '-loadKey true -limit 1') AS (id:chararray, first_name:chararray, last_name:chararray); You have to specify each field and it’s type in order to load it
  • 13. Response is as expected 'info:first_name info:last_name, AS (first_name:chararray, last_name:chararray); Will return a first name and last name as seperate fields, e.g., “Steve”, “Buscemi”
  • 14. If you can write a Vlookup() =VLOOKUP(C34, Z17:AZ56, 17, FALSE) You can write a load statement in Pig. Both are equally esoteric.
  • 15. But what if we don’t know the fields? ● Suppose we have a column family of friends ● Each record will contain will zero to many friends, e.g., friend_0: “John”, friend_1: “Paul”
  • 16. The number of friends is variable ● There could be thousands of friends per row ● And we cannot specify “friend_5” because there is no guarantee that each record has five friends
  • 17. This is common... ● NoSQL databases are known for flexible schemas and flat table structures ● Unfortunately, the way Pig handles this problem utterly sucks...
  • 18. Loading unknown friends raw = LOAD 'hbase://SampleTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]); Now we have info:friends_* that is represented as a “map”
  • 19. A map is just a collection of key- value pairs ● That look like this: friend_1# ‘Steve’, friend_2# ‘Willie’ ● They are very similar to Python dictionaries...
  • 20. Here’s why they suck ● We can’t iterate over them ● In order to access a value, in this case a friend’s name, I have to provide the specific key value, e.g., friend_5, in order to receive the name of the fifth friend
  • 21. But I thought you said we didn’t know the number of friends? ● You are right – Pig expects us to provide the specific value of something unknown ● If only there were some way to iterate over a collection of key-value pairs…
  • 22. Enter Python ● Pig may not allow you to iterate over a map, but it does allow you to write User-Defined Functions (UDFs) in Python ● In a python UDF we can read in a map as a python dict and return key-value pairs
  • 23. Python UDF for Pig @outputSchema("values:bag{t:tuple(key, value)}") def bag_of_tuples(map_dict): return map_dict.items() We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie” and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’: ‘Willie’} Based on blog post by Chase Seibert
  • 24. We can add loops and logic too @outputSchema("status:chararray") def get_steve(map_dict): for key, value in map_dict: if value == 'Steve': return "I hate that guy" else: return value
  • 25. Or if you just want the data in Excel register ‘sample_udf.py’ using jython as my_udf raw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]); clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends)); dump clean_table;
  • 26. Final Thought Make Your Big Data Small ● Prototype your Pig Scripts on your local file system o Download some data to your local machine o Start you pig shell from the command line: pig -x local o Load - Transform - Dump
  • 27. Notes Pig Tutorials ● Excellent video on Pig ● Mortar Data introduction to Pig ● Flatten HBase column with Python Me ● codingcharlatan.com ● @GusCavanaugh