Big Data for Oracle Professionals

Big Data for Oracle Professionals
Arup Nanda
Big Data Explorer
Time
Growth
Tweet @ArupNanda

Hadoop
Map/Reduce
YARN
NoSQL
Spark
Flume.
Tweet @ArupNanda
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html
HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html
HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET
/download/windows/asctab31.zip HTTP/1.0" 200 1540096
"http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html"
"Mozilla/4.7 [en]C-SYMPA (Win95; U)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0"
200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200
8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF"
"Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - -
Tweet @ArupNanda

fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
petabytes
unpredictable format
transient.
Tweet @ArupNanda
Metadata Repository
Tweet @ArupNanda

Tweet @ArupNanda
Tweet @ArupNanda

olumeV
arietyV
elocityV
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
Tweet @ArupNanda

CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSE
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
Tweet @ArupNanda

CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
EMPLOYERS
CUST_ID
NAME
CURRENT
Tweet @ArupNanda
Name = Data
Relationship status = Data
Married to = Data
In a relationship with = Data
Friends = Data, Data, Data
Likes = Data, Data
Mutually Exclusive, Maybe not?
Multiple Data Points
Tweet @ArupNanda

First Name John
Spouse Jane
Child Jill
Goes to Acme School
Tweet @ArupNanda
First Name Martha
Child goes to Acme School
Tweet @ArupNanda

First Name John
Spouse Jane
Child Jill
Goes to Acme School
First Name Martha
Tweet @ArupNanda
First Name Martha
Teacher Mrs Gillen
Teacher Mrs Gillen
Jill
Tweet @ArupNanda

First Name John
Spouse Jane
Child Jill
Goes to Acme School
Teacher Mr Fullmeister
Tweet @ArupNanda
First Name Irene
Boyfriend Henry
Works at Starwood
Hobby Photography
Ex-Spouse Jane
Tweet @ArupNanda

Tweet @ArupNanda
First Name Irene
Key Value
Key-Value Pair
Tweet @ArupNanda

John Smith and his wife Jane,
along with their daughter Jill,
were strolling on the beach
when they heard a crash. John
ran towards …
Tweet @ArupNanda
Map
Tweet @ArupNanda

begin
get post
while (there_are_remaining_posts) loop
extract status of "like" for the specific post
if status = "like" then
like_count := like_count + 1
else
no_comment := no_comment + 1
end if
end loop
end
Counter()
Tweet @ArupNanda
Counter() Counter() Counter()
Tweet @ArupNanda

Likes=100
No Comments=
300
Likes=50
No Comments=
350
Likes=150
No Comments=
250
Likes=300
No Comments=
900
Reduce
Tweet @ArupNanda
Map Reduce/
Dividing the
work among
different
nodes
Collating the
results to get
final answer
Tweet @ArupNanda

Counter
()
Counter
()
Counter
()
Likes=100
No
Comments=
300
Likes=50
No
Comments=
350
Likes=150
No
Comments=
250Likes=300
No
Comments=
900
• Divide the workload
• Submit and track the jobs
• If a job fails, restart it
on another node
• …
Hadoop
Tweet @ArupNanda
Resource Management
Applications
YARN
Yet Another Resource Negotiator
Map Reduce v2.
Tweet @ArupNanda

Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Hadoop Distributed Filesystem (HDFS)
Tweet @ArupNanda
Count
er()
Count
er()
Count
er()
1 2 32 3 13 1 2
• Not shared storage
• Data is discrete
• Version control not required
• Concurrency not required
• Transactional integrity across
nodes not required.
Comparison
with RAC
Tweet @ArupNanda

Advantages of Hadoop
• Processors need not be super-fast
• Immensely scalable
• Storage is redundant by design
• No RAID level required.
Count
er()
Count
er()
Count
er()
1 2 32 3 13 1 2
Tweet @ArupNanda
Scalable?
ACID Properties
Reliability at a cost
Large overhead in data processing
Tweet @ArupNanda

Website logs
Combine with structured data
SOAP Messages
Twitter, Facebook …
Tweet @ArupNanda
Data Access: through programs
NoSQL Databases
Tweet @ArupNanda

Key Value
Key Value DB
Key Document
Document DB.
Key Value
Key Value
Key Value
{
empID:1,
empName:Larry
salary:infinity
}
Tweet @ArupNanda
SQL-interface required
Hive
HiveQL
Tweet @ArupNanda

Creating a Hive Table
create table accounts (
accno int,
accname string,
balance float
)
row format delimited
fields terminated by ‘,’
stored as texfile
location '/user/hive/db1.db/accounts'
Tweet @ArupNanda
select count(*)
from store_sales ss
join household_demographics hd on (ss.ss_hdemo_sk
= hd.hd_demo_sk)
join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk)
join store s on (s.s_store_sk = ss.ss_store_sk)
where
t.t_hour = 8
t.t_minute >= 30
hd.hd_dep_count = 2
order by cnt;
HiveQL
Tweet @ArupNanda

Map/Reduce
Divide the work and
collate the results
Needs development
in Java, Python, Ruby, etc.
A framework to work on
the dataset in parallel Pig
Pig Latin
Scripting language for
Pig
Tweet @ArupNanda
select category, avg(pagerank)
from urls
where pagerank > 0.2
group by category
having count(*) > 1000000
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY
COUNT(good_urls)>1000000;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
SQL
Pig Latin
Tweet @ArupNanda

HBase
HiveQL
Pig
A database built
on Hadoop
An SQL-like (but
not the same)
query language
Procedural Logic
without M/R Code.
Tweet @ArupNanda
normal programming
languages, e.g. Python
YARN
Map/Reduce code in Java
Spark
Tweet @ArupNanda

Count
er()
Count
er()
Count
er()
1 2 32 3 13 1 2
Hadoop processing in
files
Memory is cheaper
Interactive
processing needs
faster access.
Tweet @ArupNanda
Spark
Core
SparkShell SparkSQL MLib SparkR PySpark
Can use Java, Python or Scala
Tweet @ArupNanda

Divide and conquer is the key
Non-shared division of data is important
Local access
Redundancy
Hadoop is a framework
You have to write the programs
Big data is batch-oriented
Hive is SQL-like
Pig Latin is a 4GL-like scripting language
Spark uses memory
Tweet @ArupNanda
Oh, I so want to Learn!
Cloudera – prebuilt VMs
https://www.cloudera.com/documentation/ente
rprise/5-9-x/topics/cloudera_quickstart_vm.html
Hortonworks – prebuilt
VMs
https://hortonworks.com/downloads/#sandbox
Tweet @ArupNanda

Thanks!
arup.blogspot.com @ArupNanda
Tweet @ArupNanda

Big Data for Oracle Professionals

More Related Content

Similar to Big Data for Oracle Professionals (20)

More from Gerger (13)

Recently uploaded (20)

Big Data for Oracle Professionals