SlideShare a Scribd company logo
Gdynia TECH Group
Intro to hadoop ecosystem
What is cool?
big data
distributed systems
libs (algorithms, collections, network, multithreading, serialization, ...)
patterns, methodologies, best practices
trends
Intro to hadoop ecosystem
Intro to hadoop ecosystem
Intro to hadoop ecosystem
Intro to hadoop ecosystem
Intro to hadoop ecosystem
technical presentations
hackathons
workshops
conferences/local events
What we want to do?
trainings
Intro to hadoop ecosystem
Intro to hadoop ecosystem
Intro to hadoop ecosystem
Upcoming presentations...
Distributed caching with HazelCast
Storm - real time stream processing
TDD - myth or good practice.
Handling failures in distributed systems
Serialization for everybody
Test your code. Always.
SQL Server Reporting Services - make your users happy and your
life easier
Upcoming presentations...
Reading (un)real-time feeds in Event Platform
Distributed computing and clustering done right
ActiveMQ usage in a SEM's Live Transcript process.
33 things we did wrong. EP lesson learned.
Who do it better? GitFlow implemented in EP and SEM.
Why Kafka is a standard?
Want to contribute? contact us
Q?
Introduction to Hadoop
Ecosystem
What is NoSQL?
Intro to hadoop ecosystem
NoSQL (often interpreted as Not only SQL[1][2]) database provides a
mechanism for storage and retrieval of data that is modeled in means other
than the tabular relations used in relational databases
What is Big Data?
Intro to hadoop ecosystem
10TB
Hadoop is Big Data !?
What is Hadoop?
Google released the
Google File System paper
in October 2003
Intro to hadoop ecosystem
Google released the
MapReduce paper
in December 2004
Intro to hadoop ecosystem
In 2006, Cutting went to work with Yahoo, which was
equally impressed by the Google File System and
MapReduce papers and wanted to build open source
technologies based on them
The transformation into Hadoop being “behind every click”
(or every batch process, technically) at Yahoo was pretty
much complete by 2008
By the time Yahoo spun out Hortonworks into a separate,
Hadoop-focused software company in 2011, Yahoo’s
Hadoop infrastructure consisted of 42,000 nodes and
hundreds of petabytes of storage
What is Hadoop?
Hadoop
Hadoop
HDFS
Map Reduce
Map Reduce
YARN
Other YARN applications
Storm
Spark
Tez
Samza
Impala
Hive
Hive is a data warehousing infrastructure based on
Hadoop. Hadoop provides massive scale out and fault
tolerance capabilities for data storage and processing
Example
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;
Example
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN
friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';
Example
SELECT pv_users.gender, count(DISTINCT pv_users.userid),
count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
Pig
Pig is a high level scripting language that is used with
Apache Hadoop. Pig excels at describing data analysis
problems as data flows. Pig is complete in that you can do
all the required data manipulations in Apache Hadoop with
Pig
Example
players = load 'baseball' as (name:chararray, team:chararray,
position:bag{t:(p:chararray)}, bat:map[]);
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} :
position)as position;
pos = foreach noempty generate name, flatten(position) as position;
bypos = group pos by position;
Example
players = load 'baseball' as (name:chararray, team:chararray,
position:bag{t:(p:chararray)}, bat:map[]);
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} :
position)as position;
pos = foreach noempty generate name, flatten(position) as position;
bypos = group pos by position;
Other frameworks...
Apache Spark
Impala
Apache Tez
Apache Flink
Storm, Samza, Spark S, Flink S (real-time analytics)
HBase
Intro to hadoop ecosystem
When Would I Use Apache HBase?
Use Apache HBase™ when you need random, realtime read/write access to your
Big Data. This project's goal is the hosting of very large tables -- billions of rows X
millions of columns -- atop clusters of commodity hardware
Q?

More Related Content

PDF
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
PPTX
Hadoop introduction
PPTX
Toulouse Data Science meetup - Apache zeppelin
ODP
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
PDF
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
PDF
Tds — big science dec 2021
PDF
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Hadoop introduction
Toulouse Data Science meetup - Apache zeppelin
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Tds — big science dec 2021
Introduction to Spark: Or how I learned to love 'big data' after all.

What's hot (20)

PPT
Big Data & Hadoop
PDF
How to deal with nested lists in R?
PDF
CityLABS Workshop: Working with large tables
PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
PPTX
Analysis of historical movie data by BHADRA
PPTX
An introduction to Hadoop for large scale data analysis
PPTX
Data engineering and analytics using python
PPTX
Big Data - Part IV
PDF
A Map of the PyData Stack
PPTX
Making Machine Learning Scale: Single Machine and Distributed
PPT
Map Reduce
PPTX
Beyond Kaggle: Solving Data Science Challenges at Scale
PPTX
Tech Talk - Underutilized Resources in Distributed System
PPT
Open Source Databases And Gis
PPTX
Big Data - Part III
PDF
Hadoop Ecosystem Architecture Overview
PDF
DBPedia-past-present-future
ODP
Google's Dremel
ODP
Hadoop @ Sara & BiG Grid
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
Big Data & Hadoop
How to deal with nested lists in R?
CityLABS Workshop: Working with large tables
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Analysis of historical movie data by BHADRA
An introduction to Hadoop for large scale data analysis
Data engineering and analytics using python
Big Data - Part IV
A Map of the PyData Stack
Making Machine Learning Scale: Single Machine and Distributed
Map Reduce
Beyond Kaggle: Solving Data Science Challenges at Scale
Tech Talk - Underutilized Resources in Distributed System
Open Source Databases And Gis
Big Data - Part III
Hadoop Ecosystem Architecture Overview
DBPedia-past-present-future
Google's Dremel
Hadoop @ Sara & BiG Grid
Big data vahidamiri-tabriz-13960226-datastack.ir
Ad

Similar to Intro to hadoop ecosystem (20)

PPT
Hive @ Hadoop day seattle_2010
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Hadoop ensma poitiers
PPTX
Experience SQL Server 2017: The Modern Data Platform
PPTX
Hadoop: An Industry Perspective
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPT
Hive Training -- Motivations and Real World Use Cases
PPTX
PASS Summit - SQL Server 2017 Deep Dive
PDF
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
PPTX
Distributed computing poli
PDF
Big Data with Hadoop – For Data Management, Processing and Storing
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PPT
Hadoop & Zing
PDF
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
PDF
Survey Paper on Big Data and Hadoop
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PPT
Another Intro To Hadoop
PPT
Hive ICDE 2010
PDF
Elephant in the room: A DBA's Guide to Hadoop
Hive @ Hadoop day seattle_2010
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop ensma poitiers
Experience SQL Server 2017: The Modern Data Platform
Hadoop: An Industry Perspective
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Hive Training -- Motivations and Real World Use Cases
PASS Summit - SQL Server 2017 Deep Dive
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
Distributed computing poli
Big Data with Hadoop – For Data Management, Processing and Storing
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Hadoop & Zing
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Survey Paper on Big Data and Hadoop
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Another Intro To Hadoop
Hive ICDE 2010
Elephant in the room: A DBA's Guide to Hadoop
Ad

Recently uploaded (20)

PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Well-logging-methods_new................
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
composite construction of structures.pdf
PPT
Project quality management in manufacturing
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Welding lecture in detail for understanding
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
Construction Project Organization Group 2.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Digital Logic Computer Design lecture notes
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Foundation to blockchain - A guide to Blockchain Tech
Well-logging-methods_new................
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Sustainable Sites - Green Building Construction
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
composite construction of structures.pdf
Project quality management in manufacturing
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Welding lecture in detail for understanding
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Geodesy 1.pptx...............................................
Construction Project Organization Group 2.pptx
bas. eng. economics group 4 presentation 1.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Digital Logic Computer Design lecture notes
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks

Intro to hadoop ecosystem

Editor's Notes

  • #2: na poczatek troche was zmecze… odpowiemy sobie na kilka pytan… wiem, jakbyscie wiedzieli ze beda pytania, byscie nie przyszli…, dlatego dopiero teraz mowie
  • #3: Whoo do cool things?
  • #5: show ourselves outside the company, uwazacie ze nie ma nic ciekawego do pokazywania? no tak jak slysze ze testy nie maja sensu ponizej 10k kodu
  • #6: jezeli nie to sa dwie mozliwosci: albo nie macie racji albo cos generalnie jest nie tak
  • #8: to moze wynikac z roznych rzeczy: brak dzielenia sie wiedza - kazdy siedzi w swojej piaskownicy, kopie dolek lopatka, a w pokoju obok maja koparke
  • #11: 1.wy jestescie naszymi przyszlymi prelegentami… :) 2. mozna sporo skozystac; -respect -presentation skills -przygotowanie prezentacji bywa bardzo ksztalcace -budowanie wlasnej marki -miejsce dla osob ktore maja ochote to zrobic na zewnatrz ale nie ma gdzie sprobowac - My zapewniamy wsparcie: -pomoc w przygotowaniu prezentacji -wybor tematu - chcecie ‘cos’ pokazac ale nie macie tematu, nie wiecie co moze interesowac inne osoby? znajdziemy wam temat
  • #37: HDFS