BBIIGG DDAATTAA 
LLeessssoonn 22 
Study : Jean-Antoine Moreau (Engineer - Lecturer) 
© Jean-Antoine Moreau 
copying and reproduction prohibited 
Managing my copyright ADAGP.
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
"Each problem that I solved became a rule, which 
served afterwards to solve other problems. » 
Rene Descartes 
Contact http://www.jean-antoine-moreau.fr.nf JAM 2
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
METADATA 
• The Policies are need to drive the data 
ecosystem; 
• Build the policies based around issues; 
• Licencing from the source, the origins; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 3
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Linked 
– On the web; 
– On Machines; 
– No Proprietary; 
– No Proprietary format; 
– RDF standard; 
– Linked RDF; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 4
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Linked Data using : 
– RDF format (Resource Description Framework); 
– SPARQL (SPARQL Protocol and RDF Query 
Language); 
– Common data model to access; 
– Connect, describe the ressources : RDF; 
– Access to that : SPARQL; 
– Define the common vocabularies : 
• RDFS (Resource Description Framework Schema); 
• OWL (Web Ontology Language); 
• SKOS (Simple Knowledge Organization System); 
Contact http://www.jean-antoine-moreau.fr.nf JAM 5
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• The W3C recommendation : 
– RDF (Resource Description Framework ) / XML 
(eXtensible Markup Language); 
– Simple triple : Subject-predicate-object; 
– SPARQL Query Language for RDF; 
 equivalent to the SQL query in a database. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 6
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• SPARQL – Like syntax 
Select ? Title 
Where { http:// …. } 
From … 
Contact http://www.jean-antoine-moreau.fr.nf JAM 7
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• RDF syntaxe 
< http:// ….> Suject 
<http:// … > Predicate 
« xxx@domain name.com » 
Contact http://www.jean-antoine-moreau.fr.nf JAM 8
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
SPARQL 
• An interogation language 
• Addition; 
• Modification; 
• Request; 
• Removing; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 9
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Application 
SPARQL Query return 
SPARQL engine 
RDF DATA XHTML 
XML 
Document 
RDF Bridge 
SQL - SPARQL 
Contact http://www.jean-antoine-moreau.fr.nf JAM 10
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Data from Wikipedia 
• DBPEDIA RDF version of forme wikipedia 
• Dataset:http://dbpedia.org 
Contact http://www.jean-antoine-moreau.fr.nf JAM 11
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Pluggable to any RDF store; 
– pluggable filter for generic SPARQL endpoints; 
• Virtualization of data; 
– RDF/WML 
• The mobile using database; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 12
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 13 
• KGRAM 
– Distributed Query Processing 
– Corese HTTP server 
– SPARQL Inference Rules 
– SPARQL Template Transformation Language 
– SPARQL approximate search 
– SPARQL Property Path extensions 
– SPIN Syntax 
– RDF Graph as Query Graph Pattern 
– SQL extension function
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Dataset 
– University 
– Music 
– Science 
Contact http://www.jean-antoine-moreau.fr.nf JAM 14
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 15 
• Syntax 
PREFIX … 
Foaf <URL> 
SELECT … 
Where { 
}
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
WW33CC (The WWorld WWide WWeb CConsortium) 
http://www.w3.org 
http://www.w3.org/egov/wiki/Main_Page 
Contact http://www.jean-antoine-moreau.fr.nf JAM 16
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
DATA SYSTEM 
NO SQL (Not Only SQL) 
Contact http://www.jean-antoine-moreau.fr.nf JAM 17
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 18 
• Google; 
• Amazon; 
• LinkeIn; 
• Facebook; 
• Project Voldemort; 
• Cassandra Project; 
• Hbase; 
• MongoDB; 
• CouchDB; 
design and operate databases NoSQL-type
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Data System 
• « Batch layer » : 
– HADOOP (JAVA framework); 
• « Speed Layer »: 
– Mongo BD; 
– H-Base; 
• Incremental algorithms; 
• Subset of big data; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 19
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Hadoop 
– Architecture 
• Hadoop Distributed File System (HDFS); 
• MapReduce; 
• Hbase; 
• ZooKeeper; 
• Hive; 
• Pig; 
• Cluster; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 20
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 21 
• MapReduce 
• step ingestion and 
processing of data as key / 
value pairs; 
• melting step recordings 
formed the key to the final 
result,
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
SQOOP 
System Hadoop 
Contact http://www.jean-antoine-moreau.fr.nf JAM 22 
relational 
database 
data 
warehousing 
system 
HBase 
HDFS 
IT architecture 
Sqoop 
interface application for transferring 
data between relational databases and Hadoop.
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Hadoop Pig 
Pig Latin Tools for Big Data 
high-level platform for creating 
MapReduce programs used with Hadoop 
Contact http://www.jean-antoine-moreau.fr.nf JAM 23
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Apache Pig 
is a platform for analyzing large data sets 
that consists of a high-level language for 
expressing data analysis programs, coupled 
with infrastructure for evaluating these 
programs. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 24
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 25 
• No SQL 
– Model using : 
• Columns; 
• Values; 
• Graphs; 
• New SQL 
– Scale DB (real time); 
• real-time analytics across 
stream data. 
– Nimbus DB; 
– Volt DB; 
– Clustrix; 
– ORACLE big data; 
– Microsoft big data;
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
In the Entreprise 
The Process 
Contact http://www.jean-antoine-moreau.fr.nf JAM 26
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
The PROCESS 
Contact http://www.jean-antoine-moreau.fr.nf JAM 27
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 28 
Big 
Data 
Appliance 
Analyze Visualization
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Acquire 
Contact http://www.jean-antoine-moreau.fr.nf JAM 29 
Social 
Network 
Internet 
Dataset 
Collect 
Organize 
Visualize 
Analyze 
Decide
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Aproach to build a big data platform 
• Using : 
The Massive Parallel Processing (MPP); 
and Data Base; 
MapReduce framework; 
Apache Open Source project called Hadoop; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 30
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Open source solutions : 
– Jasper soft; 
– Pantaho reporting; 
– Talend (Extract Transform Load); 
– Apache Hadoop et Cassandra for the 
implementation of the MapReduce technology; 
analysis and reporting tools 
Contact http://www.jean-antoine-moreau.fr.nf JAM 31
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Teradata : Teradata platform (DW) / Teradata Value Analyser (Data 
analytics) / Aster Data Analytic Platform (Data analytics); 
• Oracle : Oracle Data Integrator (Data analytics) / Oracle Exadata 
Database Machine (Acquisition + Traitement) / Oracle Data 
Warehousing (DW); 
• Oracle NoSQL Database (Acquisition) / Oracle Big Data Appliance 
(DW) / Oracle Loader for Hadoop (Data transformation) / Oracle R 
Enterprise (Data analytics); 
• EMC : Greenplum solutions (DW + Data analytics + Appliance 
• IBM : InfoSphere platform (DW + Data analytics) / Netezza solutions 
(DW appliances); 
• SAP : Sybase IQ VLDP Option (DW + Data analytics); 
Contact http://www.jean-antoine-moreau.fr.nf JAM 32
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• The IT infrastructure impact: 
No SQL systems : 
• Capture all data; 
• without categorizing; 
SQL system : 
• Impose metadata on the data captured; 
• and validate data types; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 33
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Hadoop Data Warehouse Analytic Application 
Acquire Organize Analysis Decide 
PROCESS 
Contact http://www.jean-antoine-moreau.fr.nf JAM 34
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Cloud 
Manager 
No SQL 
manage 
DATABASE Connector 
enterprise-class 
applications 
Cloud 
CHD 
cluster 
Virtual Machine 
Contact http://www.jean-antoine-moreau.fr.nf JAM 35 
JAVA 
LINUX - UNIX System
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Database analytic tools : 
– Statistical; 
– Prediction; 
– Data mining; 
– Text mining; 
– Graphs; 
– Data Plotted is to map; 
– Procedural logic; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 36
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
New SQL server integrates Hadoop component 
Open source framework specialized in 
managing unstructured data 
Exchange interface Hadoop - Excel 
Contact http://www.jean-antoine-moreau.fr.nf JAM 37
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Word Excel Power Point Visual Studio 
Familiar User Tools 
Contact http://www.jean-antoine-moreau.fr.nf JAM 38 
Web / Internet 
Apache 
Hadoop 
Server Analysis Services 
Enterprise Data Warehouse 
connectors 
Unstructured Data and Structured Data
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Word Excel Power Point Visual Studio 
Familiar User Tools 
Contact http://www.jean-antoine-moreau.fr.nf JAM 39 
Web / Internet 
Apache 
Hadoop 
Server Analysis Services 
Enterprise Data Warehouse 
connectors 
Enterprise Ressource Planing – Customer Relation Management – Line of Business 
Unstructured Data and Structured Data
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• New SQL aproach : 
– Model; 
– Design; 
– Algorithm; 
– Motion; 
– ACID (will be explained in the next slide); 
– OLTP Real Time (On-line Transaction Processing); 
Contact http://www.jean-antoine-moreau.fr.nf JAM 40
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
ACID 
Contact http://www.jean-antoine-moreau.fr.nf JAM 41 
• Atomicity: 
– means either the task or tasks within a transaction are performed or none are 
performed. 
• Consistency: 
– means the transaction meets all rules defined by the system at all times. The 
transaction does not violate those rules and the database must remain in a consistent 
state at the beginning and end of a transaction. There are no half-completed 
transactions. 
• Isolation: 
– no transaction has access to any other transaction that is in an intermediate or 
unfinished state. Each transaction is independent. 
• Durability: 
– means the transaction is complete and it will persist. The completed transaction will 
survive system failure, power loss and other types of system breakdowns.
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• OLTP (On-line Transaction Processing) 
– Characterized by a large number of short on-line 
transactions (INSERT, UPDATE, DELETE). 
– The main emphasis for OLTP systems is put on very 
fast query processing, maintaining data integrity in 
multi-access environments and an effectiveness 
measured by number of transactions per second. 
– In OLTP database there is detailed and current data, 
and schema used to store transactional databases is the 
entity model (usually 3NF). 
Contact http://www.jean-antoine-moreau.fr.nf JAM 42
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• OLAP (On-line Analytical Processing) is : 
– Characterized by relatively low volume of transactions. 
– Queries are often very complex and involve aggregations. 
For OLAP systems a response time is an effectiveness 
measure. 
• In OLAP database there is aggregated: 
– Historical data, 
– Stored in multi-dimensional schemas 
• usually star schema or the star-join schema. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 43
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Business Process 
Master data and transactions 
OLTP Business Strategy 
Operations 
Contact http://www.jean-antoine-moreau.fr.nf JAM 44 
OLAP 
information Data Mining 
Analytics 
Decsion Markink 
Business Data WareHouse
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
History 
• Scientific method and evolution : 
 330 BC: Logical Method, Aristotle. 
 1250: Birth of testable methods. 
 1700: theoretical method, Newton. 
 1950: simulation methods. 
 Today method of link analysis. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 45
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
some environments 
– UNIX 
– LINUX 
– JAVA programming language 
– Tools 
Contact http://www.jean-antoine-moreau.fr.nf JAM 46 
• Hadoop 
– The Apache Hadoop project develops open-source software, for reliable, 
scalable, distributed computing. 
• Sqoop 
– tool designed for efficiently transferring bulk data between Apache Hadoop 
and structured datastores such as relational databases. 
• Pig 
– Apache Pig is a platform for analyzing large data sets that consists of a high-level 
language for expressing data analysis programs, coupled with 
infrastructure for evaluating these programs
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Problematics : 
– Amount of data to be processed; 
– Processing time; 
– complex and cumbersome queries; 
– Storage costs; 
– Secure Treatment Process; 
– Security and Confidentiality of results treatments; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 47
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Process to work: 
1. Installing the envirronnement; 
2. Migrating Data; 
3. Data Processing; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 48
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Process of installing computer equipment 
• Network addressing; 
• Definition of machinery; 
• Tests; 
• SSH configuration; 
• Generating an RSA key pair; 
• Tests; 
• HDFS configuration; 
• Hadoop-site.xml file (Hadoop daemons and Map/Reduce jobs); 
• HDFS-site.xml (configure secure HDFS); 
• Tests; 
• MapReduce configuration; 
• Configuration core-site.xml (on every host in your cluster); 
• Tests; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 49
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Process of installing computer equipment 
Hadoop Cluster 
Host 
Server 
Master 
H (n+1) 
Server 
Slave 
Contact http://www.jean-antoine-moreau.fr.nf JAM 50 
Layer 
Map Reduce 
Using : 
•Job Tracker 
•Web Interface
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Process : 
– Formatting the HDFS; 
– Start HDFS; 
– Start Map Reduce; 
– Start the Cluster; 
– .... 
– Stop the cluster; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 51
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Data Migration 
• Hadoop implementation; 
• Application Active cluster; 
• ... 
• Apllication HadoppHive> MapReduce; 
• Objectives would be defined; 
• Mirrored to used : 
– Data management of complex types; 
– Rédiction response time of complex SQL queries; 
– Natural language queries for dynamic flitrage; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 52
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Data Migration 
• Data processing; 
• SQL queries are translated into : 
– Hadoop Pig Latin Applications 
Contact http://www.jean-antoine-moreau.fr.nf JAM 53
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Data Migration 
Treatment for the verification of 
container and contents of migrated data. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 54
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Goals of this implementation in the business 
• Improve the customer experience; 
• Optimize Process; 
• Optimize Operational Performance; 
• Improve, Enhance the Business Model of 
the Enterprise; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 55
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• There is a change in the processes used by the 
company to: 
Contact http://www.jean-antoine-moreau.fr.nf JAM 56 
Collect; 
Analysis; 
Store; 
Use; 
Operate; 
Present; 
Sale; 
its DDaattaa
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
in the future 
DATA 
Image - visual 
Contact http://www.jean-antoine-moreau.fr.nf JAM 57 
video 
text sound
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
in the future 
DATA 
Image - visual 
Contact http://www.jean-antoine-moreau.fr.nf JAM 58 
video 
text sound 
smell and flavor 
by the use of 
the chemical composition
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Storage 
DATA 
Image - visual 
Contact http://www.jean-antoine-moreau.fr.nf JAM 59 
video 
text sound 
Using 
Cloud Computing
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Storage 
Dispatching 
DATA 
Image - visual 
Contact http://www.jean-antoine-moreau.fr.nf JAM 60 
video 
text sound 
Using 
Cloud Computing
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
DATA 
semi-structured data 
Contact http://www.jean-antoine-moreau.fr.nf JAM 61 
structured data 
Geotagged data unstructured data
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Data Science 
Exploratory Analysis 
discrete-event simulation 
Predictive Analytic 
Contact http://www.jean-antoine-moreau.fr.nf JAM 62
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Data Mining 
Data Mining Model development 
Adjustment, Experiment 
Contact http://www.jean-antoine-moreau.fr.nf JAM 63
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Multilevel Modeling of Hierarchical 
and 
Longitudinal Data Using 
Contact http://www.jean-antoine-moreau.fr.nf JAM 64
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Transforming Data to Decision 
Transforming Data to Strategy 
Contact http://www.jean-antoine-moreau.fr.nf JAM 65
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Define your concept 
Data-preparation 
Clustering 
Virtualization 
Association 
Aggregation 
Evaluation 
Rules-regression 
Data mining 
Classification 
Decision Tree 
Presentation 
Visualization 
Contact http://www.jean-antoine-moreau.fr.nf JAM 66
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Example of Dataset 
• http://labs.europeana.eu 
• http://fisher.osu.edu/fin/fdf/osudata.htm 
• http://www.ncbi.nlm.nih.gov 
• https://data.nasdaq.com/ 
• http://smd.princeton.edu// 
• http://archive.ics.uci.edu 
• https://www.yelp.com/academic_dataset 
• http://aws.amazon.com/fr/publicdatasets/ 
Contact http://www.jean-antoine-moreau.fr.nf JAM 67
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
categories of tools and Software 
• Clustering; 
• Segmentation; 
• Social Network Analysis; 
• Link Analysis; 
• Visualization; 
• Statistical Analysis; 
• Text Analysis; 
• Text Mining; 
• Information Retrieval; 
• Web Analytics and Social Media Analytics; 
• Web Usage Mining; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 68
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Fast Data 
in 
Real Time 
Contact http://www.jean-antoine-moreau.fr.nf JAM 69
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Fast Data 
in 
Real Time Processing 
Data visualization 
Contact http://www.jean-antoine-moreau.fr.nf JAM 70
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Visualization of data shall provide informations : 
Contact http://www.jean-antoine-moreau.fr.nf JAM 71 
– Interpretable; 
– Relevant; 
– Innovative;
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Project 
Tools GDELT - Open Source Tool 
Global Database of Events, Languages and Tones 
Developed by Georgetown University 
Contact http://www.jean-antoine-moreau.fr.nf JAM 72
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
GDELT Project 
• http://analysis.gdeltproject.org/ 
• http://analysis.gdeltproject.org/module-event- 
exporter.html 
Contact http://www.jean-antoine-moreau.fr.nf JAM 73
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Volume of data; 
• Variety of data; 
• Veracity of the data; 
Data processing in real time. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 74
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Using Data-at-Rest; 
• Security of Data-at-Rest; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 75
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Principles for Architecting; 
Real-time Big Data Systems. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 76
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Layer 
• Batch Layer; 
• Serving Layer; 
• Speed Layer; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 77
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Batch layer Serving layer 
Contact http://www.jean-antoine-moreau.fr.nf JAM 78 
New data 
Master data set 
Batch view 
Real Time View 
Batch view 
Real Time View 
query 
query 
Speed layer
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 79 
Hadoop 
Data HDFS MapReduce 
Batch Recompute 
Process 
Stream Increment views Real Time 
increment 
New 
Data 
Stream 
Quality Function Deployment 
Batch views (HDFS) 
Real Time view 
Apache - HBase 
Serving Layer 
Query
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
• HDFS 
– Hadoop Distributed File System. 
• Hbase 
– Apache Hbase is the Hadoop database, a 
distributed, scalable, big data store. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 80
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
 The batch layer has two functions: 
• managing the master dataset; 
• pre-compute the batch views. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 81
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
 The serving layer indexes the batch views 
so that they can be queried in low-latency. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 82
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
 The speed layer compensates for the high 
latency of updates to the serving layer and 
deals with recent data only. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 83
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Batch Real Time 
Batch 
Contact http://www.jean-antoine-moreau.fr.nf JAM 84 
Layer 
Data Processing 
Time 
Batch 
Real Time 
Real Time
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Batch View 
Stored 
In the HDFS and the real-time views stored 
in HBase 
Contact http://www.jean-antoine-moreau.fr.nf JAM 85
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Architectural Principles 
architecture Main 
Big Data Architecture 
Immutability Recomputation 
Immutable Record 
Contact http://www.jean-antoine-moreau.fr.nf JAM 86
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Master Dataset; 
• Managing the version of the Dataset; 
• Fault – Tolérance; 
• Incremetal Algorithm; 
• Recompute; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 87
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Modifying the Streaming Data 
Analytic Platform 
Contact http://www.jean-antoine-moreau.fr.nf JAM 88
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Writing Data with the Storm-Hdfs Connector; 
• Writing Data with the Storm-HBase Connector. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 89
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
 The storm-hdfs connector provides the following key features: 
– Supports HDFS 2.x; 
– Supports both text and sequence files; 
– Configurable directory and file names; 
– Customizable synchronization, rotation policies, and rotation actions; 
– Tuple fails if HDFS update fails; 
– Supports the Trident API; 
– Supports writing to kerberized Hadoop cluster; 
– The primary classes of the storm-hdfs connector are HdfsBolt and 
SequenceFileBolt, both located in the org.apache.storm.hdfs.bolt package. Use the 
HDFSBolt class to write text data to HDFS and the SequenceFileBolt class to write 
binary data. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 90
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
 The storm-hbase connector provides the following key features: 
– Supports Apache HBase and above; 
– Supports incrementing counter columns; 
– Tuples are failed if an update to an HBase table fails; 
– Ability to group puts in a single batch; 
– Supports writing to Kerberized (Kerberos) HBase clusters; 
– The storm-hbase connector enables Storm developers to collect several PUTS in a 
single operation and write to multiple HBase column families and counter columns. 
a PUT is an HBase operation that inserts data into a single HBase cell. Use the 
HBase client's write buffer to automatically batch: hbase.client.write.buffer. The 
primary interface in the storm-hbase connector is the 
org.apache.storm.hbase.bolt.mapper.HBaseMapper interface. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 91
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Global Vision 
Architecture 
Synthesis 
Contact http://www.jean-antoine-moreau.fr.nf JAM 92
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
collect processing analysis restitution storage 
Contact http://www.jean-antoine-moreau.fr.nf JAM 93
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Collecting data: 
– Recover data; 
– Transmit data to : 
• the processing units 
• and units of analysis. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 94
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Data transformation: 
– Extraction of useful information from little or 
unstructured data; 
– Using data extracted to make them consistent; 
– Allow the formation of a catalog of MetaData; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 95
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Contact http://www.jean-antoine-moreau.fr.nf JAM 96 
• Analysis 
– Creation of new information by: 
• identification; 
• correlation; 
• aggregation; 
• projection; 
In real-time of the data set
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• data recovery: 
– Allow visualization of data; 
– Enable data mining; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 97
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Data storage: 
– The ability to store very large amounts of 
structured and unstructured data; 
• Data warehouse storing the data obtained from the 
processing. 
• Cache analysis to accelerate treatments for 
restitution. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 98
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Criteria for building a technical base 
Contact http://www.jean-antoine-moreau.fr.nf JAM 99
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Criteria for building a technical base 
• elasticity; 
• adaptability; 
• versatility; 
Contact http://www.jean-antoine-moreau.fr.nf JAM 100
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Elasticity, using technology : 
– Storage Technology 
• SAN (Storage Area Network); 
• NAS (Network Attached Storage); 
• DAS (Direct Attachment System); 
Contact http://www.jean-antoine-moreau.fr.nf JAM 101 
– Server 
• parallelism 
• Cluster 
• …
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
Adaptability, Versatility 
Using 
• Tools for integrating multiple data sources. 
• structured data 
Contact http://www.jean-antoine-moreau.fr.nf JAM 102 
– Data base 
– XML 
• Semi-structured data: e-mail 
• Unstructured data: 
– video; 
– image; 
– website; 
– social networks; 
– open data;
© Jean-Antoine 
Moreau 
copying and 
reproduction 
prohibited 
Managing my 
copyright ADAGP. 
BBiigg DDaattaa 
• Reactivity: 
– Optimize the distribution of data processing 
between computing nodes. 
Contact http://www.jean-antoine-moreau.fr.nf JAM 103

More Related Content

PPT
Big Data Lesson 1 Conference Jean-Antoine Moreau
PPT
DATA SCIENCE Lesson 4 Data Science Predictive Method Parsing Process Topic Mo...
PPT
DATA SCIENCE Lesson 5 Data Science Predictive Modeling and Modelling Methodol...
PPT
DATA SCIENCE Lesson 3 Data Architectures Data Processing Modeling -Algorithm ...
PPT
Data Science Lesson 1 Jean-Antoine Moreau
PPT
Big Data Lesson 3 Jean-Antoine Moreau
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
Introduction of Cloud computing
Big Data Lesson 1 Conference Jean-Antoine Moreau
DATA SCIENCE Lesson 4 Data Science Predictive Method Parsing Process Topic Mo...
DATA SCIENCE Lesson 5 Data Science Predictive Modeling and Modelling Methodol...
DATA SCIENCE Lesson 3 Data Architectures Data Processing Modeling -Algorithm ...
Data Science Lesson 1 Jean-Antoine Moreau
Big Data Lesson 3 Jean-Antoine Moreau
Hadoop introduction , Why and What is Hadoop ?
Introduction of Cloud computing

Similar to Big Data Lesson 2 Jean-Antoine Moreau (20)

PPTX
A Glimpse of Bigdata - Introduction
PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
PPTX
Chen li asterix db: 大数据处理开源平台
PDF
Big data
PPTX
Big Data Analytics with Hadoop
PPTX
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
PPTX
Big data concepts
PDF
TCS_DATA_ANALYSIS_REPORT_ADITYA
PDF
PPTX
Big data technologies and databases
PPT
Big data with hadoop
PDF
Big data processing with apache spark
PDF
module4-cloudcomputing-180131071200.pdf
PPTX
VTU 6th Sem Elective CSE - Module 4 cloud computing
PPTX
Architecting Your First Big Data Implementation
PDF
Big Data-Survey
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PPTX
Hadoop Big Data A big picture
PPT
Big data analytics, survey r.nabati
PPTX
Hadoop - A big data initiative
A Glimpse of Bigdata - Introduction
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Chen li asterix db: 大数据处理开源平台
Big data
Big Data Analytics with Hadoop
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
Big data concepts
TCS_DATA_ANALYSIS_REPORT_ADITYA
Big data technologies and databases
Big data with hadoop
Big data processing with apache spark
module4-cloudcomputing-180131071200.pdf
VTU 6th Sem Elective CSE - Module 4 cloud computing
Architecting Your First Big Data Implementation
Big Data-Survey
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Hadoop Big Data A big picture
Big data analytics, survey r.nabati
Hadoop - A big data initiative
Ad

More from Jean-Antoine Moreau (20)

PPTX
Software testing incorporating an Artificial Intelligence function
PPTX
Pruebas de software que incorporan una función de Inteligencia Artificial
PPTX
Test de logiciel Intégrant une fonction d’Intelligence Artificielle
PPTX
Management and Leadership in the Age of Artificial Intelligence
PPTX
Le Management et le Leadership au Temps de l'Intelligence Artificielle
PPTX
Consommation d'énergie dans l'industrie en France
PPTX
Evolution du Revenu des pharmaciens en France
PPTX
Histoire de la Drogue en France
PPT
l'Intelligence Artificielle Jean-Antoine Moreau
PPT
Blockchain Jean-Antoine Moreau
PPT
Management of the Performance Jean-Antoine Moreau
PPT
Management de la Performance Jean-Antoine Moreau
PPT
Le Budget Jean-Antoine Moreau
PPT
Stratégie Économique Jean-Antoine Moreau
PPT
Economic Strategy Jean-Antoine Moreau
PPT
Stratégie Industrielle Jean-Antoine Moreau
PPT
Industrial Strategy Jean-Antoine Moreau
PPT
Regional Economic Development Jean-Antoine Moreau
PPT
MARKETING STRATEGY Jean-Antoine Moreau
PPT
Politique Industrielle Seconde Partie
Software testing incorporating an Artificial Intelligence function
Pruebas de software que incorporan una función de Inteligencia Artificial
Test de logiciel Intégrant une fonction d’Intelligence Artificielle
Management and Leadership in the Age of Artificial Intelligence
Le Management et le Leadership au Temps de l'Intelligence Artificielle
Consommation d'énergie dans l'industrie en France
Evolution du Revenu des pharmaciens en France
Histoire de la Drogue en France
l'Intelligence Artificielle Jean-Antoine Moreau
Blockchain Jean-Antoine Moreau
Management of the Performance Jean-Antoine Moreau
Management de la Performance Jean-Antoine Moreau
Le Budget Jean-Antoine Moreau
Stratégie Économique Jean-Antoine Moreau
Economic Strategy Jean-Antoine Moreau
Stratégie Industrielle Jean-Antoine Moreau
Industrial Strategy Jean-Antoine Moreau
Regional Economic Development Jean-Antoine Moreau
MARKETING STRATEGY Jean-Antoine Moreau
Politique Industrielle Seconde Partie
Ad

Recently uploaded (20)

PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
communication and presentation skills 01
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
Amdahl’s law is explained in the above power point presentations
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
Feature types and data preprocessing steps
PPTX
Software Engineering and software moduleing
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPTX
CyberSecurity Mobile and Wireless Devices
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Exploratory_Data_Analysis_Fundamentals.pdf
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Visual Aids for Exploratory Data Analysis.pdf
Fundamentals of Mechanical Engineering.pptx
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
communication and presentation skills 01
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
Categorization of Factors Affecting Classification Algorithms Selection
Abrasive, erosive and cavitation wear.pdf
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Amdahl’s law is explained in the above power point presentations
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Feature types and data preprocessing steps
Software Engineering and software moduleing
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
CyberSecurity Mobile and Wireless Devices
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS

Big Data Lesson 2 Jean-Antoine Moreau

  • 1. BBIIGG DDAATTAA LLeessssoonn 22 Study : Jean-Antoine Moreau (Engineer - Lecturer) © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP.
  • 2. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. "Each problem that I solved became a rule, which served afterwards to solve other problems. » Rene Descartes Contact http://www.jean-antoine-moreau.fr.nf JAM 2
  • 3. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa METADATA • The Policies are need to drive the data ecosystem; • Build the policies based around issues; • Licencing from the source, the origins; Contact http://www.jean-antoine-moreau.fr.nf JAM 3
  • 4. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Linked – On the web; – On Machines; – No Proprietary; – No Proprietary format; – RDF standard; – Linked RDF; Contact http://www.jean-antoine-moreau.fr.nf JAM 4
  • 5. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Linked Data using : – RDF format (Resource Description Framework); – SPARQL (SPARQL Protocol and RDF Query Language); – Common data model to access; – Connect, describe the ressources : RDF; – Access to that : SPARQL; – Define the common vocabularies : • RDFS (Resource Description Framework Schema); • OWL (Web Ontology Language); • SKOS (Simple Knowledge Organization System); Contact http://www.jean-antoine-moreau.fr.nf JAM 5
  • 6. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • The W3C recommendation : – RDF (Resource Description Framework ) / XML (eXtensible Markup Language); – Simple triple : Subject-predicate-object; – SPARQL Query Language for RDF;  equivalent to the SQL query in a database. Contact http://www.jean-antoine-moreau.fr.nf JAM 6
  • 7. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • SPARQL – Like syntax Select ? Title Where { http:// …. } From … Contact http://www.jean-antoine-moreau.fr.nf JAM 7
  • 8. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • RDF syntaxe < http:// ….> Suject <http:// … > Predicate « xxx@domain name.com » Contact http://www.jean-antoine-moreau.fr.nf JAM 8
  • 9. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa SPARQL • An interogation language • Addition; • Modification; • Request; • Removing; Contact http://www.jean-antoine-moreau.fr.nf JAM 9
  • 10. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Application SPARQL Query return SPARQL engine RDF DATA XHTML XML Document RDF Bridge SQL - SPARQL Contact http://www.jean-antoine-moreau.fr.nf JAM 10
  • 11. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Data from Wikipedia • DBPEDIA RDF version of forme wikipedia • Dataset:http://dbpedia.org Contact http://www.jean-antoine-moreau.fr.nf JAM 11
  • 12. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Pluggable to any RDF store; – pluggable filter for generic SPARQL endpoints; • Virtualization of data; – RDF/WML • The mobile using database; Contact http://www.jean-antoine-moreau.fr.nf JAM 12
  • 13. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 13 • KGRAM – Distributed Query Processing – Corese HTTP server – SPARQL Inference Rules – SPARQL Template Transformation Language – SPARQL approximate search – SPARQL Property Path extensions – SPIN Syntax – RDF Graph as Query Graph Pattern – SQL extension function
  • 14. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Dataset – University – Music – Science Contact http://www.jean-antoine-moreau.fr.nf JAM 14
  • 15. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 15 • Syntax PREFIX … Foaf <URL> SELECT … Where { }
  • 16. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa WW33CC (The WWorld WWide WWeb CConsortium) http://www.w3.org http://www.w3.org/egov/wiki/Main_Page Contact http://www.jean-antoine-moreau.fr.nf JAM 16
  • 17. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa DATA SYSTEM NO SQL (Not Only SQL) Contact http://www.jean-antoine-moreau.fr.nf JAM 17
  • 18. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 18 • Google; • Amazon; • LinkeIn; • Facebook; • Project Voldemort; • Cassandra Project; • Hbase; • MongoDB; • CouchDB; design and operate databases NoSQL-type
  • 19. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Data System • « Batch layer » : – HADOOP (JAVA framework); • « Speed Layer »: – Mongo BD; – H-Base; • Incremental algorithms; • Subset of big data; Contact http://www.jean-antoine-moreau.fr.nf JAM 19
  • 20. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Hadoop – Architecture • Hadoop Distributed File System (HDFS); • MapReduce; • Hbase; • ZooKeeper; • Hive; • Pig; • Cluster; Contact http://www.jean-antoine-moreau.fr.nf JAM 20
  • 21. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 21 • MapReduce • step ingestion and processing of data as key / value pairs; • melting step recordings formed the key to the final result,
  • 22. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa SQOOP System Hadoop Contact http://www.jean-antoine-moreau.fr.nf JAM 22 relational database data warehousing system HBase HDFS IT architecture Sqoop interface application for transferring data between relational databases and Hadoop.
  • 23. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Hadoop Pig Pig Latin Tools for Big Data high-level platform for creating MapReduce programs used with Hadoop Contact http://www.jean-antoine-moreau.fr.nf JAM 23
  • 24. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Contact http://www.jean-antoine-moreau.fr.nf JAM 24
  • 25. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 25 • No SQL – Model using : • Columns; • Values; • Graphs; • New SQL – Scale DB (real time); • real-time analytics across stream data. – Nimbus DB; – Volt DB; – Clustrix; – ORACLE big data; – Microsoft big data;
  • 26. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa In the Entreprise The Process Contact http://www.jean-antoine-moreau.fr.nf JAM 26
  • 27. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa The PROCESS Contact http://www.jean-antoine-moreau.fr.nf JAM 27
  • 28. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 28 Big Data Appliance Analyze Visualization
  • 29. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Acquire Contact http://www.jean-antoine-moreau.fr.nf JAM 29 Social Network Internet Dataset Collect Organize Visualize Analyze Decide
  • 30. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Aproach to build a big data platform • Using : The Massive Parallel Processing (MPP); and Data Base; MapReduce framework; Apache Open Source project called Hadoop; Contact http://www.jean-antoine-moreau.fr.nf JAM 30
  • 31. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Open source solutions : – Jasper soft; – Pantaho reporting; – Talend (Extract Transform Load); – Apache Hadoop et Cassandra for the implementation of the MapReduce technology; analysis and reporting tools Contact http://www.jean-antoine-moreau.fr.nf JAM 31
  • 32. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Teradata : Teradata platform (DW) / Teradata Value Analyser (Data analytics) / Aster Data Analytic Platform (Data analytics); • Oracle : Oracle Data Integrator (Data analytics) / Oracle Exadata Database Machine (Acquisition + Traitement) / Oracle Data Warehousing (DW); • Oracle NoSQL Database (Acquisition) / Oracle Big Data Appliance (DW) / Oracle Loader for Hadoop (Data transformation) / Oracle R Enterprise (Data analytics); • EMC : Greenplum solutions (DW + Data analytics + Appliance • IBM : InfoSphere platform (DW + Data analytics) / Netezza solutions (DW appliances); • SAP : Sybase IQ VLDP Option (DW + Data analytics); Contact http://www.jean-antoine-moreau.fr.nf JAM 32
  • 33. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • The IT infrastructure impact: No SQL systems : • Capture all data; • without categorizing; SQL system : • Impose metadata on the data captured; • and validate data types; Contact http://www.jean-antoine-moreau.fr.nf JAM 33
  • 34. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Hadoop Data Warehouse Analytic Application Acquire Organize Analysis Decide PROCESS Contact http://www.jean-antoine-moreau.fr.nf JAM 34
  • 35. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Cloud Manager No SQL manage DATABASE Connector enterprise-class applications Cloud CHD cluster Virtual Machine Contact http://www.jean-antoine-moreau.fr.nf JAM 35 JAVA LINUX - UNIX System
  • 36. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Database analytic tools : – Statistical; – Prediction; – Data mining; – Text mining; – Graphs; – Data Plotted is to map; – Procedural logic; Contact http://www.jean-antoine-moreau.fr.nf JAM 36
  • 37. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa New SQL server integrates Hadoop component Open source framework specialized in managing unstructured data Exchange interface Hadoop - Excel Contact http://www.jean-antoine-moreau.fr.nf JAM 37
  • 38. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Word Excel Power Point Visual Studio Familiar User Tools Contact http://www.jean-antoine-moreau.fr.nf JAM 38 Web / Internet Apache Hadoop Server Analysis Services Enterprise Data Warehouse connectors Unstructured Data and Structured Data
  • 39. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Word Excel Power Point Visual Studio Familiar User Tools Contact http://www.jean-antoine-moreau.fr.nf JAM 39 Web / Internet Apache Hadoop Server Analysis Services Enterprise Data Warehouse connectors Enterprise Ressource Planing – Customer Relation Management – Line of Business Unstructured Data and Structured Data
  • 40. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • New SQL aproach : – Model; – Design; – Algorithm; – Motion; – ACID (will be explained in the next slide); – OLTP Real Time (On-line Transaction Processing); Contact http://www.jean-antoine-moreau.fr.nf JAM 40
  • 41. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa ACID Contact http://www.jean-antoine-moreau.fr.nf JAM 41 • Atomicity: – means either the task or tasks within a transaction are performed or none are performed. • Consistency: – means the transaction meets all rules defined by the system at all times. The transaction does not violate those rules and the database must remain in a consistent state at the beginning and end of a transaction. There are no half-completed transactions. • Isolation: – no transaction has access to any other transaction that is in an intermediate or unfinished state. Each transaction is independent. • Durability: – means the transaction is complete and it will persist. The completed transaction will survive system failure, power loss and other types of system breakdowns.
  • 42. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • OLTP (On-line Transaction Processing) – Characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). – The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. – In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). Contact http://www.jean-antoine-moreau.fr.nf JAM 42
  • 43. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • OLAP (On-line Analytical Processing) is : – Characterized by relatively low volume of transactions. – Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. • In OLAP database there is aggregated: – Historical data, – Stored in multi-dimensional schemas • usually star schema or the star-join schema. Contact http://www.jean-antoine-moreau.fr.nf JAM 43
  • 44. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Business Process Master data and transactions OLTP Business Strategy Operations Contact http://www.jean-antoine-moreau.fr.nf JAM 44 OLAP information Data Mining Analytics Decsion Markink Business Data WareHouse
  • 45. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa History • Scientific method and evolution :  330 BC: Logical Method, Aristotle.  1250: Birth of testable methods.  1700: theoretical method, Newton.  1950: simulation methods.  Today method of link analysis. Contact http://www.jean-antoine-moreau.fr.nf JAM 45
  • 46. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa some environments – UNIX – LINUX – JAVA programming language – Tools Contact http://www.jean-antoine-moreau.fr.nf JAM 46 • Hadoop – The Apache Hadoop project develops open-source software, for reliable, scalable, distributed computing. • Sqoop – tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. • Pig – Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs
  • 47. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Problematics : – Amount of data to be processed; – Processing time; – complex and cumbersome queries; – Storage costs; – Secure Treatment Process; – Security and Confidentiality of results treatments; Contact http://www.jean-antoine-moreau.fr.nf JAM 47
  • 48. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Process to work: 1. Installing the envirronnement; 2. Migrating Data; 3. Data Processing; Contact http://www.jean-antoine-moreau.fr.nf JAM 48
  • 49. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Process of installing computer equipment • Network addressing; • Definition of machinery; • Tests; • SSH configuration; • Generating an RSA key pair; • Tests; • HDFS configuration; • Hadoop-site.xml file (Hadoop daemons and Map/Reduce jobs); • HDFS-site.xml (configure secure HDFS); • Tests; • MapReduce configuration; • Configuration core-site.xml (on every host in your cluster); • Tests; Contact http://www.jean-antoine-moreau.fr.nf JAM 49
  • 50. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Process of installing computer equipment Hadoop Cluster Host Server Master H (n+1) Server Slave Contact http://www.jean-antoine-moreau.fr.nf JAM 50 Layer Map Reduce Using : •Job Tracker •Web Interface
  • 51. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Process : – Formatting the HDFS; – Start HDFS; – Start Map Reduce; – Start the Cluster; – .... – Stop the cluster; Contact http://www.jean-antoine-moreau.fr.nf JAM 51
  • 52. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Data Migration • Hadoop implementation; • Application Active cluster; • ... • Apllication HadoppHive> MapReduce; • Objectives would be defined; • Mirrored to used : – Data management of complex types; – Rédiction response time of complex SQL queries; – Natural language queries for dynamic flitrage; Contact http://www.jean-antoine-moreau.fr.nf JAM 52
  • 53. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Data Migration • Data processing; • SQL queries are translated into : – Hadoop Pig Latin Applications Contact http://www.jean-antoine-moreau.fr.nf JAM 53
  • 54. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Data Migration Treatment for the verification of container and contents of migrated data. Contact http://www.jean-antoine-moreau.fr.nf JAM 54
  • 55. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Goals of this implementation in the business • Improve the customer experience; • Optimize Process; • Optimize Operational Performance; • Improve, Enhance the Business Model of the Enterprise; Contact http://www.jean-antoine-moreau.fr.nf JAM 55
  • 56. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • There is a change in the processes used by the company to: Contact http://www.jean-antoine-moreau.fr.nf JAM 56 Collect; Analysis; Store; Use; Operate; Present; Sale; its DDaattaa
  • 57. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa in the future DATA Image - visual Contact http://www.jean-antoine-moreau.fr.nf JAM 57 video text sound
  • 58. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa in the future DATA Image - visual Contact http://www.jean-antoine-moreau.fr.nf JAM 58 video text sound smell and flavor by the use of the chemical composition
  • 59. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Storage DATA Image - visual Contact http://www.jean-antoine-moreau.fr.nf JAM 59 video text sound Using Cloud Computing
  • 60. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Storage Dispatching DATA Image - visual Contact http://www.jean-antoine-moreau.fr.nf JAM 60 video text sound Using Cloud Computing
  • 61. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa DATA semi-structured data Contact http://www.jean-antoine-moreau.fr.nf JAM 61 structured data Geotagged data unstructured data
  • 62. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Data Science Exploratory Analysis discrete-event simulation Predictive Analytic Contact http://www.jean-antoine-moreau.fr.nf JAM 62
  • 63. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Data Mining Data Mining Model development Adjustment, Experiment Contact http://www.jean-antoine-moreau.fr.nf JAM 63
  • 64. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Multilevel Modeling of Hierarchical and Longitudinal Data Using Contact http://www.jean-antoine-moreau.fr.nf JAM 64
  • 65. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Transforming Data to Decision Transforming Data to Strategy Contact http://www.jean-antoine-moreau.fr.nf JAM 65
  • 66. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Define your concept Data-preparation Clustering Virtualization Association Aggregation Evaluation Rules-regression Data mining Classification Decision Tree Presentation Visualization Contact http://www.jean-antoine-moreau.fr.nf JAM 66
  • 67. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Example of Dataset • http://labs.europeana.eu • http://fisher.osu.edu/fin/fdf/osudata.htm • http://www.ncbi.nlm.nih.gov • https://data.nasdaq.com/ • http://smd.princeton.edu// • http://archive.ics.uci.edu • https://www.yelp.com/academic_dataset • http://aws.amazon.com/fr/publicdatasets/ Contact http://www.jean-antoine-moreau.fr.nf JAM 67
  • 68. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa categories of tools and Software • Clustering; • Segmentation; • Social Network Analysis; • Link Analysis; • Visualization; • Statistical Analysis; • Text Analysis; • Text Mining; • Information Retrieval; • Web Analytics and Social Media Analytics; • Web Usage Mining; Contact http://www.jean-antoine-moreau.fr.nf JAM 68
  • 69. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Fast Data in Real Time Contact http://www.jean-antoine-moreau.fr.nf JAM 69
  • 70. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Fast Data in Real Time Processing Data visualization Contact http://www.jean-antoine-moreau.fr.nf JAM 70
  • 71. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Visualization of data shall provide informations : Contact http://www.jean-antoine-moreau.fr.nf JAM 71 – Interpretable; – Relevant; – Innovative;
  • 72. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Project Tools GDELT - Open Source Tool Global Database of Events, Languages and Tones Developed by Georgetown University Contact http://www.jean-antoine-moreau.fr.nf JAM 72
  • 73. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa GDELT Project • http://analysis.gdeltproject.org/ • http://analysis.gdeltproject.org/module-event- exporter.html Contact http://www.jean-antoine-moreau.fr.nf JAM 73
  • 74. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Volume of data; • Variety of data; • Veracity of the data; Data processing in real time. Contact http://www.jean-antoine-moreau.fr.nf JAM 74
  • 75. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Using Data-at-Rest; • Security of Data-at-Rest; Contact http://www.jean-antoine-moreau.fr.nf JAM 75
  • 76. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Principles for Architecting; Real-time Big Data Systems. Contact http://www.jean-antoine-moreau.fr.nf JAM 76
  • 77. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Layer • Batch Layer; • Serving Layer; • Speed Layer; Contact http://www.jean-antoine-moreau.fr.nf JAM 77
  • 78. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Batch layer Serving layer Contact http://www.jean-antoine-moreau.fr.nf JAM 78 New data Master data set Batch view Real Time View Batch view Real Time View query query Speed layer
  • 79. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 79 Hadoop Data HDFS MapReduce Batch Recompute Process Stream Increment views Real Time increment New Data Stream Quality Function Deployment Batch views (HDFS) Real Time view Apache - HBase Serving Layer Query
  • 80. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. • HDFS – Hadoop Distributed File System. • Hbase – Apache Hbase is the Hadoop database, a distributed, scalable, big data store. Contact http://www.jean-antoine-moreau.fr.nf JAM 80
  • 81. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa  The batch layer has two functions: • managing the master dataset; • pre-compute the batch views. Contact http://www.jean-antoine-moreau.fr.nf JAM 81
  • 82. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa  The serving layer indexes the batch views so that they can be queried in low-latency. Contact http://www.jean-antoine-moreau.fr.nf JAM 82
  • 83. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa  The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. Contact http://www.jean-antoine-moreau.fr.nf JAM 83
  • 84. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Batch Real Time Batch Contact http://www.jean-antoine-moreau.fr.nf JAM 84 Layer Data Processing Time Batch Real Time Real Time
  • 85. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Batch View Stored In the HDFS and the real-time views stored in HBase Contact http://www.jean-antoine-moreau.fr.nf JAM 85
  • 86. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Architectural Principles architecture Main Big Data Architecture Immutability Recomputation Immutable Record Contact http://www.jean-antoine-moreau.fr.nf JAM 86
  • 87. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Master Dataset; • Managing the version of the Dataset; • Fault – Tolérance; • Incremetal Algorithm; • Recompute; Contact http://www.jean-antoine-moreau.fr.nf JAM 87
  • 88. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Modifying the Streaming Data Analytic Platform Contact http://www.jean-antoine-moreau.fr.nf JAM 88
  • 89. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Writing Data with the Storm-Hdfs Connector; • Writing Data with the Storm-HBase Connector. Contact http://www.jean-antoine-moreau.fr.nf JAM 89
  • 90. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa  The storm-hdfs connector provides the following key features: – Supports HDFS 2.x; – Supports both text and sequence files; – Configurable directory and file names; – Customizable synchronization, rotation policies, and rotation actions; – Tuple fails if HDFS update fails; – Supports the Trident API; – Supports writing to kerberized Hadoop cluster; – The primary classes of the storm-hdfs connector are HdfsBolt and SequenceFileBolt, both located in the org.apache.storm.hdfs.bolt package. Use the HDFSBolt class to write text data to HDFS and the SequenceFileBolt class to write binary data. Contact http://www.jean-antoine-moreau.fr.nf JAM 90
  • 91. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa  The storm-hbase connector provides the following key features: – Supports Apache HBase and above; – Supports incrementing counter columns; – Tuples are failed if an update to an HBase table fails; – Ability to group puts in a single batch; – Supports writing to Kerberized (Kerberos) HBase clusters; – The storm-hbase connector enables Storm developers to collect several PUTS in a single operation and write to multiple HBase column families and counter columns. a PUT is an HBase operation that inserts data into a single HBase cell. Use the HBase client's write buffer to automatically batch: hbase.client.write.buffer. The primary interface in the storm-hbase connector is the org.apache.storm.hbase.bolt.mapper.HBaseMapper interface. Contact http://www.jean-antoine-moreau.fr.nf JAM 91
  • 92. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Global Vision Architecture Synthesis Contact http://www.jean-antoine-moreau.fr.nf JAM 92
  • 93. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa collect processing analysis restitution storage Contact http://www.jean-antoine-moreau.fr.nf JAM 93
  • 94. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Collecting data: – Recover data; – Transmit data to : • the processing units • and units of analysis. Contact http://www.jean-antoine-moreau.fr.nf JAM 94
  • 95. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Data transformation: – Extraction of useful information from little or unstructured data; – Using data extracted to make them consistent; – Allow the formation of a catalog of MetaData; Contact http://www.jean-antoine-moreau.fr.nf JAM 95
  • 96. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Contact http://www.jean-antoine-moreau.fr.nf JAM 96 • Analysis – Creation of new information by: • identification; • correlation; • aggregation; • projection; In real-time of the data set
  • 97. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • data recovery: – Allow visualization of data; – Enable data mining; Contact http://www.jean-antoine-moreau.fr.nf JAM 97
  • 98. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Data storage: – The ability to store very large amounts of structured and unstructured data; • Data warehouse storing the data obtained from the processing. • Cache analysis to accelerate treatments for restitution. Contact http://www.jean-antoine-moreau.fr.nf JAM 98
  • 99. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Criteria for building a technical base Contact http://www.jean-antoine-moreau.fr.nf JAM 99
  • 100. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Criteria for building a technical base • elasticity; • adaptability; • versatility; Contact http://www.jean-antoine-moreau.fr.nf JAM 100
  • 101. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Elasticity, using technology : – Storage Technology • SAN (Storage Area Network); • NAS (Network Attached Storage); • DAS (Direct Attachment System); Contact http://www.jean-antoine-moreau.fr.nf JAM 101 – Server • parallelism • Cluster • …
  • 102. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa Adaptability, Versatility Using • Tools for integrating multiple data sources. • structured data Contact http://www.jean-antoine-moreau.fr.nf JAM 102 – Data base – XML • Semi-structured data: e-mail • Unstructured data: – video; – image; – website; – social networks; – open data;
  • 103. © Jean-Antoine Moreau copying and reproduction prohibited Managing my copyright ADAGP. BBiigg DDaattaa • Reactivity: – Optimize the distribution of data processing between computing nodes. Contact http://www.jean-antoine-moreau.fr.nf JAM 103