SlideShare a Scribd company logo
DATA WAREHOUSING SOLUTION
USING APACHE SPARK
TEAM 18
AYUSH KHANDELWAL
GAURAV PARIDA
ANIL REDDY
MEHAK AGARWAL
INTRODUCTION TO DATA WAREHOUSE
A data warehouse is constructed by integrating data from multiple heterogeneous
sources. It supports analytical reporting, structured and/or ad hoc queries and decision
making.
A data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an
organization.
It is kept separate from the organization's operational database. There is no frequent
updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyze its
business.
Image taken from wikipedia.org/datawarehouse
KEY FEATURES
Subject Oriented - A data warehouse is subject oriented because it provides information around a
subject rather than the organization's ongoing operations.
Integrated - A data warehouse is constructed by integrating data from heterogeneous sources
such as relational databases, flat files, etc. This integration enhances the effective analysis of data.
Time Variant - The data collected in a data warehouse is identified with a particular time period.
The data in a data warehouse provides information from the historical point of view.
Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A
data warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
DATA WAREHOUSE VS OPERATIONAL DATABASE
An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contract, data warehouse queries are
often complex and they present a general form of data.
Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational
databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while an
OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
APACHE SPARK
Open Source
Alternative to Map Reduce for certain applications
A low latency cluster computing system
For very large data sets
May be 100 times faster than Map Reduce for
Iterative algorithms
Interactive data mining
Used with Hadoop / HDFS
Released under BSD License
SPARK FEATURES
Uses in memory cluster computing
Memory access faster than disk access
Has API's written in
Scala
Java
Python
Can be accessed from Scala and Python shells
Currently an Apache incubator project
Scales to very large clusters
Uses in memory processing for increased speed
Low latency shell access
OUR DATA WAREHOUSE SOLUTION
Building a data warehouse is a task that requires a lot of data to start, combined with
immense computational resources.
This project deals with creating a data warehouse like system which can perform basic
queries and some analytics.
Use-cases that we are dealing with:
Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc.
Movie rating progression graph
Movie recommendation engine
MOVIELENS 20M DATASET
movielens.org is a movie ratings aggregator owned by its parent company Grouplens.
Grouplens provides different sized movielens datasets for free that can be found at
http://grouplens.org/datasets/movielens/
For this project, we are using the Movielens 20M dataset which is the largest of all the
datasets provided by movielens.
Statistics about the dataset:
20 million ratings
465,000 tag applications
27,000 movies
DESCRIBING THE DATA
The data contains 4 CSV files of which only 2 are useful for this project:
movies.csv - movieid, title, genres
ratings.csv - userid, movieid, rating, timestamp
SOME IDEAS FROM HIVE
A data warehouse infrastructure built on top of hadoop for providing data
summarization, query and analysis.
Supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem.
Provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL.
FOREGROUND
Taking ideas from Apache Hive, the following solution has been proposed by us in this
project:
Dataset files are stored in HDFS.
API interface has been developed using flask instead of a graphical interface. API
rules have been defined for each query.
On hitting the URL for the API by passing the appropriate parameters, the results
are displayed in the browser window.
BACKGROUND
The dataset files are pushed to HDFS for faster access without any modifications.
For each query, the files are read from HDFS and converted to spark RDDs (Resilient
Distributed Datasets).
RDDs are a logical collection of data partitioned across machines. They can be
manipulated in parallel.
The API call is parsed for parameters, and accordingly the corresponding query
function is called.
The result of the query is handed over to flask and displayed on the browser. GraphX
has been used for plotting graph.

More Related Content

PDF
Unleash the power of Azure Data Factory
PPTX
Hadoop data access layer v4.0
PPTX
Digital Transformation with Microsoft Azure
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Tag based policies using Apache Atlas and Ranger
PPTX
One Large Data Lake, Hold the Hype
PPTX
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
PDF
Clinical Suspecting at Scale Using PySpark
Unleash the power of Azure Data Factory
Hadoop data access layer v4.0
Digital Transformation with Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Tag based policies using Apache Atlas and Ranger
One Large Data Lake, Hold the Hype
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Clinical Suspecting at Scale Using PySpark

What's hot (20)

PPT
Big data and hadoop
PPTX
Apache Atlas: Tracking dataset lineage across Hadoop components
PPTX
Case study on big data
PPTX
Introducing Data Lakes
PPTX
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
PDF
Azure Data Factory v2
PPTX
Azure Data Factory
PPTX
Pivotal-HadoopOverview2016-working
PPTX
Hadoop in the Cloud – The What, Why and How from the Experts
PPTX
Introducing Big Data
DOCX
Hotel inspection data set analysis copy
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PPTX
Big Data in the Cloud - The What, Why and How from the Experts
ODP
EDW and Hadoop
PPTX
Protecting your Critical Hadoop Clusters Against Disasters
PPTX
Big data course
PDF
Definitive Guide to Select Right Data Warehouse (2020)
PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
PPTX
Aster getting started
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Big data and hadoop
Apache Atlas: Tracking dataset lineage across Hadoop components
Case study on big data
Introducing Data Lakes
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Azure Data Factory v2
Azure Data Factory
Pivotal-HadoopOverview2016-working
Hadoop in the Cloud – The What, Why and How from the Experts
Introducing Big Data
Hotel inspection data set analysis copy
Hadoop Reporting and Analysis - Jaspersoft
Big Data in the Cloud - The What, Why and How from the Experts
EDW and Hadoop
Protecting your Critical Hadoop Clusters Against Disasters
Big data course
Definitive Guide to Select Right Data Warehouse (2020)
Integration Monday - Analysing StackExchange data with Azure Data Lake
Aster getting started
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Ad

Viewers also liked (14)

PPTX
Habito 1 primera parte
PDF
Transcript 070112.PDF
PDF
Analisis de decisiones integrales
DOCX
UNCSW 59 WILPF Report -Final
PPTX
PPTX
Los 7 habitos dela gente altamente efectiva
PDF
4 Things All Mentors and Mentees Should Know
PDF
Incentive-Based Instruments for Water Management
PDF
How to Be a Workplace Ally
PDF
Certificado de agradecimiento
PPTX
Kelsey Hinson Portfolio Presentation
DOCX
Proposal SKRIPSI Hukum Tata Negara
PDF
[REPORT] Women in Leadership: Why It Matters
PDF
Equity and inclusive growth background paper - sept 2016
Habito 1 primera parte
Transcript 070112.PDF
Analisis de decisiones integrales
UNCSW 59 WILPF Report -Final
Los 7 habitos dela gente altamente efectiva
4 Things All Mentors and Mentees Should Know
Incentive-Based Instruments for Water Management
How to Be a Workplace Ally
Certificado de agradecimiento
Kelsey Hinson Portfolio Presentation
Proposal SKRIPSI Hukum Tata Negara
[REPORT] Women in Leadership: Why It Matters
Equity and inclusive growth background paper - sept 2016
Ad

Similar to Cloud computing major project (20)

PDF
PDF
Hadoop data-lake-white-paper
PDF
Modern data warehouse
PDF
Modern data warehouse
PDF
Oracle Unified Information Architeture + Analytics by Example
PDF
Google Data Engineering.pdf
PDF
Data Engineering on GCP
PDF
data_engineering_on_GCP_PDE_cheat_sheets
PPTX
Hadoop project design and a usecase
PDF
Hadoop & Data Warehouse
PPTX
Hd insight overview
PDF
Infrastructure Considerations for Analytical Workloads
PPTX
Testing Big Data: Automated Testing of Hadoop with QuerySurge
PDF
Hadoop Developer
PPTX
What is Hadoop? Key Concepts, Architecture, and Applications
PDF
Comparison among rdbms, hadoop and spark
PPTX
Big Data Practice_Planning_steps_RK
PPTX
Using hadoop for enterprise data management
PPTX
Big data architectures and the data lake
Hadoop data-lake-white-paper
Modern data warehouse
Modern data warehouse
Oracle Unified Information Architeture + Analytics by Example
Google Data Engineering.pdf
Data Engineering on GCP
data_engineering_on_GCP_PDE_cheat_sheets
Hadoop project design and a usecase
Hadoop & Data Warehouse
Hd insight overview
Infrastructure Considerations for Analytical Workloads
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Hadoop Developer
What is Hadoop? Key Concepts, Architecture, and Applications
Comparison among rdbms, hadoop and spark
Big Data Practice_Planning_steps_RK
Using hadoop for enterprise data management
Big data architectures and the data lake

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Classroom Observation Tools for Teachers
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
master seminar digital applications in india
PDF
Complications of Minimal Access Surgery at WLH
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Cell Types and Its function , kingdom of life
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
Supply Chain Operations Speaking Notes -ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
Classroom Observation Tools for Teachers
O5-L3 Freight Transport Ops (International) V1.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
master seminar digital applications in india
Complications of Minimal Access Surgery at WLH
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
Institutional Correction lecture only . . .
Computing-Curriculum for Schools in Ghana
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Cell Types and Its function , kingdom of life
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Final Presentation General Medicine 03-08-2024.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Module 4: Burden of Disease Tutorial Slides S2 2025

Cloud computing major project

  • 1. DATA WAREHOUSING SOLUTION USING APACHE SPARK TEAM 18 AYUSH KHANDELWAL GAURAV PARIDA ANIL REDDY MEHAK AGARWAL
  • 2. INTRODUCTION TO DATA WAREHOUSE A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. A data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. It is kept separate from the organization's operational database. There is no frequent updating done in a data warehouse. It possesses consolidated historical data, which helps the organization to analyze its business.
  • 3. Image taken from wikipedia.org/datawarehouse
  • 4. KEY FEATURES Subject Oriented - A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. Integrated - A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. Time Variant - The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.
  • 5. DATA WAREHOUSE VS OPERATIONAL DATABASE An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data. Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database. An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. An operational database maintains current data. On the other hand, a data warehouse maintains historical data.
  • 6. APACHE SPARK Open Source Alternative to Map Reduce for certain applications A low latency cluster computing system For very large data sets May be 100 times faster than Map Reduce for Iterative algorithms Interactive data mining Used with Hadoop / HDFS Released under BSD License
  • 7. SPARK FEATURES Uses in memory cluster computing Memory access faster than disk access Has API's written in Scala Java Python Can be accessed from Scala and Python shells Currently an Apache incubator project Scales to very large clusters Uses in memory processing for increased speed Low latency shell access
  • 8. OUR DATA WAREHOUSE SOLUTION Building a data warehouse is a task that requires a lot of data to start, combined with immense computational resources. This project deals with creating a data warehouse like system which can perform basic queries and some analytics. Use-cases that we are dealing with: Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc. Movie rating progression graph Movie recommendation engine
  • 9. MOVIELENS 20M DATASET movielens.org is a movie ratings aggregator owned by its parent company Grouplens. Grouplens provides different sized movielens datasets for free that can be found at http://grouplens.org/datasets/movielens/ For this project, we are using the Movielens 20M dataset which is the largest of all the datasets provided by movielens. Statistics about the dataset: 20 million ratings 465,000 tag applications 27,000 movies
  • 10. DESCRIBING THE DATA The data contains 4 CSV files of which only 2 are useful for this project: movies.csv - movieid, title, genres ratings.csv - userid, movieid, rating, timestamp
  • 11. SOME IDEAS FROM HIVE A data warehouse infrastructure built on top of hadoop for providing data summarization, query and analysis. Supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
  • 12. FOREGROUND Taking ideas from Apache Hive, the following solution has been proposed by us in this project: Dataset files are stored in HDFS. API interface has been developed using flask instead of a graphical interface. API rules have been defined for each query. On hitting the URL for the API by passing the appropriate parameters, the results are displayed in the browser window.
  • 13. BACKGROUND The dataset files are pushed to HDFS for faster access without any modifications. For each query, the files are read from HDFS and converted to spark RDDs (Resilient Distributed Datasets). RDDs are a logical collection of data partitioned across machines. They can be manipulated in parallel. The API call is parsed for parameters, and accordingly the corresponding query function is called. The result of the query is handed over to flask and displayed on the browser. GraphX has been used for plotting graph.