SlideShare a Scribd company logo
IMDb Data Integration
Large Scale Data Management - Spring 2018
Giuseppe Andreetti
Large Scale Data Management - Spring 2018
Outline
2
• IMDb
• CSV description
• GAV
• Source Schema
• Global Schema
• Mapping
• Talend
• Data pre-processing and cleaning
• Data integration process
• Results
Large Scale Data Management - Spring 2018
IMDb
3
IMDb, also known as Internet Movie Database, is
an online database of information related to world films,
television programs, home videos and video games, and
internet streams, including cast, production crew,
personnel and fictional character biographies, plot
summaries, trivia, and fan reviews and ratings.
Large Scale Data Management - Spring 2018
Used dataset
4
In this data integration project are used 4 files in .csv format:
• movies.csv
• rating_I.csv
• rating_II.csv
• rating_III .csv
These data sources are available on kaggle.com
Large Scale Data Management - Spring 2018
Movies.csv description
5
Fields: movieId, title, genres

It contains 27779 entries.
Large Scale Data Management - Spring 2018
Rating_I.csv description
6
Fields: userId, movieId, rating, timestamp

It contains 20000264 entries.
Large Scale Data Management - Spring 2018
GAV - Global as view
7
An information integration system I is a triple <G, S, M>.
The most usual scenario here is the one in which the global
schema is created on the basis of data source schemas
observation, through an intensional integration process of
the data source schemas (think also to the consolidation
process, or to a situation in which we want to represent in an
integrated way the whole information content of the data
architecture of an organization).
In this case the global schema is expressed in terms of local
schemas.
Large Scale Data Management - Spring 2018
GAV - Global as view
8
Purpose:
• task based: data integration program for a specific purpose
• service based: data integration query with parameters
• domain based: data integration general purpose (support any
query on that domain)
Type:
• Materialized: I have a copy of the data in order to manipulate it
• Virtualized: each time I ask for data to source. No maintenance
policy, but dangerous.
Approach:
• axioms
• no axioms
Large Scale Data Management - Spring 2018
Source Schema
9
movieId title genres year
userId movieId rating timestamp
r1:
r2:
Large Scale Data Management - Spring 2018
Global Schema
10
movieId title genres year userId rating timestamp rating_avg
movieId rating_avgrating:
movie:
Large Scale Data Management - Spring 2018
Mapping
11
movieId title genres
userId movieId rating timestamp
r1:
r2:
join key
userId movieId rating timestamp
r2:
movieId rating_avg
rating:
group function
Large Scale Data Management - Spring 2018
Talend
12
Talend is a software that provides data integration solutions
to gain instant value from data by delivering timely and easy
access to all historical, live and emerging data. Talend runs
natively in Hadoop using the latest innovations from the
Apache ecosystem.
Talend combines big data components for Hadoop
MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop,
Hive, Oozie, and Pig into a unified open source environment,
to process large datas quickly. 
Large Scale Data Management - Spring 2018
Talend Interface
13
Data sources Instruments
Workflow
TerminalComponent settings
Large Scale Data Management - Spring 2018
The field movieId in the file movie.csv contains also information regarding
the year of the movie.
In order to extract this information was used tJavaRow (a Talend
component) that allows you to enter customized code which you can
integrate in job workflows.
But once written and compiled the code, Talend shell returned an error on
the conversion of type String to int.
Data Pre-Processing
14
Large Scale Data Management - Spring 2018
The second attempt was using Pandas, a Python Data Analysis Library.
Through a Python script are extracted the years of the movies and it was
generated a new field called year (this information was contained in the field title).
Data Pre-Processing
15
Large Scale Data Management - Spring 2018
The data integration process has been done just using the Talend tools.
In Talend is possible to create a workflow in order to manage and integrate
the data.
The higher number of the entries both in the .csv files and database tables
saturated the memory and the terminal returned the error:
java.lang.OutOf MemoryError: GC overhead limit exceeded
So the workflow is divided in four parts due the fact that the configuration
used isn't powerful enough.
Data Integration
16
Large Scale Data Management - Spring 2018
Data Integration: Job I
17
Union with duplicate of rating_I, rating_II, rating_III.
It was generated a new table in IMDB database called r2.
Large Scale Data Management - Spring 2018
Data Integration: Job II
18
It was generated a new table in IMDB database called rating.
Large Scale Data Management - Spring 2018
Data Integration: Job II
19
Obviously, the field userId and timestamp
were removed.
• From database table r2 to database table rating.
r2 contains for each user the movies that he voted.
Entries were grouped by movie_Id and now, rating_avg is the
average of the entries that have the same movie_Id.
Table rating
Large Scale Data Management - Spring 2018
Data Integration: Second problem
20
I tried to integrate the data contained in movies.csv with the new database r2
created in the previous step.
The Talend shell returned the memory error (with the option lookup model:
"Load once” setted)
java.lang.OutOf MemoryError: GC overhead limit exceeded
I tried with the option in lookup model: “reload at each row (cache)”.
In this case works, but I estimated the time to complete the job from the row/s
and it was 78 weeks (with my configuration).
The only way to follow with the data integration project was to reduce the
number of records in the table r2.
So the number of records was reduced from 20 million to 80000.
Large Scale Data Management - Spring 2018
Data Integration: Job III
21
Join between movies.csv and r2 table through the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job III
22
Inside the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job III
23
It was generated a new table called IMDBresults in IMDB
database starting from movie.csv and the table r2 that contains:
movieId, title, genres, year, userId, rating, timestamp
It was used tMap component, setting InnerJoin on movieId and
with “All matches” option activated.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
24
Join between IMDBresults and rating tables through the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
25
Inside the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
26
It was generated a new table called movie in IMDB2 database starting
from the table IMDBresults and the table r2 contained in in IMDB
database. It that contains:
movieId, title, genres, year, userId, rating, timestamp, rating_avg
It was used tMap component, setting InnerJoin on movieId and with
“unique match” option activated.
This operation took about 1 hour of computation.
Large Scale Data Management - Spring 2018
Results
27
movie table
Screenshot from Sequel Pro
Large Scale Data Management - Spring 2018
Results
28
rating table
Screenshot from Sequel Pro

More Related Content

PDF
Imdb movie prediction
PPT
Big Data & Sentiment Analysis
PPTX
Sentiment analysis using imdb 50 k data
PPTX
Object Oriented Design
PPTX
Dm from databases perspective u 1
PPTX
From use case to software architecture
PDF
Introduction to text classification using naive bayes
PDF
Predictive Analytics Using R | Edureka
Imdb movie prediction
Big Data & Sentiment Analysis
Sentiment analysis using imdb 50 k data
Object Oriented Design
Dm from databases perspective u 1
From use case to software architecture
Introduction to text classification using naive bayes
Predictive Analytics Using R | Edureka

What's hot (20)

PPTX
Unified process model
PPTX
Recommendation System
PPT
Software Verification & Validation
PPTX
Association Rule Mining Using WEKA
PDF
Image anomaly detection with generative adversarial networks
PPT
Software metrics
PPTX
Recommender system
PDF
Introduction of Faster R-CNN
PDF
Credit EDA Case Study : Exploratory Data Analysis on Bank Loan Data
PPTX
Unified process Model
PDF
Software engineering a practitioners approach 8th edition pressman solutions ...
PDF
Function Point Analysis (FPA) by Dr. B. J. Mohite
PPTX
Tutorial on sequence aware recommender systems - UMAP 2018
PPTX
ETL Technologies.pptx
PDF
COCOMO Model By Dr. B. J. Mohite
PPTX
Introduction to Convolutional Neural Networks
PPTX
Transformers in Vision: From Zero to Hero
PPT
4.3 multimedia datamining
PPTX
Deep Learning in Computer Vision
PPTX
06 Community Detection
Unified process model
Recommendation System
Software Verification & Validation
Association Rule Mining Using WEKA
Image anomaly detection with generative adversarial networks
Software metrics
Recommender system
Introduction of Faster R-CNN
Credit EDA Case Study : Exploratory Data Analysis on Bank Loan Data
Unified process Model
Software engineering a practitioners approach 8th edition pressman solutions ...
Function Point Analysis (FPA) by Dr. B. J. Mohite
Tutorial on sequence aware recommender systems - UMAP 2018
ETL Technologies.pptx
COCOMO Model By Dr. B. J. Mohite
Introduction to Convolutional Neural Networks
Transformers in Vision: From Zero to Hero
4.3 multimedia datamining
Deep Learning in Computer Vision
06 Community Detection
Ad

Similar to IMDb Data Integration (20)

PPTX
Hadoop Training
PPTX
Dataworks | 2018-06-20 | Gimel data platform
PPTX
Gimel at Dataworks Summit San Jose 2018
PDF
Building a Data Platform Strata SF 2019
PDF
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
PPTX
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
PDF
QCon 2018 | Gimel | PayPal's Analytic Platform
PDF
IRJET- Analysis of Boston’s Crime Data using Apache Pig
PPTX
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
PPTX
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
PDF
Custom Reports & Integrations with GraphQL
PPTX
Stream processing for the practitioner: Blueprints for common stream processi...
PDF
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
PDF
Adding Velocity to BigBench
PDF
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
PDF
About The Event-Driven Data Layer & Adobe Analytics
PDF
Airline Reservations and Routing: A Graph Use Case
PPTX
Tableau @ Spil Games
PPTX
How Financial Services can Save On File Storage
PPTX
Airline reservations and routing: a graph use case
Hadoop Training
Dataworks | 2018-06-20 | Gimel data platform
Gimel at Dataworks Summit San Jose 2018
Building a Data Platform Strata SF 2019
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
QCon 2018 | Gimel | PayPal's Analytic Platform
IRJET- Analysis of Boston’s Crime Data using Apache Pig
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Custom Reports & Integrations with GraphQL
Stream processing for the practitioner: Blueprints for common stream processi...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
About The Event-Driven Data Layer & Adobe Analytics
Airline Reservations and Routing: A Graph Use Case
Tableau @ Spil Games
How Financial Services can Save On File Storage
Airline reservations and routing: a graph use case
Ad

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
annual-report-2024-2025 original latest.
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Quality review (1)_presentation of this 21
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Introduction to the R Programming Language
PDF
Lecture1 pattern recognition............
Introduction to Knowledge Engineering Part 1
SAP 2 completion done . PRESENTATION.pptx
[EN] Industrial Machine Downtime Prediction
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
STERILIZATION AND DISINFECTION-1.ppthhhbx
annual-report-2024-2025 original latest.
Clinical guidelines as a resource for EBP(1).pdf
Quality review (1)_presentation of this 21
climate analysis of Dhaka ,Banglades.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IB Computer Science - Internal Assessment.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
1_Introduction to advance data techniques.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
Introduction to the R Programming Language
Lecture1 pattern recognition............

IMDb Data Integration

  • 1. IMDb Data Integration Large Scale Data Management - Spring 2018 Giuseppe Andreetti
  • 2. Large Scale Data Management - Spring 2018 Outline 2 • IMDb • CSV description • GAV • Source Schema • Global Schema • Mapping • Talend • Data pre-processing and cleaning • Data integration process • Results
  • 3. Large Scale Data Management - Spring 2018 IMDb 3 IMDb, also known as Internet Movie Database, is an online database of information related to world films, television programs, home videos and video games, and internet streams, including cast, production crew, personnel and fictional character biographies, plot summaries, trivia, and fan reviews and ratings.
  • 4. Large Scale Data Management - Spring 2018 Used dataset 4 In this data integration project are used 4 files in .csv format: • movies.csv • rating_I.csv • rating_II.csv • rating_III .csv These data sources are available on kaggle.com
  • 5. Large Scale Data Management - Spring 2018 Movies.csv description 5 Fields: movieId, title, genres It contains 27779 entries.
  • 6. Large Scale Data Management - Spring 2018 Rating_I.csv description 6 Fields: userId, movieId, rating, timestamp It contains 20000264 entries.
  • 7. Large Scale Data Management - Spring 2018 GAV - Global as view 7 An information integration system I is a triple <G, S, M>. The most usual scenario here is the one in which the global schema is created on the basis of data source schemas observation, through an intensional integration process of the data source schemas (think also to the consolidation process, or to a situation in which we want to represent in an integrated way the whole information content of the data architecture of an organization). In this case the global schema is expressed in terms of local schemas.
  • 8. Large Scale Data Management - Spring 2018 GAV - Global as view 8 Purpose: • task based: data integration program for a specific purpose • service based: data integration query with parameters • domain based: data integration general purpose (support any query on that domain) Type: • Materialized: I have a copy of the data in order to manipulate it • Virtualized: each time I ask for data to source. No maintenance policy, but dangerous. Approach: • axioms • no axioms
  • 9. Large Scale Data Management - Spring 2018 Source Schema 9 movieId title genres year userId movieId rating timestamp r1: r2:
  • 10. Large Scale Data Management - Spring 2018 Global Schema 10 movieId title genres year userId rating timestamp rating_avg movieId rating_avgrating: movie:
  • 11. Large Scale Data Management - Spring 2018 Mapping 11 movieId title genres userId movieId rating timestamp r1: r2: join key userId movieId rating timestamp r2: movieId rating_avg rating: group function
  • 12. Large Scale Data Management - Spring 2018 Talend 12 Talend is a software that provides data integration solutions to gain instant value from data by delivering timely and easy access to all historical, live and emerging data. Talend runs natively in Hadoop using the latest innovations from the Apache ecosystem. Talend combines big data components for Hadoop MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop, Hive, Oozie, and Pig into a unified open source environment, to process large datas quickly. 
  • 13. Large Scale Data Management - Spring 2018 Talend Interface 13 Data sources Instruments Workflow TerminalComponent settings
  • 14. Large Scale Data Management - Spring 2018 The field movieId in the file movie.csv contains also information regarding the year of the movie. In order to extract this information was used tJavaRow (a Talend component) that allows you to enter customized code which you can integrate in job workflows. But once written and compiled the code, Talend shell returned an error on the conversion of type String to int. Data Pre-Processing 14
  • 15. Large Scale Data Management - Spring 2018 The second attempt was using Pandas, a Python Data Analysis Library. Through a Python script are extracted the years of the movies and it was generated a new field called year (this information was contained in the field title). Data Pre-Processing 15
  • 16. Large Scale Data Management - Spring 2018 The data integration process has been done just using the Talend tools. In Talend is possible to create a workflow in order to manage and integrate the data. The higher number of the entries both in the .csv files and database tables saturated the memory and the terminal returned the error: java.lang.OutOf MemoryError: GC overhead limit exceeded So the workflow is divided in four parts due the fact that the configuration used isn't powerful enough. Data Integration 16
  • 17. Large Scale Data Management - Spring 2018 Data Integration: Job I 17 Union with duplicate of rating_I, rating_II, rating_III. It was generated a new table in IMDB database called r2.
  • 18. Large Scale Data Management - Spring 2018 Data Integration: Job II 18 It was generated a new table in IMDB database called rating.
  • 19. Large Scale Data Management - Spring 2018 Data Integration: Job II 19 Obviously, the field userId and timestamp were removed. • From database table r2 to database table rating. r2 contains for each user the movies that he voted. Entries were grouped by movie_Id and now, rating_avg is the average of the entries that have the same movie_Id. Table rating
  • 20. Large Scale Data Management - Spring 2018 Data Integration: Second problem 20 I tried to integrate the data contained in movies.csv with the new database r2 created in the previous step. The Talend shell returned the memory error (with the option lookup model: "Load once” setted) java.lang.OutOf MemoryError: GC overhead limit exceeded I tried with the option in lookup model: “reload at each row (cache)”. In this case works, but I estimated the time to complete the job from the row/s and it was 78 weeks (with my configuration). The only way to follow with the data integration project was to reduce the number of records in the table r2. So the number of records was reduced from 20 million to 80000.
  • 21. Large Scale Data Management - Spring 2018 Data Integration: Job III 21 Join between movies.csv and r2 table through the tMap instrument.
  • 22. Large Scale Data Management - Spring 2018 Data Integration: Job III 22 Inside the tMap instrument.
  • 23. Large Scale Data Management - Spring 2018 Data Integration: Job III 23 It was generated a new table called IMDBresults in IMDB database starting from movie.csv and the table r2 that contains: movieId, title, genres, year, userId, rating, timestamp It was used tMap component, setting InnerJoin on movieId and with “All matches” option activated.
  • 24. Large Scale Data Management - Spring 2018 Data Integration: Job IV 24 Join between IMDBresults and rating tables through the tMap instrument.
  • 25. Large Scale Data Management - Spring 2018 Data Integration: Job IV 25 Inside the tMap instrument.
  • 26. Large Scale Data Management - Spring 2018 Data Integration: Job IV 26 It was generated a new table called movie in IMDB2 database starting from the table IMDBresults and the table r2 contained in in IMDB database. It that contains: movieId, title, genres, year, userId, rating, timestamp, rating_avg It was used tMap component, setting InnerJoin on movieId and with “unique match” option activated. This operation took about 1 hour of computation.
  • 27. Large Scale Data Management - Spring 2018 Results 27 movie table Screenshot from Sequel Pro
  • 28. Large Scale Data Management - Spring 2018 Results 28 rating table Screenshot from Sequel Pro