SlideShare a Scribd company logo
Oscar Castañeda, Xoom a PayPal Service
Learning to Rank Datasets
for Search
#SAISDS8
#SAISDS8
About
• Data Scientist at Xoom a PayPal service.
• Interests:
• Data Management,
• Dataset Search,
• Learning to Rank.
2
Spark cluster with Elasticsearch
http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
And Indexed RDatasets
Spark cluster with Elasticsearch Inside
5
Learning to Rank Datasets
#SAISDS8
6
Learning to Rank Datasets
for Search!
#SAISDS8
7
Agenda
• Problem Statement and Motivation
• Elasticsearch Learning to Rank
• Data Pipeline: metadata extraction, judgement list extraction
• Demo: Beginnings of a Dataset Search Engine with Machine-
learned relevance ranking.
• Q&A
#SAISDS8
8
Problem Statement (1)
• Despite datasets being a key corporate
asset they are generally not given the
importance they deserve and as a result
they are hard to find.
#SAISDS8
9
Problem Statement (2)
• Specifically, teams within organizations
have a hard time finding datasets
relevant to their function.
#SAISDS8
10
Topics
• Indexing
#SAISDS8
11
Topics
• Indexing (Spark Summit East 2017).
#SAISDS8
12
Topics
• Indexing (Spark Summit East 2017).
• Ranking
#SAISDS8
13
Topics
• Indexing (Spark Summit East 2017).
• Ranking => today’s topic!
#SAISDS8
14
Questions
• How are datasets ranked?
• Can judgement lists (useful for ranking)
be generated at dataset production
time?
#SAISDS8
15
Overview
Rdatasets
Take ES
snapshot
Restore ES snapshot
http://bit.ly/2e5H1jL
#SAISDS8
16
Overview
Rdatasets
Data Pipelines:
•Extract Filename, Format, Field description.
•Extract Type information.
•Index CSV files.
Extract Filename
Extract Type information.
Index CSV files.
Extract Format
Extract Field description
#SAISDS8
17
Overview
Data Lake
#SAISDS8
18
Overview
Data Lake
#SAISDS8
19
Motivation (1)
• Organizing, indexing and ranking Datasets:
#SAISDS8
20
Motivation (1)
• Organizing, indexing and ranking Datasets:
• Produced by individual data pipelines
• On Data Lake(s)
#SAISDS8
21
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
#SAISDS8
22
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
#SAISDS8
23
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
• In a feedback loop leveraging click-through data on dataset
profile pages.
#SAISDS8
24
Organizing, Indexing and Ranking Datasets
Index dataset
features
Input data
Data Pipeline
Create
dataset
profile
pages.
Using Tableau Javascript API
ES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Search
Fetch dataset
from Datalake.
Dataset profile Team Dashboard
Feedback
logging
Relevance
judgements
#SAISDS8
25
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
#SAISDS8
Search
26
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
#SAISDS8
Index dataset
features
Search
27
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
Index dataset
features
Search
28
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
Index dataset
features
Search
29
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
New
Search
30
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile
#SAISDS8
Search
31
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
#SAISDS8
Search
32
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
Search
33
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
34
How do you rank datasets?
#SAISDS8
Ranking datasets
35
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
#SAISDS8
Ranking datasets
36
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
#SAISDS8
Ranking datasets
37
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
#SAISDS8
Ranking datasets
38
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
#SAISDS8
Ranking datasets
39
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
• (a posteriori vs. post hoc (Halevy et al., 2016).)
#SAISDS8
Ranking datasets
40
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
#SAISDS8
Ranking datasets
41
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Ranking datasets
42
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Create
dataset
profile
pages.
Dataset profile
43
• Alon et al (2016) advocate finding data in a post-
hoc manner by collecting and aggregating
metadata after datasets are created or updated.
• We propose a so-called “a posteriori” approach
where metadata is generated as part of running
pipelines using Spark.
A posteriori vs. Post-hoc
#SAISDS8
44
• Alon et al (2016) advocate indexing
• We prefer indexing
A posteriori vs. Post-hoc
#SAISDS8
after the fact
immediately after the fact
45
Pros
• “Relevance judgements” can be extracted and
leveraged to bootstrap a ranked dataset index.
• In a feedback loop leveraging click-through data on
Dataset profile pages.
• More granular metrics available to evaluate
metadata regeneration.
#SAISDS8
immediately after the fact
46
Cons
• Offline model development is disconnected and only
indirectly part of feedback using click-through data.
• Looking at trees instead of the forest.
• Need to replay indexing pipeline when things change
(per data pipeline).
#SAISDS8
immediately after the fact
47#SAISDS8
Demo!
48#SAISDS8
Demo Scenario
• Movies represent datasets
• TMDB movie pages represent dataset profile
pages.
• Marketing team (also called WAR team) interested
in War movies (datasets).
49#SAISDS8
Movie 1
https://www.themoviedb.org/movie/7555-rambo
50#SAISDS8
Movie 2
https://www.themoviedb.org/movie/1370-rambo-iii
51#SAISDS8
Movie 3
https://www.themoviedb.org/movie/1369-rambo-first-blood-part-ii
52#SAISDS8
judgement file
Rambo
Rambo III
Rambo: First Blood Part II
53
What have we seen?
• How to rank datasets on Elasticsearch using LTR.
• Extract relevance judgements immediately after
datasets are generated in Spark.
• Demo: Dataset Search with Spark and
Elasticsearch LTR.
#SAISDS8
54
Next Steps (1)
• Describe Datasets in a structured schema.org way
using Data Catalog Vocabulary [2].
• Build a knowledge graph and use GraphX to extract insights.
(Useful e.g. for column concept determination (Deng et al. 2013)).
• Build topic models based on structured Datasets using
Glint to perform scalable topic model extraction in Spark
(Jagerman and Eickhoff, 2016) [1].
[1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
#SAISDS8
55
References
• Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.
Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730.
• Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute,
University of Amsterdam, May 2013.
• Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/
1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422.
• Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large
knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf.
• Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to
rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016.
• Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen,
Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
#SAISDS8
56
Q&A
#SAISDS8
Thank You.
Email: ocastaneda@paypal.com
Twitter: @oscar_castaneda
#SAISDS8

More Related Content

PPTX
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
PDF
Applied Machine Learning for Ranking Products in an Ecommerce Setting
PDF
Release Management with JIRA at BlackRock
PPTX
Managing people and organizing team
PPT
Jira customization
PPTX
Neo4j Popular use case
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
PDF
Learned Embeddings for Search and Discovery at Instacart
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Release Management with JIRA at BlackRock
Managing people and organizing team
Jira customization
Neo4j Popular use case
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Learned Embeddings for Search and Discovery at Instacart

What's hot (8)

PPTX
Automation for JIRA - The Simplest Way to Automate Your Team and Project
PPTX
Explainability for Natural Language Processing
PPTX
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
PDF
Intro to Graphs and Neo4j
PDF
Brief History and Overview of LLM Agents
PDF
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
PDF
Netflix Recommendations - Beyond the 5 Stars
PPTX
State of sourcing 2015
Automation for JIRA - The Simplest Way to Automate Your Team and Project
Explainability for Natural Language Processing
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Intro to Graphs and Neo4j
Brief History and Overview of LLM Agents
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Netflix Recommendations - Beyond the 5 Stars
State of sourcing 2015
Ad

Similar to Learning to Rank Datasets for Search with Oscar Castaneda (20)

PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PPTX
DBtrends Semantics 2016
PDF
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
PPTX
Strata sf - Amundsen presentation
PPTX
Structured data and metadata evaluation methodology for organizations looking...
PPTX
Data council sf amundsen presentation
PPTX
Changes in Structured Data at Google (SEO Camp 'us in Paris)
PPTX
EpiServer find Macaw
PDF
Meetup SF - Amundsen
PPTX
Database novelty detection
PDF
Disrupting Data Discovery
PPTX
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
PDF
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
PDF
The power of polyglot searching
PDF
Power of Polyglot Search
PPTX
Customer Feedback Analytics for Starbucks
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
PDF
Data Discovery and Metadata
PPTX
Big Data Management: What's New, What's Different, and What You Need To Know
PPT
The BI Sandbox
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
DBtrends Semantics 2016
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Strata sf - Amundsen presentation
Structured data and metadata evaluation methodology for organizations looking...
Data council sf amundsen presentation
Changes in Structured Data at Google (SEO Camp 'us in Paris)
EpiServer find Macaw
Meetup SF - Amundsen
Database novelty detection
Disrupting Data Discovery
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
The power of polyglot searching
Power of Polyglot Search
Customer Feedback Analytics for Starbucks
Accelerating Data Lakes and Streams with Real-time Analytics
Data Discovery and Metadata
Big Data Management: What's New, What's Different, and What You Need To Know
The BI Sandbox
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
annual-report-2024-2025 original latest.
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Foundation of Data Science unit number two notes
PDF
Mega Projects Data Mega Projects Data
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
annual-report-2024-2025 original latest.
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
IBA_Chapter_11_Slides_Final_Accessible.pptx
Reliability_Chapter_ presentation 1221.5784
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
1_Introduction to advance data techniques.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
ISS -ESG Data flows What is ESG and HowHow
IB Computer Science - Internal Assessment.pptx
Business Analytics and business intelligence.pdf
Foundation of Data Science unit number two notes
Mega Projects Data Mega Projects Data
Miokarditis (Inflamasi pada Otot Jantung)

Learning to Rank Datasets for Search with Oscar Castaneda

  • 1. Oscar Castañeda, Xoom a PayPal Service Learning to Rank Datasets for Search #SAISDS8
  • 2. #SAISDS8 About • Data Scientist at Xoom a PayPal service. • Interests: • Data Management, • Dataset Search, • Learning to Rank. 2
  • 3. Spark cluster with Elasticsearch http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
  • 4. And Indexed RDatasets Spark cluster with Elasticsearch Inside
  • 5. 5 Learning to Rank Datasets #SAISDS8
  • 6. 6 Learning to Rank Datasets for Search! #SAISDS8
  • 7. 7 Agenda • Problem Statement and Motivation • Elasticsearch Learning to Rank • Data Pipeline: metadata extraction, judgement list extraction • Demo: Beginnings of a Dataset Search Engine with Machine- learned relevance ranking. • Q&A #SAISDS8
  • 8. 8 Problem Statement (1) • Despite datasets being a key corporate asset they are generally not given the importance they deserve and as a result they are hard to find. #SAISDS8
  • 9. 9 Problem Statement (2) • Specifically, teams within organizations have a hard time finding datasets relevant to their function. #SAISDS8
  • 11. 11 Topics • Indexing (Spark Summit East 2017). #SAISDS8
  • 12. 12 Topics • Indexing (Spark Summit East 2017). • Ranking #SAISDS8
  • 13. 13 Topics • Indexing (Spark Summit East 2017). • Ranking => today’s topic! #SAISDS8
  • 14. 14 Questions • How are datasets ranked? • Can judgement lists (useful for ranking) be generated at dataset production time? #SAISDS8
  • 15. 15 Overview Rdatasets Take ES snapshot Restore ES snapshot http://bit.ly/2e5H1jL #SAISDS8
  • 16. 16 Overview Rdatasets Data Pipelines: •Extract Filename, Format, Field description. •Extract Type information. •Index CSV files. Extract Filename Extract Type information. Index CSV files. Extract Format Extract Field description #SAISDS8
  • 19. 19 Motivation (1) • Organizing, indexing and ranking Datasets: #SAISDS8
  • 20. 20 Motivation (1) • Organizing, indexing and ranking Datasets: • Produced by individual data pipelines • On Data Lake(s) #SAISDS8
  • 21. 21 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. #SAISDS8
  • 22. 22 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). #SAISDS8
  • 23. 23 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). • In a feedback loop leveraging click-through data on dataset profile pages. #SAISDS8
  • 24. 24 Organizing, Indexing and Ranking Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript API ES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Search Fetch dataset from Datalake. Dataset profile Team Dashboard Feedback logging Relevance judgements #SAISDS8
  • 25. 25 Organizing, Indexing and Ranking Datasets Input data Data Pipeline #SAISDS8
  • 26. Search 26 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features #SAISDS8
  • 27. Index dataset features Search 27 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8
  • 28. Index dataset features Search 28 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New
  • 29. Index dataset features Search 29 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New New
  • 30. Search 30 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile #SAISDS8
  • 31. Search 31 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard #SAISDS8
  • 32. Search 32 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 33. Search 33 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 34. 34 How do you rank datasets? #SAISDS8
  • 35. Ranking datasets 35 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. #SAISDS8
  • 36. Ranking datasets 36 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. #SAISDS8
  • 37. Ranking datasets 37 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training #SAISDS8
  • 38. Ranking datasets 38 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model #SAISDS8
  • 39. Ranking datasets 39 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model • (a posteriori vs. post hoc (Halevy et al., 2016).) #SAISDS8
  • 40. Ranking datasets 40 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. #SAISDS8
  • 41. Ranking datasets 41 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8
  • 42. Ranking datasets 42 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8 Create dataset profile pages. Dataset profile
  • 43. 43 • Alon et al (2016) advocate finding data in a post- hoc manner by collecting and aggregating metadata after datasets are created or updated. • We propose a so-called “a posteriori” approach where metadata is generated as part of running pipelines using Spark. A posteriori vs. Post-hoc #SAISDS8
  • 44. 44 • Alon et al (2016) advocate indexing • We prefer indexing A posteriori vs. Post-hoc #SAISDS8 after the fact immediately after the fact
  • 45. 45 Pros • “Relevance judgements” can be extracted and leveraged to bootstrap a ranked dataset index. • In a feedback loop leveraging click-through data on Dataset profile pages. • More granular metrics available to evaluate metadata regeneration. #SAISDS8 immediately after the fact
  • 46. 46 Cons • Offline model development is disconnected and only indirectly part of feedback using click-through data. • Looking at trees instead of the forest. • Need to replay indexing pipeline when things change (per data pipeline). #SAISDS8 immediately after the fact
  • 48. 48#SAISDS8 Demo Scenario • Movies represent datasets • TMDB movie pages represent dataset profile pages. • Marketing team (also called WAR team) interested in War movies (datasets).
  • 53. 53 What have we seen? • How to rank datasets on Elasticsearch using LTR. • Extract relevance judgements immediately after datasets are generated in Spark. • Demo: Dataset Search with Spark and Elasticsearch LTR. #SAISDS8
  • 54. 54 Next Steps (1) • Describe Datasets in a structured schema.org way using Data Catalog Vocabulary [2]. • Build a knowledge graph and use GraphX to extract insights. (Useful e.g. for column concept determination (Deng et al. 2013)). • Build topic models based on structured Datasets using Glint to perform scalable topic model extraction in Spark (Jagerman and Eickhoff, 2016) [1]. [1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/ #SAISDS8
  • 55. 55 References • Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730. • Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute, University of Amsterdam, May 2013. • Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/ 1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422. • Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf. • Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016. • Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015. #SAISDS8