SlideShare a Scribd company logo
Lightweight Collection and Storage of
Software Repository Data with DataRover
Thomas Kowark, Christoph Matthies, Matthias Uflacker and Hasso Plattner
HPI, Enterprise Platform and Integration Concepts Chair, Potsdam, Germany
ASE 2016 Demo Track
September, 5th
Christoph Matthies
Sep 5
DataRover
Background — Collecting Software Repository Data
Chart 2
Collaboration
Infrastructure
Wiki
Version
Control
Issue
Tracker
CI
Server
…
Development Teams
use
MSR* Researchers
* MSR – Mining Software Repositories
transform
load Interlinked
Data Set
extract
● How do teams develop software?
● What separates good from bad teams?
● How are we doing as a team?
ETL Software
■ Plugin/service-based architectures
□ One plugin/service per data source
□ Custom data schema
□ Alitheia-Core [Gousios et al., 2009], SOFAS [Ghezzi, 2012], Sonarqube
■ Graphical ETL-Tools
□ Plugin for each data source connection
□ Visual creation of ETL processes
□ RapidMiner, KNIME
■ Collections of Repository Data
□ Pre-collected, cleansed, and interlinked data sets
□ Boa [Dyer et al., 2013] with custom query language
□ GHTorrent [Gousios, 2013 and ongoing], StackExchange dumps
Christoph Matthies
Sep 5
DataRover
Related Work
Chart 3
■ Why doesn’t this mining tool support my new/updated data source?
□ “The development team has migrated to Gitlab”
■ How are the peculiarities of my project reflected in the standard data
schema and analyses?
□ “We use JIRA with custom fields”
■ Can I store this data in a graph or document database to perform
network analyses or text mining?
□ “Neo4J already offers the graph algorithms that I need.”
□ “All my existing queries rely on MySQL.”
Christoph Matthies
Sep 5
DataRover
Chart 4
Common Issues
■ Goals
□ Minimal implementation effort for each data source
□ Separate collection and linking
□ Reuse existing implementations whenever possible
□ Allow focus on linking and analysis, not data collection
■ Concepts
□ Collection: Explorer (OAuth, Query Parameters) => JSON
– Stackoverflow Client: ~12 LoC + logging
□ Linking: Define generic mappings using GUI
– Map JSON attributes to links, new nodes or node values
□ Storage: Graph database (Neo4J)
– No explicit database scheme, easily add connections at runtime
Christoph Matthies
Sep 5
DataRover
Chart 5
Lightweight Data Collection — DataRover
Christoph Matthies
Sep 5
DataRover
Chart 6
Data Collection — Explorers
https://bitbucket.org/tkowark/data-rover/src/b37e79847a7b08a604688133834a0592b9320b57/app/models/explorers/stackoverflow_explorer.rb
Christoph Matthies
Sep 5
DataRover
Chart 7
■ Mappings: define transformations of JSON to property graph
Christoph Matthies
Sep 5
DataRover
Chart 8
From JSON to Property Graphs
Christoph Matthies
Sep 5
DataRover
Chart 9
Linking Data
■ Linking performed by attribute equality
□ New relation indicating node similarity
□ Node merging in case of equal node types
■ For Ruby-on-Rails Github repo: 2320 of 3075 users found in SO data
StackoverflowUser
GithubUser
same_as
■ Export constructed interlinked graph
□ Reuse existing analysis
□ Use the technology you like / are most proficient in
■ Graph Databases
□ Store the graph as-is
■ Relational Databases
□ One table per node Class
□ Separate relation tables
■ Document stores
□ One collection per node class
□ Links as properties or using internal document ids
Christoph Matthies
Sep 5
DataRover
Chart 10
Storing Property Graphs
■ Only storing what you really need
□ Rails commit data w/o file changes (58k commits, 3k users)
□ Example query: amount of commits performed by each user
■ Future Work
□ User study (Mapping creation time, error-proneness, clarity, etc.)
□ Measuring data import times for large datasets Christoph Matthies
Sep 5
DataRover
Chart 11
Evaluation (ongoing)
■ DataRover
□ Lightweight data collection, only code querying
□ Minimalistic data sets tailored to specific use cases
□ Ease of mapping creation, visualize mappings
□ Data Linkage
□ Storage in different target databases
■ Try it: http://bitbucket.org/tkowark/data-rover (MIT license)
□ Screencast: https://www.youtube.com/watch?v=mt4ztff4SfU
□ Sample datasets: https://bit.ly/kowark-ase-16-data
Christoph Matthies
Sep 5
DataRover
Chart 12
Summary
■ web developer by Hugo Alberto from the Noun Project
■ Communication by Role Play from the Noun Project
■ Browser by icon 54 from the Noun Project
■ Mars Rover by LA Hall from the Noun Project
■ discussion by Milka Dahan from the Noun Project
Picture Sources

More Related Content

PDF
Slide 2 collecting, storing and analyzing big data
PPTX
Release webinar: Sansa and Ontario
PDF
TechEvent Time Seriesd Databases
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
Data pipelines from zero
PDF
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
PDF
Organising for Data Success
PPTX
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Slide 2 collecting, storing and analyzing big data
Release webinar: Sansa and Ontario
TechEvent Time Seriesd Databases
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Data pipelines from zero
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Organising for Data Success
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...

What's hot (20)

PDF
Superset druid realtime
PDF
Introduction to basic data analytics tools
PDF
Handling the growth of data
PDF
Observability for Data Pipelines With OpenLineage
PPTX
Societal Challenge 6: Social Sciences - Spending Comparison
PDF
Data pipelines observability: OpenLineage & Marquez
PPTX
Accelerating Delivery of Data Products - The EBSCO Way
PPT
SC5 Hangout2 pilot 1 description
PPTX
SC1 Workshop 2 Technical overview
PDF
Open core summit: Observability for data pipelines with OpenLineage
PDF
Introduction To Spark - Durham LUG 20150916
PDF
A primer on building real time data-driven products
PDF
Data lineage and observability with Marquez - subsurface 2020
PDF
Graph basedrdf storeforapachecassandra
PDF
Enabling Presto Caching at Uber with Alluxio
PPTX
An Intro to Elasticsearch and Kibana
PPTX
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
PDF
ER 2016 Tutorial
PPTX
Hadoop and friends
PDF
Data platform architecture principles - ieee infrastructure 2020
Superset druid realtime
Introduction to basic data analytics tools
Handling the growth of data
Observability for Data Pipelines With OpenLineage
Societal Challenge 6: Social Sciences - Spending Comparison
Data pipelines observability: OpenLineage & Marquez
Accelerating Delivery of Data Products - The EBSCO Way
SC5 Hangout2 pilot 1 description
SC1 Workshop 2 Technical overview
Open core summit: Observability for data pipelines with OpenLineage
Introduction To Spark - Durham LUG 20150916
A primer on building real time data-driven products
Data lineage and observability with Marquez - subsurface 2020
Graph basedrdf storeforapachecassandra
Enabling Presto Caching at Uber with Alluxio
An Intro to Elasticsearch and Kibana
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
ER 2016 Tutorial
Hadoop and friends
Data platform architecture principles - ieee infrastructure 2020
Ad

Viewers also liked (11)

PDF
Keemaaya Volume 6 - Door Drapes
PDF
Southampton City Council Renewable Heat Incentive Pilot
PPTX
Agriculture and Energy
PDF
Gloob border & motif booklet
PDF
Keemaaya15
PDF
Gp3 part1
PDF
Lavavajillas Siemens SN26P292EU
PPTX
Diapositivas cuento la cigarra y la hormiga
PPTX
Hidroxidos metalicos-y-no-mtalicos
PPTX
Pirates
PPTX
Síntesis del agua
Keemaaya Volume 6 - Door Drapes
Southampton City Council Renewable Heat Incentive Pilot
Agriculture and Energy
Gloob border & motif booklet
Keemaaya15
Gp3 part1
Lavavajillas Siemens SN26P292EU
Diapositivas cuento la cigarra y la hormiga
Hidroxidos metalicos-y-no-mtalicos
Pirates
Síntesis del agua
Ad

Similar to Lightweight Collection and Storage of Software Repository Data with DataRover (20)

PPTX
Jethro data meetup index base sql on hadoop - oct-2014
PDF
Design Choices for Cloud Data Platforms
PDF
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
PDF
Developing Enterprise Consciousness: Building Modern Open Data Platforms
PDF
Understanding Hadoop
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
Build an Open Source Data Lake For Data Scientists
PDF
Real time analytics at uber @ strata data 2019
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
PDF
Exploratory Analysis of Spark Structured Streaming
PDF
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
PDF
PDF
An architecture for federated data discovery and lineage over on-prem datasou...
PDF
Avast Premium Security 24.12.9725 + License Key Till 2050
PDF
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
PDF
FastStone Capture 10.4 Crack + Serial Key [Latest]
PDF
EASEUS Partition Master 18.8 Crack + License Code [2025]
PDF
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
PDF
4K Video Downloader Crack (2025) + License Key Free
Jethro data meetup index base sql on hadoop - oct-2014
Design Choices for Cloud Data Platforms
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Understanding Hadoop
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Build an Open Source Data Lake For Data Scientists
Real time analytics at uber @ strata data 2019
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
2021 04-20 apache arrow and its impact on the database industry.pptx
An architecture for federated data discovery and lineage over on-prem datasou...
Avast Premium Security 24.12.9725 + License Key Till 2050
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]
EASEUS Partition Master 18.8 Crack + License Code [2025]
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
4K Video Downloader Crack (2025) + License Key Free

More from Christoph Matthies (20)

PDF
Investigating Software Engineering Artifacts in DevOps Through the Lens of Bo...
PDF
Automated Exercises & Software Development Data
PDF
Challenges (and Opportunities!) of a Remote Agile Software Engineering Projec...
PDF
Experience vs Data: A Case for More Data-informed Retrospective Activities
PDF
More than Code: Contributions in Scrum Software Engineering Teams
PDF
Agile Software Development Practices: Perceptions & Project Data
PDF
The Road to Data-Informed Agile Development Processes
PDF
Counteracting Agile Retrospective Problems with Retrospective Activities
PDF
Using Data to Inform Decisions in Agile Software Development
PDF
An Additional Set of (Automated) Eyes: Chatbots for Agile Retrospectives
PDF
Feedback in Scrum: Data-Informed Retrospectives
PDF
Beyond Surveys: Analyzing Software Development Artifacts to Assess Teaching E...
PDF
Scrum2Kanban: Integrating Kanban and Scrum in a University Software Engineeri...
PDF
Should I Bug You? Identifying Domain Experts in Software Projects Using Code...
PDF
Introduction to Lean Software & Kanban
PDF
Pybelsberg — Constraint-based Programming in Python
PDF
Git Tricks — git utilities that make life git easier
PDF
How to reverse engineer Android applications—using a popular word game as an ...
PDF
Beat Your Mom At Solitaire—Reverse Engineering of Computer Games
PDF
Introduction to Homomorphic Encryption
Investigating Software Engineering Artifacts in DevOps Through the Lens of Bo...
Automated Exercises & Software Development Data
Challenges (and Opportunities!) of a Remote Agile Software Engineering Projec...
Experience vs Data: A Case for More Data-informed Retrospective Activities
More than Code: Contributions in Scrum Software Engineering Teams
Agile Software Development Practices: Perceptions & Project Data
The Road to Data-Informed Agile Development Processes
Counteracting Agile Retrospective Problems with Retrospective Activities
Using Data to Inform Decisions in Agile Software Development
An Additional Set of (Automated) Eyes: Chatbots for Agile Retrospectives
Feedback in Scrum: Data-Informed Retrospectives
Beyond Surveys: Analyzing Software Development Artifacts to Assess Teaching E...
Scrum2Kanban: Integrating Kanban and Scrum in a University Software Engineeri...
Should I Bug You? Identifying Domain Experts in Software Projects Using Code...
Introduction to Lean Software & Kanban
Pybelsberg — Constraint-based Programming in Python
Git Tricks — git utilities that make life git easier
How to reverse engineer Android applications—using a popular word game as an ...
Beat Your Mom At Solitaire—Reverse Engineering of Computer Games
Introduction to Homomorphic Encryption

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPT
Introduction Database Management System for Course Database
PDF
AI in Product Development-omnex systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
System and Network Administraation Chapter 3
PPTX
history of c programming in notes for students .pptx
PDF
Digital Strategies for Manufacturing Companies
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
VVF-Customer-Presentation2025-Ver1.9.pptx
Operating system designcfffgfgggggggvggggggggg
Introduction Database Management System for Course Database
AI in Product Development-omnex systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
Transform Your Business with a Software ERP System
How Creative Agencies Leverage Project Management Software.pdf
System and Network Administraation Chapter 3
history of c programming in notes for students .pptx
Digital Strategies for Manufacturing Companies
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Odoo Companies in India – Driving Business Transformation.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Softaken Excel to vCard Converter Software.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Upgrade and Innovation Strategies for SAP ERP Customers
Design an Analysis of Algorithms II-SECS-1021-03
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Odoo POS Development Services by CandidRoot Solutions

Lightweight Collection and Storage of Software Repository Data with DataRover

  • 1. Lightweight Collection and Storage of Software Repository Data with DataRover Thomas Kowark, Christoph Matthies, Matthias Uflacker and Hasso Plattner HPI, Enterprise Platform and Integration Concepts Chair, Potsdam, Germany ASE 2016 Demo Track September, 5th
  • 2. Christoph Matthies Sep 5 DataRover Background — Collecting Software Repository Data Chart 2 Collaboration Infrastructure Wiki Version Control Issue Tracker CI Server … Development Teams use MSR* Researchers * MSR – Mining Software Repositories transform load Interlinked Data Set extract ● How do teams develop software? ● What separates good from bad teams? ● How are we doing as a team? ETL Software
  • 3. ■ Plugin/service-based architectures □ One plugin/service per data source □ Custom data schema □ Alitheia-Core [Gousios et al., 2009], SOFAS [Ghezzi, 2012], Sonarqube ■ Graphical ETL-Tools □ Plugin for each data source connection □ Visual creation of ETL processes □ RapidMiner, KNIME ■ Collections of Repository Data □ Pre-collected, cleansed, and interlinked data sets □ Boa [Dyer et al., 2013] with custom query language □ GHTorrent [Gousios, 2013 and ongoing], StackExchange dumps Christoph Matthies Sep 5 DataRover Related Work Chart 3
  • 4. ■ Why doesn’t this mining tool support my new/updated data source? □ “The development team has migrated to Gitlab” ■ How are the peculiarities of my project reflected in the standard data schema and analyses? □ “We use JIRA with custom fields” ■ Can I store this data in a graph or document database to perform network analyses or text mining? □ “Neo4J already offers the graph algorithms that I need.” □ “All my existing queries rely on MySQL.” Christoph Matthies Sep 5 DataRover Chart 4 Common Issues
  • 5. ■ Goals □ Minimal implementation effort for each data source □ Separate collection and linking □ Reuse existing implementations whenever possible □ Allow focus on linking and analysis, not data collection ■ Concepts □ Collection: Explorer (OAuth, Query Parameters) => JSON – Stackoverflow Client: ~12 LoC + logging □ Linking: Define generic mappings using GUI – Map JSON attributes to links, new nodes or node values □ Storage: Graph database (Neo4J) – No explicit database scheme, easily add connections at runtime Christoph Matthies Sep 5 DataRover Chart 5 Lightweight Data Collection — DataRover
  • 6. Christoph Matthies Sep 5 DataRover Chart 6 Data Collection — Explorers https://bitbucket.org/tkowark/data-rover/src/b37e79847a7b08a604688133834a0592b9320b57/app/models/explorers/stackoverflow_explorer.rb
  • 8. ■ Mappings: define transformations of JSON to property graph Christoph Matthies Sep 5 DataRover Chart 8 From JSON to Property Graphs
  • 9. Christoph Matthies Sep 5 DataRover Chart 9 Linking Data ■ Linking performed by attribute equality □ New relation indicating node similarity □ Node merging in case of equal node types ■ For Ruby-on-Rails Github repo: 2320 of 3075 users found in SO data StackoverflowUser GithubUser same_as
  • 10. ■ Export constructed interlinked graph □ Reuse existing analysis □ Use the technology you like / are most proficient in ■ Graph Databases □ Store the graph as-is ■ Relational Databases □ One table per node Class □ Separate relation tables ■ Document stores □ One collection per node class □ Links as properties or using internal document ids Christoph Matthies Sep 5 DataRover Chart 10 Storing Property Graphs
  • 11. ■ Only storing what you really need □ Rails commit data w/o file changes (58k commits, 3k users) □ Example query: amount of commits performed by each user ■ Future Work □ User study (Mapping creation time, error-proneness, clarity, etc.) □ Measuring data import times for large datasets Christoph Matthies Sep 5 DataRover Chart 11 Evaluation (ongoing)
  • 12. ■ DataRover □ Lightweight data collection, only code querying □ Minimalistic data sets tailored to specific use cases □ Ease of mapping creation, visualize mappings □ Data Linkage □ Storage in different target databases ■ Try it: http://bitbucket.org/tkowark/data-rover (MIT license) □ Screencast: https://www.youtube.com/watch?v=mt4ztff4SfU □ Sample datasets: https://bit.ly/kowark-ase-16-data Christoph Matthies Sep 5 DataRover Chart 12 Summary
  • 13. ■ web developer by Hugo Alberto from the Noun Project ■ Communication by Role Play from the Noun Project ■ Browser by icon 54 from the Noun Project ■ Mars Rover by LA Hall from the Noun Project ■ discussion by Milka Dahan from the Noun Project Picture Sources