SlideShare a Scribd company logo
Analyzing Rich-Club Behavior
in Open Source Projects
OpenSym 2019, the 15th International Symposium on Open Collaboration
Skövde, Sweden
Mattia Gasparini1, Javier Luis Cànovas Izquierdo2,
Robert Clarisò2, Marco Brambilla1, Jordi Cabot2
Politecnico di Milano1 Universitat Oberta de la Catalunya2
Introduction
• Git and Github data to analyze evolution,
success and management of Open Source
Software.
• Define developers behavioral patterns.
• Discover how collaborations between
developers work.
2
Problem
Statement
ANALYSIS OF
COLLABORATION
NETWORKS
COMMITS, ISSUES AND
PULL REQUESTS AS
SOURCES
DISCOVER PRESENCE OF
SPECIFIC COLLABORATION
STRUCTURES: RICH-CLUBS
3
Rich-club coefficient
• Graph structural property:
It represents the tendency of well-connected nodes (i.e.: hubs) to interact with other well-
connected nodes.
• Formulation:
𝜙 𝑘 =
2𝐸 𝑘
𝑁𝑘(𝑁𝑘 − 1)
𝜌 𝑘 =
𝜙(𝑘)
𝜙 𝑟𝑎𝑛𝑑𝑜𝑚(𝑘)
𝐸 𝑘: number of edges between nodes of degree greater or equal to 𝑘
𝑁𝑘: number of nodes with degree greater or equal to 𝑘
𝜙 𝑘 : rich-club coefficient
𝜌 𝑘 : normalized rich-club coefficient
4
Related Work
• Rich-club phenomenon for a specific project [2],
or for a single FLOSS community [3].
• Study of the presence of a rich-club effect
across the whole GitHub social network [4].
• Analysis on open source communities exploiting
email exchanges among participants [5].
5
[2] Weifeng Pan, Bing Li, Yutao Ma, and Jing Liu. 2011. Multi-granularity evolution analysis of software using complex network theory
[3] Guido Conaldi. 2010. Flat for the few, steep for the many: Structural cohesion and Rich-Club effect as measures of hierarchy and control in FLOSS communities
[4] Antonio Lima, Luca Rossi, and Mirco Musolesi. 2014. Coding Together at Scale: GitHub as a Collaborative Social Network
[5] Sergi Valverde and Ricard V. Solé. 2007. Self-organization versus hierarchy in open-source social networks
Case Study
6
Top-100 starred projects in 2016 on
GitHub
926K commits produced by 50K Git users
1.3M issues-related events generated by
118K GitHub users
280K pullrequest-related events
generated by 20K GitHub users
Analysis Pipeline
7
Data Collection &
Preprocessing
• Git repository cloning for
commits data using Gitana
• Github activities for issues
and PR activities querying
GHArchive
• Duplicity and clashing
problem
8
Graphs Construction
• Definition of 4 undirected graphs:
a. PR graph
b. Commits graph
c. Issues graph
d. Supergraph (a + b + c)
• Nodes: users
• Edges connect a pair of users if
they interacted on the same
element (issue, PR, file)
9
Graphs Example
Materialize PR graph (a) Materialize commits graph (b) Materialize issues graph (c) Materialize supergraph (d)
10
Rich-club Coefficient
Calculation
• Calculation using algorithm
implementation included in
NetworkX6
• Normalized coefficient
𝜌(𝑘): rich-club effect
relevant if 𝜌 𝑘 > 1
• Discard networks for which
randomization fails
11
[6] https://networkx.github.io/documentation/stable/reference/algorithms/rich_club.html
Rich-club Coefficient
Results
• 60 projects have a defined
coefficient for the
supergraph.
• Each graph presents a rich-
club effect, since 𝜌 𝑘 > 1
for some 𝑘
Materialize7:
Rich-Club
Supergraph
Coefficient
Maximum normalized coefficient (k =
49) corresponds to maximum club effect
with nodes of degree at least 49.
13[7] https://materializecss.com
Materialize:
Supergraph
14
Swift8:
Rich-Club
Supergraph
Coefficient
15[8] https://swift.org/
Swift:
Supergraph
16
Rich-club Coefficient Results
17
Maximum coefficient distribution
• Distribution of the maximum
rich-club coefficient for each
type of graph across the studied
projects.
• Mean value around 1 for issues
and commits graphs
coefficients: weak rich-club
presence.
• Mean value around 1.4 for PR
graphs coefficient: strong rich-
club presence.
Further insights
18
Multi-club users
• 25 over 60 projects present a set
of users belonging to multiple rich-
clubs.
• Distribution of multi-club users
across the 25 projects.
• Developers form community with
strong influence in each project
level.
Further insights
19
Conclusions
First systematic evaluation of the rich-club
behaviour on open source projects:
• 60% of projects shows rich-clubs in the
supergraph, mostly with a slight effect.
• Rich-club behavior could undermine the open
paradigma, but phenomeon requires further
analysis.
• Strong rich-club presence in PR graphs may
reside to criticality of the activity.
• 25 over 60 projects have users belonging to
multiple rich-clubs.
20
Future Work
Weighted rich-club
coefficient
Rich-club effect at module
and ecosystem level
Time dimension to
highlight temporal clubs
21
Questions?

More Related Content

PPTX
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
PPTX
Trigger.eu: Cocteau game for policy making - introduction and demo
PPTX
Community analysis using graph representation learning on social networks
PDF
SP1: Exploratory Network Analysis with Gephi
PPTX
Taking it Public: Visualizing Geospatial Data on the Web Using Shiny
PPTX
Social Network Analysis and Visualization
PDF
Collaboration between Software Developers and the Impact of Proximity
PDF
Rakesh-Nune-Incident-Management-for-DDOT
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Trigger.eu: Cocteau game for policy making - introduction and demo
Community analysis using graph representation learning on social networks
SP1: Exploratory Network Analysis with Gephi
Taking it Public: Visualizing Geospatial Data on the Web Using Shiny
Social Network Analysis and Visualization
Collaboration between Software Developers and the Impact of Proximity
Rakesh-Nune-Incident-Management-for-DDOT

What's hot (9)

PDF
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
PDF
Data mining based social network
PPTX
M.Tech Project Social media community using optimized algorithm by M. Gomathi...
PDF
Building better knowledge graphs through social computing
PDF
Identifying news clusters using Q-analysis and Modularity
PPTX
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
PPTX
Data mining for social media
PPTX
From Argument Mapping to Argument Mining, and Back
PDF
Navigating large graphs like a breeze with Linkurious
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Data mining based social network
M.Tech Project Social media community using optimized algorithm by M. Gomathi...
Building better knowledge graphs through social computing
Identifying news clusters using Q-analysis and Modularity
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Data mining for social media
From Argument Mapping to Argument Mining, and Back
Navigating large graphs like a breeze with Linkurious
Ad

Similar to Analyzing rich club behavior in open source projects (20)

PDF
How academic research on GitHub has evolved in the last several years
PDF
GitConnect
PPTX
GDSC WoC 3.0 Opening Ceremony.pptx
PDF
DE gitConnect
PDF
Jürgens diata12-communities
PDF
Large Scale Graph Processing with Apache Giraph
PDF
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
PPTX
Graph Analytics
PDF
Document 8 1.pdf
PPTX
Leveraging the Crowd: Supporting Newcomers to Build an OSS Community
PDF
Exploring Language Communities on Github
PPTX
OccupyWallStreetNetworkAnalysis.pptx
PPTX
What is GSoC.pptx
PPTX
PPTX
Apache Spark GraphX highlights.
PDF
AudrisMockus_MSR22.pdf
PDF
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
PDF
Let's talk FOSS!
PPTX
Network Measures: Characterizing networks
PPTX
Network sampling, community detection
How academic research on GitHub has evolved in the last several years
GitConnect
GDSC WoC 3.0 Opening Ceremony.pptx
DE gitConnect
Jürgens diata12-communities
Large Scale Graph Processing with Apache Giraph
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
Graph Analytics
Document 8 1.pdf
Leveraging the Crowd: Supporting Newcomers to Build an OSS Community
Exploring Language Communities on Github
OccupyWallStreetNetworkAnalysis.pptx
What is GSoC.pptx
Apache Spark GraphX highlights.
AudrisMockus_MSR22.pdf
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
Let's talk FOSS!
Network Measures: Characterizing networks
Network sampling, community detection
Ad

More from Marco Brambilla (20)

PDF
A GraphRAG approach for Energy Efficiency Q&A
PDF
Essential concepts of data architectures
PDF
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
PDF
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
PPTX
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
PDF
Exploring the Bi-verse. A trip across the digital and physical ecospheres
PPTX
Conversation graphs in Online Social Media
PDF
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
PDF
Available Data Science M.Sc. Thesis Proposals
PPTX
Data Cleaning for social media knowledge extraction
PPTX
Iterative knowledge extraction from social networks. The Web Conference 2018
PDF
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
PDF
Myths and challenges in knowledge extraction and analysis from human-generate...
PPTX
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
PPTX
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
PPTX
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
PDF
Big Data and Stream Data Analysis at Politecnico di Milano
PPTX
Web Science. An introduction
PPTX
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
PPTX
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
A GraphRAG approach for Energy Efficiency Q&A
Essential concepts of data architectures
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Exploring the Bi-verse. A trip across the digital and physical ecospheres
Conversation graphs in Online Social Media
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Available Data Science M.Sc. Thesis Proposals
Data Cleaning for social media knowledge extraction
Iterative knowledge extraction from social networks. The Web Conference 2018
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Myths and challenges in knowledge extraction and analysis from human-generate...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
Big Data and Stream Data Analysis at Politecnico di Milano
Web Science. An introduction
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
System and Network Administraation Chapter 3
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
medical staffing services at VALiNTRY
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Transform Your Business with a Software ERP System
PPTX
Essential Infomation Tech presentation.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PTS Company Brochure 2025 (1).pdf.......
System and Network Administraation Chapter 3
Design an Analysis of Algorithms II-SECS-1021-03
2025 Textile ERP Trends: SAP, Odoo & Oracle
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Odoo Companies in India – Driving Business Transformation.pdf
ai tools demonstartion for schools and inter college
How Creative Agencies Leverage Project Management Software.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Odoo POS Development Services by CandidRoot Solutions
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Operating system designcfffgfgggggggvggggggggg
CHAPTER 2 - PM Management and IT Context
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
medical staffing services at VALiNTRY
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Transform Your Business with a Software ERP System
Essential Infomation Tech presentation.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...

Analyzing rich club behavior in open source projects

  • 1. Analyzing Rich-Club Behavior in Open Source Projects OpenSym 2019, the 15th International Symposium on Open Collaboration Skövde, Sweden Mattia Gasparini1, Javier Luis Cànovas Izquierdo2, Robert Clarisò2, Marco Brambilla1, Jordi Cabot2 Politecnico di Milano1 Universitat Oberta de la Catalunya2
  • 2. Introduction • Git and Github data to analyze evolution, success and management of Open Source Software. • Define developers behavioral patterns. • Discover how collaborations between developers work. 2
  • 3. Problem Statement ANALYSIS OF COLLABORATION NETWORKS COMMITS, ISSUES AND PULL REQUESTS AS SOURCES DISCOVER PRESENCE OF SPECIFIC COLLABORATION STRUCTURES: RICH-CLUBS 3
  • 4. Rich-club coefficient • Graph structural property: It represents the tendency of well-connected nodes (i.e.: hubs) to interact with other well- connected nodes. • Formulation: 𝜙 𝑘 = 2𝐸 𝑘 𝑁𝑘(𝑁𝑘 − 1) 𝜌 𝑘 = 𝜙(𝑘) 𝜙 𝑟𝑎𝑛𝑑𝑜𝑚(𝑘) 𝐸 𝑘: number of edges between nodes of degree greater or equal to 𝑘 𝑁𝑘: number of nodes with degree greater or equal to 𝑘 𝜙 𝑘 : rich-club coefficient 𝜌 𝑘 : normalized rich-club coefficient 4
  • 5. Related Work • Rich-club phenomenon for a specific project [2], or for a single FLOSS community [3]. • Study of the presence of a rich-club effect across the whole GitHub social network [4]. • Analysis on open source communities exploiting email exchanges among participants [5]. 5 [2] Weifeng Pan, Bing Li, Yutao Ma, and Jing Liu. 2011. Multi-granularity evolution analysis of software using complex network theory [3] Guido Conaldi. 2010. Flat for the few, steep for the many: Structural cohesion and Rich-Club effect as measures of hierarchy and control in FLOSS communities [4] Antonio Lima, Luca Rossi, and Mirco Musolesi. 2014. Coding Together at Scale: GitHub as a Collaborative Social Network [5] Sergi Valverde and Ricard V. Solé. 2007. Self-organization versus hierarchy in open-source social networks
  • 6. Case Study 6 Top-100 starred projects in 2016 on GitHub 926K commits produced by 50K Git users 1.3M issues-related events generated by 118K GitHub users 280K pullrequest-related events generated by 20K GitHub users
  • 8. Data Collection & Preprocessing • Git repository cloning for commits data using Gitana • Github activities for issues and PR activities querying GHArchive • Duplicity and clashing problem 8
  • 9. Graphs Construction • Definition of 4 undirected graphs: a. PR graph b. Commits graph c. Issues graph d. Supergraph (a + b + c) • Nodes: users • Edges connect a pair of users if they interacted on the same element (issue, PR, file) 9
  • 10. Graphs Example Materialize PR graph (a) Materialize commits graph (b) Materialize issues graph (c) Materialize supergraph (d) 10
  • 11. Rich-club Coefficient Calculation • Calculation using algorithm implementation included in NetworkX6 • Normalized coefficient 𝜌(𝑘): rich-club effect relevant if 𝜌 𝑘 > 1 • Discard networks for which randomization fails 11 [6] https://networkx.github.io/documentation/stable/reference/algorithms/rich_club.html
  • 12. Rich-club Coefficient Results • 60 projects have a defined coefficient for the supergraph. • Each graph presents a rich- club effect, since 𝜌 𝑘 > 1 for some 𝑘
  • 13. Materialize7: Rich-Club Supergraph Coefficient Maximum normalized coefficient (k = 49) corresponds to maximum club effect with nodes of degree at least 49. 13[7] https://materializecss.com
  • 18. Maximum coefficient distribution • Distribution of the maximum rich-club coefficient for each type of graph across the studied projects. • Mean value around 1 for issues and commits graphs coefficients: weak rich-club presence. • Mean value around 1.4 for PR graphs coefficient: strong rich- club presence. Further insights 18
  • 19. Multi-club users • 25 over 60 projects present a set of users belonging to multiple rich- clubs. • Distribution of multi-club users across the 25 projects. • Developers form community with strong influence in each project level. Further insights 19
  • 20. Conclusions First systematic evaluation of the rich-club behaviour on open source projects: • 60% of projects shows rich-clubs in the supergraph, mostly with a slight effect. • Rich-club behavior could undermine the open paradigma, but phenomeon requires further analysis. • Strong rich-club presence in PR graphs may reside to criticality of the activity. • 25 over 60 projects have users belonging to multiple rich-clubs. 20
  • 21. Future Work Weighted rich-club coefficient Rich-club effect at module and ecosystem level Time dimension to highlight temporal clubs 21

Editor's Notes

  • #3: GitHub is the most popular service to develop and maintain open source software. Each user interacts with many other users in the project development process (commits, issues, pr), defining collaboration networks. Studying collaboration networks helps in discovering properties and behaviors that influence development, management and success of an OSS project.
  • #4: Developers collaborate mostly with the same fixed subset of other important colleagues, instead of spreading the cooperation to each component of the team.
  • #5: Formally, it cab be measured by the so called rich-club coefficient ϕ(k). Intuitively, ϕ(k) measures how far the set of nodes with degree k is from being a complete subgraph. The value of ϕ(k) ranges from 0 (all nodes are disconnected) to 1 (a clique), with higher values showing a stronger rich-club behavior in the network. It is monotonically increasing even for random networks, so a normalized coefficient has been introduced in literature: ϕ(k) is divided by the coefficient calculated for a random network with same degree distribution of the original one.
  • #6: Presence or absence of a rich-clubs in open source projects has not been studied in a systematic way and has not been applied to a large dataset as the one that GitHub can now provide.
  • #9: Clashing: same name of different users Duplicity: different names for the same users Solution: use SHA value to associate git commits to GitHub users (if still present)
  • #11: Two users are connected in the PR graph if they commented/interacted on the same PR…
  • #13: Calculaton of rich-club coefficient is run for each project’s supergraph to have a global view of the effect. Maximum value for each project is shown: each of the 60 graphs presents a rich club behavior, even if most of them have values only slightly higher than 1. For this reason, we want to better understand the correspondence between the coefficient and the actual graphs.
  • #14: The first example that we take is the materialize repositorty: rich-club coefficient with respect to node degree is presented. It is possible to notice a rich-club behavior for a range of degrees, with a peak on k=49, which should correspond to groups of nodes with degree at least 49 connected to each other.
  • #15: This seems to go against the open source paradigma: project “owned” by few users. Established in 2014 by a team of 4 developers, with 3,853 commits and 252 contributors. Nevertheless, the project only has two top contributors (more than 1,000 commits), which belong to the original team, and no frequent contributors
  • #16: Mixed behavior presence: slightly over than 1, then dramatically lower. The overall intuition is that the graph does not present rich-clubs
  • #17: It was publicly announced by Apple in 2014 and was later open sourced in December 2015. Currently, the project has more than 84k commits and 674 contributors, with 14 top contributors (more than 1.000 commits) and 44 frequent contributors (between 100 and 1.000 commits). Remarkably, 4 of the top contributors and 21 of the frequent contributors do not belong to Apple according to their GitHub profile. This is a sign that the project has successfully attracted and retained external talent.
  • #18: In this table, the 10 projects with highest coefficient for the supergraph are presented. Along with them, the coefficient for the other kind of graphs is calculated when possible. Infact,also these other graphs can «hide» other clubs structures.
  • #19: Maximum coefficient distribution for each kind of graph as a further insight. Blue line is the one already discussed.Green and orange line show commits and issues maximum coefficient distribution: density has a peak on 1 meaning that most of the graphs do not present strong rich-clubs. Red line has its peak around 1.4: most of the projects present evident rich-club structures. This behavior could be related to the fact that PR is the most critical level in open-source software development and few trusthworty developers are in charge of most of the tasks.
  • #20: We focused also the attention on the users: almost 50% of the projects, have users tha belongs to multiple clubs. The distribution presents the number of users shared across all the projects’ clubs: this means that, on average, 7 developers are in the PR club, as well as in the commits and issues club. These developers form a sub-community inside the project that has strong influence in all the project’s levels.
  • #22: As rich-club phenomenon is quite complex and also its application on OSS communities relatively new, plenty of further works can be done. First of all, we want to apply weighted coefficient version to check if other patterns arise. We want to extend the analysis at the module and the ecosystem level. And third, we want to introduce time variable: in this work the graphs are built using the entire data as a 1-year snapshot, but it is possible to build monthly graphs and check if temporal clubs show up.
  • #23: With this, I have concluded the presentation. Thank you for the attention.