SlideShare a Scribd company logo
+ => 1 million SPDX 
Large-scale license transparency using open data, open standards and F/OSS 
http://triplecheck.net http://searchcode.com
Speaker 
Slide #2 
Nuno Brito 
 Free/open source contributor since 2005 
 Last 12 months wrote 100k F/OSS lines of code 
 SPDX contributor, co-founder of TripleCheck 
Around the web 
http://nunobrito.eu
Transparency 
Slide #3 
Take some source code as example 
Who developed the code? 
Which licenses are applicable? 
Was the code copied from somewhere else?
Size 
Slide #4 
A problem of scale 
Open licenses? > 300 types to choose 
> 5 million F/OSS projects 
> 100 million source code files
Practice 
Slide #5 
Applying licenses 
 Burden on developer (do correctly, do enough) 
 Expressed differently (difficult to understand) 
 Scaling obstacles (scarce automation) 
Transparency?
What do? 
Slide #6 
Ideally, we'd have tooling that is.. 
a) Reachable 
b) Cooperative 
c) Free 
Choose two. (sad reality)
Choose three 
Slide #7 
Choose building blocks based on: 
a) Open standards 
b) Open data 
c) Reachable tools 
Learn, write, improve. 
Share.
Standards 
Slide #8 
SPDX: Open standard for software licensing 
 Standardizes license description 
 Defines Id for license terms 
 http://spdx.org 
Pro: Good docs, straightforward, getting better 
Cons: Slow adoption, scarce tooling
Open data 
Slide #9 
GitHub: Targeting open data repositories 
 API suited for intensive access 
 Social coding 
 Largest open source code collection 
Pro: Reachable, diverse 
Cons: Repositories processed one-by-one
Tooling 
Slide #10 
Custom-built tools for software licenses 
 Large-scale repository data-mining 
 Find applicable licenses inside content 
 Share millions of SPDX documents 
Pro: Learn by doing, modularized, single language 
Cons: Built from scratch, needs consolidation
Step 1 
Slide #11 
Desktop tool/engine to discover licenses 
 SPDX format as storage medium 
 Identify copyright and 18 license types 
 Java, released in Feb 2014. EUPL 
http://spdx.org/tools/community/triplecheck-reporter
Desktop 
Slide #12
File detail 
Slide #13
SPDX file 
Slide #14
Customize 
Slide #15
Details 
Slide #16 
Underneath the hood 
 147 file extensions, 18 license types 
 LOC, hashes (SHA1, MD5, SHA256, SSDEEP) 
 Command line supported (Jenkins, cron) 
 Fast, 40k files/minute (Pentium IV)
Step 2 
Discovering repositories with gitFinder 
Create a list of projects online to use as components. 
Get basic licensing information from each project. 
 Write text file with each github user (~7 million) 
 For each user, find repositories not forked (~10M) 
 Split each repository according to language (197) 
 For each list of language/reps, download code 
Slide #17
Performance 
Slide #18 
~70k repositories/day 
 Single machine (i7, 8Gb RAM, CentOS) 
 9 parallel threads 
 Resume/recover supported 
 Released in Jun. 2014 
https://github.com/triplecheck/gitfinder
Output 
Slide #19
Storage? 
https://what-if.xkcd.com/29/ (CC BY-NC 2.5) Slide #20
Storage 
BigZip, +100 million files on a single download 
Slide #21 
 Flat-file, zip compression (per entry) 
 Fast, simple, portable. Indexed search 
https://github.com/triplecheck/big
How it looks 
Slide #22
Step 3 
Slide #23 
SPDX search engine 
 One-click SPDX creation from open data 
 Visualize license and copyright data 
 Visit at http://searchcode.com/spdx
Example 
Slide #24 
Using the original URL.. 
 https://github.com/iuly/europa_kernel/ 
=> 
 https://spdxhub.com/iuly/europa_kernel/
Example 
Slide #25
SPDX-1M 
“Do It Yourself” kit. Generate 1 million SPDX 
Slide #26 
 https://github.com/triplecheck/diy 
 1.2 million open source projects 
 “Arduino” for s/w licenses detection 
9Gb worth of SPDX? Grab: 
http://triplecheck.net/public/storage/spdx.big
Screenshots 
Slide #27
Next step? 
Slide #28 
F2F – pinpointing non-original code 
 Decompose code into blocks 
 Tokenize/anonymize data 
 Find code matches across knowledge base 
ETA in Dec. 2014 
https://github.com/triplecheck/f2f
Preview 
Slide #29
Conclusion 
Slide #30 
What is now available for everyone 
 Desktop tooling / detection engine 
 Extraction of open data in scale 
 Search engine for SPDX
Questions? 
Slide #31 
http://spdx.org 
http://searchcode.com/spdx 
http://github.com/triplecheck 
Interesting stuff? 
Let us know: @nn81 @boyte #linuxcon 
http://xkcd.com/1118/
Backup slides 
Slide #32
Engine 
Slide #33
License DB 
Slide #34
Components 
Slide #35
Exporting 
Slide #36

More Related Content

PDF
Software Heritage: let's build together the universal archive of our software...
PDF
Making Open Source Hardware IoT with Raspberry Pi
PDF
IoT Prototyping using BBB and Debian
PDF
FOSDEM 2017: Making Your Own Open Source Raspberry Pi HAT
PPTX
Concepts of Open source
PPTX
N-ary Trees for C Programming Language
ODP
Create IoT with Open Source Hardware, Tizen and HTML5
ODP
Introduction to Free and Open Source Software (FOSS)
Software Heritage: let's build together the universal archive of our software...
Making Open Source Hardware IoT with Raspberry Pi
IoT Prototyping using BBB and Debian
FOSDEM 2017: Making Your Own Open Source Raspberry Pi HAT
Concepts of Open source
N-ary Trees for C Programming Language
Create IoT with Open Source Hardware, Tizen and HTML5
Introduction to Free and Open Source Software (FOSS)

What's hot (20)

PPTX
Open Source Software Concepts
PDF
The Ring programming language version 1.5.1 book - Part 14 of 180
PDF
Philosophy of Open Source - SFO17-TR01
PDF
For the Love of Tux: Linux on RISC-V
PPT
PDF
Introduction to FOSS, SRM University
PPTX
Benefits of Opensource Products
PDF
Python at a glance
PDF
Dynamic hacking with Guile (FOSDEM 2011)
PPT
The open source philosophy
PDF
MSR09.ppt
PDF
Free and open source software
PPT
GNU GPL, LGPL, Apache licence Types and Differences
ODP
Fundamentals of Free and Open Source Software
PPTX
Kivy report
PPT
Open Source Presentation
PDF
Avoiding the tragedy of the commons: some lessons from the Software Heritage ...
 
PPTX
Free and Open Source Software
ODP
Foss Presentation
PDF
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Open Source Software Concepts
The Ring programming language version 1.5.1 book - Part 14 of 180
Philosophy of Open Source - SFO17-TR01
For the Love of Tux: Linux on RISC-V
Introduction to FOSS, SRM University
Benefits of Opensource Products
Python at a glance
Dynamic hacking with Guile (FOSDEM 2011)
The open source philosophy
MSR09.ppt
Free and open source software
GNU GPL, LGPL, Apache licence Types and Differences
Fundamentals of Free and Open Source Software
Kivy report
Open Source Presentation
Avoiding the tragedy of the commons: some lessons from the Software Heritage ...
 
Free and Open Source Software
Foss Presentation
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Ad

Similar to 2014 10-14: GitHub plus FOSS == 1 million SPDX (20)

PPT
Android Developer Meetup
PDF
Automate your iOS deployment a bit
PDF
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
ODP
Ubucon 2013, licensing and packaging OSS
PDF
Open frameworks 101_fitc
PDF
Hacking the Kinect with GAFFTA Day 1
PDF
Module 18 (linux hacking)
PDF
Become Rick and famous, thanks to Open Source
PPT
2nd ARM Developer Day - mbed Workshop - ARM
PDF
Introduction to License Compliance and My research (D. German)
PPTX
Scanning Docker Images with ScanCode.io
PDF
Software Heritage, a revolutionary infrastructure for software source code, O...
 
PDF
OpenNTF Webinar 05/07/13: OpenNTF - The IBM Collaboration Solutions App Dev C...
PPTX
Lab Handson: Power your Creations with Intel Edison!
PPTX
Microsoft Embracing Open Source Technologies
PDF
Software Heritage: Archiving the Free Software Commons for Fun & Profit
PDF
DT2014-15 S01: Digital Toolbox
PDF
UnDeveloper Studio
ODP
Open source freeopensource & linux
PDF
Tech Talk - Blockchain presentation
Android Developer Meetup
Automate your iOS deployment a bit
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
Ubucon 2013, licensing and packaging OSS
Open frameworks 101_fitc
Hacking the Kinect with GAFFTA Day 1
Module 18 (linux hacking)
Become Rick and famous, thanks to Open Source
2nd ARM Developer Day - mbed Workshop - ARM
Introduction to License Compliance and My research (D. German)
Scanning Docker Images with ScanCode.io
Software Heritage, a revolutionary infrastructure for software source code, O...
 
OpenNTF Webinar 05/07/13: OpenNTF - The IBM Collaboration Solutions App Dev C...
Lab Handson: Power your Creations with Intel Edison!
Microsoft Embracing Open Source Technologies
Software Heritage: Archiving the Free Software Commons for Fun & Profit
DT2014-15 S01: Digital Toolbox
UnDeveloper Studio
Open source freeopensource & linux
Tech Talk - Blockchain presentation
Ad

More from Nuno Brito (6)

PDF
Triplechecheck induction-presentation-sample
PDF
Stop look and listen before you talk
PPT
Lifes Good In Portugal
PPTX
Managing business relationships
PDF
Explaining the WinBuilder framework
PDF
White paper - Adhoc 2.0
Triplechecheck induction-presentation-sample
Stop look and listen before you talk
Lifes Good In Portugal
Managing business relationships
Explaining the WinBuilder framework
White paper - Adhoc 2.0

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
DOCX
The AUB Centre for AI in Media Proposal.docx
NewMind AI Monthly Chronicles - July 2025
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The AUB Centre for AI in Media Proposal.docx

2014 10-14: GitHub plus FOSS == 1 million SPDX

  • 1. + => 1 million SPDX Large-scale license transparency using open data, open standards and F/OSS http://triplecheck.net http://searchcode.com
  • 2. Speaker Slide #2 Nuno Brito  Free/open source contributor since 2005  Last 12 months wrote 100k F/OSS lines of code  SPDX contributor, co-founder of TripleCheck Around the web http://nunobrito.eu
  • 3. Transparency Slide #3 Take some source code as example Who developed the code? Which licenses are applicable? Was the code copied from somewhere else?
  • 4. Size Slide #4 A problem of scale Open licenses? > 300 types to choose > 5 million F/OSS projects > 100 million source code files
  • 5. Practice Slide #5 Applying licenses  Burden on developer (do correctly, do enough)  Expressed differently (difficult to understand)  Scaling obstacles (scarce automation) Transparency?
  • 6. What do? Slide #6 Ideally, we'd have tooling that is.. a) Reachable b) Cooperative c) Free Choose two. (sad reality)
  • 7. Choose three Slide #7 Choose building blocks based on: a) Open standards b) Open data c) Reachable tools Learn, write, improve. Share.
  • 8. Standards Slide #8 SPDX: Open standard for software licensing  Standardizes license description  Defines Id for license terms  http://spdx.org Pro: Good docs, straightforward, getting better Cons: Slow adoption, scarce tooling
  • 9. Open data Slide #9 GitHub: Targeting open data repositories  API suited for intensive access  Social coding  Largest open source code collection Pro: Reachable, diverse Cons: Repositories processed one-by-one
  • 10. Tooling Slide #10 Custom-built tools for software licenses  Large-scale repository data-mining  Find applicable licenses inside content  Share millions of SPDX documents Pro: Learn by doing, modularized, single language Cons: Built from scratch, needs consolidation
  • 11. Step 1 Slide #11 Desktop tool/engine to discover licenses  SPDX format as storage medium  Identify copyright and 18 license types  Java, released in Feb 2014. EUPL http://spdx.org/tools/community/triplecheck-reporter
  • 16. Details Slide #16 Underneath the hood  147 file extensions, 18 license types  LOC, hashes (SHA1, MD5, SHA256, SSDEEP)  Command line supported (Jenkins, cron)  Fast, 40k files/minute (Pentium IV)
  • 17. Step 2 Discovering repositories with gitFinder Create a list of projects online to use as components. Get basic licensing information from each project.  Write text file with each github user (~7 million)  For each user, find repositories not forked (~10M)  Split each repository according to language (197)  For each list of language/reps, download code Slide #17
  • 18. Performance Slide #18 ~70k repositories/day  Single machine (i7, 8Gb RAM, CentOS)  9 parallel threads  Resume/recover supported  Released in Jun. 2014 https://github.com/triplecheck/gitfinder
  • 21. Storage BigZip, +100 million files on a single download Slide #21  Flat-file, zip compression (per entry)  Fast, simple, portable. Indexed search https://github.com/triplecheck/big
  • 22. How it looks Slide #22
  • 23. Step 3 Slide #23 SPDX search engine  One-click SPDX creation from open data  Visualize license and copyright data  Visit at http://searchcode.com/spdx
  • 24. Example Slide #24 Using the original URL..  https://github.com/iuly/europa_kernel/ =>  https://spdxhub.com/iuly/europa_kernel/
  • 26. SPDX-1M “Do It Yourself” kit. Generate 1 million SPDX Slide #26  https://github.com/triplecheck/diy  1.2 million open source projects  “Arduino” for s/w licenses detection 9Gb worth of SPDX? Grab: http://triplecheck.net/public/storage/spdx.big
  • 28. Next step? Slide #28 F2F – pinpointing non-original code  Decompose code into blocks  Tokenize/anonymize data  Find code matches across knowledge base ETA in Dec. 2014 https://github.com/triplecheck/f2f
  • 30. Conclusion Slide #30 What is now available for everyone  Desktop tooling / detection engine  Extraction of open data in scale  Search engine for SPDX
  • 31. Questions? Slide #31 http://spdx.org http://searchcode.com/spdx http://github.com/triplecheck Interesting stuff? Let us know: @nn81 @boyte #linuxcon http://xkcd.com/1118/