SlideShare a Scribd company logo
Automating License Identification
with SPDX-Tool in Ada
Stéphane Carrez Ada Developers
Workshop 2025
https://github.com/stcarrez/spdx-tool 2
Challenge
●
Find the license defined in a source file ?
https://github.com/stcarrez/spdx-tool 3
Which license is it ?
/*
* drm_irq.c IRQ and vblank support
*
* author Rickard E. (Rik) Faith <faith@valinux.com>
* author Gareth Hughes <gareth@valinux.com>
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
* OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
* ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*/
MIT
https://github.com/stcarrez/spdx-tool 4
Which license is it ?
/*
*
* We reserve no legal rights to the ANTLR--it is fully in the public domain.
* An individual or company may do whatever they wish with source code distributed
* with ANTLR or the code generated by ANTLR, including the incorporation of ANTLR,
* or its output, into commerical software.
*
* We encourage users to develop software with ANTLR. However, we do ask that credit
* is given to us for developing ANTLR. By "credit", we mean that if you use ANTLR or
* incorporate any source code into one of your programs (commercial product, research
* project, or otherwise) that you acknowledge this fact somewhere in the documentation,
* research report, etc... If you like ANTLR and have developed a nice tool with
* the output, please mention that you developed it using ANTLR. In addition,
* we ask that the headers remain intact in our source code. As long as these
* guidelines are kept, we expect to continue enhancing this system and expect
* to make other tools available as they are completed.
*/
ANTLR-PD
https://github.com/stcarrez/spdx-tool 5
Which license is it ?
/*
* This is free and unencumbered software released into the public domain.
*
* Anyone is free to copy, modify, publish, use, compile, sell, or
* distribute this software, either in source code form or as a compiled
* binary, for any purpose, commercial or non-commercial, and by any
* means.
*
* In jurisdictions that recognize copyright laws, the author or authors
* of this software dedicate any and all copyright interest in the
* software to the public domain. We make this dedication for the benefit
* of the public at large and to the detriment of our heirs and
* successors. We intend this dedication to be an overt act of
* relinquishment in perpetuity of all present and future rights to this
* software under copyright law.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
* OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
* ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*
* For more information, please refer to <http://unlicense.org/>
*/
Unlicense
https://github.com/stcarrez/spdx-tool 6
Which license is it ?
%%
%% %CopyrightBegin%
%%
%% Copyright Ericsson AB 2010-2023. All Rights Reserved.
%%
%% Licensed under the Apache License, Version 2.0 (the "License");
%% you may not use this file except in compliance with the License.
%% You may obtain a copy of the License at
%%
%% http://www.apache.org/licenses/LICENSE-2.0
%%
%% Unless required by applicable law or agreed to in writing, software
%% distributed under the License is distributed on an "AS IS" BASIS,
%% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
%% See the License for the specific language governing permissions and
%% limitations under the License.
%%
%% %CopyrightEnd%
%% Apache-2.0
https://github.com/stcarrez/spdx-tool 7
Which license is it ?
/*
* "THE BEER-WARE LICENSE" (Revision 42):
*
* <phk@FreeBSD.ORG> wrote this file. As long as you retain this notice
* you can do whatever you want with this stuff. If we meet some day,
* and you think this stuff is worth it, you can buy me a beer
* in return Poul-Henning Kamp
*/
Beerware
https://github.com/stcarrez/spdx-tool 8
Which license is it ?
/*
* Copyright (C) 2019 Intel Corporation
* Permission to use, copy, modify, and distribute this software for any
* purpose with or without fee is hereby granted, provided that the above
* copyright notice and this permission notice appear in all copies.
*
* THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
* WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
* MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
* ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*
*/
#ifndef _UAPI_LINUX_UM_TIMETRAVEL_H
#define _UAPI_LINUX_UM_TIMETRAVEL_H
ISC
https://github.com/stcarrez/spdx-tool 9
Which license is it ?
/*
* Copyright (c) 2013, 2014 Kenneth MacKay. All rights reserved.
* Copyright (c) 2019 Vitaly Chikunov <vt@altlinux.org>
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
* met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
* HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
BSD-2-Clause
https://github.com/stcarrez/spdx-tool 10
Which license is it ?
<# Copyright (c) <year> <owner>.
Redistribution and use in source and binary forms, with or without modification, are permitted provided
that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used
to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE , EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.
#>
BSD-3-Clause
https://github.com/stcarrez/spdx-tool 11
Which license is it ?
/*
* Copyright (c) 2005 Topspin Communications. All rights reserved.
* Copyright (c) 2005 Mellanox Technologies. All rights reserved.
* Copyright (c) 2013 Cisco Systems. All rights reserved.
*
* This software is available to you under a choice of one of two
* licenses. You may choose to be licensed under the terms of the GNU
* General Public License (GPL) Version 2, available from the file
* COPYING in the main directory of this source tree, or the
* BSD license below:
*
* Redistribution and use in source and binary forms, with or
* without modification, are permitted provided that the following
* conditions are met:
*
* - Redistributions of source code must retain the above
* copyright notice, this list of conditions and the following
* disclaimer.
*
* - Redistributions in binary form must reproduce the above
* copyright notice, this list of conditions and the following
* disclaimer in the documentation and/or other materials
* provided with the distribution.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
Linux-OpenIB
https://github.com/stcarrez/spdx-tool 12
Which license is it ?
/*
he.h
ForeRunnerHE ATM Adapter driver for ATM on Linux
Copyright (C) 1999-2000 Naval Research Laboratory
Permission to use, copy, modify and distribute this software and its
documentation is hereby granted, provided that both the copyright
notice and this permission notice appear in all copies of the software,
derivative works or modified versions, and any portions thereof, and
that both notices appear in supporting documentation.
NRL ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" CONDITION AND
DISCLAIMS ANY LIABILITY OF ANY KIND FOR ANY DAMAGES WHATSOEVER
RESULTING FROM THE USE OF THIS SOFTWARE.
*/
CMU-Mach-nodoc
https://github.com/stcarrez/spdx-tool 13
Next challenge
●
Where is the license in the header ?
https://github.com/stcarrez/spdx-tool 14
Where is the license ?
pragma Style_Checks (Off);
-- This file has been generated by
-- 'g++ -c -C -fdump-ada-spec /usr/include/sqlite3_h.ads'
-- and manually edited by Olivier Ramonat
-- Some sqlite3 experimental features have been removed
--** 2001 September 15
--**
-- The author disclaims copyright to this source code. In place of
-- a legal notice, here is a blessing:
--
-- May you do good and not evil.
-- May you find forgiveness for yourself and forgive others.
-- May you share freely, never taking more than you give.
--
--*************************************************************************
--** This header file defines the interface that the SQLite library
--** presents to client programs. If a C-function, structure, datatype,
--** or constant definition does not appear in this file, then it is
--** not a published API of SQLite, is subject to change without
--** notice, and should not be referenced by programs that use SQLite.
blessing
https://github.com/stcarrez/spdx-tool 15
Where is the license ?
"======================================================================
|
| Smalltalk dining philosophers
|
|
======================================================================"
"======================================================================
|
| Copyright 1999, 2000 Free Software Foundation, Inc.
| Written by Paolo Bonzini.
|
| This file is part of GNU Smalltalk.
|
| GNU Smalltalk is free software; you can redistribute it and/or modify it
| under the terms of the GNU General Public License as published by the Free
| Software Foundation; either version 2, or (at your option) any later version.
|
| GNU Smalltalk is distributed in the hope that it will be useful, but WITHOUT
| ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
| FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
| details.
|
| You should have received a copy of the GNU General Public License along with
| GNU Smalltalk; see the file COPYING. If not, write to the Free Software
| Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
|
======================================================================"
GPL-2.0 match 0.86
https://github.com/stcarrez/spdx-tool 16
Where is the license ?
;;; elec-pair.el --- Automatic parenthesis pairing -*- lexical-binding:t -*-
;; Copyright (C) 2013-2020 Free Software Foundation, Inc.
;; Author: João Távora <joaotavora@gmail.com>
;; This file is part of GNU Emacs.
;; GNU Emacs is free software: you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 3 of the License, or
;; (at your option) any later version.
;; GNU Emacs is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;; GNU General Public License for more details.
;; You should have received a copy of the GNU General Public License
;; along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>.
;;; Commentary:
;;; Code:
GPL-3.0 match 0.85
https://github.com/stcarrez/spdx-tool 17
Where is the license ?
/* Leap second stress test
* by: John Stultz (john.stultz@linaro.org)
* (C) Copyright IBM 2012
* (C) Copyright 2013, 2015 Linaro Limited
* Licensed under the GPLv2
*
* This test signals the kernel to insert a leap second
* every day at midnight GMT. This allows for stressing the
* kernel's leap-second behavior, as well as how well applications
* handle the leap-second discontinuity.
*
* Usage: leap-a-day [-s] [-i <num>]
*
* Options:
* -s: Each iteration, set the date to 10 seconds before midnight GMT.
* This speeds up the number of leapsecond transitions tested,
* but because it calls settimeofday frequently, advancing the
* time by 24 hours every ~16 seconds, it may cause application
* disruption.
*
* -i: Number of iterations to run (default: infinite)
*
* Other notes: Disabling NTP prior to running this is advised, as the two
* may conflict in their commands to the kernel.
*
* To build:
* $ gcc leap-a-day.c -o leap-a-day -lrt
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*/
https://github.com/stcarrez/spdx-tool 18
Introducing SPDX License
●
System Package Data Exchange (SPDX) :
– Created by Linux Foundation in 2011 to
improve license compliance
– Describes Bill of Materials (BOMs) for
software components
– Registry of licenses with tags
●
Examples :
// SPDX-License-Identifier: GPL-2.0
-- SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
https://github.com/stcarrez/spdx-tool 19
Introducing spdx-tool
●
Why spdx-tool ?
– Needed a way to convert license headers into
SPDX license tags (> 3000 files to convert)
– Wanted to identify licenses by analyzing
comment headers
●
Key requirements :
– Support for any programming language
– Recognize standard and custom licenses
– Parallelized license analysis
https://github.com/stcarrez/spdx-tool 20
Spdx-tool usage (1)
●
Identify licenses used in a project :
●
Listing files matching a given license:
spdx-tool --only-licenses=BSD-3-Clause --files
spdx-tool
https://github.com/stcarrez/spdx-tool 21
Spdx-tool usage (2)
●
Print license headers found in files :
spdx-tool --print-license --line-number src/intl.ads
https://github.com/stcarrez/spdx-tool 22
Spdx-tool usage (3)
●
Replace license headers by their SPDX tag :
spdx-tool --update=1..2,spdx src
●
Replace license headers by their SPDX tag :
– “1..2” means keep the original lines
– To put the SPDX tag first, use “spdx,1..2”
https://github.com/stcarrez/spdx-tool 23
License report on Linux 6.15.1
https://github.com/stcarrez/spdx-tool 24
spdx-tool numbers
●
95 % Ada, 5 % JSON
●
192K CLOC Ada... but 180K generated !
(only 6K CLOC for the spdx-tool itself)
●
560 licenses templates
●
720 programming languages identified
●
1900 source files analyzed per second
(on AMD 24 cores)
https://github.com/stcarrez/spdx-tool 25
Spdx-tool algorithm
1) Scan files and honor .gitignore rules
2) Prepare for analysis (read file & identify lines)
3) Find the programming language
4) Analyze comments
5) Identify the license from a repository of license
templates (apply change on source file)
6) Print report
Parallel execution by a pool of tasks
https://github.com/stcarrez/spdx-tool 26
License template repository
●
License templates from https://spdx.org/ :
– 560 templates in the repository
– A template defines the rules for matching (strict
content, optional content, variable/patterns)
<<var;name="copyright";original="Copyright (c) <year> <owner>. ";match=".{0,5000}">>
Redistribution and use in source and binary forms
<<var;name="theme";original="";match="()|( of the theme)">>,
with or without modification,
<<var;name="tobe";original="are";match="are|is">>
permitted provided that the following conditions are met:
https://github.com/stcarrez/spdx-tool 27
Embedded license templates
●
Template files integrated by using Advanced
Resource Embedder
https://github.com/stcarrez/resource-embedder
●
Generates a simple function (perfect hash) to
convert a license name into its content
package SPDX_Tool.Licenses.Files is
Names_Count : constant := 562;
Names : constant Name_Array;
-- Returns the data stream with the given name or null.
function Get_Content (Name : String) return
access constant Buffer_Type;
…
end SPDX_Tool.Licenses.Files;
https://github.com/stcarrez/spdx-tool 28
Embedded inverted index
●
Custom tool creates from the license repository :
– A list of tokens
– An inverted index
– Per-license token occurrence
– Defined as constant => put in .rodata section
AFL-1.1: Licensed under the Academic Free License version 1.1.
T82 : aliased constant Token_Array := -- Tokens used by AFL-1.1
((244, 1), (964, 1), (1391, 1), (1392, 1), (5041, 1), (5115, 1));
L172 : aliased constant License_Index_Array := (13, 49, 82, 83, 84, 85, 86);
Index : constant Token_Index_Array := ((1, L0'Access), (2, L1'Access),
(3, L2'Access), …
(244, L172'Access)
-- Token ‘244’ (“Academic”) is used by license 13, 49, 82, …
https://github.com/stcarrez/spdx-tool 29
Spdx-tool architecture
Reader
Files Templates
Repository
Manager
Configs Files
SPDX_Tool
Licenses Reports
Manager
Manager
ada_toml utilada magicada intl utilada_xml
sciada ansiada
printer_toolkit
Languages
Generated
Mimes
Modelines
Rules
Shell
Extensions
Filenames
Defaults
Infos
Alire crates used
Ada packages
[1] [2] [3] [4] [5] [6]
Steps:
https://github.com/stcarrez/spdx-tool 30
Scan files and honor .gitignore
Reader
Files Templates
Repository
Manager
Configs
SPDX_Tool
Licenses Reports
Manager
utilada intl
Languages
Generated
Mimes
Modelines
Rules
Shell
Extensions
Filenames
Defaults
Alire crates used
Ada packages
Files
Manager
Infos
[1]
Steps:
https://github.com/stcarrez/spdx-tool 31
Step 1 : Scan files (main task)
●
Identify files that must be analyzed :
– Scan directory tree, handle .gitignore files
– Scan and filter support by Ada Utility Library
(using Ada.Directories)
(see tree.adb example in Ada Utility Library)
– Add file path in a queue for analysis
with Util.Files.Walk;
with Util.Files.Filters;
https://github.com/stcarrez/spdx-tool 32
Language detection and
comments analysis
Files
Manager
Reader
Files Templates
Repository
Manager
Configs
SPDX_Tool
Licenses Reports
utilada magicada intl
Defaults
Alire crates used
Ada packages
Infos
Manager
Languages
Generated
Mimes
Modelines
Rules
Shell
Extensions
Filenames
[2] [3] [4]
Steps:
https://github.com/stcarrez/spdx-tool 33
Step 2: prepare for analysis
●
File analysis made by a dedicated task
(picks a file to analyze from the queue)
– Avoid sharing data across tasks
– Avoid switching tasks for analysis
●
Preparation for analysis :
– Read content as binary (Stream_Element, 8Kb)
– Prepare data structure to represent each line and
find line boundaries (within the 8Kb buffer)
with Util.Executors;
https://github.com/stcarrez/spdx-tool 34
Step 3: find the language
●
Identify the language used in the source file
(heavily inspired by the Linguist project) :
– File extension heuristic mapping (JSON)
– Disambiguation rules (JSON)
– Emacs and vi modline in header (-*-mode:..-*-),
– By using libmagic (only with --mimes option)
– Unix shell identification (‘#!’ on first line)
(map shell interpreter to a language)
●
Identify generated files by looking at patterns
https://github.com/stcarrez/spdx-tool 35
Step 3: find language
●
Language detector interface as private interface
and detectors in children private packages
package SPDX_Tool.Languages is
...
private
type Detector_Type is limited interface;
procedure Detect (Detector : in Detector_Type;
File : in File_Info;
Content : in out File_Type;
Result : in out Detector_Result)
is abstract;
end SPDX_Tool.Languages;
https://github.com/stcarrez/spdx-tool 36
Step 4: analyze comments
●
Extract useful content from comment headers :
– Identify most comment styles
●
Block comments :
/* .. */ {* .. *} <!-- .. --> ### .. ###
(* .. *) {- .. -} <# .. #> “ .. ”
●
Line comments :
// .. %% ..  ..
-- .. % .. dnl ..
# .. ; ..
– Ignore presentation markers
– Collect per-line tokens for license search
https://github.com/stcarrez/spdx-tool 37
Step 4: result of analysis
%% Licensed under the Apache License, Version 2.0 (the "License");
type Line_Type is record
Comment : Comment_Mode := NO_COMMENT;
Style : Comment_Info;
Line_Start : Buffer_Index := 1;
Line_End : Buffer_Size := 0;
Tokens : SPDX_Tool.Counter_Arrays.Array_Type;
Licenses : License_Index_Map := SPDX_Tool.EMPTY_MAP;
end record;
type Line_Array is array(Infos.Line_Number range <>)of Line_Type;
under
Apache License Licensed
1 1
1 1
C309 C1393 C1394 C5063
... ...
Version
2
C2132
Line_Start Line_End
Comment.Text_Start Comment.Text_Last
Comment := LINE_COMMENT;
Tokens :=
https://github.com/stcarrez/spdx-tool 38
License identification
Manager
Languages
Generated
Mimes
Modelines
Rules
Shell
Extensions
Filenames
Files
Manager
Configs
SPDX_Tool
Reports
utilada intl sciada
Defaults
Alire crates used
Ada packages
Infos
Reader
Files Templates
Repository
Manager
Licenses
[5]
Steps:
https://github.com/stcarrez/spdx-tool 39
Step 5: strict license identification
●
License identification algorithms :
– Check for SPDX-License-Identifier: tags
→ if found, we are done (report as ‘SPDX’).
– Build a per-line bitmap of possible license
(a bit is set if the line contains at least one token of the license)
– Scan the license templates which are referenced in
the bitmaps
→ if exact match, we are done (report as ‘TMPL’)
– License exception support not finished
https://github.com/stcarrez/spdx-tool 40
Step 5: guessing the license
●
Tried several similarity algorithms
(Jaccard, Sorensen Dice, Tversky)
●
Introducing tf-idf (used by search engines for ranking) :
– term frequency in a document :
– inverse document frequency :
●
Introducing cosine similarity :
●
Compute t-idf and cosine similarity for each license
template in the repository against the extracted license
https://github.com/stcarrez/spdx-tool 41
Step 5: compute cosine similarity
“Licensed under the Apache License”
under
Apache License Licensed
1 1
7 3
C309 C1393 C1394 C5063
... ... ...
IDF
1.3 1.3
0.2 0.9
C309 C1393 C1394 C5063
... ... ...
Apache-1.0
Apache-1.1
Apache-2.0
Pixar
R107 =
R108 =
R109 =
R416 =
C309 C1393 C1394 C5063
... ... ...
1
1
1
7
1
7
9 1
1
1
1
1 0
0
3
3
...
...
...
R1 =
R560 =
0 0
C309 C1393 C1394 C5063
... ... ...
1.2
1.1
1.3
0.5
1.3
0.2
0.3 1.4
1.2
1.1
1.1
1.2 0
0
0.9
0.8
0 0
Cosine
Similarity
0.95
0.83
0.56
0.56
0.01
https://github.com/stcarrez/spdx-tool 42
Step 5: cosine similarity
●
Implemented with sciada crate :
– Sparse arrays as coordinate lists
– Provides vectorizers and transformers
(convert data into numerical vectors)
– Similarities (Jaccard, …, Cosine)
– See similar.adb for a complete example
https://github.com/stcarrez/spdx-tool 43
Sciada package instantiations
package SPDX_Tool.Counter_Arrays is
new SCI.Sparse.COO_Arrays (Row_Type => License_Index,
Column_Type => Token_Index,
Value_Type => Count_Type,
Default_Value => 0);
function To_Float (Value : Float) return Float is (Value);
package Freq_Transformers is
new SCI.Vectorizers.Transformers (Frequency_Type => Float,
Arrays => SPDX_Tool.Counter_Arrays,
Convert => To_Float);
type Confidence_Type is delta 0.001 range 0.0 .. 1.0;
package Confidence_Numbers is
new SCI.Numbers.Number (Confidence_Type, "*" => Mul, "/" => Div);
package Confidence_Conversions is
new SCI.Numbers.Conversion (Confidence_Numbers);
package Similarities is
new SCI.Similarities.COO_Arrays (Arrays => Freq_Transformers.Frequency_Arrays,
Conversions => Confidence_Conversions,
To_Float => To_Float);
https://github.com/stcarrez/spdx-tool 44
Print report
[6]
Steps:
Manager
Languages
Generated
Mimes
Modelines
Rules
Shell
Extensions
Filenames
Files
Manager
Configs
SPDX_Tool
utilada intl
Defaults
Alire crates used
Ada packages
Reader
Manager
Repository
Files Templates
Licenses Infos
utilada_xml ansiada
printer_toolkit
Reports
https://github.com/stcarrez/spdx-tool 45
Conclusion
●
Lessons learned :
– Write custom generation tools to speed up the project
(fast inverted index generated at compilation time)
– Easily lost in types defined by generic packages
– Good performance with tasks on dedicated data sets
(What Every Programmer Should Know About Memory by Ulrich Drepper, 2007)
●
Need improvements on :
– License detection (heuristics, priorities, ...)
– Knowledge or languages and their comments
– License detection in images (EXIF) or PDFs
https://github.com/stcarrez/spdx-tool 46
Questions
?

More Related Content

TXT
Readme
TXT
Signlic
RTF
Credits
TXT
Third party attributions-click-to-call
PDF
Lbp6030 license uke_00
TXT
Licenses
PDF
Acrobat reader xi_3rd_party_read_me_ver_1
Readme
Signlic
Credits
Third party attributions-click-to-call
Lbp6030 license uke_00
Licenses
Acrobat reader xi_3rd_party_read_me_ver_1

Similar to Automating License Identification with SPDX-Tool in Ada (20)

RTF
Rm09a fin
PDF
Legal notices
PDF
Legal notices
PDF
Legal notices
PDF
Avisos legales
PDF
Legal notices
PDF
Legal notices
PDF
Legal notices
TXT
Acknow
PDF
Third party license
PDF
hamza xp
PDF
Third party license
PDF
Smart viewreporter
PDF
Conica fax driver operations user manual
PDF
Backburner install guide
TXT
Readme
PDF
LegalNotices.pdf
PDF
Legal notices
RTF
RTF
Acknowledgements
Rm09a fin
Legal notices
Legal notices
Legal notices
Avisos legales
Legal notices
Legal notices
Legal notices
Acknow
Third party license
hamza xp
Third party license
Smart viewreporter
Conica fax driver operations user manual
Backburner install guide
Readme
LegalNotices.pdf
Legal notices
Acknowledgements
Ad

More from Stephane Carrez (9)

PDF
Implementing a build manager in Ada
PDF
Porion a new Build Manager
PDF
Protect Sensitive Data with Ada Keystore
PDF
AKT un outil pour sécuriser vos données et documents sensibles
PDF
Ada for Web Development
PDF
Secure Web Applications with AWA
PDF
Persistence with Ada Database Objects (ADO)
PDF
Writing REST APIs with OpenAPI and Swagger Ada
PDF
IP Network Stack in Ada 2012 and the Ravenscar Profile
Implementing a build manager in Ada
Porion a new Build Manager
Protect Sensitive Data with Ada Keystore
AKT un outil pour sécuriser vos données et documents sensibles
Ada for Web Development
Secure Web Applications with AWA
Persistence with Ada Database Objects (ADO)
Writing REST APIs with OpenAPI and Swagger Ada
IP Network Stack in Ada 2012 and the Ravenscar Profile
Ad

Recently uploaded (20)

PDF
MCP Security Tutorial - Beginner to Advanced
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
Website Design Services for Small Businesses.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
assetexplorer- product-overview - presentation
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PPTX
chapter 5 systemdesign2008.pptx for cimputer science students
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
iTop VPN Crack Latest Version Full Key 2025
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
STL Containers in C++ : Sequence Container : Vector
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PPTX
Introduction to Windows Operating System
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
Types of Token_ From Utility to Security.pdf
MCP Security Tutorial - Beginner to Advanced
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Website Design Services for Small Businesses.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
assetexplorer- product-overview - presentation
Topaz Photo AI Crack New Download (Latest 2025)
chapter 5 systemdesign2008.pptx for cimputer science students
DNT Brochure 2025 – ISV Solutions @ D365
Designing Intelligence for the Shop Floor.pdf
iTop VPN Crack Latest Version Full Key 2025
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
STL Containers in C++ : Sequence Container : Vector
Tech Workshop Escape Room Tech Workshop
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Oracle Fusion HCM Cloud Demo for Beginners
How to Use SharePoint as an ISO-Compliant Document Management System
Introduction to Windows Operating System
Wondershare Recoverit Full Crack New Version (Latest 2025)
Types of Token_ From Utility to Security.pdf

Automating License Identification with SPDX-Tool in Ada

  • 1. Automating License Identification with SPDX-Tool in Ada Stéphane Carrez Ada Developers Workshop 2025
  • 3. https://github.com/stcarrez/spdx-tool 3 Which license is it ? /* * drm_irq.c IRQ and vblank support * * author Rickard E. (Rik) Faith <faith@valinux.com> * author Gareth Hughes <gareth@valinux.com> * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice (including the next * paragraph) shall be included in all copies or substantial portions of the * Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR * OTHER DEALINGS IN THE SOFTWARE. */ MIT
  • 4. https://github.com/stcarrez/spdx-tool 4 Which license is it ? /* * * We reserve no legal rights to the ANTLR--it is fully in the public domain. * An individual or company may do whatever they wish with source code distributed * with ANTLR or the code generated by ANTLR, including the incorporation of ANTLR, * or its output, into commerical software. * * We encourage users to develop software with ANTLR. However, we do ask that credit * is given to us for developing ANTLR. By "credit", we mean that if you use ANTLR or * incorporate any source code into one of your programs (commercial product, research * project, or otherwise) that you acknowledge this fact somewhere in the documentation, * research report, etc... If you like ANTLR and have developed a nice tool with * the output, please mention that you developed it using ANTLR. In addition, * we ask that the headers remain intact in our source code. As long as these * guidelines are kept, we expect to continue enhancing this system and expect * to make other tools available as they are completed. */ ANTLR-PD
  • 5. https://github.com/stcarrez/spdx-tool 5 Which license is it ? /* * This is free and unencumbered software released into the public domain. * * Anyone is free to copy, modify, publish, use, compile, sell, or * distribute this software, either in source code form or as a compiled * binary, for any purpose, commercial or non-commercial, and by any * means. * * In jurisdictions that recognize copyright laws, the author or authors * of this software dedicate any and all copyright interest in the * software to the public domain. We make this dedication for the benefit * of the public at large and to the detriment of our heirs and * successors. We intend this dedication to be an overt act of * relinquishment in perpetuity of all present and future rights to this * software under copyright law. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. * IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR * OTHER DEALINGS IN THE SOFTWARE. * * For more information, please refer to <http://unlicense.org/> */ Unlicense
  • 6. https://github.com/stcarrez/spdx-tool 6 Which license is it ? %% %% %CopyrightBegin% %% %% Copyright Ericsson AB 2010-2023. All Rights Reserved. %% %% Licensed under the Apache License, Version 2.0 (the "License"); %% you may not use this file except in compliance with the License. %% You may obtain a copy of the License at %% %% http://www.apache.org/licenses/LICENSE-2.0 %% %% Unless required by applicable law or agreed to in writing, software %% distributed under the License is distributed on an "AS IS" BASIS, %% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. %% See the License for the specific language governing permissions and %% limitations under the License. %% %% %CopyrightEnd% %% Apache-2.0
  • 7. https://github.com/stcarrez/spdx-tool 7 Which license is it ? /* * "THE BEER-WARE LICENSE" (Revision 42): * * <phk@FreeBSD.ORG> wrote this file. As long as you retain this notice * you can do whatever you want with this stuff. If we meet some day, * and you think this stuff is worth it, you can buy me a beer * in return Poul-Henning Kamp */ Beerware
  • 8. https://github.com/stcarrez/spdx-tool 8 Which license is it ? /* * Copyright (C) 2019 Intel Corporation * Permission to use, copy, modify, and distribute this software for any * purpose with or without fee is hereby granted, provided that the above * copyright notice and this permission notice appear in all copies. * * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. * */ #ifndef _UAPI_LINUX_UM_TIMETRAVEL_H #define _UAPI_LINUX_UM_TIMETRAVEL_H ISC
  • 9. https://github.com/stcarrez/spdx-tool 9 Which license is it ? /* * Copyright (c) 2013, 2014 Kenneth MacKay. All rights reserved. * Copyright (c) 2019 Vitaly Chikunov <vt@altlinux.org> * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are * met: * * Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ BSD-2-Clause
  • 10. https://github.com/stcarrez/spdx-tool 10 Which license is it ? <# Copyright (c) <year> <owner>. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE , EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #> BSD-3-Clause
  • 11. https://github.com/stcarrez/spdx-tool 11 Which license is it ? /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2013 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ Linux-OpenIB
  • 12. https://github.com/stcarrez/spdx-tool 12 Which license is it ? /* he.h ForeRunnerHE ATM Adapter driver for ATM on Linux Copyright (C) 1999-2000 Naval Research Laboratory Permission to use, copy, modify and distribute this software and its documentation is hereby granted, provided that both the copyright notice and this permission notice appear in all copies of the software, derivative works or modified versions, and any portions thereof, and that both notices appear in supporting documentation. NRL ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" CONDITION AND DISCLAIMS ANY LIABILITY OF ANY KIND FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. */ CMU-Mach-nodoc
  • 14. https://github.com/stcarrez/spdx-tool 14 Where is the license ? pragma Style_Checks (Off); -- This file has been generated by -- 'g++ -c -C -fdump-ada-spec /usr/include/sqlite3_h.ads' -- and manually edited by Olivier Ramonat -- Some sqlite3 experimental features have been removed --** 2001 September 15 --** -- The author disclaims copyright to this source code. In place of -- a legal notice, here is a blessing: -- -- May you do good and not evil. -- May you find forgiveness for yourself and forgive others. -- May you share freely, never taking more than you give. -- --************************************************************************* --** This header file defines the interface that the SQLite library --** presents to client programs. If a C-function, structure, datatype, --** or constant definition does not appear in this file, then it is --** not a published API of SQLite, is subject to change without --** notice, and should not be referenced by programs that use SQLite. blessing
  • 15. https://github.com/stcarrez/spdx-tool 15 Where is the license ? "====================================================================== | | Smalltalk dining philosophers | | ======================================================================" "====================================================================== | | Copyright 1999, 2000 Free Software Foundation, Inc. | Written by Paolo Bonzini. | | This file is part of GNU Smalltalk. | | GNU Smalltalk is free software; you can redistribute it and/or modify it | under the terms of the GNU General Public License as published by the Free | Software Foundation; either version 2, or (at your option) any later version. | | GNU Smalltalk is distributed in the hope that it will be useful, but WITHOUT | ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS | FOR A PARTICULAR PURPOSE. See the GNU General Public License for more | details. | | You should have received a copy of the GNU General Public License along with | GNU Smalltalk; see the file COPYING. If not, write to the Free Software | Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. | ======================================================================" GPL-2.0 match 0.86
  • 16. https://github.com/stcarrez/spdx-tool 16 Where is the license ? ;;; elec-pair.el --- Automatic parenthesis pairing -*- lexical-binding:t -*- ;; Copyright (C) 2013-2020 Free Software Foundation, Inc. ;; Author: João Távora <joaotavora@gmail.com> ;; This file is part of GNU Emacs. ;; GNU Emacs is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; GNU Emacs is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>. ;;; Commentary: ;;; Code: GPL-3.0 match 0.85
  • 17. https://github.com/stcarrez/spdx-tool 17 Where is the license ? /* Leap second stress test * by: John Stultz (john.stultz@linaro.org) * (C) Copyright IBM 2012 * (C) Copyright 2013, 2015 Linaro Limited * Licensed under the GPLv2 * * This test signals the kernel to insert a leap second * every day at midnight GMT. This allows for stressing the * kernel's leap-second behavior, as well as how well applications * handle the leap-second discontinuity. * * Usage: leap-a-day [-s] [-i <num>] * * Options: * -s: Each iteration, set the date to 10 seconds before midnight GMT. * This speeds up the number of leapsecond transitions tested, * but because it calls settimeofday frequently, advancing the * time by 24 hours every ~16 seconds, it may cause application * disruption. * * -i: Number of iterations to run (default: infinite) * * Other notes: Disabling NTP prior to running this is advised, as the two * may conflict in their commands to the kernel. * * To build: * $ gcc leap-a-day.c -o leap-a-day -lrt * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */
  • 18. https://github.com/stcarrez/spdx-tool 18 Introducing SPDX License ● System Package Data Exchange (SPDX) : – Created by Linux Foundation in 2011 to improve license compliance – Describes Bill of Materials (BOMs) for software components – Registry of licenses with tags ● Examples : // SPDX-License-Identifier: GPL-2.0 -- SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
  • 19. https://github.com/stcarrez/spdx-tool 19 Introducing spdx-tool ● Why spdx-tool ? – Needed a way to convert license headers into SPDX license tags (> 3000 files to convert) – Wanted to identify licenses by analyzing comment headers ● Key requirements : – Support for any programming language – Recognize standard and custom licenses – Parallelized license analysis
  • 20. https://github.com/stcarrez/spdx-tool 20 Spdx-tool usage (1) ● Identify licenses used in a project : ● Listing files matching a given license: spdx-tool --only-licenses=BSD-3-Clause --files spdx-tool
  • 21. https://github.com/stcarrez/spdx-tool 21 Spdx-tool usage (2) ● Print license headers found in files : spdx-tool --print-license --line-number src/intl.ads
  • 22. https://github.com/stcarrez/spdx-tool 22 Spdx-tool usage (3) ● Replace license headers by their SPDX tag : spdx-tool --update=1..2,spdx src ● Replace license headers by their SPDX tag : – “1..2” means keep the original lines – To put the SPDX tag first, use “spdx,1..2”
  • 24. https://github.com/stcarrez/spdx-tool 24 spdx-tool numbers ● 95 % Ada, 5 % JSON ● 192K CLOC Ada... but 180K generated ! (only 6K CLOC for the spdx-tool itself) ● 560 licenses templates ● 720 programming languages identified ● 1900 source files analyzed per second (on AMD 24 cores)
  • 25. https://github.com/stcarrez/spdx-tool 25 Spdx-tool algorithm 1) Scan files and honor .gitignore rules 2) Prepare for analysis (read file & identify lines) 3) Find the programming language 4) Analyze comments 5) Identify the license from a repository of license templates (apply change on source file) 6) Print report Parallel execution by a pool of tasks
  • 26. https://github.com/stcarrez/spdx-tool 26 License template repository ● License templates from https://spdx.org/ : – 560 templates in the repository – A template defines the rules for matching (strict content, optional content, variable/patterns) <<var;name="copyright";original="Copyright (c) <year> <owner>. ";match=".{0,5000}">> Redistribution and use in source and binary forms <<var;name="theme";original="";match="()|( of the theme)">>, with or without modification, <<var;name="tobe";original="are";match="are|is">> permitted provided that the following conditions are met:
  • 27. https://github.com/stcarrez/spdx-tool 27 Embedded license templates ● Template files integrated by using Advanced Resource Embedder https://github.com/stcarrez/resource-embedder ● Generates a simple function (perfect hash) to convert a license name into its content package SPDX_Tool.Licenses.Files is Names_Count : constant := 562; Names : constant Name_Array; -- Returns the data stream with the given name or null. function Get_Content (Name : String) return access constant Buffer_Type; … end SPDX_Tool.Licenses.Files;
  • 28. https://github.com/stcarrez/spdx-tool 28 Embedded inverted index ● Custom tool creates from the license repository : – A list of tokens – An inverted index – Per-license token occurrence – Defined as constant => put in .rodata section AFL-1.1: Licensed under the Academic Free License version 1.1. T82 : aliased constant Token_Array := -- Tokens used by AFL-1.1 ((244, 1), (964, 1), (1391, 1), (1392, 1), (5041, 1), (5115, 1)); L172 : aliased constant License_Index_Array := (13, 49, 82, 83, 84, 85, 86); Index : constant Token_Index_Array := ((1, L0'Access), (2, L1'Access), (3, L2'Access), … (244, L172'Access) -- Token ‘244’ (“Academic”) is used by license 13, 49, 82, …
  • 29. https://github.com/stcarrez/spdx-tool 29 Spdx-tool architecture Reader Files Templates Repository Manager Configs Files SPDX_Tool Licenses Reports Manager Manager ada_toml utilada magicada intl utilada_xml sciada ansiada printer_toolkit Languages Generated Mimes Modelines Rules Shell Extensions Filenames Defaults Infos Alire crates used Ada packages [1] [2] [3] [4] [5] [6] Steps:
  • 30. https://github.com/stcarrez/spdx-tool 30 Scan files and honor .gitignore Reader Files Templates Repository Manager Configs SPDX_Tool Licenses Reports Manager utilada intl Languages Generated Mimes Modelines Rules Shell Extensions Filenames Defaults Alire crates used Ada packages Files Manager Infos [1] Steps:
  • 31. https://github.com/stcarrez/spdx-tool 31 Step 1 : Scan files (main task) ● Identify files that must be analyzed : – Scan directory tree, handle .gitignore files – Scan and filter support by Ada Utility Library (using Ada.Directories) (see tree.adb example in Ada Utility Library) – Add file path in a queue for analysis with Util.Files.Walk; with Util.Files.Filters;
  • 32. https://github.com/stcarrez/spdx-tool 32 Language detection and comments analysis Files Manager Reader Files Templates Repository Manager Configs SPDX_Tool Licenses Reports utilada magicada intl Defaults Alire crates used Ada packages Infos Manager Languages Generated Mimes Modelines Rules Shell Extensions Filenames [2] [3] [4] Steps:
  • 33. https://github.com/stcarrez/spdx-tool 33 Step 2: prepare for analysis ● File analysis made by a dedicated task (picks a file to analyze from the queue) – Avoid sharing data across tasks – Avoid switching tasks for analysis ● Preparation for analysis : – Read content as binary (Stream_Element, 8Kb) – Prepare data structure to represent each line and find line boundaries (within the 8Kb buffer) with Util.Executors;
  • 34. https://github.com/stcarrez/spdx-tool 34 Step 3: find the language ● Identify the language used in the source file (heavily inspired by the Linguist project) : – File extension heuristic mapping (JSON) – Disambiguation rules (JSON) – Emacs and vi modline in header (-*-mode:..-*-), – By using libmagic (only with --mimes option) – Unix shell identification (‘#!’ on first line) (map shell interpreter to a language) ● Identify generated files by looking at patterns
  • 35. https://github.com/stcarrez/spdx-tool 35 Step 3: find language ● Language detector interface as private interface and detectors in children private packages package SPDX_Tool.Languages is ... private type Detector_Type is limited interface; procedure Detect (Detector : in Detector_Type; File : in File_Info; Content : in out File_Type; Result : in out Detector_Result) is abstract; end SPDX_Tool.Languages;
  • 36. https://github.com/stcarrez/spdx-tool 36 Step 4: analyze comments ● Extract useful content from comment headers : – Identify most comment styles ● Block comments : /* .. */ {* .. *} <!-- .. --> ### .. ### (* .. *) {- .. -} <# .. #> “ .. ” ● Line comments : // .. %% .. .. -- .. % .. dnl .. # .. ; .. – Ignore presentation markers – Collect per-line tokens for license search
  • 37. https://github.com/stcarrez/spdx-tool 37 Step 4: result of analysis %% Licensed under the Apache License, Version 2.0 (the "License"); type Line_Type is record Comment : Comment_Mode := NO_COMMENT; Style : Comment_Info; Line_Start : Buffer_Index := 1; Line_End : Buffer_Size := 0; Tokens : SPDX_Tool.Counter_Arrays.Array_Type; Licenses : License_Index_Map := SPDX_Tool.EMPTY_MAP; end record; type Line_Array is array(Infos.Line_Number range <>)of Line_Type; under Apache License Licensed 1 1 1 1 C309 C1393 C1394 C5063 ... ... Version 2 C2132 Line_Start Line_End Comment.Text_Start Comment.Text_Last Comment := LINE_COMMENT; Tokens :=
  • 39. https://github.com/stcarrez/spdx-tool 39 Step 5: strict license identification ● License identification algorithms : – Check for SPDX-License-Identifier: tags → if found, we are done (report as ‘SPDX’). – Build a per-line bitmap of possible license (a bit is set if the line contains at least one token of the license) – Scan the license templates which are referenced in the bitmaps → if exact match, we are done (report as ‘TMPL’) – License exception support not finished
  • 40. https://github.com/stcarrez/spdx-tool 40 Step 5: guessing the license ● Tried several similarity algorithms (Jaccard, Sorensen Dice, Tversky) ● Introducing tf-idf (used by search engines for ranking) : – term frequency in a document : – inverse document frequency : ● Introducing cosine similarity : ● Compute t-idf and cosine similarity for each license template in the repository against the extracted license
  • 41. https://github.com/stcarrez/spdx-tool 41 Step 5: compute cosine similarity “Licensed under the Apache License” under Apache License Licensed 1 1 7 3 C309 C1393 C1394 C5063 ... ... ... IDF 1.3 1.3 0.2 0.9 C309 C1393 C1394 C5063 ... ... ... Apache-1.0 Apache-1.1 Apache-2.0 Pixar R107 = R108 = R109 = R416 = C309 C1393 C1394 C5063 ... ... ... 1 1 1 7 1 7 9 1 1 1 1 1 0 0 3 3 ... ... ... R1 = R560 = 0 0 C309 C1393 C1394 C5063 ... ... ... 1.2 1.1 1.3 0.5 1.3 0.2 0.3 1.4 1.2 1.1 1.1 1.2 0 0 0.9 0.8 0 0 Cosine Similarity 0.95 0.83 0.56 0.56 0.01
  • 42. https://github.com/stcarrez/spdx-tool 42 Step 5: cosine similarity ● Implemented with sciada crate : – Sparse arrays as coordinate lists – Provides vectorizers and transformers (convert data into numerical vectors) – Similarities (Jaccard, …, Cosine) – See similar.adb for a complete example
  • 43. https://github.com/stcarrez/spdx-tool 43 Sciada package instantiations package SPDX_Tool.Counter_Arrays is new SCI.Sparse.COO_Arrays (Row_Type => License_Index, Column_Type => Token_Index, Value_Type => Count_Type, Default_Value => 0); function To_Float (Value : Float) return Float is (Value); package Freq_Transformers is new SCI.Vectorizers.Transformers (Frequency_Type => Float, Arrays => SPDX_Tool.Counter_Arrays, Convert => To_Float); type Confidence_Type is delta 0.001 range 0.0 .. 1.0; package Confidence_Numbers is new SCI.Numbers.Number (Confidence_Type, "*" => Mul, "/" => Div); package Confidence_Conversions is new SCI.Numbers.Conversion (Confidence_Numbers); package Similarities is new SCI.Similarities.COO_Arrays (Arrays => Freq_Transformers.Frequency_Arrays, Conversions => Confidence_Conversions, To_Float => To_Float);
  • 44. https://github.com/stcarrez/spdx-tool 44 Print report [6] Steps: Manager Languages Generated Mimes Modelines Rules Shell Extensions Filenames Files Manager Configs SPDX_Tool utilada intl Defaults Alire crates used Ada packages Reader Manager Repository Files Templates Licenses Infos utilada_xml ansiada printer_toolkit Reports
  • 45. https://github.com/stcarrez/spdx-tool 45 Conclusion ● Lessons learned : – Write custom generation tools to speed up the project (fast inverted index generated at compilation time) – Easily lost in types defined by generic packages – Good performance with tasks on dedicated data sets (What Every Programmer Should Know About Memory by Ulrich Drepper, 2007) ● Need improvements on : – License detection (heuristics, priorities, ...) – Knowledge or languages and their comments – License detection in images (EXIF) or PDFs