SOOCon24 - From OpenAI to Opensource AI: Navigating Between Commercial Ownership and Collaborative Openness

From OpenAI
to Open Source AI
Navigating Between Commercial Ownership and Collaborative Openness
https://stateofopencon.com/ #stateofopencon #soocon24 #openuk
https://hachyderm.io/@openuk
Raphaël Semeteys (and Luxin Zhang) - Worldline

Introduction
Raphaël Semeteys
• Open source since 1997, professionally since 2004
• Yoga Teacher, Creator of the QSOS method
• Head of DevRel at Worldline
7000+ engineers
in over 40 countries
Managing 43+ billion
transactions per year
€250M spent in R&D
every year
Handling 150+
payment methods
We design payments technology that powers the growth
of millions of businesses around the world

The early days of LLMs
From rule-based and simpler statistical models to LLMs
2010’s 2020’s
2017-2018
Word embeddings
such as Word2Vec
and GloVe
“Attention is All You Need"
Transformers, BERT
Generative AI, ChatGPT
responsibility concerns

GenAI is having its Linux Moment
• Just like open source and Internet, bust much faster!
• Dynamics between collaborative openness and commercial ownership
• Need of clarity on licenses
Labs &
Universities
Individuals
Enterprises
Commodities

Defining Openness of a LLM
Pre-training
Dataset
Fine-tuning
Dataset
Reward
Model
Model
Data Processing Code

Defining Openness of a LLM
Score Level Description
Model
(weights)
Pre-
training
Dataset
Fine-
tuning
Dataset
Reward
model
Data
Processing
Code
0 Closed
No access to any public
information, data or asset
1
Published
research only
Research papers(s) published but
with no more information, data or
asset
2
Restricted
access
Access to asset is possible only
with special agreement
(commercial, research…)
3
Open with
limitations
Access and reuse of asset is
possible but with certain
limitations on usage (ex. Open
RAIL)
4 Totally open
Access and reuse of asset is
possible without restriction (ex.
open source license)

Market-Leading Player: OpenAI
Deviation from original vision of research transparency & openness
Non/For-profit (US)
Component Score
Level
description
Model 4 Totally open
Dataset 1
Published
research
only
Code 1
Published
research
only
0 Closed

GPT-1 & 2 GPT-3 & 4
ChatGPT
research paper only
No training of other commercial LLMs

Market-Leading Player: OpenAI
Deviation from original vision of research transparency & openness
Non/For-profit (US)
Component Score
Level
description
Dataset 1
Published
research
only
Code 1
Published
research
only
0 Closed

GPT-1 & 2 GPT-3 & 4
ChatGPT
research paper only
No training of other commercial LLMs
You may not: […] Use Output to develop
models that compete with OpenAI.

Market-Leading Player: Google
Transition from open research to proprietary commercial approach
Enterprise (US)
Component Score
Level
description
Dataset 2
Restricted
access
Code 4 Totally open
BERT PaLM 2 & Gemini
1
Published
research only
1
Published
research only
0 Closed


Market-Leading Player: Meta
Journey to openness
Enterprise (US)
Component Score
Level
description
Dataset 3
Open with
limitations
Code 4 Totally open
RoBERTa Llama 2
3
Open with
limitations
1
Published
research only
1
Published
research only

Restriction on usage: license for platforms with 700+ M users

Market-Leading Player: Meta
Journey to openness
Enterprise (US)
Component Score
Level
description
Dataset 3
Open with
limitations
Code 4 Totally open
RoBERTa Llama 2
3
Open with
limitations
1
Published
research only
1
Published
research only

Restriction on usage: license for platforms with 700+ M users
Additional Commercial Terms. If, on the Llama 2 version release date, the
monthly active users of the products or services made available by or for
Licensee, or Licensee’s affiliates, is greater than 700 million monthly active
users in the preceding calendar month, you must request a license from
Meta, which Meta may grant to you in its sole discretion, and you are not
authorized to exercise any of the rights under this Agreement unless or
until Meta otherwise expressly grants you such rights.

Llama offspring’s: Alpaca and Vicuna
Fine-tuned models from Llama 2 by universities
Research (US)
Component Score
Level
description
Model 3
Open with
limitations
Pre-training
Dataset
1
Published
research only
Fine-tuning
Dataset
2
Research use
only
Code 4
Under Apache
2 license
Restrictions from both Llama 2 and OpenAI (ShareGPT)

Collaborative foundational LLMs
Dataset fuzziness: please refer to the specific license depending on the subset you use
Notion of responsible usage
Non-profit (US) Research (UAE) Research (EU) Research (US) Enterprise (FR)
EleutherAI GPT-J Falcon BLOOM OpenLLaMa Mistal/Mixtral
Model 4
Access and
reuse without
restriction
3
Open with
limitations
3
Open RAIL
license
4
Access and
reuse without
restriction
4
Access and
reuse without
restriction
Dataset 3
Open with
limitations
4
Access and
reuse without
restriction
3
Open with
limitations
4
Access and
reuse without
restriction
0
No public
information or
access
Code 4
Completely
open
1
General
instructions
4
Completely
open
1 Just examples 4
Completely
open

Collaborative foundational LLMs
Dataset fuzziness: please refer to the specific license depending on the subset you use
Notion of responsible usage
Non-profit (US) Research (UAE) Research (EU) Research (US) Enterprise (FR)
EleutherAI GPT-J Falcon BLOOM OpenLLaMa Mistal/Mixtral
Model 4
Access and
reuse without
restriction
3
Open with
limitations
3
Open RAIL
license
4
Access and
reuse without
restriction
4
Access and
reuse without
restriction
Dataset 3
Open with
limitations
4
Access and
reuse without
restriction
3
Open with
limitations
4
Access and
reuse without
restriction
0
No public
information or
access
Code 4
Completely
open
1
General
instructions
4
Completely
open
1 Just examples 4
Completely
open
This license is, in part, based on the Apache License Version 2.0,
with a series of modifications. The contribution of the Apache
License 2.0 to the framing of this document is acknowledged.
Please read this license carefully, as it is different to other ‘open
access’ licenses you may have encountered previously. Use of
Falcon180B for hosted services may require a separate license.

Collaborative fine-tuned LLMs
Impact of foundational model or pre-training datasets
Enterprise (US) Enterprise (US) Enterprise (US) Consortium (UAE/US)
Dolly BLOOMChat Zephyr LLM360
Model 4 Based on GPT-J 3 Based on BLOOM 4 Based on Mistral 4 Open source
Pre-training
Dataset
3 Based on GPT-J 3 Based on BLOOM 0 Based on Mistral 4
RedPajama,
Falcon, StarCoder
Fine-tuning
Dataset
4
Access and reuse
without restriction
4 Dolly and LAION 2
Research use only
(OpenAI)
2
Research use only
(OpenAI)
Reward model 0
No public
information
available
0
No public
information
available
3
Paper and code
examples
0
No public
information
available
Code 4 Open source 3 OpenRAIL 3
Example code
available
4 Open source

Collaborative fine-tuned LLMs
Impact of foundational model or pre-training datasets
Enterprise (US) Enterprise (US) Enterprise (US) Consortium (UAE/US)
Dolly BLOOMChat Zephyr LLM360
Model 4 Based on GPT-J 3 Based on BLOOM 4 Based on Mistral 4 Open source
Pre-training
Dataset
3 Based on GPT-J 3 Based on BLOOM 0 Based on Mistral 4
RedPajama,
Falcon, StarCoder
Fine-tuning
Dataset
4
Access and reuse
without restriction
4 Dolly and LAION 2
Research use only
(OpenAI)
2
Research use only
(OpenAI)
Reward model 0
No public
information
available
0
No public
information
available
3
Paper and code
examples
0
No public
information
available
Code 4 Open source 3 OpenRAIL 3
Example code
available
4 Open source
BLOOMChat Use Restrictions
l. To provide medical advice and medical results interpretation; or
m. To generate or disseminate information for the purpose to be used for
administration of justice, law enforcement, immigration or asylum processes,
such as predicting an individual will commit fraud/crime
commitment.

Collaboration platform: Hugging Face
• Startup and ecosystem dedicated to democratizing AI
• Open source Transformers library
• LLM leaderboard: upload and assess models
• The “GitHub of AI”
• Collaborative space for exploring, sharing and experimenting AI
• Hosts thousands of models, datasets, and demo applications
Enabler for collaboration and reuse

Hosting and resource paradigms
• Big players invest billions (Microsoft/OpenAI, AWS/Anthropic)
• CSP selling shovels in the AI Gold rush
Source: numind.ai
Closed models are centralized and resource-consuming

Hosting and resource paradigms
• Democratizing AI Computing
• Quantization, AI Chips
• Run models locally, in containers
• Emergence of smaller models for edge and mobile
• Small/Tiny Language Models: Gemini nano, Microsoft Phi-2, Huawei TinyBERT
• Domain Specific Language Models: BloombergGPT, Harvey (law)
• Mixture of models: Mixtral 8x7B, OpenMoE  Mixture of licenses?

Key takeaways
• Hyper-centralization leads to black boxes and closed solutions
• Openness
• Fosters collaboration and fuels community-driven innovation
• Enables inclusivity
• Just like open source software beware of licenses and restrictions
• AI's democratization continually reshapes the landscape

Thank you
Raphaël Semeteys - Worldline
@RaphaelSemeteys
https://dev.to/raphiki
Check the two-part article co-written with Luxin Zhang

Image credits
• Opensource, Internet & GenAI evolution image generated with DALL-E
• Robot evolution from Freepik
• LLMs’ #parameters evolution from numind.ai
• Shovels in Gold rush image generated with DALL-E
• Logos from official websites
• Coffee cups from Freepik

#stateofopencon #soocon24 #openuk

SOOCon24 - From OpenAI to Opensource AI: Navigating Between Commercial Ownership and Collaborative Openness

More Related Content

Similar to SOOCon24 - From OpenAI to Opensource AI: Navigating Between Commercial Ownership and Collaborative Openness (20)

More from Raphaël Semeteys (20)

Recently uploaded (20)

SOOCon24 - From OpenAI to Opensource AI: Navigating Between Commercial Ownership and Collaborative Openness