AI_dev Europe 2024 - From OpenAI to Opensource AI

Payments to grow your world
Navigating between
Commercial Ownership
and Collaborative Openness
Raphaël Semeteys
Head of DevRel
Open Source Expert
Senior Architect at Worldline
19 June 2024
Paris, France
From OpenAI to Open Source AI

We design payments technology
that powers the growth of millions
of businesses around the world.
7000+ engineers
in over 40 countries
Managing 43+ billion
transactions per year
€250M spent in R&D
every year
Handling 150+
payment methods

The early days of LLMs
From rule-based and simpler statistical models to LLMs
2010’s 2020’s
2017-2018
Word embeddings
Word2Vec, GloVe
“Attention is All You Need"
Transformers
GenAI, ChatGPT
Responsibility concerns
Tomorrow?
Small Language Models
Mobile, Agents & LAMs

GenAI is having its Linux Moment
• Just like open source and Internet, bust much faster!
• Dynamics between collaborative openness and commercial ownership
• Need of clarity on licenses
Labs &
Universities
Individuals
Enterprises
Commodities

Defining Openness of a Model
Pre-training
Dataset
Fine-tuning
Dataset
Reward
Model
Model
Data Processing Code

Defining Openness of a Model
Score Level Description
Model
(weights)
Pre-
training
Dataset
Fine-
tuning
Dataset
Reward
model
Data
Processing
Code
0 Closed
No access to any public
information, data or asset
1
Published
research
only
Research papers(s) published but
with no more information, data or
asset
2
Restricted
access
Access to asset is possible only
with special agreement
(commercial, research…)
3
Open with
limitations
Access and reuse of asset is
possible but with certain
limitations on usage
4 Totally open
Access and reuse of asset is
possible without restriction on
usage (ex. open source license)

Market-Leading Player: OpenAI
Deviation from original vision of research transparency & openness
Non/For-profit (US)
Component Score
Level
description
Model 4 Totally open
Dataset 1
Published
research
only
Code 1
Published
research
only
0 Closed
→
GPT-1 & 2 GPT-3.x & 4.x/o
ChatGPT
research paper only

Market-Leading Player: OpenAI
Deviation from original vision of research transparency & openness
Non/For-profit (US)
Component Score
Level
description
Dataset 1
Published
research
only
Code 1
Published
research
only
0 Closed
→
GPT-1 & 2
ChatGPT
research paper only
No training of other commercial LLMs
You may not: […] Use Output to
develop models that compete with
OpenAI.
GPT-3.x & 4.x/o

Market-Leading Player: Google
Transition from open research to a pragmatic approach
Enterprise (US)
Component Score
Level
description
Dataset 2
Restricted
access
Code 4 Totally open
1
Published
research only
1
Published
research only
0 Closed
→
3
Open with
limitations
1
Published
research only
4
Toolchain
available
↔

Market-Leading Player: Google
Transition from open research to a pragmatic approach
Enterprise (US)
Component Score
Level
description
Dataset 2
Restricted
access
Code 4 Totally open
1
Published
research only
1
Published
research only
0 Closed
→
3
Open with
limitations
1
Published
research only
4
Toolchain
available
→
You may not use nor allow others to use Gemma or
Model Derivatives to: [illegals activities, unlicensed
practices of profession, abuse, security bypass and
promotion of hatred, abuse, violence, monitoring people
without consent, misinformation/defamation, automate
decisions concerning human rights and well-being, etc.]
Responsible AI contradicts Open Source Definition

Other Big Players
Catching up and making their mark in the GenAI Gold Rush
Partner for Infrastructure (inference and training)
Create their own (open) models

Market-Leading Player: Meta
Journey to openness
Enterprise (US)
Component Score
Level
description
Dataset 3
Open with
limitations
Code 4 Totally open
RoBERTa
3
Open with
limitations
1
Published
research only
1
Published
research only
→

Journey to openness
Enterprise (US)
Component Score
Level
description
Dataset 3
Open with
limitations
Code 4 Totally open
RoBERTa
3
Open with
limitations
1
Published
research only
1
Published
research only
→
Restriction on usage: license for platforms with 700+ M users
Additional Commercial Terms. If, on the Llama 2 version release date,
the monthly active users of the products or services made available by or
for Licensee, or Licensee’s affiliates, is greater than 700 million monthly
active users in the preceding calendar month, you must request a license
from Meta, which Meta may grant to you in its sole discretion, and you
are not authorized to exercise any of the rights under this Agreement
unless or until Meta otherwise expressly grants you such rights.

Journey to openness
Enterprise (US)
Component Score
Level
description
Dataset 3
Open with
limitations
Code 4 Totally open
RoBERTa
3
Open with
limitations
1
Published
research only
1
Published
research only
→
LLaMA 3 now more restrictive on redistribution and reuse
Redistribution and Use. If you distribute or make available the Llama Materials (or any
derivative works thereof), or a product or service that uses any of them, including
another AI model, you shall (A) provide a copy of this Agreement with any such Llama
Materials; and (B) prominently display “Built with Meta Llama 3” on a related website,
user interface, blogpost, about page, or product documentation. If you use the Llama
Materials to create, train, fine tune, or otherwise improve an AI model, which is
distributed or made available, you shall also include “Llama 3” at the beginning of any
such AI model name.

Llama 2 offspring’s: Alpaca and Vicuna
Fine-tuned models from Llama 2 by universities
Research (US)
Component Score
Level
description
Model 3
Open with
limitations
Pre-training
Dataset
1
Published
research only
Fine-tuning
Dataset
2
Research use
only
Code 4
Under Apache
2 license
Restrictions from both Llama 2 and OpenAI (ShareGPT)

Collaborative foundational LLMs
Non-profit (US) Research (UAE) Research (EU) Research (US) Enterprise (FR)
EleutherAI GPT-J Falcon BLOOM OpenLLaMa Mistral
Model 4
Access and
reuse
without
restriction
3
Open with
limitations
3
Open RAIL
license
4
Access and
reuse
without
restriction
4
Access and
reuse
without
restriction
Dataset 3
Open with
limitations
4
Access and
reuse
without
restriction
3
Open with
limitations
4
Access and
reuse
without
restriction
0
No public
information
or access
Code 4
Completely
open
1
General
instructions
4
Completely
open
1
Just
examples
4
Completely
open
Dataset fuzziness: please refer to the specific license depending on the subset you use
Notion of responsible usage

Collaborative foundational LLMs
Modified open source licenses
Non-profit (US) Research (UAE) Research (EU) Research (US) Enterprise (FR)
EleutherAI GPT-J Falcon BLOOM OpenLLaMa Mistral
Model 4
Access and
reuse
without
restriction
3
Open with
limitations
3
Open RAIL
license
4
Access and
reuse
without
restriction
4
Access and
reuse
without
restriction
Dataset 3
Open with
limitations
4
Access and
reuse
without
restriction
3
Open with
limitations
4
Access and
reuse
without
restriction
0
No public
information
or access
Code 4
Completely
open
1
General
instructions
4
Completely
open
1
Just
examples
4
Completely
open
This license is, in part, based on the Apache License Version 2.0, with a
series of modifications. The contribution of the Apache License 2.0 to
the framing of this document is acknowledged. Please read this license
carefully, as it is different to other ‘open access’ licenses you may have
encountered previously. Use of Falcon180B for hosted services may
require a separate license.

Mistral AI’s French sauce
Navigating both open and close waters
Just like with Open Source, rise of Community VS Enterprise
Mix of AI Models
• Mixture-of-Experts (SMoE): Mixtral 8x7B, 8x22B
• Foundational and fine-tuned models
Mix of Business Models & Licenses
• “Open Source” models, mistral-finetune SDK
• Commercial: optimized Small, Large & Embed Models
• Sustainable openness: new non-production license for codestral

Mistral AI’s French sauce
Navigation both open and close waters
Just like with Open Source, revisiting Open in Cloud era
Mix of AI Models
• Mixture-of-Experts (SMoE): Mixtral 8x7B, 8x22B
• Foundational and fine-tuned models
Mix of Business Models & Licenses
• “Open Source” models, mistral-finetune SDK
• Commercial: optimized Small, Large & Embed Models
• Sustainable openness: new non-production license for codestral
MNPL - 3.2. Usage Limitation
- You shall only use the Mistral Models and Derivatives (whether or not created
by Mistral AI) for testing, research, Personal, or evaluation purposes in Non-
Production Environments;
- Subject to the foregoing, You shall not supply the Mistral Models or
Derivatives in the course of a commercial activity, whether in return for
payment or free of charge, in any medium or form, including but not limited to
through a hosted or managed service (e.g. SaaS, cloud instances, etc.), or
behind a software layer.

Collaborative fine-tuned LLMs
Impact of foundational models or pre-training datasets
Enterprise (US) Enterprise (US) Enterprise (US) Consortium (UAE/US) Research (US)
Dolly BLOOMChat Zephyr LLM360 OLMo-Instruct
Model 4 Based on GPT-J 3
Based on
BLOOM
4
Based on
Mistral
4 Open source 4 Open source
Pre-training
Dataset
3 Based on GPT-J 3
Based on
BLOOM
0
Based on
Mistral
4
RedPajama,
Falcon,
StarCoder
3
Dolma
(ImpACT MR)
Fine-tuning
Dataset
4
Access and
reuse without
restriction
4
Dolly and
LAION
2
Research use
only (OpenAI)
2
Research use
only (OpenAI)
3
Tülu 2
(IMPACT LR)
Reward
model
0
No public
information
available
0
No public
information
available
3
Paper and code
examples
0
No public
information
available
4
UltraFeedback
(MIT)
Code 4 Open source 3 OpenRAIL 3
Example code
available

Collaborative fine-tuned LLMs
Enterprise (US) Enterprise (US) Enterprise (US) Consortium (UAE/US) Research (US)
Dolly BLOOMChat Zephyr LLM360 OLMo-Instruct
Model 4 Based on GPT-J 3
Based on
BLOOM
4
Based on
Mistral
Pre-training
Dataset
3 Based on GPT-J 3
Based on
BLOOM
0
Based on
Mistral
4
RedPajama,
Falcon,
StarCoder
3
Dolma
(ImpACT MR)
Fine-tuning
Dataset
4
Access and
reuse without
restriction
4
Dolly and
LAION
2
Research use
only (OpenAI)
2
Research use
only (OpenAI)
3
Tulu 2
(IMPACT LR)
Reward
model
0
No public
information
available
0
No public
information
available
3
Paper and code
examples
0
No public
information
available
4
UltraFeedback
(MIT)
Code 4 Open source 3 OpenRAIL 3
Example code
available
AI2 ImpACT Licenses - Restrictions
[…] a. military weapons purposes […]
b. purposes of military surveillance […]
c. purposes of generating or disseminating information or content […] without
expressly and intelligibly disclaiming that the text is machine generated;
d. purposes of ‘real time’ remote biometric processing […]
e. fully automated decision-making without a human in the loop […] as spreading
misinformation […]
f. purposes of the predictive administration of justice, law enforcement, immigration,
or asylum processes, such as predicting an individual will commit fraud/crime
Responsible AI contradicts Open Source Definition

Other aspects of GenAI’s Linux Moment
Democratize and Decentralize (re)use and innovation
Notebooks
Communities
New Business Models
Collaborative Tools
& Ecosystems
AI Chips
Quantization
Decentralization
Hardware
Optimization
Do One Thing Well
Interoperable Standards
Beyond Python
Opensource Tools
& Frameworks

Key takeaways
• Closed APIs → Open Weights → Free AI (as in freedom)
• Datasets and upstream transitivity
• Competitive clauses
• Responsible AI restrictions
• Open Research → Competitive Market → Coopetitive Ecosystem
• Openness fosters reuse and collaboration
• Collaboration brings commoditization and innovation
Just like Open Source!

Thank you
Raphaël Semeteys - Worldline
@RaphaelSemeteys
raphiki.github.io

AI_dev Europe 2024 - From OpenAI to Opensource AI

More Related Content

Similar to AI_dev Europe 2024 - From OpenAI to Opensource AI (20)

More from Raphaël Semeteys (20)

Recently uploaded (20)

AI_dev Europe 2024 - From OpenAI to Opensource AI