About the value of personal data
Abstract:
“Data” is becoming a key production factor, comparable in importance to capital, land, or labour in an increasingly digital economy. Due to the unique characteristics of “data” as an economic good, data markets are underperforming, concentrated and geographically imbalanced, which undermines the immense potential of machine learning and artificial intelligence.
Aiming to unlock this potential, several entities trading data over the Internet have entered the market adopting at least ten different business models. They face problems to protect data ownership, to deal with a fragmented market that lacks secure sovereign exchange standards, to come up with knowledgeable data valuation models, and to set explainable prices.
In that direction, practitioners of different disciplines have proposed methodologies to calculate the value of data from different perspectives responding to different contexts and needs. Still, measuring the data economy continues to be a titanic challenge that will require huge efforts from the scientific community, the public sector and the industry in the next few years.
Thanks to the development of artificial intelligence (AI) and to the massive adoption of machine learning (ML) models, “data” is becoming a key production factor, comparable in importance to capital, land, or labour. However, due to the unique characteristics of “data” as an economic good (a freely replicable, non-depletable asset holding a highly combinatorial and context-specific value [1]), companies are reluctant to share them, exchanges often take place ad hoc and through barter arrangements [2], and most valuable data assets still remain unexploited in corporate “silos” nowadays.
As a result, the so-called data economy flourishes around a restricted number of champions, horizontally integrated across the value chain [3], and shows a significant geographical imbalance [4]. Unsurprisingly, unleashing the potential of data in the economy has become a key policy objective in the European Union [5], which predicts the size of the data economy to reach 827€ billion for the EU27 countries in 2025 [6]. Some analysts have calculated the potential value of the data economy to become US$2.5 trillion globally in the same year [7], and a potential of +US$13 trillion in AI by 2030 [8].
The massive collection and exploitation of personal data by digital firms in exchange of services, often with little or no consent, has raised a general concern about privacy and data protection [9]. Apart from spurring recent legislative developments in this direction [10,11], this concern has raised remarkable voices warning against the unsustainability of the existing digital economics, some of which propose that people are paid for their data in a sort of worldwide data labour market as a potential solution to this dilemma [12]. Some dare to estimate a transfer of 9% of the data economy from data-driven companies to data owners thanks to a new radical market of data as labour, meaning US$ 20k yearly income for a family of four, while increasing the overall size of the economy by 3% [13]. This result is far above the US$ 1k per individual resulting from dividing the market capitalisation of data-driven market champions by the global population in the beginning of 2023.
A few models have been proposed to calculate the value of personal data [14], often resulting in disparate and apparently contradictory outcomes. These models follow heterogeneous methods such as relying on the market capitalization of data-driven firms, on their turnover or on the net income of data providers, analysing unit prices by user or data volume, or evaluating the cost of a data breach. Some models quantify the economic and social impact of personal data use cases [15]. Finally, other valuation techniques turn to user surveys to determine their willingness to pay for protecting their privacy [16].
In spite of the difficulties and challenges of trading such a peculiar economic good, there already exists a relevant B2B data market that extends well beyond personal data. A recent study of companies trading data over the Internet revealed more than 2,000 such entities, and identified ten different business models. Such models include data and digital service providers, marketplaces embedded in data management platforms (Snowflake, Carto, Cognite), and data marketplaces (DMs) aiming to mediate between sellers and buyers and manage data transactions. Among the latter, general-purpose DMs (AWS, Advaneo, DataRade) intend to trade any kind of data, and are being challenged by niche DMs that target specific industries, such as automobile (Caruso, Otonomo), energy and logistics (Veracity), finance (Refinitiv, S&P). Other niche DMs focus on specific data types such as IoT real-time sensor data (e.g., IOTA, Terbine), or cover data sourcing for specific purposes, such as feeding ML algorithms (e.g., Nokia DM, DefinedCrowd). In addition, personal information management systems (PIMS, like Digi.me, Meeco, ErnieApp, or Swash) leverage recent data protection legislation to empower end users take control of their personal data, to help them exert their rights as granted by law, and to manage their consent to share their personal data with third parties. Some studies have identified key challenges that these companies face and existing or developing technology that may help with overcoming them [17, 18].
An interesting trend was spotted towards the distribution or federation of data exchange platforms [19], which can also benefit from the growing processing capabilities of the cloud edge. Through commodifying and specialising data trading, data markets are moving away from horizontally integrated monolithic siloed data providers, and towards distributed “niche” exchange platforms (Ocean Protocol, Settlemint) [17], often leveraging blockchain and their own cryptocurrencies to manage and settle transactions, and relying on federated learning [20] to process data where it is stored (Nokia DM, Acuratio). Two meaningful initiatives targeting those objectives are receiving substantial support from governments and key industry players in Europe: International Data Spaces and the Gaia-X project.
Data pricing remains a relevant problem that has long attracted the attention of researchers from very different disciplines [21]. Different schools resort to disparate techniques such as running auctions, measuring the quality of data as a weighted sum of features, comparing the information provided by different queries of a single database, or quantifying the loss of privacy by sharing a piece of data or the decreasing utility of noisy versions of a dataset. A recent study has gathered and analysed information about more than 200,000 data products offered by 43 data providers and marketplaces, idenfied which categories of data are more popular, which of them command the highest prices, which data features are being used by sellers to set the price of data products and which features do very valuable products have in common [22]. Based on this metadata, some works have managed to compare across data marketplaces and ML regression models have been trained to learn the relationship between prices and metadata as a first step to predict them and increase the transparency of data markets [23].
The 'value of data' is oftentimes linked to that of personal data, and more specifically to its application in marketing and advertising. Some notable works have measured the prices observed in online advertising to different user profiles [24, 25], and there are tools to calculate the revenue that users generate for social networks like Facebook [26].
Finally, measuring the value of personal data before acquiring them would help in avoiding the indiscriminate replication of data, most of which eventually turns out to be useless and is filtered out during the training process. Knowing the value of data beforehand allows buyers to select and purchase only products that are useful for their specific purposes. Neither is this valuation necessarily dependent on the volume of data, nor can it be easily calculated through heuristics [27, 28, 29]. On the contrary, it often requires an ad hoc valuation for the potential buyer’s specific task, a functionality that data marketplaces can offer to potential buyers looking for controlling the efficiency of such pre-processing [30, 31].
In conclusion, despite the huge efforts of the industry and the scientific community, measuring the value of data and measuring the data economy remain major challenges from both a technical and an economic perspective [32]. Furthermore, there is a growing need for reaching consensus and standard methodologies to calculate the value of personal data for accounting purposes, for valuating data-intensive enterprises, for compensating people, for setting up data taxes [33, 34], for just selecting the best data to feed a ML model [35], etc. I firmly believe this is a thriving research field that will require joint efforts of practitioners from different disciplines in the years to come.
References:
Image "Matrix Code Computer" by 0fjd125gk87 licensed by Pixabay