Data In, Data Out: Sustainable LLMs?
(NOTE: This is a repost of an earlier article. This post should now include the article in the Events, Data, Information newsletter.)
I need to stop obsessing over AI, but... LLMs to produce "content" have an entropy problem. I just want to get it off my chest and then move on.
Imagine I make a machine that loads samples of bread, chews them up, and then spits out fresh bread on command. Call it a Large Loaf Model. Let's say the output is pretty close to the real thing. Ok, so the texture is bland and regular, and it misses active leavening (yeast, sourdough, what have you). But in general the output passes for bread. Now what happens as I gain market share, and the percentage of LLM bread increases? Say the next model update incorporates 10% LLM bread. And after that 20%, and so on. For an increasing want of texture and leavening, the LLM bread ultimately devolves into something less than what you might have hoped. Model collapse.
This is an entropy problem if you see culture in terms of energy and work. Recent science is going there, notably in work by the cloud physicist Tim Garrett, and summarized by the astronomer Rick Noltnenious (https://www.dr-ricknolthenius.com/Apowers/A7-K43-Garrett.pdf). According to Nolthenious, Garrett "...applies thermodynamic thinking to the ordered system which is Civilization, and sees a simple relation which has held true in real-world data." Which is to say, culture and its artifacts rely on energy input. Physicists are stepping into biological territory.
From the biology POV, Stuart Kauffman tentatively defines an autonomous agent as "...something that can both reproduce itself and do at least one thermodynamic work cycle." (https://www.edge.org/conversation/stuart_a_kauffman-the-adjacent-possible). He holds the autonomous agent as a sort of prototype of life. In his book "The Origins of Order" Kauffman describes fitness landscapes where molecules could "search" for optima in a multidimensional domain. It seems the burst of energy from the big bang gives us order for free. His multidimensional models are similar to physicists' models of a sink into entropy, except they're inverted... Biology is the local story of order emerging out of the physicist's general slide toward entropy.
Looking back, life shows a growth of complexity from microbial mats putting oxygen into the atmosphere, to mammals, to people analyzing their world via multidimensional models. Which pretty much brings us to now, and LLMs... multidimensional models that encode relationships between tokens of cultural artifacts. And if Kauffman is right about his models of emerging order, LLM behavior does have similarities to his search for optima.
Back to Nolthenious, he describes cultural growth like this:
"Useful work accomplishes innate human values – powering the networks of our relationships to each other and to material things, and the enhancement and growth of civilization."
"The analog for physical entropy S, is the amount of disorder Sc in the civilization+environment system."
"Growth in civilization must correspond to a reduction in civilization’s portion of Sc at the expense of greater Sc in the total environment system, powered by the expenditure of physical ENERGY."
So yeah, we need power to train our models. But we also use energy to produce the training data. Authors have to eat, and while good fiction is hard enough to produce, truth is even harder (as Yuval Harari points out in numerous interviews and his book "Nexus"). Training source is culture, and we burn energy to produce it.
We should look at how culture grows from a perspective of data vs information. People develop ideas (theories) about the world around them. What they perceive in that world are affordances (see More on Data: Affordances), and their theories inform their relationships to the world. Ultimately, they see ways to combine affordances in ways that make new affordances, and they create artifacts that express those combinations. These artifacts could be technologies (tools, processes, etc.) or compressed descriptions (symbols, text, images, etc.). The development and collection of these artifacts is what we see as culture. It's important to note that the artifacts are themselves data... facts in the material world. They record information events that occurred in the heads of people.
If organic thought and other responses emerge along Kauffman's lines (and intuitively that makes sense), then humans seem to do something very interesting. Kauffman introduces variables in his models that adjust the "ruggedness" of the resulting "fitness landscapes". It makes sense that people either play with these variables (probably by "feel"), or else they play with the dimensions that go into the currently active model (again, by "feel"), in order to respond/adapt to fluid environments. However you describe it, people are adept at switching and merging contexts on the fly.
LLMs are reaching for something similar after they break culture down into tokens. Within a static context, they search for tokens that hang together much like you might look for a metaphor or the right hammer. But their search is statistical, and doesn't register affordance of any sort. If we want LLM output to contribute to our culture, we have to identify and record the affordance ourselves. In other words, LLMs break culture down to a data product, process the data, and output a modified data product. Data In, Data Out.
So where's the entropy problem?
As Dries Buytaert points out, we might soon be flooded with AI-generated content. "AI-driven content management isn't a distant scenario. Soon, Content Management Systems (CMS) may deploy hundreds of AI agents making bulk edits across thousands of pages." (https://dri.es/how-ai-could-reshape-cms-platforms). At least in certain domains, and ones where truth should be particularly important, we're actively working to see the volume of generated content outstrip the volume of human-authored content. How do we keep our LLM bread from turning into an unpalatable mush?
People seem to be pretty efficient in the grand scheme of things. To produce a viable answer to a question they seem to use a fraction of the energy that goes into an LLM. It would be interesting to measure the energy cost to produce a highschool grad vs the cost to make ChatGPT 4.0. By comparison, the costs I can think of for LLMs include:
The infrastructure and implementation of LLM technology -- initial cost.
The power to train models -- ongoing cost.
The culture behind the training data -- initial + ongoing cost.
We take on this cost to get a mechanical advantage in what... Time to information in a specific domain? Let's assume we get that. There are more costs. We have to verify the output, especially if it generates a content product with exposure to liability. Yes, we verify human-authored text too, but human authors are better at self correcting. They have less tendancy to fabulate, and we can predict where they might go wrong just by talking to them. GenAI tends to leave misinformation pumpkins sprinkled arbitrarily through a text, so the verification cost drifts toward the cost of human authoring in the first place. (And can we put a cost on drudgery?)
Or maybe we want dynamic documents that tell you what they know? That renders the problem insidious. Once you ask the question, how do you evaluate the answer? Is it truth or fiction? Are there fabulation pumpkins in there, and do you have the expertise to recognize them? What are the sources of your answer, and do you know the citations are faithful?
(I'll point out that AI isn't necessary to produce a body of text that dynamically gives an answer. The military worked with IETMs - Interactive Electronic Technical Manuals - decades ago, where the idea is to plug a computer into an airplane and generate the maintenance pages required for the current state of that system. I myself used DITA, XSLT, and javascript to implemented a document that filters content according to the current state of a software system. Work along these lines could go a long way to developing bounded, verifiable conversational documents without using AI to generate the results.)
Then there's the problem of model collapse. If we crowd out human-generated text with LLM text, what happens to the base for the next round of training? Because the LLM is a data machine, it can't do the work to maintain the relevance of that data to the human experience (i.e. information). But we're using these machines precisely to gain an advantage in that human experience.
So the delimma is, who pays the cost for upkeep? It isn't free. For example, "...Kenyan workers paid as low as US$1.32/hour were hired to label toxic content such as 'textual descriptions of sexual abuse, hate speech, and violence' to help OpenAI develop automated filters that prevent the public from seeing these outputs... In the process, they reported being 'mentally scarred by the work...'" (The TESCREAL bundle: Eugenics and the promise of utopia through artificial general intelligence). This illustrates the cost of filtering out human noise. On top of that, who pays to filter out the inevitable creep toward sameness, the drift away from human relavence, the entropy that inevitably overcomes every machine? Can we afford that upkeep without creating a class of info-elites supported by the exploitation of info-workers? And if the elites lose contact with their info QA, how will they guard against an info-uprising (data poisoning) by the exploited workers?
Or... Can we even hope to exploit workers who have the expertise to inject informatiobn into the the training material? If we have to pay them for their information capabilities, wouldn't it be cheaper to just have them produce the artifact and leave the LLM out of the loop?
I just can't see a way around it, nothing is free, especially not information. And while you can transform energy to do work, you can't escape physical laws. LLMs and GenAI to produce fiction is a fun toy, and you might be able to make a business out of it. But using it to produce truth is another thing. And using AI to generate truth or fiction looks to me like using a nuclear reactor to clean your teeth. Some people do it (electric toothbrushes powered by nuclear energy), but considering the cost and infrustructure behind the technology on the one hand, and the efficiency of human power on the other, does it ultimately make sense? Somebody, prove me wrong!
Advisor, Mentor, Entrepreneur, co-founder of Turbonomic
3moChris, yes, the energy aspect is quite important here. We spend now quite an amount of it to produce facts which we already learned and summarized centuries ago. Like spending $$$$ and time to learn the facts which could easily be taken out from the textbooks in a local library. But then indeed - what does this energy bring when we spend it? New knowledge, new facts? Sometimes yes, sometimes not, pure statistical outcome. But if not - does it reduce entropy or increase it? Remember the old question: "Workers lifted a grand piano to a roof of a skyscraper and burned it. What happened to the energy they spent?". This is not a joke.
IT Automation Tech Seller
3moThis is really fascinating, Chris. I heard long ago that they fed all the works of Bach into an AI and a produced a new work of Bach. Amazing I thought, but as I came to understand AI more. I realized that the AI is simply continuing the pattern of the input. To put it another way, if you feed all of the works of the Beatles up to Rubber Soul into an AI, what you get is another Rubber Soul. You don’t get Sergeant Pepper.