Baby steps to local AGI

Baby steps to local AGI

Vision: The Digital Organism

Our goal is to create a "Digital Organism" that lives within the user's operating system. Its "senses" are the screen pixels and system events. Its "nervous system" is a hybrid of a fast, reactive NCP and a slower, deliberative LLM. Its "body" is the suite of control functions (mouse, keyboard, file system) that allow it to act upon its environment. It does not wait for commands; it perceives, decides, acts, and learns in a continuous cycle, bootstrapping its intelligence from scratch, much like a living being.




Phase I: Architectural Reframing - From Tools to a Brain-Body System

First, we must conceptually reorganize the existing scripts. They are no longer separate programs but integrated parts of a single agent.

  1. ncp_brain.py - The Cerebellum & Control Center:
  2. qwen3_06b.py (LLM) - The Cerebral Cortex:
  3. self_learning.py - The Embodied Action Toolkit (EAT):




Phase II: The "Wake-Learn-Act" Cycle - The Core Loop of Consciousness

The ncp_brain.py will run an infinite loop that represents the agent's life. This cycle is the practical implementation of the paper's end-to-end control philosophy, adapted for a desktop environment.

Step 1: Perception (Live Screen View)

  • The loop begins by capturing the screen using pyautogui.screenshot(). To handle the temporal nature of tasks, we will maintain a buffer of the last few frames (e.g., 16 frames, as in the paper's training methodology), creating a short-term visual memory.

Step 2: Comprehension (Visual Processing & Object Recognition)

  • This is where raw pixels are turned into meaning. This step will have two sub-processes running on the captured frames:
  • Result: The raw screen frames are converted into a structured "Visual Working Memory"—a JSON-like object describing the current state: {"timestamp": ..., "cursor_pos": [x, y], "windows": [{"title": "...", "bbox": [...]}, ...], "objects": [{"class": "button", "text": "Save", "bbox": [...]}, ...]}. This is the compact, feature-rich input the NCP needs.

Step 3: Decision (The NCP Brain Acts)

  • The serialized "Visual Working Memory" is fed as input into the untrained NCP model.
  • The NCP's job is to output a low-dimensional "Intention Vector". This vector represents the agent's next high-level goal.

Step 4: Reasoning & Elaboration (The LLM Thinks)

  • The "Intention Vector" from the NCP is passed to the Qwen3 LLM. A master prompt-template translates this structured intent into a natural language request for the LLM.
  • Example Prompt for the LLM: "The current goal is to write code. The target is the file 'qwen3_06b.py' in Visual Studio Code. The specific task is to refactor the 'generate' function. Based on the visible code on screen [screen OCR text appended here], generate the improved Python code."
  • The LLM processes this and generates the required output: a block of Python code.

Step 5: Execution (The EAT Moves)

  • The LLM's output is passed to the Embodied Action Toolkit (self_learning.py).
  • A dispatcher function in the EAT reads the "Intention Vector's" action type (WRITE_CODE, SEARCH_WEB, etc.) and calls the appropriate functions.

Step 6: Feedback & Reinforcement (The Brain Learns)

  • After the action is executed, the loop repeats. The agent perceives the new screen state.
  • This creates a feedback mechanism. Did the code it wrote produce an error? The error message is now visible on screen. This "negative" outcome is used as a training signal. Did the code run successfully? That's a "positive" signal.
  • This feedback loop is perfect for Reinforcement Learning. The NCP's output (the Intention Vector) is an "action," and the resulting screen state determines the "reward." Over thousands of these cycles, the NCP will learn which intentions lead to successful outcomes in which visual contexts.




Phase III: The Bootstrapping Pathway - Learning Like a Baby

The NCP starts untrained. It needs to learn from the ground up. This will be an incremental process guided by the system's own capabilities.

Stage 1: Foundational Skills (The Infant)

  • The system starts in a "supervised" or "mimicry" mode.
  • It uses the self_learning.py functions to perform its own Genesis Scan, as it does now. However, instead of the LLM evaluating it, the ncp_brain observes its own actions.
  • As the EAT demonstrates keyboard control, mouse control, etc., the ncp_brain records the "Visual Working Memory" and the action being taken. This creates the very first, primitive dataset: (Visual State) -> (Correct Action). This is known as Behavioral Cloning. The NCP is trained to simply imitate its own pre-programmed abilities.

Stage 2: Guided Learning (The Toddler)

  • The system now uses its primitive, cloned policy to attempt simple goals. We can give it a high-level text goal like: "Your goal is to understand your own code. Open the self_learning.py file and read its contents."
  • The NCP will generate intentions. The LLM will generate actions. The EAT will execute them.
  • The self_learning.py's acquireknowledge function is now directed by the NCP. The NCP, by observing its own code and the environment, can decide what to learn about next to improve a skill. For instance, if it sees frequent errors related to file paths, it can form the intention: {"action": "ACQUIRE_KNOWLEDGE", "topic": "Best practices for handling file paths in Python on Windows"}.

Stage 3: Self-Correction and Evolution (The Adolescent)

  • This is where the system becomes truly autonomous.
  • The agent uses its learned abilities to interact with its environment to achieve goals. When it fails (e.g., its generated code has a bug), the feedback from Step 6 of the core loop triggers a self-improvement sub-routine.
  • The NCP generates the intention: {"action": "EVOLVE_SELF", "reason": "Repeated TypeError in cleanjson_response function"}.
  • This invokes the testand_evolve logic. The agent uses its browser/search capability to research the TypeError, uses the LLM to propose a code fix, clones itself, applies the patch, and then attempts the task again.
  • Opening the Black Box: The interpretability described in the paper is achieved here. The NCP's "Intention Vector" is a human-readable log of the agent's thought process. We can see why it decided to evolve itself. The LLM's role in generating the new code is also guided and logged. This contrasts with a single, giant black-box model where the reasoning is entirely opaque.

By following this proposal, we architect a system that fully embraces the spirit of the NCP paper—creating a compact, robust, and interpretable control agent. We place this agent at the heart of a larger cognitive architecture, allowing it to leverage its existing tools for self-modification and knowledge acquisition to bootstrap its own intelligence in a continuous, live environment, truly reaching for the foundations of AGI.

Per Filip Tjeransen

Commercial Tender Specialist @ TechnipFMC | @XR_NITO on X

2mo

Takk for at du deler, Ali

To view or add a comment, sign in

Others also viewed

Explore topics