The Future of AI Agents

Insights from perception, inner speech, and spatial memory

Santiago M. Quintero
9 min readJun 26, 2024

I will present three ideas drawn from human cognition to innovate how we think about building AI agents. The first idea is about how memory is created and represented in the agent. The second idea is about how memory influences the development of a personality and improves the likelihood of an agent achieving its goals. The final idea is about enriching the information agents have available when making decisions and taking actions with perception. What all three ideas have in common is a multithreaded thought process, where agents use symbolic, probabilistic, and linguistic reasoning to enhance their understanding of the world.

An AI Agent has perception, and reason to execute actions, and engage with its environment. Image from The Rise and Potential of Large Language Model Based Agents: A Survey

A neural memory

Humans store information in at least three different ways: long-term memories, semantic representation of words, and spatial memory. The hippocampus is responsible for spatial navigation and short-term memory. In our minds, space is represented as a graph where objects are nodes, and edges indicate proximity and orientation. When we are in a new place, we reuse the same set of neurons, only modifying the activation patterns between them. Within the graph, neurons also form connections to emotions and memories, making it easy for similar places to share a similar context. In the hippocampus, inner dialogue facilitates converting short-term thoughts to long-term memories by repetition. This is what I call a neural memory.

Applying these insights into an AI agent results in a malleable, non-deterministic, fixed-size memory. Very different from traditional databases. It has three components: a grid, memories, and thoughts. The grid is a fixed-size graph that emulates neurons; connections are formed and reinforced by activity. Memories are activation patterns—variable-length arrays indicating which neurons were activated when representing that outer object. Thoughts are sequences of memories, the order of which is probabilistic based on similarity.

The basic idea is that when a new memory is learned, it is based on the existing structure of the grid following an activation pattern similar to related memories. A recurrent algorithm in the background, continuously updates memories, adapts the connections between neurons on the grid, and calculates the similarity of memories to form future thoughts. The algorithm is neuro-symbolic, where mathematical operations coexist with a reasoning process to create a space-efficient storage system. In AI agents, neural processing is powered by a large language model (LLM) that links memories with a semantic, verbal connection to form thoughts. Updating the memory is part of a larger process that rewards memories that lead to successful otucomes.

A neural memory is more complex than a traditional database, but it has three significant advantages: first, it stores similar elements close to each other. Second, it has logarithmic efficincy in terms of space because the grid has a constant size, and memories are comparatively smaller due to their sparse representation. Third, updates are fast and cheap since only a small number of edges may need to be modified, even for large objects. Additionally, it can store related memories by reusing the original activation pattern, only introducing small changes (hierarchical representation). A neural memory shares similarities with vector and graph databases but scales more elegantly and is more adaptable to the dynamism of knowledge. It is ideal for common use cases like recommendation engines.

Extract from the paper ReAct: Synergizing Reasoning and Acting in Language Models that inspired the authors to expand the action space of AI agents beyon actions to also include thoughts.

Developing a personality

There is something intriguing about our mind’s spatial representation. If space is represented as a graph where nodes are objects. Then, there is a specific node that represents us. Nodes representing objects are not only connected to other objects but also to emotions, memories, and meanings. Just as we might have a preference for certain food or people close to us evoke powerful emotions, we also assign ourselves attributes that influence how we perceive us. Identity is not solely based on our memory; the process is far more sophisticated. But this meta-representation provides a useful tool to endow our agent with a personality. Assuming that the agentś goal is fixed, and it has no control over it, the agent can allocate part of its memory to represent itself as an ensemble of characteristics that maximize the likelihood of attaining its goal. Developing a personality becomes part of a larger reinforcement learning process that rewards positive behavior of the agent.

The agent’s personality becomes at service of its goal.

In “Inner Speech: Development, Cognitive Functions, Phenomenology, and Neurobiology” authors Ben Alderson-Day, and Charles Fernyhough explain that inner dialogue serves four main purposes: a) memory creation and memory retrieval, b) planning, problem-solving, and task switching, c) developing social awareness, and d) generating motivation. Not all of us experience inner speech, but for those os us who do, it is an important part of our cognitive process. Inner speech happens in what is called the "Brain's Default Network," which activates when we are not paying attention to our senses, for example, when daydreaming. The development of a personality can be considered a problem-solving task where the person needs to find the optimal facade to navigate a complex social situation. Then, the inner dialogue becomes the ideal tool to express, and develop our agent's personality.

Embodying an inner dialogue in an AI agent is theoretically trivial, but making it coherent is not as easy. Hallucinations might increase, and the sense of purpose might be lost. “Graph of Thoughts” and other prompt engineering techniques have provided a useful blueprint for solving complex problems like sorting. In this scenario, a reinforcement learning algorithm may experiment tunning different prompts. The key is having a close-loop between the agent’s behaviour, and the results. Based on that the agent can gradually develop better decide what information to prioritize, how to communicate, and how to adapt to different circumstances. This meta-cognition process becomes the agent’s awareness.

Creating meaning with perception

Difference between sensing, and perception from Stanford’s EE259: Principles of Sensing for Autonomy.

As the image depicts, the difference between sensing and perception is vast. Today’s agents mostly engage with the text input without creating a deeper, richer representation of what the human wants. To better understand what is perception, let's consider human vision as an example.

Visual perception starts starts when light is reflected from objects. Our retina captures three basic colors: red, green, and blue. Through signal processing these primary colors are combined into a rich color spectrum. Bits of light that are close to each other, and have similar colors are groupped into regions (image segmentation). Next, a visual map is generated that emulates what we see, and help the brain identify lines and edges, which are later converted into features and objects. Finally, sophisticated algorithms are used to perceive depth and motion from images using our binocular vision. During the process, the visual cortex engages with memory, language, and cognition to derive meaning from images.

Now that we appreciate the complexity behind perception. Let’s consider applying it to AI agents for inferring user’s intent and detecting change. Under this scenario, the agent enriches the input to further clarify what the user wants and how is this interaction different from others. The agent may draw insights from the time of day, similar engagements in the past, capture subtleties in the language, or even consider the behavior when typing. The extra information is used by the language model to deliver a better customer experience.

Perception cannot be solved using LLMs. The large amount of information the agent requires to process makes it unsuitable to the slow, single-threaded, context-constrained, computationally intensive, hallucination-prone reasoning that language models provide. Instead, signal processing, and machine learning algorithms will be used to quantify the information into a set of choices that the language model can use. Lightweight language models may be used to connect the last parts of the perception process with the mathematically-intensive early computations. Please note, that enriching with text with context is also useful for training, and fine-tunning language models. As the uncertainty of the next token decreases with the additional context. The final outcome is an agent that can anticipate user’s intent, offer suggestions, and better accomplish the end-goal of the customer.

Perception creates a map of the customer’s problem statement: an ideal representation for a search algorithm. What the agent does is moving a person from a place where there is a problem, to one with a solution. In navigation, three variables are considered to reach a destination: location (where I am), mapping (what’s around), and odometry (movement). Based on that a path can be found, either in a static or dynamic environment. Perception creates clarity on where the customer is, where it wants go, and how to get there.

However, there is a crucial difference between human vision, and the semantical, text-based environment where AI agents live. Seeing depends on the three-dimensional space we live on, and our binocular vision that helps us contrast images. True innovation in the development of AI agents require a new representation that allows us to process language at scale. The foundation of our current written language was designed around 4,000 years ago. Written language provides low-bandwidth to transfer information compared to digital mediums like optic fiber. While I do not know how that new paradigm of language looks like, I know where to find for inspiration.

Bioinformatics

Molecular biology studies how proteins encode information. Compared to our linear language, proteins exchange information in multiple dimensions simultaneously. Proteins follow a sequential structure, just as human language, but they have three additional dimensions. First, every single amino acid that composes a protein has a chemical structure that influences the protein locally. For example, it may have an electric charge or be hydrophilic. Our letters, on the contrary, are mostly arbitrary; while some letters play a syntactical and grammatical role, they provide relatively little semantic value. Proteins can be thousands of amino acids long, and they fold in three-dimensional space.

The specific shape that a protein adopts is highly relevant to its function. For example, proteins may have a hydrophobic membrane to avoid dissolving in water, be hollow to transport nutrients, or have precise shapes to match other enzymes or molecules. While our documents can also be thousands of words long, we fail to visualize their semantic content in a single image. The best we can do is summarize it in a title. Moreover, we do not have a theory or taxonomy that adequately explains what structure of a text is ideal for a given function. Statistical analysis to develop this theory may be one of the simplest steps we can take to begin scaling our capacity to assimilate information. Finally, proteins support a quaternary structure to form macromolecules. It would be useful if we could group a large collection of related texts to accomplish specific purposes.

Conclusion

This content originated from an “Introduction to AI Agents” talk that I gave in Mexico. Upon the start of my research, I considered AI agents as one of the many applications of the new wave of innovation powered by AI. Today, I see it as the primary application of AI that will transform the fibers of our society. The fact is that this is the next phase of Earth’s evolution, where synthetic forms of life cohabitate the planet. While many may see this as dangerous and fear the loss of their privileged place as the sole species at the top of the evolutionary hierarchy, I offer you a distinct view — one where intelligent machines become the third kingdom of life in a mutually beneficial relationship.

After all, non-sentient machines may be our only possibility to reverse the damage we have done to the planet or provide the work required to terraform other planets and satellites in our solar system. My hope is that the present narrative inspires founders, researchers, investors, and readers to learn more about the applications of human cognition to the development of Artificial Intelligence. Please, trust the wisdom of millions of years of evolution in designing and testing complex systems.

Contact

I write three times a week on LinkedIn. There, you can read my developing thoughts as I conduct research and influence the final outcome with your comments. The content is also available on YouTube, where I dive deeper into the technical details. Next year, I plan to build a startup. If you are an investor and like the ideas I present, please consider reaching out. I’m also open to part-time consulting if the problem is interesting. LinkedIn is the best place to reach me for business inquiries. If you just want to chat about the ideas presented here, please connect with me on LinkedIn, and I will be happy to share my Calendly.

Acknowledgments

First, I want to thank you: Medium readers. My last story about “How to build a startup” was very well-received, and motivated me to invest the time, and effort into writting this: your views, clapps, comments, and highlights truly make a difference. Second, to the attendees of the Introduction to AI Agents talks in Mexico. The positive feedback that I received encouraged me to share thoughts that I find interesting, but different to the content that is usually available. Lastly, to the writers of the ReAct paper that included the reference to the inner dialogue paper that inspired the torrent of curiosity that culminated into this story. Thank you!

--

--

Santiago M. Quintero
Santiago M. Quintero

Written by Santiago M. Quintero

Entrepreneur, Software Engineer & Writer specialized in building ideas to test Product Market Fit and NLP-AI user-facing applications.

No responses yet