Why pursue understanding agency in AI safety

30 November 2023

I believe that the question of ensuring human agency is broadly neglected by alignment researchers. Even if we get value alignment correct and have it point at the correct goals we can still end up in a bad future if agency is not a core consideration of the system from the beginning. Conceptions of powerful AI systems need to have human agency as a fundamental consideration in its operation above and beyond simply trying to get it to broadly care about the same things we do.

Furthermore, conceiving of the problem with an emphasis on solving the problem of preserving the agency of a single human helps to gain traction on the problem, and I think makes many downstream problems of alignment simpler.

Below is an example of a possible failure mode where we still manage to broadly align the AI with our values.

Humanity builds an AI system with all the goals and values we might hope to align it with.

It then goes into the world doing its work to ensure these goals are worked towards according to these values.

Humans are now left in a scenario in which they are limited in their capacity to change the operations of the system, and even if human values are preserved, the operations of the system basically become inscrutable to us and humans lose their individual and collective agency.

One could argue that this represents a failure in alignment, but I think that such a system could easily contain what we call "our values" in that it would operate within a preferable distribution of behaviours with the expectation of values improving and altering over time. However the fundamental paradigm is still one that fundamentally disempowers humans.

Why should we worry about this failure mode when we still haven't solved value alignment? My answer is that one needs to think from perspective of agency because it helps us to understand the problem of the AI's relationship to individual.

For example, how do we ensure that the AI system is preserving the agency of its human operator in their interactions? How do we measure the agency of the two agents in this interaction and model it, and what levers do we need to pull to get more of the kind of behaviours we want? When asked to perform a task, to what extent will completing that task disempower the human operator or other humans?

This view has advantages because it shifts the focus of getting alignment correct from a world saving narrative where the AI is completing world altering actions to one of focusing on the specific dynamics of the development of AGI in its relation to individual users and how that interaction becomes scaled up. This is important because I expect this to more closely reflect the actual development of AGI as it is happening today, and because it is simpler to solve the problem of agent to agent interactions, than to solve the one to many problem of AGI interacting with humanity. By solving the problem of agency preservation we inherently solve the problem of human destiny being preserved because human destiny is shaped through the accumulated actions of humanity as a whole.

One way to think of this is like Scott Alexander's 'Ascended Economy' where the complexity is expressed in the external system that is developed in this economy, ie the complexity is in the databases and supply chains that are governed by autonomous AI systems whose moves at some point become inscrutable to us.

I am less confident about this point, but it seems to me that there is a larger gap that needs to be crossed for a model to think about instrumental goals such as self-improvement in the context of its goal being to serve a particular person's needs while ensuring their agency as compared to the context of an AGI in a lab which is being asked to solve some wish-like specific reward function such as "do the pivotal act". It may be naive but this framing feels much more helpful for how I expect things to actually happen.

I think this problem is also fundamental up and down the impact chain, regardless of whether you are worried about complete extinction of the human race or simply individuals being disempowered/jobs being taken away. The problem here is fundamental to both and thus is palatable to a broad swath of research tastes.

This doesn't get rid of a lot of the original problems but rather inverts them. However, I see this as the paradigm we are moving towards and thus more worthy of our attention.

This view seems to be neglected by the alignment community, though I see it as actually being central to the paradigm of OpenAI and Anthropic in the nature of their products. It does not receive much of the attention of the discussion, perhaps because it is a "what is water" where we are actually living within the problem context already. This paradigm seems more likely to persist as there is a smoother track for product development in the private sector in the approach of distributed services rather than single super-powered AI systems.

Ultimately I believe that focusing more on agendas that rest on the assumptions implied by this paradigm will lead to research more closely aligned with reality, and will lead to more fruitful approaches to ensuring AI safety. These include ones of personal autonomy, economic regulation, understandability, and a deep understanding of the incentives at play that underlie our economic structures, as well as those motivating our AI systems.

No comments yet. Be the first to leave one below.

Comments are public. Email will not be displayed publicly.

Questions, thoughts?

Leave a comment