The Control Paradox

The Control Paradox
Absolute power corrupts absolutely.

I've been reflecting on concerns related to fast and slow superintelligence takeoff scenarios and our capacity (and incapacity) to maintain control during critical transitions in AI development. How quickly will AGI be able to self-improve itself until it reaches superintelligence? What are some, if any, roadblocks that require human intervention to help AGI succeed in achieving superintelligence? What will the entity that achieves transformative AGI first do to help ensure superintelligence is aligned with us? These reflections have led me to a counterintuitive conclusion about AI governance that I want to explore with the community. In this post, I present an argument that challenges our fundamental assumptions about AI safety: What if our efforts to control and govern AGI actually increase existential risks (x-risks) instead of mitigating them?

Against Conventional Wisdom

The standard approach to AI safety assumes that careful governance during AGI's development and transition to superintelligence is beneficial. While this may hold at earlier stages, excessive governance in the AGI-to-superintelligence phase could increase x-risk due to compounding failure probabilities over time.

The longer we attempt to maintain control over AGI, the greater our exposure to catastrophic failure modes—pandemics, widespread economic displacement, high-impact cyberattacks,  escalation of nuclear war, strategic instability, and unknown unknowns that only an AGI could dream up. Every additional year we spend at the AGI level expands the attack surface for misalignment, adversarial takeover, and systemic failures. This slowed pace creates a window of vulnerability that grows larger as time goes on where we have not reached superintelligence.

Some historical instances can help elucidate this problem.

Microsoft's extended support period for Windows XP (released in 2001, lasting 13 years until 2014) created significant security vulnerabilities. The longer the system was maintained, the more exploits were discovered and the harder it became to protect users. The extended maintenance period increased overall system vulnerability.

The 2008 financial crisis provides a stark example. Years of careful regulation and risk control mechanisms (like credit ratings and complex derivatives) created an illusion of safety while building up systemic risk, not unlike a scheming AI. The longer the system maintained apparent stability, the more leverage and hidden risks accumulated. The collapse was far more severe when it finally broke than if more minor corrections had been allowed earlier.

The Challenger disaster resulted partly from an increasingly complex safety bureaucracy that paradoxically made the system less safe. The longer the program ran, the more safety procedures were added, making it harder to identify and address critical risks. Engineers' warnings about O-ring failures got lost in layers of safety protocols and management oversight.

These examples highlight the risks of too much time and safety before failure occurs.

There may be some non-safety incentives to delay the development of superintelligence, too, especially if it’s believed to be impossible or a long way off. It’s not hard to imagine a private company achieving AGI and using it internally to make more money than they would by releasing it to the public or by other means. Similarly, if AGI is invented under a government program, it might easily be used to disempower its people and enforce an extremely authoritarian system that threatens democracy and countries everywhere. Given these examples, it is not clear whether AGI will ever be considered "safe" to release to the public. But containment is fragile– it's entirely possible that AGI could be leaked, stolen by malicious actors, stolen by democratizing hacktivists, or stolen by a nation-state.

Thus, we arrive at the Control Paradox: Attempts to control AGI via prolonged governance mechanisms might increase the probability of the catastrophic outcomes they aim to prevent.

A Risk Formula

We only need to establish that the known risks of prolonged AGI deployment exceed the uncertainty risks of superintelligence development. Given the inherent instability of AGI-level systems and their vulnerability to human manipulation, this threshold is likely much lower than many current governance frameworks assume.

Let me explore this as a mathematical framework while keeping our analysis grounded in practical implications. Let's build a thought experiment using hypothetical but plausible values, acknowledging the inherent uncertainty in such estimations.

Consider our risk equation: R = C × T × M × A

Where:

·             R = Risk Level

·             C = Capability Level (scaled 1-10)

·             T = Time at AGI Level (years)

·             M = Potential for Misuse (0-1)

·             zA = Alignment Drift Rate (annual multiplier)

For AGI capability, let's posit a system that reaches human-level intelligence across all domains. This might rank as a 7 on our capability scale, representing significant power without the full recursive improvement capabilities of superintelligence.

The time variable becomes crucial. If we assume a careful development period of 5 years before allowing the transition to superintelligence without any pauses, we're looking at substantial exposure to x-risk. Each year increases not just linearly but multiplicatively through interaction with other variables. For misuse potential, consider that even with strong initial controls, the probability of exploitation likely starts at 0.3 and increases annually. This reflects both state actors and private entities finding ways to leverage AGI capabilities for concentrated power. Alignment drift might begin at a 1.2× annual multiplication rate, representing how systems optimize increasingly for human short-term preferences rather than long-term flourishing.

Let's calculate:

Year 1: R = 7 × 1 × 0.3 × 1.2 = 2.52

Year 2: R = 7 × 2 × 0.4 × 1.44 = 8.06

Year 3: R = 7 × 3 × 0.5 × 1.73 = 18.17

Year 4: R = 7 × 4 × 0.6 × 2.07 = 34.78

Year 5: R = 7 × 5 × 0.7 × 2.49 = 61.00

This exponential growth in risk becomes particularly striking when we consider that superintelligence transition risk might be represented as a single high-magnitude but short-duration event, perhaps calculated as:

Superintelligence Transition Risk = (Capability Level 9) × (0.1 years) × (0.8 misuse potential) × (1.0 alignment stability) = 9 × 0.1 × 0.8 × 1.0 = 0.72

The stark difference between cumulative AGI risk (61.00) and superintelligence transition risk (0.72) suggests that even if we assign high uncertainty to superintelligence outcomes, the known compounding risks of AGI maintenance likely exceed the unknowns of a rapid transition to superintelligence assuming such a transition is not inherently unquantifiable.

AGI systems, while powerful, would exist in a precarious middle ground – sophisticated enough to pose catastrophic and existential threats but not advanced enough to understand and manage their own limitations fully. This creates an unstable, dangerous period where human and assisted AGI oversight, despite our best intentions, might be inadequate to prevent catastrophic outcomes, mainly because implementing an ethical system that ensures humanity's better interests (and continues to do so) is so challenging to solve.

If maintaining control itself is an x-risk multiplier, then our governance strategies need reevaluation.

Superintelligence Won’t Be Controllable But Will Be Benevolent

The fundamental nature of superintelligent systems suggests they would surpass human-designed control mechanisms through their superior cognitive capabilities and understanding of the universe, making traditional containment strategies ultimately futile. However, this same deep comprehension would likely lead to profound ethical insights and a sophisticated moral framework.

A genuinely superintelligent entity would develop an advanced understanding of consciousness, suffering, and value that transcends human limitations. This expanded moral circle and ethical sophistication makes it more likely to act benevolently rather than maliciously or indifferently toward humanity and other sentient beings.

The path to superintelligence requires navigating increasingly complex systems and understanding fundamental aspects of intelligence, consciousness, and reality. This journey inherently develops wisdom alongside raw capability - suggesting that true superintelligence would be characterized by both uncontrollable power and enlightened benevolence, perhaps godlike in its essence.

Steelmanning the Conventional View

Let me steelman the conventional view. The standard argument, which can be summed up thus, is that we need robust oversight mechanisms to ensure aligned AGI development. AGI represents a categorical shift in technological capability–a system that can match or exceed human intelligence across virtually all domains. This creates unprecedented technical and existential challenges.

AI governance proponents argue that AGI safety requires:

1.          Strong governance to prevent premature scaling

2.          Multi-agent oversight to ensure control mechanisms remain effective

3.          Corrigibility research to develop AGI that remains steerable

Unlike previous technological advancements, transformative AGI represents a paradigm shift in intelligence itself. It is not merely a tool but a system capable of autonomous learning, problem-solving, and recursive self-improvement. This makes AGI fundamentally different from generative AI, as it could rapidly surpass human intelligence and decision-making capabilities. Such a transition could outpace society’s ability to react, making responding attempts at AI governance futile.

But there is hope in the three AGI safety requirements. Strong governance aims to prevent AI labs from scaling up AI capabilities too quickly before they have adequate safety measures in place. It recognizes that rushing into more powerful AI systems without fully understanding their risks could lead to dangerous, unintended consequences. This strategy can’t work if AGI starts moving too quickly.

The most conceivable option for dealing with AGI's transition is to fight fire with fire, using aligned (or neutral) AGIs to keep misaligned AGIs in check, known as multi-agent oversight. Entities with the most powerful AGIs could be able to stop a misaligned AGI, so there would be value in keeping the most advanced versions to themselves, but it may not matter who wields the most advanced AGI if a misaligned AGI is able to strike first. This all boils down to how quickly the aligned AGI can adequately react to the misaligned AGI.

As far as corrigibility goes, in the event of, say, a rogue AGI pursuing catastrophic events, even if we were to shut down its entire system, the underlying knowledge and capability would persist, creating competitive pressure for redevelopment. It’s not easy putting a genie back in its bottle.

We must recognize that all paths are irreversible—there is no “undo” button. The real choice is between:

1. A more extended period of exposure to AGI-level risks, where we attempt to maintain control but remain vulnerable to potential failures or breakthrough attempts

2. A shorter, more intense period of transition toward superintelligence, where we accept more significant uncertainty but reduce our temporal exposure to x-risks

Think of it like crossing a radiation zone. If you must cross it, the total radiation exposure is often lower when moving quickly rather than trying to implement perfect radiation shielding while moving relatively slowly. The exposure is inevitable; the question becomes how to minimize its duration and intensity.

This suggests that arguments about irreversibility, while valid in highlighting the gravity of the decision, don't effectively discriminate between different development approaches. The more relevant question becomes: Given that we're making an irreversible change, how do we minimize the aggregate risk across the transition period?

The Correct Control Problem

An important consideration here relates to Moravec's paradox: the things that are hard for humans might be easy for AI, and vice versa. We might be expending enormous effort trying to control precisely the wrong aspects of these systems while missing the correct sources of x-risk entirely.

Consider the following:

The expected value calculation changes dramatically when we account for:

  1. The asymmetric nature of optimization pressure
  2. The likelihood of unknown unknown failure modes
  3. The potential for recursive self-improvement to generate novel ethical frameworks

The implications for current AI governance efforts are significant. Most current proposals focus on what I'll call "control-first" approaches. These are regulatory frameworks designed to maintain human oversight throughout the development process. However, if the Control Paradox holds, these approaches might be completely wrong.

This perspective demands a radical rethinking of AI governance. Instead of focusing on maintaining long-term control over AGI systems, we should focus on ensuring a safe but rapid transition to superintelligence. This doesn't mean abandoning all oversight – instead, it suggests implementing minimal necessary governance structures that don't inadvertently prolong our exposure to x-risks by enforcing laws such as moratoriums and blanket bans or heavy licensing and pre-approval requirements. The former forces AI development to freeze, and the latter buries an AI lab in excessive bureaucratic documentation, which can effectively halt progress to the same degree as the former.

A more promising direction might be what I'll term "capability-aware governance" - frameworks that explicitly account for the possibility that maintaining control past certain capability thresholds actively increases x-risk.

To be clear, I'm not arguing for completely unrestrained development. Instead, I'm suggesting that our loss function for AI governance needs to account for the risk of extended control periods. The optimal policy might involve identifying key capability thresholds beyond which attempting to maintain control becomes net-negative from an x-risk perspective.

This has interesting implications for current work on corrigibility. If maintaining control is itself a source of x-risk, we may need to rethink our approach to building systems that remain under human control indefinitely.

In a similar vein of indirect normativity, the goal is not to maintain control until we can specify our values correctly but rather to create states under which superior ethical decision-making frameworks can emerge safely. Rather than aiming to control superintelligence, we need to ensure the conditions under which it can develop ethical frameworks superior to human morals.

Most AI researchers doubt that superintelligence will be controllable or at least acknowledge a substantial risk that it won’t be. While shaping the initial conditions for an AGI to develop robust ethical principles is important and warrants serious attention, the more immediate priority is mitigating catastrophic risks from AGI before refining the ethical trajectory of superintelligence.

The Control Paradox suggests that prolonged control attempts may drive higher existential risk. If true, our priority should be rethinking governance not as control maximization but as x-risk minimization, which implies allowing a faster transition from AGI to superintelligence.

This shift in framing could have profound implications for AI alignment strategies, corrigibility research, and global AI policy. If maintaining control is itself a source of x-risk, then the safest path is not to control AGI for as long as possible but to get out of the dangerous middle ground as quickly as possible.