Focus will be on the actual arguments in section on optimization pressure, since that seems to be the true objection here—previous sections seem to be rhetoric and background, mostly accepting the theoretical basis for the discussion.
I take it this essay presumes that the pure version of the argument is true—if you were so foolish as to tell a sufficiently capable AGI ‘calculate as many digits of Pi as possible’ with no mitigations in place, and it has the option to take over the world to do the calculation faster, it’s going to do that.
However I interpret you as saying in practice, that wouldn’t happen, because practical considerations and countermeasures? Is that right?
I take the quoted sections here to be the core arguments:
It’s true that you can’t get the coffee if you’re dead. But that doesn’t imply that any coffee-fetching plan must include personal security measures, or that you have to take over the world just to make an apple pie. What would push an innocuous goal into dangerous power-seeking?
The only way I can see this happening is if extreme optimization pressure is applied. And indeed, this is the kind of example that is often given in arguments for instrumental convergence.
...
But why would a system face extreme pressure like this? There’s no need for a paperclip-maker to verify its paperclips over and over, or for a button-pressing robot to improve its probability of pressing the button from five nines to six nines.
More to the point, there is no economic incentive for humans to build such systems. In fact, given the opportunity cost of building fortresses or using the mass-energy of one more star (!), this plan would have spectacularly bad ROI. The AI systems that humans will have economic incentives to build are those that understand concepts such as ROI. (Even the canonical paperclip factory would, in any realistic scenario, be seeking to make a profit off of paperclips, and would not want to flood the market with them.)
The implication here is that there are reasons not to do power seeking or too much verification—it’s dangerous, it’s expensive and it’s complicated. To overcome the optimization pressures acting against doing that, you’d need to exert even more powerful pressure to do it, which wouldn’t be present if you had a truly bounded goal that already had e.g. p~0.99 of happening if you didn’t do that. Because the risk of disruption, or the cost in resources, exceeds the gain from power seeking.
Let’s consider the verification question first. If you give me affordances, and then reward me based purely on a certain outcome, we agree I’ll use those affordances as best I can even if the gains are minimal. A common version of this is someone going over their SAT answers for the sixth time, because the stakes are so high, so might as well use all the time given to you. There are always students who will use every second you give them, they’d fall asleep at their desk if you let them then wake up and keep trying.
The question is, why in practice wouldn’t you stop at a reasonable point given the cost? That ‘reasonable’ is based on the affordances given, and what terms you effectively built into the reward function. Sure, if you put in a cost term, at some point it stops verifying, but you have to put in the cost term, or it will keep verifying. If you didn’t say exactly 32 paperclips or make it deliver you exactly the 32, it will make 32,000 paperclips instead because that is a good way to ensure you made 32 good ones, etc.
Thus your defense is to start with a bounded goal ‘make 32 paperclips’ or ‘fetch me the coffee.’ Then you put in penalty terms—asymmetrical ones I hope! - for things like costs and impacts. That could work.
You still have to worry that there will be a ‘way around’ those restrictions. For example, if there’s a way to make money that can then be spent, or otherwise gain power or capabilities in a net profitable way, and this is allowed without penalty, suddenly there’s a reason to go maximalist again, and why not? It’s certainly what I would do. Or if sufficient power lets it change the rules around its reward, of course. Or if there’s a way to specification game that you didn’t anticipate. Again, what I would look to do.
It is not trivial to specify exactly what you want here, but yes it is possible to prevent IC this way in a given case. The problem is, as the affordances and capabilities of the system increase, the attractiveness of these alternative strategies and its ability to find them increases, and your attempts to block them become more likely to fail—not that it’s impossible in theory to solve the issue in any given case.
The other problem is that if some people solve this problem, while others do not, some systems will seek power and others will not seek power, which does not solve our collective problem at all. The systems that don’t seek power quickly become irrelevant. And this is a strong argument, from the perspective of such a system and for its owner, for seeking power. If you intend to kill me to ensure you can fetch your boss’ coffee, then I cannot sit on my hands and be a humble assistant, or I will fail.
With fully maximalist goals you are in much deeper trouble, and often people give AIs maximalist goals—the most clicks or engagement, the most profits or paperclips, and so on. Then what do you actually want to happen?
Often the best way to do something really will be to seek power, or humans do choose this on reflection.
(E.g. IRL: Oil companies overthrow governments, people fight world wars in order to ensure their freedom on their farm or to implement their favorite distribution of resources, people engage in grand conspiracies or globe-spanning decades-long epic quests to win someone’s heart, wreck entire industries in order to protect a handful of jobs, work every day their entire lives to earn more money without ever having a plan to spend it, etc)
Most people spend most of their time pursuing instrumental goals—power, money, knowledge, skills, influence and so on. If you tell a system to ‘make the most money’ as many people will, what happens? It’s not that easy to put in sufficient correction terms, and when you do, you really do hurt the capabilities of the system to achieve the goals specified.
(Happy to do a call, I deleted like 3 attempts on this, and higher bandwidth / feedback likely helps here)
I think it’s an important crux of its own which level of such safety is necessary or sufficient to expect good outcomes. What is the default style of situation and use case? What can we reasonably hope to prevent happening at all? Do our ‘trained professionals’ actually know what they have to do, especially without being able to cheaply make mistakes and iterate, if they do have solutions available? Reality is often so much stupider than we expect.
Saying ‘it is possible to use a superintelligent system safely’ would, if true, be highly insufficient, unless you knew how to do that, were willing to make the likely very very large performance sacrifices necessary (pay the ‘alignment tax’) in the face of very strong pressures, and also ensure no one else did it differently, and that this state persists.
Other than decelerationists, I don’t see people proposing paths towards keeping access to such systems sufficiently narrow, or constraining competitive dynamics such that people with such systems have the affordance to pay large alignment taxes. If it is possible to use such systems safely, that safety won’t come cheap.
I do think you are right that we disagree about the nature of such systems.
Right now, I think we flat out have no idea how to make an AGI do what we’d like it to do, and if we managed to scale up a system to AGI-level using current methods, even the most cautious user would fail. I don’t think there is a ‘power-seeking’ localized thing that you can solve to get rid of this, either.
But yeah, as for the crux it’s hard for me to pinpoint someone’s alternative mindset on how these systems are going to work, that makes ‘use it safely’ a tractable thing to do.
Throwing a bunch of stuff out there I’ve encountered or considered, in the hopes some of it is useful.
I think you’re imagining maybe some form of… common sense? Satisficing rather than pure maximization? Risk aversion and model uncertainty and tail risk concerns causing the AI to avoid disruptive actions if not pushed in such directions? A hill climbing approach not naturally ‘finding’ solutions that require a lot of things to go right and that wouldn’t work without a threshold capabilities level (there’s a proof I don’t have a link to atm that gradient descent almost always will find the optimal solution rather than get stuck in a local optima but yeah this does seem weird)? That the AI will develop habits and heuristics the way humans do that will then guide its behavior and keep things in check? That it ‘won’t be a psychopath’ in some sense? That it will ‘figure it out we don’t want it to do these things’ and optimize for that instead of its explicit reward function, because that was the earlier best way to maximize its reward function?
I don’t put actual zero chance some of these things could happen, although in each case I can then point to what the ‘next man up’ problem is down the line if things go down that road...