Meta-level: I think to have a coherent discussion, it is important to be clear about which levels of safety we are talking about.
Right now I am mostly focused on the question of: is it even possible for a trained professional to use AI safely, if they are prudent and reasonably careful and follow best practices?
I am less focused, for now, on questions like: How dangerous would it be if we open-sourced all models and weights and just let anyone in the world do anything they wanted with the raw engine? Or: what could a terrorist group do with access to this? And I am not right now taking a strong stance on these questions.
And the reason for this focus is:
The most profound arguments for doom claim that literally no one on Earth can use AI safely, with our current understanding of it.
Right now there is a vocal “decelerationist” group saying that we should slow, pause, or halt AI development. I think this argument mostly rests on the most extreme and IMO least tenable versions of the doom argument.
With that context:
We might agree, at the extreme ends of the spectrum, that:
If a trained professional is very cautious and sets up all of the right goals, incentives and counter-incentives in a carefully balanced way, the AI probably won’t take over the world
If a reckless fool puts extreme optimization pressure on a superintelligent situationally-aware agent with no moral or practical constraints, then very bad things might happen
I feel like we are still at different points in the middle of that spectrum, though. You seem to think that the balancing of incentives has to be pretty careful, because some pretty serious power-seeking is the default outcome. My intuition is something like: problematic power-seeking is possible but not expected under most normal/reasonable scenarios.
I have a hunch that the crux has something to do with our view of the fundamental nature of these agents.
… I accidentally posted this without finishing it, but honestly I need to do more thinking to be able to articulate this crux.
I think it’s an important crux of its own which level of such safety is necessary or sufficient to expect good outcomes. What is the default style of situation and use case? What can we reasonably hope to prevent happening at all? Do our ‘trained professionals’ actually know what they have to do, especially without being able to cheaply make mistakes and iterate, if they do have solutions available? Reality is often so much stupider than we expect.
Saying ‘it is possible to use a superintelligent system safely’ would, if true, be highly insufficient, unless you knew how to do that, were willing to make the likely very very large performance sacrifices necessary (pay the ‘alignment tax’) in the face of very strong pressures, and also ensure no one else did it differently, and that this state persists.
Other than decelerationists, I don’t see people proposing paths towards keeping access to such systems sufficiently narrow, or constraining competitive dynamics such that people with such systems have the affordance to pay large alignment taxes. If it is possible to use such systems safely, that safety won’t come cheap.
I do think you are right that we disagree about the nature of such systems.
Right now, I think we flat out have no idea how to make an AGI do what we’d like it to do, and if we managed to scale up a system to AGI-level using current methods, even the most cautious user would fail. I don’t think there is a ‘power-seeking’ localized thing that you can solve to get rid of this, either.
But yeah, as for the crux it’s hard for me to pinpoint someone’s alternative mindset on how these systems are going to work, that makes ‘use it safely’ a tractable thing to do.
Throwing a bunch of stuff out there I’ve encountered or considered, in the hopes some of it is useful.
I think you’re imagining maybe some form of… common sense? Satisficing rather than pure maximization? Risk aversion and model uncertainty and tail risk concerns causing the AI to avoid disruptive actions if not pushed in such directions? A hill climbing approach not naturally ‘finding’ solutions that require a lot of things to go right and that wouldn’t work without a threshold capabilities level (there’s a proof I don’t have a link to atm that gradient descent almost always will find the optimal solution rather than get stuck in a local optima but yeah this does seem weird)? That the AI will develop habits and heuristics the way humans do that will then guide its behavior and keep things in check? That it ‘won’t be a psychopath’ in some sense? That it will ‘figure it out we don’t want it to do these things’ and optimize for that instead of its explicit reward function, because that was the earlier best way to maximize its reward function?
I don’t put actual zero chance some of these things could happen, although in each case I can then point to what the ‘next man up’ problem is down the line if things go down that road...
Thanks a lot, Zvi.
Meta-level: I think to have a coherent discussion, it is important to be clear about which levels of safety we are talking about.
Right now I am mostly focused on the question of: is it even possible for a trained professional to use AI safely, if they are prudent and reasonably careful and follow best practices?
I am less focused, for now, on questions like: How dangerous would it be if we open-sourced all models and weights and just let anyone in the world do anything they wanted with the raw engine? Or: what could a terrorist group do with access to this? And I am not right now taking a strong stance on these questions.
And the reason for this focus is:
The most profound arguments for doom claim that literally no one on Earth can use AI safely, with our current understanding of it.
Right now there is a vocal “decelerationist” group saying that we should slow, pause, or halt AI development. I think this argument mostly rests on the most extreme and IMO least tenable versions of the doom argument.
With that context:
We might agree, at the extreme ends of the spectrum, that:
If a trained professional is very cautious and sets up all of the right goals, incentives and counter-incentives in a carefully balanced way, the AI probably won’t take over the world
If a reckless fool puts extreme optimization pressure on a superintelligent situationally-aware agent with no moral or practical constraints, then very bad things might happen
I feel like we are still at different points in the middle of that spectrum, though. You seem to think that the balancing of incentives has to be pretty careful, because some pretty serious power-seeking is the default outcome. My intuition is something like: problematic power-seeking is possible but not expected under most normal/reasonable scenarios.
I have a hunch that the crux has something to do with our view of the fundamental nature of these agents.
… I accidentally posted this without finishing it, but honestly I need to do more thinking to be able to articulate this crux.
I think it’s an important crux of its own which level of such safety is necessary or sufficient to expect good outcomes. What is the default style of situation and use case? What can we reasonably hope to prevent happening at all? Do our ‘trained professionals’ actually know what they have to do, especially without being able to cheaply make mistakes and iterate, if they do have solutions available? Reality is often so much stupider than we expect.
Saying ‘it is possible to use a superintelligent system safely’ would, if true, be highly insufficient, unless you knew how to do that, were willing to make the likely very very large performance sacrifices necessary (pay the ‘alignment tax’) in the face of very strong pressures, and also ensure no one else did it differently, and that this state persists.
Other than decelerationists, I don’t see people proposing paths towards keeping access to such systems sufficiently narrow, or constraining competitive dynamics such that people with such systems have the affordance to pay large alignment taxes. If it is possible to use such systems safely, that safety won’t come cheap.
I do think you are right that we disagree about the nature of such systems.
Right now, I think we flat out have no idea how to make an AGI do what we’d like it to do, and if we managed to scale up a system to AGI-level using current methods, even the most cautious user would fail. I don’t think there is a ‘power-seeking’ localized thing that you can solve to get rid of this, either.
But yeah, as for the crux it’s hard for me to pinpoint someone’s alternative mindset on how these systems are going to work, that makes ‘use it safely’ a tractable thing to do.
Throwing a bunch of stuff out there I’ve encountered or considered, in the hopes some of it is useful.
I think you’re imagining maybe some form of… common sense? Satisficing rather than pure maximization? Risk aversion and model uncertainty and tail risk concerns causing the AI to avoid disruptive actions if not pushed in such directions? A hill climbing approach not naturally ‘finding’ solutions that require a lot of things to go right and that wouldn’t work without a threshold capabilities level (there’s a proof I don’t have a link to atm that gradient descent almost always will find the optimal solution rather than get stuck in a local optima but yeah this does seem weird)? That the AI will develop habits and heuristics the way humans do that will then guide its behavior and keep things in check? That it ‘won’t be a psychopath’ in some sense? That it will ‘figure it out we don’t want it to do these things’ and optimize for that instead of its explicit reward function, because that was the earlier best way to maximize its reward function?
I don’t put actual zero chance some of these things could happen, although in each case I can then point to what the ‘next man up’ problem is down the line if things go down that road...