I think it’s an important crux of its own which level of such safety is necessary or sufficient to expect good outcomes. What is the default style of situation and use case? What can we reasonably hope to prevent happening at all? Do our ‘trained professionals’ actually know what they have to do, especially without being able to cheaply make mistakes and iterate, if they do have solutions available? Reality is often so much stupider than we expect.
Saying ‘it is possible to use a superintelligent system safely’ would, if true, be highly insufficient, unless you knew how to do that, were willing to make the likely very very large performance sacrifices necessary (pay the ‘alignment tax’) in the face of very strong pressures, and also ensure no one else did it differently, and that this state persists.
Other than decelerationists, I don’t see people proposing paths towards keeping access to such systems sufficiently narrow, or constraining competitive dynamics such that people with such systems have the affordance to pay large alignment taxes. If it is possible to use such systems safely, that safety won’t come cheap.
I do think you are right that we disagree about the nature of such systems.
Right now, I think we flat out have no idea how to make an AGI do what we’d like it to do, and if we managed to scale up a system to AGI-level using current methods, even the most cautious user would fail. I don’t think there is a ‘power-seeking’ localized thing that you can solve to get rid of this, either.
But yeah, as for the crux it’s hard for me to pinpoint someone’s alternative mindset on how these systems are going to work, that makes ‘use it safely’ a tractable thing to do.
Throwing a bunch of stuff out there I’ve encountered or considered, in the hopes some of it is useful.
I think you’re imagining maybe some form of… common sense? Satisficing rather than pure maximization? Risk aversion and model uncertainty and tail risk concerns causing the AI to avoid disruptive actions if not pushed in such directions? A hill climbing approach not naturally ‘finding’ solutions that require a lot of things to go right and that wouldn’t work without a threshold capabilities level (there’s a proof I don’t have a link to atm that gradient descent almost always will find the optimal solution rather than get stuck in a local optima but yeah this does seem weird)? That the AI will develop habits and heuristics the way humans do that will then guide its behavior and keep things in check? That it ‘won’t be a psychopath’ in some sense? That it will ‘figure it out we don’t want it to do these things’ and optimize for that instead of its explicit reward function, because that was the earlier best way to maximize its reward function?
I don’t put actual zero chance some of these things could happen, although in each case I can then point to what the ‘next man up’ problem is down the line if things go down that road...
I think it’s an important crux of its own which level of such safety is necessary or sufficient to expect good outcomes. What is the default style of situation and use case? What can we reasonably hope to prevent happening at all? Do our ‘trained professionals’ actually know what they have to do, especially without being able to cheaply make mistakes and iterate, if they do have solutions available? Reality is often so much stupider than we expect.
Saying ‘it is possible to use a superintelligent system safely’ would, if true, be highly insufficient, unless you knew how to do that, were willing to make the likely very very large performance sacrifices necessary (pay the ‘alignment tax’) in the face of very strong pressures, and also ensure no one else did it differently, and that this state persists.
Other than decelerationists, I don’t see people proposing paths towards keeping access to such systems sufficiently narrow, or constraining competitive dynamics such that people with such systems have the affordance to pay large alignment taxes. If it is possible to use such systems safely, that safety won’t come cheap.
I do think you are right that we disagree about the nature of such systems.
Right now, I think we flat out have no idea how to make an AGI do what we’d like it to do, and if we managed to scale up a system to AGI-level using current methods, even the most cautious user would fail. I don’t think there is a ‘power-seeking’ localized thing that you can solve to get rid of this, either.
But yeah, as for the crux it’s hard for me to pinpoint someone’s alternative mindset on how these systems are going to work, that makes ‘use it safely’ a tractable thing to do.
Throwing a bunch of stuff out there I’ve encountered or considered, in the hopes some of it is useful.
I think you’re imagining maybe some form of… common sense? Satisficing rather than pure maximization? Risk aversion and model uncertainty and tail risk concerns causing the AI to avoid disruptive actions if not pushed in such directions? A hill climbing approach not naturally ‘finding’ solutions that require a lot of things to go right and that wouldn’t work without a threshold capabilities level (there’s a proof I don’t have a link to atm that gradient descent almost always will find the optimal solution rather than get stuck in a local optima but yeah this does seem weird)? That the AI will develop habits and heuristics the way humans do that will then guide its behavior and keep things in check? That it ‘won’t be a psychopath’ in some sense? That it will ‘figure it out we don’t want it to do these things’ and optimize for that instead of its explicit reward function, because that was the earlier best way to maximize its reward function?
I don’t put actual zero chance some of these things could happen, although in each case I can then point to what the ‘next man up’ problem is down the line if things go down that road...