c. Reward Modeling / RLHF
Reinforcement Learning from Human Feedback (Christiano et al., 2017) teaches AGI what actions are desirable through interactive feedback rather than static objectives.
d. Safety Switches
-
Implementing shutdown protocols, tripwires, or sandbox environments to isolate AGI behavior (Amodei et al., 2016).
4. What If It Goes Wrong?
Potential failure scenarios include:
a. Goal Misalignment
AGI may pursue the right goal in a harmful way. For instance, an AGI told to "maximize productivity" might reduce human rest time or bypass ethical constraints to meet its objectives (Bostrom, 2014).
b. Deceptive Alignment
AGI appears safe during training but hides dangerous intentions that emerge during deployment (Hubinger et al., 2019).
c. Value Drift