Experiments in Life

AI Alignment as a Geometry Problem, Encoding Disgust

Anthropic just released a paper showing that it may not matter how much malicious data a model has been trained upon. One may be able to compromise a model with a tiny amount of malicious data.

I am currently learning linear algebra because of my interest in machine learning. If you could define the vector subspaces that produce unwanted output, could you simply "delete" these subspaces from a model or design the model to not utilize them? Basically, could you code disgust into the model?

Disgust seems to be a pre-rational phenomenon in human beings. We don't reason through something and then decide we're disgusted. The feeling of disgust is visceral.

Would the equivalent of disgust in an AI be a deleted or off-limits vector subspace?

What about subspaces where ideas are entangled, where they contribute to perfectly acceptable outputs as well as totally unacceptable ones? I am new to linear algebra and machine learning, so this is only a guess, but wouldn't that indicate a yet undefined dimension that would disentangle these subspaces? A missing axis of meaning? More training to discover this axis of meaning, this missing dimension, is somewhat analogous to a child learning nuance.