It’s assumed that artificial intelligence and the magic of LLMs emerge from an impenetrable black box consisting of neural networks many layers deep. Some input comes in and an output comes out, yet we will never know what is happening “in there”. Well! Thanks to research from the team at Antrhopic we can perform an “MRI” on LLMs and peer into the neurons to identify monosemantic concepts.
We fear that which we don’t understand, especially if that which we don’t understand is very powerful. Much of the conversation on AI fears, especially about run-away intelligence, stems from the fact that as of today, most LLMs are black boxes that seem to perform miracles. This video does a great job of explaining why it is important to understand the ideas inside the AI’s head, and how researchers are have tackled the challenge.
Importantly the research from Anthropic demonstrates that not only can we know what neurons in the neural network are doing — we can also variably “clamp” outputs from the LLM to that concept. This means that we can direct how the LLM interacts with people increasing steerability and reducing risks.
How should LLMs respond to minimize potential harm? Anthropic’s list of principles it uses for its “Constitutional AI” is a good start deriving it’s principles from the Charter of Human Rights and a few other sourcs. My question is could “do no harm” be just a starting point?
Could the first tool that can think and decide respond in ways that improve life for individuals and humanity? And should it?
Do we have a Universal Set of Principles for The Good Life that an LLM is also trained on and how do we know what the “respond in ways to …” sentence will end in?
One thought on “Looking Inside LLMs – If we can drive now, where should we go?”