LLMs: Neuroscience Research for AI Alignment and Safety

28 May 2024

What new mathematical function can be included in the parameters of large language models that will map intent, against usefulness for harms like deepfakes?

What new layer can be included in training a base model such that rather than have an immediate output that hallucinates, confabulates or discriminates, it has a correction layer that tends towards accuracy and fairness?

AI alignment, safety, interpretability, and regulation are more problems of brain science than of engineering. There are several weaknesses of large language models that the human mind has easy fixes for—presenting the standard for the next iterations of LLMs to shape safety and wider deployment.

The human brain does not predict. Conceptually, electrical signals in a set split, with some going ahead of others, to interact with chemical signals like before, such that if the input matches, the incoming ones follow similarly. If not, the incoming ones proceed in the right direction. This explains predictive coding, processing, and prediction error.

Simply, in experiences, initial perceptions are often a result of splits, so that processes can be faster in the mind. They are not prediction of the next token, with some getting it right and others hallucinating, like in LLMs.

Back propagation works in training of base models, but another version of cost function would be needed for the fine-tuned model, against hallucination, discrimination and bias, just like how the human mind corrects, almost accurately, after a missed initial perception.

The human mind also mechanizes intent, where it is possible to choose to do some things or not. This is different from LLMs that do anything they can do—as prompted—before or after guardrail. They can also be attacked with prompt injections as well as jailbreaks, exposing a vulnerability.

Explorations in research for explainable AI and alignment may include power series expansion and other functions for accuracy and fluidintentionality. This can also be useful to develop monitoring AIs, based on their activations, for other AIs and their outputs, in use in common areas of the internet—towards general safety, not just of individual AI models.

Anthropic recently published a paper on Interpretability, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. How features act, as an approach to explainable AI, is vital. However, exploring new functions and layers, for parallels to the human mind, may decide more.

There is a recent paper in Neural Computation, Synaptic Information Storage Capacity Measured With Information Theory, that concluded that, "Information storage and coding in neural circuits have multiple substrates over a wide range of spatial and temporal scales. How information is coded and its potential efficiency depends on how the information is stored. The efficiency of information storage was analyzed by comparing the distribution of synaptic strengths, as measured in spine head volumes, with the uniform strength distribution, the maximum entropy distribution. The outcomes reveal a near maximal efficiency in storing information across the distinguishable sizes of spine head volumes."

Chemical impulses, rather than synapses, are, conceptually, the basis of information organization and storage in the brain. Synapses can be in a formation, for which the provisions of rations of chemical impulses for specificity of functions are possible, but synapses themselves do not hold information.

The human mindhas functions and features. Functions include memory, feelings, emotions and regulation of internal senses. Features grade or qualify functions. They include attention, awareness [or less than attention], self or subjectivity and free will or intentionality.

Functions and features are mechanized in the same sets of electrical and chemical signals. This makes it possible for functions to be properly graded. For AI, some expansion of parameters may define alignment by the neuroscience of mind. How the human mind works, conceptually, is also applicable as a display, for a form of care, for those already affected by the risks of AI like deepfake audios, job losses and others.