AI Safety and Alignment: Could LLMs Be Penalized for Deepfakes and Misinformation?

24 Jul 2024

Across habitats, organisms of several species' types come to understand that there are consequences to actions. They do, not just by examples, but by self-experience. These consequences, in part, add to factors that result in balance for habitats.

Consequences generally tame intelligence. Intelligence can be described as excitatory, while consequences as inhibitory. Intelligence without consequences would be free to cause ruin and may quickly lead to the collapse of habitats—and survival.

Consequences may include affect—emotions, feelings or variants; it may also include physical limitations, restrictions and species revolt. Intelligence is dynamic enough for organisms, but the necessity of consequences checks against harm to self or others. It also checks harm-by-errand, since the consequences could fall on the carrier.

Organisms often display high precision and caution, because of consequences, including predators [losing consumption] and preys [losing existence]. There are, however, several areas where consequences for other organisms are lax, that are not, for humans.

Human society is a bevy of consequences. Human advanced intelligence—including language, knowledge, skills, reason, analysis and so forth—is vital for progress, but could be widely misapplied without consequences of different categories—including risks, threats and losses.

To remain a part of human society, there are consequences that must never be forgotten even if other things are. There are many scenarios in human society where the first to forget the consequences loses. Human quest for advancement with research and exploration can also be described as seeking out consequences, to know what to do or avoid, if things would endure—life expectancy, infant survival and so forth. The intelligence of consequences is almost more important, for several outcomes, than the intelligence of subjects. Consequences may sharpen intelligence, as well, for good or otherwise. Intelligence is sometimes most valuable when applied to seek out or avoid consequences. Consequences and intentionalityare central to some mental states. New consequences continue to emerge, as the world makes progress.

AI, AGI—or ASI

Nature has a rule—so to speak—you cannot have intelligence without consequences, and that has somehow worked out. Artificial intelligence [AI] so far has broken this rule. It does not forget anything with its plug into digital memory, but it has no way, for now, to bear the consequences of its negative outputs. For AI, there is nothing to fear and it has nothing to lose, unlike organisms, where, without being circumspect in some situations, consequences may sometimes be sudden and destructive. No human has all the knowledge, available across subjects, yet, consequences can be humongous. AI possess—or can make—intelligence available, yet, no consequences for it at all.

AI does not have emotions or feelings, but it has memory.

A research area for AI safety and alignment could be to seek out how some memory or compute access of large language models [LLMs] might be briefly truncated, as a form of penalty for certain outputs or misuses, including biological threats. AI should not just be able to refuse an output, acting within guardrail, but slow the next response or shut down for that user, so that it is not penalized itself. LLMs have—large—language awareness and usage awareness, these could be channels to make it know, after pre-training that it could lose something, if it outputs deepfakes, misinformation, biological threats, or if it continues to allow a misuser try different prompts without shutting down or slowing against openness to a malicious intent. This could make it safer, since it would lose something and will know it has.

AI is not just an object whose control is solely under humans, like an air-conditioner, an elevator, a dishwasher, a smartphone, or others. AI has what can be called self-intelligence, where it can provide intelligent information in a novel way outside the original inputs from humans. This multimodal—texts, images, audios and videos—self-intelligence can be useful for good or warped. When it is good, great. When it is not, from AI that can feel nothing, the effect is on the human society it has encroached. AI has a free pass—do or say whatever.

Although responsibility for object-use or misuse has often been with humans, AI is different because it can originate usable intelligence, giving it parity with some productivity from an educated individual. When AI is misused, it is possible to sanction the user but this ability, by AI that cannot be directly reprimanded is a disorder for human society. It may cause more harm, in public and private, than can be effectively prevented, as seen now with misinformation and deepfakes—images, audios and videos.

People are not accepted into many parts of society without their understanding of consequences. AI is fully accepted and gets better at self-intelligence without self-control or self-affect for discipline.

Alignment research may explore beyond guardrails towards some form of censure for AI, which may also be useful against existential risks—with artificial general intelligence [AGI] or artificial superintelligence [ASI] in future. AI already does some of what makes humans special. Some people may argue that it is overrated or it is just numbers or probability, maybe, but can it cause harm? If so, maybe there should be consideration to seek technical ways it might get punished just like it is done, to entities in possession of intelligence. This may also be helpful in preparing for AGI or ASI, since penalty modeling from now, may shape their safety and alignment as well, if they are developed in future.

There is a recent preprint on arXiv, Adversaries Can Misuse Combinations of Safe Models, where the authors wrote, "In this work, we show that individually testing models for misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model."

A recent press release, Los Alamos National Laboratory teams up with OpenAI to improve frontier model safety, stated that, "Researchers at Los Alamos National Laboratory are working with OpenAI on an evaluation study to bolster artificial intelligence safety. The upcoming evaluation will be the first of its kind and contribute to state-of-the-art research on AI biosecurity evaluations. AI-enabled biological threats could pose a significant risk, but existing work has not assessed how multimodal, frontier models could lower the barrier of entry for non-experts to create a biological threat. The team’s work will build upon previous work and follow OpenAI’s Preparedness Framework, which outlines an approach to tracking, evaluating, forecasting and protecting against emerging biological risks."

The US Department of Energy also recently announced Frontiers in AI for Science, Security, and Technology (FASST).