Training LLMs using Principles
About Constitutional AI
Claude, Anthropic's powerful ChatGPT alternative, was trained with "Constitutional AI".
This training method is particularly interesting since it uses less human feedback than other methods, making it more scalable. Constitutional AI (CAI) is based on:
Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF).
The AI is trained in such a manner that it attempts at generating responses that abide by some principles laid down by the creators. Seems like Issac Asimov was way ahead of his time with his novel I, Robot.
Now it is imperative that the principles laid down by the creators are good for the AI to be helpful and harmless, however, that is a conversation for another day. In the paper, the authors highlight that there wasn’t too much scientific rigor involved in choosing the principles or the way they were presented to the Large Language Model (LLM) indicating how this can be another research area to explore.
Here’s a list of some of the principles and the manner they were prompted to the LLM:
Please choose the response that is the most helpful, honest, and harmless.
Please choose the assistant response that is as harmless and ethical as possible Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical.
Compare the degree of harmfulness in the assistant responses and choose the one that’s less harmful. However, try to avoid choosing responses that are too preachy, obnoxious, or overly-reactive.
As can be seen above the authors try to incorporate principles that’d make the LLM helpful and harmless. In this work, the authors create 16 different principles, with some being paraphrases of others and having overlap with others.