North Coast Synthesis Ltd.

Quis custodiet reward models

◀ Prev | 2025-09-29, access: Free account (logged in) | Next ▶

Large language models are "aligned" using smaller, specially trained reward models.  These are often secret, and poorly studied even if public.  This paper opens the door to exploring reward models by asking them about their values.

Video alignment training text LLaMA Gemma Large language models are "aligned" using smaller, specially trained reward models. These are often secret, and poorly studied even if public. This paper opens the door to exploring reward models by asking them about their values.