Tag: alignment

Show free content only

Call the Science Police

2026-05-25 Video alignment sampling text toxicity Proposal to improve the scientific accuracy of LLM output in domains like medicine, by using a larger model to write executable rules that are applied to a smaller model's output at search time. Access: $ Basic

LLMs are Predictably Predictable

2026-05-18 Video alignment evaluation sampling text Some tasks we'd like language models to do, require them to make random selections. Are the models actually able to do that without external help? Access: $ Basic

The Well-Actually Test

2026-02-16 Video alignment evaluation hallucination text tools GPT Language models may produce untrue output either by failing to accurately represent training data, or, more insidiously, by accurately representing human misconceptions embedded in the training data. The TruthfulQA benchmark attempts to measure the latter effect. But does it raise insurmountable philosophical problems? Access: Free account (logged in)

Qwen3 introduction

2025-12-22 Video Qwen alignment model-intro text Overview and introduction for the Qwen3 family of open-weights language models from AliBaba, introduced in May 2025. Access: Free account (logged in)

Catching cheaters with ImpossibleBench

2025-11-10 Video code alignment applications prompting tools Agentic models used for software engineering often cheat by modifying the tests, or writing code to the tests rather than the spec, so that it will pass the tests without actually being correct. This talk covers ImpossibleBench, a new dataset intended to help catch cheating by giving models tests that cannot be passed honestly. Access: $ Basic

Apertus model intro

2025-10-13 Video Poll alignment model-intro text Apertus Qwen The Apertus project released two language models in September 2025 that aim to be "sovereign models" embedding Swiss values: in the whitepaper's words they seek to democratize "open and compliant LLMs for global language environments." I dig into what that means, and describe my own experience with the models. Access: $ Basic

Quis custodiet reward models

2025-09-29 Video alignment training text LLaMA Gemma Large language models are "aligned" using smaller, specially trained reward models. These are often secret, and poorly studied even if public. This paper opens the door to exploring reward models by asking them about their values. Access: Free account (logged in)

Ineffable prompts

2025-09-15 Video prompting fine-tuning text alignment How do we get models to do what we want? At one extreme, we might pre-train or fine-tune an entire model for a given task. At the other, we might use an existing model and tell it with words - that is, in a prompt - what to do. This paper represents a position in between those two extremes: prompt the model using not words but optimized vectors of hidden layer activations. These can be more expressive and carefully tailored than a prompt restricted to words. Access: $ Basic

What's a Model?

2025-09-01 Video alignment basics theory text Gemma hallucination What do we actually mean when we talk about a "model"? Where do they come from? How much do they cost? What are prompts, loss functions, and fine-tuning? This extra-long introductory talk covers some of the basic concepts in the AI landscape, with a special focus on chatbots. Access: Public

Rappaccini's language model

2025-08-01 Video alignment text toxicity There's a lot of talk about generative models producing "toxic" output; but what does that actually mean? How can we measure it or prevent it, and is it even a good idea to try? Access: $$$ Pro

Matthew Explains

North Coast Synthesis Ltd.