Working with LLMs

intro image

When you're an LLM engineer building a product with LLMs in an org with other engineers who don't, you get a lot of the following on Slack:

User: “I tried X and it didn’t work.”

You: *investigates*

You: “Nothing looks obviously wrong, it might’ve just been bad luck with the completion. Does it happen all the time? What happens if you re-run the query?”

User: “It worked the second time I tried it.”

You: “Great – sorry about that, but that’s just one of the quirks of LLMs – sometimes they do stupid things, seemingly at random.”

User: “But we need this to be 99% reliable”

You: “We can try prompt things a bit better, but there’s no way for us to guarantee a certain level of reliability. LLMs are stochastic in nature, unfortunately”

User: “Yes, but sometimes it doesn’t work.”

You: “Yeah, it sucks. There’s only so much we can do to shape the output distributions, it’s just the nature of an LLM. Sometimes you just get unlucky with a completion.”

User: “But can’t we just $hardcode_solution?”

You: “We could try that, but unfortunately that can inadvertently introduce other issues: false positives, or introduce other edge-cases, or lead to spaghetti code as we try and patch every undesirable piece of LLM behaviour we observe. The best solution is to try and understand these edge cases, diagnose a root cause, and then improve the general approach to be more reliably correct.”

User: “I don’t get it – why can’t we just detect the bad behaviour, and then just stop it from doing the bad behaviour.”

You: “Because patching specific edge-cases that happen every now and then isn’t going to scale (see above), we don’t want to play whack-a-mole this way until we’re drowning in special-case code. Experience has shown time and time again this isn’t the way.”

User: “But why can’t you just do it?”

And around and around we go.


secondary image

The stochasticity of LLMs is a challenging property to many engineers and product managers (and becomes increasingly so as their experiential proximity to AI / LLMs increases).

Before LLMs, most systems were mostly deterministic: A goes in, B comes out. If there is variation in behaviour, it’s usually due to something that can be traced back to a root cause (for example a race condition), although sometimes with some difficulty.

However, once you identify the root cause, you often figure out how to reproduce the issue reliably. LLMs just do not work this way. They’re truly stochastic. And they accept free-form input, and they produce free-form output. They sample from a distribution to predict their next token. As LLM engineers, you try your best to shape those output probabilities as best you can with system prompts, few-shot examples, structured outputs, good documentation, type annotations, field descriptions, reinforcement, contrastive prompting, etc. The list goes on (and grows daily).

Sometimes the LLM samples well, but we only know this because we receive the right answer at the end (post-hoc verification), and sometimes it samples “poorly”, because it gives us the wrong answer (again, only verifiable post-hoc).

But at the end of the day, you’re sampling from a distribution. You’re going to sample a dud token once in a while that leads you astray. And an awkwardly-phrased user prompt (which you cannot validate!) might be all it takes to flatten that distribution out just enough to tip the scales to undesired behaviour just a little too frequently.

Folks who work with LLMs frequently learn this intuitively, and aren’t thrown off by the variance in behaviour. They understand how to read the LLM’s vibes and adjust their prompt accordingly to illicit the desired behaviour (and can quickly and intuitively identify triggers of the failure modes they’re observing in unwanted behaviour). But this is a skill – in the same way it’s a skill for a staff engineer to glance at some logs in a system the understand and say, “ah hah – I suspect it’s this”, and be correct.

The problem is that engineers are used to knowing a lot of things, and so when encountering something that behaves unlike any of the software they’ve written before, but appears familiar to them (I’m just calling a model behind an API), they mistakenly still reach for solutions that they think are correct out of habit, like trying to use code to govern stochastic edge cases that are difficult to deterministically identify. It’s understandable, but it’s a competency trap – they’ve mistaken the true system for something they’re used to, and misapplying their usual problem-solving approach.

For engineers new to LLMs, the trick is to see an LLM for what it is – an entirely new form of system, with it’s own set of specific patterns and paradigms. Treat it for what it is – something new and unfamiliar. Approach it with curiosity and humility. Hold your pre-existing knowledge about how you think something works loosely in your mind. Learn to read the vibes. You’re learning something new.

And try to listen at least a little to the engineers who work with LLM systems everyday.

And LLM engineers? Sometimes people new to LLMs have more creative ideas than you too.