Originally posted: 2025-05-18. View source code for this page here.

What can you do when you hit the limits of your LLM's abilities, and it repeatedly fails to 'one shot' a prompt?

This post describes my experience in building a medium-complexity (c. 2000 lines of code) maths quiz game, and what techniques seemed to work.

In a nutshell:

  • Use LLMs to structure the problem and research fuller context before writing any code, and then proceed in small, verifiable steps.
  • Don't bother with prompt 'phrasing' tricks such as 'I'll give you a $100 tip'

I wanted to build an educational quiz game which presented a numberline, and asked the user correctly place numbers and fractions. This required fairly complex axis behaviour - zooming and panning, and conditional label placement.

None of the frontier LLMs (o4-high, sonnet 3.7, gemini 2.5) could get close with a one-shot implementation. They couldn't even render the numberline as instructed.

However, I succeeded in 'vibe coding' the app (i.e. no human-written code) using the following process:

  • Create a spec. Have a conversation about your app to nail down objectives and the desired functionality. Then ask it whether it needs any clarifications, and ask the LLM to produce a spec on the basis of the conversation. Here's the spec for the maths app.
  • Deep research. Given the app spec, ask Deep Research for best practices for how to implement some of the more complex areas of your app. The task here is to produce a document of focused context for implementation. If there are any areas you know are tricky, ask it to concentrate on those areas, and ask it to summarise relevant areas of docs for the libraries you're using.
  • Write a step-by-step implementation plan. Given the spec and the deep research output as context, ask for a step by step implementation plan. Ask the LLM to ensure that each step is small, self contained and verifiable. Tell it to think carefully about what feature to implement first to enable iterative building and verification at each step. Get it to check the implementation plan front-to-back and back-to-front to check it makes sense. Ask it to provide an architectural review of the plan. Here is the implementation plan for the maths app.
  • Make sure the proposed architecture is not going overboard. When asked to follow best practice, LLMs will sometimes add too much abstraction/complexity. Ensure your spec and step-by-step plan is proportionate to your objectives before proceeding.

Once you have your step by step implementation plan, copy and paste each step into Copilot agent mode (I find GPT 4.1 works well), and verify everything's working before proceeding to the next step. The verification is crucial - if you miss a bug at an earlier stage, the problems will often compound.

If things go wrong, use files-to-prompt to copy your whole repo into a prompt, and give it to Gemini 2.5 in Google AI studio to fix the problem.

If you're implementing a complex feature on an existing app, you can follow a similar process, but ensure you provide the full app source code as context, if possible. Again, I recommend Gemini 2.5 here with its 1m token context.

One interesting aspect of this process is that it's really just a combination of long context prompting and LLM reasoning paired with a little human input. So it seems likely that increasingly, as we see more agentic behaviour from LLMs, the computer will do all of this for you.

Thanks to Mary Rose Cook for this video which I found useful in learning and understanding some of these techniques.