worked with OpenAI to fine-tune gpt-4o and built the SOTA... | worked with OpenAI to fine-tune gpt-4o and built the SOTA...
worked with OpenAI to fine-tune gpt-4o and built the SOTA model for the
patched-codes/static-analysis-eval
benchmark. All the code and data
patched-codes/synth-vuln-fixes
on how we did it is available on their GitHub - https://github.com/openai/build-hours/tree/main/5-4o_fine_tuning.

Here are some tips based on our experience:

→ Establish baseline with "conditioning" / prompting

→ Task-specific datasets are ideal for PEFT; hard to beat gpt-4o on "broad" tasks

→ Add your best system prompt to each example

→ Ensure training data distribution is similar to inference data

→ Shorten instructions with concise prompts; may require more examples.

→ Define clear evaluation metrics (seriously, please eval!)

You can see more details on the benchmark and process here - https://www.patched.codes/blog/the-static-analysis-evaluation-benchmark-measuring-llm-performance-in-fixing-software-vulnerabilities build-hours/5-4o_fine_tuning at main · openai/build-hours