worked with OpenAI to fine-tune gpt-4o and built the SOTA model for the
patched-codes/static-analysis-eval
benchmark. All the code and data
patched-codes/synth-vuln-fixes
on how we did it is available on their GitHub - https://github.com/openai/build-hours/tree/main/5-4o_fine_tuning.
Here are some tips based on our experience:
→ Establish baseline with "conditioning" / prompting
→ Task-specific datasets are ideal for PEFT; hard to beat gpt-4o on "broad" tasks
→ Add your best system prompt to each example
→ Ensure training data distribution is similar to inference data
→ Shorten instructions with concise prompts; may require more examples.
→ Define clear evaluation metrics (seriously, please eval!)
You can see more details on the benchmark and process here - https://www.patched.codes/blog/the-static-analysis-evaluation-benchmark-measuring-llm-performance-in-fixing-software-vulnerabilities
patched-codes/static-analysis-eval
benchmark. All the code and data
patched-codes/synth-vuln-fixes
on how we did it is available on their GitHub - https://github.com/openai/build-hours/tree/main/5-4o_fine_tuning.
Here are some tips based on our experience:
→ Establish baseline with "conditioning" / prompting
→ Task-specific datasets are ideal for PEFT; hard to beat gpt-4o on "broad" tasks
→ Add your best system prompt to each example
→ Ensure training data distribution is similar to inference data
→ Shorten instructions with concise prompts; may require more examples.
→ Define clear evaluation metrics (seriously, please eval!)
You can see more details on the benchmark and process here - https://www.patched.codes/blog/the-static-analysis-evaluation-benchmark-measuring-llm-performance-in-fixing-software-vulnerabilities