Benchmarking OpenAI's gpt-4o-mini model for unit test generation
OpenAI just released an improved and lightweight version of their amazing gpt-4o model called gpt-4o-mini. In this blog we’ll assess the value of this new model for the development task of unit test generation, comparing the quality of unit test generation between the new gpt-4o-mini, gpt-4o models, and a mixture run of both models.
If you want to learn more about our Test Criteria, please visit our previous blog describing the different methods we use for evaluating the results including code coverage, mutation scores at 3 levels, and scope coverage (how successful are we in generating tests for all public methods in the project)
Let’s get to the Benchmark and results.
Setup and testing environment
To conduct our benchmark we used a popular OSS project, ts-morph, and the latest version of Early's VSCode extension for unit test generation.
Test Project ts-morph
ts-morph is an open-source project that provides a powerful and user-friendly API for working with the TypeScript Compiler API. It is designed to simplify the process of interacting with TypeScript code, enabling developers to create, manipulate and analyze TypeScript code programmatically.
- GitHub Stars 4600
- GitHub Forks 189 (now one more)
- Contributors 58
- Total Commits 2,297
- License MIT License
- Language TypeScript
- Repository https://github.com/dsherret/ts-morph
- Clone date: June 2024
- Tested package: packages/common/src
- Line of App code (packages/common/src): 4937 LoC
Setup:
- The original tests were removed from the project’s code to mimic a clean slate. This also allows evaluating unit tests in isolation compared to other forms of tests.
- Tests were generated only for packages/common/src
- Using Early's latest VSCode extension from July 18th 2024
- Attempting to generate unit tests for the 210 public methods in this project
The specific model we used for this benchmark are:
- gpt-4o-2024-05-13
- gpt-4o-mini-2024-07-18
- A mixture run, first gpt-4o-mini, then gpt-4o run where gpt-4o-mini was not successful
KPIs
- General data like the number of green unit tests, red unit tests, lines of code, time to generate and more.
- Code coverage (all files)
- Mutation scores for public methods (3 groups)some text
- Mutation score (all method)
- Mutation score (only for methods that have unit tests)
- Mutation score for 100% coverage. We scored all the methods that have unit tests AND the code is 100% covered by these tests (indicative of quality tests)
- Scope coverage ratio:
Definition: Measures how successful we are in generating quality unit tests for the public methods.
Quality unit tests are defined as tests with 100% coverage for their respective methods. Scope coverage ratios are defined as:
Benchmark results
First let’s look at the basic coverage and mutation scores.
We can see that gpt-4o-mini performs quite well compared to gpt-4o at 58% coverage compared to 53% respectively. Mutation scores for all files are averaging 40% and increasing as we go to newer or the mixture model, we can see that where unit tests generation succeeded, the mutation scores jumped to around 80% level.
If we dive one step further into the quality of these unit tests we can see that the quality goes up with gpt-4o-mini, and the highest in a mixture run.
When generating unit tests automatically we found high correlation between high coverage and mutation scores. Meaning, that when test generations are successful, in most cases, the unit tests are at high quality as well.
To understand how effective these tests are on the project level we will also look at how successful we were in generating tests for all the public methods in the project. To do so, we introduced the Scope-coverage-ratio as defined earlier.
The project had 210 public methods. Here is the scope-coverage for each model:
The normalized mutation score shows how the latest model is also more successful in generating tests for more methods in the project.
Let’s look at more supporting data that could explain these results. Specifically, we’ll look at the number of green and red unit tests.
For the purpose of running Stryker we had to skip all red tests as stryker requires only green tests to run. We decided to keep these tests as some of them could be valuable and reveal bugs. We will explore this on a different blog.
Code and unit tests Metrics
Although counting LoC (Lines of Code) is a questionable metric for quality and success, in this case, the code represents mocks, happy-path, edge-cases, and complex logic for the generated tests, for the methods for which unit tests were generated.
No less impressive is the time it took to generate these tests, around 40 minutes in average! With an average LLM response time of 19 seconds.
In summary, we can see that gpt-4o-mini is quite capable in its result compared to gpt-4o for unit test generation. We have seen cases where each model performed better on different methods. Overall, the ROI for the gpt-4o-mini is substantial.
Try it out for yourself and see how Early can revolutionize your development workflow and reduce the cost of bugs!
Sharon Barr
Co-founder and CEO