In the world of Mixed-Integer Programming (MIP), the “gap”—the distance between our incumbent solution and the dual bound—remains the gold standard. This mathematical rigor is precisely why we use MIP: it provides a proven certificate of quality that other heuristic methods cannot match. However, while a solution’s proof of optimality is crucial, the path to that solution is equally important.
In many fast-paced business environments, having a good feasible point is the primary requirement: a proven optimal solution or a good dual bound is secondary. In such situations, the MIP framework’s greatest strength is its flexibility. Even when a problem cannot be solved to optimality within a tight time window, MIP solvers often deliver high-quality feasible solutions very early in the search. By measuring how quickly a solver narrows the distance to the optimal value, we can see how MIP extends its reach into real-time decision-making and other realms where speed is as vital as proof.
Unlike pure heuristics, the MIP framework maintains a deterministic stopping criterion and continues to provide a dual bound, offering a formal architecture to prove when a solution can no longer be improved. By adding the Primal Integral to other well-known metrics like time to optimality, the MIPfeas benchmark helps to classify MIP solvers by their ability to deliver high-value feasible points in the critical opening minutes of a solve.
The Metric: Defining the Primal Integral
To capture this “quality over time” performance, we look to the Primal Integral, a metric introduced by Timo Berthold12 to evaluate the effectiveness of primal heuristics. Instead of just looking at the final result, the Primal Integral accounts for every improvement the solver finds throughout the entire run. To understand how the Primal Integral is calculated, we can trace the progress of a single model—in this case, the MIPLIB 2017 instance supportcase7.
During a solve, the branch-and-cut algorithm tracks two key values: the incumbent (the best primal feasible solution found so far) and the dual bound (the theoretical limit of how good the solution could possibly be).
We transform and scale the area between incumbent values and the known optimal solution value by integrating the primal gap function p(t). Mathematically, if x* is the optimal value and x(t) is the incumbent at time t, the primal gap function p(t) (aka penalty) measures the relative difference, i.e. the relative distance from x(t) to x*. In the MIPfeas benchmark, we have slightly adapted the original penalty structure introduced by Timo Berthold taking advantage of all problems being minimization problems:
- The “No-Solution” Penalty: If no incumbent is found, we set the penalty to 2. We believe that in a business context, the absence of any feasible solution is a significantly greater hurdle than a poor one.
- Handling “Sign-Flip” and Zeros: When the incumbent and the optimal solution have opposite signs, we set the penalty to 1. When x* and x(t) are close enough to 0 (<1e-6), we set the penalty to 0. There are other ways of handling these cases. The scripts calculating the primal integral shared as part of this blog provide an alternative where the optimal value and all incumbents are shifted to strictly positive values allowing to apply the formula given below in all cases.
- The Progress Metric: Once a solution with the correct sign is found, the penalty is at most 1 and decreases to 0 as the incumbent approaches the optimal value:
$$ p(\bar{x}(t)) = \frac{\bar{x}(t) - x^*}{\max(|\bar{x}(t)|, |x^ *|)} $$
The Primal Integral P(T) is the area under this curve over the total time T scaled, i.e. divided by T. So the value of the Primal Integral is between 0 and 2.
Mathematically, the integral is expressed as $ P(T) = \int_{0}^{T} p(t) dt / T $
By focusing on the area rather than just the final point, the Primal Integral rewards solvers that are “fast out of the gate.” A solver that finds a 99% optimal solution in the first 10 seconds will have a significantly lower (better) score than a solver that finds that same solution only in the final 10 seconds of a 600-second run.
The Rise of GPU-Accelerated Heuristics
An intriguing side aspect of this benchmark has been the recent, rapid advancement in GPU-based heuristics. While commercial MIP vendors have so far stayed on the sidelines of this development, the availability of open-source implementations like NVIDIA cuOpt allowed us to put these GPU-based methods to the test against more traditional stalwarts.
Benchmark Design and Methodology
To ensure a rigorous and transparent comparison between traditional and emerging technologies, we standardized the testing environment across a diverse problem set:
-
The Test Suite: We selected 233 instances from the MIPLIB 2017 benchmark collection. To ensure the primal integral remains a meaningful metric of solution quality, we specifically excluded instances known to be infeasible.
-
Computational Limits and the “Power of Defaults”: To simulate a high-performance workstation environment, each solver was allocated a strictly enforced time limit of 600 seconds and utilized 24 threads.
We explicitly note that for this benchmark, we performed a single run per instance-solver pair. In an ideal computational study, one would perform multiple runs using different random seeds or instance permutations to account for “performance variability”—the phenomenon where minor internal differences can lead to significantly different search paths. While we did not have the computational resources to perform multiple runs per combination, we believe that our large test set of 233 instances provides the statistical breadth necessary to mitigate the impact of “good or bad luck” on any single model, revealing the true underlying performance trends of each engine.
All solvers in this initial round were run using their default settings. We chose this approach for several strategic reasons:- Fairness across Engines: Some solvers offer “meta-options” to prioritize feasibility over optimality, while others require manual adjustment of granular settings like heuristic frequency or branching rules. Sticking to defaults avoids the need for subjective “manual fiddling” to achieve parity.
- The Robustness of Defaults: Solver developers invest significant resources into ensuring that default parameters perform reliably across a diverse library of instances, providing a stable “out-of-the-box” intelligence baseline.
- Avoiding the Tuning Trap: While hyper-parameter tuning is powerful for specific models, it can often become a liability when software versions change or data distributions shift.
By prioritizing defaults, we are evaluating the general-purpose strategies prioritized by the developers. However, we view this as a starting point. Future iterations of the MIPfeas benchmark may expand to include entries with tuned feasibility configurations or even pure heuristic solvers whose sole purpose is to produce high-quality solutions rapidly. This evolution will allow us to compare the rigorous search of a standard MIP solver against specialized engines designed specifically for real-time, high-speed decision-making.
-
The Contestants:
-
Open-Source Ecosystem: We included Cbc, HiGHS, and SCIP, representing the most widely used and respected open-source solvers.
-
GPU-Accelerated Innovation: We evaluated the GPU-based open-source cuOpt. To understand the impact of hardware scaling, we tested cuOpt across two distinct environments: the NVIDIA H100, a high-end, heavy-duty data center platform, and the NVIDIA GB10, a more accessible, commodity hardware platform.
-
Commercial Performance: While we are unable to publish individual results for commercial solvers (see below), we believe it is essential to demonstrate the current performance gap between open-source solvers and the state-of-the-art in commercial optimization.
To bridge the gap between open-source and commercial performance without violating legal terms, we have grouped the results of three major commercial solvers—COPT, CPLEX, and Xpress—into three synthetic baselines: the Virtual Best, Mean, and Worst Commercial Solvers (VBCS, VMCS, and VWCS).
- The Virtual Best (VBCS): Created by taking the best (smallest) Primal Integral achieved by any of the three commercial solvers for every single instance.
- The Virtual Worst (VWCS): Created by taking the worst (largest) Primal Integral per instance. This provides a reliable “lower bound” on what one can expect from a top-tier commercial engine.
- The Virtual Mean (VMCS): The average of the three values per instance, offering a sense of the “average” performance across the commercial landscape.
It is important to note that the VBCS represents an “upper bound” on the potential of commercial technology; no single solver likely achieved this level of performance across all 233 instances. However, by including this metric, we provide a clear “north star” for the benchmark. Conversely, the VWCS and VMCS provide a grounded reality of the commercial sector.
This approach allows us, after many years of absence for some of these solvers from public benchmarks, to include them once again. It enables us to quantify exactly how much ground open-source solvers have gained while highlighting the remaining performance premium offered by the commercial sector in finding high-quality solutions under tight time constraints.
-
-
Computational Environment & Hardware:
- All runs were conducted using GAMS version 53.1.0 (Linux x86_64 and arm64). The solver versions used were:
- Cbc: 2.10.11
- COPT: 8.0.3
- CPLEX: 22.1.2.0
- cuOpt: 26.2.0 (git hash: f73da24d)
- HiGHS: 1.12.0 (git hash: 9fa8db6f9)
- SCIP: 10.0.1 (git hash: 76a50c97d6)
- Xpress: 9.8.1
- Optimization benchmarks are ideally conducted on identical hardware; however, given the specialized requirements for GPU-accelerated solvers and the realities of high-performance computing access, this study utilized three distinct environments:
- All runs were conducted using GAMS version 53.1.0 (Linux x86_64 and arm64). The solver versions used were:
| Solver Group | CPU / Architecture | GPU / Accelerator |
|---|---|---|
| Cbc, COPT, CPLEX, HiGHS, SCIP, Xpress | AMD EPYC 9B45 | N/A |
| cuOpt-h100 | Intel(R) Xeon(R) Platinum 8481C @ 2.70GHz | NVIDIA H100 80GB HBM3 |
| cuOpt-gb10 | 20-core ARM-based (NVIDIA Grace Blackwell) | NVIDIA GB10 (120GB VRAM) |
Data Collection: The GAMS Solvetrace Advantage
Parsing solver logs is a notorious headache—formats change, and information is often missing. For this benchmark, we utilized the GAMS solvetrace facility. By using solver callbacks, solvetrace records the progress of the incumbent over time in a structured and consistent format. Crucially, it standardizes the measurement of wall-clock time across all engines, regardless of their internal timing defaults. This ensures a rigorous “apples-to-apples” comparison and allows for a more automated, high-fidelity calculation of the Primal Integral.
Note to the Community: Benchmarks Public or Private
One cannot mention benchmarking and optimization without considering the Mittelmann benchmarks . Over the course of many years his public benchmarks have informed the community and spurred discussion and development. We are grateful to Prof. Mittelmann for his work in this area and particularly for including the current MIPfeas results on his benchmark page .
Recognizing the legal complexities of “DeWitt clauses” —legal provisions that prohibit the public disclosure of comparative benchmark results— we have worked constructively with our commercial solver vendors to find a path forward. By introducing the Virtual Best, Mean, and Worst (VBCS, VMCS, VWCS) categories, we have successfully bridged the gap—incorporating state-of-the-art commercial results into a public benchmark while respecting necessary legal frameworks.
However, we recognize that for many users, “virtual” is not enough. We have made all our benchmarking tools and scripts publicly available for those creating private, internal benchmarks. If you hold a license for a specific commercial solver, our framework is designed to be plug-and-play. You can easily integrate your licensed solver into our scripts to generate a direct, private comparison against our public contestants in your own environment. Please note, however, that while our tools facilitate internal evaluation, most commercial solvers require explicit approval from the vendor before any comparative benchmark results involving their software can be published or shared publicly.
For the academic community, the barrier to entry is even lower. Researchers can obtain a GAMSPy academic license , which provides integrated access to all the open-source solvers mentioned here (Cbc, HiGHS, SCIP), as well as a wide array of commercial MIP engines. Since the solvetrace facility is natively available within the GAMSPy ecosystem, every academic with access to compute power—whether it’s a standard CPU cluster or high-end GPU hardware—can replicate this benchmark or, perhaps more importantly, run it on their own unique sets of models.
Acknowledgments
This benchmark is the result of a broad collaboration across the optimization community, and we are deeply grateful to those who provided the technical foundations and guidance necessary to bring it to life.
First, we would like to thank Hans Mittelmann for his invaluable guidance and for the honor of hosting this benchmark on his renowned benchmarking site. We also extend our gratitude to ZIB (Zuse Institute Berlin) for their support and guidance throughout this project.
We extend our sincere thanks to the NVIDIA team, specifically Burcin Bozkaya, Akif Coerduek, and Chris Maes. Their efforts in implementing the incumbent callback within cuOpt were essential to enabling the GAMS solvetrace facility, and their overall guidance was instrumental throughout the creation of this benchmark and the accompanying blog post.
Finally, we are grateful for the valuable comments, insights, and support provided by our colleagues across the solver industry. Their expertise—ranging from high-level methodology to the fine details of the Primal Integral—was indispensable. Special thanks go to Timo Berthold (Xpress), Gerald Gamrath (COPT), and Ferenc Katai (CPLEX).
Results
The following table summarizes the performance of the contestants, including the virtual commercial solver groups, across the 233 MIPLIB instances. The individual solvers are listed alphabetically, followed by the virtual solvers, to provide a clear reference for specific engine results.
How to present a large volume of Primal Integral (PI) data in a meaningful way is a nuanced question. In the original Primal Integral paper, the Arithmetic Mean was used because the primal gap is already a scaled measure of performance. However, following a discussion with Timo Berthold, we have introduced additional statistical rigor to this benchmark.
Because PI spans multiple orders of magnitude and functions similarly to a time measurement (dynamically scaled by the gap), we include the Shifting Geometric Mean with a shift of 0.001. This ensures that the mean is not unduly skewed by outliers or extremely small values, providing a more robust “center” for the performance distribution. Moreover, we added some other statistical information like the median and the interquartile range.
Since a Primal Integral of 2 represents a failure to find any integer-feasible point, we report the total count of instances (“Feasible found (#)”) where a solver successfully returned at least one primal feasible solution. We also record how many instances each solver managed to solve to optimality (“Proven Optimal (#)”). When viewed alongside the Primal Integral, this count offers a comprehensive look at how effectively a contestant balances its two primary tasks: the search for early, high-quality feasible points and the effort required to prove optimality.
cbc cuopt-gb10 cuopt-h100 highs scip VBCS VMCS VWCS
Arithmetic Mean 0.5747 0.3065 0.2586 0.3591 0.4285 0.0389 0.1115 0.2040
Arithmetic Mean [Scaled] 14.7742 7.8797 6.6485 9.2305 11.0141 1.0000 2.8657 5.2433
Geom. Mean (shift=0.001) 0.1121 0.0887 0.0651 0.0567 0.0755 0.0050 0.0132 0.0214
Geom. Mean (0.001) [Scaled] 22.4290 17.7487 13.0263 11.3398 15.1058 1.0000 2.6511 4.2911
Median 0.1209 0.0967 0.0651 0.0484 0.0744 0.0028 0.0078 0.0129
Interquartile Range 0.9451 0.3289 0.3240 0.3154 0.5044 0.0113 0.0449 0.0827
Feasible found (#) 192 223 225 212 204 232 232 221
Proven Optimal (#) 77 60 57 97 77 215 180 145
Visualizing the Performance Landscape
To provide a deeper look into these numbers, we utilize three complementary representations. In these charts, the solvers are ordered by their shifted geometric mean (from worst to best) to highlight the performance hierarchy.
The Bar Chart offers a direct comparison of the overall shifted geometric mean, serving as the primary scoreboard for the benchmark.
The Boxplot shifts the focus to consistency, highlighting the median performance and the spread of the middle 50% of instances (the Interquartile Range); this helps distinguish between solvers that are consistently fast and those that vary wildly by model.
Finally, the Violin Plot provides a high-fidelity view of the solution density. It reveals exactly where solvers cluster their results—specifically showing the “penalty ceiling” at P(T)=2, which indicates how often a solver failed to find any feasible solution within the 600-second window.
Technical Appendix: The Data Pipeline
Model Pre-processing: MPS to GDX
The 233 MIPLIB instances are distributed in gzipped MPS format. Since GAMS and GAMSPy do not consume MPS files directly, we utilize the mps2gms
utility. This tool converts the MPS file into a GDX (GAMS Data Exchange) file, which contains the complete model structure—objective function, bounds, variable types, and the constraint matrix. The instance, in form of a GDX file, can then be solved with GAMS or GAMSPy via the generic scripts mipfeas.gms and mipfeas.py.
An important side effect of the file conversion process is that constraints may be reordered compared to the original MPS file. While this reordering creates a different internal matrix structure than the source file, it is not expected to impact the benchmark in any significant way. We recognize that performance variability remains a factor for any individual model, regardless of whether the original or a permuted matrix order is used. As previously noted, we rely on our large test set of 233 instances to smooth out these variations, ensuring that the resulting aggregate metrics reflect the genuine performance trends of each contestant.
We also used the utility to force a consistent minimization sense across all models to simplify the downstream integral calculations.
# Example conversion loop
cd ${HOME}/mipfeas/mps_files
for file in *.mps.gz
do
# For GAMS users
mps2gms $file GMS= PY= CONVERTSENSE=MIN COLUMNINTVARSAREBINARY=1
# For GAMSPy users
gamspy mps2gms $file --py /dev/null --convertsense MIN --columnintvarsarebinary 1
done
Job Submission and Execution
To handle the scale of 233 instances across multiple solvers, we utilized the Slurm workload manager. Each job was allocated 24 threads and a strict time limit.
A critical component of our execution script is the multi-layered timeout protection. Since some solvers may occasionally fail to respect the 600 seconds time limit, we wrap the call in a Linux timeout utility, while Slurm acts as the final arbiter to terminate the process:
cd ${HOME}/mipfeas/log_files
let "cnt=99"
for file in ../mps_files/*.gdx
do
let "cnt++"
f=`basename -- $file .gdx`
for s in cbc cuopt-h100 cuopt-hb10 highs scip
do
# Submit to Slurm with appropriate hardware requests,
# add options to specify partition, memory, GPU, and
# other constraints
sbatch --time 00:20:00 --cpus-per-task=24 -o ${HOME}/mipfeas/log_files/${f}_${s}.log runme.sh $f $s $cnt
done
done
The previous script submits the script runme.sh
#!/bin/bash
# Using /usr/bin/time to capture OS-level resource stats
# For GAMS users
/usr/bin/time -v timeout -k 750 730 gams mipfeas.gms --instance $1 --solver $2 optfile $3 resLim 600 threads 24 writeOutput 0
# For GAMSPy users
/usr/bin/time -v timeout -k 750 730 python mipfeas.py $1 $2 $3 600 24
Creating the Solvetrace
The core of the benchmark relies on the mip/solvetrace facility. Our customized mipfeas.gms and mipfeas.py scripts accept the instance and solver names as parameters, then dynamically generate the solver-specific option files required to trigger the trace.
Because the syntax for solvetrace can vary slightly between solvers, the script handles these nuances internally to ensure that every run produces a standardized .trc file. These trace files contain the vital “step-function” data of incumbent improvements over wall-clock time.
Data Processing and Integral Calculation
Once the raw .trc files are generated, we use a Python script (create_primalintegral.py) to process the results and create the individual primal integral values and indicator for having found the optimal solution for each instance and solver and stores them in CSV (primalintegral.csv and optimalindicator.csv) and Excel (mipfeas.xlsx with sheets “Primal Integral Values” and “Optimal Solution Indicator”) files. This script obtains the optimal or best-known solution from a CSV file optimal_objective.csv. Another Python script (create_plots.py) processes the CSV files primalintegral.csv and optimalindicator.csv to create the summary table and the three plots given above.
Downloads
All 233 instances (mps.gz and GDX format) with the generic GAMS and GAMSPy scripts are available for download (https://cloud.gams.com/s/SsDNbTdjk5Lo5QK) . The trace and log files of the runs (excluding the commercial solvers), the CSV and Excel files, together with the Python scripts are also available for download (https://cloud.gams.com/s/4mE2ZxkA3agzs7C) . Results from new iterations of the benchmark will be available in the directory “mipfeas” (https://cloud.gams.com/s/XkfCnXBBqbPPtaZ) .
-
Berthold, Timo. “Measuring the impact of primal heuristics.” Operations Research Letters 41.6 (2013): 611-614. ↩︎
-
Berthold, Timo, and Csizmadia, Zsolt. “The confined primal integral: a measure to benchmark heuristic MINLP solvers against global MINLP solvers.” Mathematical Programming 188.2 (2021): 523-537 ↩︎