Significant performance drop (5ms to 35ms) when deploying SQP_RTI to MicroAutoBox 3

Hi,

I am currently running a simulation using an SQP_RTI solver generated by acados. I am deploying this solver to a dSPACE MicroAutoBox 3 (MABX3) using Simulink Coder (dSPACE RTI).

I am observing a significant execution time discrepancy between the Simulink simulation (Host PC) and the compiled code running on the target hardware (MABX3).

System Setup:

  • Solver: SQP_RTI
  • QP Solver: PARTIAL_CONDENSING_HPIPM
  • Prediction Horizon: 100 steps
  • Target Hardware: dSPACE MicroAutoBox 3
  • Interface: Simulink (S-Function) → dSPACE RTI Build

The Issue:

  • In Simulink (Host PC): The solver execution time is approx. 5 ms.
  • On MicroAutoBox 3 (Target): The solver execution time increases to 35 ms.

Context: In my previous experience with other optimization setups (e.g., fmincon-based generated code), the performance gap between the Host PC and the MABX3 was usually negligible. However, with acados, I am seeing a 7x slowdown.

Questions:

  1. Is this performance drop expected due to the difference in CPU architecture/clock speed between a standard PC and the MABX3?
  2. Could this be related to compiler optimization flags (e.g., -O2 vs -O3) during the cross-compilation for dSPACE? I suspect the library or the generated code might not be fully optimized for the target.
  3. Given that I am using N=100, are there specific memory or cache considerations on embedded targets that could cause this drastic slowdown compared to a PC?

Any insights on how to debug this latency or specific flags to check for dSPACE deployment would be greatly appreciated.

Thanks!

Hi,
yes such a performance drop seems rather unexpected.
I don’t know the specifics of your two setups, but the MABX3 is quite powerful.

Could you double check the compilation flags?
In particular, the BLASFEO_TARGET can make a big difference, as it determines if vectorized instructions are used in the linear algebra.

Alternatively, it could be that some things don’t fit into cache.

Is the behavior of the solver the same, apart from runtime?
E.g. if there are many more QP iterations required, that can explain an increase in runtime.

Just to get an idea: What are your state and control dimensions?

Best,
Jonathan

Hello, Mr. Jonathan

I’m sorry for the late reply.
As you advised, I will check the BLASFEO_TARGET setting first.
Except for optimization time, exitflag and action are all the same, so I’ll double check if I missed anything.

The current model in use has 15 states and 8 control inputs.
It is a model with strong nonlinearity, but the problem is being solved within 10ms with SQP_RTI on the Ryzen AM4 CPU.

Thank you for your advice.

Best,
Jeongsu