Just to be clear, in theory acados would already work on a Cortex M4. Since the hardware does not support DP FP, the compiler would generate calls to libraries instead (generally implemented using operations on integers). This means that from your side nothing special has to be done, but that the performance will be much slower (I guess something like an order of magnitude, maybe more) than the same operations performed in hardware in SP. Maybe this is fast enough for you.
The alternative to have more speed would be to use SP instead of DP. But for your particular application (unstable dynamics around the upright position of the pendulum together with long horizons) make this alternative most likely unfeasible in practice, at least using second order optimization methods.
Hope this helps to clarify the situation.