GC Conventions
caller site should ensure the lifetime of GC typed arguments during call and call_indirect.
Fast Lower
Fast lowering passes focus on speed.
- For each locals it will be statically assigned to shadow stack slot according to local index.
- For temporary value in operand stack, which can only happened for
callorcall_indirect, it will be assigned according to stack-based counter.
WARNING
It is possible to have nested call. We should make sure the slot of first argument is different with the second and the third.
(call $foo
(call $tostack (local.get $arg0))
(call $bar
(call $tostack (local.get $arg1))
(call $tostack (local.get $arg2))
)
)Opt Lower
Optimized lowering passes will analysis the object liveness and dom tree, which cost more time then fast lower but produce better output.
Clean tostack Call
Some tmp_to_stack call is unless and can be removed.
tmp_to_stacka immutable global. (ImmutableGlobalToStackRemover)
Analyze Object Liveness
It will convert each object to SSA. analyze liveness of each SSA object.
The liveness information is important for later passes.
Filter Leaf Function
GC will be triggered only when __new and __collect is called. If there is no such call in the life cycle of an ssa object, then we do not need to store it on the stack.
Assign Shadow Stack Position
TODO
Shrink Wrap
Shrink wrapping is an optimization technique that moves stack frame setup and teardown operations closer to where they are actually needed, rather than placing them at function entry and exit points.
Benefits
- Reduces overhead for early function returns that don't use stack
- Improves performance for functions with multiple exit paths
Algorithm
WARPO will calculate prologue and epilogue following those rules
- prologue dominate all shadow stack usages.
- epilogue post-dominate all shadow stack usages.
- prologue and epilogue are not inside loop. (avoid cpu cost in high frequency loop)
- prologue dominate epilogue (before execute to epilogue, prologue must be executed).
- epilogue post-dominate prologue (after executed prologue, epilogue must be executed).
- prologue are not entry basic block. (make no sense)
- epilogue are not exit basic block. (make no sense)
ToStackReplacer
accept each call mapped stack offset as inputs.
replace
(call $__tmptostack (...))with
(i32.store offset=... (global.get $sp) (local.tee $tmp (...)))
(local.get $tmp)PrologEpilogInserter
insert prologue and epilogue
wrapper original function body
(global.set $sp (i32.sub (global.get $sp) (i32.const <MAX_OFFSET>)))
(local.set $tmp (<RESULT>))
(global.set $sp (i32.add (global.get $sp) (i32.const <MAX_OFFSET>)))
(return (local.get $tmp))PostLower
implement
~lib/rt/__decrease_sp.~lib/rt/__increase_sp.
Make Decision
Why we create function for each offset?
In PR#60, we change the implement of tostack.
The previous emitted code is
call $xxx#constructor
i32.const <OFFSET>
call $~lib/rt/__tostackwith function
(func $~lib/rt/__tostack (param $value i32) (param $offset i32) (result i32)
global.get $~lib/memory/__stack_pointer
local.get $offset
i32.add
local.get $value
i32.store $0 align=1
local.get $value
)It cannot be optimized well even subsequent passes will inline $~lib/rt/__tostack and do constant propagation for $offset.
After PR, the code will change to
call $xxx#constructor
call $~lib/rt/__tostack<OFFSET>with function
(func $~lib/rt/__tostack<OFFSET> (param $value i32) (result i32)
global.get $~lib/memory/__stack_pointer
local.get $value
i32.store $0 align=1 offset=<OFFSET>
local.get $value
)Then it is better than previous one since we can merge the add and store instruction.
In wasm's world, offset= is not equal to i32.add. since when operate ptr with i32.add, we need to consider integer overflow but with offset= not.
Why we inline tostack?
Even though ~lib/rt/__tostack<OFFSET> is better then ~lib/rt/__tostack. However, in the subsequent optimization process, if tostack cannot be inlined, it will interfere with the subsequent constant propagation optimization.
for example, binaryen opt passes cannot propagate constant for below bytecode. Because most of opt passes is intra-procedural. It does not know the result of tostack is always same as the first parameter.
(local.set $0 (call $tostack<0> (i32.const 10)))
(local.set $1 (i32.add (local.get $0) (i32.const 10))) ;; cannot be optimized to (local.set (i32.const 20))