[RFC] ldp/stp peephole optimizations

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] ldp/stp peephole optimizations

Zhongwei Yao
Hi,

We are planning to add AArch64 LDP/STP (load/store pair of registers)
support in C2 code-gen for better performance. I think the LDP/STP can
be used in following cases:
A). For register spill/unspill. We've observed many sequential single
stack load/store patterns in SPECjbb C2 generated code.
B). Besides spilling, LDP is also not generated generally for multiple
LoadI/LoadL nodes. Is there any risk (e.g. implicit check?) for
combing them together, apart from alignment issue?

I think peephole is the best fit for above optimization (gcc/llvm also
has such peephole optimization). However, current peephole rules in C2
compiler is very limited and I doubt whether it really takes effect -
AArch64 has disabled peephole optimizations. x86 has enabled it, but
the instruction sequences to be matched by the rules seems to be very
uncommon.

To address issue A), since current spill/unspill are handled by common
MachSpillCopyNode, I was thinking if we could add peephole rule to
match MachSpillCopyNode, but MachSpillCopyNode has no operands (e.g.
mem, src, dst) like ordinary instruct defined in aarch64.ad. Even we
may extract them (mem, src, dst) like in
MachSpillCopyNode::implementation(), and even we can extend current
peephole rule grammar, expressing such extraction in peephole's
grammar is complex.
So I prefer adding following manually defined method peephole() to
MachSpillCopyNode:

    virtual MachNode *peephole(Block *block, int block_index,
PhaseRegAlloc *ra_, int &deleted);

This makes the patch relative simple. My prototype patch for A) (still
some TODOs and hardcodes, but it works fine):
    http://cr.openjdk.java.net/~zyao/RFC_A/

To address issue B) is somewhat complicated, we need to extend current
peephole rule syntax, as I don't think current simple syntax works for
any useful peephole optimizations like ldp/stp opt.

My extended syntax - at least works for ldp/stp optimizations:

------
  peepmatch ( loadI loadI );
  peepconstraint (0.mem$base == 1.mem$base, 0.mem$scale ==
1.mem$scale, 0.mem$disp - 4 == 1.mem$disp, 0.dst != 1.dst); // new
grammar is described below.
  peepreplace (loadPairI(1.mem 1.mem))
------

But for loadPairI, it is hard to express in current instruct semantic.
Because current instruct in aarch64.ad is defined by a match rule. The
match rule is an expression tree and made of Ideal Node.
However, LDP instruction doesn't have Ideal Node (say LoadPair) to
match. And adding load pair node to arch-independent Ideal node seems
strange.

My proposed solution is: add a special arch dependent operand like iRegIpair:

------
  operand iRegIpair(iRegI reg1, iRegI reg2)
  %{
   constraint(ALLOC_IN_RC(any_reg32));
   op_cost(0);
   format %{ "pair: reg1, reg2"%}; // hard coded format for now.
   interface(REG_INTER);
  %}
------

This needs to update ADLC to support iRegIpair operand. Because unlike
current operand which has 1 register, iRegIpair has 2.

Then use it as loadPairI's operand type like:

------
instruct loadPairI(indOffI mem, iRegIpair dst)
%{
  match(Set dst mem); //no Ideal Node in match rule.
  ...

%}
------

Then we can use loadPairI in peephole rule's "peepreplace".

Since only constraints between operands are supported in peephole
rule. But to check whether the adjacent loads are loaded from adjacent
memory address, we need to check operand's member, like (0.mem$disp -
4 == 1.mem$disp), My solution is: add new grammar like 0.mem$disp to
extract member in operand in ADLC (peep_constraint_parse()).

Another issue for peephole optimization is that it only matches
adjacent instructions in the same basic block. This leads to many
missing matches when loads are not scheduled to adjacent.
So I propose to delay peephole phase to the place just before final
code emit (the fill_buffer() function). This place is after
instruction scheduling. So after instruction scheduling, we could
match more adjacent loads.

My draft patch to address B) is at:
  http://cr.openjdk.java.net/~zyao/RFC_B/

What do you think? Welcome any feedback!

--
Best regards,
Zhongwei
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Dmitry Chuyko-2
Hi Zhongwei,

I'm not a reviewer. Thank you, it is a great idea to merge loads or
stores that are issued sequentially. Can you share some more data on that?

Is there a micro-benchmark or some sample program than shows better
performance on some hardware?
What are the numbers observed?
You mention SPECjbb, is it 2005 or 2015? Which configuration?
In SPECjbb dou you see most sequential series only related to stack work?

-Dmitry

On 12/22/2017 11:02 AM, Zhongwei Yao wrote:

> Hi,
>
> We are planning to add AArch64 LDP/STP (load/store pair of registers)
> support in C2 code-gen for better performance. I think the LDP/STP can
> be used in following cases:
> A). For register spill/unspill. We've observed many sequential single
> stack load/store patterns in SPECjbb C2 generated code.
> B). Besides spilling, LDP is also not generated generally for multiple
> LoadI/LoadL nodes. Is there any risk (e.g. implicit check?) for
> combing them together, apart from alignment issue?
>
> I think peephole is the best fit for above optimization (gcc/llvm also
> has such peephole optimization). However, current peephole rules in C2
> compiler is very limited and I doubt whether it really takes effect -
> AArch64 has disabled peephole optimizations. x86 has enabled it, but
> the instruction sequences to be matched by the rules seems to be very
> uncommon.
>
> To address issue A), since current spill/unspill are handled by common
> MachSpillCopyNode, I was thinking if we could add peephole rule to
> match MachSpillCopyNode, but MachSpillCopyNode has no operands (e.g.
> mem, src, dst) like ordinary instruct defined in aarch64.ad. Even we
> may extract them (mem, src, dst) like in
> MachSpillCopyNode::implementation(), and even we can extend current
> peephole rule grammar, expressing such extraction in peephole's
> grammar is complex.
> So I prefer adding following manually defined method peephole() to
> MachSpillCopyNode:
>
>      virtual MachNode *peephole(Block *block, int block_index,
> PhaseRegAlloc *ra_, int &deleted);
>
> This makes the patch relative simple. My prototype patch for A) (still
> some TODOs and hardcodes, but it works fine):
>      http://cr.openjdk.java.net/~zyao/RFC_A/
>
> To address issue B) is somewhat complicated, we need to extend current
> peephole rule syntax, as I don't think current simple syntax works for
> any useful peephole optimizations like ldp/stp opt.
>
> My extended syntax - at least works for ldp/stp optimizations:
>
> ------
>    peepmatch ( loadI loadI );
>    peepconstraint (0.mem$base == 1.mem$base, 0.mem$scale ==
> 1.mem$scale, 0.mem$disp - 4 == 1.mem$disp, 0.dst != 1.dst); // new
> grammar is described below.
>    peepreplace (loadPairI(1.mem 1.mem))
> ------
>
> But for loadPairI, it is hard to express in current instruct semantic.
> Because current instruct in aarch64.ad is defined by a match rule. The
> match rule is an expression tree and made of Ideal Node.
> However, LDP instruction doesn't have Ideal Node (say LoadPair) to
> match. And adding load pair node to arch-independent Ideal node seems
> strange.
>
> My proposed solution is: add a special arch dependent operand like iRegIpair:
>
> ------
>    operand iRegIpair(iRegI reg1, iRegI reg2)
>    %{
>     constraint(ALLOC_IN_RC(any_reg32));
>     op_cost(0);
>     format %{ "pair: reg1, reg2"%}; // hard coded format for now.
>     interface(REG_INTER);
>    %}
> ------
>
> This needs to update ADLC to support iRegIpair operand. Because unlike
> current operand which has 1 register, iRegIpair has 2.
>
> Then use it as loadPairI's operand type like:
>
> ------
> instruct loadPairI(indOffI mem, iRegIpair dst)
> %{
>    match(Set dst mem); //no Ideal Node in match rule.
>    ...
>
> %}
> ------
>
> Then we can use loadPairI in peephole rule's "peepreplace".
>
> Since only constraints between operands are supported in peephole
> rule. But to check whether the adjacent loads are loaded from adjacent
> memory address, we need to check operand's member, like (0.mem$disp -
> 4 == 1.mem$disp), My solution is: add new grammar like 0.mem$disp to
> extract member in operand in ADLC (peep_constraint_parse()).
>
> Another issue for peephole optimization is that it only matches
> adjacent instructions in the same basic block. This leads to many
> missing matches when loads are not scheduled to adjacent.
> So I propose to delay peephole phase to the place just before final
> code emit (the fill_buffer() function). This place is after
> instruction scheduling. So after instruction scheduling, we could
> match more adjacent loads.
>
> My draft patch to address B) is at:
>    http://cr.openjdk.java.net/~zyao/RFC_B/
>
> What do you think? Welcome any feedback!
>

Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Andrew Haley
In reply to this post by Zhongwei Yao
Hi,

On 22/12/17 08:02, Zhongwei Yao wrote:

> My draft patch to address B) is at:
>   http://cr.openjdk.java.net/~zyao/RFC_B/
>
> What do you think? Welcome any feedback!

I wonder if merging ld/st pairs could be handled by the
MacroAssembler.  MachSpillCopyNode::peephole looks very complicated.
If you handled ldp/stp conversion in MacroAssembler then it'd work
everywhere, for any adjacent pair of loads or stores, not just for
spills in C2.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Andrew Haley
In reply to this post by Dmitry Chuyko-2
On 22/12/17 11:45, Dmitry Chuyko wrote:
> Is there a micro-benchmark or some sample program than shows better
> performance on some hardware?
> What are the numbers observed?

It's hard to do that in a meaningful way with a micro-benchmark
because the advantages of code  size reduction don't apply
except with larger programs.  However, anything that reduces code
size without adversely affecting anything else is worth having.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Zhongwei Yao
In reply to this post by Dmitry Chuyko-2
Hi, Dmitry,

Thanks for your feedback!

For the performance, I have not noticed any changes in specjbb2015
with my patches. But as Andrew said, it is worth having because of
better code size at least.

On 22 December 2017 at 19:45, Dmitry Chuyko <[hidden email]> wrote:

> Hi Zhongwei,
>
> I'm not a reviewer. Thank you, it is a great idea to merge loads or stores
> that are issued sequentially. Can you share some more data on that?
>
> Is there a micro-benchmark or some sample program than shows better
> performance on some hardware?
> What are the numbers observed?
> You mention SPECjbb, is it 2005 or 2015? Which configuration?
> In SPECjbb dou you see most sequential series only related to stack work?
>
> -Dmitry
>
>
> On 12/22/2017 11:02 AM, Zhongwei Yao wrote:
>>
>> Hi,
>>
>> We are planning to add AArch64 LDP/STP (load/store pair of registers)
>> support in C2 code-gen for better performance. I think the LDP/STP can
>> be used in following cases:
>> A). For register spill/unspill. We've observed many sequential single
>> stack load/store patterns in SPECjbb C2 generated code.
>> B). Besides spilling, LDP is also not generated generally for multiple
>> LoadI/LoadL nodes. Is there any risk (e.g. implicit check?) for
>> combing them together, apart from alignment issue?
>>
>> I think peephole is the best fit for above optimization (gcc/llvm also
>> has such peephole optimization). However, current peephole rules in C2
>> compiler is very limited and I doubt whether it really takes effect -
>> AArch64 has disabled peephole optimizations. x86 has enabled it, but
>> the instruction sequences to be matched by the rules seems to be very
>> uncommon.
>>
>> To address issue A), since current spill/unspill are handled by common
>> MachSpillCopyNode, I was thinking if we could add peephole rule to
>> match MachSpillCopyNode, but MachSpillCopyNode has no operands (e.g.
>> mem, src, dst) like ordinary instruct defined in aarch64.ad. Even we
>> may extract them (mem, src, dst) like in
>> MachSpillCopyNode::implementation(), and even we can extend current
>> peephole rule grammar, expressing such extraction in peephole's
>> grammar is complex.
>> So I prefer adding following manually defined method peephole() to
>> MachSpillCopyNode:
>>
>>      virtual MachNode *peephole(Block *block, int block_index,
>> PhaseRegAlloc *ra_, int &deleted);
>>
>> This makes the patch relative simple. My prototype patch for A) (still
>> some TODOs and hardcodes, but it works fine):
>>      http://cr.openjdk.java.net/~zyao/RFC_A/
>>
>> To address issue B) is somewhat complicated, we need to extend current
>> peephole rule syntax, as I don't think current simple syntax works for
>> any useful peephole optimizations like ldp/stp opt.
>>
>> My extended syntax - at least works for ldp/stp optimizations:
>>
>> ------
>>    peepmatch ( loadI loadI );
>>    peepconstraint (0.mem$base == 1.mem$base, 0.mem$scale ==
>> 1.mem$scale, 0.mem$disp - 4 == 1.mem$disp, 0.dst != 1.dst); // new
>> grammar is described below.
>>    peepreplace (loadPairI(1.mem 1.mem))
>> ------
>>
>> But for loadPairI, it is hard to express in current instruct semantic.
>> Because current instruct in aarch64.ad is defined by a match rule. The
>> match rule is an expression tree and made of Ideal Node.
>> However, LDP instruction doesn't have Ideal Node (say LoadPair) to
>> match. And adding load pair node to arch-independent Ideal node seems
>> strange.
>>
>> My proposed solution is: add a special arch dependent operand like
>> iRegIpair:
>>
>> ------
>>    operand iRegIpair(iRegI reg1, iRegI reg2)
>>    %{
>>     constraint(ALLOC_IN_RC(any_reg32));
>>     op_cost(0);
>>     format %{ "pair: reg1, reg2"%}; // hard coded format for now.
>>     interface(REG_INTER);
>>    %}
>> ------
>>
>> This needs to update ADLC to support iRegIpair operand. Because unlike
>> current operand which has 1 register, iRegIpair has 2.
>>
>> Then use it as loadPairI's operand type like:
>>
>> ------
>> instruct loadPairI(indOffI mem, iRegIpair dst)
>> %{
>>    match(Set dst mem); //no Ideal Node in match rule.
>>    ...
>>
>> %}
>> ------
>>
>> Then we can use loadPairI in peephole rule's "peepreplace".
>>
>> Since only constraints between operands are supported in peephole
>> rule. But to check whether the adjacent loads are loaded from adjacent
>> memory address, we need to check operand's member, like (0.mem$disp -
>> 4 == 1.mem$disp), My solution is: add new grammar like 0.mem$disp to
>> extract member in operand in ADLC (peep_constraint_parse()).
>>
>> Another issue for peephole optimization is that it only matches
>> adjacent instructions in the same basic block. This leads to many
>> missing matches when loads are not scheduled to adjacent.
>> So I propose to delay peephole phase to the place just before final
>> code emit (the fill_buffer() function). This place is after
>> instruction scheduling. So after instruction scheduling, we could
>> match more adjacent loads.
>>
>> My draft patch to address B) is at:
>>    http://cr.openjdk.java.net/~zyao/RFC_B/
>>
>> What do you think? Welcome any feedback!
>>
>



--
Best regards,
Zhongwei
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Zhongwei Yao
In reply to this post by Andrew Haley
Hi, Andrew,

Thanks for your review!

On 23 December 2017 at 00:02, Andrew Haley <[hidden email]> wrote:

> Hi,
>
> On 22/12/17 08:02, Zhongwei Yao wrote:
>
>> My draft patch to address B) is at:
>>   http://cr.openjdk.java.net/~zyao/RFC_B/
>>
>> What do you think? Welcome any feedback!
>
> I wonder if merging ld/st pairs could be handled by the
> MacroAssembler.

I was also thinking about merging it in assembler. My concern was that
assembler usually does not do optimisation.

However, I've taken a quick check and I think it should be doable in
assembler. For example, we can merge ldr in assembler's ldr instruct
definition by checking if the previous instruct meets the constraints.
For the previous instruction, we can record it in Instruction_aarch64
class's destructor (if it is ld/st instruction, record it. if not,
clear it.).

What do you think? If it is OK, I'll work out a prototype for merging
ldr in assembler.


>MachSpillCopyNode::peephole looks very complicated.
> If you handled ldp/stp conversion in MacroAssembler then it'd work
> everywhere, for any adjacent pair of loads or stores, not just for
> spills in C2.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



--
Best regards,
Zhongwei
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Andrew Haley
On 26/12/17 04:24, Zhongwei Yao wrote:

> I was also thinking about merging it in assembler. My concern was that
> assembler usually does not do optimisation.
>
> However, I've taken a quick check and I think it should be doable in
> assembler. For example, we can merge ldr in assembler's ldr instruct
> definition by checking if the previous instruct meets the constraints.
> For the previous instruction, we can record it in Instruction_aarch64
> class's destructor (if it is ld/st instruction, record it. if not,
> clear it.).
>
> What do you think? If it is OK, I'll work out a prototype for merging
> ldr in assembler.

Try doing it in MacroAssembler.  MacroAssembler::membar is an example of
where we already merge instructions.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Zhongwei Yao
OK, I'll give a try doing it in MacroAssembler.

On 26 December 2017 at 18:13, Andrew Haley <[hidden email]> wrote:

> On 26/12/17 04:24, Zhongwei Yao wrote:
>> I was also thinking about merging it in assembler. My concern was that
>> assembler usually does not do optimisation.
>>
>> However, I've taken a quick check and I think it should be doable in
>> assembler. For example, we can merge ldr in assembler's ldr instruct
>> definition by checking if the previous instruct meets the constraints.
>> For the previous instruction, we can record it in Instruction_aarch64
>> class's destructor (if it is ld/st instruction, record it. if not,
>> clear it.).
>>
>> What do you think? If it is OK, I'll work out a prototype for merging
>> ldr in assembler.
>
> Try doing it in MacroAssembler.  MacroAssembler::membar is an example of
> where we already merge instructions.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



--
Best regards,
Zhongwei
Reply | Threaded
Open this post in threaded view
|

RE: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

White, Derek
In reply to this post by Zhongwei Yao
Hi Zhongwei,

Great idea!

Here are some small comments on part A, stack spilling. I know you're thinking about a different approach to part "B", so I'm not sure if that means you're redoing "A" as well. These comments may only partially apply.

In aarch64.ad, MachSpillCopyNode::peephole(), around lines 3416-3419:
+  if (((src_lo_rc == rc_stack && dst_lo_rc == rc_int)
+      || (dst_lo_rc == rc_stack && src_lo_rc == rc_int))
+      && ((inst1_src_lo_rc == rc_stack && inst1_dst_lo_rc == rc_int)
+          || (inst1_dst_lo_rc == rc_stack && inst1_src_lo_rc == rc_int)))

This seems to match a lo-spill and a hi-unspill (or the reverse), instead of only both being spills or both unspills.

Also, I think it would be good to factor out the ldp/stp offset range checks into a separate function (inlined). Then use that predicate, instead of inlined checks such as the new ones you added in spill() and unspill() in macroAssembler_aarch64.hpp, as well as the slightly odd pre-existing checks in spill_copy128().

And although these checks are only testing the upper-bound (which is OK for stack offsets), more generally we'd want to check both lower and upper bounds, and you could probably make use of the predicate in the part-B patch too.

 - Derek

> -----Original Message-----
> From: aarch64-port-dev [mailto:aarch64-port-dev-
> [hidden email]] On Behalf Of Zhongwei Yao
> Sent: Friday, December 22, 2017 3:03 AM
> To: [hidden email]; aarch64-port-
> [hidden email]
> Subject: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations
>
> Hi,
>
> We are planning to add AArch64 LDP/STP (load/store pai[Derek] lines 3416034-19r of registers)
> support in C2 code-gen for better performance. I think the LDP/STP can be
> used in following cases:
> A). For register spill/unspill. We've observed many sequential single stack
> load/store patterns in SPECjbb C2 generated code.
> B). Besides spilling, LDP is also not generated generally for multiple
> LoadI/LoadL nodes. Is there any risk (e.g. implicit check?) for combing them
> together, apart from alignment issue?
>
> I think peephole is the best fit for above optimization (gcc/llvm also has such
> peephole optimization). However, current peephole rules in C2 compiler is
> very limited and I doubt whether it really takes effect -
> AArch64 has disabled peephole optimizations. x86 has enabled it, but the
> instruction sequences to be matched by the rules seems to be very
> uncommon.
>
> To address issue A), since current spill/unspill are handled by common
> MachSpillCopyNode, I was thinking if we could add peephole rule to match
> MachSpillCopyNode, but MachSpillCopyNode has no operands (e.g.
> mem, src, dst) like ordinary instruct defined in aarch64.ad. Even we may
> extract them (mem, src, dst) like in MachSpillCopyNode::implementation(),
> and even we can extend current peephole rule grammar, expressing such
> extraction in peephole's grammar is complex.
> So I prefer adding following manually defined method peephole() to
> MachSpillCopyNode:
>
>     virtual MachNode *peephole(Block *block, int block_index, PhaseRegAlloc
> *ra_, int &deleted);
>
> This makes the patch relative simple. My prototype patch for A) (still some
> TODOs and hardcodes, but it works fine):
>     http://cr.openjdk.java.net/~zyao/RFC_A/
>
> To address issue B) is somewhat complicated, we need to extend current
> peephole rule syntax, as I don't think current simple syntax works for any
> useful peephole optimizations like ldp/stp opt.
>
> My extended syntax - at least works for ldp/stp optimizations:
>
> ------
>   peepmatch ( loadI loadI );
>   peepconstraint (0.mem$base == 1.mem$base, 0.mem$scale ==
> 1.mem$scale, 0.mem$disp - 4 == 1.mem$disp, 0.dst != 1.dst); // new
> grammar is described below.
>   peepreplace (loadPairI(1.mem 1.mem))
> ------
>
> But for loadPairI, it is hard to express in current instruct semantic.
> Because current instruct in aarch64.ad is defined by a match rule. The match
> rule is an expression tree and made of Ideal Node.
> However, LDP instruction doesn't have Ideal Node (say LoadPair) to match.
> And adding load pair node to arch-independent Ideal node seems strange.
>
> My proposed solution is: add a special arch dependent operand like
> iRegIpair:
>
> ------
>   operand iRegIpair(iRegI reg1, iRegI reg2)
>   %{
>    constraint(ALLOC_IN_RC(any_reg32));
>    op_cost(0);
>    format %{ "pair: reg1, reg2"%}; // hard coded format for now.
>    interface(REG_INTER);
>   %}
> ------
>
> This needs to update ADLC to support iRegIpair operand. Because unlike
> current operand which has 1 register, iRegIpair has 2.
>
> Then use it as loadPairI's operand type like:
>
> ------
> instruct loadPairI(indOffI mem, iRegIpair dst) %{
>   match(Set dst mem); //no Ideal Node in match rule.
>   ...
>
> %}
> ------
>
> Then we can use loadPairI in peephole rule's "peepreplace".
>
> Since only constraints between operands are supported in peephole rule. But
> to check whether the adjacent loads are loaded from adjacent memory
> address, we need to check operand's member, like (0.mem$disp -
> 4 == 1.mem$disp), My solution is: add new grammar like 0.mem$disp to
> extract member in operand in ADLC (peep_constraint_parse()).
>
> Another issue for peephole optimization is that it only matches adjacent
> instructions in the same basic block. This leads to many missing matches
> when loads are not scheduled to adjacent.
> So I propose to delay peephole phase to the place just before final code emit
> (the fill_buffer() function). This place is after instruction scheduling. So after
> instruction scheduling, we could match more adjacent loads.
>
> My draft patch to address B) is at:
>   http://cr.openjdk.java.net/~zyao/RFC_B/
>
> What do you think? Welcome any feedback!
>
> --
> Best regards,
> Zhongwei