Faster Vector128/64 compare on arm64 #75864

EgorBo · 2022-09-19T22:05:37Z

Apply @TamarChristinaArm's suggestions for faster vector comparison in #75849

bool Test1(Vector128<int> a, Vector128<int> b) => a == b;
bool Test2(Vector64<float> a, Vector64<float> b) => a != b;

Now emits:

; Method Tests:Test1
G_M48391_IG01:              
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
G_M48391_IG02:              
        6EA18C10          cmeq    v16.4s, v0.4s, v1.4s
-       6E31AA10          uminv   b16, v16.16b
-       0E013E00          umov    w0, v16.b[0]
-       7100001F          cmp     w0, #0
-       9A9F07E0          cset    x0, ne
+       6EB0AE10          uminp   v16.4s, v16.4s, v16.4s
+       4E083E00          umov    x0, v16.d[0]
+       B100041F          cmn     x0, #1
+       9A9F17E0          cset    x0, eq
G_M48391_IG03:              
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
; Total bytes of code: 36


; Method Tests:Test2
G_M64388_IG01:              
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M64388_IG02:              
        0E21E410          fcmeq   v16.2s, v0.2s, v1.2s
-       2E31AA10          uminv   b16, v16.8b
-       0E013E00          umov    w0, v16.b[0]
-       7100001F          cmp     w0, #0
-       9A9F17E0          cset    x0, eq
+       4E083E00          umov    x0, v16.d[0]
+       B100041F          cmn     x0, #1
+       9A9F07E0          cset    x0, ne
G_M64388_IG03:              
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
-; Total bytes of code: 36
+; Total bytes of code: 32

ghost · 2022-09-19T22:05:49Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Apply @TamarChristinaArm's suggestions for faster vector comparison in #75849

bool Test1(Vector128<int> a, Vector128<int> b) => a == b;

Now emits:

; Method Test1
G_M3164_IG01:              
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
G_M3164_IG02:             
        6EA18C10          cmeq    v16.4s, v0.4s, v1.4s
        6EB0AE10          uminp   v16.4s, v16.4s, v16.4s
        4E083E00          umov    x0, v16.d[0]
        B100041F          cmn     x0, #1
        9A9F17E0          cset    x0, eq
G_M3164_IG03:              
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
; Total bytes of code: 36

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

TamarChristinaArm · 2022-09-19T22:12:28Z

G_M64388_IG02:              
        0E21E410          fcmeq   v16.2s, v0.2s, v1.2s
        0E043E00          umov    w0, v16.s[0]
        12800001          movn    w1, #0
        6B01001F          cmp     w0, w1
        9A9F07E0          cset    x0, ne

That doesn't look right, The 64-bit case should be transferring the entire register, so I'm expecting the same d[0] transfer here to an x register and the same compare as the 128-bit case, just without the compression step. Looks like this has ignored the top 32-bits.

EgorBo · 2022-09-19T22:13:12Z

G_M64388_IG02:              
        0E21E410          fcmeq   v16.2s, v0.2s, v1.2s
        0E043E00          umov    w0, v16.s[0]
        12800001          movn    w1, #0
        6B01001F          cmp     w0, w1
        9A9F07E0          cset    x0, ne
That doesn't look right, The 64-bit case should be transferring the entire register, so I'm expecting the same d[0] transfer here to an x register and the same compare as the 128-bit case, just without the compression step. Looks like this has ignored the top 32-bits.

Ah, so I thought, let me fix it 🙂

TamarChristinaArm · 2022-09-19T22:31:11Z

I assume the diff for the 128-bit one is reversed? but otherwise that looks good to me now.

EgorBo · 2022-09-19T22:32:08Z

I assume the diff for the 128-bit one is reversed? but otherwise that looks good to me now.

Yes, should be correct now, thanks!

EgorBo · 2022-09-20T15:06:39Z

@kunalspathak PTAL this one too please

kunalspathak

LGTM

EgorBo · 2022-09-21T22:07:27Z

@TamarChristinaArm it's too early to estimate the whole impact but I already see nice improvements from this, e.g.:

And even more here:

TamarChristinaArm · 2022-09-22T02:08:36Z

Naise!

DrewScoggins · 2022-09-22T16:34:43Z

Improvements: dotnet/perf-autofiling-issues#8660

Faster Vector compare on arm64

97b8eba

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 19, 2022

ghost assigned EgorBo Sep 19, 2022

Address feedback

e0f100c

kunalspathak approved these changes Sep 20, 2022

View reviewed changes

EgorBo merged commit bff967b into dotnet:main Sep 20, 2022

EgorBo mentioned this pull request Sep 20, 2022

Inefficient Vector128.Equals* comparisons on Arm64 #75849

Closed

EgorBo deleted the arm64-faster-vector-cmp branch September 20, 2022 15:57

DenisYaroshevskiy mentioned this pull request Sep 21, 2022

[FEATURE] Arm, faster all/any jfalcou/eve#1411

Closed

EgorBo mentioned this pull request Sep 22, 2022

Improve "vec == Vector128<>.Zero" #75999

Merged

TamarChristinaArm mentioned this pull request Sep 22, 2022

[Mono][Arm64]Improve the generated code for Vector128.Equals* on Arm64 #69325

Closed

This was referenced Sep 29, 2022

[Perf] Windows/arm64: 10 Improvements on 9/22/2022 11:37:30 PM dotnet/perf-autofiling-issues#8767

Closed

[Perf] Windows/arm64: 10 Improvements on 9/20/2022 7:22:22 PM dotnet/perf-autofiling-issues#8758

Closed

ghost locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster Vector128/64 compare on arm64 #75864

Faster Vector128/64 compare on arm64 #75864

Uh oh!

EgorBo commented Sep 19, 2022 •

edited

Loading

Uh oh!

ghost commented Sep 19, 2022

Uh oh!

TamarChristinaArm commented Sep 19, 2022

Uh oh!

EgorBo commented Sep 19, 2022

Uh oh!

TamarChristinaArm commented Sep 19, 2022

Uh oh!

EgorBo commented Sep 19, 2022

Uh oh!

EgorBo commented Sep 20, 2022

Uh oh!

kunalspathak left a comment

Uh oh!

EgorBo commented Sep 21, 2022 •

edited

Loading

Uh oh!

TamarChristinaArm commented Sep 22, 2022

Uh oh!

DrewScoggins commented Sep 22, 2022

Uh oh!

Uh oh!

Faster Vector128/64 compare on arm64 #75864

Faster Vector128/64 compare on arm64 #75864

Uh oh!

Conversation

EgorBo commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Sep 19, 2022

Uh oh!

TamarChristinaArm commented Sep 19, 2022

Uh oh!

EgorBo commented Sep 19, 2022

Uh oh!

TamarChristinaArm commented Sep 19, 2022

Uh oh!

EgorBo commented Sep 19, 2022

Uh oh!

EgorBo commented Sep 20, 2022

Uh oh!

kunalspathak left a comment

Choose a reason for hiding this comment

Uh oh!

EgorBo commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TamarChristinaArm commented Sep 22, 2022

Uh oh!

DrewScoggins commented Sep 22, 2022

Uh oh!

Uh oh!

EgorBo commented Sep 19, 2022 •

edited

Loading

EgorBo commented Sep 21, 2022 •

edited

Loading