Skip to content

Faster Vector128/64 compare on arm64 #75864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 20, 2022
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Sep 19, 2022

Apply @TamarChristinaArm's suggestions for faster vector comparison in #75849

bool Test1(Vector128<int> a, Vector128<int> b) => a == b;
bool Test2(Vector64<float> a, Vector64<float> b) => a != b;

Now emits:

; Method Tests:Test1
G_M48391_IG01:              
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
G_M48391_IG02:              
        6EA18C10          cmeq    v16.4s, v0.4s, v1.4s
-       6E31AA10          uminv   b16, v16.16b
-       0E013E00          umov    w0, v16.b[0]
-       7100001F          cmp     w0, #0
-       9A9F07E0          cset    x0, ne
+       6EB0AE10          uminp   v16.4s, v16.4s, v16.4s
+       4E083E00          umov    x0, v16.d[0]
+       B100041F          cmn     x0, #1
+       9A9F17E0          cset    x0, eq
G_M48391_IG03:              
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
; Total bytes of code: 36


; Method Tests:Test2
G_M64388_IG01:              
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M64388_IG02:              
        0E21E410          fcmeq   v16.2s, v0.2s, v1.2s
-       2E31AA10          uminv   b16, v16.8b
-       0E013E00          umov    w0, v16.b[0]
-       7100001F          cmp     w0, #0
-       9A9F17E0          cset    x0, eq
+       4E083E00          umov    x0, v16.d[0]
+       B100041F          cmn     x0, #1
+       9A9F07E0          cset    x0, ne
G_M64388_IG03:              
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
-; Total bytes of code: 36
+; Total bytes of code: 32

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 19, 2022
@ghost ghost assigned EgorBo Sep 19, 2022
@ghost
Copy link

ghost commented Sep 19, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Apply @TamarChristinaArm's suggestions for faster vector comparison in #75849

bool Test1(Vector128<int> a, Vector128<int> b) => a == b;

Now emits:

; Method Test1
G_M3164_IG01:              
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
G_M3164_IG02:             
        6EA18C10          cmeq    v16.4s, v0.4s, v1.4s
        6EB0AE10          uminp   v16.4s, v16.4s, v16.4s
        4E083E00          umov    x0, v16.d[0]
        B100041F          cmn     x0, #1
        9A9F17E0          cset    x0, eq
G_M3164_IG03:              
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
; Total bytes of code: 36
Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@TamarChristinaArm
Copy link
Contributor

G_M64388_IG02:              
        0E21E410          fcmeq   v16.2s, v0.2s, v1.2s
        0E043E00          umov    w0, v16.s[0]
        12800001          movn    w1, #0
        6B01001F          cmp     w0, w1
        9A9F07E0          cset    x0, ne

That doesn't look right, The 64-bit case should be transferring the entire register, so I'm expecting the same d[0] transfer here to an x register and the same compare as the 128-bit case, just without the compression step. Looks like this has ignored the top 32-bits.

@EgorBo
Copy link
Member Author

EgorBo commented Sep 19, 2022

G_M64388_IG02:              
        0E21E410          fcmeq   v16.2s, v0.2s, v1.2s
        0E043E00          umov    w0, v16.s[0]
        12800001          movn    w1, #0
        6B01001F          cmp     w0, w1
        9A9F07E0          cset    x0, ne

That doesn't look right, The 64-bit case should be transferring the entire register, so I'm expecting the same d[0] transfer here to an x register and the same compare as the 128-bit case, just without the compression step. Looks like this has ignored the top 32-bits.

Ah, so I thought, let me fix it 🙂

@TamarChristinaArm
Copy link
Contributor

I assume the diff for the 128-bit one is reversed? but otherwise that looks good to me now.

@EgorBo
Copy link
Member Author

EgorBo commented Sep 19, 2022

I assume the diff for the 128-bit one is reversed? but otherwise that looks good to me now.

Yes, should be correct now, thanks!

@EgorBo
Copy link
Member Author

EgorBo commented Sep 20, 2022

@kunalspathak PTAL this one too please

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EgorBo EgorBo merged commit bff967b into dotnet:main Sep 20, 2022
@EgorBo EgorBo deleted the arm64-faster-vector-cmp branch September 20, 2022 15:57
@EgorBo
Copy link
Member Author

EgorBo commented Sep 21, 2022

@TamarChristinaArm it's too early to estimate the whole impact but I already see nice improvements from this, e.g.:
image

And even more here:

image

@TamarChristinaArm
Copy link
Contributor

Naise!

@DrewScoggins
Copy link
Member

Improvements: dotnet/perf-autofiling-issues#8660

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants