M7gインスタンスをベンチマークしてみた

2023.02.15

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

こんにちは。CX事業本部Delivery部のakkyです。

2月13日にGraviton3を搭載する新たなインスタンスであるM7gとR7gがGAされました。

https://aws.amazon.com/jp/about-aws/whats-new/2023/02/amazon-ec2-m7g-r7g-instances/

すでにDevelopersIOで記事も書かれています。

M7g (汎用), R7g(メモリ最適化) AWS Graviton 3を搭載する新しいEC2インスタンスがリリースされました

Graviton2と比較して、コンピュート性能 最大25%、浮動小数点計算2倍、暗号計算性能2倍とされていますが、実際の性能はどうなのでしょうか。Graviton3を搭載するM7g.largeとGraviton2を搭載するM6g.largeインスタンスと比較してみました。

共に2vCPU、8GB RAMです。オレゴンリージョン(us-west-2)で検証しました。

/proc/cpuinfo

M7g.large

M6gと比較して、拡張機能がたくさんついていますね。Graviton3は命令セットがArmv8.4-A、Graviton2はArmv8.2-Aなので、その違いが表れています。

BogoMIPS値は何かおかしい気がします。

$ cat /proc/cpuinfo 
processor       : 0
BogoMIPS        : 2100.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd40
CPU revision    : 1

processor       : 1
BogoMIPS        : 2100.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd40
CPU revision    : 1

M6g.large

$ cat /proc/cpuinfo 
processor       : 0
BogoMIPS        : 243.75
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x3
CPU part        : 0xd0c
CPU revision    : 1

processor       : 1
BogoMIPS        : 243.75
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x3
CPU part        : 0xd0c
CPU revision    : 1

Unix Bench

定番のベンチマークです。CPUとOSの性能を見ます。

M7g.large

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: ip-172-31-5-119: GNU/Linux
   OS: GNU/Linux -- 5.15.0-1028-aws -- #32-Ubuntu SMP Mon Jan 9 12:29:05 UTC 2023
   Machine: aarch64 (aarch64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0:  (2100.0 bogomips)
          
   CPU 1:  (2100.0 bogomips)
          
   01:56:19 up 8 min,  1 user,  load average: 0.09, 0.06, 0.01; runlevel 2023-02-15

------------------------------------------------------------------------
Benchmark Run: Wed Feb 15 2023 01:56:19 - 02:24:14
2 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       49993610.6 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     7679.0 MWIPS (9.9 s, 7 samples)
Execl Throughput                               2123.4 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        900104.6 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          247783.5 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       2661191.4 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1391358.9 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 148827.8 lps   (10.0 s, 7 samples)
Process Creation                               5187.3 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   8162.8 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1635.7 lpm   (60.0 s, 2 samples)
System Call Overhead                         874665.4 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   49993610.6   4283.9
Double-Precision Whetstone                       55.0       7679.0   1396.2
Execl Throughput                                 43.0       2123.4    493.8
File Copy 1024 bufsize 2000 maxblocks          3960.0     900104.6   2273.0
File Copy 256 bufsize 500 maxblocks            1655.0     247783.5   1497.2
File Copy 4096 bufsize 8000 maxblocks          5800.0    2661191.4   4588.3
Pipe Throughput                               12440.0    1391358.9   1118.5
Pipe-based Context Switching                   4000.0     148827.8    372.1
Process Creation                                126.0       5187.3    411.7
Shell Scripts (1 concurrent)                     42.4       8162.8   1925.2
Shell Scripts (8 concurrent)                      6.0       1635.7   2726.2
System Call Overhead                          15000.0     874665.4    583.1
                                                                   ========
System Benchmarks Index Score                                        1304.0

------------------------------------------------------------------------
Benchmark Run: Wed Feb 15 2023 02:24:14 - 02:52:10
2 CPUs in system; running 2 parallel copies of tests

Dhrystone 2 using register variables      100238134.4 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    15359.6 MWIPS (9.9 s, 7 samples)
Execl Throughput                               3936.8 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks       1544096.3 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          436658.6 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       4763765.4 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2781042.7 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 295290.1 lps   (10.0 s, 7 samples)
Process Creation                               9161.9 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  12216.4 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1660.8 lpm   (60.0 s, 2 samples)
System Call Overhead                        1747584.1 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  100238134.4   8589.4
Double-Precision Whetstone                       55.0      15359.6   2792.7
Execl Throughput                                 43.0       3936.8    915.5
File Copy 1024 bufsize 2000 maxblocks          3960.0    1544096.3   3899.2
File Copy 256 bufsize 500 maxblocks            1655.0     436658.6   2638.4
File Copy 4096 bufsize 8000 maxblocks          5800.0    4763765.4   8213.4
Pipe Throughput                               12440.0    2781042.7   2235.6
Pipe-based Context Switching                   4000.0     295290.1    738.2
Process Creation                                126.0       9161.9    727.1
Shell Scripts (1 concurrent)                     42.4      12216.4   2881.2
Shell Scripts (8 concurrent)                      6.0       1660.8   2768.0
System Call Overhead                          15000.0    1747584.1   1165.1
                                                                   ========
System Benchmarks Index Score                                        2289.0

M6g.large

m6g.large

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: ip-172-31-5-119: GNU/Linux
   OS: GNU/Linux -- 5.15.0-1028-aws -- #32-Ubuntu SMP Mon Jan 9 12:29:05 UTC 2023
   Machine: aarch64 (aarch64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0:  (243.8 bogomips)
          
   CPU 1:  (243.8 bogomips)
          
   04:46:43 up  1:40,  1 user,  load average: 0.41, 1.35, 1.81; runlevel 2023-02-15

------------------------------------------------------------------------
Benchmark Run: Wed Feb 15 2023 04:46:43 - 05:14:40
2 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       39738207.9 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     7198.9 MWIPS (9.9 s, 7 samples)
Execl Throughput                               2057.5 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        870263.3 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          248097.6 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       2308360.1 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1335004.2 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  88057.3 lps   (10.0 s, 7 samples)
Process Creation                               4962.1 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   7268.3 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1478.5 lpm   (60.0 s, 2 samples)
System Call Overhead                         925608.8 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   39738207.9   3405.2
Double-Precision Whetstone                       55.0       7198.9   1308.9
Execl Throughput                                 43.0       2057.5    478.5
File Copy 1024 bufsize 2000 maxblocks          3960.0     870263.3   2197.6
File Copy 256 bufsize 500 maxblocks            1655.0     248097.6   1499.1
File Copy 4096 bufsize 8000 maxblocks          5800.0    2308360.1   3979.9
Pipe Throughput                               12440.0    1335004.2   1073.2
Pipe-based Context Switching                   4000.0      88057.3    220.1
Process Creation                                126.0       4962.1    393.8
Shell Scripts (1 concurrent)                     42.4       7268.3   1714.2
Shell Scripts (8 concurrent)                      6.0       1478.5   2464.2
System Call Overhead                          15000.0     925608.8    617.1
                                                                   ========
System Benchmarks Index Score                                        1172.9

------------------------------------------------------------------------
Benchmark Run: Wed Feb 15 2023 05:14:40 - 05:42:37
2 CPUs in system; running 2 parallel copies of tests

Dhrystone 2 using register variables       79409640.2 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    14393.4 MWIPS (9.9 s, 7 samples)
Execl Throughput                               3770.3 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks       1475162.1 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          432270.4 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       3315535.3 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2667690.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 323040.4 lps   (10.0 s, 7 samples)
Process Creation                               9770.6 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  11168.5 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1472.6 lpm   (60.0 s, 2 samples)
System Call Overhead                        1852372.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   79409640.2   6804.6
Double-Precision Whetstone                       55.0      14393.4   2617.0
Execl Throughput                                 43.0       3770.3    876.8
File Copy 1024 bufsize 2000 maxblocks          3960.0    1475162.1   3725.2
File Copy 256 bufsize 500 maxblocks            1655.0     432270.4   2611.9
File Copy 4096 bufsize 8000 maxblocks          5800.0    3315535.3   5716.4
Pipe Throughput                               12440.0    2667690.5   2144.4
Pipe-based Context Switching                   4000.0     323040.4    807.6
Process Creation                                126.0       9770.6    775.4
Shell Scripts (1 concurrent)                     42.4      11168.5   2634.1
Shell Scripts (8 concurrent)                      6.0       1472.6   2454.4
System Call Overhead                          15000.0    1852372.5   1234.9
                                                                   ========
System Benchmarks Index Score                                        2141.7

比較

シングルスレッド

Dhrystone 2が125%、Pipe-based Context Switchingが169%、その他は5-15%程度高速でした。総合では11%高速でした。

マルチスレッド

シングルスレッドと同様の傾向になりました。(File Copy 4096は理由不明です)

OpenSSL speed

暗号性能も向上しているとのことなので、OpenSSLでテストしてみました。

SHA512とAES-128-GCMで比較しました。

M7g.large

$ openssl speed -evp sha512
Doing sha512 for 3s on 16 size blocks: 10558323 sha512's in 3.00s
Doing sha512 for 3s on 64 size blocks: 10609983 sha512's in 3.00s
Doing sha512 for 3s on 256 size blocks: 5556640 sha512's in 3.00s
Doing sha512 for 3s on 1024 size blocks: 2304995 sha512's in 3.00s
Doing sha512 for 3s on 8192 size blocks: 356516 sha512's in 3.00s
Doing sha512 for 3s on 16384 size blocks: 181427 sha512's in 3.00s
version: 3.0.2
built on: Mon Feb  6 17:57:17 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-oZetzz/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_armcap=0xff
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha512           56311.06k   226346.30k   474166.61k   786771.63k   973526.36k   990833.32k
$ openssl speed -evp sha512
Doing sha512 for 3s on 16 size blocks: 10558323 sha512's in 3.00s
Doing sha512 for 3s on 64 size blocks: 10609983 sha512's in 3.00s
Doing sha512 for 3s on 256 size blocks: 5556640 sha512's in 3.00s
Doing sha512 for 3s on 1024 size blocks: 2304995 sha512's in 3.00s
Doing sha512 for 3s on 8192 size blocks: 356516 sha512's in 3.00s
Doing sha512 for 3s on 16384 size blocks: 181427 sha512's in 3.00s
version: 3.0.2
built on: Mon Feb  6 17:57:17 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-oZetzz/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_armcap=0xff
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha512           56311.06k   226346.30k   474166.61k   786771.63k   973526.36k   990833.32k
ubuntu@ip-172-31-5-119:~$ openssl speed -evp aes-128-gcm
Doing AES-128-GCM for 3s on 16 size blocks: 119146247 AES-128-GCM's in 2.99s
Doing AES-128-GCM for 3s on 64 size blocks: 82901661 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 256 size blocks: 33596767 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 1024 size blocks: 12051388 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 8192 size blocks: 1654743 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 16384 size blocks: 833177 AES-128-GCM's in 3.00s
version: 3.0.2
built on: Mon Feb  6 17:57:17 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-oZetzz/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_armcap=0xff
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
AES-128-GCM     637571.89k  1768568.77k  2866924.12k  4113540.44k  4518551.55k  4550257.32k

M6g.large

$ openssl speed -evp sha512
Doing sha512 for 3s on 16 size blocks: 5653728 sha512's in 3.00s
Doing sha512 for 3s on 64 size blocks: 5649812 sha512's in 3.00s
Doing sha512 for 3s on 256 size blocks: 2621488 sha512's in 3.00s
Doing sha512 for 3s on 1024 size blocks: 1001815 sha512's in 3.00s
Doing sha512 for 3s on 8192 size blocks: 148089 sha512's in 3.00s
Doing sha512 for 3s on 16384 size blocks: 75075 sha512's in 3.00s
version: 3.0.2
built on: Mon Feb  6 17:57:17 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-oZetzz/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_armcap=0xbf
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha512           30153.22k   120529.32k   223700.31k   341952.85k   404381.70k   410009.60k
$ openssl speed -evp aes-128-gcm
Doing AES-128-GCM for 3s on 16 size blocks: 64254877 AES-128-GCM's in 2.99s
Doing AES-128-GCM for 3s on 64 size blocks: 46072641 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 256 size blocks: 18846950 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 1024 size blocks: 6425855 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 8192 size blocks: 901257 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 16384 size blocks: 454285 AES-128-GCM's in 3.00s
version: 3.0.2
built on: Mon Feb  6 17:57:17 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-oZetzz/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_armcap=0xbf
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
AES-128-GCM     343838.81k   982883.01k  1608273.07k  2193358.51k  2461032.45k  2481001.81k

比較

SHA512は1.8~2.4倍の高速化、AESは1.8倍の高速化を確認できました。ほぼ実際に宣伝されている性能向上が確認できました。

まとめ

Graviton3の性能向上を検証しました。 インスタンスの料金はおおよそ6%高いですが、それに見合った性能があることが確認できました。 東京、大阪リージョンにも導入されることを期待しています。