Discussion:
Unstable local network throughput
(too old to reply)
Ben RUBSON
2016-08-02 18:43:01 UTC
Permalink
Hello,

I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a ConnectX-3 Mellanox network adapter.

FreeBSD 10.3 just installed, last updates performed.
Network adapters running last firmwares / last drivers.
No workload at all, just iPerf as the benchmark tool.

### Step 1 :
I never achieved to go beyond around 30Gb/s.
I did the usual tuning (MTU, kern.ipc.maxsockbuf, net.inet.tcp.sendbuf_max, net.inet.tcp.recvbuf_max...).
I played with adapter interrupt moderation.
I played with iPerf options (window / buffer size, number of threads...).
But it did not help.
Results fluctuate, throughput is not sustained, and using 2 or more iPerf threads did not help but degraded the results "quality".

### Step 2 :
Let's start Linux on these 2 physical hosts.
I only had to use jumbo frames in order to achieve the 40Gb/s max throughtput...
OK, network between the 2 hosts is not the root cause, and my hardware can run these adapters up to their max throughput.
Good point.

### Step 3 :
Go back to FreeBSD on these physical hosts.
Let's run this simple command to test FreeBSD itself :
# iperf -c 127.0.0.1 -i 1 -t 60
Strangely enough, higher results are around 35GB/s.
Even more strange, from one run to another, I do not get identical results : sometimes 17Gb/s, sometimes 20, sometimes 30...
Throughput can also suddenly drop down, then increase again...
Power management in BIOS is totally disabled, as well as FreeBSD powerd, so CPU frequency is not throttled.
Another strange thing, increasing the number of iPerf threads (-P 2 for example), does not improve the results at all.
iPerf3 gave the same random results.

### Step 4 :
Let's start Linux again on these 2 hosts.
Let's run the same simple command :
# iperf -c 127.0.0.1 -i 1 -t 60
Result : 45Gb/s.
With 2 threads : 90Gb/s.
With 4 threads : 180Gb/s.
So here we have expected results, and they stay identical over the time.

### Step 5 :
Does FreeBSD suffers when sending or when receiving ?
Let's start one host with Linux, the other one with FreeBSD.
Results :
Linux --> FreeBSD : around 30GB/s.
FreeBSD --> Linux : 40Gb/s.
So sounds like FreeBSD suffers when receiving.

### Step 6 :
FreeBSD 11-BETA3 gave the same random results.

### Questions :
I think my tests show that there is something wrong with FreeBSD (tuning ? something else ?).
Do you have the same kind of random results on your hosts ?
Could you help me trying to have sustained througput @step3, as we have @step4 (I think this is what we should expect) ?
There would then be no reason not to achieve max throughput through Mellanox adapters themselves.

Thank you very much !

Best regards,

Ben
Hans Petter Selasky
2016-08-02 19:35:48 UTC
Permalink
Post by Ben RUBSON
Hello,
I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a ConnectX-3 Mellanox network adapter.
FreeBSD 10.3 just installed, last updates performed.
Network adapters running last firmwares / last drivers.
No workload at all, just iPerf as the benchmark tool.
Hi,

The CX-3 driver doesn't bind the worker threads to specific CPU cores by
default, so if your CPU has more than one so-called numa, you'll end up
that the bottle-neck is the high-speed link between the CPU cores and
not the card. A quick and dirty workaround is to "cpuset" iperf and the
interrupt and taskqueue threads to specific CPU cores.

Are you using "options RSS" and "options PCBGROUP" in your kernel config?

Are you also testing CX-4 cards from Mellanox?

--HPS
Ben RUBSON
2016-08-02 20:11:53 UTC
Permalink
Hi,
Thank you for your answer Hans Petter !
The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores.
My CPUs : 2x E5-2620v3 with ***@1866.
What is strange is that even without using the card (iPerf on localhost), as my results show, I have very low and unstable random throughput (compared to Linux on the same host).
Are you using "options RSS" and "options PCBGROUP" in your kernel config?
I only installed FreeBSD 10.3 and updated it, so I use the GENERIC kernel.
RSS and PCBGROUP are not defined in /usr/src/sys/amd64/conf/GENERIC, so I think I do not use them.
Are you also testing CX-4 cards from Mellanox?
No, I only have CX-3 at my disposal :)

Ben

PS : in my previous mail I sometimes used GB/s, of course you must read Gb/s everywhere.
Eugene Grosbein
2016-08-03 02:32:57 UTC
Permalink
Post by Ben RUBSON
Hello,
I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a ConnectX-3 Mellanox network adapter.
If you have gateway_enable="YES" (sysctl net.inet.ip.forwarding=1)
then try to disable this forwarding setting and rerun your tests to compare results.
Ben RUBSON
2016-08-03 06:09:18 UTC
Permalink
Post by Eugene Grosbein
If you have gateway_enable="YES" (sysctl net.inet.ip.forwarding=1)
then try to disable this forwarding setting and rerun your tests to compare results.
Thank you Eugene for this, but net.inet.ip.forwarding is disabled by default and I did not enabled it.
# sysctl net.inet.ip.forwarding
net.inet.ip.forwarding: 0
Ben RUBSON
2016-08-03 16:57:28 UTC
Permalink
The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores.
Hans Petter,

I'm testing "cpuset" and sometimes get better results, I'm still trying to get the best.
I've cpuset iperf & Mellanox interrupts, but what do you mean by taskqueue threads ?

Thank U !

Ben
Hans Petter Selasky
2016-08-03 18:02:06 UTC
Permalink
Post by Ben RUBSON
taskqueue threads ?
The mlx4 send and receive queues have each their set of taskqueues. Look
in output from "ps auxww".

--HPS
Ben RUBSON
2016-08-03 18:41:49 UTC
Permalink
The mlx4 send and receive queues have each their set of taskqueues. Look in output from "ps auxww".
I can't find them, I even unloaded/reloaded the driver in order to catch the differences, but I did not found any relevant process.
Here are the process I have when driver is loaded (I removed my own process lines) :

# ps auxxw
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 11 2398.1 0.0 0 384 - RL Mon10pm 65969:09.19 [idle]
root 0 0.0 0.0 0 8288 - DLs Mon10pm 4:59.31 [kernel]
root 1 0.0 0.0 9492 872 - ILs Mon10pm 0:00.04 /sbin/init --
root 2 0.0 0.0 0 96 - DL Mon10pm 0:00.82 [cam]
root 3 0.0 0.0 0 176 - DL Mon10pm 0:17.88 [zfskern]
root 4 0.0 0.0 0 16 - DL Mon10pm 0:00.00 [sctp_iterator]
root 5 0.0 0.0 0 16 - DL Mon10pm 0:00.75 [enc_daemon0]
root 6 0.0 0.0 0 16 - DL Mon10pm 0:00.50 [enc_daemon1]
root 7 0.0 0.0 0 16 - DL Mon10pm 0:00.05 [enc_daemon2]
root 8 0.0 0.0 0 16 - DL Mon10pm 0:00.05 [enc_daemon3]
root 9 0.0 0.0 0 16 - DL Mon10pm 0:00.00 [g_mirror swap]
root 10 0.0 0.0 0 16 - DL Mon10pm 0:00.00 [audit]
root 12 0.0 0.0 0 1408 - WL Mon10pm 186:01.05 [intr]
root 13 0.0 0.0 0 48 - DL Mon10pm 0:05.24 [geom]
root 14 0.0 0.0 0 16 - DL Mon10pm 1:07.19 [rand_harvestq]
root 15 0.0 0.0 0 160 - DL Mon10pm 0:08.22 [usb]
root 16 0.0 0.0 0 32 - DL Mon10pm 0:00.23 [pagedaemon]
root 17 0.0 0.0 0 16 - DL Mon10pm 0:00.00 [vmdaemon]
root 18 0.0 0.0 0 16 - DL Mon10pm 0:00.00 [pagezero]
root 19 0.0 0.0 0 16 - DL Mon10pm 0:00.12 [bufdaemon]
root 20 0.0 0.0 0 16 - DL Mon10pm 0:00.13 [vnlru]
root 21 0.0 0.0 0 16 - DL Mon10pm 2:13.02 [syncer]
root 124 0.0 0.0 12360 1736 - Is Mon10pm 0:00.00 adjkerntz -i
root 618 0.0 0.0 13628 4868 - Ss Mon10pm 0:00.03 /sbin/devd
Ben RUBSON
2016-08-04 09:40:10 UTC
Permalink
The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores.
OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf processes, and I'm able to reach max bandwidth.
Choosing the wrong NUMA (or both, or one for interrupts, the other one for iPerf, etc...) totally kills throughput.

However, full-duplex throughput is still limited, I can't manage to reach 2x40Gb/s, throttle is at about 45Gb/s.
I tried many different cpuset layouts, but I never went above 45Gb/s.
(Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)
Are you using "options RSS" and "options PCBGROUP" in your kernel config?
I will then give RSS a try.

Any other clue perhaps regarding the full-duplex limitation ?

Many thanks !

Ben
Ben RUBSON
2016-08-04 15:24:17 UTC
Permalink
Post by Ben RUBSON
The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores.
OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf processes, and I'm able to reach max bandwidth.
Choosing the wrong NUMA (or both, or one for interrupts, the other one for iPerf, etc...) totally kills throughput.
However, full-duplex throughput is still limited, I can't manage to reach 2x40Gb/s, throttle is at about 45Gb/s.
I tried many different cpuset layouts, but I never went above 45Gb/s.
(Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)
Are you using "options RSS" and "options PCBGROUP" in your kernel config?
I will then give RSS a try.
Without RSS :
A ---> B : 40Gbps (unidirectional)
A <--> B : 45Gbps (bidirectional)

With RSS :
A ---> B : 28Gbps (unidirectional)
A <--> B : 28Gbps (bidirectional)

Sounds like RSS does not help :/

Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ?

Thank U !
Hans Petter Selasky
2016-08-04 15:33:10 UTC
Permalink
Post by Ben RUBSON
Post by Ben RUBSON
The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores.
OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf processes, and I'm able to reach max bandwidth.
Choosing the wrong NUMA (or both, or one for interrupts, the other one for iPerf, etc...) totally kills throughput.
However, full-duplex throughput is still limited, I can't manage to reach 2x40Gb/s, throttle is at about 45Gb/s.
I tried many different cpuset layouts, but I never went above 45Gb/s.
(Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)
Are you using "options RSS" and "options PCBGROUP" in your kernel config?
I will then give RSS a try.
A ---> B : 40Gbps (unidirectional)
A <--> B : 45Gbps (bidirectional)
A ---> B : 28Gbps (unidirectional)
A <--> B : 28Gbps (bidirectional)
Sounds like RSS does not help :/
Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ?
Hi,

Possibly because the packets are arriving at the wrong CPU compared to
what RSS expects. Then RSS will invoke a taskqueue to process the
packets on the correct CPU, if I'm not mistaken.

The mlx4 driver does not fully support RSS. Then mlx5 does.

--HPS
Ben RUBSON
2016-08-04 15:33:19 UTC
Permalink
Post by Ben RUBSON
Post by Ben RUBSON
The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores.
OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf processes, and I'm able to reach max bandwidth.
Choosing the wrong NUMA (or both, or one for interrupts, the other one for iPerf, etc...) totally kills throughput.
However, full-duplex throughput is still limited, I can't manage to reach 2x40Gb/s, throttle is at about 45Gb/s.
I tried many different cpuset layouts, but I never went above 45Gb/s.
(Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)
Are you using "options RSS" and "options PCBGROUP" in your kernel config?
I will then give RSS a try.
A ---> B : 40Gbps (unidirectional)
A <--> B : 45Gbps (bidirectional)
A ---> B : 28Gbps (unidirectional)
A <--> B : 28Gbps (bidirectional)
Sounds like RSS does not help :/
Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ?
Hi,
Possibly because the packets are arriving at the wrong CPU compared to what RSS expects. Then RSS will invoke a taskqueue to process the packets on the correct CPU, if I'm not mistaken.
But even without RSS, I should be able to go up to 2x40Gbps, don't you think so ?
Nobody already did this ?
Ryan Stone
2016-08-04 18:15:09 UTC
Permalink
Post by Ben RUBSON
But even without RSS, I should be able to go up to 2x40Gbps, don't you
think so ?
Nobody already did this ?
Try this patch, which should improve performance when multiple TCP streams
are running in parallel over an mlx4_en port:

https://people.freebsd.org/~rstone/patches/mlxen_counters.diff
Ben RUBSON
2016-08-04 18:53:34 UTC
Permalink
Post by Ben RUBSON
But even without RSS, I should be able to go up to 2x40Gbps, don't you think so ?
Nobody already did this ?
https://people.freebsd.org/~rstone/patches/mlxen_counters.diff
Thank you very much Ryan.
I just tried it, but it does not help :/

Below is the cpuload during bidirectional trafic.
We clearly see the 4 CPUs allocated to Mellanox IRQs, the others to iPerf processes.
No improvement if IRQs are spread over the 12 NUMA CPUs, but slightly less throughput.
Note that I get the same results if I only use 2 CPUs for IRQs.

27 processes: 1 running, 26 sleeping
CPU 0: 1.1% user, 0.0% nice, 16.7% system, 0.0% interrupt, 82.2% idle
CPU 1: 1.1% user, 0.0% nice, 18.9% system, 0.0% interrupt, 80.0% idle
CPU 2: 1.9% user, 0.0% nice, 17.8% system, 0.0% interrupt, 80.4% idle
CPU 3: 1.1% user, 0.0% nice, 15.2% system, 0.0% interrupt, 83.7% idle
CPU 4: 0.4% user, 0.0% nice, 16.3% system, 0.0% interrupt, 83.3% idle
CPU 5: 1.1% user, 0.0% nice, 14.4% system, 0.0% interrupt, 84.4% idle
CPU 6: 2.6% user, 0.0% nice, 17.4% system, 0.0% interrupt, 80.0% idle
CPU 7: 2.2% user, 0.0% nice, 15.2% system, 0.0% interrupt, 82.6% idle
CPU 8: 1.1% user, 0.0% nice, 3.0% system, 15.9% interrupt, 80.0% idle
CPU 9: 0.0% user, 0.0% nice, 3.0% system, 32.2% interrupt, 64.8% idle
CPU 10: 0.0% user, 0.0% nice, 0.4% system, 58.9% interrupt, 40.7% idle
CPU 11: 0.0% user, 0.0% nice, 0.4% system, 77.4% interrupt, 22.2% idle
CPU 12: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 13: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 14: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 15: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 16: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 17: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 18: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 19: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 20: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 21: 0.0% user, 0.0% nice, 0.0% system, 0.4% interrupt, 99.6% idle
CPU 22: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 23: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Ben RUBSON
2016-08-04 21:47:29 UTC
Permalink
Post by Ben RUBSON
But even without RSS, I should be able to go up to 2x40Gbps, don't you think so ?
Nobody already did this ?
Try this patch
(...)
I also just tested the NODEBUG kernel but I did not help.
Ben RUBSON
2016-08-04 21:49:16 UTC
Permalink
Post by Ben RUBSON
But even without RSS, I should be able to go up to 2x40Gbps, don't you think so ?
Nobody already did this ?
Try this patch
(...)
I also just tested the NODEBUG kernel but it did not help.
Hans Petter Selasky
2016-08-05 08:30:04 UTC
Permalink
Post by Ben RUBSON
Post by Ben RUBSON
But even without RSS, I should be able to go up to 2x40Gbps, don't you think so ?
Nobody already did this ?
Try this patch
(...)
I also just tested the NODEBUG kernel but it did not help.
Hi,

When running these tests, do you see any CPUs fully utilized?

Did you check the RX/TX pauseframes settings and the mlx4 sysctl
statistics counters, if there is packet loss?

--HPS

Loading...