Discussion:
proposal: splitting NIC RSS up from stack RSS
(too old to reply)
Adrian Chadd
2016-07-14 20:06:30 UTC
Permalink
Hi,

now that 11 is branched and marching on, I'd like to start pushing
some more improvements/evolution into the RSS side of things.

The short list feedback from people is:

* it'd be nice to be able to configure per-device RSS keys on the fly;
* it'd be nice to be able to configure per-device RSS bucket mappings
on the fly;
* it'd be nice to be able to configure per-device RSS hash
configurations on the fly;
* it'd be nice to be able to configure per-bucket CPU set mappings on the fly;
* it'd be nice to split the RSS driver side, the RSS packet input side
and the RSS stack side of things up into separate options;
* UDP IPv6 RSS support would be nice (it works, but i need to
test/integrate bz's v6 udp locking changes for it to really matter);
* it'd be nice to scale linearly on incoming /and/ outgoing
connections. Right now incoming connections are easy, but outgoing
connections aren't so easy.

The other big thing, mostly to be expected, is:

* it'd be nice if this were better documented;
* it'd be nice if we had easy examples of this stuff working, complete
with library bits in base.

I'm going to tidy up the NetworkRSS bits in the wiki soon and map out
a roadmap for 12 with some other bits and pieces.

The "can we have RSS for NICs but not for the stack, and have
keys/mapping/bucket configurable" is actually a biggish thing, as that
ties into people wanting to abuse things with netmap. They don't care
about the rest of the stack being RSS aware; they just want to be able
to control the NIC configurations from userspace and then get it
completely out of the way.

I'd appreciate any other feedback/comments/suggestions. If you're
using RSS and you haven't told me then please let me know!

thanks,


-adrian
Andrew Gallatin
2016-07-21 19:31:46 UTC
Permalink
Post by Adrian Chadd
I'd appreciate any other feedback/comments/suggestions. If you're
using RSS and you haven't told me then please let me know!
Hi Adrian,

I'm a huge fan of your RSS work. In fact, I did a backport of RSS to
Netflix's stable-10 about 6 months ago. I was really interesting in
breaking up the global network hashtables, where we see a lot of
contention. PCBGROUP didn't help much (just spread contention
around), so I was hoping that RSS would be the magic bullet.

Things may have progressed since then, but the real deficiencies that
I saw were:

o RSS (at the time) would only use a power-of-two number of cores.
Sadly, Intel and AMD are building lots of chips with oddball core
counts. So in a workload like ours where most work is initiated via
the NIC rx ithread, having a 14-core machine meant leaving almost 1/2
the machine mostly idle, while 8 cores were maxed out.

o There is (or was at the time) no library interface for RSS, and no
patches for popular web servers (like nginx) to use RSS. The only
example RSS users I could find were a few things from your blog.

These 2 things lead me to abandon the backport, as I didn't have time
to address them on top of other work I was doing. I especially think
getting a real API and a real example consumer would help a lot.

Best regards,

Drew
Adrian Chadd
2016-07-21 22:39:33 UTC
Permalink
hi,

Cool! Yeah, the RSS bits thing can be removed, as it's just doing a
bitmask instead of a % operator to do mapping. I think we can just go
to % and if people need the extra speed from a power-of-two operation,
they can reintroduce it.

I'll add that to the list.

There's a librss, I don't think I committed it to -HEAD. I'll go dig
it out and throw it into freebsd-head soon.



-adrian
Sepherosa Ziehau
2016-07-22 01:54:08 UTC
Permalink
Post by Adrian Chadd
hi,
Cool! Yeah, the RSS bits thing can be removed, as it's just doing a
bitmask instead of a % operator to do mapping. I think we can just go
to % and if people need the extra speed from a power-of-two operation,
they can reintroduce it.
I thought about it a while ago (the most popular E5-2560v{1,2,3} only
has 6 cores, but E5-2560v4 has 8 cores! :). Since the raw RSS hash
value is '& 0x1f' (I believe most of the NICs use 128 entry indirect
table as defined by MS RSS) to select an entry in the indirect table,
simply '%' on the raw RSS hash value probably will not work properly;
you will need (hash&0x1f)%mp_ncpus at least. And well, since the
indirect table's size if 128, you still will get some uneven CPU
workload for non-power-of-2 cpus. And if you take cpu affinity into
consideration, the situation will be even more complex ...

Thanks,
sephe
Adrian Chadd
2016-07-22 19:23:35 UTC
Permalink
Post by Sepherosa Ziehau
Post by Adrian Chadd
hi,
Cool! Yeah, the RSS bits thing can be removed, as it's just doing a
bitmask instead of a % operator to do mapping. I think we can just go
to % and if people need the extra speed from a power-of-two operation,
they can reintroduce it.
I thought about it a while ago (the most popular E5-2560v{1,2,3} only
has 6 cores, but E5-2560v4 has 8 cores! :). Since the raw RSS hash
value is '& 0x1f' (I believe most of the NICs use 128 entry indirect
table as defined by MS RSS) to select an entry in the indirect table,
simply '%' on the raw RSS hash value probably will not work properly;
you will need (hash&0x1f)%mp_ncpus at least. And well, since the
indirect table's size if 128, you still will get some uneven CPU
workload for non-power-of-2 cpus. And if you take cpu affinity into
consideration, the situation will be even more complex ...
Hi,

Sure. The biggest annoying part is that a lot of the kernel
infrastructure for queueing packets (netisr) and scheduling stack work
(callouts) are indexed on CPU, not on "thing". If it was indexed on
"thing" then we could do a two stage work redistribution method that'd
scale O(1):

* packets get plonked into "thing" via some mapping table - eg, map
128 or 256 buckets to queues that do work / schedule call outs /
netisr; and
* the queues aren't tied to a CPU at this point, and it can get
shuffled around by using cpumasks.

It'd be really, really nice IMHO if we had netisr and callouts be
"thing" based rather than "cpu" based, so we could just shift work by
changing the CPU mask - then we don't have to worry about rescheduling
packets or work onto the new CPU when we want to move load around.
That doesn't risk out of order packet handling behaviour and it means
we can (in theory!) put a given RSS bucket into more than one CPU, for
things like TCP processing.

Trouble is, this is somewhat contentious. I could do the netisr change
without upsetting people, but the callout code honestly makes me want
to set everything (in sys/kern) on fire and start again. After all of
the current issues with the callout subsystem I kind of just want to
see hps finish his work and land it into head, complete with more
sensible lock semantics, before I look at breaking it out to not be
per-CPU based but instead allow subsystems to create their own worker
pools for callouts. I'm sure NFS and CAM would like this kind of thing
too.

Since people have asked me about this in the past, the side effect of
support dynamic hash mapping (even in software) is that for any given
flow, once you change the hash mapping you will have some packets in
said flow in the old queue and some packets in the new queue. For
things like stack TCP/UDP where it's using pcbgroups it can vary from
being slow to (eventually, when the global list goes away) plainly not
making it to the right pcb/socket, which is okay for some workloads
and not for others. That may be a fun project to work on once the
general stack / driver tidyups are done, but I'm going to resist doing
it myself for a while because it'll introduce the above uncertainties
which will cause out-of-order behaviour that'll likely generate more
problem reports than I want to handle.

(Read: since I'm doing this for free, I'm not going to do anything
risky, as I'm not getting paid to wade through the repercussions just
right now.)

FWIW, we had this same problem in ye olde past with squid and WCCP
with its hash based system. Squid's WCCP implementation was simple and
static. The commercial solutions (read: cisco, etc) implemented
handling the cache set changing / hash traffic map changing by having
the caches redirect traffic to the /old/ cache whenever the hash or
cache set changed. Squid didn't do this out of the box, so if the
cache topology changed it would send traffic to the wrong box and the
existing connections would break.



-adrian

Loading...