FreeBSD 10G forwarding performance @Intel

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

FreeBSD 10G forwarding performance @Intel

Alexander V. Chernikov-4
Hello list!

I'm quite stuck with bad forwarding performance on many FreeBSD boxes
doing firewalling.

Typical configuration is E5645 / E5675 @ Intel 82599 NIC.
HT is turned off.
(Configs and tunables below).

I'm mostly concerned with unidirectional traffic flowing to single
interface (e.g. using singe route entry).

In most cases system can forward no more than 700 (or 1400) kpps which
is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).


Test scenario:

Ixia XM2 (traffic generator) <> ix0 (FreeBSD).

Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to
destinations in vlan11 (10.100.1.128 - 10.100.1.192).

Static arps are configured for all destination addresses.

Traffic level is slightly above or slightly below system performance.


================= Test 1  =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
no firewall

Traffic: 1-1 flow (1 src, 1 dst)
(This is actually a bit different from described above)

Result:
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        878k   48k     0        59M       878k     0        56M     0
        874k   48k     0        59M       874k     0        56M     0
        875k   48k     0        59M       875k     0        56M     0

16:41 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf "  %7s
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
      STATE  C   TIME        CPU         COMMAND
       CPU6  6  17:28    100.00%      kernel{ix0 que}
       CPU9  9  20:42     60.06%    intr{irq265: ix0:que

16:41 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0                 500796        167
irq257: ix0:que 1                6693573       2245
irq258: ix0:que 2                2572380        862
irq259: ix0:que 3                3166273       1062
irq260: ix0:que 4                9691706       3251
irq261: ix0:que 5               10766434       3611
irq262: ix0:que 6                8933774       2996
irq263: ix0:que 7                5246879       1760
irq264: ix0:que 8                3548930       1190
irq265: ix0:que 9               11817986       3964
irq266: ix0:que 10                227561         76
irq267: ix0:link                       1          0

Note that system is using 2 cores to forward, so 12 cores should be able
to forward 4+ mpps which is more or less consistent with Linux results.
Note that interrupts on all queues are (as far as I understand from the
fact that AIM is turned off and interrupt rates are the same from
previous test). Additionally, despite hw.intr_storm_threshold = 200k,
i'm constantly getting
interrupt storm detected on "irq265:"; throttling interrupt source
message.


================= Test 2  =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
no firewall

Traffic: Unidirectional many-2-many

16:20 [0] test15# netstat -I ix0 -hw 1
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        507k  651k     0        74M       508k     0        32M     0
        506k  652k     0        74M       507k     0        28M     0
        509k  652k     0        74M       508k     0        37M     0


16:28 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf "  %7s
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
      STATE  C   TIME        CPU         COMMAND
      CPU10  6   0:40    100.00%      kernel{ix0 que}
       CPU2  2  11:47     84.86%    intr{irq258: ix0:que
       CPU3  3  11:50     81.88%    intr{irq259: ix0:que
       CPU8  8  11:38     77.69%    intr{irq264: ix0:que
       CPU7  7  11:24     77.10%    intr{irq263: ix0:que
       WAIT  1  10:10     74.76%    intr{irq257: ix0:que
       CPU4  4   8:57     63.48%    intr{irq260: ix0:que
       CPU6  6   8:35     61.96%    intr{irq262: ix0:que
       CPU9  9  14:01     60.79%    intr{irq265: ix0:que
        RUN  0   9:07     59.67%    intr{irq256: ix0:que
       WAIT  5   6:13     43.26%    intr{irq261: ix0:que
      CPU11 11   5:19     35.89%      kernel{ix0 que}
          -  4   3:41     25.49%      kernel{ix0 que}
          -  1   3:22     21.78%      kernel{ix0 que}
          -  1   2:55     17.68%      kernel{ix0 que}
          -  4   2:24     16.55%      kernel{ix0 que}
          -  1   9:54     14.99%      kernel{ix0 que}
       CPU0 11   2:13     14.26%      kernel{ix0 que}


16:07 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0                  13654         15
irq257: ix0:que 1                  87043         96
irq258: ix0:que 2                  39604         44
irq259: ix0:que 3                  48308         53
irq260: ix0:que 4                 138002        153
irq261: ix0:que 5                 169596        188
irq262: ix0:que 6                 107679        119
irq263: ix0:que 7                  72769         81
irq264: ix0:que 8                  30878         34
irq265: ix0:que 9                1002032       1115
irq266: ix0:que 10                 10967         12
irq267: ix0:link                       1          0


Note that all cores are loaded more or less evenly, but the result is
_worse_. The first reason for this is mtx_lock which is acquired twice
on every lookup (once in in in_matroute() where it can possibly be
removed and once again in rtalloc1_fib()). Latter one is addressed by
andre@ in r234650).

Additionally, despite itreads are bound to singe CPU each, kernel que
are not in stock setup. However, configuration with 5 queues and 5
kernel threads bound to different CPU provides the same bad results.

================= Test 3  =======================
Kernel: FreeBSD-8-S June 4 SVN, +merged ifaddrlock, stock drivers, stock
routing, no FLOWTABLE, no firewall


     packets  errs idrops      bytes    packets  errs      bytes colls
        580k   18k     0        38M       579k     0        37M     0
        581k   26k     0        39M       580k     0        37M     0
        580k   24k     0        39M       580k     0        37M     0
................
Enabling ipfw _increases_ performance a bit:

        604k     0     0        39M       604k     0        39M     0
        604k     0     0        39M       604k     0        39M     0
        582k   19k     0        38M       568k     0        37M     0
        527k   81k     0        39M       530k     0        34M     0
        605k    28     0        39M       605k     0        39M     0


================= Test 3.1  =======================

Same as test 3, the only difference is the following:
route add -net 10.100.1.160/27 -iface vlan11.

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        543k  879k     0        91M       544k     0        35M     0
        547k  870k     0        91M       545k     0        35M     0
        541k  870k     0        91M       539k     0        30M     0
        952k  565k     0        97M       962k     0        48M     0
        1.2M  228k     0        91M       1.2M     0        92M     0
        1.2M  226k     0        90M       1.1M     0        76M     0
        1.1M  228k     0        91M       1.2M     0        76M     0
        1.2M  233k     0        90M       1.2M     0        76M     0

================= Test 3.2  =======================

Same as test 3, splitting destination into 4 smaller rtes:
route add -net 10.100.1.128/28 -iface vlan11
route add -net 10.100.1.144/28 -iface vlan11
route add -net 10.100.1.160/28 -iface vlan11
route add -net 10.100.1.176/28 -iface vlan11

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.4M     0     0       106M       1.6M     0       106M     0
        1.8M     0     0       106M       1.6M     0        71M     0
        1.6M     0     0       106M       1.6M     0        71M     0
        1.6M     0     0        87M       1.6M     0        71M     0
        1.6M     0     0       126M       1.6M     0       212M     0

================= Test 3.3  =======================

Same as test 3, splitting destination into 16 smaller rtes:
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.6M     0     0       118M       1.8M     0       118M     0
        2.0M     0     0       118M       1.8M     0       119M     0
        1.8M     0     0       119M       1.8M     0        79M     0
        1.8M     0     0       117M       1.8M     0       157M     0


================= Test 4  =======================
Kernel: FreeBSD-8-S June 4 SVN, stock drivers, routing patch 1, no
FLOWTABLE, no firewall

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.8M     0     0       114M       1.9M     0       114M     0
        1.7M     0     0       114M       1.7M     0       114M     0
        1.8M     0     0       114M       1.8M     0       114M     0
        1.7M     0     0       114M       1.7M     0       114M     0
        1.8M     0     0       114M       1.8M     0        74M     0
        1.5M     0     0       114M       1.8M     0        74M     0
          2M     0     0       114M       1.8M     0       194M     0


Patch 1 totally eliminates mtx_lock for fastforwarding path to get an
idea how much performance we can achieve. The result is nearly the same
as in 3.3

================= Test 4.1  =======================

Same as the test 4, same traffic level, enabling firewall with single
allow rule (evaluating RLOCK performance)

22:35 [0] test15# netstat -I ix0 -hw 1
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.8M  149k     0       114M       1.6M     0       142M     0
        1.4M  148k     0        85M       1.6M     0       104M     0
        1.8M  149k     0       143M       1.6M     0       104M     0
        1.6M  151k     0       114M       1.6M     0       104M     0
        1.6M  151k     0       114M       1.6M     0       104M     0
        1.4M  152k     0       114M       1.6M     0       104M     0

E.g something like 10% performance loss.


================= Test 4.2  =======================

Same as test4, playing with number of queues.

5queues, same traffic level
        1.5M  225k     0       114M       1.5M     0        99M     0

================= Test 4.3  =======================

Same as test 4, HT on, number of queues = 16

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        2.4M     0     0       157M       2.4M     0       156M     0
        2.4M     0     0       156M       2.4M     0       157M     0

However, enabling firewall immediately drops rate to 1.9mpps which is
nearly the same as 4.1 (and complicated fw ruleset possibly kill HT core
much faster)

================= Test 4.3  =======================

Same as test4, kerwnel ix0 que Tx threads bound to specific CPUs
(corresponding to RX ):
18:02 [0] test15# procstat -ak | grep ix0 | sort -nk 2
     12 100045 intr             irq256: ix0:que  <running>
      0 100046 kernel           ix0 que          <running>
     12 100047 intr             irq257: ix0:que  <running>
      0 100048 kernel           ix0 que          mi_switch sleepq_wait
msleep_spin taskqueue_thread_loop fork_exit fork_trampoline
     12 100049 intr             irq258: ix0:que  <running>
..

test15# for i in `jot 12 0`; do cpuset -l $i -t $((100046+2*$i)); done

Result:
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        2.1M     0     0       139M         2M     0       193M     0
        2.1M     0     0       139M       2.3M     0       139M     0
        2.1M     0     0       139M       2.1M     0        85M     0
        2.1M     0     0       139M       2.1M     0       193M     0

Quite considerable increase, however this works better for uniform
traffic distribution only.


================= Test 5  =======================
Same as test 4, make radix use rmlock (r234648, r234649).

Result: 1.7 MPPS.


================= Test 6  =======================
Same as test 4 + FLOWTABLE

Result: 1.7 MPPS.


================= Test 7  =======================
Same as test 4, build with GCC 4.7

Result: No performance gain


Further investigations:

================= Test 8  =======================
Test 4 setup with kernel build with LOCK_PROFILING.

17:46 [0] test15# sysctl debug.lock.prof.enable=1 ; sleep 2 ; sysctl
debug.lock.prof.enable=0

        920k     0     0        59M       920k     0        59M     0
        875k     0     0        59M       920k     0        59M     0
        628k     0     0        39M       566k     0        45M     0
         79k  2.7M     0       186M        57k     0       6.5M     0
         71k  878k     0        61M        73k     0       4.0M     0
        891k  254k     0        72M       917k     0        54M     0
        920k     0     0        59M       920k     0        59M     0


When enabled, forwarding performance goes down to 60kpps.
Enabled for 2 seconds (so actually 130k packets forwarded), results
attached as separate file. Several hundred lock contentions in ixgbe,
that's all.

================= Test 9  =======================
Same as test 4 setup with hwpmc.
Results attached.

================= Test 9  =======================
Kernel: Freebsd-9-S.
No major difference


Some (my) preliminary conclusions:
1) rte mtx_lock should (and can) be eliminated from stock kernel. (And
it can be done more or less easily for in_matroute).
2) rmlock vs rwlock performance difference is insignificant (maybe
because of 3) )
3) there are locks contention between ixgbe taskq threads and ithreads.
I'm not sure if taskq threads are necessary in the case of packet
forwarding and not traffic generation.


Maybe I'm missing something else? (l2 cache misses or other things).

What else I can do to debug this further?



Relevant files:
http://static.ipfw.ru/files/fbsd10g/0001-no-rt-mutex.patch
http://static.ipfw.ru/files/fbsd10g/kernel.gprof.txt
http://static.ipfw.ru/files/fbsd10g/prof_stats.txt

============= CONFIGS ====================

sysctl.conf:
kern.ipc.maxsockbuf=33554432
net.inet.udp.maxdgram=65535
net.inet.udp.recvspace=16777216
net.inet.tcp.sendbuf_auto=0
net.inet.tcp.recvbuf_auto=0
net.inet.tcp.sendspace=16777216
net.inet.tcp.recvspace=16777216
net.inet.ip.maxfragsperpacket=64


kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0


net.inet.ip.forwarding=1
net.inet.ip.fastforwarding=1
net.inet.ip.redirect=0

hw.intr_storm_threshold=20000

loader.conf:
kern.ipc.nmbclusters="512000"
ixgbe_load="YES"
hw.ixgbe.rx_process_limit="300"
hw.ixgbe.nojumbobuf="1"
hw.ixgbe.max_loop="100"
hw.ixgbe.max_interrupt_rate="20000"
hw.ixgbe.num_queues="11"


hw.ixgbe.txd=4096
hw.ixgbe.rxd=4096

kern.hwpmc.nbuffers=2048

debug.debugger_on_panic=1
net.inet.ip.fw.default_to_accept=1


kernel:
cpu HAMMER

ident           CORE_RELENG_7
options COMPAT_IA32

makeoptions     DEBUG=-g                # Build kernel with gdb(1) debug
symbols

options         SCHED_ULE               # ULE scheduler
options         PREEMPTION              # Enable kernel thread preemption
options         INET                    # InterNETworking
options         INET6                   # IPv6 communications protocols
options         SCTP                    # Stream Control Transmission
Protocol
options         FFS                     # Berkeley Fast Filesystem
options         SOFTUPDATES             # Enable FFS soft updates support
options         UFS_ACL                 # Support for access control lists
options         UFS_DIRHASH             # Improve performance on big
directories
options         UFS_GJOURNAL            # Enable gjournal-based UFS
journaling
options         MD_ROOT                 # MD is a potential root device
options         PROCFS                  # Process filesystem (requires
PSEUDOFS)
options         PSEUDOFS                # Pseudo-filesystem framework
options         GEOM_PART_GPT           # GUID Partition Tables.
options         GEOM_LABEL              # Provides labelization
options         COMPAT_43TTY            # BSD 4.3 TTY compat [KEEP THIS!]
options         COMPAT_FREEBSD4         # Compatible with FreeBSD4
options         COMPAT_FREEBSD5         # Compatible with FreeBSD5
options         COMPAT_FREEBSD6         # Compatible with FreeBSD6
options         COMPAT_FREEBSD7         # Compatible with FreeBSD7
options COMPAT_FREEBSD32
options         SCSI_DELAY=4000         # Delay (in ms) before probing SCSI
options         KTRACE                  # ktrace(1) support
options         STACK                   # stack(9) support
options         SYSVSHM                 # SYSV-style shared memory
options         SYSVMSG                 # SYSV-style message queues
options         SYSVSEM                 # SYSV-style semaphores
options         _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time
extensions
options         KBD_INSTALL_CDEV        # install a CDEV entry in /dev
options         AUDIT                   # Security event auditing
options         HWPMC_HOOKS
options         GEOM_MIRROR
options         MROUTING
options         PRINTF_BUFR_SIZE=100

# To make an SMP kernel, the next two lines are needed
options         SMP                     # Symmetric MultiProcessor Kernel

# CPU frequency control
device          cpufreq

# Bus support.
device          acpi
device          pci

device          ada
device          ahci

# SCSI Controllers
device          ahd             # AHA39320/29320 and onboard AIC79xx devices
options         AHD_REG_PRETTY_PRINT    # Print register bitfields in debug
                                          # output.  Adds ~215k to driver.
device          mpt             # LSI-Logic MPT-Fusion
# SCSI peripherals
device          scbus           # SCSI bus (required for SCSI)
device          da              # Direct Access (disks)
device          pass            # Passthrough device (direct SCSI access)
device          ses             # SCSI Environmental Services (and SAF-TE)

# RAID controllers
device          mfi             # LSI MegaRAID SAS

# atkbdc0 controls both the keyboard and the PS/2 mouse
device          atkbdc          # AT keyboard controller
device          atkbd           # AT keyboard
device          psm             # PS/2 mouse

device          kbdmux          # keyboard multiplexer

device          vga             # VGA video card driver

device          splash          # Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device          sc

device          agp             # support several AGP chipsets

## Power management support (see NOTES for more options)
#device         apm
## Add suspend/resume support for the i8254.
#device         pmtimer

# Serial (COM) ports
#device         sio             # 8250, 16[45]50 based serial ports
device          uart            # Generic UART driver

# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to sio, uart and/or ppc drivers):
#device         puc

# PCI Ethernet NICs.
device          em              # Intel PRO/1000 adapter Gigabit
Ethernet Card
device          bce
#device         ixgb            # Intel PRO/10GbE Ethernet Card
#device         ixgbe

# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device          miibus          # MII bus support

# Pseudo devices.
device          loop            # Network loopback
device          random          # Entropy device
device          ether           # Ethernet support
device          pty             # Pseudo-ttys (telnet etc)
device          md              # Memory "disks"
device          firmware        # firmware assist module
device          lagg

# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device          bpf             # Berkeley packet filter

# USB support
device          uhci            # UHCI PCI->USB interface
device          ohci            # OHCI PCI->USB interface
device          ehci            # EHCI PCI->USB interface (USB 2.0)
device          usb             # USB Bus (required)
#device         udbp            # USB Double Bulk Pipe devices
device          uhid            # "Human Interface Devices"
device          ukbd            # Keyboard
device          umass           # Disks/Mass storage - Requires scbus and da
device          ums             # Mouse
# USB Serial devices
device          ucom            # Generic com ttys


options         INCLUDE_CONFIG_FILE

options         KDB
options         KDB_UNATTENDED
options         DDB
options         ALT_BREAK_TO_DEBUGGER

options         IPFIREWALL              #firewall
options         IPFIREWALL_FORWARD      #packet destination changes
options         IPFIREWALL_VERBOSE      #print information about
                                          # dropped packets
options         IPFIREWALL_VERBOSE_LIMIT=10000    #limit verbosity

# MRT support
options         ROUTETABLES=16

device          vlan                    #VLAN support

# Size of the kernel message buffer.  Should be N * pagesize.
options         MSGBUF_SIZE=4096000


options         SW_WATCHDOG
options         PANIC_REBOOT_WAIT_TIME=4

#
# Hardware watchdog timers:
#
# ichwd: Intel ICH watchdog timer
#
#device          ichwd

device          smbus
device          ichsmb
device          ipmi




--
WBR, Alexander

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: FreeBSD 10G forwarding performance @Intel

Luigi Rizzo-4
On Tue, Jul 03, 2012 at 08:11:14PM +0400, Alexander V. Chernikov wrote:
> Hello list!
>
> I'm quite stuck with bad forwarding performance on many FreeBSD boxes
> doing firewalling.
...
> In most cases system can forward no more than 700 (or 1400) kpps which
> is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).

among the many interesting tests you have run, i am curious
if you have tried to remove the update of the counters on route
entries. They might be another severe contention point.

cheers
luigi
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: FreeBSD 10G forwarding performance @Intel

Alexander V. Chernikov-4
On 03.07.2012 20:55, Luigi Rizzo wrote:

> On Tue, Jul 03, 2012 at 08:11:14PM +0400, Alexander V. Chernikov wrote:
>> Hello list!
>>
>> I'm quite stuck with bad forwarding performance on many FreeBSD boxes
>> doing firewalling.
> ...
>> In most cases system can forward no more than 700 (or 1400) kpps which
>> is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).
>
> among the many interesting tests you have run, i am curious
> if you have tried to remove the update of the counters on route
> entries. They might be another severe contention point.

21:47 [0] m@test15 netstat -I ix0 -w 1
             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
    1785514 52785     0  121318340    1784650     0  117874854     0
    1773126 52437     0  120701470    1772977     0  117584736     0
    1781948 52154     0  121060126    1778271     0   75029554     0
    1786169 52982     0  121451160    1787312     0  160967392     0
21:47 [0] test15# sysctl net.rt_count=0
net.rt_count: 1 -> 0
    1814465 22546     0  121302076    1814291     0   76860092     0
    1817769 14272     0  120984922    1816254     0  163643534     0
    1815311 13113     0  120831970    1815340     0  120159118     0
    1814059 13698     0  120799132    1813738     0  120172092     0
    1818030 13513     0  120960140    1814578     0  120332662     0
    1814169 14351     0  120836182    1814003     0  120164310     0

Thanks, another good point. I forgot to merge this option from andre's
patch.

Another 30-40-50kpps to win.


+u_int rt_count  = 1;
+SYSCTL_INT(_net, OID_AUTO, rt_count, CTLFLAG_RW, &rt_count, 1, "");

@@ -601,17 +625,20 @@ passout:
         if (error != 0)
                 IPSTAT_INC(ips_odropped);
         else {
-               ro.ro_rt->rt_rmx.rmx_pksent++;
+               if (rt_count)
+                       ro.ro_rt->rt_rmx.rmx_pksent++;
                 IPSTAT_INC(ips_forward);
                 IPSTAT_INC(ips_fastforward);


>
> cheers
> luigi
>


--
WBR, Alexander
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: FreeBSD 10G forwarding performance @Intel

Luigi Rizzo-4
On Tue, Jul 03, 2012 at 09:37:38PM +0400, Alexander V. Chernikov wrote:
...
> Thanks, another good point. I forgot to merge this option from andre's
> patch.
>
> Another 30-40-50kpps to win.

not much gain though.
What about the other IPSTAT_INC counters ?
I think the IPSTAT_INC macros were introduced (by rwatson ?)
following a discussion on how to make the counters per-cpu
and avoid the contention on cache lines.
But they are still implemented as a single instance,
and neither volatile nor atomic, so it is not even clear
that they can give reliable results, let alone the fact
that you are likely to get some cache misses.

the relevant macro is in ip_var.h.

Cheers
luigi

>
> +u_int rt_count  = 1;
> +SYSCTL_INT(_net, OID_AUTO, rt_count, CTLFLAG_RW, &rt_count, 1, "");
>
> @@ -601,17 +625,20 @@ passout:
>         if (error != 0)
>                 IPSTAT_INC(ips_odropped);
>         else {
> -               ro.ro_rt->rt_rmx.rmx_pksent++;
> +               if (rt_count)
> +                       ro.ro_rt->rt_rmx.rmx_pksent++;
>                 IPSTAT_INC(ips_forward);
>                 IPSTAT_INC(ips_fastforward);
>
>
> >
> >cheers
> >luigi
> >
>
>
> --
> WBR, Alexander
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: FreeBSD 10G forwarding performance @Intel

Alexander V. Chernikov-4
On 04.07.2012 00:27, Luigi Rizzo wrote:
> On Tue, Jul 03, 2012 at 09:37:38PM +0400, Alexander V. Chernikov wrote:
> ...
>> Thanks, another good point. I forgot to merge this option from andre's
>> patch.
>>
>> Another 30-40-50kpps to win.
>
> not much gain though.
> What about the other IPSTAT_INC counters ?
Well, we should then remove all such counters (total, forwarded) and
per-interface statistics (at least for forwarded packets).
> I think the IPSTAT_INC macros were introduced (by rwatson ?)
> following a discussion on how to make the counters per-cpu
> and avoid the contention on cache lines.
> But they are still implemented as a single instance,
> and neither volatile nor atomic, so it is not even clear
> that they can give reliable results, let alone the fact
> that you are likely to get some cache misses.
>
> the relevant macro is in ip_var.h.
Hm. This seems to be just per-vnet structure instance.
We've got some more real DPCPU stuff (sys/pcpu.h && kern/subr_pcpu.c)
which can be used for global ipstat structure, however since it is
allocated from single area without possibility to free we can't use it
for per-interface counters.

I'll try to run tests without any possibly contested counters and report
the results on Thursday.

>
> Cheers
> luigi
>
>>
>> +u_int rt_count  = 1;
>> +SYSCTL_INT(_net, OID_AUTO, rt_count, CTLFLAG_RW,&rt_count, 1, "");
>>
>> @@ -601,17 +625,20 @@ passout:
>>          if (error != 0)
>>                  IPSTAT_INC(ips_odropped);
>>          else {
>> -               ro.ro_rt->rt_rmx.rmx_pksent++;
>> +               if (rt_count)
>> +                       ro.ro_rt->rt_rmx.rmx_pksent++;
>>                  IPSTAT_INC(ips_forward);
>>                  IPSTAT_INC(ips_fastforward);
>>
>>
>>>
>>> cheers
>>> luigi
>>>
>>
>>
>> --
>> WBR, Alexander
>> _______________________________________________
>> [hidden email] mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "[hidden email]"
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[hidden email]"
>


--
WBR, Alexander
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: FreeBSD 10G forwarding performance @Intel

Luigi Rizzo-4
On Wed, Jul 04, 2012 at 12:31:56AM +0400, Alexander V. Chernikov wrote:

> On 04.07.2012 00:27, Luigi Rizzo wrote:
> >On Tue, Jul 03, 2012 at 09:37:38PM +0400, Alexander V. Chernikov wrote:
> >...
> >>Thanks, another good point. I forgot to merge this option from andre's
> >>patch.
> >>
> >>Another 30-40-50kpps to win.
> >
> >not much gain though.
> >What about the other IPSTAT_INC counters ?
> Well, we should then remove all such counters (total, forwarded) and
> per-interface statistics (at least for forwarded packets).

I am not saying to remove them for good, but at least have a
try at what we can hope to save by implementing them
on a per-cpu basis.

There is a chance that one will not
see big gains util the majority of such shared counters
are fixed (there are probably 3-4 at least on the non-error
path for forwarded packets), plus the per-interface ones
that are not even wrapped in macros (see if_ethersubr.c)

> >I think the IPSTAT_INC macros were introduced (by rwatson ?)
> >following a discussion on how to make the counters per-cpu
> >and avoid the contention on cache lines.
> >But they are still implemented as a single instance,
> >and neither volatile nor atomic, so it is not even clear
> >that they can give reliable results, let alone the fact
> >that you are likely to get some cache misses.
> >
> >the relevant macro is in ip_var.h.
> Hm. This seems to be just per-vnet structure instance.

yes but essentially they are still shared by all threads within a vnet
(besides you probably ran your tests in the main instance)

> We've got some more real DPCPU stuff (sys/pcpu.h && kern/subr_pcpu.c)
> which can be used for global ipstat structure, however since it is
> allocated from single area without possibility to free we can't use it
> for per-interface counters.

yes, those should be moved to a private, dynamically allocated
region of the ifnet (the number of CPUs is known at driver init
time, i hope). But again for a quick test disabling the
if_{i|o}{bytesC|packets} should do the job, if you can count
the received rate by some other means.

> I'll try to run tests without any possibly contested counters and report
> the results on Thursday.

great, that would be really useful info.

cheers
luigi
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"