With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.
OK mvs@
Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.
OK claudio@
tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.
OK mvs@
Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.
input and OK sashan@
TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.
OK mvs@
The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.
Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.
bug reported and fix tested by Peter J. Philipp
OK claudio@
After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.
As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.
Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.
OK yasuoka@
If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.
tested by jan@; OK claudio@ jan@
With a lot of tweaks, improvements and testing from bluhm.
Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.
ok bluhm
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.
ok claudio
The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.
Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.
tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio
optional.
We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.
Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.
ok guenther@
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@
on pru_rcvd() return value.
Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.
ok guenther@ bluhm@
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@
Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.
The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.
ok bluhm@
The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.
Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.
ok guenther@ bluhm@
PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.
Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.
ok bluhm@
We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.
Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.
Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.
ok bluhm@
The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().
The former pfkeyv2_send() was renamed to pfkeyv2_dosend().
ok bluhm@
For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.
ok bluhm@ guenther@
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.
Based on reverted diff from guenther@.
ok bluhm@
TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.
Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.
ok mvs@ bluhm@
This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.
ok derradt@
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@