SKB Segmenation in Linux Kernel

Generic Segmentation Offload (GSO), usually, comes before a device specific codes. Which means, device drivers do not have to be aware of segmentation of any kind. Basically, the segmentation may happen before calling dev_queue_xmit().

One important thing to notice is that a GSO segmentation function comes after L2 protocol, because L2 header size will remain unchanged and attached to each segment. The type of the segmentation function is determined by the Ethertype. Then, we may have GSO for MPLS and IPv4/TCP, for example.

/**
 *      __skb_gso_segment - Perform segmentation on skb.
 *      @skb: buffer to segment
 *      @features: features for the output path (see dev->features)
 *      @tx_path: whether it is called in TX path
 *
 *      This function segments the given skb and returns a list of segments
 *
 *      It may return NULL if the skb requires no segmentation.  This is
 *      only possible when GSO is used for verifying header integrity.
 */
struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
                                  netdev_features_t features, bool tx_path)
{
         if (unlikely(skb_needs_check(skb, tx_path))) {
                int err;
 
                skb_warn_bad_offload(skb);
 
                if (skb_header_cloned(skb) &&
                     (err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
                        return ERR_PTR(err);
         }
 
         SKB_GSO_CB(skb)->mac_offset = skb_headroom(skb);
         SKB_GSO_CB(skb)->encap_level = 0;
 
         skb_reset_mac_header(skb);
         skb_reset_mac_len(skb);
 
         return skb_mac_gso_segment(skb, features);
}

skb_mac_gso_segment() will trigger a protocol specific segmentation offload functions (protocols above L2).

/**
 *      skb_mac_gso_segment - mac layer segmentation handler.
 *      @skb: buffer to segment
 *      @features: features for the output path (see dev->features)
 */
struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
                                     netdev_features_t features)
{
         struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);
         struct packet_offload *ptype;
         int vlan_depth = skb->mac_len;
         __be16 type = skb_network_protocol(skb, &vlan_depth);
 
         if (unlikely(!type))
                 return ERR_PTR(-EINVAL);
 
         __skb_pull(skb, vlan_depth);
 
        rcu_read_lock();
        list_for_each_entry_rcu(ptype, &offload_base, list) {
                if (ptype->type == type && ptype->callbacks.gso_segment) {
                        if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {
                                 int err;
 
                                err = ptype->callbacks.gso_send_check(skb);
                                segs = ERR_PTR(err);
                                if (err || skb_gso_ok(skb, features))
                                       break;
                                 __skb_push(skb, (skb->data -
                                                 skb_network_header(skb)));
                         }
                        segs = ptype->callbacks.gso_segment(skb, features);
                        break;
                 }
         }
         rcu_read_unlock();
 
         __skb_push(skb, skb->data - skb_mac_header(skb));
 
         return segs;
}

ptype->callbacks.gso_segment() is the callback function for a specific typeof protocol, such as MPLS. Let’s the ‘ptype->type’ matches ‘type’, where ‘type’ is equal to MPLS Unicast Ethertype (0x8847), so mpls_gso_segment() would be the callback fucntion to be called.

GSO: Packet Offload Callback Function

An actual code to lean how to program a Linux kernel module for GSO (Generic Segmentation Offloading). Basically, a callback function which will perform the segmentation according to a defined type (e.g. MPLS, IPv4 fragmentation, ..) need to be registered in the kernel stack. For this purpose, the data structure packet_offload has to be instantiated in the module, because this is where the callback function will be in.

/**
 *      dev_add_offload - register offload handlers
 *      @po: protocol offload declaration
 *
 *      Add protocol offload handlers to the networking stack. The passed
 *      &proto_offload is linked into kernel lists and may not be freed until
  *      it has been removed from the kernel lists.
  *
  *      This call does not sleep therefore it can not
  *      guarantee all CPU's that are in middle of receiving packets
  *      will see the new offload handlers (until the next received packet).
  */
void dev_add_offload(struct packet_offload *po)
{
        struct list_head *head = &offload_base;
 
         spin_lock(&offload_lock);
         list_add_rcu(&po->list, head);
         spin_unlock(&offload_lock);
}
EXPORT_SYMBOL(dev_add_offload);

I put a code snippet of the actual kernel module for MPLS GSO. This the callback function which performs MPLS segmentation.

static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
                                        netdev_features_t features)
{
         struct sk_buff *segs = ERR_PTR(-EINVAL);
         netdev_features_t mpls_features;
         __be16 mpls_protocol;
 
         if (unlikely(skb_shinfo(skb)->gso_type &
                                 ~(SKB_GSO_TCPV4 |
                                   SKB_GSO_TCPV6 |
                                   SKB_GSO_UDP |
                                   SKB_GSO_DODGY |
                                   SKB_GSO_TCP_ECN)))
                 goto out;
 
         /* Setup inner SKB. */
         mpls_protocol = skb->protocol;
         skb->protocol = skb->inner_protocol;
 
         /* Push back the mac header that skb_mac_gso_segment() has pulled.
          * It will be re-pulled by the call to skb_mac_gso_segment() below
          */
         __skb_push(skb, skb->mac_len);
 
         /* Segment inner packet. */
         mpls_features = skb->dev->mpls_features & features;
         segs = skb_mac_gso_segment(skb, mpls_features);
 
 
         /* Restore outer protocol. */
         skb->protocol = mpls_protocol;
 
         /* Re-pull the mac header that the call to skb_mac_gso_segment()
          * above pulled.  It will be re-pushed after returning
          * skb_mac_gso_segment(), an indirect caller of this function.
          */
         __skb_pull(skb, skb->data - skb_mac_header(skb));
out:
         return segs;
}

Make a packet_offload variable:

static struct packet_offload mpls_uc_offload __read_mostly = {
         .type = cpu_to_be16(ETH_P_MPLS_UC),
         .priority = 15,
         .callbacks = {
                 .gso_segment    =       mpls_gso_segment,
          },
};

Register the callback in the kernel stack:

dev_add_offload(&mpls_uc_offload);

					

GSO in FreeBSD System

Text extracted from GSO pacth for FreeBSD.

“The use of large frames makes network communication much less demanding for the CPU. Yet, backward compatibility and slow links requires the use of 1500 byte or smaller frames. Modern NICs with hardware TCP segmentation offloading (TSO) address this problem. However, a generic software version (GSO) provided by the OS has reason to exist, for use on paths with no suitable hardware, such as between virtual machines or with older or buggy NICs.

Much of the advantage of TSO comes from crossing the network stack only once per (large) segment instead of once per 1500-byte frame. GSO does the same both for segmentation (TCP) and fragmentation (UDP) by doing these operations as late as possible. Ideally, this could be done within the device driver, but that would require modifications to all drivers. A more convenient, similarly effective approach is to segment just before the packet is passed to the driver (in ether_output()).

Our preliminary implementation supports TCP and UDP on IPv4/IPv6; it only intercepts packets large than the MTU (others are left unchanged), and only when GSO is marked as enabled for the interface.

Segments larger than the MTU are not split in tcp_output(), udp_output(), or ip_output(), but marked with a flag (contained in m_pkthdr.csum_flags), which is processed by ether_output() just before calling the device driver.

ether_output(), through gso_dispatch(), splits the large frame as needed, creating headers and possibly doing checksums if not supported by the hardware.”

Enabling and disabling GSO

A nice tutorial on Generic Segmentation Offloading can be found here. I put a shortened text extracted from there.

“[…]”

To further illustrate segmentation offloading, and how to control it in Linux, consider the following tests performed on two Ubuntu computers, basiland ginger, connected on an Ethernet LAN. On basil (which has IP address 10.10.1.22) netcat in server mode is used to receive data:

sgordon@basil$ nc -l 5001

On ginger netcat in client mode is used to send 10,000 Bytes of data (stored in a file) to the server.

sgordon@ginger$ nc -p 5002 10.10.1.22 5001 < 10000bytes.txt 

tcpdump is used to see the captured IP packets, and in particular the size of the TCP segments. I could have used Wireshark, but the text output oftcpdump> is easier to include in this page. ethtool is used to view and change the status of segmentation offloading (in this example, generic segmentation offload or GSO).

First note that ethtool shows us that generic segmentation offload is on.

sgordon@ginger$ sudo ethtool -k eth0
Offload parameters for eth0:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

Now, after running the netcat client, lets see the output from tcpdump (for clarity I have omitted the option fields from selected TCP segments):

sgordon@ginger$ sudo tcpdump -i eth0 -n 'not port 22'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
18:30:24.899687 IP 192.168.1.2.5002 > 10.10.1.22.5001: S 679249855:679249855(0) win 5840 
18:30:24.900583 IP 10.10.1.22.5001 > 192.168.1.2.5002: S 1420594303:1420594303(0) ack 679249856 win 5792 
18:30:24.900612 IP 192.168.1.2.5002 > 10.10.1.22.5001: . ack 1 win 92
18:30:24.900713 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 1:2897(2896) ack 1 win 92
18:30:24.900735 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 2897:4345(1448) ack 1 win 92 
18:30:24.902575 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 1449 win 68 
18:30:24.902591 IP 192.168.1.2.5002 > 10.10.1.22.5001: P 4345:7241(2896) ack 1 win 92 
18:30:24.903597 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 2897 win 91 
18:30:24.903607 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 7241:8689(1448) ack 1 win 92 
18:30:24.903613 IP 192.168.1.2.5002 > 10.10.1.22.5001: P 8689:10001(1312) ack 1 win 92 
18:30:24.903617 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 4345 win 114 
18:30:24.905573 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 5793 win 136 
18:30:24.905587 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 7241 win 159 
18:30:24.906628 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 8689 win 181 
18:30:24.906637 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 10001 win 204 

Each line is showing a captured packet. The TCP segments containing data can be identified by the sequence numbers (I’ve made them bold). The number in parentheses indicates the number of bytes in this TCP segment. We can see from the capture that our 10,000 Bytes of data is broken into 5 segments containing: 2896, 1448, 2896, 1448, 1312 Bytes each. But wait … 2896 Bytes in a TCP segment when the MSS is 1460? (in fact, with TCP header options, like SACK and timestamp, the MSS in this capture is 1448). This is Generic Segmentation Offloading going to work: the OS is sending large segments, as captured above, and letting the NIC do the real segmentation.

So now lets turn Generic Segmentation Offloading off using ethtool:

sgordon@ginger$ sudo ethtool -K eth0 gso off

And run the netcat transfer again and look at the tcpdump output this time:

sgordon@ginger$ sudo tcpdump -i eth0 -n 'not port 22'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
18:33:02.644356 IP 192.168.1.2.5002 > 10.10.1.22.5001: S 3144010294:3144010294(0) win 5840 
18:33:02.645427 IP 10.10.1.22.5001 > 192.168.1.2.5002: S 3901655238:3901655238(0) ack 3144010295 win 5792 
18:33:02.645471 IP 192.168.1.2.5002 > 10.10.1.22.5001: . ack 1 win 92 
18:33:02.645542 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 1:1449(1448) ack 1 win 92 
18:33:02.645558 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 1449:2897(1448) ack 1 win 92 
18:33:02.645567 IP 192.168.1.2.5002 > 10.10.1.22.5001: P 2897:4345(1448) ack 1 win 92 
18:33:02.647415 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 1449 win 68 
18:33:02.647433 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 4345:5793(1448) ack 1 win 92 
18:33:02.647439 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 5793:7241(1448) ack 1 win 92 
18:33:02.648437 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 2897 win 91 
18:33:02.648446 IP 192.168.1.2.5002 > 10.10.1.22.5001: . 7241:8689(1448) ack 1 win 92 
18:33:02.648451 IP 192.168.1.2.5002 > 10.10.1.22.5001: P 8689:10001(1312) ack 1 win 92 
18:33:02.648460 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 4345 win 114 
18:33:02.650414 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 5793 win 136 
18:33:02.650428 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 7241 win 159 
18:33:02.651469 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 8689 win 181 
18:33:02.651476 IP 10.10.1.22.5001 > 192.168.1.2.5002: . ack 10001 win 204

Now this is what we expect to see – 7 TCP segments each no larger than 1448 Bytes.

Whats the conclusion of all this? What is taught in lectures and textbooks is not always what you see in practice. I suggest turning offloading optimisations off to demonstrate the basic concepts, and then turn them back on again to illustrate the practical performance optimizations applied at the expense of theoretical layering principles.

A sorted tree: Red-Black Tree

I found a short tutorial on ‘how to use Rede-Black Tree kernel’s data structure‘ . The data structure can be found in linux/rbtree.h. (kernel library)

“Red-black trees are a type of self-balancing binary search tree, used for storing sortable key/value data pairs. This differs from radix trees (which are used to efficiently store sparse arrays and thus use long integer indexes to insert/access/delete nodes) and hash tables (which are not kept sorted to be easily traversed in order, and must be tuned for a specific size and hash function where rbtrees scale gracefully storing arbitrary keys).” [Got from here]

800px-red-black_tree_example-svg

 

“In addition to the requirements imposed on a binary search tree the following must be satisfied by a red–black tree:

  1. A node is either red or black.
  2. The root is black. This rule is sometimes omitted. Since the root can always be changed from red to black, but not necessarily vice versa, this rule has little effect on analysis.
  3. All leaves (NIL) are black.
  4. If a node is red, then both its children are black.
  5. Every path from a given node to any of its descendant NIL nodes contains the same number of black nodes. The uniform number of black nodes in the paths from root to leaves is called the black-height of the red–black tree.” [Wikipedia]

Ryu controller: OSLO package issue

Hi,

I recently faced an issue with Ryu SDN controller framework.  The error was:

ImportError: No module named oslo.config.cfg

And solved by doing (in the root directory of Ryu to upgrade the controller):

# pip2 install --upgrade ryu

I’m not sure, but I guess it was a deprecated version of OSLO package that was installed on my system or, maybe, the version of OSLO package that I had was superior than Ryu 3.21 needed (API’s parameters or names might have changed).

That’s all,

How to configure ONOS

ONOS is a carrier-grade network operating system (OpenFlow controller). Here, I use the version 1.3 and the configurations below, worked for me.

Download ONOS
==============
$ cd ~; mkdir sdn; cd sdn
$ git clone -b onos-1.3 https://gerrit.onosproject.org/onos
 
Installation – Java, Maven, and Karaf
======================================
 
Maven and Karaf
 
$ cd ~/sdn
$ wget http://archive.apache.org/dist/karaf/3.0.3/apache-karaf-3.0.3.tar.gz
$ wget http://archive.apache.org/dist/maven/maven-3/3.3.1/binaries/apache-maven-3.3.1-bin.tar.gz
$ tar -zxvf apache-karaf-3.0.3.tar.gz -C ~/sdn
$ tar -zxvf apache-maven-3.3.1-bin.tar.gz -C ~/sdn
 
Oracle Java 8:
 
$ sudo apt-get install software-properties-common -y
$ sudo add-apt-repository ppa:webupd8team/java -y
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer oracle-java8-set-default -y
 
Setting Environment Variables (~/.bashrc). The environment variables can be optional (I’m not sure, but I confired such vars in the .bashrc).
 
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=/usr/lib/jvm/java-8-oracle/jre
export KARAF_ROOT=$HOME/sdn/apache-karaf-3.0.3
export M2_HOME=$HOME/sdn/apache-maven-3.3.1
 
 
Development Environment Setup (~/.bashrc)
=========================================
 
$ export ONOS_ROOT=$HOME/sdn/onos
$ source $ONOS_ROOT/tools/dev/bash_profile
 
 
Building and packaging ONOS 
============================
$ cd ~/sdn/onos
$ mvn clean install
 
Selecting IP address (~/.bashrc)
================================
Put your IP address. 
$ export ONOS_IP=A.B.C.D
  
Starting ONOS
============== 
$ ok clean # or onos-karaf
  
GUI: karaf/karaf
==================
 
http://<Your IP address>:8181/onos/ui/login.html