X

Recent Posts

General

Crossbow paper wins best paper award at Usenix LISA09 and BOF schedule

Crossbow paper wins best paper award at Usenix LISA09 and BOF scheduleWe had submiited another paper at Usenix LISA 2009 conference atBaltimore, MD which is being held from Nov 3-5, 2009. The paperis title Crossbow Virtual Wire: Network in a Box. Yesterday we were informed that our paper won the Best Paper award for theconference. Woohoo!! I met many people here at LISA that are already using Crossbow invery interesting ways. I got many requests to hold a BOF while wewere here. So I hit our marketing VP for some beer budget (can'thave a BOF without drinks) and we are now having a Crossbow andSolaris Networking BOF on Nov 4th, 2009 from 10.30 to 11.30pm inDover AB conference room. The venue details can be found on usenix LISA site here. So people who are already at the conference on in thegeneral area of Maryland, Virginia, DC, etc, please do come buy.It would be good to attach faces to name and we will have chilledbeer. We will also be showing the Virtual Wire Builder kit tobuild your own virtual network (all available in open source form).Once again, BOF details are Crossbow & Solaris Networking BOF at Usenix Lisa 2009 Place: Marriott Waterfront Hotel, Baltimore. MD. Date: Nov 4th, 2009 Time: 10.30-11.30pm Agenda: Virtual Wire Builder kit, Open discussion, BeerHope to see you there.

Crossbow paper wins best paper award at Usenix LISA09 and BOF schedule We had submiited another paper at Usenix LISA 2009 conference at Baltimore, MD which is being held from Nov 3-5, 2009. The paperis...

Solaris Networking

Crossbow Launch, Talk and BOF at Community One and Java One

Crossbow Launch, Talk and BOF at Community One and Java OneOn June 1, 2009, during Community One and Java One in San Francisco, California, Crossbow was formally launched as part of OpenSolaris 2009.06. The morning started with a keynote where John Fowler, EVP of Sun Systems group formally announced OpenSolaris 2009.06 as the beta for next enterprise release of Solaris.Next (Next release after Solaris 10). He and Greg Lavender then went on to show the Crossbow feature and the Virtual Wire demo. Later in the day I did a talk on Crossbow where Nicolas and Kais accompanied me and showed the Crossbow Virtual wire demo in detail. Bill Franklin and some of his cohorts were dressed as Crossbow knights and they charged in the room right after the talk. I think people just got a shock of their life. It was very entertaining. The launch got lot of visibility and very good press coverage which can be see on the Crossbow News page. The most notable ones were:Eweek article where they tried some of the Virtual Wire features and liked itInformation week articleComputer Week Article which also had the quotes from Dan Roberts talking about Crossbowand many more available hereOn June 2, 2009 we held the Crossbow BOF in the evening. Great showing and great support from the Community. Joyent: Ben Rockwood talks about Cloud deployment with CrossbowVeraz Network: Xiaobo Wang talk about Telco consolidationReliant Security: Richard Newman talks about Crossbow in Small enterprise spaceForce 10 Networks: Michael O'Brien talks about extending Crossbow H/W lanes to the switchSo great stuff and a good closure for Phase 1 of the Crossbow project. The team members were pretty happy and relived. Now trying to get the next intermediate phase going so we can complete the story for next enterprise release of Solaris which might or might not be called Solaris11. Key things are more analytics (dlstat/flowstat), some security/anti spoofing features and more usablity etc. More details are being discussed on the Crossbow Discussion page.

Crossbow Launch, Talk and BOF at Community One and Java One On June 1, 2009, during Community One and Java One in San Francisco, California, Crossbow was formally launched as part of OpenSolaris...

General

Crossbow Sigcomm09 papers are now online

Crossbow Sigcomm09 papers are now onlineHere are the details of the two Crossbow ACM Sigcomm09 papersCrossbow: From Hardware Virtualized NICs to Virtualized NetworksAbstract: This paper describes a new architecture for achieving network virtualization using virtual NICs (VNICs) as the building blocks. The VNICs can be associated with dedicated and independent hardware lanes that consist of dedicated NIC and kernel resources.Hardware lanes support dynamic polling, which enables the fair sharing of bandwidth with no performance penalty. VNICs ensure full separation of traffic for virtual machines within the host. A collection of VNICs on one or more physical machines can be connectedto create a Virtual Wire by assigning them a common attribute such as a VLAN tag. The full paper is available hereCrossbow: A vertically integrated QoS stackAbstract: This paper describes a new architecture which addresses Quality of Service (QoS) by creating unique flows for applications, services, or subnets.A flow is a dedicated and independent path from the NIC hardware to the socket layer in which the QoS layer is integrated into the protocolstack instead of being implemented as a separate layer. Each flow has dedicated hardware and software resources allowing applications to meet their specified quality of service within the host.The architecture efficiently copes with Distributed Denial of Service (DDoS) attacks by creating zero or limited bandwidth flows forthe attacking traffic. The unwanted packets can be dropped by the NIC hardware itself at no cost.A collection of flows on more than one host can be assigned the same Differentiated Services Code Point (DSCP) label which forms a path dedicated to a service across the enterprise network and enables end-to-end QoS within the data center. The full paper is available hereEnjoy reading and join us for the talk BOF and party at Community One (see the previous entry) on June 1-2, 2009!!

Crossbow Sigcomm09 papers are now online Here are the details of the two Crossbow ACM Sigcomm09 papers Crossbow: From Hardware Virtualized NICs to Virtualized NetworksAbstract: This paper describes a...

General

Crossbow Research papers in SIGCOMM, Party, Community One/Java One etc

Crossbow Research papers in SIGCOMM, Community One/Java One etcLast week was a very exciting week. Two of our research papers gotaccepted in SIGCOMM VISA09 and SIGCOMM WREN09. This year, SIGCOMM will to be held in Barcelona, Spain from August 17-21and has four focus areas. Two of them are on Virtualization and Enterprise Networking whichis where we had submitted a paper on the virtualization and flows respectively. Wewill make these papers available online very soon once we submit the camera readycopy to the ACM editors.So comes the next question - where is the party? Well the party is during the Java Oneand Community One on June 1 and 2. Did I tell you that Community One is FREE and there is a big party in the evening. I think Crossbow gets formally announced as part ofCommunity One itself and we will have a talk on Crossbow titled Open Networking with Crossbow on June 1st at 2.40pm and aBOF on Crossbow on June 2nd at 5.30pm. We will also be hosting a Demo Pod during Java One. Crossbow is a more visible initiative but the last few months werepretty fruitful since not only we delivered Crossbow,but also several parts of Clearview and Volo amongst others.So please come by, help if you can or just enjoy the sessions and enrich yourself and justcelebrate. Let me know if you are able to help out in demo, manning the booths and answering questions.

Crossbow Research papers in SIGCOMM, Community One/Java One etc Last week was a very exciting week. Two of our research papers got accepted in SIGCOMM VISA09 and SIGCOMM WREN09. This year, SIGCOMM...

Solaris Networking

Crossbow: Virtualized switching and performance

Crossbow: Virtualized switching and performanceSaw Cisco's unified fabric announcement. Seems like they are goingafter Cloud computing which pretty much promises to solve the worldhunger problem. Even if Cloud computing can just solve the high datacenter cost problem and make compute, networking, and storageavailable on demand in a cheap manner, I am pretty much sold on it.The interesting part is that world needs to move towards enablingpeople to bring their network on the cloud and have compute, bandwidthand storage available on demand. Talking about networking and networkvirtualization, this means that we need to go to open standards,open technology and off the shelf hardware. The users of cloudwill not accept a vendor or provider lock down. The cloud needs to bebuilt in such a manner that a user can take his physical network andmigrate it to an operator's cloud and at the same time have theability to build their own clouds and migrate stuff between thetwo. Open Networking is the key ingredient here.This essentially means that there is no room for custom ASICs andprotocols and the world of networking needs to change. This is what Jonathan was talking about to certain extent around Open Networkingand Crossbow. OpenSolaris with Crossbow make things veryinteresting in this space. But it seems like people don't fullyunderstand what Crossbow and OpenSolaris bring to the table. I saw a post from Scott Lowe and several other mentioning that Crossbow is prettysimilar to VMware'snetwork virtualization solutions and Cisco Nexus 1000v virtual switches. Let me take some time toexplain few very important things about Crossbow:Its Open Source and part of OpenSolaris. You can download itright here.Its leverages NIC hardware switching and features to deliverisolation and performance for virtual machines. Crossbow not onlyincludes H/W & S/W based VNICs and switches, it also offersVirtualized Routers, Load balancer, and Firewalls. The Virtual NetworkMachines can be created using Crossbow and Solaris Zones and havepretty amazing performance. All these are connected together using theCrossbow VirtualWire. You don't need to buy fancy and expensive virtualized switches to createand use Virtual Wire.Using hardware virtualized lanes Crossbow technology scales multiples of 10gigtraffic using off the shelf hardware.Hardware based VNICs and Hardware based SwitchingPicture is always worth a thousand words. The figure shows howcrossbow VNIC are built on top of real NIC hardware and how we doswitching in hardware where possible. And Crossbow does have a fullfeatured S/W layer where it can do S/W VNICs and switching aswell. The hardware is leveraged when available. Its important to notethat most of the NIC vendors do ship with the necessary NICclassifiers and Rx/Tx rings and its pretty much mandatory for 10 gigNICs which do form the backbone for a cloud.Virtual Wire: The essence of virtualized networkingThe Crossbow VirtualWire technology allows a person to convert a full features physical network (multiple subnets, switches and routers) and configure it within one or more hosts. This is the key to movevirtualized networks in and out of the cloud. The figure shows atwo subnet physical network with multiple switches, different linkspeeds and connected via a router and how it can be virtualized in asingle box. A full workshop to do virtualized networking is availablehere.Scaling and PerformanceCrossbow leverages the NICs features pretty aggressively to createvirtualization lanes that help traffic scale across large number ofcores and threads. For people wanting to build real or virtualappliances using OpenSolaris, the performance and scaling across 10Gig NICs is pretty essential. The figure below shows an overview ofhardware lanes.More InformationThere is a white paper and more detaileddocuments (including how to get started) at theCrossbowOpenSolaris page.network virtualizationcrossbowcloud computing

Crossbow: Virtualized switching and performance Saw Cisco's unified fabric announcement. Seems like they are going after Cloud computing which pretty much promises to solve the worldhunger problem....

Solaris Networking

Crossbow enables an Open Networking Platform

Crossbow enables an Open Networking PlatformI came across this blog from Paul Murphy. Youshould read the second half of Pauls blog. What he says pretty true. Crossbow delivered a brand newnetworking stack to Solaris which has scalability, virtualization, QoS, and better observabilitydesigned in (instead of patched in). The complete list of features delivered and under works arehere. Coupled with a full fledged open source Quagga Routing Suite (RIP, OSPF, BGP, etc), IP Filter Firewall, and a kernel Load Balancer, OpenSolaris becomes apretty useful platform for building Open Networking appliances.Apart from single box functionality, imagine if you want to deliver Virtual Router or a load balancer,it would be pretty easy to do so. OpenSolaris offers Zones where you can deliver a pre configured zone as a Router, Load balancer, or a firewall. The difference would be that this Zone would be fully portable to another machine running OpenSolaris and will have no performance penalty. After all, we aka Crossbow team guarantee that our VNICs with Zones do not have any performance penalties.You can also build a fully portable and pre configured virtual networking equipment using Xen guest which can be made to migrate between any OpenSolarisor Linux host.I noticed that couple of folks on Paul blog were asking about why Crossbow NIC virtualization isdifferent? Well, its not just the NIC being virtualized but actually the entire data path along with it called a Virtualization Lane. You can see the virtualization lane all the way from NIC to socket Layer and backhere.Not only is there one or more Virtualization Lanes per virtual machine,the bandwidth partitioning, Diffserv tagging, priority, CPU assignment etc. are designed in as part of thearchitecture. The same concepts are used to scale the stack across multiples of 10gigE NIC over largenumber of cores and threads (out of the world forwarding performance anyone!).And as mentioned before, Crossbow enables Virtual Wire. A ability to create a full featured network without any physical wires. Think ofrunning network simulations and testing in a whole new light!!

Crossbow enables an Open Networking Platform I came across this blog from Paul Murphy. You should read the second half of Pauls blog. What he says pretty true. Crossbow delivered a brand newnetworking...

Solaris

Crossbow - Network Virtualization Architecture Comes to Life

Crossbow - Network Virtualization Architecture Comes to LifeCrossbow - Network Virtualization Architecture Comes to LifeDecember 5th, 2008 was a joyous occasion and a humbling one at thesame time. A vision that was created 4 years back was coming to life.I still remember the summer of 2004 when Sinyawthrew a challenge at me - can you Change the world? And it wasFall of same year when I unveiled the first set of Crossbow slides tohim and Fred Zlotnik over a bottle of wine. Lot of planning andfinally ready to start but there were still hurdles in the way. We were stilltrying to finish Nemo aka GLDv3 - A high performance device driver framework whichwas absolutely required for Crossbow (We needed absolute control overthe Hardware). Nemo finished mid 2005 but then Nicolas, Yuzo etc. leftSun and went to a startup. Thiru was still trying to finish Yosemite(the FireEngine follow on). So in short, 2005 was basically moreplanning and prototyping (specially controlling the Rx rings anddynamic polling) on my part. I think it was early 2006 when workbegin on Crossbow in earnest. Kais moved over from security group,Nicolas was back at Sun, Thiru, Eric Cheng, Mike Lim (and of course me)came together to form the core team (which later expanded to 20+ peoplein early 2008). So it was a long standing dreamand almost three years of hard work that finally came to life when Crossbow Phase1 integrated in Nevada Build 105 (and will be available inOpenSolaris 6.09 release).Crossbow - H/W Virtualized Lanes that Scale (10gigE over multiple cores)One of key tenets of Crossbow design was the concept of H/W VirtualizationLanes. Essentially tying a NIC Receive and Transmit ring, DMA channel, kernel threads, kernel queues, processing CPUs together. There areno shared locks, counters or anything. Each lane gets to individuallyschedule the packet processing by switching its Rx ring independently between interrupt mode and poll mode (Dynamic Polling). Now you can see why Nemo was soimportant because without it, stack couldn't control the H/W andwithout Nemo, the NIC vendors wouldn't have played along with us inadding the features we wanted (stateless classification, Rx/Tx rings,etc). Once a lane is created, we can program the classifier to spreadpackets based on IP addresses and port between each lane for scalingreasons. With the multiple cores and multiple thread that seems to bethe way of life going forward and 10+ gigE of Bandwidth (soon we willhave IPoIB working as well), scaling really matters (and we are not talking about achieving line rates on 10 gigE with jumbo grams - we are talking about real world, mix of small and large packets, 10k of connections and 1000s of threads).To demonstrate the point, I captured bunch of statistics whilefinishing the final touches to the data path and getting ready to beatsome world records. The table below shows mpstat output along withpackets per second serviced for the Intel Oplin (10gigE) NIC on aNiagara2 based system. The NIC has enabled all 8 Rx/Tx rings and has 8interrupts enabled (one for each rx ring). CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 38 0 0 6 21 3 31 1 5 12 0 86 0 0 0 99 39 0 0 2563 5506 3907 3282 28 34 1170 0 178 0 21 0 78 40 0 0 2553 5117 3948 2410 38 150 1192 0 504 1 21 0 77 41 0 0 2651 5221 4232 2011 25 53 1195 0 210 0 20 0 80 42 0 0 3078 5700 4743 2069 21 28 1285 0 125 0 22 0 78 43 0 0 3280 5837 4777 2118 19 24 1328 0 101 0 22 0 78 44 0 0 3143 19566 18801 1773 50 44 1285 0 68 0 65 0 35 45 0 0 4570 7748 6838 1984 23 27 1697 0 118 0 29 0 71# netstat -ia 1 input e1000g output input (Total) outputpackets errs packets errs colls packets errs packets errs colls 4 0 1 0 0 61284 0 128820 0 0 3 0 2 0 0 61015 0 129316 0 0 4 0 2 0 0 60878 0 128922 0 0 Thislink shows the interrupt binding, mpstat and intrstat output. Youcan see that the NIC is trying very hard to spread the load butbecause the stack sees this as one NIC, there is one CPU (number 44)where all the 8 threads collide. Its like a 8 lane highway becomingsingle lane during rush hours.Now lets look what happens when Crossbow enables a lane all the way up the stack for each Rx ring and also enables dynamic polling for eachindividually. If you look at the corresponding mpstat and intrstatoutput and packets per second rate, you will see that the lanesreally do work independently from each other resulting in almostlinear spreading and much higher packets per second serviced. Thebenchmark represents a webserver workload and needless to say,Crossbow with dynamic polling on per Rx ring basis almost tripled theperformance. The raw stats can be seen here.CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 37 0 0 2507 11906 10272 4267 265 326 489 0 776 4 28 0 68 38 0 0 2111 11793 9840 6503 336 314 472 0 615 3 32 0 65 39 0 0 500 10409 10164 565 7 125 174 0 1413 6 23 0 70 40 0 1 660 10423 9982 950 23 288 272 0 3834 8 34 0 58 41 0 1 658 10490 10108 847 16 238 237 0 2549 8 29 0 64 42 0 0 584 10605 10299 708 12 181 207 0 1828 7 26 0 67 43 0 0 732 10829 10559 598 9 141 193 0 1485 7 25 0 68 44 0 1 306 487 25 1091 17 282 330 0 4083 9 17 0 74# netstat -ia 1 input e1000g output input (Total) outputpackets errs packets errs colls packets errs packets errs colls 2 0 1 0 0 267619 0 522226 0 0 2 0 2 0 0 275395 0 539920 0 0 2 0 2 0 0 251023 0 482335 0 0 And finally below we print some statistics from the MAC per Rx ring datastructure (mac_soft_ring_set_t). For each Rx ring, we track the numberof packets received via interrupt path, number received via poll path,chains less than 10, chains between 10 and 50 and chains over 50 (eachtime we polled the Rx ring). And you can see that polling path bringsa larger chunk of packets and in bigger chains. Keep in mind that for most OSes and most NIC, the interrupt pathbrings one packet at a time. This makes Crossbow architecture more efficient for scaling as well as performance at higher loads on high B/W NICs. Crossbow and Network Virtualization Once we have the ability to create these independent H/W lanes,programming the NIC classifier is easy. Instead of spreading theincoming traffic for scaling, we program the classifier to sendpackets for a mac address to a individual lane. The MAC addresses aretied to individual Virtual NICs (VNICs) which are in turn attached toguest Virtual Machines or Solaris Containers (Zones). The separationfor each virtual machine is driven by the H/W and processed on theCPUs attached to the virtual machine (the poll thread and interruptsfor the Rx ring for a VNIC are bound to the assigned CPUs). Thepicture kind of looks like thisSince for NICs and VNICs, we always do dynamic polling, enforcing bandwidthlimit is pretty easy. One can create a VNIC by simply specifying theB/W limit, priority, cpu lists in one shot and the poll thread willenforce the limit by picking up only packets that meet the limit. Somethingas simple asfreya(67)% dladm create-vnic -l e1000g0 -p maxbw=100,cpus=2 my_guest_vmThe above command will create a VNIC called my_guest_vm with a random MACaddress and assign it a B/W of 100Mbps. All the processing for this VNICis tied to CPU 2. Its features like this that makes Crossbow a integral part of Sun Cloud Computing initiative due to roll out soon.Anyway, this should give you a flavour. There is a white paper and more detaileddocuments (including how to get started) at theCrossbowOpenSolaris page.network virtualizationcrossbowcloud computing

Crossbow - Network Virtualization Architecture Comes to Life December 5th, 2008 was a joyous occasion and a humbling one at the same time. A vision that was created 4 years back was coming to life.I...

Solaris Networking

Virtual Wire: Network in a Box (Sun Tech Day in Hyderabad)

Virtual Wire: Network in a Box (Sun Tech Day in Hyderabad)Virtual Wire: Network in a Box (Sun Tech Day in Hyderabad)I did a session for developers during the Sun Tech Day in Hyderabad and Raju Allurihad printed out 100 copies of the workshop and we were carrying 100 DVDs with Crossbow iso images (they are availableon web here. The people just loved it. We had sooounderestimated the demand that printouts and DVDs disappeared in less than a minute. I had a presentation that included30 odd slides but I couldn't even go past slide 7 since the workshop was so interesting to people. And between thetech day presentation and user group meeting in the evening, people pointed out a lot of interesting uses and whythis can be such a powerful thing.The idea that you can create any arbitrarily complex physical network as a virtual wire and run your favorite workload,do performance analysis and debug it is very appealing to people. Remember that we are not simulating the network. Thisis the real thing i.e. real applications running and real packets flowing. If you application runs on any OS, it willrun on this virtual network and will send and receive real packets!!The concept is pretty useful even to people like us because now we don't need to pester our lab staff to create usa network for us to test or experiment on. And best part is, we can use xVM and run Linux and Windows as hosts as well.We are thinking of writing a book which reinvents how you learn networking in schools and universities. And oh by the way,do people really care about CCNA now that they can do all this on their laptop :) If someone is interested in contributingreal examples for this workshop module and the book, you are more than welcome. Just drop us a line.networkingvirtualizationcrossbow

http-equiv="content-type"> http-equiv="content-type">Virtual Wire: Network in a Box (Sun Tech Day in Hyderabad) I did a session for developers during the Sun Tech Day in Hyderabad and Raju Allurihad...

Solaris Networking

Network in a Box (Creating a real Networks on your Laptop)

Virtual Wire: Network in a Box (Creating a real Networks on your Laptop)Virtual Wire: Network in a Box (Creating a real Network on your Laptop)Crossbow: Network Virtualization & ResourceControlObjectiveCreate a real network comprising of Hosts, Switches and Routers as a VirtualNetwork on a laptop. The Virtual Network (called Virtual Wire) is created using OpenSolaris projectCrossbow Technology and the hosts etc are created using Solaris Zones (a lightweight virtualization technology). All the steps necessary to create thevirtual topology are explained.The users can use this hands on demo/workshop and exercises in the end tobecome an expert inConfiguring IPv4 and IPv6 networks Hands on experience with OpenSolarisConfigure and manage a real RouterIP Routing technologies including RIP, OSPF and BGPDebugging configuration and connectivity issues Network performance and bottleneck AnalysisThe users of this module need not have access to a real network, router andswitches. All they need is a laptop or desktop running OpenSolaris ProjectCrossbow snapshot 2/28/2008 or later which can be found at http://www.opensolaris.org/os/project/crossbow/snapshots.IntroductionCrossbow (Network Virtualization and Resource Control) allows users to create a Virtual Wire with fixed link speeds in a box. Multiple subnet connected via a Virtual Router is pretty easy to configure. This allows the networkadministrators to do a full network configuration, verify IP address, subnetmasks and router ports and addresses. They can test connectivity and linkspeeds and when fully satisfied, they can instantiate the configuration onthe real network. Another great application is to debug problems by simulating a real network ina box. If network administrators are having issues with connectivity orperformance, they can create a virtual network and debug their issues usingsnoop, kernel stats and dtrace. They don't need to use the expensive H/Wbased network analyzers.The network developers and researchers working with protocols (like highspeed TCP) can use OpenSolaris to write their implementation and then try itout with other production implementations. They can debug and fine tune their protocol quite a bit before sending even a single packet on the realnetwork. Note1: Users can use Solaris Zones, Xen or ldom guests to create the virtualhosts while Crossbow provides the virtual network building blocks. There isno simulation but real protocol code at work. Users run real applicationson the host and clients which generate real packets.Note2: The Solaris protocol code executed for a virtual network or Solarisacting a real router or host is common all the way to bottom of MAC layer. In case of virtual networks, the device driver code for a physical NIC is theonly code that is not needed.Try it YourselfLets do a simple exercise. As part of this exercise, you will learnHow to configure a virtual network having two subnets and connected via a Virtual Router using Crossbow and ZonesHow to set the various link speeds to simulate multiple speed networkDo some performance runs to verify connectivity What you need:A laptop or machine running Crossbow snapshot from Feb 28, 2008 or laterhttp://www.opensolaris.org/os/project/crossbow/snapshots/Virtual Network ExampleLets take a physical network. The example in Fig 1a is representing thereal network showing how my desktop connects to the Lab servers. The desktopis on 20.0.0.0/24 network while the server machines (host1 and host2) areon 10.0.0.0/24 network. In addition, host1 has got a 10/100 Mbps NIClimiting its connectivity to 100Mbps. Fig. 1aWe will represent the network shown in Fig 1a on my Crossbow enabled laptop as a Virtual Network. We use Zones to act as host1, host2 and the Router whilethe global zone (gz) acts as the client (as a user exercise, create anotherclient zone and assign VNIC6 to it to act as a client). Fig. 1aNote 3: The Crossbow MAC layer itself does the switching between theVNICs. The Etherstub is craeated as a dummy device to connect the variousvirtual NICs. User can imagine etherstub as a Virtual Switch to helpvisualize the virtual network as a replacement for a physical network whereeach physical switch is replaced by a virtual switch (implemented by aCrossbow etherstub).Create the Virtual NetworkLets start by creating the 2 etherstubs using the dladm commandgz# dladm create-etherstub etherstub1gz# dladm create-etherstub etherstub3gz# dladm show-etherstubLINKetherstub1etherstub3Create the necessary Virtual NICs. VNIC1 has a limited speed of 100Mbswhile others have no limitgz# dladm create-vnic -l etherstub1 vnic1gz# dladm create-vnic -l etherstub1 vnic2gz# dladm create-vnic -l etherstub1 vnic3gz# dladm create-vnic -l etherstub3 vnic6gz# dladm create-vnic -l etherstub3 vnic9gz# dladm show-vnicLINK OVER SPEED MACADDRESS MACADDRTYPE vnic1 etherstub1 - Mbps 2:8:20:8d:de:b1 random vnic2 etherstub1 - Mbps 2:8:20:4a:b0:f1 random vnic3 etherstub1 - Mbps 2:8:20:46:14:52 random vnic6 etherstub3 - Mbps 2:8:20:bf:13:2f random vnic9 etherstub3 - Mbps 2:8:20:ed:1:45 random Create the hosts and assign them the VNICs. Also create the VirtualRouter and assign it VNIC3 and VNIC9 over etherstub1 and etherstub3respectively. Both the Virtual Router and Hosts are created usingZones in this example but you can easily use Xen or logical domains.Create a base Zone which we can clone. The first part is necessary if you are on a zfs filesystem.gz# zfs create -o mountpoint=/vnm rpool/vnmgz# chmod 700 /vnmgz# zonecfg -z vnmbasevnmbase: No such zone configuredUse 'create' to begin configuring a new zone.zonecfg:vnmbase> createzonecfg:vnmbase> set zonepath=/vnm/vnmbasezonecfg:vnmbase> set ip-type=exclusivezonecfg:vnmbase> add inherit-pkg-dirzonecfg:vnmbase:inherit-pkg-dir> set dir=/optzonecfg:vnmbase:inherit-pkg-dir> set dir=/etc/cryptozonecfg:vnmbase:inherit-pkg-dir> endzonecfg:vnmbase> verifyzonecfg:vnmbase> commitzonecfg:vnmbase> exitThis part takes 15-20 minutesgz# zoneadm -z vnmbase installNow lets create the 2 hosts and the Virtual Router as followgz# zonecfg -z host1host1: No such zone configuredUse 'create' to begin configuring a new zone.zonecfg:vnmbase> createzonecfg:vnmbase> set zonepath=/vnm/host1zonecfg:vnmbase> set ip-type=exclusivezonecfg:vnmbase> add inherit-pkg-dirzonecfg:vnmbase:inherit-pkg-dir> set dir=/optzonecfg:vnmbase:inherit-pkg-dir> set dir=/etc/cryptozonecfg:vnmbase:inherit-pkg-dir> endzonecfg:vnmbase> add netzonecfg:vnmbase:net> set physical=vnic1zonecfg:vnmbase:net> endzonecfg:vnmbase> verifyzonecfg:vnmbase> commitzonecfg:vnmbase> exitgz# zoneadm -z host1 clone vnmbasegz# zoneadm -z host1 bootgz# zlogin -C host1Connect to the console and go through the sysid config. For this example,we assign 10.0.0.1/24 as IP address for vnic1. You can specify thisduring sysidcfg. For default route, specify 10.0.0.3 as the defaultroute. You can say 'none' for naming service, IPv6, kerberos etc for thepurpose of this example.Similarly create host2 and configure it with vnic2 i.e.gz# zonecfg -z host2host2: No such zone configuredUse 'create' to begin configuring a new zone.zonecfg:vnmbase> createzonecfg:vnmbase> set zonepath=/vnm/host2zonecfg:vnmbase> set ip-type=exclusivezonecfg:vnmbase> add inherit-pkg-dirzonecfg:vnmbase:inherit-pkg-dir> set dir=/optzonecfg:vnmbase:inherit-pkg-dir> set dir=/etc/cryptozonecfg:vnmbase:inherit-pkg-dir> endzonecfg:vnmbase> add netzonecfg:vnmbase:net> set physical=vnic2zonecfg:vnmbase:net> endzonecfg:vnmbase> verifyzonecfg:vnmbase> commitzonecfg:vnmbase> exitgz# zoneadm -z host2 clone vnmbasegz# zoneadm -z host2 bootgz# zlogin -C host2Connect to the console and go through the sysid config. For this example,we assign 10.0.0.2/24 as IP address for vnic2. You can specify thisduring sysidcfg. For default route, specify 10.0.0.3 as the defaultroute. You can say 'none' for naming service, IPv6, kerberos etc for thepurpose of this example.Lets now create the Virtual Router asgz# zonecfg -z vRoutervRouter: No such zone configuredUse 'create' to begin configuring a new zone.zonecfg:vnmbase> createzonecfg:vnmbase> set zonepath=/vnm/vRouterzonecfg:vnmbase> set ip-type=exclusivezonecfg:vnmbase> add inherit-pkg-dirzonecfg:vnmbase:inherit-pkg-dir> set dir=/optzonecfg:vnmbase:inherit-pkg-dir> set dir=/etc/cryptozonecfg:vnmbase:inherit-pkg-dir> endzonecfg:vnmbase> add netzonecfg:vnmbase:net> set physical=vnic3zonecfg:vnmbase:net> endzonecfg:vnmbase> add netzonecfg:vnmbase:net> set physical=vnic9zonecfg:vnmbase:net> endzonecfg:vnmbase> verifyzonecfg:vnmbase> commitzonecfg:vnmbase> exitgz# zoneadm -z vRouter clone vnmbasegz# zoneadm -z vRouter bootgz# zlogin -C vRouterConnect to the console and go through the sysid config. For this example, weassign 10.0.0.3/24 as IP address for vnic3 and 20.0.0.1/24 as the IP addressfor vnic9. You can specify this during sysidcfg. For default route, specify'none' as the default route. You can say 'none' for naming service, IPv6,kerberos etc for the purpose of this example. Lets enable forwarding onthe Virtual Router to connect the 10.x.x.x and 20.x.x.x networks.vRouter# svcadm enable network/ipv4-forwarding:defaultNote 5: The above is done inside virtual router. Make sure you are in thewindow where you did the zlogin -C vRouter aboveNow lets bringup VNIC6 and configure it including setting up routes in theglobal zone. You can easily create another host called host3 as the clienton 20.x.x.x network by creating a host3 zone and assigning it 20.0.0.1/24IP addressLets configure the VNIC6. Open a xterm in the global zonegz# ifconfig vnic6 plumb 20.0.0.3/24 upgz# route add 10.0.0.0 20.0.0.1gz# ping 10.0.0.110.0.0.1 is alivegz# ping 10.0.0.210.0.0.2 is aliveSimilarly, login into host1 and/or host2 and verify connectivityhost1# ping 20.0.0.320.0.0.3 is alivehost1# ping 10.0.0.210.0.0.2 is aliveSet up Link SpeedWhat we configured above are unlimited B/W links. We can configure a linkspeed on all the links. For this example, lets configure the link speed of100Mbps on VNIC1gz# dladm set-linkprop -p maxbw=100 vnic1We could have configured the link speed (or B/W limit) while we were creating the vnic itself by adding the -p maxbw=100 option to create-vnic command. Test the performanceStart 'netserver' (or tool of your choice) in host1 and host2. You wil haveto install the tools in the relevant placeshost1# /opt/tools/netserver &host2# /opt/tools/netserver &gz# /opt/tools/netperf -H 10.0.0.2TCP STREAM TEST to 10.0.0.2 : histogramRecv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10\^6bits/sec 49152 49152 49152 10.00 2089.87 gz# /opt/tools/netperf -H 10.0.0.1TCP STREAM TEST to 10.0.0.1 : histogramRecv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10\^6bits/sec 49152 49152 49152 10.00 98.78 Note6: Since 10.0.0.2 is assigned to VNIC2 which has no limit, we get the max speed possible. 10.0.0.1 is configured over VNIC1 which is assigned to host1and we just set the link speed to 100Mbps and thats why we get only98.78Mbps.Cleanupgz# zoneadm -z host1 haltgz# zoneadm -z host1 uninstalldelete the zonegz# zonecfg -z host1zonecfg:host1> deleteAre you sure you want to delete zone host1 (y/[n])? yzonecfg:host1> exitIn this way, delete host2 and vRouter zones. Make sure you don't deletevnmbase since re creating it takes time.gz# ifconfig vnic6 unplumbAfter you have deleted the zone, you can delete vnics and etherstubs asfollows# dladm delete-vnic vnic1/\* Delete VNIC \*/# dladm delete-vnic vnic2# dladm delete-vnic vnic3# dladm delete-vnic vnic6# dladm delete-vnic vnic9# dladm delete-etherstub etherstub3/\* Delete etherstub \*/# dladm delete-etherstub etherstub1Make sure that VNICs are unplumbed (ifconfig vnic6 unplumb) and not assignedto a zone (delete the zone first) before you can delete them. You need todelete all the vnics on the etherstub before you can delete the etherstub.User ExercisesNow that you are familiar with the concepts and technology, you are ready todo some experiments of your own. Cleanup the machine as mentioned above. Theexercises below will help you master IP routing, configuring networks, anddebugging for performance bottlenecks.Recreate the Virtual Networkwork as show in Fig 1b but this time create an additional zone called client and assigned vnic6 to that client zone.client ZonevRouterhost1host2| | | | |---- etherstub3 --- -------- etherstub 1---------- Run all your connectivity tests from zloging into the client. Now change all IPv4 addresses to be IPv6 addresses and verify that client and hosts still have connectivityLeave the Virtual Network as in 1, but configure OSPF in vRouter instead of RIP by default. Verify that you can still get the connectivity. Note the steps needed to configure OSPFConfigure 20.0.0.0 and 10.0.0.0 networks as two separate autonomous networks, assign them unique ASN numbers and configure unique BGP domains. Verify that connectivity still works. Note the steps needed to configure BGP domains.Cleanup everything and recreate the virtual network in 1 above but instead of statically assigning the IP addresses to hosts and clients, configure NAT on the vRouter to give out address on subnet 10.0.0.0/24 on vnic3 and address on 20.0.0.0/24 for vnic9. While creating the hosts and clients, configure them to get their IP address through DHCP.Cleanup everything and recreate the virtual network in 1 above. Add additional router vRouter2 which has a vnic each on the 2 etherstubs. vRouter1/ \\ 20.0.0.0/24 10.0.0.0/24\\ / vRouter2 This provides a redundant path from client to the hosts. Experiment with running different routing protocols and assign different weight to each path and see what path you take from client to host (use traceroute to detect). Now configure the routing protocol on two vRouters to be OSPF and play with link speeds and see how the path changes. Note the configuration and observations.Cleanup. Lets now introduce another Virtual Router between two subnets i.e. client ZonevRouter1vRouter2host1 host2| | | | | | |---- etherstub3 --- -etherstub 2- -----etherstub 3---------- 20.0.0.0/24 30.0.0.0/24 10.0.0.0/24 Now set the link (VNIC) between vRouter1 and etherstub2 to be 75 Mbps. Use snmp from client to retrive the stats from the vRouter1 and check where the packets are getting dropped when you run netperf from client to host2. Remove the limit set earlier and instead set the link speed of 75 Mbps on link between etherstub2 and vRouter2. Again use snmp to get the stats out on vRouter1. Do you see similar results as vRouter1? If not, can you explain why?Conclusion and More resourcesUse the real example and configure the virtual network to get familiar withthe techniques used. At this point, have a look at your network and try tocreate a virtual network.Get more details on the OpenSolaris Crossbow pagehttp://www.opensolaris.org/os/project/crossbowYou can find high level presentations, architectural documents, man pages etc at http://www.opensolaris.org/os/project/crossbow/DocsJoin the crossbow-discuss@opensolaris.org mailing list athttp://www.opensolaris.org/os/project/crossbow/discussionsSend in your questions or your configuration samples and we will put it inthe use cases examples.A similar Virtual Network example using global zone as a NAT can be found onNicolas's blog at http://blogs.sun.com/drouxKais has a a example of dynamic bandwidth paritioning athttp://blogs.sun.com/kaisVenu talks about some of the cool crossbow features athttp://blogs.sun.com/iyer which allows virtualizing services with Crossbow technology using flowadm.networkingvirtualizationcrossbow

http-equiv="content-type"> http-equiv="content-type">Virtual Wire: Network in a Box (Creating a real Network on your Laptop) href="http://www.opensolaris.org/os/project/crossbow">Crossbow: Network...

Solaris

Project Crossbow: Network Virtualization and Resource Control going live

Project Crossbow - going live on OpenSolarisProject Crossbow going live on OpenSolarisHello and Welcome to project Crossbow!! We are going to add Network Virtualization and Resource Control to Solaris without degradingperformance. At this time, we are seeking members from open solaris community to becomepart of Crossbow i-team. Its the charter of i-team to gather requirementsand deliver the project including design, docs and testing. We would loveto have members of the community get involved from day one. The participationopportunities include (but are not limited to): helping define the project gathering requirements designing the project writing code creating demos doing talks and evangalizing the project Please send an email to me if you are interested. we can promise youthat this will be a thrilling adventure and you will be living on thebleeding edge of technology! Project Crossbow is brought to you bysame people who created project FireEngine (new stack architecture), project Nemo (GLDv3 - new high performance device driver framework),project Yosemite (UDP performance), etc to name a few. Apart from active participation, you can also participate via themailing lists and discussion groups where we will be posting variousdocuments for review and comments apart from day to day discussion.The project Crossbow page is visible hereYou can sign up for the discussion group here

Project Crossbow going live on OpenSolaris Hello and Welcome to project Crossbow!! We are going to add Network Virtualization and Resource Control to Solaris without degrading performance.At this...

Solaris Networking

Niagara - Designed for Network Throughput

Niagara - Designed for Network throughputWe finally announce Niagara based servers to the public! Billed as thelow cost, energy efficient, huge network throughput processors -marketing mumbo jumbo you think?? Well, try it and you will see. I waspriviledged enough that one of the earliest prototype landed on my desk(or in my lab to be precise) so Solaris networking could be tailored totake advantage of the chip. And boy, together with Solaris, this thingrocks!!So you know that Niagara is multi core, multi threaded chip and Solaristakes advantage in multiple way. Let me highlight some of them.Network performanceThe load from the NIC is fanned out to multiple soft rings in the GLDv3layer based on the src IP address and port information. Each soft ringin turn is tied to a Niagara thread and a VerticalPerimeter  such that packets from a connection have localityto specific H/W thread on a core and the NIC has locality to specificcore. Think of this model as 4 H/W threads per core processing the NICsuch that if one thread stalls for resource, the CPU cycles are notwasted. The result is amazing network performance for this beast.Performs 5-6 times the performance of your typical x86 based CPU.VirtualizationImagine you are a ISP or someone wanting to consolidate multiplemachines on one physical machine. Well, Niagara based platforms lendsthemselves beautifully to this concept because there are so many H/Wthreads around which appear as individual CPUs to Solaris. We have aproject underway called  Crossbow(details available on NetworkCommunity page on OpenSolaris) which will allow you to carve themachine (create virtual network stacks) into multiple virtual machinesand tied specific CPUs to them and control the B/W utilization for eachvirtual machine on a shared NIC. Real Time Networking/OffloadWith GLDv3based drivers and FireEnginearchitecture in Solaris 10, the stack controls the rate of interruptsand can dynamically switch the NIC between interrupt and polling mode.Couple with Niagara platform, Solaris can run the entire networkingstack on one core and provide real time capabilities to theapplication. Meanwhile, the application them selves run on differentcore without worrying about networking interrupts pinning them down.You can get pretty bounded latencies provided application can do someadmission control. We are also planning to hide the core runningnetworking from the application effectively getting TOE for freewithout suffering from the drawbacks of offloading networking to aspearate piece of hardware.[ T:NiagaraCMT]

http-equiv="content-type">Niagara - Designed for Network throughput We finally announce Niagara based servers to the public! Billed as thelow cost, energy efficient, huge network throughput...

Solaris

Solaris Networking - The Magic Revealed (Part I)

Solaris Networking - The Magic RevealedMany of you have asked for details on Solaris 10 networking.  Thegreat news is that I finished writing the treatise on the subject whichwill become a new section in Solaris Internals bookby Jim Mauro and Richard Mcdougall. In the meawhile, I have used some excerpts to create a mini book (partI and II) for Networkingcommunity on OpenSolaris. The Part II containing the new High Performance GLDv3 baseddevice driver framework, tuning guide for Solaris 10, etc is below. Enjoy! As usual, comments(good or bad) are welcome. Solaris Networking - The Magic Revealed(Part I)BackgroundSolaris 10 stackOverviewVerticalperimeterIP classifierSynchronizationmechanismTCP SocketBindConnectListenAcceptCloseData pathTCP LoopbackUDPUDP packetdrop within the stackUDP ModuleUDP andSocket interactionSynchronousSTREAMSSTREAMs fallbackIPPlumbing NICsIP NetworkMultiPathing (IPMP)MulticastSolaris 10 DeviceDriver frameworkGLDv2 andMonolithic DLPI drivers (Solaris 9 and before) GLDv3 - A NewArchitecture GLDv3 Linkaggregation architectureChecksum offloadTuning forperformance:FutureAcknowledgments1 BackgroundThe networking stack of Solaris 1.x was a BSD variant and was prettysimilar to the BSD Reno implementation. The BSD stack worked fine forlow end machines but Solaris wanted to satisfy the needs of low endcustomers as well as enterprise customers and such migrated to AT&TSVR4 architecture which became Solaris 2.x.With Solaris 2.x, the networking stack went through a make over andtransitioned from a BSD style stack to STREAMs based stack. The STREAMsframework provided an easy message passing interface which allowed theflexibility of one STREAMs module interacting with other STREAM module.Using the STREAMs inner and outer perimeter, the module writer couldprovide mutual exclusion without making theimplementation complex. The cost of setting up a STREAM was high butnumber of connection setup per second was not an important criterionand connections were usually long lived. Whenthe connections were more long lived(NFS, ftp, etc.), the cost of setting up a new stream was amortizedover the life of the connection. During late 90s, the servers became heavily SMP running largenumber of CPUs. The cost of switching processing from one CPU  toanother became high as the mid to high end machines became more NUMAcentric. Since STREAMs by design did not have any CPU affinity, packetsfor a particular connections moved around to different CPU. It wasapparent that Solaris needed to move away from STREAMs architecture.Late 90s also saw the explosion of web and increase in processing powermeant a large number of short lived connections making connection setuptime equally important. With Solaris 10, the networking stack wentthrough one more transition where the core pieces (i.e. socket layer,TCP, UPD, IP, and device driver) used an IP Classifier andserialization queue to improve the connection setup time, scalability,and packet processing cost. STREAMs are still used to provide theflexibility that ISVs need to implement additional functionality.2 Solaris 10 stackLets have a look at how the new framework and its key components.OverviewThe pre SOlaris 10 stack uses STREAMS perimeter and kernel adaptivemutexes for multi-threading. TCP uses a STREAMS QPAIR perimeter, UDPuses a STREAMS QPAIR with PUTSHARED, and IP a PERMOD perimeter withPUTSHARED and various TCP, UDP, and IP global data structures protectedby mutexes. The stack was executed by both user-land threads executingvarious system-calls, the network device driver read-side interrupt ordevice driver worker thread, and by STREAMS framework worker threads.As the current perimeter provides per module, per protocol stack layer,or horizontal perimeter. This can, and often does, lead to a packetbeing processed on more than one CPU and by more than one threadleading to excessive context switching and poor CPU data locality. Theproblem gets even more compounded by the various places packet can getqueued under load and various threads that finally process the packet.The "FireEngine" approach is to merge all protocol layers into oneSTREAMs module which is fully multi threaded. Inside the merged module,instead of using per data structure locks, use a per CPUsynchronization mechanism called "vertical perimeter". The "verticalperimeter" is implemented using a serialization queue abstractioncalled "squeue". Each squeue is bound to a CPU and each connection isin turn bound to a squeue which provides any synchronization and mutualexclusion needed for the connection specific data structures. The connection (or context) lookup for inbound packets is done outsidethe perimeter, using an IP connection classifier, as soon as the packetreaches IP. Based on the classification, the connection structure isidentified. Since the lookup happens outside the perimeter, we can binda connection to an instance of the vertical perimeter or "squeue" whenthe connection is initialized and process all packets for thatconnection on the squeue it is bound to maintaining better cachelocality. More details about the vertical perimeter and classifier aregiven later sections. The classifier also becomes the database forstoring a sequence of function calls necessary for all inbound andoutbound packets. This allows to change the Solaris networking stacksfrom the current message passing interface to a BSD style function callinterface. The string of functions created on the fly (event-list) forprocessing a packet for a connection is the basis for an eventual newframework where other modules and 3rd party high performance modulescan participate in this framework.VerticalperimeterSqueue guarantees that only a single thread can process a givenconnection at any given time thus serializing access to the TCPconnection structure by multiple threads (both from read and writeside) in the merged TCP/IP module. It is similar to the STREAMS QPAIRperimeter but instead of just protecting a module instance, it protectsthe whole connection state from IP to sockfs. Vertical perimeter or squeue by themselves just provide packetserialization and mutual exclusion for the data structures, but bycreating per CPU perimeter and binding a connection to the instanceattached to the CPU processing interrupts, we can guarantee much betterdata locality. We could have chosen between creating a per connection perimeter or aper CPU perimeter i.e. a instance per connection or per CPU. Theoverheads involved with a per connection perimeter and threadcontention gives lower performance and made us choose a per CPUinstance. For a per CPU instance, we had the choice of queuing aconnection structure for processing or instead just queue the packetitself and store the connection structure pointer in the packet itself.The former approach leads to some interesting starvation scenarioswhere packets for a connection keep arriving and to prevent such asituation, the overheads caused a lowered performance. Queuing thepacket themselves allows us to protect the ordering and is much simplerand thus the approach we have taken for FireEngine. As mentioned before, each connection instance is assigned to a singlesqueue and is thus only processed within the vertical perimeter. As asqueue is processed by a single thread at a time all data structuresused to process a given connection from within the perimeter can beaccessed without additional locking. This improves both the CPU andthread context data locality of access of both the connection metadata, the packet meta data, and the packet payload data. In additionthis will allow the removal of per device driver worker thread schemeswhich are problematic in solving a system wide resource issue and allowadditional strategic algorithms to be implemented to best handle agiven network interface based on throughput of the network interfaceand the system throughput (e.g. fanning out per connection packetprocessing to a group of CPUs). The thread, entering squeue may eitherprocess the packet right away or queue it for later processing byanother thread or worker thread. The choice depends on the squeue entrypoint and on the state of the squeue. The immediate processing is onlypossible when no other thread has entered the same squeue. The squeueis represented by the following abstraction: typedef struct squeue_s { int_t sq_flag; /\* Flags tells squeue status \*/ kmutex_t sq_lock; /\* Lock to protect the flag etc \*/ mblk_t \*sq_first; /\* First Packet \*/ mblk_t \*sq_last; /\* Last Packet \*/ thread_t sq_worker; /\* the worker thread for squeue \*/} squeue_t;Its important to note that the squeues are created on the basic of perH/W execution pipeline i.e. cores, hyper threads, etc. The stackprocessing of the serialization queue (and the H/W execution pipeline)is limited to one thread at a time but this actually improvesperformance because the new stack ensure that there are no waits forany resources such as memory or locks inside the vertical perimeter andallowing more than one kernel thread to time share the H/W executionpipelines has more overheads vs allowing only one thread to rununinterrupted. Queuing Model - The queue is strictly FIFO (first in first out)for both read and write side which ensures that any particularconnection doesn't suffer or is starved. A read side or a write sidethread enqueues packet at the end of the chain. It can then be allowedto process the packet or signal the worker thread based on theprocessing model below.Processing Model - After enqueueing its packet, if another threadis already processing the squeue, the enqueuing thread returns and thepacket is drained later based on the drain model. If the squeue is notbeing processed and there are no packets queued, the thread can markthe squeue as being processed (represented by 'sq_flag'), and processesthe packet. Once it completes processing the packet, it removes the'processing in progress' flag and makes the squeue free for futureprocessing.Drain Model - A thread, which was successfully able to processits own packet, can also drain any packets that were enqueued while itwas processing the request. In addition, if the squeue is not beingprocessed but there are packets already queued, then instead of queuingits packet and leaving, the thread can drain the queue and then processits own packets.The worker thread is always allowed to drain the entire queue. Choosingthe correct Drain model is quite complicated. Choices arebetween, "always queue", "process your own packet if you can", "time bounded process and drain". These options can be independently applied to the read thread and thewrite thread.Typically, the draining by an interrupt thread should always betime-bounded "drain and process" while the write thread can choosebetween "processes your own" and time bounded "process and drain". ForSolaris 10, the write thread behavior is a tunable with default being"process your own" while the read side is fixed to "time boundedprocess and drain". The signaling of worker thread is another option worth exploring. Ifthe packet arrival rate is low and a thread is forced to queue itspacket, then the worker thread should be allowed to run as soon as theentering thread finished processing the squeue when there is work to bedone. On the other hand, if the packet arrival rate is high, it may bedesirable to delay waking up the worker thread hoping for an interruptto arrive shortly after to complete the drain. Waking up the workerthread immediately when the packet arrival rate is high createsunnecessary contention between the worker and interrupt threads. The default for Solaris 10 is delayed wakeup of the worker thread.Initial experiments on available servers showed that the best resultsare obtained by waking up the worker thread after a 10ms delay. Placing a request on the squeue requires a per-squeue lock to protectthe state of the queue, but this doesn't introduce scalability problemsbecause it is distributed between CPU's and is only held for a shortperiod of time. We also utilize optimizations, which allow avoidingcontext switches while still preserving the single-threaded semanticsof squeue processing. We create an instance of an squeue per CPU in thesystem and bind the worker thread to that CPU. Each connection is thenbound to a specific squeue and thus to a specific CPU as well. The binding of an squeue to a CPU can be changed but binding of aconnection to an squeue never changes because of the squeue protectionsemantics. In the merged TCP/IP case, the vertical perimeter protectsthe TCP state for each connection. The squeue instance used by eachconnection is chosen either at the "open", "bind" or "connect" time foroutbound connections or at "eager connection creation time" for inboundones. The choice of the squeue instance depends on the relative speeds of theCPUs and the NICs in the system. There are two cases:CPU is faster than the NIC: the incoming connections are assignedto the "squeue instance" of the interrupted CPU. For the outbound case,connections are assigned to the squeue instance of the CPU theapplication is running on. NIC is faster than the CPU: A single CPU is not capable ofhandling the NIC. The connections are bounded in random manner on allavailable squeue. For Solaris 10, the determination of NIC being faster or slower thanCPU is done by the system administrator in the form of a tuning theglobal variable 'ip_squeue_fanout'. The default is 'no fanout' i.e.Assign the incoming connection to the squeue attached to theinterrupted CPU. For the purposes of taking a CPU offline the workerthread bound to this CPU removes its binding and restores it when theCPU gets back online. This allows for the DR functionality to workcorrectly. When packets for a connection are arriving on multiple NICs(and thus interrupting multiple CPUs), they are always processed on thesqueue the connection was originally established on. In Solaris 10, thevertical perimeter are provided only for TCP based connections. Theinterface to vertical perimeter is done at the TCP and IP layer afterdetermining that it is a TCP connection. Solaris 10 updates willintroduce the general vertical perimeter for any use. The squeue APIs look like:squeue_t \*squeue_create(squeue_t \*, uint32_t, processorid_t, void (\*)(), void \*, clock_t, pri_t);void squeue_bind(squeue_t \*, processorid_t);void squeue_unbind(squeue_t \*);void squeue_enter(squeue_t \*, mblk_t \*, void (\*)(), void \*);void squeue_fill(squeue_t \*, mblk_t \*, void (\*)(), void \*);Squeue_create instantiates a new squeue and usessqueue_bind()/squeue_unbind() to bind or unbind itself from aparticular CPU. The squeue once created are never destroyed. Thesqueue_enter() is used to try and access the squeue and the threadentering is allowed to process and drain the squeue based on modelsdiscussed before. squeue_fill() is used just to queue a packet on thesqueue to be processed by worker thread or other threads. IP classifierThe IP connection fanout mechanism consists of 3 hash tables. A 5-tuplehash table {protocol, remote and local IP addresses, remote and localports} to keep fully qualified TCP (ESTABLISHED) connections, A 3-tuplelookup consisting of protocol, local address and local port to keep thelisteners and a single-tuple lookup for protocol listeners. As part ofthe lookup, a connection structure (a superset of all connectioninformation) is returned. This connection structure is called 'conn_t'and is abstracted below.typedef struct conn_s { kmutex_t conn_lock; /\* Lock for conn_ref \*/ uint32_t conn_ref; /\* Reference counter \*/ uint32_t conn_flags; /\* Flags \*/ struct ill_s \*conn_ill; /\* The ill packets are coming on \*/ struct ire_s \*conn_ire; /\* ire cache for outbound packets \*/ tcp_t \*conn_tcp; /\* Pointer to tcp struct \*/ void \*conn_ulp /\* Pointer for upper layer\*/ edesc_pf conn_send; /\* Function to call on read side \*/ edesc_pf conn_recv; /\* Function to call on write side \*/ squeue_t \*conn_sqp; /\* Squeue for processing \*/ /\* Address and Ports \*/ struct { in6_addr_t connua_laddr; /\* Local address \*/ in6_addr_t connua_faddr; /\* Remote address. \*/ } connua_v6addr;#define conn_src V4_PART_OF_V6(connua_v6addr.connua_laddr)#define conn_rem V4_PART_OF_V6(connua_v6addr.connua_faddr)#define conn_srcv6 connua_v6addr.connua_laddr#define conn_remv6 connua_v6addr.connua_faddr union { /\* Used for classifier match performance \*/ uint32_t conn_ports2; struct { in_port_t tcpu_fport; /\* Remote port \*/ in_port_t tcpu_lport; /\* Local port \*/ } tcpu_ports; } u_port;#define conn_fport u_port.tcpu_ports.tcpu_fport#define conn_lport u_port.tcpu_ports.tcpu_lport#define conn_ports u_port.conn_ports2 uint8_t conn_protocol; /\* protocol type \*/ kcondvar_t conn_cv;} conn_t;The interesting member to note is the pointer to the squeue or verticalperimeter. The lookup is done outside the perimeter and the packet isprocessed/queued on the squeue connection is attached to. Also,conn_recv and conn_send point to the read side and write sidefunctions. The read side function can be 'tcp_input' if the packet ismeant for TCP. Also, the connection fan-out mechanism has provisions for supportingwildcard listener's i.e. INADDR ANY. Currently, the connected and bindtables are primarily for TCP and UDP only. A listener entry is madeduring a listen() call. The entry is made into the connected tableafter the three-way handshake is complete for TCP. The IPCLassifier APIs look like:conn_t \*ipcl_conn_create(uint32_t type, int sleep);void ipcl_conn_destroy(conn_t \*connp);int ipcl_proto_insert(conn_t \*connp, uint8_t protocol);int ipcl_proto_insert_v6(conn_t \*connp, uint8_t protocol);conn_t \*ipcl_proto_classify(uint8_t protocol);int \*ipcl_bind_insert(conn_t \*connp, uint8_t protocol, ipaddr_t src, uint16_t lport);int \*ipcl_bind_insert_v6(conn_t \*connp, uint8_t protocol, const in6_addr_t \* src, uint16_t lport);int \*ipcl_conn_insert(conn_t \*connp, uint8_t protocol, ipaddr_t src, ipaddr_t dst, uint32_t ports);int \*ipcl_conn_insert_v6(conn_t \*connp, uint8_t protocol, in6_addr_t \*src, in6_addr_t \*dst, uint32_t ports);void ipcl_hash_remove(conn_t \*connp);conn_t \*ipcl_classify_v4(mblk_t \*mp);conn_t \*ipcl_classify_v6(mblk_t \*mp);conn_t \*ipcl_classify(mblk_t \*mp);The names of the functions are pretty self explanatory.SynchronizationmechanismSince the stack is fully multi-threaded (barring the per CPUserialization enforced by the vertical perimeter), it uses a referencebased scheme to ensure that connection instance are available whenneeded. The reference count is implemented by 'conn_t' member'conn_ref' and protected by 'conn_lock'. The prime purpose of the lockin not to protect bulk of 'conn_t' but just the reference count. Eachtime some entity takes reference to the data structure (stores apointer to the data structure for later processing), it increments thereference count by calling the CONN_INC_REF macro which basicallyacquires the 'conn_lock', increments the 'conn_ref' and drops the'conn_lock'. Each time the entity drops the reference to the connectioninstance, it drops its reference using the CONN_DEC_REF macro. For an established TCP connection, There are guaranteed to be 3references on it. Each protocol layer has a reference on the instance(one each for TCP and IP) and the classifier itself has a referencesince its a established connection. Each time a packet arrive for theconnection and classifier looks up the connection instance, an extrareference is place which is dropped when the protocol layer finishesprocessing that packet. Similarly, any timers running on the connectioninstance have a reference to ensure that the instance is aroundwhenever timer fires. The memory associated with the connectioninstance is freed once the last reference is dropped. 3 TCP Solaris 10 provides the same view for TCP as previous releases i.e. TCPappears as a clone device but it is actually a composite, with the TCPand IP code merged into a single D_MP STREAMS module. The merged TCP/IPmodule's STREAMS entry points for open and close are the same as IP'sentry points viz ip_open and ip_close. Based on the major number passedduring open, IP decides whether the open corresponds to a TCP open oran IP open. The put and service STREAMS entry points for TCP aretcp_wput, tcp_wsrv and tcp_rsrv. The tcp_wput entry point simply servesas a wrapper routine and enable sockfs and other modules from the topto talk to TCP using STREAMs. Note that tcp_rput is missing since IPcalls TCP functions directly. IP's STREAMS entry points remainunchanged. The operational part of TCP is fully protected by the verticalperimeter which entered through the squeue_\* primitives as illustratedin Fig 4. Packets flowing from the top enter into TCP through thewrapper function tcp_wput, which then tries to execute the real TCPoutput processing function tcp_output after entering the correspondingvertical perimeter. Similarly packets coming from the bottom try toexecute the real TCP input processing function tcp_input after enteringthe vertical perimeter. There are multiple entry points into TCPthrough the vertical perimeter. Fig. 4tcp_input - All inbound data packets and control messagestcp_output - All outbound data packets and control messagestcp_close_output - On user close tcp_timewait_output - timewait expirytcp_rsrv_input - Flowcontrol relief on read side.tcp_timer - All tcp timersThe Interface between TCP and IPFireEngine changes the interface between TCP and IP from the existingSTREAMS based message passing interface to a functional call basedinterface, both in the control and data paths. On the outbound side TCPpasses a fully prepared packet directly to IP by calling ip_output,while being inside the vertical perimeter. Similarly control messages are also passed directly as functionarguments. ip_bind_v{4, 6} receives a bind message as an argument,performs the required action and returns a result mp to the caller. TCPdirectly calls ip_bind_v{4, 6} in the connect(), bind() and listen()paths. IP still retains all its STREAMs entry point but TCP (/dev/tcp)becomes a real device driver i.e. It can't be pushed over other devicedrivers. The basic protocol processing code was unchanged. Lets have a look atcommon socket calls and see how they interact with the framework. SocketA socket open of TCP or open of /dev/tcp eventually calls into ip_open.The open then calls into the IP connection classifier and allocates theper-TCP endpoint control block already integrated with the conn_t. Itchooses the squeue for this connection. In the case of an internal openi.e by sockfs for an acceptor stream, almost nothing is done, and wedelay doing useful work till accept time. Bindtcp_bind eventually needs to talk to IP to figure out whether theaddress passed in is valid. FireEngine TCP prepares this request asusual in the form of a TPI message. However this messages is directlypassed as a function argument to ip_bind_v{4, 6}, which returns theresult as another message. The use of messages as parameters is helpfulin leveraging the existing code with minimal change. The port hashtable used by TCP to validate binds still remains in TCP, since theclassifier has no use for it. ConnectThe changes in tcp_connect are similar to tcp_bind. The full bind()request is prepared as a TPI message and passed as a function argumentto ip_bind_v{4, 6}. IP calls into the classifier and inserts theconnection in the connected hash table. The conn_ hash table in TCP isno longer used. ListenThis path is part of tcp_bind. The tcp_bind prepares a local bind TPImessage and passes it as a function argument to ip_bind_v{4, 6}. IPcalls the classifier and inserts the connection in the bind hash table.The listen hash table of TCP does not exist any more. AcceptThe pre Solaris 10 accept implementation did the bulk of the connectionsetup processing in the listener context. The three way handshake wascompleted in listener's perimeter and the connection indication wassent up the listener's STREAM. The messages necessary to perform theaccept were sent down on the listener STREAM and the listener wassingle threaded from the point of sending the T_CONN_RES message to TCPtill sockfs received the acknowledgment. If the connection arrival ratewas high, the ability of pre Solaris 10 stack to accept new connectionsdeteriorated significantly. Furthermore, there were some additional TCP overhead involved, whichcontribute to slower accept rate. When sockfs opened an acceptor STREAMto TCP to accept a new connection, TCP was not aware that the datastructures necessary for the new connection have already beenallocated. So it allocated new structures and initializes them butlater as part of the accept processing these are freed. Another majorproblem with the pre Solaris 10 design was that packets for a newlycreated connection arrived on the listener's perimeter. This requires acheck for every incoming packet and packets landing on the wrongperimeter need to be sent to their correct perimeter causing additionaldelay. The FireEngine model establishes an eager connection (a incomingconnection is called eager till accept completes) in its own perimeteras soon as a SYN packet arrives thus making sure that packets alwaysland on the correct connection. As a result it is possible tocompletely eliminate the TCP global queues. The connection indicationis still sent to the listener on the listener's STREAM but the accepthappens on the newly created acceptor STREAM (thus, there is no need toallocate data structures for this STREAM) and the acknowledgment can besent on the acceptor STREAM. As a result, sockfs doesn't need to becomesingle threaded at any time during the accept processing. The new model was carefully implemented because the new incomingconnection (eager) exists only because there is a listener for it andboth eager and listener can disappear at any time during acceptprocessing as a result of eager receiving a reset or listener closing. The eager starts out by placing a reference on the listener so that theeager reference to the listener is always valid even though thelistener might close. When a connection indication needs to be sentafter the three way handshake is completed, the eager places areference on itself so that it can close on receiving a reset but anyreference to it is still valid. The eager sends a pointer to itself aspart of the connection indication message, which is sent via thelistener's STREAM after checking that the listener has not closed. Whenthe T_CONN_RES message comes down the newly created acceptor STREAM, weagain enter the eager's perimeter and check that the eager has notclosed because of receiving a reset before completing the acceptprocessing. For TLI/XTI based applications, the T_CONN_RES message isstill handled on the listener's STREAM and the acknowledgment is sentback on listener's STREAMs so there is no change in behavior. CloseClose processing in tcp now does not have to wait till the referencecount drops to zero since references to the closing queue andreferences to the TCP are now decoupled. Close can return as soon asall references to the closing queue are gone. The TCP data structuresthemself may continue to stay around as a detached TCP in most cases.The release of the last reference to the TCP frees up the TCP datastructure. A user initiated close only closes the stream. The underlying TCPstructures may continue to stay around. The TCP then goes through theFIN/ACK exchange with the peer after all user data is transferred andenters the TIME_WAIT state where it stays around for a certain durationof time. This is called a detached TCP. These detached TCPs also needprotection to prevent outbound and inbound processing from happening atthe same time on a given detached TCP. Data pathTCP does not even need to call IP to transmit the outbound packet inthe most common case, if it can access the IRE. With a merged TCP/IP wehave the advantage of being able to access the cached ire for aconnection, and TCP can putnext the data directly to the link layerdriver based on the information in the IRE. FireEngine does exactly theabove. TCP LoopbackTCP Fusion is a protocol-less data path for loopback TCP connections inSolaris 10. The fusion of two local TCP endpoints occurs at connectionestablishment time. By default, all loopback TCP connections are fused.This behavior may be changed by setting the system wide tunable do tcpfusion to 0. Various conditions on both endpoints need to be met forfusion to be successful: They must share a common squeue. They must be TCP and not "raw socket". They must not require protocol-level processing, i.e. IPsec or IPQoS policy is not presentfor the connection.If it fails, we fall back to the regular TCP data path; if it succeeds,both endpoints proceed to use tcp fuse output() as the transmit path.tcp fuse output() enqueues application data directly onto the peer'sreceive queue; no protocol processing is involved. After enqueueing thedata, the sender can either push - by calling putnext(9F) - the data upthe receiver's read queue; or the sender can simply return and let thereceiver retrieve the enqueued data via the synchronous STREAMS entrypoint. The latter path is taken if synchronous STREAMS is enabled.Itgets automatically disabled if sockfs no longer resides directly on topof TCP module due to a module insertion or removal. Locking in TCP Fusion is handled by squeue and the mutex tcp fuse lock.One of the requirements for fusion to succeed is that both endpointsneed to be using the same squeue. This ensures that neither side candisappear while the other side is still sending data. By itself, squeueis not sufficient for guaranteeing safe access when synchronous STREAMSis enabled. The reason is that tcp fuse rrw() doesn't enter the squeue,and its access to tcp rcv list and other fusion-related fields needs tobe synchronized with the sender. tcp fuse lock is used for thispurpose. Rate Limit for Small Writes Flow control for TCP Fusion in synchronousstream mode is achieved by checking the size of receive buffer and thenumber of data blocks, both set to different limits. This is differentthan regular STREAMS flow control where cumulative size check dominatesdata block count check (STREAMS queue high water mark typicallyrepresents bytes). Each enqueue triggers notifications sent to thereceiving process; a build up of data blocks indicates a slow receiverand the sender should be blocked or informed at the earliest momentinstead of further wasting system resources. In effect, this isequivalent to limiting the number of outstanding segments in flight. The minimum number of allowable enqueued data blocks defaults to 8 andis changeable via the system wide tunable tcp_fusion_burst_min toeither a higher value or to 0 (the latter disables the burst check). 4 UDPApart from the framework improvements, Solaris 10 made additionalchanges in the UDP packets move through the stack. The internal codename for the project was "Yosemite". Pre Solaris 10, the UDP processingcost was evenly divided between per packet processing cost and per byteprocessing cost. The packet processing cost was generally due toSTREAMS; the stream head processing; and packet drops in the stack anddriver. The per byte processing cost was due to lack of H/W cksum andunoptimized code branches throughout the network stack. UDP packetdrop within the stackAlthough UDP is supposed to be unreliable, the local area networks havebecome pretty reliable and applications tend to assume that there willbe no packet loss in a LAN environment. This assumption was largelytrue but pre Solaris 10 stack was not very effective in dealing withUDP overload and tended to drop packets within the stack itself. On Inbound, packets were dropped at more than one layers throughout thereceive path. For UDP, the most common and obvious place is at the IPlayer due to the lack of resources needed to queue the packets. Anotherimportant yet in-apparent place of packet drops is at the networkadapter layer. This type of drop is fairly common to occur when themachine is dealing with a high rate of incoming packets. UDP sockfs The UDP sockfs extension (sockudp) is an alternative path tosocktpi used for handling sockets-based UDP applications. It providesfor a more direct channel between the application and the network stackby eliminating the stream head and TPI message-passing interface. Thisallows for a direct data and function access throughout the socket andtransport layers. This allows the stack to become more efficient andcoupled with UDP H/W checksum offload (even for fragmented UDP),ensures that UDP packets are rarely dropped within the stack. UDP ModuleA fully multi-threaded UDP module running under the same protectiondomain as IP. It allows for a tighter integration of the transport(UDP) with the layers above and below it. This allows socktpi to makedirect calls to UDP. Similarly UDP may also make direct calls to thedata link layer. In the post GLDv3 world, the data link layer may alsomake direct calls to the transport. In addition, utility functions canbe called directly instead of using message-based interface. UDP needs exclusive operation on a per-endpoint basis, when executingfunctions that modify the endpoint state. udp rput other() deals withpackets with IP options, and processing these packets end up having toupdate the endpoint's option related state. udp wput other() deals withcontrol operations from the top, e.g. connect(3SOCKET) that needs toupdate the endpoint state. In the STREAMS world this synchronizationwas achieved by using shared inner perimeter entry points, and by usingqwriter inner() to get exclusive access to the endpoint. The Solaris 10 model uses an internal, STREAMS-independent perimeter toachieve the above synchronization and is described below: udp enter() - Enter the UDP endpoint perimeter. udp becomewriter() i.e.become exclusive on the UDP endpoint. Specifies a functionthat will be called exclusively either immediately or later when theperimeter is available exclusively.udp exit() - Exit the UDP endpoint perimeter.Entering UDP from the top or from the bottom must be done using udpenter(). As in the general cases, no locks may be held across theseperimeter. When finished with the exclusive mode, udp exit() must becalled to get out of the perimeter. To support this, the new UDP model employs two modes of operationnamely UDP MT HOT mode and UDP SQUEUE mode. In the UDP MT HOT mode,multiple threads may enter a UDP endpoint concurrently. This is usedfor sending or receiving normal data and is similar to the putsharedSTREAMS entry points. Control operations and other special cases calludp become writer() to become exclusive on a per-endpoint basis andthis results in transitioning to the UDP SQUEUE mode. squeue bydefinition serializes access to the conn t. When there are no morepending messages on the squeue for the UDP connection, the endpointreverts to MT HOT mode. In between when not all MT threads of anendpoint have finished, messages are queued in the endpoint and the UDPis in one of two transient modes, i.e. UDP MT QUEUED or UDP QUEUEDSQUEUE mode. While in stable modes, UDP keeps track of the number of threadsoperating on the endpoint. The udp reader count variable represents thenumber of threads entering the endpoint as readers while it is in UDPMT HOT mode. Transitioning to UDP SQUEUE happens when there is only asingle reader, i.e. when this counter drops to 1. Likewise, udp squeuecount represents the number of threads operating on the endpoint'ssqueue while it is in UDP SQUEUE mode. The mode transition to UDP MTHOT happens after the last thread exits the endpoint. Though UDP and IP are running in the same protection domain, they arestill separate STREAMS modules. Therefore, STREAMS plumbing is keptunchanged and a UDP module instance is always pushed above IP. Althoughthis causes an extra open and close for every UDP endpoint, it providesbackwards compatibility for some applications that rely on suchplumbing geometry to do certain things, e.g. issuing I POP on thestream to obtain direct access to IP9. The actual UDP processing is done within the IP instance. The UDPmodule instance does not possess any state about the endpoint andmerely acts as a dummy module, whose presence is to keep the STREAMSplumbing appearance unchanged. Solaris 10 allows for the following plumbing modes: Normal - IP is first opened and later UDP is pushed directly ontop. This is the default action that happens when a UDP socket ordevice is opened.SNMP - UDP is pushed on top of a module other than IP. When thishappens it will support only SNMP semantics.These modes imply that we don't support any intermediate module betweenIP and UDP; in fact, Solaris has never supported such scenario in thepast as the inter-layer communication semantics between IP andtransport modules are private. UDP andSocket interactionA significant event that takes place during socket(3SOCKET) system callis the plumbing of the modules associated with the socket's addressfamily and protocol type. A TCP or UDP socket will most likely resultin sockfs residing directly atop the corresponding transport module.Pre Solaris 10, Socket layer used STREAMs primitives to communicatewith UDP module. Solaris 10 allowed for a functionally callableinterface which eliminated the need to use T UNITDATA REQ message formetadata during each transmit from sockfs to UDP. Instead, data and itsancillary information (i.e. remote socket address) could be provideddirectly to an alternative UDP entry point, therefore avoiding theextra allocation cost. For transport modules, being directly beneath sockfs allows forsynchronous STREAMS to be used. This enables the transport layer tobuffer incoming data to be later retrieved by the application (viasynchronous STREAMS) when a read operation is issued, thereforeshortening the receive processing time. SynchronousSTREAMSSynchronous STREAMS is an extension to the traditional STREAMSinterface for message passing and processing. It was originally addedas part of the combined copy and checksum effort. It offers a way forthe entry point of the module or driver to be called in synchronousmanner with respect to user I/O request. In traditional STREAMS, thestream head is the synchronous barrier for such request. SynchronousSTREAMS provides a mechanism to move this barrier from the stream headdown to a module below. The TCP implementation of synchronous STREAMS in pre Solaris 10 wascomplicated, due to several factors. A major factor was the combinedchecksum and copyin/copyout operations. In Solaris 10, TCP wasn'tdependent on checksum during copyin/copyout, so the mechanism wasgreatly simplified for use with loopback TCP and UDP on the read side.The synchronous STREAMS entry points are called during requests such asread(2) or recv(3SOCKET). Instead of sending the data upstream usingputnext(9F), these modules enqueue the data in their internal receivequeues and allow the send thread to return sooner. This avoids callingstrrput() to enqueue the data at the stream head from within the sendthread context, therefore allowing for better dynamics - reducing theamount of time taken to enqueue and signal/poll-notify the receivingapplication allows the send thread to return faster to do further work,i.e. things are less serialized than before. Each time data arrives, the transport module schedules for theapplication to retrieve it. If the application is currently blocked(sleeping) during a read operation, it will be unblocked to allow it toresume execution. This is achieved by calling STR WAKEUP SET() on thestream. Likewise, when there is no more data available for theapplication, the transport module will allow it to be blocked againduring the next read attempt, by calling STR WAKEUP CLEAR(). Any newdata that arrives before then will override this state and causesubsequent read operation to proceed. An application may also be blocked in poll(2) until a read event takesplace, or it may be waiting for a SIGPOLL or SIGIO signal if the socketused is non-blocking. Because of this, the transport module deliversthe event notification and/or signals the application each time itreceives data. This is achieved by calling STR SENDSIG() on thecorresponding stream. As part of the read operation, the transport module delivers data tothe application by returning it from its read side synchronous STREAMSentry point. In the case of loopback TCP, the synchronous STREAM readentry point returns the entire content (byte stream) of its receivequeue to the stream head; any remaining data will be re-enqueued at thestream head awaiting the next read. For UDP, the read entry pointreturns only one message (datagram) at a time. STREAMs fallbackBy default, direct transmission and read side synchronous STREAMSoptimizations are enabled for all UDP and loopback TCP sockets whensockfs is directly above the corresponding transport module. There areseveral cases which require these features to be disabled; when thishappens, message exchange between sockfs and the transport module mustthen be done through putnext(9F). The cases are described as follows - Intermediate Module - A module is configured to be autopushed atopen time on top of the transport module via autopush(1M), or is IPUSH'd on a socket via ioctl(2). Stream Conversion - The imaginary sockmod module is I POP'd froma socket causing it to be converted from a socket endpoint into adevice stream.(Note that I INSERT or I REMOVE ioctl is not permitted on a socketendpoint and therefore a fallback is not required to handle it.) If a fallback is required, sockfs will notify the transport module thatdirect mode is disabled. The notification is sent down by the sockfsmodule in the form of an ioctl message, which indicates to thetransport module that putnext(9F) must now be used to deliver dataupstream. This allows for data to flow through the intermediate moduleand it provides for compatibility with device stream semantics. 5 IPAs mentioned before, all the transport layers have been merged in IPmodule which is fully multithreaded and acts as a pseudo device driveras well a STREAMs module. The key change in IP was the removal IPclient functionality and multiplexing the inbound packet stream. Thenew IP Classifier (which is still part of IP module) is responsible forclassifying the inbound packets to the correct connection instance. IPmodule is still responsible for network layer protocol processing andplumbing and managing the network interfaces. Lets have a quick look at how plumbing of network interfaces, multipathing, and multicast works in the new stack. Plumbing NICsPlumbing is a long sequence of operations involving message exchangesbetween IP, ARP and device drivers. Most set ioctls are typicallyinvolved in plumbing operations. A natural model is to serialize theseioctls one per ill. For example plumbing of hme0 and qfe0 can go on inparallel without any interference. But various set ioctls on hme0 willall be serialized. Another possibility is to fine-grain even further and serializeoperations per ipif rather than per ill. This will be beneficial onlyif many ipifs are hosted on an ill, and if the operations on differentipifs don't have any mutual interference. Another possibility is tocompletely multithread all ioctls using standard Solaris MT techniques.But this is needlessly complex and does not have much added value. Itis hard to hold locks across the entire plumbing sequence, whichinvolves waits, and message exchanges with drivers or other modules.Not much is gained in performance or functionality by simultaneouslyallowing multiple set ioctls on an ipif at the same time since theseare purely non-repetitive control operations. Broadcast ires arecreated on a per ill basis rather than per ipif basis. Hence trying tobring up more than 1 ipif simultaneously on an ill involves extracomplexity in the broadcast ire creation logic. On the other handserializing plumbing operations per ill lends itself easily to theexisting IP code base. During the course of plumbing IP exchangesmessages with the device driver and ARP. The messages received from theunderlying device driver are also handled exclusively in IP. This isconvenient since we can't hold standard mutex locks across the putnextin trying to provide mutual exclusion between the write side and readside activities. Instead of the all exclusive PERMOD syncq, this effectcan be easily achieved by using a per ill serialization queue. IP NetworkMultiPathing (IPMP)IPMP operations are all driven around the notion of an IPMP group.Failover and Failback operations operate between 2 ills, usually partof the same IPMP group. The ipifs and ilms are moved between the ills.This involves bringing down the source ill and could involve bringingup the destination ill. Bringing down or bringing up ills affectbroadcast ires. Broadcast ires need to be grouped per IPMP group tosuppress duplicate broadcast packets that are received. Thus broadcastire manipulation affects all members of the IPMP group. SettingIFF_FAILED or IFF_STANDBY causes evaluation of all ills in the IPMPgroup and causes regrouping of broadcast ires. Thus serializing IPMPoperations per IPMP group lends itself easily to the existing codebase. An IPMP group includes both the IPv4 and IPv6 ills. MulticastMulticast joins operate on both the ilg and ilm structures. Multiplethreads operating on an ipc (socket) trying to do multicast joins needto synchronize when operating on the ilg. Multiple threads potentiallyoperating on different ipcs (socket endpoints) trying to do multicastjoins could eventually end up trying to manipulate the ilmsimultaneously and need to synchronize on the access to the ilm. Bothare amenable to standard Solaris MT techniques. Considering all theabove, i.e. plumbing, IPMP and multicast, the common denominator is toserialize all the exclusive operations on a per IPMP group basis. IfIPMP is not enabled, then on a phyint basis. E.g. hme0 v4 and hme0 v6ills taken together share a phyint. In the above multicast has apotential higher degree of multithreading. But it has to coexist withother exclusive operations. For example we don't want a thread tocreate or delete an ilm when a failover operation is already inprogress trying to move ilms between 2 ills. So the lowest commondenominator is to serialize multicast joins per physical interface orIPMP group.

http-equiv="content-type"> http-equiv="content-type"> Many of you have asked for details on Solaris 10 networking.  The great news is that I finished writing the treatise on the subject whichwill...

Solaris

Solaris Networking - The Magic Revealed (Part II)

Solaris Networking - The Magic RevealedMany of you have asked for details on Solaris 10 networking.  Thegreat news is that I finished writing the treatise on the subject whichwill become a new section in Solaris Internals bookby Jim Mauro and Richard Mcdougall. In the meawhile, I have used some excerpts to create a mini book (partII) for Networkingcommunity on OpenSolaris. Enjoy! As usual, comments(good or bad) are welcome. Solaris Networking - The Magic Revealed(Part II)6. Solaris 10DeviceDriver frameworkGLDv2 andMonolithic DLPI drivers (Solaris 9 and before)GLDv3 - A NewArchitectureGLDv3 Linkaggregation architectureChecksum offload7. Tuning forperformance:8. Future9. Acknowledgments6. Solaris 10 DeviceDriver frameworkLets have a quick look at how Network device drivers were implementedpre Solaris 10 and why they need to change with the new Solaris 10stack. GLDv2 andMonolithic DLPI drivers (Solaris 9 and before)Pre Solaris 10, network stack relays on DLPI1 providers, which arenormally implemented in one of two ways. The following illustrations(Fig 5) show a stack based on a so-called monolithic DLPI driver and astack based on a driver utilizing the Generic LAN Driver (GLDv2)module. Fig. 5The GLDv2 module essentially behaves as a library. The client stilltalks to the driver instance bound to the device but the DLPI protocolprocessing is handled by calling into the GLDv2 module, which will thencall back into the driver to access the hardware. Using the GLD modulehas a clear advantage in that the driver writer need not re-implementlarge amounts of mostly generic DLPI protocol processing. Layer two(Data-Link) features such as 802.1q Virtual LANs (VLANs) can also beimplemented centrally in the GLD module allowing them to be leveragedby all drivers. The architecture still poses a problem though whenconsidering how to implement a feature such as 802.3ad link aggregation(a.k.a. trunking) where the one-to-one correspondence between networkinterface and device is broken. Both GLDv2 and monolithic driver depend on DLPI messages andcommunicated with upper layers via STREAMs framework. This mechanismwas not very effective for link aggregation or 10Gb NICs. With the newstack, a better mechanism was needed which could ensure data localityand allow the stack to control the device drivers at much finergranularity to deal with interrupts. GLDv3 - A NewArchitectureSolaris 10 introduced a new device driver framework called GLDv3(internal name "project Nemo") along with the new stack. Most of themajor device drivers were ported to this framework and all future and10Gb device drivers will be based on this framework. This frameworkalso provided a STREAMs based DLPI layer for backword compatibility (toallow external, non-IP modules to continue to work). GLDv3 architecture virtualizes layer two of the network stack. There isno longer a one-to-one correspondence between network interfaces anddevices. The illustration below (Fig. 6) shows multiple devicesregistered with a MAC Services Module (MAC). It also shows two clients:one traditional client that communicates via DLPI to a Data-Link Driver(DLD) and one that is kernel based and simply makes direct functioncalls into the Data-Link Services Module (DLS). Fig. 6GLDv3 DriversGLDv3 drivers are similar to GLD drivers. The driver must be linkedwith a dependency on misc/mac. and misc/dld. It must callmac_register() with a pointer to an instance of the following structureto register with the MAC module:typedef struct mac { const char \*m_ident; mac_ext_t \*m_extp; struct mac_impl \*m_impl; void \*m_driver; dev_info_t \*m_dip; uint_t m_port; mac_info_t m_info; mac_stat_t m_stat; mac_start_t m_start; mac_stop_t m_stop; mac_promisc_t m_promisc; mac_multicst_t m_multicst; mac_unicst_t m_unicst; mac_resources_t m_resources; mac_ioctl_t m_ioctl; mac_tx_t m_tx;} mac_t;This structure must persist for the lifetime of the registration, i.e.it cannot be de-allocated until after mac_unregister() is called. AGLDv3 driver _init(9E) entry point is also required to callmac_init_ops() before calling mod_install(9F), and they are required tocall mac_fini_ops() after calling mod_remove(9F) from _fini(9E).The important members of this 'mac_t' structure are:'m_impl' - This is used by the MAC module to point to its privatedata. It must not be read ormodified by a driver.'m_driver' - This field should be set by the driver to point atits private data. Thisvalue will be supplied as the first argument to the driver entry points.'m_dip' - This field must be set to the dev_info_t pointer of thedriver instance callingmac_register(). 'm_stat' - typedef uint64_t (\*mac_stat_t)(void \*, mac_stat_t);This entry point is called to retrieve a value for one of thestatistics defined in themac_stat_t enumeration (below). All values should be stored andreturnedin 64-bit unsigned integers. Values will not be requested forstatistics that the driver has not explicitly declared to be supported.'m_start' - typedef int (\*mac_start_t)(void \*);This entry point is called to bring the device out of thereset/quiesced state that it was in when the interface was registered.No packets will be submitted by the MAC module fortransmission and no packets should be submitted by the driver forreception before this call is made. If this function succeeds then zeroshould be returned. If it fails then an appropriate errno value shouldbe returned. 'm_stop' - typedef void (\*mac_stop_t)(void \*);This entry point should stop the device and put it in a reset/quiescedstate such that the interface can be unregistered. No packets will besubmitted by the MAC for transmission once this call has been made andno packets should be submitted by the driver for reception once it hascompleted. 'm_promisc' - typedef int (\*mac_promisc_t)(void \*, boolean_t);This entry point is used to set the promiscuity of the device. If thesecond argument is B_TRUE then the device should receive all packets onthe media. If it is set to B_FALSE then only packets destined for thedevice's unicast address and the media broadcast address should bereceived. 'm_multicst' - typedef int (\*mac_multicst_t)(void \*, boolean_t, const uint8_t \*);This entry point is used to add and remove addresses to and from theset of multicast addresses for which the device will receive packets.If the second argument is B_TRUE then the address pointed to by thethird argument should be added to the set. If the second argument isB_FALSE then the address pointed to by the third argument should beremoved. 'm_unicst' - typedef int (\*mac_unicst_t)(void \*, const uint8_t \*);This entry point is used to set a new device unicast address. Once thiscall is made then only packets with the new address and the mediabroadcast address should be received unless the device is inpromiscuous mode. 'm_resources' - typedef void (\*mac_resources_t)(void \*, boolean_t);This entry point is called to request that the driver register itsindividual receive resources or Rx rings. 'm_tx' - typedef mblk_t \*(\*mac_tx_t)(void \*, mblk_t \*);This entry point is used to submit packets for transmission by thedevice. The second argument points to one or more packets contained inmblk_t structures. Fragments of the same packet will be linked togetherusing the b_cont field. Separate packets will be linked by the b_nextfield in the leading fragment. Packets should be scheduled fortransmission in the order in which they appear in the chain. Anyremaining chain of packets that cannot be scheduled should be returned.If m_tx() does return packets that cannot be scheduled the driver mustcall mac_tx_update() when resources become available. If all packetsare scheduled for transmission then NULL should be returned. 'm_info' - This is an embedded structure defined as follows: typedef struct mac_info { uint_t mi_media; uint_t mi_sdu_min; uint_t mi_sdu_max; uint32_t mi_cksum; uint32_t mi_poll; boolean_t mi_stat[MAC_NSTAT]; uint_t mi_addr_length; uint8_t mi_unicst_addr[MAXADDRLEN]; uint8_t mi_brdcst_addr[MAXADDRLEN]; } mac_info_t;mi_media is set of be the media type; mi_sdu_min is the minimum payloadsize; mi_sdu_max is the maximum payload size; mi_cksum details thedevice cksum capabilities flag; mi_poll details if the driver supportspolling; mi_addr_length is set to the length of the addresses used bythe media; mi_unicst_addr is set with the unicast address of the deviceat the point at which mac_register() is called;mi_brdcst_addr is set tothe broadcast address of the media; mi_stat is an array of booleanvaluestypedef enum { MAC_STAT_IFSPEED = 0, MAC_STAT_MULTIRCV, MAC_STAT_BRDCSTRCV, MAC_STAT_MULTIXMT, MAC_STAT_BRDCSTXMT, MAC_STAT_NORCVBUF, MAC_STAT_IERRORS, MAC_STAT_UNKNOWNS, MAC_STAT_NOXMTBUF, MAC_STAT_OERRORS, MAC_STAT_COLLISIONS, MAC_STAT_RBYTES, MAC_STAT_IPACKETS, MAC_STAT_OBYTES, MAC_STAT_OPACKETS, MAC_STAT_ALIGN_ERRORS, MAC_STAT_FCS_ERRORS, MAC_STAT_FIRST_COLLISIONS, MAC_STAT_MULTI_COLLISIONS, MAC_STAT_SQE_ERRORS, MAC_STAT_DEFER_XMTS, MAC_STAT_TX_LATE_COLLISIONS, MAC_STAT_EX_COLLISIONS, MAC_STAT_MACXMT_ERRORS, MAC_STAT_CARRIER_ERRORS, MAC_STAT_TOOLONG_ERRORS, MAC_STAT_MACRCV_ERRORS, MAC_STAT_XCVR_ADDR, MAC_STAT_XCVR_ID, MAC_STAT_XVCR_INUSE, MAC_STAT_CAP_1000FDX, MAC_STAT_CAP_1000HDX, MAC_STAT_CAP_100FDX, MAC_STAT_CAP_100HDX, MAC_STAT_CAP_10FDX, MAC_STAT_CAP_10HDX, MAC_STAT_CAP_ASMPAUSE, MAC_STAT_CAP_PAUSE, MAC_STAT_CAP_AUTONEG, MAC_STAT_ADV_CAP_1000FDX, MAC_STAT_ADV_CAP_1000HDX, MAC_STAT_ADV_CAP_100FDX, MAC_STAT_ADV_CAP_100HDX, MAC_STAT_ADV_CAP_10FDX, MAC_STAT_ADV_CAP_10HDX, MAC_STAT_ADV_CAP_ASMPAUSE, MAC_STAT_ADV_CAP_PAUSE, MAC_STAT_ADV_CAP_AUTONEG, MAC_STAT_LP_CAP_1000FDX, MAC_STAT_LP_CAP_1000HDX, MAC_STAT_LP_CAP_100FDX, MAC_STAT_LP_CAP_100HDX, MAC_STAT_LP_CAP_10FDX, MAC_STAT_LP_CAP_10HDX, MAC_STAT_LP_CAP_ASMPAUSE, MAC_STAT_LP_CAP_PAUSE, MAC_STAT_LP_CAP_AUTONEG, MAC_STAT_LINK_ASMPAUSE, MAC_STAT_LINK_PAUSE, MAC_STAT_LINK_AUTONEG, MAC_STAT_LINK_DUPLEX, MAC_STAT_LINK_STATE, MAC_NSTAT /\* must be the last entry \*/ } mac_stat_t;The macros MAC_MIB_SET(), MAC_ETHER_SET() and MAC_MII_SET() areprovided to set all the values in each of the three groups respectivelyto B_TRUE.MACServices (MAC) moduleSome key Driver Support Functions:'mac_resource_add' - extern mac_resource_handle_t mac_resource_add(mac_t \*, mac_resource_t \*);Various members are defined as typedef void (\*mac_blank_t)(void \*, time_t, uint_t); typedef mblk_t \*(\*mac_poll_t)(void \*, uint_t); typedef enum { MAC_RX_FIFO = 1 } mac_resource_type_t; typedef struct mac_rx_fifo_s { mac_resource_type_t mrf_type; /\* MAC_RX_FIFO \*/ mac_blank_t mrf_blank; mac_poll_t mrf_poll; void \*mrf_arg; time_t mrf_normal_blank_time; uint_t mrf_normal_pkt_cnt; } mac_rx_fifo_t; typedef union mac_resource_u { mac_resource_type_t mr_type; mac_rx_fifo_t mr_fifo; } mac_resource_t;This function should be called from the m_resources() entry point toregister individual receive resources (commonly ring buffers of DMAdescriptors) with the MAC module. The returned mac_resource_handle_tvalue should then be supplied in calls to mac_rx(). The second argumentto mac_resource_add() specifies the resource being added. Resources arespecified by the mac_resource_t structure. Currently only resources oftype MAC_RX_FIFO are supported. MAC_RX_FIFO resources are described bythe mac_rx_fifo_t structure.This mac_blank function is meant to be used by upper layers to controlthe interrupt rate of the device. The first argument is the devicecontext meant to be used as the first argument to poll_blank.The other fields mrf_normal_blank_time and mrf_normal_pkt_cnt specifythe default interrupt interval and packet count threshold,respectively. These parameters may be used as the second and thirdarguments to mac_blank when the upper layer wants the driver to revertto the default interrupt rate.The interrupt rate is controlled by the upper layer by callingpoll_blank with different arguments. The interrupt rate can beincreased or decreased by the upper layer by passing a multiple ofthese values to the last two arguments of mac_blank. Setting theseavlues to zero disables the interrupts and NIC is deemed to be inpolling mode.The mac_poll is the driver supplied function is used by upper layer toretrieve a chain of packets (upto max count specified by secondargument) from the Rx ring corresponding to the earlier suppliedmrf_arg during mac_resource_add (supplied as first argument tomac_poll). 'mac_resource_update' - extern void mac_resource_update(mac_t \*);Invoked by the driver when the available resources have changed.'mac_rx' - extern void mac_rx(mac_t \*, mac_resource_handle_t, mblk_t \*);This function should be called to deliver a chain of packets, containedin mblk_t structures, for reception. Fragments of the same packetshould be linked together using the b_cont field. Separate packetsshould be linked using the b_next field of the leading fragment. If thepacket chain was received by a registered resource then the appropriatemac_resource_handle_t value should be supplied as the second argumentto the function. The protocol stack will use this value as a hint whentrying to load-spread across multiple CPUs. It is assumed that packetsbelonging to the same flow will always be received by the sameresource. If the resource is unknown or is unregistered then NULLshould be passed as the second argument.Data-Link Services(DLS) ModuleThe DLS module provides Data-Link Services interface analogous to DLPI.The DLS interface is a kernel-level functional interface as opposed tothe STREAMS message based interface specified by DLPI. This moduleprovides the interfaces necessary for upper layer to create and destroya dala link service; It also provides the interfaces necessary to plumband unplumb the NIC. The plumbing and unplumbing of NIC for GLDv3 baseddevice drivers is unchanged from the older GLDv2 or monolithic DLPIdevice drivers. The major changes are in data paths which allow directcalls, packet chains and much finer grained control over NIC.Data-LinkDriver (DLD)The Data-Link Driver provides a DLPI using the interfaces provided bythe DLS and MAC modules. The driver is configured using IOCTLs passedto a control node. These IOCTLs create and destroy separate DLPIprovider nodes. This module deals with DLPI messages necessary toplumb/unplumb the NIC and provides the backward compatibility for datapath via STREAMs for non GLDv3 aware clients.GLDv3 Linkaggregation architectureThe GLDv3 framework provides support for Link Aggregation as defined byIEEE 802.3ad. The key design principles while designing this facilitywere:Allow GLDv3 MAC drivers to be aggregated without code changeThe performance ofnon-aggregated devices must be preservedThe performance ofaggregated devices should be cumulative of line rate for each member i.e.minimal overheads due to aggregationSupport both manualconfiguration and Link Aggregation Control protocol (LACP)GLDv3 link aggregation is implement by means of a pseudo driver called'aggr'. It registers virtual ports corresponding to link aggregationgroups with the GLDv3 Mac layer. It uses the client interface providedby MAC layer to control and communicate with aggregated MAC ports asillustrated below in Fig 7. It also export a pseudo 'aggr' devicedriver which is used by 'dladm' command to configure and control thelink aggregated interface. Once a MAC port is configured to be part oflink aggregation group, it cannot be simultaneously accessed by otherMAC clients clients such as DLS layer. The exclusive access is enforcedby the MAC layer. The implementation of LACP is implemented by the'aggr' driver which has access to individual MAC ports or links. Fig. 7The GLDv3 aggr driver acts a normal MAC module to upper layer andappears as a standard NIC interface which once created with 'dladm',can be configured and managed by 'ifconfig'. The 'aggr' moduleregisters each MAC port which is part of the aggregation with the upperlayer using the 'mac_resource_add' function such that the data pathsand interrupts from each MAC port can be independently managed by theupper layers (see Section 8b). In short, the aggregated interface ismanaged as a single interface with possibly one IP address and the datapaths are managed as individual NICs by unique CPUs/Squeues providingaggregation capability to Solaris with near zero overheads and linearscalability with respect to number of MAC ports that are part of theaggregation group.Checksum offloadSolaris 10 improved the H/W checksum offload capability further toimprove overall performance for most applications. 16-bit one'scomplement checksum offload framework has existed in Solaris for sometime. It was originally added as a requirement for Zero Copy TCP/IP inSolaris 2.6 but was never extended until recently to handle otherprotocols. Solaris defines two classes of checksum offload:Full - Complete checksum calculation in the hardware, includingpseudo-header checksum computation for TCP and UDP packets. Thehardware is assumed to have the ability to parse protocol headers.Partial - "Dumb" one's complement checksum based on start, endand stuff offsets describing the span of the checksummed data and thelocation of the transport checksum field, with no pseudo-headercalculation ability in the hardware.Adding support for non-fragmented IPV4 cases (unicast or multicast) istrivial for both transmit and receive, as most modern network adapterssupport either class of checksum offload with minor differences in theinterface. The IPV6 cases are not as straightforward, because very fewfull-checksum network adapters are capable of handling checksumcalculation for TCP/UDP packets over IPV64.The fragmented IP cases have similar constraints. On transmit,checksumming applies to the unfragmented datagram. In order for anadapter to support checksum offload, it must be able to buffer all ofthe IP fragments (or perform the fragmentation in hardware) beforefinally calculating the checksum and sending the fragments over thewire; until then, checksum offloading for outbound IP fragments cannotbe done. On the other hand, the receive fragment reassembly case ismore flexible since most full-checksum (and all partial-checksum)network adapters are able to compute and provide the checksum value tothe network stack. During fragment reassembly stage, the network stackcan derive the checksum status of the unfragmented datagram bycombining the values altogether.Things were simplified by not offloading checksum when IP option werepresent. For partial-checksum offload, certain adapters limit the startoffset to a width sufficient for simple IP packets. When the length ofprotocol headers exceeds such limit (due to the presence of options),the start offset will wrap around causing incorrect calculation. Forfull-checksum offload, none of the capable adapters is able tocorrectly handle IPV4 source routing option.When transmit checksum offload takes place, the network stack willassociate eligible packets with ancillary information needed by thedriver to offload the checksum computation to hardware.In the inbound case, the driver has full control over the packets thatget associated with hardware-calculated checksum values. Once a driveradvertises its capability via DL CAPAB HCKSUM, the network stack willaccept full and/or partial-checksum information for IPV4 and IPV6packets. This process happens for both non-fragmented and fragmentedpayloads.Fragmented packets will first need to go through the reassembly processbecause checksum validation happens for fully reassembled datagrams.During reassembly, the network stack combines the hardware-calculatedchecksum value of each fragment.'dladm' - Newcommand for datalink administrationOver period of time, 'ifconfig' has become severely overloaded tryingto manage various layers in the stack. Solaris 10 introduced 'dladm'command to manage the data link services and ease the burden on'ifconfig'. The dladm command operates on three kinds of object:'link' - Data-links, identified by a name'aggr' - Aggregations of network devices, identified by a key 'dev' - Network devices, identified by concatenation of a drivername and an instance number.The key of an aggregation must be an integer value between 1 and 65535.Some devices do not support configurable data-links or aggregations.The fixed data-links provided by such devices can be viewed using dladmbut not configured.The GLDv3 framework allows users to select the outbound load balancingpolicy across various members of aggregation while configuring theaggregation. The policy specifies which dev object is used to sendpackets. A policy consists of a list of one or more layers specifiersseparated by commas. A layer specifier is one of the following:L2 - Select outbound device according to source and destinationMAC addresses of the packet.L3 - Select outbound device according to source and destinationIP addresses of the packet.L4 - Select outbound device according to the upper layer protocolinformation contained in the packet. For TCP and UDP, this includessource and destination ports. For IPsec, this includes the SPI(Security Parameters Index.)For example, to use upper layer protocol information, the followingpolicy can be used:            -P L4To use the source and destination MAC addresses as well as the sourceand destination IP addresses, the following policy can be used:            -P L2,L3The framework also supports Link aggregation control protocol (LACP)for GLDv3 based aggregations which can be controlled by 'dladm' viathe  'lacp-mode' and 'lacp-timer' sub commands. The 'lacp-mode'can be set to 'off', 'active' or 'passive'.When a new device is inserted into a system. During reconfigurationboot or DR a default non-VLAN data-link will be created for the device.The configuration of all objects will persist across reboot.In future, 'dladm' and its private file where all persistantinformation is stored ('/etc/datalink.conf') will be used to managedevice specific parameters which are currently managed via 'ndd',driver specific configuration files and /etc/system.7. Tuning forperformance:The Solaris 10 stack is tuned to give steller out of box performanceirrespective of the H/W used. The secret lies in using techniques likedynamically switching between interrupt vs polling mode which givesvery good latencies when load is managible by allowing the NIC tointerrupt per packet and switching to polling mode for betterthroughput and well bounded latencies when load is very high. Thedefaults are also carefully picked based on H/W configuration. Forinstance, the 'tcp_conn_hash_size' tunable was very conservative preSolaris 10. The default value of 512 hash buckets was selected based onlowest supperted configuration (in terms of memory). Solaris 10 looksat the free memory at boot time to choose the value for'tcp_conn_hash_size'. Similarly, when connection is 'reaped' from thetime wait state, the memory associated with the connection instance isnot freed instantly (again based on the total system memory available)but instead put in a 'free_list'. When new connections arrive if agiven period, TCP tries to reuse memory from 'free_list' otehr wise'free_list' is periodically cleaned up. Inspite of these features, sometimes its necessary to tweak sometunables to deal with extreme cases or specific workloads. We discusssome tunables below that control the stack behaviour. Care should betaken to understand the impact otherwise the system might becomeunstable. Its important to note that for bulk of the applications andworkloads, the defaults will give the best results. 'ip_squeue_fanout' - Controls whether incoming connections fromone NIC are fanned out across all CPUs. A value of 0 means incomingconnections are assigned to the squeue attached to the interrupted CPU.A value of 1 means the connections are fanned out across all CPUs. Thelatter is required when NIC is faster than the CPU (say 10Gb NIC) andmultiple CPU need to service the NIC. Set via /etc/system by adding thefollowing line set ip:ip_squeue_fanout=1'ip_squeue_bind' - Controls whether worker threads are bound tospecific CPUs or not. When bound (default), they give better locality.The non default value (don't bind) is often chosen only when processorsets are to be created on the system. Unset via /etc/system by addingthe following line set ip:ip_squeue_bind=0'tcp_squeue_wput' - controls the write side squeue drain behavior.1 - Try to process your own packets but don't try to drainthesqueue2 - Try to process your own packet as well as any queuedpackets. The default value is 2 and can be changed via /etc/system by adding set ip:tcp_squeue_wput=1This value should be set to 1 whennumber of CPUs are far more than number of active NICs and the platformhas inherently higher memory latencies where chances of an applicationthread doing squeue drain and getting pinned is high. 'ip_squeue_wait' - Controls the amount of time in 'ms' a workerthread will wait before processing queued packets in the hope thatinterrupt or writer thread will process the packet. For servers whichsee enough traffic, the default of 10ms is good but for machines whichsee more interactive traffic (like desktops) where latency is an issue,the value should be set to 0 via /etc/system by adding set ip:ip_squeue_wait=0In addition, some protocol level tuning like changing themax_buf, high and low water mark, etc if beneficial specially on largememory systems.8. FutureThe future direction of Solaris networking stack will continue tobuild on better vertical integration between layers which will improvelocality and performance further. With the advent of Chipmultithreading and multi core CPUs, the number of parallel executionpipelines will continue to increase even on low end systems. A typical2 CPU machine today is dual core providing 4 execution pipelines andsoon going to have hyperthreading as well. The NICs are also becoming advanced offering multiple interrupts viaMSI-X, small classification capabilities, multiple DMA channels, andvarious stateless offloads like large segment offload etc. Future work will continue to leverage on these H/W trends includingsupport for TCP offload engines, Remote direct memory access (RDMA),and iSCSI.  Some other specific things that are being worked on:Network stack virtualization - With the industry wide trend ofserver consolidation and running multiple virtual machines on samephysical instance, its important the Solaris stack can be virtualizedefficiently.B/W Resource control - The same trend thats driving networkvirtualization is also driving the need to control the bandwidth usagefor various applications and virtual machines on same box efficiently. Support for high performance 3rd party modules - The currentSolaris 10 framework is still private to modules from Sun. STREAMsbased modules are the only option for the ISVs and they miss the fullpotential of the new framework.Forwarding performance - Work is being done to further improvethe Solaris forwarding performance.Network security with performance - The world is becomingincreasing complex and hostile. Its not possible to choose betweenperformance and security anymore. Both are a requirement. Solaris wasalways very strong in security and Solaris 10 makes great strides inenabling security without sacrificing performance. Focus will continueon enhancing IPfilter performance and functionality and a whole newapproach and detecting Denial of service attacks and dealing with them.9. AcknowledgmentsMany Thanks to Thirumalai Srinivasan, Adi Masputra, Nicolas Droux, andEric Cheng for contributing parts of this text. Also thanks are due toall the members of Solaris networking community for their help.

http-equiv="content-type"> http-equiv="content-type"> Many of you have asked for details on Solaris 10 networking.  The great news is that I finished writing the treatise on the subject whichwill...

Solaris Networking

The world of Solaris Networking

The world of Solaris Networking The DDay has finally arrived. Open Solaris is here. For mepersonally, its a very nice feeling since I can now talk about thearchitecture and implementation openly with people and point them tothe code. Before coming to Sun, I had always been in research labswhere collaboration is the way of life. God - how much I missed thatpart in Sun and thankfully I am hoping to get it.One of the big changes in Solaris 10 wasproject FireEngine which allowed Solaris to perform andscale. The important thing that I couldn't tell people before waswhere the wins came from. Bulk of them came from a lockless designcalled Vertical perimeter implemented by means of a serializationqueue. This allows packets once picked up for processing to be takenall the way up to socket layer or all the way down to devicedriver. With the aid of the IPclassifer, we bind connections tosqueues (which in turn are bound to CPUs) and this allows us to get abetter locality and scaling. The squeues also allow us to track theentire backlog per CPU. The GLDv3 based drivers allow IP to controlthe interrupts and based on the squeue backlog, the interrupts arecontrolled dynamically to achieve even higher performance and avoidthe havoc caused by interrupts. Some day I will tell you stories onhow we dealt with 1Gb NICs when they arrived and CPUs were stillpretty slow.Coming back to collaboration, you will notice that Solaris networkingarchitecture looks very different compared to SVR4 STREAMS basedarchitecture or BSD based architecture. It opens new doors for us andit allows us to do stack virtualization and resource control (projectCrossbow) and tons of new things. We have setup a networkingcommunity page which has brief discussion on some of the newprojects we are doing and would love to hear what you think aboutit. The discussion form on the same page would be an easy way to talk. We are open to suggestions on howyou would like to see this go forward. Enjoy, just like I enjoyed Solaris for so many years!Technorati Tag: OpenSolarisTechnorati Tag: Solaris

The DDay has finally arrived. href="http://www.opensolaris.org">Open Solaris is here. For me personally, its a very nice feeling since I can now talk about thearchitecture and implementation openly...

Solaris Networking

High Performance device driver framework (aka project Nemo)

A lot has happened since my last blog. I will talk about it one of these days. But the coolest thing wefinished is called project Nemo. Its a high performance device driver framework which allows writing devicedriver for Solaris a breeze. Its technically GLDv3 framework but we like to call it Nemo instead :) So what can Nemo do for you? Well, it switches dynamically between interrupt and polling mode (all controlledby IP) to boost performance. Any device driver which support turning the interrupt on/off can take advantage and boost performance by 20-25% by cutting the number of interrupts in more useful mannerand improving the latency at the same time. Way superior to interrupt coalescing etc Ben also finds it pretty useful here. Hey Ben, as you mentioned, lot of people are finding using ethernet pretty useful in storage as well. I'll havesome followon news on our iSCSI front soon. The initiator is already done and will be part of S10 updatewhile we are seening some pretty impressive numbers on a Solaris 10 iSCSI target with 10Gb. Coming back to Nemo, it also does trunking for both 10Gb and 1Gb NICs in a pretty simple way. We demo'da trunk of 2 10Gb NICs on a 2 CPU machine during the Sunlab's openhouse in april and we ran over 12GBpsover the trunk! There are some other cool things Nemo will allow us to do and one of these days I will tell youthe details.

A lot has happened since my last blog. I will talk about it one of these days. But the coolest thing wefinished is called project Nemo. Its a high performance device driver framework which allows...

Solaris Networking

Solaris vs Red Hat

Sorry guys, the heading is not mine. Its coming from the discussion at www.osnews.comwhere the Solaris 10 networking is being discussed. It is prettyinteresting discussion if you filter out few of the usual posting wherepeople don't really know the facts. I was surprised to see a large number of people who know Solaris voicingtheir positive opinions. Normally, people from Solaris world are not very vocal on discussion groups and public forums. So that is a surprising(and good) change. Guys keep it up! Someone mentioned that why are we not targetting Windows. Come on guys,you got to be serious. I am an engineer and do you think I designnetworking architecture targeted to beat windows :) As pointed out inthe comments, they are not even on my radar. Maybe in next twenty years,their technology will match our current stuff but then we would havehopefully moved on :\^) And yes, as I am told, we do beat Windows 2003 by20-30% on a 2 CPU x86 box (Opteron 2x2.2GHz with 2 Gb RAM) on webbench(static, dynamic and ecommerce). There are probably more benchmarks butfrankly we hadn't had time to compare or publish. Our sole aim right nowis to improve the real customer workloads and we are depending oncustomers to tell us these numbers. As for AIX and HP-UX (and I am going to get in trouble now with my bossesfor saying this), they just don't exist in any significant manner. I havetalked to a large numbers of customers in past two years since part ofour approach is to understand what the customer is having trouble withand what he will need going forward, and let me be really honest, I don'tsee HP-UX at all and very little AIX. Yes I do see IBM and HP machines,but they are all running Linux (please no flames, this is just myexperience). Again, when we are designing/writing new code, we do like to set sometargets. When it comes to scaling across large number of CPUs, we havealways done very well because thats where we focused. We never reallylooked at 1-2 CPU performance before since it was always easy to add moreprocessors on SPARC platforms. Linux on the other hand has really simplecode that allowed it to perform very well on 1 CPU. So our challengewas to come up with an architecture that could beat Linux on low end andstill allowed us to scale linearly on high end and sure enough, wecreated FireEngine . Its the same code that runs on SPARC platformsscaling linearly and runs pretty fast on 2 CPU x86 platforms. And as youadd more CPUs on x86 (going to 4 and 8 and then dual core), we just startbecoming very compelling architecture. As for some people commenting about the validity on the numbers comparingSolaris 10 and Apache with RHEL AS3 and Apache on www.sun.com, they are onthe same H/W. Its a 2x2.2 Ghz Opteron box (V20z) with 6Gb RAM and 2Broadcom Gig NICs. The numbers were done on webbench and the other majorweb performance benchmark that we can't talk about since the numbers arenot published yet. These numbers are for out of box Solaris 10 32bits with notuning at all (entire FireEngine focus was on out of box performance forreal customer workloads). And frankly, we are not really interestedin benchmarks because all the Linux web performance numbers (for instanceSPECweb99) are published using TUX or Red Hat content accelarator. Ihaven't come across a single customer who is running TUX so far. So whydoesn't someone publish a Linux Apache number without any benchmarkspecial and we will be sure to put resources to meet/beat thosenumbers. That I think would be a more fair comparison. And thats why I amfar more impressed by customer quotes like the one from "BillMorgan, CIO at Philadelphia Stock Exchange Inc.", where he said thatSolaris 10 improved his trading capacity by 36%. Now we are nottalking about a micro benchmark here but a system level capacity. Thiswas on a 12 way E4800 (SPARC platform). Basically, they loaded Solaris 10on the same H/W and were able to do 36% more stock transactions persecond. And once again, I am not really anti Linux or anything. I just needsomething to compete against in a good natured way (HP-UX, AIX, IRIX arenot around anymore, and I still can't bring myself down to compete withWindows). Before FireEngine, it was Linux guys who used to pull my legsaying when will I make Solaris perform as well as Linux on 1 CPU. Well,Solaris does perform now and some of the guys who used ot pull my legtook me out for beer when they loaded Solaris express on theirsystem. And knowing them, I might be buying the next round somewhere downthe line. Oh, before I end, I wanted to just touch on why we are not comparingagainst RHEL AS4beta. Well, its not us who is doing the comparing but our customers. And that is because although Solaris 10 is due to ship now,things like FireEngine have been available and stable for almost ayear. If I am to do the comparison, I will pick the latest in Red hat but I will compare it against Solaris 10 Update (due out 3-6 months afterSolaris 10). And you know what, we haven't exactly been sitting aroundfor the past year. Solaris 10 update will improve performance over S10FCS by another 20-25% on networking workloads.

Sorry guys, the heading is not mine. Its coming from the discussion at href="http://osnews.com/comment.php?news_id=8571">www.osnews.comwhere the Solaris 10 networking is being discussed. It...

Solaris Networking

Solaris 10 on x86 really performs

Someone pointed me to this article from George Colony, CEO, Forrester Research and the real story from Tom Adelstein. Both are pretty interesting articles but one of the feedbacks "Untrue... Learn the Facts first" to Tom kind of got me motivated to write this blog. "Solaris 10 on x86" can really match Linux in performance and better yet, linearly scale over large number of CPUs (remember that 8 CPUs x86 blades are here already and then we will start seeing 8 CPUs, dual core blades). The new network architecture (FireEngine) in S10 allows the same code to give a huge performance win on 1 and 2 CPU configurations and give linear scaling when more CPUs are added.Take for instance web performance. We have improvemed 2 CPU performance by close to 50% (compared to Solaris 9) using a real web server like Apache, Sun One Web Server, Zeus, etc without any gimmicks like kernel caching etc. Its just plain webserver with TCP/IP and a dumb NIC. Some of our Solaris express customers are telling us that we are outperforming RHEL AS3 by almost 15-35% on the same hardware.Interested in more numbers - On static and dynamic webbench, Solaris 10 is at par with RHEL AS3 on 2 CPU v20z while its ahead by 15% on webbench Ecommerce benchmark. On the same box, we can saturate a 1Gb NIC using only 8-9% on a 2.2Ghz Opteron processor but the real killer deal is that our 10Gb drivers are coming up and Alex Aizman fromS2io just informed me that we are pushing close to 7.3Gbps traffic on a v20z (with 2 x 1.6 Ghz Opterons) with more than 20% CPU to spare. We haven't even ported the driver to the high performance Nemo framework or enabled any hardware features as yet. So I am expecting a huge upside in next 2-3 months as the driver gets ported to Nemo (Paul and Yuzo should tell you more about Nemo sometime soon).The improvements are not restricted to TCP only. We are doing a FireEngine followup for UDP which improves Tibco benchmark by 130% and Volano Mark benchmark by 30%. The customer tells us that we are outperforming RHEL AS3 by almost 15% on the same hardware. Adi et. al. can add some more details about UDP performance.And the big killer features on Opterons, you can run 64bit Oracle or webserver on 64bit Solaris to take advantage of the bigger VM space but leave bulk of your apps to be 32bits which run unchanged. I am not claiming the best performing OS title (atleast not yet!) for Solaris 10, but guys, we are still ramping up! Every new project going in Solaris is now delivering double digits performance improvements (FireEngine architecture has opened the door) and soon I will claim that title :) I must add that all these gains come on the same hardware without application needing to change at all. Just get the latest Solaris Express and see it for yourself.And BTW, most of us at SUN are really pretty friendly towards Linux. Sure we compete in a good natured way. And Tom did hit the nail on the head regarding why people at SUN don't like Red Hat - Its really has to do with them having transformed free Linux into a not so free Linux.

Someone pointed me to this article from George Colony, CEO, Forrester Research and the real story from Tom Adelstein. Both are pretty interesting articles but one of the feedbacks "Untrue... Learn the...

Solaris Networking

When will you have enough performance?

For someone who never had a web-page, this blog business is reallyfrightening so bear with me if I seem like a novice. I wonder ifsomeone actually reads these pages or its just the robots, crawlersand zombies generating the hits ;\^) Anyway since Carol (our PM) thinksthis is useful medium to tell people outside Sun what I am thinkinginstead of them finding out when the product actually ships, heregoes. My name is Sunay Tripathi and I am a Sr. staff Eng. in SolarisNetworking and security technologies. Yes, we are the people who makethe 'Net' work in 'Network is the computer'. I also go by as thearchitect of FireEngine, the new TCP/IP stack in Solaris 10 for people who have tried Solaris 10 already and are pretty happy with theperformance (which is most). So what am I working on these days - well I hear 10Gb ishappening. And I also hear that 10Gb is not enough. People are wanting20-30Gbps bandwidth coming into 4 CPU opteron blades and still havemeaningful processing power left!! Well, you do that and watch theinterrupts go up like crazy and the system behave in more twisted ways than you can imagine and trust me, its not nice. But FireEngine comes to the rescue. We can tame the interrupts and do exactly what peoplewant. I'll tell you the details some other day unless John Fowler canbeat me to it by blogging soon.Fairness and security is something that keeps me awake these days. Alarge section of customers tell me that they see 'http' literallydisappearing in next 3-5 years and everything will be 'https' (SSL)and they don't want to sacrifice CPU just doing crypto and they don'twant crypto to overwhelm rest of the traffic. Well, OK, they said QOSbut what they actually meant was fairness without any guarantees. I amhard pressed to see why CNN will go 'https' but they do have a point -Yahoo mail should really be SSL protected by default!! So I am buildingfairness as part of the architecture instead of another add-onlayer. So let me tell you what else do I do other than designing and writingcode. I like to hang out with my old stanford and IIT buddies who keeptelling me that how we can combine forces to build the next big thingfor internet (some day). I also love watching my 11 months old learnto walk. He is already hooked on to my workstation and has his owndesktop now. Not surprising given that he sees his Mom and Dad spend80% of their waking hours on these things. But what he really wants ismy Acer Ferrari laptop running 64bit Solaris and I tell him dream onbuddy :) My other passion is fast cars (after fast code) andTaekwando. I am a black belt and used to practice with StanfordTaekwando. Had a string of injuries last year which has kept me awaybut I have started training again and will be back soon. Well, thats who I am. But let me tell you the real reason why I amdoing this (apart from the fact even Sin-yaw and the rest of the perfteam has a blog) - I actually want to hear back from you guys. Tell me what latest and greatest thing you are working on or dreaming offand how Solaris can make it happen for you. Not sure how the feedbackthing works on this blog but you can always drop me a directemail. The address is pretty simple - sunay at sun dot com. I wouldalso love to hear your opinions if you already tried Solaris 10. And as for when will you have enough performance? The answer is never!

For someone who never had a web-page, this blog business is really frightening so bear with me if I seem like a novice. I wonder if someone actually reads these pages or its just the robots, crawlersan...

Oracle

Integrated Cloud Applications & Platform Services