Tuesday Oct 23, 2012

Virtual Networks in Oracle Solaris - Part 5

           A
        long
        time
     ago in a
   blogosphere
  far, far away...

I wrote four blog entries to describe the new network virtualization features that were in Solaris 11 Express:

  • Part 1 introduced the concept of network virtualization and listed the basic virtual network elements.
  • Part 2 expanded on the concepts and discussed the resource management features.
  • Part 3 demonstrated the creation of some of these virtual network elements.
  • Part 4 demonstrated the network resource controls.
I had planned a final entry that added virtual routers to the list of virtual network elements, but Jeff McMeekin wrote a paper that discuses the same features. That paper is available at OTN. And this Jeff can't write any better than that Jeff...

All of the features described in those blog entries and that paper are also available in Solaris 11. It is possible that some details have changed, but the vast majority of the content is unchanged.

Thursday Jul 14, 2011

Extreme Oracle Solaris Virtualization

There will be a live webcast today, explaining how to leverage Oracle Solaris' unmatched virtualization features. The webcast begins at 9 AM PT. Register is required, at: Oracle.com.

Tuesday Mar 01, 2011

Virtual Network - Part 4

Resource Controls

This is the fourth part of a series of blog entries about Solaris network virtualization. Part 1 introduced network virtualization, Part 2 discussed network resource management capabilities available in Solaris 11 Express, and Part 3 demonstrated the use of virtual NICs and virtual switches.

This entry shows the use of a bandwidth cap on Virtual Network Elements (VNEs). This form of network resource control can effectively limit the amount of bandwidth consumed by a particular stream of packets. In our context, we will restrict the amount of bandwidth that a zone can use.

As a reminder, we have the following network topology, with three zones and three VNICs, one VNIC per zone.

All three VNICs were created on one ethernet interface in Part 3 of this series.

Capping VNIC Bandwidth

Using a T2000 server in a lab environment, we can measure network throughput with the new dlstat(1) command. This command reports various statistics about data links, including the quantity of packets, bytes, interrupts, polls, drops, blocks, and other data. Because I am trying to illustrate the use of commands, not optimize performance, the network workload will be a simple file transfer using ftp(1). This method of measuring network bandwidth is reasonable for this purpose, but says nothing about the performance of this platform. For example, this method reads data from a disk. Some of that data may be cached, but disk performance may impact the network bandwidth measured here. However, we can still achieve the basic goal: demonstrating the effectiveness of a bandwidth cap.

With that background out of the way, first let's check the current status of our links.

GZ# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g2     phys      1500   unknown  --         --
e1000g1     phys      1500   down     --         --
e1000g3     phys      1500   unknown  --         --
emp_web1    vnic      1500   up       --         e1000g0
emp_app1    vnic      1500   up       --         e1000g0
emp_db1     vnic      1500   up       --         e1000g0
GZ# dladm show-linkprop emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     autopush        rw   --             --             --
emp_app1     zone            rw   emp-app        --             --
emp_app1     state           r-   unknown        up             up,down
emp_app1     mtu             rw   1500           1500           1500
emp_app1     maxbw           rw   --             --             --
emp_app1     cpus            rw   --             --             --
emp_app1     cpus-effective  r-   1-9            --             --
emp_app1     pool            rw   SUNWtmp_emp-app --             --
emp_app1     pool-effective  r-   SUNWtmp_emp-app --             --
emp_app1     priority        rw   high           high           low,medium,high
emp_app1     tagmode         rw   vlanonly       vlanonly       normal,vlanonly
emp_app1     protection      rw   --             --             mac-nospoof,
                                                                restricted,
                                                                ip-nospoof,
                                                                dhcp-nospoof
<some lines deleted>
Before setting any bandwidth caps, let's determine the transfer rates between a zone on this system and a remote system.

It's easy to use dlstat to determine the data rate to my home system while transferring a file from a zone:

GZ# dlstat -i 10 e1000g0 
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   27.99M    2.11G   54.18M   77.34G
       emp_app1       83    6.72K        0        0
       emp_app1      339   23.73K    1.36K    1.68M
       emp_app1    1.79K  120.09K    6.78K    8.38M
       emp_app1    2.27K  153.60K    8.49K   10.50M
       emp_app1    2.35K  156.27K    8.88K   10.98M
       emp_app1    2.65K  182.81K    5.09K    6.30M
       emp_app1      600   44.10K      935    1.15M
       emp_app1      112    8.43K        0        0
The OBYTES column is simply the number of bytes transferred during that data sample. I'll ignore the 1.68MB and 1.15MB data points because the file transfer began and ended during those samples. The average of the other values leads to a bandwidth of 7.6 Mbps (megabits per second), which is typical for my broadband connection.

Let's pretend that we want to constrain the bandwidth consumed by that workload to 2 Mbps. Perhaps we want to leave all of the rest for a higher-priority workload. Perhaps we're an ISP and charge for different levels of available bandwidth. Regardless of the situation, capping bandwidth is easy:

GZ# dladm set-linkprop -p maxbw=2000k emp_app1
GZ# dladm show-linkprop -p maxbw emp__app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw       2          --             --
GZ# dlstat -i 20 emp_app1 
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   18.21M    1.43G   10.22M   14.56G
       emp_app1      186   13.98K        0        0
       emp_app1      613   51.98K    1.09K    1.34M
       emp_app1    1.51K  107.85K    3.94K    4.87M
       emp_app1    1.88K  131.19K    3.12K    3.86M
       emp_app1    2.07K  143.17K    3.65K    4.51M
       emp_app1    1.84K  136.03K    3.03K    3.75M
       emp_app1    2.10K  145.69K    3.70K    4.57M
       emp_app1    2.24K  154.95K    3.89K    4.81M
       emp_app1    2.43K  166.01K    4.33K    5.35M
       emp_app1    2.48K  168.63K    4.29K    5.30M
       emp_app1    2.36K  164.55K    4.32K    5.34M
       emp_app1      519   42.91K      643  793.01K
       emp_app1      200   18.59K        0        0
Note that for dladm, the default unit for maxbw is Mbps. The average of the full samples is 1.97 Mbps.

Between zones, the uncapped data rate is higher:

GZ# dladm reset-linkprop -p maxbw emp_app1
GZ# dladm show-linkprop  -p maxbw emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw   --             --             --
GZ# dlstat -i 20 emp_app1
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   20.80M    1.62G   23.36M   33.25G
       emp_app1      208   16.59K        0        0
       emp_app1   24.48K    1.63M  193.94K  277.50M
       emp_app1  265.68K   17.54M    2.05M    2.93G
       emp_app1  266.87K   17.62M    2.06M    2.94G
       emp_app1  255.78K   16.88M    1.98M    2.83G
       emp_app1  206.20K   13.62M    1.34M    1.92G
       emp_app1   18.87K    1.25M   79.81K  114.23M
       emp_app1      246   17.08K        0        0
This five year old T2000 can move at least 1.2 Gbps of data, internally, but that took five simultaneous ftp sessions. (A better measurement method, one that doesn't include the limits of disk drives, would yield better results, and newer systems, either x86 or SPARC, have higher internal bandwidth characteristics.) In any case, the maximum data rate is not interesting for our purpose, which is demonstration of the ability to cap that rate.

You can often resolve a network bottleneck while maintaining workload isolation, by moving two separate workloads onto the same system, within separate zones. However, you might choose to limit their bandwidth consumption. Fortunately, the NV tools in Solaris 11 Express enable you to accomplish that:

GZ# dladm set-linkprop -t -p maxbw=25m emp_app1
GZ# dladm show-linkprop -p maxbw emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw      25          --             --
Note that the change to the bandwidth cap was made while the zone was running, potentially while network traffic was flowing. Also, changes made by dladm are persistent across reboots of Solaris unless you specify a "-t" on command line.

Data moves much more slowly now:

GZ# # dlstat  -i 20 emp_app1
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   23.84M    1.82G   46.44M   66.28G
       emp_app1      192   16.10K        0        0
       emp_app1    1.15K   79.21K    5.77K    8.24M
       emp_app1   18.16K    1.20M   40.24K   57.60M
       emp_app1   17.99K    1.20M   39.46K   56.48M
       emp_app1   17.85K    1.19M   39.11K   55.97M
       emp_app1   17.39K    1.15M   38.16K   54.62M
       emp_app1   18.02K    1.19M   39.53K   56.58M
       emp_app1   18.66K    1.24M   39.60K   56.68M
       emp_app1   18.56K    1.23M   39.24K   56.17M
<many lines deleted>
The data show an aggregate bandwidth of 24 Mbps.

Conclusion

The network virtualization tools in Solaris 11 Express include various resource controls. The simplest of these is the bandwidth cap, which you can use to effectively limit the amount of bandwidth that a workload can consume. Both physical NICs and virtual NICs may be capped by using this simple method. This also applies to workloads that are in Solaris Zones - both default zones and Solaris 10 Zones which mimic Solaris 10 systems.

Next time we'll explore some other virtual network architectures.

Tuesday Feb 08, 2011

Virtual Network - Part 3

This is the third in a series of blog entries that discuss the network virtualization features in Oracle Solaris 11 Express. Part 1 introduced the concept of network virtualization and listed the basic virtual network elements that Solaris 11 Express (S11E) provides. Part 2 expanded on the concepts and discussed the resource management features which can be applied to those virtual network elements (VNEs).

This blog entry assumes that you have some experience with Solaris Zones. If you don't, you can read my earlier blog entries, or buy the book "Oracle Solaris 10 System Virtualization Essentials" or read the documentation.

This entry will demonstrate the creation of some of these VNEs.

For today's examples, I will use an old Sun Fire T2000 that has one SPARC CMT (T1) chip and 32GB RAM. I will pretend that I am implementing a 3-tier architecture in this one system, where each tier is represented by one Solaris zone. The mythical example provides access to an employee database. The 3-tier service is named 'emp' and VNEs will use 'emp' in their names to reduce confusion regarding the dozens of VNEs we expect to create for the services this system will deliver.

The commands shown below use the prompt "GZ#" to indicate that the command is entered in the global zone by someone with sufficient privileges. Similarly, the prompt "emp-web1#" indicates a command which is entered in the zone "emp-web1" as a sufficiently privileged user.

Fortunately, Solaris network engineers gathered all of the actions regarding the management of network elements (virtual or physical) into one command: dladm(1M). You use dladm to create, destroy, and configure datalinks such as VNICs. You can also use it to list physical NICs:

GZ# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g2     phys      1500   unknown  --         --
e1000g1     phys      1500   down     --         --
e1000g3     phys      1500   unknown  --         --
We need three VNICs for our three zones, one VNIC per zone. They will also have useful names - one for each of the tiers - and will share e1000g0:
GZ# dladm create-vnic -l e1000g0 emp_web1
GZ# dladm create-vnic -l e1000g0 emp_app1
GZ# dladm create-vnic -l e1000g0 emp_db1
GZ# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g2     phys      1500   unknown  --         --
e1000g1     phys      1500   down     --         --
e1000g3     phys      1500   unknown  --         --
emp_web1    vnic      1500   up       --         e1000g0
emp_app1    vnic      1500   up       --         e1000g0
emp_db1     vnic      1500   up       --         e1000g0
GZ# dladm show-vnic
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID
emp_web1     e1000g0      0      2:8:20:3a:43:c8   random              0
emp_app1     e1000g0      0      2:8:20:36:a1:17   random              0
emp_db1      e1000g0      0      2:8:20:b4:5b:d3   random              0

The system has four NICs and three VNICs. Note that the name of a VNIC may not include a hyphen (-) but may include an underscore (_).

VNICs that share a NIC appear to be attached together via a virtual switch. That vSwitch is created automatically by Solaris. This diagram represents the NIC and NVEs we have created.

Now that these datalinks - the VNICs - exist, we can assign them to our zones. I'll assume that the zones already exist, and just need network assignment.

GZ# zonecfg -z emp-web1 info
zonename: emp-web1
zonepath: /zones/emp-web1
brand: ipkg
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
fs-allowed:
GZ# zonecfg -z emp-web1
zonecfg:emp-web1> add net
zonecfg:emp-web1:net> set physical=emp_web1
zonecfg:emp-web1:net> end
zonecfg:emp-web1> exit

Those steps can be followed for the other two zones and matching VNICs. After those steps are completed, our earlier diagram would look like this:

Packets passing from one zone to another within a Solaris instance do not leave the computer, if they are in the same subnet and use the same datalink. This greatly improves network bandwidth and latency. Otherwise, the packets will head for the zone's default router.

Therefore, in the above diagram packets sent from emp-web1 destined for emp-app1 would traverse the virtual switch, but not pass through e1000g0.

This zone is an "exclusive-IP" zone, meaning that it "owns" its own networking. What is its view of networking? That's easy to determine. The zlogin(1M) command inserts a complete command-line into the zone. By default, the command is run as the root user.

GZ# zoneadm -z emp-web1 boot
GZ# zlogin emp-web1 dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
emp_web1    vnic      1500   up       --         ?
GZ# zlogin emp-web1 dladm show-vnic
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID
emp_web1     ?            0      2:8:20:3a:43:c8   random              0

Notice that the zone sees its own VNEs, but cannot see NEs or VNEs in the global zone, or in any other zone.

The other important new networking command in Solaris 11 Express is ipadm(1M). That command creates IP address assignments, enables and disables them, displays IP address configuration information, and performs other actions.

The following example shows the global zone's view before configuring IP in the zone:

GZ# ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
e1000g0    ok       bm--------4- ---
GZ# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/?             static   ok           127.0.0.1/8
lo0/?             static   ok           127.0.0.1/8
lo0/?             static   ok           127.0.0.1/8
e1000g0/_a        static   ok           10.140.204.69/24
lo0/v6            static   ok           ::1/128
lo0/?             static   ok           ::1/128
lo0/?             static   ok           ::1/128
lo0/?             static   ok           ::1/128

At this point, not only does the zone know it has a datalink (which we saw above) but the IP tools show that it is there, ready for use. The next example shows this:

GZ# zlogin emp-web1 ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
GZ# zlogin emp-web1 ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
An ethernet datalink without an IP address isn't very useful, so let's configure an IP interface and apply an IP address to it:
GZ# zlogin emp-web1 ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
GZ# zlogin emp-web1 ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128

GZ# zlogin emp-web1 ipadm create-if emp_web1
GZ# zlogin emp-web1 ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
emp_web1   down     bm--------46 -46

GZ# zlogin emp-web1 ipadm create-addr -T static -a local=10.140.205.82/24 emp_web1/v4static
GZ# zlogin emp-web1 ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
emp_web1/v4static static   ok           10.140.205.82/24
lo0/v6            static   ok           ::1/128

GZ# zlogin emp-web1 ifconfig emp_web1
emp_web1: flags=1000843 mtu 1500 index 2
        inet 10.140.205.82 netmask ffffff00 broadcast 10.140.205.255
        ether 2:8:20:3a:43:c8

The last command above shows the "old" way of displaying IP address configuration. The command ifconfig(1) is still there, but the new tools dladm and ipadm provide a more consistent interface, with well-defined separation between datalink management and IP management.

Of course, if you want the zone's outbound packets to be routed to other networks, you must use the route(1M) command, the /etc/defaultrouter file, or both.

Next time, I'll show a new network measurement tool and the ability to control the amount of network bandwidth consumed.

Thursday Jan 27, 2011

Virtual Networks - Part 2

This is the second in a series of blog entries that discuss the network virtualization features in Solaris 11 Express. The first entry discussed the basic concepts and the virtual network elements, including virtual NICs, VLANs, virtual switches, and InfiniBand datalinks.

This entry adds to that list the resource controls and security features that are necessary for a well-managed virtual network.

Virtual Networks, Real Resource Controls

In Oracle Solaris 11 Express, there are four main datalink resource controls:
  1. a bandwidth cap, which limits the amount of traffic passing through a datalink in a small amount of elapsed time
  2. assignment of packet processing tasks to a subset of the system's CPUs
  3. flows, which were introduced in the previous blog post
  4. rings, which are hardware or software resources that can be dedicated to a single purpose.
Let's take them one at a time. By default, datalinks such as VNICs can consume as much of the physical NIC's bandwidth as they want. That might be the desired behavior, but if it isn't you can apply the property "maxbw" to a datalink. The maximum permitted bandwidth can be specified in Kbps, Mbps or Gbps. This value can be changed dynamically, so if you set this value too low, you can change without affecting the traffic flowing over that link. Solaris will not allow traffic to flow over that datalink at a rate faster than you specify.

You can "over-subscribe" this bandwidth cap: the sum of the bandwidth caps on the VNICs assigned to a NIC can exceed the rated bandwidth of the NIC. If that happens, the bandwidth caps become less effective.

In addition the bandwidth cap, packet processing computation can be constrained to the CPUs associated with a workload.

First some background. When Solaris boots, it assigns interrupt handler threads to the CPUs in the system. (See Solaris CPUs for an explanation of the meaning of "CPU".) Solaris attempts to spread the interrupt handlers out evenly so that one CPU does not become a bottleneck for interrupt handling.

If you create non-default CPU pools, the interrupt handlers will retain their CPU assignments. One unintended side effect of this is a situation where the CPUs intended for one workload will be handling interrupts caused by another workload. This can occur even with simple configurations of Solaris Zones. In extreme cases, network packet processing for one zone can severely impact the performance of another zone.

To prevent this behavior, Solaris 11 Express offers the ability to assign a datalink's interrupt handler to a set of CPUs or a pool of CPUs. To simplify this further, the obvious choice is made for you, by default, for a zone which is assigned its own resource pool. When such a zone boots, a resource pool is created for the zone, a sufficient quantity of CPUs is moved from the default pool to the zone's pool, and interrupt handlers for that zone's datalink(s) are automatically reassigned to that resource pool. Network flows enable you to create multiple lanes of traffic. This allows the parallelization of network traffic. You can assign a bandwidth cap to a flow. Flows were introduced in the previous post and will be discussed further in future posts.

Finally, the newest high speed NICs support hardware rings: memory resources that can be dedicated to a particular set of network traffic. For inbound packets, this is the first resource control that separates network traffic based on packet information such as destination MAC address. By assigning one or more rings to a stream of traffic, you can commit sufficient hardware resources to it and ensure a greater relative priority for those packets, even if another stream of traffic on the same NIC would otherwise cause congestion and impact packet latency of all streams.

If you are using a NIC that does not support hardware rings, Solaris 11 Express support software rings which cause a similar effect.

Virtual Networks, Real Security

In addition to rescource controls, Solaris 11 Express offers datalink protection controls. These controls are intended to prevent a user from creating improper packets that would cause mischief on the network. The mac-nospoof property requires that outgoing packets have a MAC address which matches the link's MAC address. The ip-nospoof property implements a similar restriction, but for IP addresses. The dhcp-nospoof property prevents improper DHCP assignment.

Summary (so far)

The network virtualization features in Solaris 11 Express enable the creation of virtual network devices, leading to the implementation of an entire network inside one Solaris system. Associated resource control features give you the ability to manage network bandwidth as a resource and reduce the potential for one workload to cause network performance problems for another workload. Finally, security features help you minimize the impact of an intruder.

With all of the introduction out of the way, next time I'll show some actual uses of these concepts.

Wednesday Jan 05, 2011

Virtual Networks

Network virtualization is one of the industry's hot topics. The potential to reduce cost while increasing network flexibility easily justifies the investment in time to understand the possibilities. This blog entry describes network virtualization and some concepts. Future entries will show the steps to create a virtual network.

Introduction to Network Virtualization

Network virtualization can be described as the process of creating a computer network which does not match the physical topology of a physical network. Usually this is achieved by using software tools of general-purpose computers or by using features of network hardware. A defining characteristic of a virtual network is the ability to re-configure the topology without manipulating any physical objects: devices or cables.

Such a virtual network mimics a physical network. Some types of virtual networks, for example virtual LANs (VLANs), can be implemented using features of network switches and computers. However, some other implementations do not require traditional network hardware such as routers and switches. All of the functionality of network hardware has been re-implemented in software, perhaps in the operating system.

Benefits of network virtualization (NV) include increased architectural flexibility, better bandwidth and latency characteristics, the ability to prioritize network traffic to meet desired performance goals, and lower cost from fewer devices, reduced total power consumption, etc.

The remainder of this blog entry will focus on a software-only implementation of NV.

A few years ago, networking engineers at Sun began working on a software project named "Crossbow." The goal was to create a comprehensive set of NV features within Solaris. Just like Solaris Zones, Crossbow would provide integrated features for creation and monitoring of general purpose virtual network elements that could be deployed in limitless configurations. Because these features are integrated into the operating system, they automatically take advantage of - and smoothly interoperate with - existing features. This is most noticeable in the integration of Solaris NV features and Solaris Zones. Also, because these NV features are a part of Solaris, future Solaris enhancements will be integrated with Solaris NV where appropriate.

The core NV features were first released in OpenSolaris 2009.06. Since then, those core features have matured and more details have been added. The result is the ability to re-implement entire networks as virtual networks using Solaris 11 Express. Here is an example of a virtual network architecture:

As you can guess from that example, you can create virtually :-) any network topology as a virtual network...

Oracle Solaris NV does more than is described here. This content focuses on the key features which might be used to consolidate workloads or entire networks into a Solaris system, using zones and NV features.

Virtual Network Elements

Solaris 11 Express implements the following virtual network elements.
  • NIC: OK, this isn't a virtual element, it's just on the list as a starting point.
    For a very long time, Solaris has managed Network Interface Connectors (NICs). Solaris offers tools to manage NICs, including bringing them up and down, and assigning various characteristics to them, such as IP addresses, assignment to IP Multipathing (IPMP) groups, etc. Note that up through Solaris 10, most of those configuration tasks were accomplished with the ifconfig(1M) command, but in Solaris 11 Express the dladm(1M) and ipadm(1M) commands perform those tasks, and a few more. You can monitor the use of NICs with dlstat(1M). The term "datalink" is now used consistently to refer to NICs and things like NICs, such as...

  • A VNIC is a pseudo interface created on a datalink (a NIC or an etherstub, described next). Each VNIC has its own MAC address, which can be generated automatically, but can be specified manually. For almost all purposes, a VNIC can be can be managed like a NIC. The dladm command creates, lists, deletes, and modifies VNICs. The dlstat command displays statistics about VNICs. The ipadm(1M) command configures IP interfaces on VNICs.
    Like NICs, VNICs have a number of properties that can be modified with dladm. These include the ability to force network processing of a VNIC to a certain set of CPUs, setting a cap (maximum) on permitted bandwidth for a VNIC, the relative priority of this VNIC versus other VNICs on the same NIC, and other properties.

  • Etherstubs are pseudo NICs, making internal networks possible. For a general understanding, think of them as virtual switches. The command dladm manages etherstubs.

  • A flow is a stream of packets that share particular attributes such as source IP address or TCP port number. Once defined, a flow can be managed as an entity, including capping bandwidth usage, setting relative priorities, etc. The new flowadm(1M) command enables you to create and manage flows. Even if you don't set resource controls, flows will benefit from dedicated kernel resources and more predictable, consistent performance. Further, you can directly observe detailed statistics on each flow, improving your ability to understand these streams of packets and set proper resource controls. Flows are managed with flowadm(1M) and monitored with flowstat(1M).

  • VLANs (Virtual LANs) have been around for a long time. For consistency, the commands dladm, dlstat and ipadm now manage VLANs.

  • InfiniBand partitions are virtual networks that use an InfiniBand fabric. They are managed with the same commands as VNICs and VLANs: dladm, dlstat, ipadm and others.

Summary

Solaris 11 Express provides a complete set of virtual network components which can be used to deploy virtual networks within a Solaris instance. The next blog entry will describe network resource management and security. Future entries will provide some examples.

Friday Apr 02, 2010

Solaris Virtualization Book

This blog has been pretty quiet lately, but that doesn't mean I haven't been busy! For the last 6 months I've been leading the writing of a book: _Oracle Solaris 10 System Virtualization Essentials_.

This book discusses all of the forms of server virtualization, not just hypervisors. It covers the forms of virtualization that Solaris 10 provides and those that it can use. These include Solaris Containers (also called Solaris Zones), VirtualBox, Oracle VM ( x86 and SPARC; the latter was called Logical Domains), and Dynamic Domains.

One chapter is dedicated to the topic of choosing the best virtualization technology for a particular workload or set of workloads. Another chapter shows how to use each virtualization technology to achieve specific goals, including screenshots and command sequences. The last chapter of the book describes the need for virtualization management tools and then uses Oracle EM Ops Center as an example.

The book is available for pre-order at Amazon.com.

Wednesday Dec 09, 2009

Virtual Overhead?

So you're wondering about operating system efficiency or the overhead of virtualization. How about a few data points?

SAP created benchmarks that measure transaction performance. One of them, the SAP SD, 2-Tier benchmark, behaves more like real-world workloads than most other benchmarks, because it exercises all of the parts of a system: CPUs, memory access, I/O and the operating system. The other factor that makes this benchmark very useful is the large number of results submitted by vendors. This large data set enables you to make educated performance comparisons between computers, or operating systems, or application software.

A couple of interesting comparisons can be made from this year's results. Many submissions use the same hardware configuration: two Nehalem (Xeon X5570) CPUs (8 cores total) running at 2.93 GHz, and 48GB RAM (or more). Submitters used several different operating systems: Windows Server 2008 EE, Solaris 10, and SuSE Linux Enterprise Server (SLES) 10. Also, two results were submitted using some form of virtualization: Solaris 10 Containers and SLES 10 on VMware ESX Server 4.0.

Operating System Comparison

The first interesting comparison is of different operating systems and database software, on the same hardware, with no virtualization. Using the hardware configuration listed above, the following results were submitted. The Solaris 10 and Windows results are the best results on each of those operating systems, on this hardware. The SLES 10 result is the best of any Linux distro, with any DB software, on the same hardware configuration.

Operating SystemDBResult (SAPS)
Solaris 10Oracle 10g21,000
Windows Server 2008 EESQL Server 200818,670
SLES 10MaxDB 7.817,380

(Note that all of the results submitted in 2009 cannot be compared against results from previous years because SAP changed the workload.)

With those data points, it's very easy to conclude that for transactional workloads, the combination of Solaris 10 and Oracle 10g is roughly 20% more powerful than Linux and MaxDB.

Virtualization Comparison

The virtualization comparison is also interesting. The same benchmark was run using Solaris 10 Containers and 8 vCPUs. It was also run using SLES 10 on VMware ESX, also using 8 vCPUs.

Operating SystemVirtualizationDBResult (SAPS)
Solaris 10Solaris ContainersOracle 10g15,320
SLES 10VMware ESXMaxDB 7.811,230

Interpretation

Some of the 36% advantage of the Solaris Containers result is due to the operating systems and DB software, as we saw above. But the rest is due to the virtualization tools. The virtualized and non-virtualized results for each OS had only one difference: virtualization was used. For example, the two Solaris 10 results shown above used the same hardware, the same OS, the same DB software and the same workload. The only difference was the use of Containers and the limitation of 8 vCPUs.

If we assume that Solaris 10/Oracle 10G is consistently 21% more powerful than SLES 10/MaxDB on this benchmark, than it's easy to conclude that VMWare ESX has 13% more overhead than Solaris Containers when running this workload.

However, the non-virtualized performance advantage of the Solaris 10 configuration over that of SLES 10 may be different with 8 vCPUs than with 8 cores. If Solaris' advantage is less, then the overhead of VMware is even worse. If the advantage of Solaris 10 Containers/Oracle over VMware/SLES 10/MaxDB with 8 vCPUs is more than the non-virtualized results, than the real overhead of VMware is not quite that bad. Without more data, it's impossible to know.

But one of those three cases (same, less, more) is true. And the claims by some people that VMware ESX has "zero" or "almost no" overhead are clearly untrue, at least for transactional workloads. For compute-intensive workloads, like HPC, the overhead of software hypervisors like VMware ESX is typically much smaller.

What Does All That Mean?

What does that overhead mean for real applications? Extra overhead means longer response times for transactions or fewer users per workload, or both. It also means that fewer workloads (guests) can be configured per system.

In other words, response time should be better (or maximum number of users should be greater) if your transactional workload is running in a Solaris Container rather than in a VMware ESX guest. And when you want to add more workloads, Solaris Containers should support more of those workloads than VMware ESX, on the same hardware.

Qualification

Of course, the comparison shown above only applies to certain types of workloads. You should test your workload on different configurations before committing yourself to one.

Disclosure

For more detail, see the results for yourself.

SAP wants me to include the results:
Best result for Solaris 10 on 2-way X5570, 2.93GHz, 48GB:
Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 21,000 SAPS, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, Oracle 10g, Solaris 10, Cert# 2009033.
Best result for any Linux distro on 2-way X5570, 2.93GHz, 48GB:
HP ProLiant DL380 G6 (2 processors, 8 cores, 16 threads) 3,171 SAP SD Users, 17,380 SAPS, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, MaxDB 7.8, SuSE Linux Enterprise Server 10, Cert# 2009006.
Result on Solaris 10 using Solaris Containers and 8 vCPUs:
Sun Fire X4270 (2 processors, 8 cores, 16 threads) run in 8 virtual cpu container, 2,800 SAP SD Users, 2x 2.93 GHz Intel Xeon X5570, 48 GB memory, Oracle 10g, Solaris 10, Cert# 2009034.
Result on SuSE Enterprise Linux as a VMware guest, using 8 vCPUs:
Fujitsu PRIMERGY Model RX300 S5 (2 processors, 8 cores, 16 threads) 2,056 SAP SD Users, 2x 2.93 GHz Intel Xeon X5570, 96 GB memory, MaxDB 7.8, SUSE Linux Enterprise Server 10 on VMware ESX Server 4.0, Cert# 2009029.
SAP, R/3, reg TM of SAP AG in Germany and other countries.

Addendum, added December 10, 2009:

Today an associate reminded me that previous SAP SD 2-tier results demonstrated the overhead of Solaris Containers. Sun ran four copies of the benchmark on one system, simultaneously, one copy in each of four Solaris Containers. The system was a Sun Fire T2000, with a single 1.2GHz SPARC processor, running Solaris 10 and MaxDB 7.5:

  1. 2006029
  2. 2006030
  3. 2006031
  4. 2006032

The same hardware and software configuration - but without Containers - already had a submission:
2005047

The sum of the results for the four Containers can be compared to the single result for the configuration without Containers. The single system outpaced the four Containers by less than 1.7%.

Second Addendum, also added December 10, 2009:

Although this blog entry focused on a comparison of performance overhead, there are other good reasons to use Solaris Containers in SAP deployments. At least 10, in fact, as shown in this slide deck. One interesting reason is that Solaris Containers is the only server virtualization technology supported by both SAP and Oracle on x86 systems. <script type="text/javascript"> var sc_project=2359564; var sc_invisible=1; var sc_security="22b325fd"; var sc_https=1; var sc_remove_link=1; var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www."); document.write("");</script>

counter for tumblr

Friday Sep 05, 2008

got (enough) memory?

DBAs are in for a rude awakening.

A database runs most efficiently when all of the data is held in RAM. Insufficient RAM causes some data to be sent to a disk drive for later retrieval. This process, called 'paging' can have a huge performance impact. This can be shown numerically by comparing the time to retrieve data from disk (about 10,000,000 nanoseconds) to the access time for RAM (about 20 ns).

Databases are the backbone of most Internet services. If a database does not perform well, no amount of improvement of the web servers or application servers will achieve good performance of the overall service. That explains the large amount of effort that is invested in tuning database software and database design.

These tasks are complicated by the difficulty of scaling a single database to many systems in the way that web servers and app servers can be replicated. Because of those challenges, most databases are implemented on one computer. But that single system must have enough RAM for the database to perform well.

Over the years, DBAs have come to expect systems to have lots of memory, either enough to hold the entire database or at least enough for all commonly accessed data. When implementing a database, the DBA is asked "how much memory does it need?" The answer is often padded to allow room for growth. That number is then increased to allow room for the operating system, monitoring tools, and other infrastructure software.

And everyone was happy.

But then server virtualization was (re-)invented to enable workload consolidation.

Server virtualization is largely about workload isolation - preventing the actions and requirements of one workload from affecting the others. This includes constraining the amount of resources consumed by each workload. Without such constraints, one workload could consume all of the resources of the system, preventing other workloads from functioning effectively. Most virtualization technologies include features to do this - to schedule time using the CPU(s), to limit use of network bandwidth... and to cap the amount of RAM a workload can use.

That's where DBAs get nervous.

I have participated in several virtualization architecture conversations which included:
Me: "...and you'll want to cap the amount of RAM that each workload can use."
DBA: "No, we can't limit database RAM."

Taken out of context, that statement sounds like "the database needs infinite RAM." (That's where the CFO gets nervous...)

I understand what the DBA is trying to say:
DBA: "If the database doesn't have sufficient RAM, its performance will be horrible, and so will the performance of the web and app servers that depend on it."

I completely agree with that statement.

The misunderstanding is that the database is not expected to use less memory than before. The "rude awakening" is modifying one's mind set to accept the notion that a RAM cap on a virtualized workload is the same as having a finite amount of RAM - just like a real server.

This also means that system architects must understand and respect the DBA's point of view, and that a virtual server must have available to it the same amount of RAM that it would need in a dedicated system. If a non-consolidated database needed 8GB of RAM to run well in a dedicated system, it will still need 8GB of RAM to run well in a consolidated environment.

If each workload has enough resources available to it, the system and all of its workloads will perform well.

And they all computed happily ever after.

P.S. Memory needs of consolidated systems require that a system running multiple workloads will need more memory than each of the unconsolidated systems had - but less than the aggregate amount they had.

Considering that need, and the fact that most single-workload systems were running at 10-15% CPU utilization, I advise people configuring virtual server platforms to focus more effort on ensuring that the computer has enough memory for all of its workloads, and less effort on achieving sufficient CPU performance. If the system is 'short' on CPU power by 10%, performance will be 10% less than expected. That rarely matters. But if the system is 'short' on memory by 10%, excessive paging can cause transaction times to increase by 10 times, 100 times, or more.

Thursday Aug 21, 2008

Virtual Eggs, One Basket

One of the hottest computer industry trends is virtualization. If you skim off the hype, there is still a great deal to be excited about. Who doesn't like reducing the number of servers to manage, and reducing the electric power consumed by servers and by the machines that move the heat they create (though I supposed that the power utilities and the coal and natural gas companies aren't too thrilled by virtualization...)

But there are some critical factors which limit the consolidation of workloads into virtualized environments (VE's). One, often-overlooked factor is that the technology which controls VE's is a single point of failure (SPOF) for all of the VE's it is managing. If that component (a hypervisor for virtual machines, an OS kernel for operating system-level virtualization, etc.) has a bug which affects its guests, they may all be impacted. In the worst case, all of the guests will stop working.

One example of that was the recent licensing bug in VMware. If the newest version of VMware ESX was in use, the hypervisor would not permit guests to start after August 12. EMC created a patch to fix the problem, but solving it and testing the fix took enough time that some customers could not start some workloads for about one day. For some details, see http://www.networkworld.com/news/2008/081208-vmware-bug.html and http://www.deploylinux.net/matt/2008/08/all-your-vms-belong-to-us.html.

Clearly, the lesson from this is the importance of designing your consolidated environments with this factor in mind. For example, you should never configure both nodes of a high-availability (HA) cluster as guests of the same hypervisor. In general, don't assume that the hypervisor is perfect - it's not - and that it can't fail - it can.

Thursday May 10, 2007

Virtual DoS

No, not that DOS. I'm referring to Denial-of-Service.

A team at Clarkson University including a professor and several students recently performed some interesting experiments. They wanted to determine how server virtualization solutions handled a guest VM which performed a denial-of-service attack on the whole system. This knowledge could be useful when virtualizing guests that you don't trust. It gives you a chance to put away the good silver.

They tested VMware Workstation, Xen, OpenVZ, and Solaris Containers. (It's a shame that they didn't test VMware ESX. VMware Workstation and ESX are very different technologies. Therefore, it is not safe to assume that the paper's conclusions regarding VMware Workstation apply to ESX.) After reading the paper, my conclusion for Solaris Containers is "they have non-default resource management controls to contain DoS attacks, and it's important to enable those controls."

Fortunately, with the next update to Solaris 10 (due this summer) those controls are much easier to use. For example, the configuration parameters used in the paper, and shown below, limit a Container's use of physical memory, virtual memory, and amount of physical memory which can be locked so that it doesn't get paged out:

add capped-memory
 set physical=128M 
 set swap=512M 
 set locked=64M 
end
Further, the following parameters limit the number of execution threads that the Container can use, turn on the fair-share scheduler and assign a quantity of shares for this Container:
set max-lwps=175 
set scheduling-class=FSS 
set cpu-shares=10
All of those parameters are set using the zonecfg(1M) command. One benefit of the centralization of these control parameters is that they move with a Container when it is moved to another system.

I partly disagree with the authors' statement that these controls are complex to configure. The syntax is simple - and a significant improvement over previous versions - and an experienced Unix admin can determine appropriate values for them without too much effort. Also, a GUI is available for those who don't like commands: the Solaris Container Manager. On the other hand, managing these controls does require Solaris administration experience, and there are no default values. It is important to use these features in order to protect well-behaved workloads from misbehaving workloads.

It also is a shame that the hardware used for the tests was a desktop computer with limited physical resources. For example it had only one processor. Because multi-core processors are becoming the norm, it would be valuable to perform the same tests on a multi-core system. The virtualization software would be stressed in ways which were not demonstrated. I suspect that Containers would handle that situation very well, for two reasons:

  1. There is almost no overhead caused by the use of Containers - the workload itself does not execute any extra code just because it is in a Container. Hypervisor-like solutions like Xen and VMware have longer code paths for network and storage I/O than would occur without virtualization. The additional cores would naturally support more guests, but the extra code paths would limit scalability. Containers would not suffer from that limitation.
  2. Solaris has extraordinary scalability - efficiently running workloads that have hundreds or thousands of processes on 144 cores in a Sun Fire E25K. None of the other solutions tested for this paper have been shown to scale beyond 4-8 cores. I booted 1,000 Containers on an 8-socket system.

Also, the test system did not have multiple NICs. The version of Solaris that was used includes a new feature called IP Instances. This feature allows a Container to be given exclusive access to particular NIC. No process outside that Container can access that NIC. Of course, multiple NICs are required to use that feature...

The paper Quantifying the Performance Isolation Properties of Virtualization Systems will be delivered at the ACM's Workshop on Experimental Computer Science.

Thursday Mar 29, 2007

Virtualization HCLs

Did you know that Solaris Containers has the largest HCL of any server virtualization solution?

Here are three examples:

  1. Solaris 10 HCL: 790 x86/x64 systems + 75 SPARC systems = 865 total systems (March 28, 2007)
  2. VMware: 305 x86/x64 systems (March 21, 2007)
  3. Xen publishes specific component requirements (e.g. "1.5GHz single CPU minimum") instead of an HCL
Note that the Solaris Containers functionality is available on all Solaris 10 systems, and is exactly the same on all hardware architectures.

Is that metric relevant? Many factors should affect your virtualization choice. One of them is hardware choice: "does my choice of server virtualization technology limit my choice of hardware platform?"

The data points above show sufficient choice in commodity hardware for most people, but Containers maximizes your choice, and only Containers is supported on multiple hardware architectures.

Thursday Mar 15, 2007

Spawning 0.5kZ/hr (Part 2)

As I said last time, zone-clone/ZFS-clone is time- and space-efficient. And that entry looked briefly at cloning zones. Now let's look at the integration of zone-clones and ZFS-clones.

Enter Z\^2 Clones

Instead of copying every file from the original zone to the new zone, a clone of a zone that 'lives' in a ZFS file system is actually a clone of a snapshot of the original zone's file system. As you might imagine, this is fast and small. When you use zone-clone to install a zone, most of the work is merely copying zone-specific files around. Because all of the files start out identical from one zone to the next, and because each zone is a snapshot of an existing zone, there is very little disk activity, and very little additional disk space is used.

But how fast is the process of cloning, and how small is the new zone?

I asked myself those questions, and then used a Sun Fire X4600 with eight AMD Opeteron 854's and 64GB of RAM to answer them. Unfortunately the system only has its internal disk drives. The disk drive was the bottleneck most of the time. I created a zpool from one disk slice on that drive, which is neither robust nor efficient. But it worked.

Creating the first zone took 150 seconds, including creating the ZFS file system for the zone, and used 131MB in the zpool. Note that this is much smaller than the disk space used by other virtualization solutions. Creating the next nine zones took less than 50 seconds, and used less than 20MB, total, in the zpool.

The length of time to create additional zones gradually increased. Creation of the 200th through 500th zones averaged 8.2 seconds each. Also, the disk space used gradually increased per zone. After booting each zone several times, they each used 6MB-7MB of disk space. The disk space used per zone increased as each zone made its own changes to configuration files. But the final rate of creation was 489 zones per hour.

But will they run? And are they as efficient at memory usage as they are at disk usage?

I booted them from a script, sequentially. This took roughly 10 minutes Using the "memstat" tool of mdb, I found that each zone uses 36MB of RAM. This allowed all 500 zones to run very comfortably in the 64GB on this system. This small amount was due to the model used by sparse-root zones: a program that is running in multiple zones shares the program's text pages.

The scalability of performance was also excellent. A quick check of CPU usage showed that all 500 zones used less than 2% of the eight CPUs in the system. Of course, there weren't any applications running in the zones, but just try to run 500 guest operating systems in your favorite hypervisor-based virtualization product...

But why stop there? 500 zones not enough for you? Nah, me neither. How about 1,000 zones? That sounds like a good reason for a "Part 3."

Conclusion

New features added recently to Solaris zones improve on their excellent efficiency:

  1. making a copy of a zone was reduced from 30 minutes to 8 seconds
  2. disk usage of zones decreased from 100MB to 7 MB
  3. memory usage stayed extremely low - roughly 36MB per zone
  4. CPU cycles used just by unbooted zones is zero, and by running zones (with no applications) is negligible

So, maybe computers hate me for pushing them out of their comfort zone. Or maybe it's something else.

About

Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today