Monday Dec 03, 2012

My error with upgrading 4.0 to 4.2- What NOT to do...

Last week, I was helping a client upgrade from the 2011.1.4.0 code to the newest 2011.1.4.2 code. We downloaded the 4.2 update from MOS, upload and unpacked it on both controllers, and upgraded one of the controllers in the cluster with no issues at all. As this was a brand-new system with no networking or pools made on it yet, there were not any resources to fail back and forth between the controllers. Each controller had it's own, private, management interface (igb0 and igb1) and that's it. So we took controller 1 as the passive controller and upgraded it first. The first controller came back up with no issues and was now on the 4.2 code. Great. We then did a takeover on controller 1, making it the active head (although there were no resources for it to take), and then proceeded to upgrade controller 2.

Upon upgrading the second controller, we ran the health check with no issues. We then ran the update and it ran and rebooted normally. However, something strange then happened. It took longer than normal to come back up, and when it did, we got the "cluster controllers on different code" error message that one gets when the two controllers of a cluster are running different code. But we just upgraded the second controller to 4.2, so they should have been the same, right???

Going into the Maintenance-->System screen of controller 2, we saw something very strange. The "current version" was still on 4.0, and the 4.2 code was there but was in the "previous" state with the rollback icon, as if it was the OLDER code and not the newer code. I have never seen this happen before. I would have thought it was a bad 4.2 code file, but it worked just fine with controller 1, so I don't think that was it. Other than the fact the code did not update, there was nothing else going on with this system. It had no yellow lights, no errors in the Problems section, and no errors in any of the logs. It was just out of the box a few hours ago, and didn't even have a storage pool yet.

So.... We deleted the 4.2 code, uploaded it from scratch, ran the health check, and ran the upgrade again. once again, it seemed to go great, rebooted, and came back up to the same issue, where it came to 4.0 instead of 4.2. See the picture below.... HERE IS WHERE I MADE A BIG MISTAKE....

I SHOULD have instantly called support and opened a Sev 2 ticket. They could have done a shared shell and gotten the correct Fishwork engineer to look at the files and the code and determine what file was messed up and fixed it. The system was up and working just fine, it was just on an older code version, not really a huge problem at all.

Instead, I went ahead and clicked the "Rollback" icon, thinking that the system would rollback to the 4.2 code.   Ouch... What happened was that the system said, "Fine, I will delete the 4.0 code and boot to your 4.2 code"... Which was stupid on my part because something was wrong with the 4.2 code file here and the 4.0 was just fine. 

So now the system could not boot at all, and the 4.0 code was completely missing from the system, and even a high-level Fishworks engineer could not help us. I had messed it up good. We could only get to the ILOM, and I had to re-image the system from scratch using a hard-to-get-and-use FishStick USB drive. These are tightly controlled and difficult to get, almost always handcuffed to an engineer who will drive out to re-image a system. This took another day of my client's time. 

So.... If you see a "previous version" of your system code which is actually a version higher than the current version... DO NOT ROLL IT BACK.... It did not upgrade for a very good reason.

In my case, after the system was re-imaged to a code level just 3 back, we once again tried the same 4.2 code update and it worked perfectly the first time and is now great and stable.  Lesson learned. 

By the way, our buddy Ryan Matthews wanted to point out the best practice and supported way of performing an upgrade of an active/active ZFSSA, where both controllers are doing some of the work. These steps would not have helpped me for the above issue, but it's important to follow the correct proceedure when doing an upgrade.

1) Upload software to both controllers and wait for it to unpack
2) On controller "A" navigate to configuration/cluster and click "takeover"
3) Wait for controller "B" to finish restarting, then login to it, navigate to maintenance/system, and roll forward to the new software.
4) Wait for controller "B" to apply the update and finish rebooting
5) Login to controller "B", navigate to configuration/cluster and click "takeover"
6) Wait for controller "A" to finish restarting, then login to it, navigate to maintenance/system, and roll forward to the new software.
7) Wait for controller "A" to apply the update and finish rebooting
8) Login to controller "B", navigate to configuration/cluster and click "failback"

Thursday Nov 15, 2012

New code release today - 2011.1.4.2

Wow, two blog entries in the same day! When I wrote the large 'Quota' blog entry below, I did not realize there would be a micro-code update going out the same evening.

So here it is. Code 2011.1.4.2 has just been released. You can get the readme file for it here:

Download it, of course, through the MOS website.

It looks like it fixes a pretty nasty bug. Get it if you think it applies to you. Unless you have a great reason NOT to upgrade, I would strongly advise you to upgrade to 2011.1.4.2. Why? Because the readme file says they STRONGLY RECOMMEND YOU ALL UPGRADE TO THIS CODE IMMEDIATELY using LOTS OF CAPITAL LETTERS.

That's good enough for me. Be sure to run the health check like the readme tells you to. 

**Updated after I posted the above... What worries me is that 2011.1.5.0 was supposed to be out pretty soon, as in weeks. So if they put this 1.4.2 version out now, instead of just adding these three fixes to the 1.5.0 code, they must be pretty important.  

Quotas - Using quotas on ZFSSA shares and projects and users

So you don't want your users to fill up your entire storage pool with their MP3 files, right? Good idea to make some quotas. There's some good tips and tricks here, including a helpful workflow (a script) that will allow you to set a default quota on all of the users of a share at once.

Let's start with some basics. I mad a project called "small" and inside it I made a share called "Share1". You can set quotas on the project level, which will affect all of the shares in it, or you can do it on the share level like I am here. Go the the share's General property page.

First, I'm using a Windows client, so I need to make sure I have my SMB mountpoint. Do you know this trick yet? Go to the Protocol page of the share. See the SMB section? It needs a resource name to make the UNC path for the SMB (Windows) users. You do NOT have to type this name in for every share you make! Do this at the Project level. Before you make any shares, go to the Protocol properties of the Project, and set the SMB Resource name to "On". This special code will automatically make the SMB resource name of every share in the project the same as the share name. Note the UNC path name I got below. Since I did this at the Project level, I didn't have to lift a finger for it to work on every share I make in this project. Simple.

So I have now mapped my Windows "Z:" drive to this Share1. I logged in as the user "Joe". Note that my computer shows my Z: drive as 34GB, which is the entire size of my Pool that this share is in. Right now, Joe could fill this drive up and it would fill up my pool. 

Now, go back to the General properties of Share1. In the "Space Usage" area, over on the right, click on the "Show All" text under the Users & Groups section. Sure enough, Joe and some other users are in here and have some data. Note this is also a handy window to use just to see how much space your users are using in any given share. 

Ok, Joe owes us money from lunch last week, so we want to give him a quota of 100MB. Type his name in the Users box. Notice how it now shows you how much data he's currently using. Go ahead and give him a 100M quota and hit the Apply button.

If I go back to "Show All", I can see that Joe now has a quota, and no one else does.

Sure enough, as soon as I refresh my screen back on Joe's client, he sees that his Z: drive is now only 100MB, and he's more than half way full.

 That was easy enough, but what if you wanted to make the whole share have a quota, so that the share itself, no matter who uses it, can only grow to a certain size? That's even easier. Just use the Quota box on the left hand side. Here, I use a Quota on the share of 300MB.

 So now I log off as Joe, and log in as Steve. Even though Steve does NOT have a quota, it is showing my Z: drive as 300MB. This would effect anyone, INCLUDING the ROOT user, becuase you specified the Quota to be on the SHARE, not on a person.

 Note that back in the Share, if you click the "Show All" text, the window does NOT show Steve, or anyone else, to have a quota of 300MB. Yet we do, because it's on the share itself, not on any user, so this panel does not see that.

Ok, here is where it gets FUN....

Let's say you do NOT want a quota on the SHARE, because you want SOME people, like Root and yourself, to have FULL access to it and you want the ability to fill the whole thing up if you darn well feel like it. HOWEVER, you want to give the other users a quota. HOWEVER you have, say, 200 users, and you do NOT feel like typing in each of their names and giving them each a quota, and they are not all members of a AD global group you could use or anything like that.  Hmmmmmm....

No worries, mate. We have a handy-dandy script that can do this for us. Now, this script was written a few years back by Tim Graves, one of our ZFSSA engineers out of the UK. This is not my script. It is NOT supported by Oracle support in any way. It does work fine with the 2011.1.4 code as best as I can tell, but Oracle, and I, are NOT responsible for ANYTHING that you do with this script. Furthermore, I will NOT give you this script, so do not ask me for it. You need to get this from your local Oracle storage SC. I will give it to them. I want this only going to my fellow SCs, who can then work with you to have it and show you how it works. 

Here's what it does...
Once you add this workflow to the Maintenance-->Workflows section, you click it once to run it. Nothing seems to happen at this point, but something did. 

 Go back to any share or project. You will see that you now have four new, custom properties on the bottom.

 Do NOT touch the bottom two properties, EVER. Only touch the top two. Here, I'm going to give my users a default quota of about 40MB each. The beauty of this script is that it will only effect users that do NOT already have any kind of personal quota. It will only change people who have no quota at all. It does not effect the Root user.

 After I hit Apply on the Share screen. Nothing will happen until I go back and run the script again. The first time you run it, it creates the custom properties. The second and all subsequent times you run it, it checks the shares for any users, and applies your quota number to each one of them, UNLESS they already have one set. Notice in the readout below how it did NOT apply to my Joe user, since Joe had a quota set.

 Sure enough, when I go back to the "Show All" in the share properties, all of the users who did not have a quota, now have one for 39.1MB. Hmmm... I did my math wrong, didn't I?  

 That's OK, I'll just change the number of the Custom Default quota again. Here, I am adding a zero on the end.

 After I click Apply, and then run the script again, all of my users, except Joe, now have a quota of 391MB

 You can customize a person at any time. Here, I took the Steve user, and specifically gave him a Quota of zero. Now when I run the script again, he is different from the rest, so he is no longer effected by the script. Under Show All, I see that Joe is at 100, and Steve has no Quota at all. I can do this all day long. es, you will have to re-run the script every time new users get added. The script only applies the default quota to users that are present at the time the script is ran. However, it would be a simple thing to schedule the script to run each night, or to make an alert to run the script when certain events occur.

 For you power users, if you ever want to delete these custom properties and remove the script completely, you will find these properties under the "Schema" section under the Shares section. You can remove them here. There's no need to, however, they don't hurt a thing if you just don't use them.

 I hope these tips have helped you out there. Quotas can be fun. 

Friday Oct 26, 2012

Replication - between pools in the same system

OK, I fully understand that's it's been a LONG time since I've blogged with any tips or tricks on the ZFSSA, and I'm way behind. Hey, I just wrote TWO BLOGS ON THE SAME DAY!!! Make sure you keep scrolling down to see the next one too, or you may have missed it. To celebrate, for the one or two of you out there who are still reading this, I got something for you. The first TWO people who make any comment below, with your real name and email so I can contact you, will get some cool Oracle SWAG that I have to give away. Don't get excited, it's not an iPad, but it pretty good stuff. Only the first two, so if you already see two below, then settle down.

Now, let's talk about Replication and Migration.  I have talked before about Shadow Migration here:
Shadow Migration lets one take a NFS or CIFS share in one pool on a system and migrate that data over to another pool in the same system. That's handy, but right now it's only for file systems like NFS and CIFS. It will not work for LUNs. LUN shadow migration is a roadmap item, however.

So.... What if you have a ZFSSA cluster with multiple pools, and you have a LUN in one pool but later you decide it's best if it was in the other pool? No problem. Replication to the rescue. What's that? Replication is only for replicating data between two different systems? Who told you that? We've been able to replicate to the same system now for a few code updates back. These instructions below will also work just fine if you're setting up replication between two different systems. After replication is complete, you can easily break replication, change the new LUN into a primary LUN and then delete the source LUN. Bam.

Step 1- setup a target system. In our case, the target system is ourself, but you still have to set it up like it's far away. Go to Configuration-->Services-->Remote Replication. Click the plus sign and setup the target, which is the ZFSSA you're on now.

Step 2. Now you can go to the LUN you want to replicate. Take note which Pool and Project you're in. In my case, I have a LUN in Pool2 called LUNp2 that I wish to replicate to Pool1.

 Step 3. In my case, I made a Project called "Luns" and it has LUNp2 inside of it. I am going to replicate the Project, which will automatically replicate all of the LUNs and/or Filesystems inside of it.  Now, you can also replicate from the Share level instead of the Project. That will only replicate the share, and not all the other shares of a project. If someone tells you that if you replicate a share, it always replicates all the other shares also in that Project, don't listen to them.
Note below how I can choose not only the Target (which is myself), but I can also choose which Pool to replicate it to. So I choose Pool1.

 Step 4. I did not choose a schedule or pick the "Continuous" button, which means my replication will be manual only. I can now push the Manual Replicate button on my Actions list and you will see it start. You will see both a barber pole animation and also an update in the status bar on the top of the screen that a replication event has begun. This also goes into the event log.

 Step 5. The status bar will also log an event when it's done.

Step 6. If you go back to Configuration-->Services-->Remote Replication, you will see your event.

Step 7. Done. To see your new replica, go to the other Pool (Pool1 for me), and click the "Replica" area below the words "Filesystems | LUNs" Here, you will see any replicas that have come in from any of your sources. It's a simple matter from here to break the replication, which will change this to a "Local" LUN, and then delete the original LUN back in Pool2.

Ok, that's all for now, but I promise to give out more tricks sometime in November !!! There's very exciting stuff coming down the pipe for the ZFSSA. Both new hardware and new software features that I'm just drooling over. That's all I can say, but contact your local sales SC to get a NDA roadmap talk if you want to hear more.  

Happy Halloween,

New Write Flash SSDs and more disk trays

In case you haven't heard, the Write SSDs the ZFSSA have been updated. Much faster now for the same price. Sweet.

The new write-flash SSDs have a new part number of 7105026 , so make sure you order the right ones. It's important to note that you MUST be on code level 2011.1.4.0 or higher to use these.

They have increased in IOPS from 6,000 to 11,000, and increased throughput from 200MB/s to 350MB/s.  

 Also, you can now add six SAS HBAs (up from 4) to the 7420, allowing one to have three SAS channels with 12 disk trays each, for a new total of 36 disk trays. With 3TB drives, that's 2.5 Petabytes. Is that enough for you?

Make sure you add new cards to the correct slots. I've talked about this before, but here is the handy-dandy matrix again so you don't have to go find it. Remember the rules: You can have 6 of any one kind of card (like six 10GigE cards), except IB which is still four max. You only really get 8 slots, since you have two SAS cards no matter what. If you want more than 12 disk trays, you need two more SAS cards, so think about expansion later, too. In fact, if you're going to have two different speeds of drives (in other words you want to mix 15K speed and 7,200 speed drives in the same system), I would highly recommend two different SAS channels. So I would want four SAS cards in that system, no matter how many trays you have. 

Thursday Aug 16, 2012

New 2011.1.4.0 code is ready

Ok, folks, the newest code for the ZFSSA is now out and available on My Oracle Support. It's pretty easy to find under the 'Patches and Updates" tab. Or you can look for it under it's patch number:


So this is now version  2011.1.4.0, also called AK-2011. It has a VERY large list of fixes for various issues. You can find the readme file with all of the issues fixed in this release here:

**Update 8-27-12 ** IMPORTANT news for Exadata clients backing up to a ZFSSA over Infiniband: 
Earlier, I reported on this blog about a special IDR code release to fix a certain issue called "CR 7162888, IB infiniband interface stop communicating on both heads". It turns out that this is a duplicate issue of CR 7013410, which IS fixed in this public code release of 2011.1.4.0.
So, go ahead and use this public release, and do not worry about any special IDR unless you are working with support on some other issue. - Steve

Thursday Aug 09, 2012

Phone Home- just like E.T.

Hmmm, still no update, so they have changed the ETA from "July" to "It will be out when it's ready". I have not heard of a new ETA, so please don't ask.

In the meantime, there are plenty of you that do not have the Automated Service Request (ASR) feature turned on for your ZFSSA systems. This is better known as "Phone Home". It's not only extremely handy and free, but it could possibly save your job and your company lots of time and money. You really, really want to turn it on. The Phone Home feature on the ZFSSA does two things. It obviously creates a support ticket with Oracle support in the event of some failure on the ZFSSA. That's good. You will see an email that this happened, and can then go track it with your account on the MOS (My Oracle Support) website. If you're wondering what issues will force an ASR, then go back and read my blog entry from last year here:

You need to make sure your ZFSSA system can get to the internet and has access through your firewalls to the following sites and ports:
1.  on port 443
2.  on port 443

The other thing it does is send heartbeats to the Oracle ASR phone home database. As a pre-sales engineer, I find this very handy for my clients. I'll show you a few screenshots below of what I can see in the phone home database for my clients. This lets me keep track of minor, major, and critical issues my customers are having with their ZFSSA systems. I can see how their storage pools are setup and if they are becoming too full. This is something I like to track and keep my eye on for my customers. I will offer to create summary reports for them on a monthly basis to help them keep track of their systems, as many of them don't look at them very often or have too many other things to worry about. Your local Storage SC can also access this database and show you what he or she sees for your systems on Phone Home.

Here is where you setup Phone Home in the ZFSSA BUI interface:

Here is an example of what I can see in the Phone Home database. 

In this example, just lower down on the same screen, you can see the list of issues this system has had over the years, and a click will get you more detail. This system had some hard drives replaced in January, and one in May, but nothing of note since then. Many of the other 'Major' alerts you see below were actually just cluster peer takeovers done for testing and upgrades.

Friday Jul 27, 2012

Installing new ZFSSA systems

I had not realized how long it has been since my last blog entry. I've been so busy installing ZFSSAs that I've been a flake on my blog. I havn't hit anything Earth-shattering to report, and like most of you, I'm really waiting for the release of the new code, which I heard was supposed to be in July. Hey, we have two more days before we can say it's late, right?  :)

This week, I spent two days setting up and configuring a large 7420 cluster. The funny thing is, a day later, I spent only 30 minutes setting up two different 7420 single-head systems. It's really funny how easy it is to setup a single head, single tray system. 

Monday Jun 04, 2012

New Analytic settings for the new code

If you have upgraded to the new 2011.1.3.0 code, you may find some very useful settings for the Analytics. If you didn't already know, the analytic datasets have the potential to fill up your OS hard drives. The more datasets you use and create, that faster this can happen. Since they take a measurement every second, forever, some of these metrics can get in the multiple GB size in a matter of weeks. The traditional 'fix' was that you had to go into Analytics -> Datasets about once a month and clean up the largest datasets. You did this by deleting them. Ouch. Now you lost all of that historical data that you might have wanted to check out many months from now. Or, you had to export each metric individually to a CSV file first. Not very easy or fun. You could also suspend a dataset, and have it not collect data at all. Well, that fixed the problem, didn't it? of course you now had no data to go look at. Hmmmm....

All of this is no longer a concern. Check out the new Settings tab under Analytics...

Now, I can tell the ZFSSA to keep every second of data for, say, 2 weeks, and then average those 60 seconds of each minute into a single 'minute' value. I can go even further and ask it to average those 60 minutes of data into a single 'hour' value.  This allows me to effectively shrink my older datasets by a factor of 1/3600 !!! Very cool. I can now allow my datasets to go forever, and really never have to worry about them filling up my OS drives.

That's great going forward, but what about those huge datasets you already have? No problem. Another new feature in 2011.1.3.0 is the ability to shrink the older datasets in the same way. Check this out. I have here a dataset called "Disk: I/O opps per second" that is about 6.32M on disk (You need not worry so much about the "In Core" value, as that is in RAM, and it fluctuates all the time. Once you stop viewing a particular metric, you will see that shrink over time, just relax). 

When one clicks on the trash can icon to the right of the dataset, it used to delete the whole thing, and you would have to re-create it from scratch to get the data collecting again. Now, however, it gives you this prompt:

As you can see, this allows you to once again shrink the dataset by averaging the second data into minutes or hours.

Here is my new dataset size after I do this. So it shrank from 6.32MB down to 2.87MB, but i can still see my metrics going back to the time I began the dataset.

Now, you do understand that once you do this, as you look back in time to the minute or hour data metrics, that you are going to see much larger time values, right? You will need to decide what size of granularity you can live with, and for how long. Check this out.

Here is my Disk: Percent utilized from 5-21-2012 2:42 pm to 4:22 pm:

After I went through the delete process to change everything older than 1 week to "Minutes", the same date and time looks like this:

Just understand what this will do and how you want to use it. Right now, I'm thinking of keeping the last 6 weeks of data as "seconds", and then the last 3 months as "Minutes", and then "Hours" forever after that. I'll check back in six months and see how the sizes look.


Friday May 18, 2012

New code is out- Version 2011.1.3.0


The newest version of the ZFSSA code, 2011.1.3.0, is now out and availiable on MOS.

I will be writing more about one of it's many new, useful features coming early next week. It's very cool, and has to do with how you can now change the size of your analytic datasets.


ak-2011. Release Notes


This minor release of the Sun ZFS Storage Appliance software contains significant bug fixes for all supported platforms. Please carefully review the list of CRs that have been addressed and all known issues prior to updating.

Among other issues, this release fixes some memory fragmentation issues (CRs 7092116 and 7105404), includes improvements to DTrace Analytics, and failover improvements to DNS, LDAP, and the SMB Domain Controller.

This release requires appliances to be running the 2010.Q3.2.1 micro release or higher prior to updating to this release. In addition, this release includes update health checks that are performed automatically when an update is started prior to the actual update from the prerequisite 2010.Q3.2.1 micro release or higher. If an update health check fails, it can cause an update to abort. The update health checks help ensure component issues that may impact an update are addressed. It is important to resolve all hardware component issues prior to performing an update.

Deferred Updates

When updating from a 2010.Q3 release to a 2011.1 release, the following deferred updates are available and may be reviewed in the Maintenance System BUI screen. See the "Maintenance:System:Updates#Deferred_Updates" section in the online help for important information on deferred updates before applying them.

1. RAIDZ/Mirror Deferred Update (Improved RAID performance)
This deferred update improves both latency and throughput on several important workloads. These improvements rely on a ZFS pool upgrade provided by this update. Applying this update is equivalent to upgrading the on-disk ZFS pool to version 29.

2. Optional Child Directory Deferred Update (Improved snapshot performance)
This deferred update improves list retrieval performance and replication deletion performance by improving dataset rename speed. These improvements rely on a ZFS pool upgrade provided by this update. Before this update has been applied, the system will be able to retrieve lists and delete replications, but will do so using the old, much slower, recursive rename code. Applying this update is equivalent to upgrading the on-disk ZFS pool to version 31.

Supported Platforms

Issues Addressed

The following CRs have been fixed in this release:

4325892 performance decrease if 1st nameserver is down
4377911 RFE for improved DNS resolver failover performance
6822262 Windows Media Service/SQL server cannot connect to cifs share
6822586 SMB should try UDP first to get LDAP SRV record and retry with TCP if truncated UDP response
6941854 idmap_getwinnamebygid() and idmap_getwinnamebyuid() need to work for builtin names
6953716 Exception 'Native message: datasets is undefined' while append dataset to current worksheet
6973870 Specify retention time for Analytics data
6991949 panic will happen during some error injection stress
6996698 The SRAO may terminate irrelevant memory copy
6997450 gcpu_mca_process() doesn't return a right disp for poisoned error
7023548 replacement failed for faulted readzilla
7040757 smb_com_write_andx NULL pointer dereference panic in mbc_marshal_get_uio
7044065 Replay records within a dataset in parallel
7047976 zil replay assertion failure with full pool
7048780 I/O sent from iSCSI initiator embedded in VirtualBox completes with a status TASK SET FULL
7052406 NFS Server shouldn't take zero copy path on READ if no write chunk list provided
7052703 zl_replay_lock needs to be initialised and destroyed
7066080 s11n code generated by fcc should include strings.h
7066138 fcc must define _INT64_TYPE
7066170 configurable max- and min-units for akInputDuration
7066552 NFS Server Fails READS with NFS4ERR_INVAL when using krbi or krb5p
7071147 DC failover improvements
7071628 ldap_cachemgr exits after failure in profile refresh, only when using sasl/GSSAPI authentication
7071916 ztest/ds_3 missing log records: replayed X < committed Y
7074722 dataspan can be marked read-only even if it has dirty subspans
7080443 LDAP client failover doesn't work
7080790 ZIL: Assertion failed: zh->zh_replay_seq < *replayed_seq (0x1a < 0x1a)
7084762 ztest/ds_3 missing log records: replayed X < committed Y - Part 2
7089422 ldap client uses bindTimeLimit instead of searchTimeLimit when searching for entries
7090133 Large READs are broken with krb5i or krb5p with NFS Zero-Copy turned on
7090153 table-free akInputRadio
7090166 ak_dataspan_stashed needs a reality check
7091223 command to prune datasets
7092116 Extremely sluggish 7420 node due to heap fragmentation
7093687 LDAP client/ldap_cachemgr: long delays in failover to secondary Directory Server
7098553 deadlock when recursive zfs_inactive collides with zfs_unmount
7099848 Phone Home logs still refer to SUN support
7102888 akShow() can clobber CSS
7103620 akInputRadio consumers must be explicit in their use of subinputs as labels
7104363 Influx of snapshots can stall resilvering
7105404 appliance unavailable due to zio_arena fragmentation
7107750 SMB kernel door client times out too early on authentication requests
7108243 ldap_cachemgr spins on configuration error
7114579 Operations per second broken down by share reports "Datum not present" for most time periods
7114890 Ak Build tools should accommodate double-slashes in paths
7117823 RPC: Can't decode result after READ of zero bytes
7118230 Need to deliver CMOS images for Lynxplus SW 1.5
7121760 failure to post an alert causes a umem double-free
7122403 akCreateLabel(): helper function for creating LABEL elements
7122405 several Analytics CLI commands do not check for extra arguments
7122426 akParseDateTime() could be a bit more flexible
7123096 panic: LU is done with the task but LPORT is not done, itask ffffff9f59c540a0 itask_flags 3204
7125626 fmtopo shows duplicate target-path for both sims in a tray starting in 2010.q3.4 and 2011.1.1
7126842 NTLMSSP negotiation fails with 0xC00000BB (NT_STATUS_NOT_SUPPORTED)
7128218 uio_to_mblk() doesn't check for esballoca() failure
7129787 status could be uninitialized in netlogon_logon function
7130441 CPU is pegging out at 98%
7131965 SMB stops serving data (still running) Need to reset smb to fix issue
7133069 smbserver locks up on 7320 running 2011.1. Can't kill the service
7133619 Need to deliver CMOS image for SW 1.3 for Otoro
7133643 Need to deliver CMOS image for SW 1.2 for Otoro+
7142320 Enable DNS defer-on-fail functionality
7144155 idmap kernel module has lock contention calling zone_getspecific()
7144745 Online help Application Integration - MOS should be replaced with OTN
7145938 Add maximum cards for 7420 10GbE, FC, Quad GbE, and InfiniBand
7146346 Online Help: Document Sun ZFS Backup Appliance
7149992 Update doc to include 7320 DRAM
7152262 double-digit firefox version number throws off appliance version checks
7153789 ldapcachemgr lint warnings
7154895 Remove CMOS images for Lynx+ SW 1.5
7155512 lint warning in usr/src/cmd/ldapcachemgr/cachemgr_change.c
7158091 BUI Alert Banner and Wait Dialog are not functioning correctly
7158094 Still not ready for 9005
7158519 dataset class authorization doesn't work as expected
7158522 pruning an unsaved dataset does nothing but looks like it working continuously
7160553 NMI does not panic appliance platforms using apix
7161060 system hang due to physical memory exhaustion seen when major shift in workload
7165883 arc data shrinks continuously after arc grew to reach its steady state

Thursday May 03, 2012

Analytics & Threshold Alerts

Alerts are great for not only letting you know when there's some kind of hardware event, but they can also be pro-active and let you know there's a bottleneck coming BEFORE it happens. Check these out. There are two kinds of Alerts in the ZFSSA. When you go to Configuration-->Alerts, you fist see the plus sign by the "Alert Actions" section. These are pretty self-explanatory and not what I'm talking about today. Click on the "Threshold Alerts", and then click the plus sign by those.

This is what I'm talking about. The default one that comes up, "CPU: Percent Utilization" is a good one to start with. I don't mind if my CPUs go to 100% utilized for a short time. After all, we bought them to be used, right? If they go over 90% for over 10 minutes, however, something is up, and maybe we have workloads on this machine it was not designed for, or we don't have enough CPUs in the system and need more. So we can setup an alert that will keep an eye on this for us and send us an email if this were to occur. Now I don't have to keep watching it all the time. For an even better example, keep reading...

What if you want to keep your eyes on whether your Readzillas or Logzillas are being over-utilized? In other words, do you have enough of them? Perhaps you only have 2 Logzillas, and you think you may be better off with 4, but how do you prove it? No problem. Here in Threshold Alerts, click on the Threshold drop-down box, and choose your "Disk: Percent Utilization for Disk: Jxxxxx 013" choice, which is my Logzilla drive in the Jxxxxx tray.

Wait. What's that? You don't have a choice in your drop-down for the Threshold item you are looking for, such as an individual disk?
Well, we will have to fix that.

Leave Alerts for now, and join me over in Analytics. Start with a worksheet with "Disk: Percent utilization broken down by Disk" chart. You do have this, as it's already one of your built-in datasets.

Now, expand it so you can see all of your disks, and find one of your Readzilla or Logzilla drives. (Hint: It will NOT be disk 13 like my example here. Logzillas are always in the 20, 21, 22, or 23 slots of a disk tray. Go to your Configuration-->Hardware screens and you can easily find out which drives are which for your system).

Now, click on that drive to highlight it, like this: 

 Click on the Drill Button, and choose to drill down on that drive as a raw statistic. You will now have a whole new data chart, just for that one drive.

 Don't go away yet. You now need to save that chart as a new dataset, which will keep it in your ZFSSA analytic metrics forever. Well, until you delete it.
Click on the "Save" button, the second to last button on that chart. It looks like a circle with white dots on it (it's supposed to look like a reel-to-reel tape spindle).

Now go to your "Analytics-->Datasets", and you will see a new dataset in there for it. 

 Go back to your Threshold Alerts, and you will now be able to make an alert that will tell you if this specific drive goes over 90% for more than 10 minutes. If this happens a lot, you probably need more Readzillas or Logzillas.

I hope you like these Alerts. They may take some time to setup at first, but in the long run you may thank yourself. It might not be a bad idea to send the email alerts to a mail distribution list, instead of a single person who may be on vacation when the alert is hit.  Enjoy. 

Thursday Apr 19, 2012

Route Table Stuff

Let's talk about your Routing Table.

I have never installed a ZFSSA, ever, without having to edit this table. If you believe that you do not need to edit your routing table, then you are wrong.
:)  Ok, maybe not. Maybe you only have your ZFSSA connected to one network with only a few systems on it. I guess it's possible. Even in my simulator, however, I had to edit the routing table so I could use it no matter how I had my laptop connected, at home over a VPN or at work or using a public Wifi. So I'm going to bet a nice dinner that you, or someone, should be checking this out.

First things first. I'm going to assume you have a cluster. I try really hard to only sell clusters, but yes I know there are plenty of single-nodes out there too. Single-node people can skip these first two paragraphs. It's very important in your cluster to have a 1GigE management interface to each of the two controllers. You really want to be able to manage each controller, even when one of them is down, right? So best practice is to use the 'igb0' port for controller 1 management and to use the 'igb1' port for controller 2 management. It's important to make these ports 'Private' in the cluster configuration screen, so they do NOT failover to the other controller when a cluster takeover takes place for whatever reason. Igb0 and igb1 are two of the four built-in 1GigE ports. You can still use igb2 and igb3 for data, either alone or as an aggregate, and don't make them private, so they DO failover in a cluster takeover event. Now go to your remote workstation, which may be over a different subnet, and you should be able to ping and connect to Controller 1 using igb0.
Now, back to the routing table. You have probably noticed that you can not ping or connect to the other controller, and you think something is wrong. Not to worry, everything is fine. You just need to tell your routing table, which is shared between the heads, how to talk to that other port, igb1. You see, you have a default route setup already for port igb0, that's why it works. Your new, private, igb1 however, does not know how to speak back to your remote system you are now using to manage via the BUI from a different subnet. So, make a new default route for igb1 and point it to the default gateway, which is the router it needs to use in order to cross subnets. See the picture below. Note how I have a default route for "ZFS1-MGMT" for port igb0. This shows a green light because I'm currently on ZFS1, and it sees this port just fine. I also have a default route for "ZFS2-MGMT" from port igb1. This route has a blue light, showing it as inactive. That's because this controller, ZFS1, has nothing plugged into it's igb1 port. That's perfect. Hit "Apply". Now count to 10. Now from your remote host, go ahead and ping or connect to Controller 2, and it works!!! This is because your controllers share a routing table, and when you added that igb1 route, it propagated over to the other controller, where igb1 is plugged in, and that route has a green light over there and it works fine. You will see from Controller 2's point of view that igb1 has a green light and igb0 has a blue light.  (continued below the picture)

Now it's time to setup any static routes you may need. If you have different subnets for your 1GigE management and your IB or 10GigE data (a very good idea), then you will need to make these. It's important to have routes for this, as you do not want data coming in over the 10GigE pipe, but then returning over the 1GigE pipe, right? That will happen if this is not setup correctly. Make your routes, as the picture example shows with a 10Gig aggragate here we called "Front-end-IP". Any traffic coming in from subnet 172.20.69 will use this pipe.

Lastly, check your multi-homing model button up top. I like 'Adaptive'. Loose is the default, and makes it so your packets can traverse your routes, even though they may go over the wrong route, so it seems like your system is working. This can very well be an illusion. Your ping may work, but it may be coming from the wrong interface, as "Loose" basically means the ZFSSA just doesn't care or enforce any rules. "Strict", on the other hand, is great if you want total enforcement. If you are very good with your routes, and are positive you have it right, and want to ensure that a packet never goes the wrong way, even if that means dropping the packet, then use Strict. I'm using Adaptive here, which is a happy medium.  From the help file: The "Adaptive" choice will prefer routes with a gateway address on the same subnet as the packet's source IP address: 1) An IP packet will be accepted on an IP interface so long as its destination IP address is up on the appliance. 2) An IP packet will be transmitted over the IP interface tied to the route that most specifically matches an IP packet's destination address. If multiple routes are equally specific, prefer routes that have a gateway address on the same subnet as the packet's source address. If no eligible routes exist, drop the packet.

Update 4/23/12- My colleague, Darius (, rightfully wanted me to point out how important it was to setup a static route for replication. You do not want replication to go over a private management port by mistake, as this will cause it to fail when one controller or the other goes down for maintenance.

I hope this helps. Routing can be fun. 

Saturday Apr 14, 2012

New SPC2 benchmark- The 7420 KILLS it !!!

This is pretty sweet. The new SPC2 benchmark came out last week, and the 7420 not only came in 2nd of ALL speed scores, but came in #1 for price per MBPS.

Check out this table. The 7420 score of 10,704 makes it really fast, but that's not the best part. The price one would have to pay in order to beat it is ridiculous. You can go see for yourself at
The only system on the whole page that beats it was over twice the price per MBPS. Very sweet for Oracle.

So let's see, the 7420 is the fastest per $.
The 7420 is the cheapest per MBPS.
The 7420 has incredible, built-in features, management services, analytics, and protocols. It's extremely stable and as a cluster has no single point of failure. It won the Storage Magazine award for best NAS system this year.

So how long will it be before it's the number 1 NAS system in the market? What are the biggest hurdles still stopping the widespread adoption of the ZFSSA? From what I see, it's three things: 1. Administrator's comfort level with older legacy systems. 2. Politics 3. Past issues with Oracle Support.  

I see all of these issues crop up regularly. Number 1 just takes time and education. Number 3 takes time with our new, better, and growing support team. many of them came from Oracle and there were growing pains when they went from a straight software-model to having to also support hardware. Number 2 is tricky, but it's the job of the sales teams to break through the internal politics and help their clients see the value in oracle hardware systems. Benchmarks like this will help.

Thursday Apr 12, 2012

Hybrid Columnar Compression

You heard me in the past talk about the HCC feature for Oracle databases. Hybrid Columnar Compression is a fantastic, built-in, free feature of Oracle 11Gr2. One used to need an Exadata to make use of it. However, last October, Oracle opened it up and now allows it to work on ANY Oracle DB server running 11Gr2, as long as the storage behind it is a ZFSSA for DNFS, or an Axiom for FC.

If you're not sure why this is so cool or what HCC can do for your Oracle database, please check out this presentation. In it, Art will explain HCC, show you what it does, and give you a great idea why it's such a game-changer for those holding lots of historical DB data.

Did I mention it's free? Click here:

Monday Apr 02, 2012

New ZFSSA code release - April 2012

A new version of the ZFSSA code was released over the weekend.

In case you have missed a few, we are now on code 2011.1.2.1. This minor update is very important for our friends with the older SAS1 cards on the older 7x10 systems. This 2.1 minor release was made specifically for them, and fixes the issue that their SAS1 card had with the last major release. They can now go ahead and upgrade straight from the 2010.Q3.2.1 code directly to 2011.1.2.1.

If you are on a 7x20 series, and already running 2011.1.2.0, there is no real reason why you need to upgrade to 1.2.1, as it's really only the Pandora SAS1 HBA fix. If you are not already on 1.2.0, then go ahead and upgrade all the way to 2011.1.2.1.

I hope everyone out there is having a good April so far. For my next blog, the plan is to work off the Analytic tips I did last week and expand on which Analytics you want to really keep your eyes on, and also how to setup alerts to watch them for you.

You can read more and keep up on your releases here:



Wednesday Mar 28, 2012

Fun tips with Analytics

If you read this blog, I am assuming you are at least familiar with the Analytic functions in the ZFSSA. They are basically amazing, very powerful and deep.

However, you may not be aware of some great, hidden functions inside the Analytic screen.

Once you open a metric, the toolbar looks like this:

Now, I’m not going over every tool, as we have done that before, and you can hover your mouse over them and they will tell you what they do. But…. Check this out.
Open a metric (CPU Percent Utilization works fine), and click on the “Hour” button, which is the 2nd clock icon. That’s easy, you are now looking at the last hour of data. Now, hold down your ‘Shift’ key, and click it again. Now you are looking at 2 hours of data. Hold down Shift and click it again, and you are looking at 3 hours of data. Are you catching on yet?
You can do this with not only the ‘Hour’ button, but also with the ‘Minute’, ‘Day’, ‘Week’, and the ‘Month’ buttons. Very cool. It also works with the ‘Show Minimum’ and ‘Show Maximum’ buttons, allowing you to go to the next iteration of either of those.

One last button you can Shift-click is the handy ‘Drill’ button. This button usually drills down on one specific aspect of your metric. If you Shift-click it, it will display a “Rainbow Highlight” of the current metric. This works best if this metric has many ‘Range Average’ items in the left-hand window. Give it a shot.

Also, one will sometimes click on a certain second of data in the graph, like this:

 In this case, I clicked 4:57 and 21 seconds, and the 'Range Average' on the left went away, and was replaced by the time stamp. It seems at this point to some people that you are now stuck, and can not get back to an average for the whole chart. However, you can actually click on the actual time stamp of "4:57:21" right above the chart. Even though your mouse does not change into the typical browser finger that most links look like, you can click it, and it will change your range back to the full metric.

Another trick you may like is to save a certain view or look of a group of graphs. Most of you know you can save a worksheet, but did you know you could Sync them, Pause them, and then Save it? This will save the paused state, allowing you to view it forever the way you see it now. 

Heatmaps. Heatmaps are cool, and look like this: 

Some metrics use them and some don't. If you have one, and wish to zoom it vertically, try this. Open a heatmap metric like my example above (I believe every metric that deals with latency will show as a heatmap). Select one or two of the ranges on the left. Click the "Change Outlier Elimination" button. Click it again and check out what it does. 

Enjoy. Perhaps my next blog entry will be the best Analytic metrics to keep your eyes on, and how you can use the Alerts feature to watch them for you.


Wednesday Mar 21, 2012

Using all Ten IO slots on a 7420

So I had the opportunity recently to actually use up all ten slots in a clustered 7420 system. This actually uses 20 slots, or 22 if you count the clusteron card. I thought it was interesting enough to share here. This is at one of my clients here in southern California.

You can see the picture below. We have four SAS HBAs instead of the usual two. This is becuase we wanted to split up the back-end taffic for different workloads. We have a set of disk trays coming from two SAS cards for nothing but Exadata backups. Then, we have a different set of disk trays coming off of the other two SAS cards for non-Exadata workloads, such as regular user file storage. 
We have 2 Infiniband cards which allow us to do a full mesh directly into the back of the nearby, production Exadata, specifically for fast backups and restores over IB. You can see a 3rd IB card here, which is going to be connected to a non-production Exadata for slower backups and restores from it.
The 10Gig card is for client connectivity, allowing other, non-Exadata Oracle databases to make use of the many snapshots and clones that can now be created using the RMAN copies from the original production database coming off the Exadata. This allows for a good number of test and development Oracle databases to use these clones without effecting performance of the Exadata at all.
We also have a couple FC HBAs, both for NDMP backups to an Oracle/StorageTek tape library and also for FC clients to come in and use some storage on the 7420.

 Now, if you are adding more cards to your 7420, be aware of which cards you can place in which slots. See the bottom graphic just below the photo. 
Note that the slots are numbered 0-4 for the first 5 cards, then the "C" slots which is the dedicated Cluster card (called the Clustron), and then another 5 slots numbered 5-9.

Some rules for the slots:

  • Slots 1 & 8 are automatically populated with the two default SAS cards. The only other slots you can add SAS cards to are 2 & 7.
  • Slots 0 and 9 can only hold FC cards. Nothing else. So if you have four SAS cards, you are now down to only four more slots for your 10Gig and IB cards. Be sure not to waste one of these slots on a FC card, which can go into 0 or 9, instead. 
  • If at all possible, slots should be populated in this order: 9, 0, 7, 2, 6, 3, 5, 4

Monday Mar 12, 2012

Good papers and links for the ZFSSA

So I have a pretty good collection of links and papers for the ZFSSA, and instead of giving them out one-at-a-time when asked, I thought it may be easier to do it this way. Many of the links from my old blog last May no longer work, so here is an updated list of some good spots to check out.

These are for ZFS, in general, not the ZFSSA, but it gives one good insight to how ZFS functions:

Tuesday Mar 06, 2012

New 7420 hardware released today

Some great new upgrades to the 7420 were announced and released today. You can now get 10-core CPUs in your 7420, allowing you to have 40 cores in each controller. Even better, you can now also go to a huge 1TB of DRAM for your L1ARC in each controller, using the new 16GB DRAM modules.

So your new choices for the new 7420 hardware are 4 x 8-core or 4 x 10-core models. Oracle is no longer going to sell the 2 x CPU models, and they are also going to stop selling the 6-core CPUs, both as of May 31st. Also, you can now order 8GB or 16GB modules, meaning that the minimum amount of memory is now 128GB, and can go to 1TB in each controller. No more 64GB, as the 4GB module has also been phased out (starting today, actually).

Now before you get upset that you can no longer get the 2-CPU model, be aware that there was also a price drop, so that the 4 x 8-core CPU model is a tad LESS then the old 2 x 8-core CPU model. So stop complaining.

It's the DRAM that I'm most excited about. I don't have a single ZFSSA client that I know of that has a CPU bottleneck. So the extra cores are great, but not amazing. What I really like is that my L1ARC can now be a whole 1TB. That's crazy, and will be able to drive some fantastic workloads. I can now place your whole, say 800GB, database entirely in DRAM cache, and not even have to go to the L2ARC on SSDs in order to hit 99% of your reads. That's sweet. 

Friday Feb 24, 2012

New ZFSSA code release today

The first minor release of the 2011.1.1 major release for the ZFSSA came out yesterday.

You can get the code via MOS, under the "Patches and updates" tab. Just click the "Product or Family (advanced)" link, and then type "ZFS" in the search window and it really takes you right to it. Or search on it's patch ID, which is 13772123

Along with some other fixes, the most important piece of this update is the RPC flow control fix, which will greatly help those using the ZFSSA to backup an Exadata over Infiniband. 

If you're not already on the major release of 2011.1.1, I urge you to update to it as soon as you can. You can jump right to this new 2011.1.1.1 code, as long as you are already on 2010.Q3.2.1 or higher. You don't need to go to 2011.1.1 first, just jump to 2011.1.1.1.

If you are using your ZFSSA to backup an Exadata, I urge you to get on 2011.1.1.1 ASAP, even if it means staying late and scheduling special time to do it.

It's also important to note that if you have a much older ZFSSA (one of the 7x10 models that are using the older SAS1 HBAs, and not the SAS2 HBAs), that you do NOT upgrade to 2011.1 code. The latest code that supports your SAS1 systems is 2010Q3.4.2.

 **Update 2-26-12:  I noted a few folks saying the link was down, however that may have been a burp in the system, as I just went into MOS and was able to get 2011.1.1.1 just fine. So delete your cookies and try again. - Steve

Thursday Feb 23, 2012

Great new 7320 benchmark

A great new benchmark has been put up on SPEC for our mid-class 7320. You can see it here:

What's cool about this benchmark is the fact this is not only our middle-sized box, but it used only 136 drives to reach this rather high 134,140 NFS Ops/sec number. If you look at the other systems tested here, you will notice that they must use MANY more drives (at presumably a much higher cost) in order to meet or beat those IOPS.

Check these out here...

For example, a FAS6080 should be far faster then our smaller 7320, right? But it only scored 120,011 even though it used 324 disks. The Isilon S200 with 14 nodes and 679 drives only scored 115,911. I would hate to find out what that system's street price is. I'm pretty sure it's higher then our 7320 with 136 drives. Now, of course all of these benchmark numbers are unrealistic to most people, as they are done in perfect conditions with each manufacture's engineers tuning and tweaking the system the best they can, right? True, but if that's the case, and the other folks tuned and configured those other boxes just like we did, it still seems like a fair fight to me, and our results are just heads and tails above the rest on a cost per IOP basis. I don't see anything on this site that touches our IOPS with the same amount of drives and presumably the same cost price range. Please point out if I missed anything here, I might be wrong.

I really love the ones that go so far overboard on this site... Check out the 140 node Isilon. Let's see... Wow, it's over one million IOPS!!!! That's impressive, until you see it's using 3,360 disk drives. That's funny. PLEASE let me know if you have a 140 node Isilon up and running. I'd love to see it. I'd also love to know what it costs.

Tuesday Feb 07, 2012

Tip- Setting up a new cluster

I haven’t given out a real tip for a while now, but this issue popped up on my last week, so thought I would pass it along. I had a horrible time setting up a new 7320 cluster; for the sole reason that I screwed it up by not doing it in the right order. This caused my install, which should have been done in 1 hour, to take me over 3 hours to complete.

So let me tell you what I did wrong, and then I'll tell you the way I should have done it.

Out of the box, my client's two new 7320 controller heads were one software revision behind, at 2010.Q3.4.2, so I wanted to upgrade them to the newest version of 2011.Q1.1.1. So far, so good, right? Well here was my mistake. I configured controller A via the serial interface, gave it IP numbers, went into the BUI, and did the upgrade to 2011.Q1.1.1. No problem. Now, I wanted to bring the other one up and do the same thing. However, I knew that controller B in a cluster must be in the initial, factory-reset state in order to be joined to a cluster.  You can't configure it, first, or if you do, you must factory-reset it in order to join a cluster. So I bring controller B up, but I don't configure it, and I go to controller A to start the cluster setup process. Big mistake. The process starts, but because the two controllers are on two different software versions, the cluster process cannot continue. This hoses me (that's southern California slang for "messes me up"), because now controller B has started the cluster setup process, and going to the serial connection just has it hung up in a "configuring cluster" state. Rebooting it does not help, as it's still in the "configuring cluster" state once it comes back up.

So.... now I have 2 choices. I can downgrade controller A back to 2010.Q3.4.2, or I can factory-reset controller B, bring it up as a single controller, upgrade it to 2011.Q1.1.1, and then factory reset again, and then finally be able to add it to the cluster via controller A's cluster setup process. I opt for the second choice, as I do not want to downgrade controller A, which is working just fine. Remember, controller B is currently hosed, messed up, or wanked, depending on how you want to say it.
It's stuck. So to get it back to a state I can work with, I need to do the trick I talked about way back in this blog on May 31, 2011 ( I had to use the GRUB menu, use the -c trick on the kernel line, and reset the machine and erase all configuration on it. Now I could bring it up as a single controller, upgrade it, factory reset it, and then have it join the cluster. That all worked fine, it just took be two hours to do it all.

Here's what I should have done.

Bring up controller A, config it and log into the BUI. Now bring up controller B. Do NOT config it in any way. Using controller A, setup clustering in the cluster menu.

Once the two controllers are clustered and all is well, NOW go ahead and upgrade controller A to the latest code. Once it reboots, go ahead and upgrade controller B. Everything's fine. You see, if the cluster has already been made, it's perfectly fine to upgrade one controller at a time. The software lets you do that. The software does NOT let you setup a NEW cluster if the controllers are not on the same software level. 

So that is the cluster setup safety tip of the day, kids. Have fun. 

Tuesday Jan 31, 2012

New Power Calculator is up

The Oracle Power Calculator for the new 3TB, 600GB, and 300GB drive versions of the ZFSSA is now up and running.

From this page, you can click on the "Power Calculators" link on top to go back out to the main screen where you will find power calculators for all of Oracle hardware. 

Friday Jan 20, 2012

New Storage Magazine awards for NAS... Check this out...

Well, it's hard to be quiet about this. Storage Magazine just came out with the January 2012 issue, showing Oracle Storage doing quite well (#1) with the Oracle ZFSSA 7420 and 7320 family. Check out pages 37-43 of this month's Storage Magazine.

Storage Magazine: (pages 37-43)


Thursday Jan 12, 2012

New ZFSSA simulator download

I've just been informed that the simulator download has been updated to the latest version of 2011.1.1.

So instead of trying to upgrade your older simulator, it is possible to download and install the new one at the latest code. Mine upgraded just fine, but some people report errors during upgrading, which occurs when using a computer or laptop without enough memory or a variety of other problems. You can get the simulator here:

Tuesday Jan 10, 2012

Even more ZFSSA announcements

The new announcements for the ZFSSA just keep on coming.

Oracle has released today the 3TB drives for the 7420 and 7320 disk trays. So you now can choose 2TB and 3TB 7,200 RPM drives and 300GB and 600GB 15,000 RPM drives in your 7420 and 7320 systems.

Now, the 2TB drive have a last order date of May 31, 2012, so after that it will be 3TB only for the slower-speed drives.

Also, has anyone checked out the new local replication feature that just came out in the 2011.1.1 software release? I'm going to play with it this week and I'll do a write up on it soon.


Thursday Jan 05, 2012

New ZFSSA firmware release is available in MOS


In case you have not been paying attention, the new 2011.1.1.0 software release for the ZFSSA is out and available for download inside the My Oracle Support website.

To find it, go to the "Patches & Updates" tab, and then do the advanced family search. Type in "ZFSSA" and it will take you right to it (choose 2011.1 in the next submenu).

You need to have your systems on 2010.3.2.1 or greater in order to upgrade to 2011.1.1, so be prepared.

It also includes a new OEM grid control plug-in for the ZFSSA.

Here are some details about it from the readme file: 

Sun ZFS Storage Software 2011.1.1.0 (ak-2011. major software update for Sun ZFS Storage Appliances contains numerous bug fixes and important firmware upgrades. Please carefully review all release notes below prior to updating.
Seven separate patches are provided for the 2011.1.1.0 release:


This release includes a variety of new features, including:

  • Improved RMAN support for Oracle Exadata
  • Improved ACL interoperability with SMB
  • Replication enhancements - including self-replication
  • InfiniBand enhancements - including better connectivity to Oracle Exalogic
  • Datalink configuration enhancements - including custom jumbogram MTUs
  • Improved fault diagnosis - including support for a variety of additional alerts
  • Per-share rstchown support


This release also includes major performance improvements, including:

  • Significant cluster rejoin performance improvements
  • Significant AD Domain Controller failover time improvements
  • Support for level-2 SMB Oplocks
  • Significant zpool import speed improvements
  • Significant NFS, iSER, iSCSI and Fibre Channel performance improvements due to elimination of data copying in critical datapaths
  • ZFS RAIDZ read performance improvements
  • Significant fairness improvements during ZFS resilver operations
  • Significant Ethernet VLAN performance improvements

Bug Fixes

This release includes numerous bug fixes, including:

  • Significant clustering stability fixes
  • ZFS aclmode support restored and enhanced
  • Assorted user interface and online help fixes
  • Significant ZFS, NFS, SMB and FMA stability fixes
  • Significant InfiniBand, iSER, iSCSI and Fibre Channel stability fixes
  • Important firmware updates

Wednesday Dec 28, 2011

New Storage Eye Charts

My new Storage Eye Chart is out. You can get it from the bookmark link on the right-hand side of this page.

Version 10 adds the Axiom and 2500M2 to a new page and also updates the ZFSSA with the new updates.

I hope everyone out there has a very happy New Year. See you in January. 

Tuesday Dec 06, 2011

New SSDs announced today

Thought you should know about the 3 new announcements for the ZFSSA.

--Write-flash-cache SSDs have gone from 18GB to 73GB each.
--New long-range transceivers for the 10GigE cards are now available
--3TB drives for the 7120 model are here today. The 3TB drives for the 7320 and 7420 are NOT here yet, but close.

Effective December 6, 2011, we are pleased to announce three new options for Oracle’s Sun ZFS Storage portfolio:
1. Availability of a 73GB Write Flash Cache for 7320 and 7420.  This new SSD features 4X the capacity and almost double the write throughput and IOPS performance of its
predecessor.  In comparison to the current 18GB SSD, this new 73GB SSD significantly enhances the system write speed.  As an example, a recent test on a particular 7420
system demonstrated a 7% improvement in system write performance while using half the number of SSDs.  The 73GB SSD is also available to our customers at a lower list
price point.  This is available as an ATO or X Option.
2. Availability of the standard Sun 10 GbE Long Range Transceiver for the current 1109A-Z 10GbE card as a configurable option for ZFS Storage Appliance.  This Long Range Transceiver enables 10 GbE optical connectivity for distances greater than 2 KM.
3. Availability of a new 7120 base model featuring integrated 11 x 3TB HDDs and a 73GB Write Flash Cache.  (Note that availability of the 3TB drive is limited to the 7120 base model internal storage only – it is not available in the disk shelves at this time.)

Additionally, we are announcing End-of-Life for the following two items:
1. 2TB drive-equipped base model of the 7120, with a Last Order Date of December 31, 2011.
2. 18GB Write Flash Cache, with a Last Order Date of January 10, 2012.

Wednesday Oct 26, 2011

VDEV - What is a VDEV and why should you care?

Ok, so we can finally talk VDEVs. Going back to my blog on disk calculations, I told you how the calculator works, and the way you can see how many drive spindles you would have for any particular RAID layout. Let's use an example of nine trays of 24 drives each, using 1TB drives.

 Yes, I know we no longer offer 1TB drives, but this is the graphic I had, so just roll with me. Now, if we were setting this up in the ZFSSA BUI, it would look like this:

 So that's all great and it all lines up, right? Well, the one thing the BUI doesn't show very well is the VDEVs. You can figure it out in your head if you know what you're doing, but the calculator can do it for you if you just add the "-v" option right after the .py command in the python string. Doing that for the above example will give you back this:

Notice the new column for VDEVs. Cool. So now I can see the breakdown of Virtual Devices that each type of RAID will create out of my physical devices (spindles). In this case, my nine trays of 24 spindles is 216 physical devices.  
-If I do something silly and make that a 'Stripe', then I would get 1 big virtual device made up of 216 physical devices.
-I could also make it a 'Mirror', which will give me 106 virtual devices, each made up of 2 physical devices.
-A RAIDz1 pool will give me 53 virtual devices, each with 4 physical devices to make my 3+1 stripes.
-Finally, for the sake of this conversation, a RAIDz2 choice will give me only 15 VDEVs, each with 14 physical drives that make 12+2 stripes. You don't get 14 data drives, you get 14 drives per stripe, so you need to remember that 2 of those are parity drives in a RAIDz2 stripe when you calculate your usable space.

Now, why do you care how many VDEVs you have? It's all about throughput.  Very simply stated, the more VDEVs you have, the more data can be pushed into the system by the most amount of users at once. Now, that's very simplistic, and it really depends on your workload. There are exceptions as I have found, but for the most part, more VDEVs will equal better throughput for small, random IO. This is why a Mirrored pool is almost always the best way to setup a high-throughput pool for small, random IO such as a database. look at all the VDEVs a mirrored pool gives you.

Think of it this way: Say you have a 256K block of data you want to write to the system, using a 128K record size. With a mirrored pool, ZFS will split your 256K file into 2 blocks of 128K each, and send it down to exactly 2 of the VDEVs to write out to 4 physical drives. Now, you still have a whopping 104 other VDEVs not doing anything, and they could all be handling other user's workflows all at the same time. Take the same example using a RAIDz1 pool. ZFS will have to break up your 256K block again into two 128K chunks and send it to 2 VDEVs, each with 4 physical drives, with each data drive of the 3+1 stripe getting about 43K. That's all fine, but while those 8 physical drives are working on that data, they can't do anything else, and you only have 51 other VDEVs to handle everyone else's workload.
As an extreme example, let's check out a RAIDz3 False pool. You only get 4 VDEVs, each with 53 drives, each in a 50+3 stripe. Writing that same 256K block with 128K record sizes will still split it over 2 VDEVs, and you only have 2 left for others to use at the same time. In other words, it will take the IOPs of 106 physical spindles to write that one stupid 256K block, while in the Mirrored pool, it would have only taken the IOPs of 4 physical spindles, leaving you with tons of other IOPs. 

Make sense?

Like I said, Mirroring is not always the best way to go. I've seen plenty of examples where we choose other pools over Mirrored after testing. That is the key. You need to test your workload with multiple pool types before picking one. If you don't have that luxury, make your best, educated guess based on the knowledge that in general, high throughput random IO does better with more VDEVs, and large, sequential files can do very well with larger stripes found in RAIDz2. 

As a side note, we recommend the RAIDz1 pool for our Exadata backups to a ZFSSA. After testing, we found that, yes, the mirrored pool did go a bit faster, but not enough to justify the drop in capacity. We also found that the RAIDz1 pool was about 20% faster for backups and restores then the RAIDz2 pool, so that speed difference didn't justify the extra capacity of RAIDz2. Now, some people may disagree and say they don't care about capacity, they want the fastest no matter what, and go with Mirrored even in this scenario. That's fine, and that's the beauty of the ZFSSA, where you are allowed to experiment with many choices and options and choose the right balance for your company and your workload.

Have fun. Steve 


This blog is a way for Steve to send out his tips, ideas, links, and general sarcasm. Almost all related to the Oracle 7000, code named ZFSSA, or Amber Road, or Open Storage, or Unified Storage. You are welcome to contact with any comments or questions


« October 2015