Solaris x86 rootnex warning in syslog

Your probably here because you searched for "x86 rootnex warning" or something like that on google. :-) If so, you came to the right place. I'm including below, a slightly edited heads up message that I sent out when I putback the bug fix related to this warning. It gives some details on the warning.

But first, a little background. Towards the end of s10 development, I fixed a couple of old x86 rootnex bugs. One of these incorrectly passed a dma bind operation instead of failing it. This fix ended up finding a few driver bugs. Since we were getting relatively close to s10 RR, and this was a hard to diagnose problem, I ended up printing a warning when we hit this condition. We found a few 3rd party drivers with sgllen related bugs in Solaris Express after this was fixed.

I heard some folks still occasionally run into this, so I figured I'd put this info out there... Here's the rootnex code in question and here's the original bug. Below is a slightly edited version of the heads-up that I sent out..


My recent putback of:

(P1) 3001685 ddi: DMA breakup routines do not match DDI specification
     4926500 ddi: x86 DMA breakup with sgllen==1 can give ncookies!=1
     4796610 ddivs_dmae/ddi_check_dma_handle_\* 3 assertions fail due to a product bug

could cause existing buggy drivers which are thought to be functioning
correctly, to start failing.


------------------


 o description of the problem - what's wrong

   The x86 root nexus's ddi_dma_\*_bind_handle() implementation
   can wrongly return more cookies than a driver specifies is
   the maximum that it can handle (when DDI_DMA_PARTIAL is not
   specified). This can be a very serious problem, causing silent
   data corruption. There can be a lot of corner cases depending
   on memory fragmentation, making it difficult to test for.
   Unfortunately, the fix for these bugs also can/has exposed
   minor bugs in existing drivers, which could be hard to identify
   and debug. An example is the elxl driver which specified that
   it could only handle 1 cookie for the tx data buffers, but
   really could handle 2 (and was getting 2 occasionally).


 o does the problem happen on sparc?  why not?

   This problem does \*not\* happen on SPARC. This is a bug
   in the x86 rootnex driver. This part of the code is not
   shared with any SPARC nexus driver code.


 o what rootnex was doing wrong

   The rootnex driver would return success from a DMA bind
   operation when the cookie count was larger than the maximum
   the driver specified that it could handle, and the driver
   specified that it couldn't handle partial mappings.


 o what rootnex is now doing right

   The rootnex driver fails a DMA bind operation when
   the cookie count would be larger than the maximum
   that a driver specifies that it can handle, and the
   driver specifies that it cannot handle partial mappings.


 o what drivers were doing wrong

   The vast majority of drivers aren't doing anything wrong.
   There may be a few which are not handling the DMA bind
   operation correctly. For example, if a driver cannot
   handle partial DMAs, it should be able to handle the
   following # of cookies ((max possible bind size / page size) + 1).
   If it cannot handle that number of cookies, it must be
   able to expect and correctly handle a failed bind
   operation.


 o what they need to do right.

   If a driver cannot handle partial DMAs, it should be able
   to handle the following # of cookies
   ((max possible bind size / page size) + 1).
   If it cannot handle that number of cookies, it must be
   able to expect and correctly handle a failed bind
   operation.


 o does this mean a driver which was working ok before
   could now broken, because a driver may have been
   written to work around the rootnex bug?

   No. A driver didn't need to workaround the bug. It does
   mean that a driver which was working or was appearing to
   work before, could now be broken. This could be because the
   code path for the failing bind operation has never been
   tested or the driver never expected the bind operation
   to fail (i.e. the driver has a bug which they never
   debugged since it never failed before unless they noticed
   memory corruption).


 o instructions for using flags

       There are two patchables rootnex_bind_fail & rootnex_bind_warn. The
       following table explains the behavior of the new bind operation
       failure. The current default behavior is to fails the bind and
       print one warning message per major number.

 rootnex_bind_fail |  rootnex_bind_warn  | Results if sgllen < \*ccount
                   |                     | && !DDI_DMA_PARTIAL
 ---------------------------------------------------------------------
        0          |          0          | behaves like code today (bind succeeds, no warning message)
        0          |          1          | bind succeeds, print one warning/major#
        1          |          0          | fails the bind, no warning message
        1          |          1          | fails the bind, print one warning/major#

       To revert to the previous behavior, which is incorrectly returning
       success from these ddi_dma_\*_bind_handle() operations, put the
       following in /etc/system then reboot.
          set rootnex:rootnex_bind_fail = 0

       To disable the warning message, put the following in /etc/system
       then reboot.
          set rootnex:rootnex_bind_warn = 0

       Anyone fixing bugs found by the warning should make sure they test their
       fix \*without\* rootnex_bind_fail = 0 and set kmem_flags = 3f to ensure they
       clean up correctly after the failure (since it's likely that code path
       hasn't been tested). e.g. put the following in /etc/system
          set rootnex:rootnex_bind_fail = 1
          set kmem_flags = 0x3f


 o so you've got a driver's source code.  how can you
   inspect the code to check for possible errors
   and how can you fix them?

   First you must understand what is the maximum possible bind
   size that your driver will see. This may not be trivial.
   For example, the ata driver handles ~64K maximum buffers
   for normal operation, but newfs goes through /dev/rdsk
   which doesn't break buffers down to ~64K buffers initially.
   Once you understand maximum possible bind size, make sure
   sgllen is set to ((max possible bind size / page size) + 1)
   and that you actually handle that many cookies. Or make
   sure you handle the fail case correctly.


 o can you run your fixed driver on earlier solaris releases?

   Yes. A correctly written driver will run fine on both S10
   and earlier solaris releases. This bug fix only fixes a
   problem of not correctly identifying a failure case. If a
   driver doesn't generate a failure case, or correctly
   handles the failure case, there will not be any problems.


 o What can I do if my system doesn't boot anymore

   In the unlikely event your system doesn't boot anymore,
   you can recover the system by setting a defered breakpoint
   in rootnex_attach, clear the rootnex_bind_fail patchable, 
   then, once you've booted, add set rootnex:rootnex_bind_fail = 0
   to /etc/system

   An example is provide below...

     Boot args: 

     Type    b [file-name] [boot-flags]       to boot with options
     or      i                                to enter boot interpreter
     or                                       to boot with defaults

                       <<< timeout in 5 seconds >>>

     Select (b)oot or (i)nterpreter: b kmdb -d
     Loading kmdb...

     Welcome to kmdb
     kmdb: Unable to determine terminal type: assuming `vt100'
     [0]> ::bp rootnex`rootnex_attach
     [0]> :c
     SunOS Release 5.10 Version gate:2004-10-18 32-bit
     Copyright 1983-2004 Sun Microsystems, Inc.  All rights reserved.
     Use is subject to license terms.
     Loaded modules: [ ufs unix krtld genunix specfs ]
     kmdb: stop at rootnex`rootnex_attach
     kmdb: target stopped at:
     rootnex`rootnex_attach: pushl  %ebp
     [0]> rootnex`rootnex_bind_fail?W 0
     rootnex`rootnex_bind_fail:      0x1             =       0x0
     [0]> :c
     Hostname: ...
     [hostname] console login: root
     Password: 

     [CUT]
     bfu'ed from /ws/on10-gate/archives/i386/nightly-nd on 2004-10-18
     Sun Microsystems Inc.   SunOS 5.10      s10_68  December 2004
     # echo "set rootnex:rootnex_bind_fail = 0" >> /etc/system
     #
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

mrj

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today