openbsd-tech
[Top] [All Lists]

Re: aac(4) timeout problems + fix

To: tech@openbsd.org
Subject: Re: aac(4) timeout problems + fix
From: Brandin L Claar <blc3@only.arl.psu.edu>
Date: Fri, 4 Oct 2002 14:02:57 -0400
Cc: "Benninghoff, John" <John.Benninghoff@rbcdain.com>, Srebrenko Sehic <haver@insecure.dk>, niklas@openbsd.org
In-reply-to: <20021001184630.GA24527@hellspawn.insecure.dk>
References: <D6CE5018C5563D47A8A823409AA4F19403301A13@MAIL3.corp.isib.net> <20021001184630.GA24527@hellspawn.insecure.dk>
Reply-to: blc3@only.arl.psu.edu
Sender: owner-tech@openbsd.org
User-agent: Mutt/1.3.25i
On Tue, Oct 01, 2002 at 08:46:30PM +0200, Srebrenko Sehic wrote:
> Actually, I just got a total freeze on a 2550 with PERC 3/Di running
> 3.1-STABLE + this patch. Machine ran fine for a 1½ month, but died
> today. The console was dead, I could not ping the machine, nothing.

I haven't seen a single lock up or crash yet and I've been running this
patch on 2 different machines (one of them going back to when I created 
the patch 5 months ago).

> Someone should _really_ look into either fixing the timeout/instability
> issues with aac(4) or remove the driver from the supported list.

This is a much simpler patch (against -current), and maybe it will stand
a better chance of being committed.  This patch makes 2 changes.  First,
it lowers the value of sc->sc_link.openings to 128.  If I recall correctly,
I was trying values going down from 512 by 1/2.  Both 512 (the original
value) and 256 caused periodic pauses during heavy disk activity.  I have
never seen a problem with 128 on my hardware.  As far as I know, nobody
who has tried my patch has ever seen a problem with 128.  It could be that
255 works just as well.  I have never tried it.  This should be a call 
for somebody much more familiar with OpenBSD scsi internals than I am.

Second, this patch actually checks the return value of aac_start().  If
this value isn't checked, ccb's can be dequeued without anything ever 
going on the fib queue.  Later, the timeout fires causing the familiar 
"timed out" message.  This patch ensures ccb's won't get lost when the 
fib queue is full.  Originally, I tried a patch with just this change.
This caused the scsi layer to spit out "not queued" errors.  That is
what initially led me to lowering the openings value.  The reason being
the scsi layer appears to temporarily shut off disk activity for a 
process any time the driver refuses to handle a request immediately.
This results in the same sort of effect you would get controlling the 
speed of your car with an on/off switch. 

I have a Dell 2650 with an AMI MegaRAID using the ami driver.  I have 
noticed similar periodic pauses with it during heavy disk activity.
NetBSD and FreeBSD both seem to be able to support tagged queueing
without manually setting openings in the device driver.  It certainly
seems the scsi layer could handle this more gracefully.  That's just an
observation.


Index: sys/dev/ic/aac.c
===================================================================
RCS file: /home/cvsup/src/sys/dev/ic/aac.c,v
retrieving revision 1.14
diff -c -r1.14 aac.c
*** sys/dev/ic/aac.c    27 Mar 2002 15:02:59 -0000      1.14
--- sys/dev/ic/aac.c    3 Oct 2002 20:43:03 -0000
***************
*** 241,247 ****
        sc->sc_link.adapter_softc = sc;
        sc->sc_link.adapter = &aac_switch;
        sc->sc_link.device = &aac_dev;
!       sc->sc_link.openings = AAC_ADAP_NORM_CMD_ENTRIES; /* XXX
optimal? */
        sc->sc_link.adapter_buswidth = AAC_MAX_CONTAINERS;
        sc->sc_link.adapter_target = AAC_MAX_CONTAINERS;

--- 241,247 ----
        sc->sc_link.adapter_softc = sc;
        sc->sc_link.adapter = &aac_switch;
        sc->sc_link.device = &aac_dev;
!       sc->sc_link.openings = 128; /* 512 caused not queued errors */
        sc->sc_link.adapter_buswidth = AAC_MAX_CONTAINERS;
        sc->sc_link.adapter_target = AAC_MAX_CONTAINERS;

***************
*** 1569,1579 ****
                    sizeof(struct aac_sg_entry);
        }

!       aac_start(ccb);
!
!       xs->error = XS_NOERROR;
!       xs->resid = 0;
!       return (1);
  }

  
/********************************************************************************
--- 1569,1580 ----
                    sizeof(struct aac_sg_entry);
        }

!       if (aac_start(ccb) == 0) {
!               xs->error = XS_NOERROR;
!               xs->resid = 0;
!               return (1);
!       }
!       return (0);
  }

  
/********************************************************************************


-- 
Brandin Claar
Assistant Research Engineer
Penn State Applied Research Lab

<Prev in Thread] Current Thread [Next in Thread>