[OmniOS-discuss] iSCSI target hang, no way to restart but server reboot

Discussion:

Matej Zerovnik

2015-03-27 13:07:55 UTC

Hello!

We are having issues with iSCSI on work. Every now and then, iSCSI
target just hangs up. We are unable to kill it, restart it or do
anything else to restore the service. The only option to restore iscsi
target back to working state, is to reboot the whole server and loose
all the sessions (around 100 clients).

Weird thing is, that only iscsi target hangs. I can ssh to server and
work on it without any problem, there is no load or anything else,
nothing in log files, just iscsi target locks up and all connections to
iscsi target drop (probably timeout)

Server is a IBM 3550 M4 with dual Xeon E5-2640 CPUs and 160GB of memory.
Hard drives are mounted in a Supermicro JBOD, which is attached via SAS
HBA LSI Logic SAS2308.

We are using OmniOS v11 r151006.

Anyone encounter similar troubles?
Any recomendations what to do or how to solve that problem?

Matej

Dan McDonald

2015-03-27 14:43:39 UTC

Permalink

Post by Matej Zerovnik
Hello!
We are having issues with iSCSI on work. Every now and then, iSCSI target just hangs up. We are unable to kill it, restart it or do anything else to restore the service. The only option to restore iscsi target back to working state, is to reboot the whole server and loose all the sessions (around 100 clients).
Weird thing is, that only iscsi target hangs. I can ssh to server and work on it without any problem, there is no load or anything else, nothing in log files, just iscsi target locks up and all connections to iscsi target drop (probably timeout)
Server is a IBM 3550 M4 with dual Xeon E5-2640 CPUs and 160GB of memory.
Hard drives are mounted in a Supermicro JBOD, which is attached via SAS HBA LSI Logic SAS2308.
We are using OmniOS v11 r151006.
Anyone encounter similar troubles?
Any recomendations what to do or how to solve that problem?

I'd move to 012 or wait the short amount of time until 014 hits the streets. Then see if your problem persists.

Dan

Matej Zerovnik

2015-03-27 14:54:27 UTC

Permalink

It just happened about 2 hours ago... The whole system did not crash,
but 2 clients lost the connection.

This is what I see in logs:
Mar 27 13:55:51 storage.host.org scsi: [ID 107833 kern.notice]
/***@0,0/pci8086,***@1/pci1000,***@0 (mpt_sas0):
Mar 27 13:55:51 storage.host.org Timeout of 0 seconds expired
with 1 commands on target 68 lun 0.
Mar 27 13:55:51 storage.host.org scsi: [ID 107833 kern.warning] WARNING:
/***@0,0/pci8086,***@1/pci1000,***@0 (mpt_sas0):
Mar 27 13:55:51 storage.host.org Disconnected command timeout
for target 68 w500304800039d83d, enclosure 3
Mar 27 13:55:52 storage.host.org scsi: [ID 365881 kern.info]
/***@0,0/pci8086,***@1/pci1000,***@0 (mpt_sas0):
Mar 27 13:55:52 storage.host.org Log info 0x31140000 received
for target 68 w500304800039d83d.
Mar 27 13:55:52 storage.host.org scsi_status=0x0,
ioc_status=0x8048, scsi_state=0xc
Mar 27 15:08:31 storage.host.org iscsit: [ID 744151 kern.notice] NOTICE:
login_sm_session_bind: add new conn/sess continue
Mar 27 15:10:53 storage.host.org scsi: [ID 107833 kern.notice]
/***@0,0/pci8086,***@1/pci1000,***@0 (mpt_sas0):
Mar 27 15:10:53 storage.host.org Timeout of 0 seconds expired
with 1 commands on target 68 lun 0.
Mar 27 15:10:53 storage.host.org scsi: [ID 107833 kern.warning] WARNING:
/***@0,0/pci8086,***@1/pci1000,***@0 (mpt_sas0):
Mar 27 15:10:53 storage.host.org Disconnected command timeout
for target 68 w500304800039d83d, enclosure 3
Mar 27 15:10:54 storage.host.org scsi: [ID 365881 kern.info]
/***@0,0/pci8086,***@1/pci1000,***@0 (mpt_sas0):
Mar 27 15:10:54 storage.host.org Log info 0x31140000 received
for target 68 w500304800039d83d.
Mar 27 15:10:54 storage.host.org scsi_status=0x0,
ioc_status=0x8048, scsi_state=0xc

I read in the archives, that this errors happens when you have SATA
drives on a SAS expander and one of the drives misbehaves:
A command did not complete and the mpt driver reset the target.
If that target is an expander, then everything behind the expander can
reset, resulting in the aborts of any in-flight commands, as follows...

iostat -Ei | grep Error reports that one device has 6 hard errors and 6
device not ready errors, but that is a local drive, attached to a
different controller (LSI Megaraid).

I wouldn't like to do a major upgrade, since this is a production
machine. Too scary:)

Matej

Post by Dan McDonald

I'd move to 012 or wait the short amount of time until 014 hits the streets. Then see if your problem persists.
Dan

Dan McDonald

2015-03-27 14:56:53 UTC

Permalink

Post by Matej Zerovnik
A command did not complete and the mpt driver reset the target.
If that target is an expander, then everything behind the expander can
reset, resulting in the aborts of any in-flight commands, as follows...

You read correctly. You should not have SATA drives on a SAS expander. You are setting yourself up for failure.

Post by Matej Zerovnik
iostat -Ei | grep Error reports that one device has 6 hard errors and 6 device not ready errors, but that is a local drive, attached to a different controller (LSI Megaraid).

LSI Megaraid, ESPECIALLY with 006, is not going to be as good as either mpt_sas, or a more modern build of OmniOS (I'm hoping to get one very good change in before I close 014's illumos synching).

Post by Matej Zerovnik
I wouldn't like to do a major upgrade, since this is a production machine. Too scary:)

You should plan for it, however. SATA drives on SAS expanders is a recipe for disaster, as you're seeing.

Dan

Matej Zerovnik

2015-03-27 15:03:19 UTC

Permalink

Post by Dan McDonald

Post by Matej Zerovnik
iostat -Ei | grep Error reports that one device has 6 hard errors and 6 device not ready errors, but that is a local drive, attached to a different controller (LSI Megaraid).

LSI Megaraid, ESPECIALLY with 006, is not going to be as good as either mpt_sas, or a more modern build of OmniOS (I'm hoping to get one very good change in before I close 014's illumos synching).

Only rpool is on megaraid, the storage is on LSI Logic SAS2308 HBA,
which I think is using mpt_sas driver. What change do you plan on
putting it? Does it concern mpt_sas driver?

Post by Dan McDonald

Post by Matej Zerovnik
I wouldn't like to do a major upgrade, since this is a production machine. Too scary:)

You should plan for it, however. SATA drives on SAS expanders is a recipe for disaster, as you're seeing.

Is there a better support for SATA drives in newer omnios?

Matej

Dan McDonald

2015-03-27 15:04:57 UTC

Permalink

Only rpool is on megaraid, the storage is on LSI Logic SAS2308 HBA, which I think is using mpt_sas driver. What change do you plan on putting it? Does it concern mpt_sas driver?

There are some mpt_sas improvements, but no amount of driver improvements can fix the failure modes caused by SATA drives in SAS expanders. You Just Can't Fix That.

Post by Dan McDonald

Post by Matej Zerovnik
I wouldn't like to do a major upgrade, since this is a production machine. Too scary:)

You should plan for it, however. SATA drives on SAS expanders is a recipe for disaster, as you're seeing.

Is there a better support for SATA drives in newer omnios?

Not when you're using them in situations that are operationally dangerous.

Were you a paying customer, we would tell you we don't support SATA drives in SAS expanders.

Sorry,
Dan

Narayan Desai

2015-03-27 15:13:42 UTC

Permalink

Having been on the receiving end of similar advice, it is a frustrating
situation to be in, since you have (and will likely continue to have) the
hardware in production, without much option for replacement.

When we had systems like this, we had a lot of success being aggressive in
swapping out disks that were showing signs of going bad, even before
critical failures occurred. Also looking at SMART statistics, and
aggressively replacing those as well. This made the situation manageable.
Basically, having sata drives in sas expanders means the system is brittle,
and you should treat it as such. Look for:
- errors in iostat -En
- high service times in iostat -xnz
- smartctl (this causes harmless sense messages when devices are probed,
but it is easy enough to ignore these)
- any errors reported out of lsiutil, showing either problems with
cabling/enclosures, or devices
- decode any sense errors reported by the lsi driver

Aggressively replace devices implicated by these, and hope for the best.
The best may or may not be what you're hoping for, but may be livable; it
was for us.
good luck
-nld

Post by Matej Zerovnik
Only rpool is on megaraid, the storage is on LSI Logic SAS2308 HBA,

which I think is using mpt_sas driver. What change do you plan on putting
it? Does it concern mpt_sas driver?
There are some mpt_sas improvements, but no amount of driver improvements
can fix the failure modes caused by SATA drives in SAS expanders. You Just
Can't Fix That.

Post by Matej Zerovnik

Post by Dan McDonald

Post by Matej Zerovnik
I wouldn't like to do a major upgrade, since this is a production

machine. Too scary:)

Post by Matej Zerovnik

Post by Dan McDonald
You should plan for it, however. SATA drives on SAS expanders is a

recipe for disaster, as you're seeing.

Post by Matej Zerovnik
Is there a better support for SATA drives in newer omnios?

Not when you're using them in situations that are operationally dangerous.
Were you a paying customer, we would tell you we don't support SATA drives
in SAS expanders.
Sorry,
Dan
_______________________________________________
OmniOS-discuss mailing list
http://lists.omniti.com/mailman/listinfo/omnios-discuss

Dave Pooser

2015-03-28 03:51:17 UTC

Permalink

<snip>

Post by Narayan Desai
Aggressively replace devices implicated by these, and hope for the best.
The best may or may not be what you're hoping for, but may be livable; it
was for us.

Also bear in mind it's entirely possible to mix SAS and SATA drives in the
same enclosure and even the same vdev-- so as you're aggressively
replacing SATA drives replace them with SAS drives and your system will
become less brittle. Assuming you're using enterprise SATA drives, their
SAS siblings are not much more expensive (often about $20 difference) and
the reliability gains will be significant.

--
Dave Pooser
Cat-Herder-in-Chief, Pooserville.com

Matej Zerovnik

2015-04-10 10:11:15 UTC

Permalink

On Wednesday, the server crashed again. We switched to a new server(same
model xServer 3550 M4), installed OmniOS r14 and updates LSI firmware
from P15 to P19.

So far, everything is humming nicely, there are also no more errors in
the logs (errors were from SAS expander and not from a particular drive,
at least according to the target number).

New SAS drives are in order, since we want to go HA as well.

Thanks everyone for help and answers, Matej

Post by Dave Pooser

<snip>

Post by Narayan Desai
Aggressively replace devices implicated by these, and hope for the best.
The best may or may not be what you're hoping for, but may be livable; it
was for us.

Matej Zerovnik

2015-03-31 12:08:01 UTC

Permalink

Post by Narayan Desai
Having been on the receiving end of similar advice, it is a
frustrating situation to be in, since you have (and will likely
continue to have) the hardware in production, without much option for
replacement.
When we had systems like this, we had a lot of success being
aggressive in swapping out disks that were showing signs of going bad,
even before critical failures occurred. Also looking at SMART
statistics, and aggressively replacing those as well. This made the
situation manageable. Basically, having sata drives in sas expanders
- errors in iostat -En
- high service times in iostat -xnz
- smartctl (this causes harmless sense messages when devices are
probed, but it is easy enough to ignore these)
- any errors reported out of lsiutil, showing either problems with
cabling/enclosures, or devices
- decode any sense errors reported by the lsi driver
Aggressively replace devices implicated by these, and hope for the
best. The best may or may not be what you're hoping for, but may be
livable; it was for us.

When errors happened to you, were you able to use the pool itself and
only iscsi target froze or did you have troubles with the pool itself as
well...

Because on our end, when iscsi target freezes, zpool is perfectly ok. We
can access it and use it locally, but iscsi target is frozen and can't
be restarted.

I will check my sistem with iostat and smartctl, but we are using
seagate drives, so some of the smartctl stats are useless on 1st sight:)

Matej

Narayan Desai

2015-03-31 12:54:37 UTC

Permalink

We were primarily using the machines for serving iscsi to VMs, and we'd see
bad cascading failures (iscsi lun timeouts would cause the watchdog to kick
in on the linux hosts, resetting the initiator, meanwhile the VM would
decide that the virtio devices in the VM were dead, requiring a client
reboot). In some cases, the problems would happen across all luns, in
others it would be just particular luns. I assume this followed the
severity of the situation with the failing drive (or number of failing
drives before got aggressive about replacement). Similarly, we'd see a
range of behaviors with local pool commands, ranging from everything
looking alright to zpool commands hanging or running *extremely* slowly.

I'd hacked up some quick scripts to correlate info from the different
sources. They are here:
https://github.com/narayandesai/diy-lsi
They may or may not be portable, but demonstrate all of the info gathering
methods we found useful. Another thing that was useful was maintaining a
pool inventory (stored somewhere else) with device addresses, serial
numbers, and jbod bay mappings. Having to map that you when things are
falling apart is seriously sad times.

fwiw, you might still be ok with seagate drives; we were only using the
self-check predictive failure flag, as opposed to anything more
complicated.
good luck
-nld

Post by Matej Zerovnik

Post by Narayan Desai
Having been on the receiving end of similar advice, it is a frustrating
situation to be in, since you have (and will likely continue to have) the
hardware in production, without much option for replacement.
When we had systems like this, we had a lot of success being aggressive
in swapping out disks that were showing signs of going bad, even before
critical failures occurred. Also looking at SMART statistics, and
aggressively replacing those as well. This made the situation manageable.
Basically, having sata drives in sas expanders means the system is brittle,
- errors in iostat -En
- high service times in iostat -xnz
- smartctl (this causes harmless sense messages when devices are probed,
but it is easy enough to ignore these)
- any errors reported out of lsiutil, showing either problems with
cabling/enclosures, or devices
- decode any sense errors reported by the lsi driver
Aggressively replace devices implicated by these, and hope for the best.
The best may or may not be what you're hoping for, but may be livable; it
was for us.
When errors happened to you, were you able to use the pool itself and

only iscsi target froze or did you have troubles with the pool itself as
well...
Because on our end, when iscsi target freezes, zpool is perfectly ok. We
can access it and use it locally, but iscsi target is frozen and can't be
restarted.
I will check my sistem with iostat and smartctl, but we are using seagate
drives, so some of the smartctl stats are useless on 1st sight:)
Matej