Missing disk on AIX



AIX LVM will mark a disk as "missing" when it cannot successfully determine if that disk belongs to the volume group that it is in. Here are some ways to diagnose this problem and some possible solutions to get the disk back in an active state again.

One main symptom is that the lsvg command shows the disk in a "missing" state:

# lsvg -p vgname

datavg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk5            active            646         486         130..00..98..129..129
hdisk6            missing           646         6           00..00..00..00..06


Diagnosing the problem

First data gathering steps to take.
These steps can all be performed by a non-root user.

* Are there any disk errors for this physical volume in the system error report?
$ errpt | less
or for more in-depth error information
$ errpt -a | less

* Is the disk marked in an "Available" state in lsdev output?
$ lsdev -Cc disk -l hdiskX

* Does the disk show up in lspv?
$ lspv

* Is the disk in an "active" state in lsvg?
$ lsvg -p VGNAME

Resolving the problem

Try to read the PVID directly off the drive. This technique uses a lower-level command that bypasses the ODM and will print out information recorded on the disk. You will need to be root or su'ed to root to run this command and many of the ones that follow.

# lquerypv -h /dev/hdiskX 80 10
00000080   00050A85 A9B17061 00000000 00000000  |......pa........|

In the above output the PVID is the 2nd and 3rd columns combined.

Does the PVID returned match the output from lspv?

1. If yes then it's possible there was some temporary problem accessing the disk.

1a. Try running:

# varyonvg VGNAME

which should force the volume group to go out and probe all physical volumes belonging to it.

1b. If varyonvg does not go out and find the disk, we may have to force it into an available state so LVM will check it. To do this use:

# chpv -va hdiskX

It's possible after this you may need to try the varyonvg again.


2. If a PVID is returned but does not match what lspv or the ODM show, then does it exist in the VGDA?

The PVIDs in the VGDA can be viewed easiest using lqueryvg:

# lqueryvg -Ptp hdiskX
( -P will list the PVIDs only )

If the PVID on the drive is in the VGDA, but not in the ODM, the ODM can be updated by forcing a re-read of the PVID from the drive using the chdev command.

Do NOT run this command unless you have verified that there is a PVID on the drive AND that PVID is in the VGDA also on the drive.

# chdev -a pv=yes -l hdiskX

After this check to see if the physical volume shows up with no errors:

# lspv
# lsvg -p VGNAME


3. If the PVID on disk does not exist in the VGDA, a new PVID can be written to the drive and ODM, and the VGDA updated with that new PVID.

      NOTE: You should be suspicious that this may not be the proper disk for this volume group. For example if a LUN was unmapped and then a different one remapped accidentally, that LUN may belong to a completely different volume group!

      You can check this using lqueryvg:

      # lqueryvg -Atp hdiskX

      Run this against the disk in question, and one that is a known good disk in the volume group. Compare PVIDs, logical volume names, etc to insure it really belongs to the same volume group. If not then do not proceed with the steps below.

The volume group will be removed and re-imported using recreatevg.

3a. First get a list of all disks that are part of this volume group

# lsvg -p VGNAME

3b. The next steps will require that all logical volumes in the volume group be closed, so unmount any filesystems and stop any applications that are using raw logical volumes.

3c. Now remove the volume group:

# varyoffvg VGNAME
# exportvg VGNAME

3d. And bring it back in using recreatevg. In this instance we DO NOT want recreatevg to add the default prefixes onto the logical volume names and filesystem mount points, so we add flags to prevent that.

Using recreatevg in this manner it is IMPORTANT to list ALL disks belonging to this volume group. Unlike importvg, recreatevg needs a complete list of physical volumes in order to completely import the volume group and all logical volumes. The exception to this is when "-f" is used.

# recreatevg -L / -Y NA -y VGNAME hdiskX hdiskY hdiskZ

This will write new PVIDs to all drives listed on the command-line and update the VGDA with those PVIDs. It will also import and vary on the volume group.


4. If no PVID is returned at all, or the lqueryvg command hangs, then there is a disk problem. No LVM commands will fix this issue. Contact the correct team who support the disk type being used and have them find a solution to the problem.

Even a brand-new drive, or one completely clean of any LVM information should return with either a PVID or all zeroes:

# lspv | grep hdisk7
hdisk7          none                                None

# lquerypv -h /dev/hdisk7 80 10
00000080   00000000 00000000 00000000 00000000  |................|


Examples of errors that may accompany this issue.

This list is not complete, but may help you to identify the source of the problem. Many of these errors have been seen in conjunction with a disk marked as "missing".

LVM Errors you may see in errpt:

LABEL: LVM_SA_STALEPP
IDENTIFIER: EAA3D429
Description: PHYSICAL PARTITION MARKED STALE

LABEL: LVM_SA_PVMISS
IDENTIFIER: F7DDA124
Description: PHYSICAL VOLUME DECLARED MISSING

LABEL: LVM_SA_WRTERR
IDENTIFIER: 52715FA5
Description: FAILED TO WRITE VOLUME GROUP STATUS AREA

LABEL: LVM_SA_STALEPP
IDENTIFIER: EAA3D429
Description: PHYSICAL PARTITION MARKED STALE

LABEL: LVM_IO_FAIL
IDENTIFIER: E86653C3
Description: I/O ERROR DETECTED BY LVM

LABEL: LVM_QUORUMNOQUORUM
IDENTIFIER: 5BEAD71B
Description: Activation of a no quorum volume group without 100% of the disks.

LABEL: LVM_MISSPVADDED
IDENTIFIER: 26120107
Description: PHYSICAL VOLUME DEFINED AS MISSING


Filesystem errors you may see in errpt:

LABEL: J2_METADATA_EIO
IDENTIFIER: 78ABDDEB
Description: META-DATA I/O ERROR

LABEL: J2_FSCK_REQUIRED
IDENTIFIER: B6DB68E0
Description: FILE SYSTEM RECOVERY REQUIRED

LABEL: J2_LOG_EIO
IDENTIFIER: C1348779
Description: LOG I/O ERROR


Disk errors that may be seen during this problem:

LABEL: SC_DISK_ERR1
IDENTIFIER: 747725D9

LABEL: SC_DISK_ERR2
IDENTIFIER: B6267342

LABEL: SC_DISK_ERR7
IDENTIFIER: DE3B8540

No comments:

Post a Comment