29 Aralık 2014 Pazartesi

M5000 sunucularda fmadm hatalarının silinmesi hata veri tabanının boşaltılması

Oracle M5000 sparc sunucularda fmadm ile görülen donanım hatalarının silinmesi için aşağıdaki adımlar sırayla uygulanır. Öncelikle sunucu üzerinde alınan hatalar aşağıdaki şekilde listelenir.

root@FENERBAH2:~$ fmadm faulty -a
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
May 30 2012     ef9dfccb-37bc-63d1-f8df-fe6031378ca8  PCIEX-8000-3S  Critical 

Host        : FENERBAH2
Platform    : SUNW,SPARC-Enterprise     Chassis_id  : GALATA2
Product_sn  :

Fault class : fault.io.pciex.device-interr max 50%
              fault.io.pciex.bus-linkerr 25%
Affects     : dev:////pci@12,600000/pci@0
              dev:////pci@12,600000
                  faulted but still in service
FRU         : "iou#1-pci#3" (hc:///component=iou#1-pci#3)
                  faulty

Description : A problem has been detected on one of the specified devices or on
              one of the specified connecting buses.
              Refer to http://sun.com/msg/PCIEX-8000-3S for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : If a plug-in card is involved check for badly-seated cards or
              bent pins. Otherwise schedule a repair procedure to replace the
              affected device(s).  Use fmadm faulty to identify the devices or
              contact Sun for support.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
May 31 2012     e4b06bcb-e8ed-6f9c-c1cc-ee8216c1ec63  SUNOS-8000-FU  Major    

Host        : FENERBAH2
Platform    : SUNW,SPARC-Enterprise     Chassis_id  : GALATA2
Product_sn  :

Fault class : defect.sunos.eft.undiag.fme
FRU         : None
                  faulty

Description : The diagnosis engine encountered telemetry for which it was
              unable to perform a diagnosis.  Refer to
              http://sun.com/msg/SUNOS-8000-FU for more information.

Response    : Error reports have been logged for examination by Sun.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest Solaris Kernel and Predictive Self-Healing
              (PSH) patches are installed.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
May 30 2012     4eb0d6f7-19fe-ef16-f047-d81526351d1b  PCIEX-8000-MH  Major    

Host        : FENERBAH2
Platform    : SUNW,SPARC-Enterprise     Chassis_id  : GALATA2
Product_sn  :

Fault class : fault.io.pciex.device-interr-unaf
Affects     : dev:////pci@12,600000/pci@0
                  faulted but still in service
FRU         : "iou#1-pci#3" (hc:///component=iou#1-pci#3)
                  faulty

Description : Too many recovered errors have been detected, which indicates a
              problem with the specified PCIEX device. This may degrade into an
              unrecoverable fault.
              Refer to http://sun.com/msg/PCIEX-8000-MH for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Schedule a repair procedure to replace the affected device.  Use
              fmadm faulty to identify the device or contact Sun for support.

Bu hatalar giderilmiş ise bunlar FMD veri tabanına bu hataların repair edildiği beliritilir.

root@FENERBAH2:~$ fmadm repair 4eb0d6f7-19fe-ef16-f047-d81526351d1b
fmadm: recorded repair to 4eb0d6f7-19fe-ef16-f047-d81526351d1b
root@FENERBAH2:~$ fmadm repair e4b06bcb-e8ed-6f9c-c1cc-ee8216c1ec63
fmadm: recorded repair to e4b06bcb-e8ed-6f9c-c1cc-ee8216c1ec63
root@FENERBAH2:~$ fmadm repair ef9dfccb-37bc-63d1-f8df-fe6031378ca8
fmadm: recorded repair to ef9dfccb-37bc-63d1-f8df-fe6031378ca8
root@FENERBAH2:~$

Repair edilen bu hatalar için bundan sonraki adımlar aşağıdaki şekilde uygulanır. Bu ara başlıkların Türkçe karşılıklarını bulamadığımdan orijinal hallerini yazdım.

Clear ereports ve resource cache

Burada sunucu üzerinde bulunan bazı folderlar ve dosyalar silinir.

root@FENERBAH2:~$ cd /var/fm/fmd/
root@FENERBAH2:/var/fm/fmd$ ls
ckpt    errlog  fltlog  rsrc    xprt
root@FENERBAH2:/var/fm/fmd$ ls -al
total 299
drwxr-xr-x   5 root     sys            7 May 30  2012 .
drwxr-xr-x   3 root     sys            3 May 25  2012 ..
drwx------   4 root     sys            4 May 31  2012 ckpt
-rw-r--r--   1 root     root       80995 Dec 27 18:24 errlog
-rw-r--r--   1 root     root       62538 Jun  5 15:17 fltlog
drwx------   2 root     sys            7 Mar  9  2013 rsrc
drwx------   2 root     sys            2 May 25  2012 xprt
root@FENERBAH2:/var/fm/fmd$ rm e*
root@FENERBAH2:/var/fm/fmd$ ls
ckpt    fltlog  rsrc    xprt
root@FENERBAH2:/var/fm/fmd$ rm f*
root@FENERBAH2:/var/fm/fmd$ rm ckpt/eft/*
root@FENERBAH2:/var/fm/fmd$ rm rsrc/*

clearing out FMA files with no reboot needed

Sunucu reboot etmeden bu database’in aşağıdaki komutlarla silinebileceği gösteriliyor.
root@FENERBAH2:/var/fm/fmd$ svcs -a |grep fmd
online         Nov_06   svc:/system/fmd:default
root@FENERBAH2:/var/fm/fmd$ svcadm disable -s svc:/system/fmd:default
root@FENERBAH2:/var/fm/fmd$ cd /var/fm/fmd/
root@FENERBAH2:/var/fm/fmd$ ls
ckpt  rsrc  xprt
root@FENERBAH2:/var/fm/fmd$ ls -al
total 15
drwxr-xr-x   5 root     sys            5 Jun  5 15:19 .
drwxr-xr-x   3 root     sys            3 May 25  2012 ..
drwx------   4 root     sys            4 May 31  2012 ckpt
drwx------   2 root     sys            2 Jun  5 15:19 rsrc
drwx------   2 root     sys            2 May 25  2012 xprt
root@FENERBAH2:/var/fm/fmd$ find /var/fm/fmd -type f -exec ls{} \;
root@FENERBAH2:/var/fm/fmd$ find /var/fm/fmd -type f -exec rm{} \;
root@FENERBAH2:/var/fm/fmd$ svcadm enable svc:/system/fmd:default

reset the FMD send modules.

Bu aşamada aşağıdaki şekilde modüller resetlenir.
root@FENERBAH2:/var/fm/fmd$ fmadm reset cpumem-diagnosis
fmadm: cpumem-diagnosis module has been reset
root@FENERBAH2:/var/fm/fmd$ fmadm reset cpumem-retire
fmadm: cpumem-retire module has been reset
root@FENERBAH2:/var/fm/fmd$ fmadm reset eft
fmadm: eft module has been reset
root@FENERBAH2:/var/fm/fmd$ fmadm reset io-retire
fmadm: io-retire module has been reset
root@FENERBAH2:/var/fm/fmd$

işlem tamamlanmış olup bundan sonra aşağıdaki komut ile faulty ler listelendiğinde ekrana herhangi bir şey gelmeyecektir.

root@FENERBAH2:/var/fm/fmd$ fmadm faulty -a



Kaynakça:

Hiç yorum yok: