Storage Cell MS service failed to start after reboot - RS-7445 [Serv MS is absent] [It will not be restarted]

RS-7445 [Serv MS is absent] [It will not be restarted] [] [] [] [] [] [] [] [] [] []


 2021-11-18T09:19:32.879857+01:00

RS version=19.3.13.0.0,label=OSS_19.3.13.0.0_LINUX.X64_201022,Thu_Oct_22_23:43:28_PDT_2020

[RS] Started Service RS_MAIN with pid 58411

[RS] Kill previous monitoring processes for RS_BACKUP, MS and CELLSRV

2021-11-18T09:19:32.991137+01:00

[RS] Started monitoring process /opt/oracle/cell/cellsrv/bin/cellrsbmt with pid 58424

2021-11-18T09:19:33.057173+01:00

RSBK version=19.3.13.0.0,label=OSS_19.3.13.0.0_LINUX.X64_201022,Thu_Oct_22_23:43:28_PDT_2020

[RS] Started Service RS_BACKUP with pid 58425

[RS] Kill previous monitoring process for core RS

2021-11-18T09:19:33.159572+01:00

[RS] Started monitoring process /opt/oracle/cell/cellsrv/bin/cellrssmt with pid 58429

2021-11-18T09:19:42.312339+01:00

[RS] Started monitoring process /opt/oracle/cell/cellsrv/bin/cellrsmmt with pid 58623

2021-11-18T09:21:53.198369+01:00

[RS] Start service MS failed with error: -74.

2021-11-18T09:21:53.238382+01:00

[RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsmmt (pid: 58623, srvc_pid: 58675) returned with error: 162

2021-11-18T09:21:53.238610+01:00

[RS] Service MS with pid 58675 is no longer present

Errors in file /opt/oracle/cell/log/diag/asm/cell/StorageCell0013-man/trace/rstrc_58411_mmt.trc (incident=25):

RS-7445 [Serv MS is absent] [It will not be restarted] [] [] [] [] [] [] [] [] [] []

Incident details in: /opt/oracle/cell/log/diag/asm/cell/StorageCell0013-man/incident/incdir_25/rstrc_58411_mmt_i25.trc


---

root@StorageCell0013-man ~]# imageinfo


Kernel version: 4.14.35-1902.306.2.1.el7uek.x86_64 #2 SMP Wed Oct 21 20:57:15 PDT 2020 x86_64

Cell version: OSS_19.3.13.0.0_LINUX.X64_201022

Cell rpm version: cell-19.3.13.0.0_LINUX.X64_201022-1.x86_64


Active image version: 19.3.13.0.0.201022

Active image kernel version: 4.14.35-1902.306.2.1.el7uek

Active image activated: 2021-03-31 09:39:20 +0200

Active image status: success

Active node type: STORAGE

Active system partition on device: /dev/md24p6

Active software partition on device: /dev/md24p8


Cell boot usb partition: not found


mount: special device /dev/md6 does not exist

Inactive image version: undefined

Rollback to the inactive partitions: Impossible

---

1. Please provide the below details for analysis.

++ Date/Time of crash and a summary of events leading up to the crash.

last reboot

Uname -a

Linux StorageCell0013-man.dbaas.ing.net 4.14.35-1902.306.2.1.el7uek.x86_64 #2 SMP Wed Oct 21 20:57:15 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux

uptime

10:59:11 up 4 days, 4:10, 1 user, load average: 0.08, 0.09, 0.09


++ List of Crashed nodes

++ Exadata Machine Type X8-2

++ How many compute nodes/cell nodes? 6 (X6 CN)

++ (Full / Half / Quarter Rack / One eighth): Full / upgraded from X6


++ Storage server image version (# Imageinfo):

Kernel version: 4.14.35-1902.306.2.1.el7uek.x86_64 #2 SMP Wed Oct 21 20:57:15 PDT 2020 x86_64

Cell version: OSS_19.3.13.0.0_LINUX.X64_201022

Cell rpm version: cell-19.3.13.0.0_LINUX.X64_201022-1.x86_64

Active image version: 19.3.13.0.0.201022

Active image kernel version: 4.14.35-1902.306.2.1.el7uek

Active image activated: 2021-03-31 09:39:20 +0200

Active image status: success

Active node type: STORAGE

Active system partition on device: /dev/md24p6

Active software partition on device: /dev/md24p8

Cell boot usb partition: not found

Inactive image version: undefined

Rollback to the inactive partitions: Impossible


++ Compute node image version (# Imageinfo) :

Kernel version: 4.14.35-1902.306.2.1.el7uek.x86_64 #2 SMP Wed Oct 21 20:57:15 PDT 2020 x86_64

Image kernel version: 4.14.35-1902.306.2.1.el7uek

Image version: 19.3.13.0.0.201022

Image activated: 2021-07-17 21:45:47 +0200

Image status: success

Node type: COMPUTE

System partition on device: /dev/mapper/VGExaDb-LVDbSys1

++ RDBMS version:

/u01/app/oracle/product/12.2.0.1/dbhome_200415

/u01/app/oracle/product/19.7.0.0/dbhome_2

/u01/app/oracle/product/19.7.0.0/dbhome_4

++ Grid Home version:

19.7.0.0

++ Bare metal or OVM:

Bare metal

++ On premises or Cloud OCI/OCI2:

On premises

---

Solution 

Jump to table of contents

Dump continued from file: /opt/oracle/cell/log/diag/asm/cell/StorageCell0013-man/trac

[TOC00001]

RS-7445 [Serv MS is absent] [It will not be restarted] [] [] [] [] [] [] [] [] [

[TOC00001-END]

2021-11-18 09:21:53.097 :000023C6: Failed to heartbeat MS (port: 5043 timeout: 60 sec)

2021-11-18 09:21:53.097 :000023C7: socket open error: Port no: 8888. Received errorno 111. Connection refused

2021-11-18 09:21:53.197 :000023C8: mon_proc_pid oldpid: 58675

2021-11-18 09:21:53.197 :000023C9: pid 58675 has disappeared

2021-11-18 09:21:53.197 :000023CA: start service MS failed with error: -74.

2021-11-18 09:21:53.198 :000023CB: Error : start service failed

Please make sure java servers are not running.

ps -ef|grep java

ps -ef|grep -e msServer

If not running, please try redeploy. and confirm results.

/opt/oracle/cell/cellsrv/deploy/scripts/unix/setup_dynamicDeploy



===================

[root@StorageCell0013-man unix]# pwd

/opt/oracle/cell/cellsrv/deploy/scripts/unix

[root@StorageCell0013-man unix]# ls

celladmin_create.sh  cell_env.csh       cell_limits.sh   common_image_install_func  exadata-capacity-on-demand          freespace.sh                 install_util_lib.sh  permissions_check.py  swupdate

celld                cell_env.sh        cell_updown.sh   common_imag_install.sh     exadata-capacity-on-demand.service  hwadapter                    migrate.sh           rs

celld.service        cellfixfsperms.sh  collect_jmap.sh  exa_config_parser          freeall.sh                          install_model_properties.sh  mscore               setup_dynamicDeploy

[root@StorageCell0013-man unix]# sh setup_dynamicDeploy

unzipping wls

CLASSPATH=/usr/java/default/lib/tools.jar:/opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/deploy/wls/wlserver_12.2/wlserver/modules/features/wlst.wls.classpath.jar:


PATH=/opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/deploy/wls/wlserver_12.2/wlserver/server/bin:/opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/deploy/wls/wlserver_12.2/wlserver/../oracle_common/modules/thirdparty/org.apac

he.ant/1.9.8.0.0/apache-ant-1.9.8/bin:/usr/java/default/jre/bin:/usr/java/default/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/bin:/sbin:/usr/sbin:/opt/MegaRAID/storcli/:/r

oot/bin:/opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/deploy/wls/wlserver_12.2/wlserver/../oracle_common/modules/org.apache.maven_3.2.5/bin


Your environment has been set.


Initializing WebLogic Scripting Tool (WLST) ...


Welcome to WebLogic Server Administration Scripting Shell


Type help() for help on available commands




Exiting WebLogic Scripting Tool.


/opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/deploy/wls/wlserver_12.2/wlserver/server/bin

subject= /CN=localhost/OU=Oracle Exadata/O=Oracle Corporation/L=Redwood City/ST=California/C=US

RSA key ok

Successfully verified old security identity and certificates.

Generating a 2048 bit RSA private key

.....................................................................+++

.........................+++

writing new private key to '/opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/deploy/config/security/key.original.pem'

-----

sleep until wls is ready ...


Initializing WebLogic Scripting Tool (WLST) ...


Welcome to WebLogic Server Administration Scripting Shell


Type help() for help on available commands


Connecting to t3://localhost:8888 with userid weblogic ...

Successfully connected to Admin Server "msServer" that belongs to domain "msdomain".


Warning: An insecure protocol was used to connect to the server.

To ensure on-the-wire security, the SSL port or Admin port should be used instead.


Location changed to edit tree.

This is a writable tree with DomainMBean as the root.

To make changes you will need to start an edit session via startEdit().

For more help, use help('edit').


Starting an edit session ...

Started edit session, be sure to save and activate your changes once you are done.

Activating all your changes, this may take a while ...

The edit lock associated with this edit session is released once the activation is completed.


The following non-dynamic attribute(s) have been changed on MBeans

that require server re-start:

MBean Changed : Security:Name=myrealmMSUserAuthenticator

Attributes changed : ControlFlag


MBean Changed : Security:Name=myrealm

Attributes changed : AuthenticationProviders


Activation completed


weblogic.Deployer invoked with options:  -verbose -name MS -source /opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/lib/MS.war -targets msServer -user weblogic -adminURL t3://localhost:8888 -deploy

<Nov 23, 2021 1:50:41 PM CET> <Info> <J2EE Deployment SPI> <BEA-260121> <Initiating deploy operation for application, MS [archive: /opt/oracle/cell19.3.13.0.0_LINUX.X64_201022/cellsrv/lib/MS.war], to msServer .>

Task 0 initiated: [Deployer:149026]deploy application MS on msServer.

Task 0 completed: [Deployer:149026]deploy application MS on msServer.

Target state: deploy completed on Server msServer

java.lang.Exception: [Deployer:149169]Requires server restart for completion.



Target Assignments:

+ MS  msServer


Initializing WebLogic Scripting Tool (WLST) ...


Welcome to WebLogic Server Administration Scripting Shell


Type help() for help on available commands


Connecting to t3://localhost:8888 with userid weblogic ...

Successfully connected to Admin Server "msServer" that belongs to domain "msdomain".


Warning: An insecure protocol was used to connect to the server.

To ensure on-the-wire security, the SSL port or Admin port should be used instead.


Location changed to edit tree.

This is a writable tree with DomainMBean as the root.

To make changes you will need to start an edit session via startEdit().

For more help, use help('edit').


Starting an edit session ...

Started edit session, be sure to save and activate your changes once you are done.

Saving all your changes ...

Saved all your changes successfully.

Activating all your changes, this may take a while ...

The edit lock associated with this edit session is released once the activation is completed.

Activation completed


Initializing WebLogic Scripting Tool (WLST) ...


Welcome to WebLogic Server Administration Scripting Shell


Type help() for help on available commands


Connecting to t3://localhost:8888 with userid weblogic ...

Successfully connected to Admin Server "msServer" that belongs to domain "msdomain".


Warning: An insecure protocol was used to connect to the server.

To ensure on-the-wire security, the SSL port or Admin port should be used instead.


Location changed to edit tree.

This is a writable tree with DomainMBean as the root.

To make changes you will need to start an edit session via startEdit().

For more help, use help('edit').


Starting an edit session ...

Started edit session, be sure to save and activate your changes once you are done.

Saving all your changes ...

Saved all your changes successfully.

Activating all your changes, this may take a while ...

The edit lock associated with this edit session is released once the activation is completed.

Activation completed

0

[root@StorageCell0013-man unix]#


[root@StorageCell0013-man unix]# service celld status

Redirecting to /bin/systemctl status celld.service

● celld.service - celld

   Loaded: loaded (/etc/systemd/system/celld.service; enabled; vendor preset: disabled)

   Active: inactive (dead) since Thu 2021-11-18 09:18:29 CET; 5 days ago

 Main PID: 33770 (code=exited, status=0/SUCCESS)


Nov 18 06:53:02 StorageCell0013-man.dbaas.ing.net celld[33770]: Starting MS services...

Nov 18 06:55:15 StorageCell0013-man.dbaas.ing.net celld[33770]: The STARTUP of MS services was not successful.

Nov 18 06:55:15 StorageCell0013-man.dbaas.ing.net celld[33770]: CELL-01554: MS startup failed for unknown reasons.

Nov 18 06:55:15 StorageCell0013-man.dbaas.ing.net celld[33770]: Starting CELLSRV services...

Nov 18 06:55:34 StorageCell0013-man.dbaas.ing.net celld[33770]: The STARTUP of CELLSRV services was successful.

Nov 18 06:55:34 StorageCell0013-man.dbaas.ing.net systemd[1]: Started celld.

Nov 18 09:18:23 StorageCell0013-man.dbaas.ing.net systemd[1]: Stopping celld...

Nov 18 09:18:23 StorageCell0013-man.dbaas.ing.net celld[57157]: Stopping the RS, CELLSRV, and MS services...

Nov 18 09:18:29 StorageCell0013-man.dbaas.ing.net celld[57157]: The SHUTDOWN of services was successful.

Nov 18 09:18:29 StorageCell0013-man.dbaas.ing.net systemd[1]: Stopped celld.



[root@StorageCell0013-man unix]# cellcli -e alter cell shutdown services all


Stopping the RS, CELLSRV, and MS services...

CELL-01509: Restart Server (RS) not responding.

Getting the state of CELLSRV services... unknown

Getting the state of MS services... unknown

Getting the state of RS services... stopped



[root@StorageCell0013-man unix]# cellcli -e alter cell startup services all


Starting the RS, CELLSRV, and MS services...

Getting the state of RS services...  running

Starting CELLSRV services...

The STARTUP of CELLSRV services was successful.

Starting MS services...

The STARTUP of MS services was not successful.

CELL-01554: MS startup failed for unknown reasons.


==========

Ilom was not reachable so cold restart of the Storage server was done. Then MS service was automatically started,



Comments

Popular Posts