2177064 – FAQ: SAP HANA Service Restarts and Crashes

1. Which indications exist for problems with SAP HANA service restarts and crashes?

The following SAP HANA alerts indicate problems in the locking area:

Alert Name SAP Note  Description
4 Restarted services 1909660 Identifies services that have restarted since the last time the check was performed.
52 Crashdump files 1977218 Identifies new crashdump files that have been generated in the trace directory of the system.

SQL: “HANA_Configuration_MiniChecks” (SAP Notes 1969700, 1999993) returns a potentially critical issue (C = ‘X’) for one of the following individual checks:

Check ID Details
110 Everything started
111 Host startup time variation (s)
115 Service startup time variation (s)
650 Number of crash dumps (last day)

2. How can I check when SAP HANA hosts and services were (re-)started?

The following options exist to check for host and service startup times:

Tool Location Details
SAP HANA Studio Administration -> Landscape In the ‘Services’ and ‘Hosts’ tab you can find several details including the startup times of SAP HANA services and hosts.
DBACOCKPIT Configuration In the ‘Services’ and ‘Hosts’ section you can find several details including the startup times of SAP HANA services and hosts.
SAP Note1969700 SQL: “HANA_Startup_StartupTimes” This SQL statement provides an overview of host and / or service startup times including “startup delays” – i.e. the time a host or service is started after the first started host or service.
Example: (host with significant later startup time than other hosts):

3. What are typical reasons for service restarts and crashes?

The following main reasons exist for restarts and crashes:

  • Explicit restart by administrator or software component
  • Software bugs (SAP HANA, operating system)
  • Hardware problems
  • Configuration issues

4. How can the actual reason of services restarts and crashes be identified?

The following possibilities exist to identify reasons for restarts and crashes:

Approach Details
daemon.<host>.<port>.<id>.trc The daemon trace file typically provides a good overview when and why services were stopped or started.
<service>_<host>.<port>.crashdump.<timestamp>.trc Crash dump files are written when SAP HANA needs to be stopped due to unforeseen problems. See SAP Note 2000003 (“Which types of dumps can be created in SAP HANA environments?”) for more information.
Analyzing the content of these files is typically a key step to identify the root cause. Particularly important are the following sections:

  • CRASH_SHORTINFO: General crash information
  • CRASH_STACK: Call stack where crash happened
  • THREADS: Overview of thread activities when crash happened

With key words from CRASH_SHORTINFO and CRASH_STACK you can search for additional information using the following tools:

  • SAP Notes
  • SAP incidents (only available SAP internally)
  • SAP crash search inspector (only available SAP internally, see SAP Note 2163520)

If you can’t find a solution, you can open a SAP incident on component HAN-DB in order to request assistance from SAP.

<service>_<host>.<port>.emergencydump.<timestamp>.trc Emergency dump files can be treated similar like crash dump files, but in this case the shutdown happens in a more controlled manner. See SAP Note 2000003 (“Which types of dumps can be created in SAP HANA environments?”) for more information.
When a crash dump is created, it is useful to check also the normal log file of the service in question and its alert trace, because a crash can be a consequence of a preceeding activity like a failover. This preceeding context may be visible in the service specific trace files.
Discussion with administrator If an administrator performed the restart manually, you can discuss about the reasons.
Check of high availability features There are situation when high availability features of SAP HANA (see SAP Note 2057595) or other components (e.g. external cluster software) result in a restart or move of services and hosts. If it happens without apparent reason (e.g. crash or hang situation), the responsible high availability solution needs to be checked.

5. How does a typical crash dump analysis look like?

Let’s for example look at the following CRASH_SHORTINFO and CRASH_STACK section of a crash dump:

Based on the information marked yellow we can already draw some conclusions:

  • The crash was caused by a thread accessing a file in the DATA area asynchronously.
  • The file name is datavolume_0000.dat.
  • The error accessing this file was an “Input/output error”.
  • As a consequence the thread and the whole service crashed due to signal 6 (SIGABRT).
  • An input / output error in this context indicates that the file was (temporarily) not accessible or the connection was interrupted.
  • The likeliest area of the root cause is a layer below SAP HANA, like hardware or operating system.
  • One other explanation for this kind of input / output error can be found in SAP Note2062631 (ping-pong situation).

6. Which typical errors can be found in the crash dumps and what do the mean?

Some important errors reported in crash dump can be found below. The main error is marked bold, other errors provide context information, but may deviate:

Error Reason Troubleshooting Steps
Error during asynchronous file transfer
rc=5: Input/output error
Error accessing a file on disk The input/output error can also happen if a file in persistence is no longer accessible for a service because a SAP HANA failover mechanism has moved it away from the host. See SAP Note 2062631 that describes ping-pong failover situations that can be responsible.
Check in collaboration with your hardware and OS partners why the mentioned file can’t be accessed sporadically or permanently.
Cannot lock file “<file>” for write access
rc=11: Resource temporarily unavailable
Error locking a file on disk Another process may have locked the file (backup, cluster, virus scan, …). See SAP Note 1880382 and use lsof to check for processes having set a lock on the file.
If the error happens during a SAP HANA failover, it might be a timing issue or a consequence of another problem. Example: The nameserver fails over and the new nameserver tries to set a lock on its DATA file although the previous nameserver is somehow still running and locking the file. In this case please also check which error resulted in the failover and analyze that problem to avoid it in the future.
The LINUX aio-max-nr is not set large enough (see SAP internal Note1868829).
Check in collaboration with your hardware and OS partners why the mentioned file can’t be locked sporadically or permanently.
Error during asynchronous file transfer
rc=28: No space left on device
No disk quota or disk space available Make sure that sufficient space is available on files ystem level.
If you have configured disk quota limitations, make sure that they aren’t hit (see SAP Note 1921354 for GPFS).
Use GPFS version or higher where a related GPFS bug is fixed (SAP Note 1846872).
Make sure that the hddpool quota of GPFS isn’t exceeded (SAP Note2051052).

7. What kind of information should I provide to SAP in case of a crash?

In any case you should attach the crash dump to the SAP incident. If possible you can alternatively create a full system info dump (SAP Note 1732157) which contains both the crash dump and several other potentially important files and load information.

Leave a Reply