Full root partition on Brocade FC switch

Posted: June 8, 2021/Under: HPE, HPE | Storage/By: vNote42

If you are working with Brocade respectively Broadcom Fibre Channel (FC) switches, you have probably had similar experiences as I have. Normally switches are running without any issues after setup. Sometimes firmware needs to be updated and zoning needs to be extended. Otherwise they just run. Last week I had the opposite experience. I saw very strange behavior caused by a full root partition on Brocade FC switch.

Symptoms

Fortunately a full root partition on Brocade FC switch does not cause an outage. At least not in this case. Observed Symptoms:

Build-in web-server did not run. So, when trying to connect to http(s), there was no response.
Switch is shown with yellow status, when using GUI of other switch in fabric.

Environment

This situation occurred in an environment containing two fabrics. Each fabric contains 4 to 5 switches. All switches, but one 8G-switch, running Fabric OS v8.2.2c. Remarkable was the fact that the issue occurred on both fabrics on 3 switches in sum. And all of these switches are HPE BladeSystem c7000 interconnect switches.

Problem

For the following commands and troubleshooting actions, you need to login as root. User admin is not allowed to run such system close commands.

These symptoms are caused by a full root partition. To check this, run df -h. A full root partition looks like this:

In my case this was caused by thousands of pid-files. These files are used in Linux-based systems to save process ID (pid) information for later usage. Such files are stored in the directory /var/run. Here pid-files for the process weblinkerfcgd were abundant.

Solution

First and foremost, DO NOT REBOOT THE SWITCH WITH A FULL ROOT PARTITION! This can lead to a no longer functioning switch.

To solve the problem, you need to reboot the switch. This will stop the system from trying to create new files. Sufficient is hareboot. Here are the steps that solved the problem:

Delete some files to reduce filling level to at least 95%. See Notes beneath.
Restart the switch. Use command hareboot to do so. Before restart, note current time by running date.
After the reboot the web-server should by running. Unfortunately rebooting the switch does not clean unused pid-files. So this must be done manually. Important here is not to delete the current pid-file for the weblinkerfcgd process. Find this file by running the following commands. Here, time of reboot was: 2nd June 2021 at 14:11.

cd /var/run
touch -t 202106021411 /tmp/current find . -newer /tmp/current -printO | xargs -0 echo

touch command creats a file that is used by find as time-stamp. Output shows all current pid-files. One of them should looks like weblinlcerfcgd.*.pid with * the process ID. Example for the next step: weblinkerfcgd.1644.pid
Now it is time to remove all unneeded pid-file. To do so, you can run this command:

find . -type f -name 'weblinlcerfcgd.*.pid' -not -name 'weblinkerfcgd.1644.pid' | xargs rm

Note

At the moment of writing I do not know the exact root-cause. If I get more information from support, I will post it here.

Root partition should be kept beneath 85% filling level. What also helps to reduce used space:

Because of the amount of files, you cannot use simply the command rm. You will get the error Argument list too long. For creating some space for rebooting the switch, it can be helpful to delete a subset of these files. This can be done by using a for-loop like this:

for i in weblinlcerfcgd.9*; do rm $i; done

This loop will do the same as rm weblinlcerfcgd.9* but without an error.
There is a log-file hasm.log and a hasm.log.save in /var/log. In this scenario hasm.log was >30MB in size. A reboot moves hasm.log to hasm.log.save and starts a new hasm.log. I downloaded the current and saved log-file – in case support would ask for it – and cleared it.
Command supportsave -R cleans up some unneeded files.

Full root partition on Brocade FC switch