Full root partition on Brocade FC switch
If you are working with Brocade respectively Broadcom Fibre Channel (FC) switches, you have probably had similar experiences as I have. Normally switches are running without any issues after setup. Sometimes firmware needs to be updated and zoning needs to be extended. Otherwise they just run. Last week I had the opposite experience. I saw very strange behavior caused by a full root partition on Brocade FC switch.
Symptoms
Fortunately a full root partition on Brocade FC switch does not cause an outage. At least not in this case. Observed Symptoms:
- Build-in web-server did not run. So, when trying to connect to http(s), there was no response.
- Switch is shown with yellow status, when using GUI of other switch in fabric.
Environment
This situation occurred in an environment containing two fabrics. Each fabric contains 4 to 5 switches. All switches, but one 8G-switch, running Fabric OS v8.2.2c. Remarkable was the fact that the issue occurred on both fabrics on 3 switches in sum. And all of these switches are HPE BladeSystem c7000 interconnect switches.
Problem
For the following commands and troubleshooting actions, you need to login as root. User admin is not allowed to run such system close commands.
These symptoms are caused by a full root partition. To check this, run df -h
. A full root partition looks like this:
In my case this was caused by thousands of pid-files. These files are used in Linux-based systems to save process ID (pid) information for later usage. Such files are stored in the directory /var/run
. Here pid-files for the process weblinkerfcgd
were abundant.
Solution
First and foremost, DO NOT REBOOT THE SWITCH WITH A FULL ROOT PARTITION! This can lead to a no longer functioning switch.
To solve the problem, you need to reboot the switch. This will stop the system from trying to create new files. Sufficient is hareboot
. Here are the steps that solved the problem:
- Delete some files to reduce filling level to at least 95%. See Notes beneath.
- Restart the switch. Use command
hareboot
to do so. Before restart, note current time by runningdate
. - After the reboot the web-server should by running. Unfortunately rebooting the switch does not clean unused pid-files. So this must be done manually. Important here is not to delete the current pid-file for the
weblinkerfcgd
process. Find this file by running the following commands. Here, time of reboot was: 2nd June 2021 at 14:11.cd /var/run
touch -t 202106021411 /tmp/current
find . -newer /tmp/current -printO | xargs -0 echotouch
command creats a file that is used byfind
as time-stamp. Output shows all current pid-files. One of them should looks likeweblinlcerfcgd.*.pid
with*
the process ID. Example for the next step:weblinkerfcgd.1644.pid
- Now it is time to remove all unneeded pid-file. To do so, you can run this command:
find . -type f -name 'weblinlcerfcgd.*.pid' -not -name 'weblinkerfcgd.1644.pid' | xargs rm
Note
At the moment of writing I do not know the exact root-cause. If I get more information from support, I will post it here.
Root partition should be kept beneath 85% filling level. What also helps to reduce used space:
- Because of the amount of files, you cannot use simply the command
rm
. You will get the errorArgument list too long
. For creating some space for rebooting the switch, it can be helpful to delete a subset of these files. This can be done by using a for-loop like this:for i in weblinlcerfcgd.9*; do rm $i; done
This loop will do the same asrm weblinlcerfcgd.9*
but without an error. - There is a log-file
hasm.log
and ahasm.log.save
in/var/log
. In this scenariohasm.log
was >30MB in size. A reboot moveshasm.log
tohasm.log.save
and starts a newhasm.log
. I downloaded the current and saved log-file – in case support would ask for it – and cleared it. - Command
supportsave -R
cleans up some unneeded files.