Full root partition on Brocade FC switch
If you are working with Brocade respectively Broadcom Fibre Channel (FC) switches, you have probably had similar experiences as I have. Normally switches are running without any issues after setup. Sometimes firmware needs to be updated and zoning needs to be extended. Otherwise they just run. Last week I had the opposite experience. I saw very strange behavior caused by a full root partition on Brocade FC switch.
Fortunately a full root partition on Brocade FC switch does not cause an outage. At least not in this case. Observed Symptoms:
- Build-in web-server did not run. So, when trying to connect to http(s), there was no response.
- Switch is shown with yellow status, when using GUI of other switch in fabric.
This situation occurred in an environment containing two fabrics. Each fabric contains 4 to 5 switches. All switches, but one 8G-switch, running Fabric OS v8.2.2c. Remarkable was the fact that the issue occurred on both fabrics on 3 switches in sum. And all of these switches are HPE BladeSystem c7000 interconnect switches.
For the following commands and troubleshooting actions, you need to login as root. User admin is not allowed to run such system close commands.
These symptoms are caused by a full root partition. To check this, run
df -h. A full root partition looks like this:
In my case this was caused by thousands of pid-files. These files are used in Linux-based systems to save process ID (pid) information for later usage. Such files are stored in the directory
/var/run. Here pid-files for the process
weblinkerfcgd were abundant.
First and foremost, DO NOT REBOOT THE SWITCH WITH A FULL ROOT PARTITION! This can lead to a no longer functioning switch.
To solve the problem, you need to reboot the switch. This will stop the system from trying to create new files. Sufficient is
hareboot. Here are the steps that solved the problem:
- Delete some files to reduce filling level to at least 95%. See Notes beneath.
- Restart the switch. Use command
harebootto do so. Before restart, note current time by running
- After the reboot the web-server should by running. Unfortunately rebooting the switch does not clean unused pid-files. So this must be done manually. Important here is not to delete the current pid-file for the
weblinkerfcgdprocess. Find this file by running the following commands. Here, time of reboot was: 2nd June 2021 at 14:11.
touch -t 202106021411 /tmp/current
find . -newer /tmp/current -printO | xargs -0 echo
touchcommand creats a file that is used by
findas time-stamp. Output shows all current pid-files. One of them should looks like
*the process ID. Example for the next step:
- Now it is time to remove all unneeded pid-file. To do so, you can run this command:
find . -type f -name 'weblinlcerfcgd.*.pid' -not -name 'weblinkerfcgd.1644.pid' | xargs rm
At the moment of writing I do not know the exact root-cause. If I get more information from support, I will post it here.
Root partition should be kept beneath 85% filling level. What also helps to reduce used space:
- Because of the amount of files, you cannot use simply the command
rm. You will get the error
Argument list too long. For creating some space for rebooting the switch, it can be helpful to delete a subset of these files. This can be done by using a for-loop like this:
for i in weblinlcerfcgd.9*; do rm $i; done
This loop will do the same as
rm weblinlcerfcgd.9*but without an error.
- There is a log-file
/var/log. In this scenario
hasm.logwas >30MB in size. A reboot moves
hasm.log.saveand starts a new
hasm.log. I downloaded the current and saved log-file – in case support would ask for it – and cleared it.
supportsave -Rcleans up some unneeded files.