Another Reason Why I Don't Like Linux
By templedf on Jun 09, 2005
As I've said in previous posts, I'm not a big fan of Linux. I will gladly use Linux over Windows any day, but on an absolute scale, it really doesn't do much for me. Here's another example of why.
In a previous post, I mentioned that I was busting my butt on an important project with a short deadline. As luck would have it, this project involved setting up a complex Grid Engine cluster on a rack of Linux servers. Part of the complexity came from a series of scripts which Grid Engine was supposed to run in response to events in the cluster. For the purpose of this rant, the only two that are interesting are suspend.sh and resume.sh. By the nature of what I was doing, these scripts already had a low trust quotient. I had just written them, and they both kicked off some rather opaque processes, so I was anything but certain that they were doing what they were supposed to, much less in response to the proper events.
After I got everything put where it was supposed to be, I tested the system. Much to my surprise, the suspend.sh script actually did what it was supposed to do, when it was supposed to do it. The resume.sh script, however, didn't appear to do anything. Fine. I added a line to resume.sh that touched a file in /tmp so I could see that it was at least being called. Nothing. So I went through the Grid Engine configuration, looking for a reason why it wasn't working. Nothing. I tried running resume.sh by hand. Worked fine.
Then I started getting creative. I changed the configuration to use suspend.sh on both events (suspend and resume), and suspend.sh worked in both cases. Back to a problem with the resume.sh script. I ran it by hand again, just to be sure. Worked fine. I changed the configuration to run a third script (instead of resume.sh) on resume. Worked fine. I copied the contents of resume.sh into the third script. Worked fine! I removed resume.sh and made it a hard link to the third script, and switched the configuration back to using resume.sh. Nothing! I switched the configuration back to the third script. Worked fine!
In the end, the best guess I have as to what wasted 4 hours of my time on an extremely time critical project was that the NFS daemon on either the client or server somehow decided that the name, resume.sh, was bad and should be ignored. Gah!
I freely admit that if I had trusted the scripts and the Grid Engine configuration more, I would have found the problem much sooner. I spent a lot of time misguidedly trying to debug both of them. Nevertheless, the fact that I even had to deal with a ridiculous problem like that makes me never want to log in to a Linux box again!