I am at something of an impasse with the 1% bug.
In order to gain ground I need to be able to see where the program is stuck on the destination machine. There are three ways to do this:
1. Have the community report which workunit stalled on there machine and attempt to reproduce it.
2. Hook up a debugger on the target machine and have the person at the keyboard create a dump file of the process.
3. Introduce a trigger into the executable so that on a certain action it causes it to dump its own backtraces.
Option one proves difficult just in managing the sheer number of workunits to look at. Roughly 550 workunits a day are being aborted or have exceeded their allotted CPU time. R@H hasn’t been able to reproduce the problem in the lab with the workunits they have looked at and are continuing to look at.
Option two doesn’t scale very well, namely of all the people who are hitting this problem only small fraction of them know how to create a dump file with a debugger and only a small fraction of them are willing to spend the time to compress and break the 200MB to 350MB file into smaller pieces to email them to me so I can look at them. Then of course there is only one of me and I still have all my other BOINC work to do, like fixing bugs in the 5.3.x clients so we can ship 5.4.0!
Option three didn’t hit me till Monday night. As part of the feature work we did for CPDN we introduced a way for the core client to notify the science application that it was being aborted so it could clean up after itself. Well I completely forgot that the 5.2.x clients don’t send the abort command to the client when I burned the midnight oil to deliver the backtrace functionality for R@H 4.94. At 4am I had the functionality working for Windows and checked it in.
Fast forward to today. I went looking through the results on Ralph@Home and discovered that the backtraces were not being logged like I thought they should have been. After further investigation I realized that the 5.2.x clients were sending the quit command instead of the abort command. Talk about killing morale. I have posted in the Ralph@Home forums that people should upgrade and I’ve been seeing results come back with 5.3.28 which is good. I’m just not sure when I’ll have enough information about the bug.
We are pretty close to having 5.4 ready for public release. I believe in a week or less. But a big problem remains, typically it takes a few months for a new stable client to reach a high enough level of adoption that patterns emerge that can be tracked.
After some discussions with David Baker we are going to drop the maximum amount of time allotted for a workunit to run on a machine. That’ll keep a good chunk of the wasted CPU cycles down. I am also selling the idea of releasing the PDB file with the Rosetta application for the public project. Now granted, it is a 30MB file. Without it none of the diagnostic stuff built into the BOINC API for tracking down bugs will work. Isn’t a 30MB insurance policy for an abort or crash worth it if the project can get something useful out of it which will lead to bug fixes?
—– Rom