BOINC Application Optimization: The Good, the Bad, and the Ugly

Somebody pointed out a thread to me on E@H:
http://einstein.phys.uwm.edu/forum_thread.php?id=4480

I have to say that I’m a little shocked at some of the themes in attitudes of some of the participants I’ve seen.

First let me clear up some misunderstandings about what validators and assimilators for a BOINC server cluster are supposed to do. Validators only check to make sure there is agreement between the machines who have crunched the same workunit. If all of the machines agree on what the numbers are then the results are considered valid and flagged for assimilation. Assimilators just copy the result data from the BOINC database/file system to the projects internal database for analysis. After assimilation a result finally has meaning in the context of the projects goal, prior to that it is a collection of numbers and BOINC doesn’t have a clue if they are correct or not.

Projects are free to add additional logic to their validators and assimilators to try and weed out incorrect results, but to some degree it is still just a guess. If they already know what the correct answer is then they would not have needed to send out the work to begin with.

For projects that are searching for something, their results can be broken down to into two camps, something that needs further investigation and background noise. What separates something that needs further investigation and something that is background noise? There is some value or a set of values in the result files that exceed one or more thresholds. Some thresholds may have a cap on them in which case an interesting value or set of values falls into. We can then refer to the lower and upper bound of a threshold as a threshold window. Those thresholds are typically calibrated against the default client a project sends out. Tests are run against the default client using special workunits that contain various samples of data that expose what the application is looking for so the scientists can make sure the client is working like it is supposed to.

So now the crux of the problem, changing instruction sets for an application can and will change the level of precision of the data returned back to the project.

Optimized SSE/SSE2/SSE3/3DNow applications change how the mathematical operations are performed vs. and un-optimized application. Now whether that adversely affects the project totally depends on how the project handles data types internally. If a project doesn’t release the source code or test workunits for their application then somebody optimizing the application with a disassembler or hex editor is making an assumption about how calculations are being performed and what they can do to optimize it. If they are wrong then something might be flagged as noise when it should have been flagged as needing to be investigated. What if something is missed because the thresholds are geared for a different range of values then what the optimized application is producing?

SSE/SSE2/SSE3/3DNow instruction sets use 128-bit registers while the original x87 FPU uses 80-bit registers. Now most programming languages store floating point numbers as either 32-bit single precision floats or 64-bit double precision floats. Quite a bit of the performance improvement that these new instruction sets provide comes from packing multiple numbers into a register and then performing mathematical operations on them in a matrix style fashion. So you could fit 4 single precision floats, or two double precision floats into a single 128-bit register. Depending on the instruction the result may be bounded to 32-bits, 64-bits, or 128-bits. That means in the worse case scenarios any optimized application is rounding any computation either higher or lower than the original application.

You might be thinking, why don’t projects just enlarge the threshold window so that those small rounding errors can get through. Some of them have, but others still need to investigate how using different instructions affect the system overall. A few of the science applications perform calculations on the result of previous calculations over and over again. How large would the threshold window have to be if the calculations on previous calculations happened 1,000,000 or 10,000,000 times?

Here is an example of two different Intel SSE CPU instructions (one for working on packed data, and the other one using the whole register) on the same processor producing different results:
http://softwareforums.intel.com/ISN/Community/en-US/forums/thread/5484332.aspx

Note, that was using the Intel IPP library. That is how easy rounding problems can be introduced when optimizing.

For those who are quick to say by using optimized applications I’m doing more science because I can process workunits faster, my response is:
Only if the projects backend databases and tools are equipped to deal with the differences, otherwise something might be missed. If you processed the one but sent back numbers outside the target threshold windows have you really helped the project?

Another common thing I’ve seen is; I’ve run the standard application and the optimized application across x number of workunits I’ve been assigned and they produced the same result files so the optimized application must be good in all scenarios, my response is:
What that really means is that no rounding issues occurred with the workunits you had access too. Without the test workunits a project uses internally you really don’t know if you covered all your bases.

The good news in all of this is the projects are listening and are working with the optimizers to incorporate the needed changes into the projects default application. Please be patient during the transition though, it is going to take a bit of time to double check everything and make sure it is all in working order.

In case you are curious, I do not use any optimized clients on any of my machines. To me the science applications are big black boxes, I don’t know enough about what they do under the hood to smartly make changes for the better. I’ll wait for optimization changes to be released by the projects which means that their backend systems can account for any changes to the data.

At the end of the day most of the projects are probably not concerned with the problems of verifying data that has been flagged as interesting, it is concern about missing something interesting that was flagged as background noise.

—– Rom

References:
http://en.wikipedia.org/wiki/IA-32
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://en.wikipedia.org/wiki/SSE2
http://en.wikipedia.org/wiki/SSE3
http://en.wikipedia.org/wiki/SSE4
http://en.wikipedia.org/wiki/3dnow
http://docs.sun.com/source/806-3568/ncg_goldberg.html

[08/15/2006] Adding a few more reference articles, the banking industry is still battling with rounding errors in its software.
http://www.regdeveloper.co.uk/2006/08/12/floating_point_approximation/
http://cch.loria.fr/documentation/IEEE754/