Woooooooo Hooooooooooo!!!!!!
Time to throw a little party!!!!
—– Rom
Rom Walton's view of the world
Woooooooo Hooooooooooo!!!!!!
Time to throw a little party!!!!
—– Rom
We are getting close to releasing a new version of the BOINC client software. Not only does the new version contain a bunch of fixes for the client itself, but it also includes new version of the DLLs required by the BOINC Windows Runtime debugger.
It wasn’t always referred to as the BOINC Windows Runtime Debugger; we used to just refer to it as the stackwalker technology. Over the last couple of weeks I’ve made significant changes to the infrastructure on the Windows platform that I feel we can call it something different.
Here is a list of additions that I think will help out:
Upgrading to the latest version of Microsoft’s debugger technology vastly increased the accuracy of the stack traces on Windows when the correct symbol files can be found. Another huge benefit is the use of the symbol store technology.
Symbol stores are basically websites that contain compressed versions of the symbol files. Now for those who have not been following my previous posts the symbol files are pretty big. R@H for instance had a symbol file that was 30MB uncompressed. Now when a crash occurs the debugger will attempt to go to the symbol stores that are defined in the search path and download the correct symbol file for the application. This is a huge win for a project since the symbol files (PDB) for the application do not need to be included for each workunit. Those who are on dial-up connections will have less to download.
Hopefully we can get the rest of the code written so that projects can maintain there own symbol stores. There are many huge wins with getting this technology enabled within BOINC.
Here is a sample of what the new debugger engine produces:
BOINC Windows Runtime Debugger Version 5.5.0
Dump Timestamp : 04/16/06 23:41:39
Debugger Engine : 4.0.5.0
Symbol Search Path:
C:BOINCSRCMainboinc_sampleswin_buildRelease;
C:BOINCSRCMainboinc_sampleswin_buildRelease;
srv*c:windowssymbols*http://msdl.microsoft.com/download/symbols;
srv*C:DOCUME~1romwLOCALS~1Tempsymbols*http://boinc.berkeley.edu/symstore
ModLoad: 00400000 00060000 C:BOINCSRCMainboinc_sampleswin_buildReleaseuppercase_5.10_windows_intelx86.exe (PDB Symbols Loaded)
ModLoad: 7c800000 000c0000 C:WINDOWSsystem32ntdll.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 77e40000 00102000 C:WINDOWSsystem32kernel32.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 5e8d0000 000ce000 C:WINDOWSsystem32OPENGL32.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 77ba0000 0005a000 C:WINDOWSsystem32msvcrt.dll (7.0.3790.1830) (PDB Symbols Loaded)
File Version : 7.0.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 7.0.3790.1830
ModLoad: 77f50000 0009c000 C:WINDOWSsystem32ADVAPI32.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 77c50000 0009f000 C:WINDOWSsystem32RPCRT4.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 77c00000 00048000 C:WINDOWSsystem32GDI32.dll (5.2.3790.2606) (PDB Symbols Loaded)
File Version : 5.2.3790.2606 (srv03_sp1_gdr.051230-1233)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.2606
ModLoad: 77380000 00092000 C:WINDOWSsystem32USER32.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 68720000 00020000 C:WINDOWSsystem32GLU32.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 73860000 0004c000 C:WINDOWSsystem32DDRAW.dll (5.3.3790.1830) (PDB Symbols Loaded)
File Version : 5.3.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version: 5.3.3790.1830
ModLoad: 73b30000 00006000 C:WINDOWSsystem32DCIMAN32.dll (5.2.3790.0) (PDB Symbols Loaded)
File Version : 5.2.3790.0 (srv03_rtm.030324-2048)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.0
ModLoad: 76aa0000 0002d000 C:WINDOWSsystem32WINMM.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 71b70000 00036000 C:WINDOWSsystem32uxtheme.dll (6.0.3790.1830) (PDB Symbols Loaded)
File Version : 6.00.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 6.00.3790.1830
ModLoad: 4b8d0000 00051000 C:WINDOWSsystem32MSCTF.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 69500000 004f3000 C:WINDOWSsystem32nvoglnt.dll (6.14.10.7801) (-exported- Symbols Loaded)
File Version : 6.14.10.7801
Company Name : NVIDIA Corporation
Product Name : NVIDIA Compatible OpenGL ICD
Product Version: 6.14.10.7801
ModLoad: 02e50000 00006000 C:WINDOWSsystem32ctagent.dll (1.0.0.11) (-exported- Symbols Loaded)
File Version : 1, 0, 0, 11
Company Name : Creative Technology Ltd
Product Name : ctagent
Product Version: 1, 0, 0, 11
ModLoad: 60970000 0000a000 C:WINDOWSsystem32mslbui.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
ModLoad: 03030000 00118000 C:BOINCSRCMainboinc_sampleswin_buildReleasedbghelp.dll (6.5.3.7) (PDB Symbols Loaded)
File Version : 6.5.0003.7 (vbl_core_fbrel(jshay).050527-1915)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version: 6.5.0003.7
ModLoad: 03250000 00046000 C:BOINCSRCMainboinc_sampleswin_buildReleasesymsrv.dll (6.5.3.8) (PDB Symbols Loaded)
File Version : 6.5.0003.8 (vbl_core_fbrel(jshay).050527-1915)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version: 6.5.0003.8
ModLoad: 032a0000 00031000 C:BOINCSRCMainboinc_sampleswin_buildReleasesrcsrv.dll (6.5.3.7) (PDB Symbols Loaded)
File Version : 6.5.0003.7 (vbl_core_fbrel(jshay).050527-1915)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version: 6.5.0003.7
ModLoad: 77b90000 00008000 C:WINDOWSsystem32version.dll (5.2.3790.1830) (PDB Symbols Loaded)
File Version : 5.2.3790.1830 (srv03_sp1_rtm.050324-1447)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version: 5.2.3790.1830
*** UNHANDLED EXCEPTION ****
Reason: Breakpoint Encountered (0x80000003) at address 0x7C822583
*** Dump of the Worker(offending) thread: ***
eax=00000000 ebx=00000000 ecx=77e4245b edx=7c82ed54 esi=77e424a8 edi=00454f20
eip=7c822583 esp=00a1fd64 ebp=00a1ffb4
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
ChildEBP RetAddr Args to Child
00a1fd60 0040203b 00000000 00000000 00000000 00000001 ntdll!_DbgBreakPoint@0+0x0 FPO: [0,0,0]
00a1ffb4 004239ce 77e66063 00000000 00000000 00000000 uppercase_5.10_windows_intelx86!worker+0x0 (c:boincsrcmainboinc_samplesuppercaseupper_case.c:174)
00a1ffb8 77e66063 00000000 00000000 00000000 00000000 uppercase_5.10_windows_intelx86!foobar+0x0 (c:boincsrcmainboincapigraphics_impl.c:75) FPO: [1,0,0]
00a1ffec 00000000 004239c0 00000000 00000000 00000000 kernel32!_BaseThreadStart@8+0x0 (c:boincsrcmainboincapigraphics_impl.c:75)
*** Dump of the Timer thread: ***
eax=0002625a ebx=00000000 ecx=00000000 edx=00b1feb0 esi=00000001 edi=00000000
eip=7c82ed54 esp=00b1ff0c ebp=00b1ffb8
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
ChildEBP RetAddr Args to Child
00b1ff08 7c822114 76aba0d3 00000002 00b1ff70 00000001 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
00b1ff0c 76aba0d3 00000002 00b1ff70 00000001 00000001 ntdll!_NtWaitForMultipleObjects@20+0x0 FPO: [5,0,0]
00b1ffb8 77e66063 00000000 00000000 00000000 00000000 WINMM!_timeThread@4+0x0
00b1ffec 00000000 76aba099 00000000 00000000 49474542 kernel32!_BaseThreadStart@8+0x0
*** Dump of the Graphics thread: ***
eax=00000000 ebx=7738e3f7 ecx=00000000 edx=00000000 esi=0012fc00 edi=7739ca9d
eip=7c82ed54 esp=0012fbb4 ebp=0012fbd8
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
ChildEBP RetAddr Args to Child
0012fbb0 7739c78d 77392f3a 0012fc00 00000000 00000000 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
0012fbd8 00424c3f 0012fc00 00000000 00000000 00000000 USER32!_NtUserGetMessage@16+0x0
0012fcb0 00423ca3 00000001 00000001 00000001 00000001 uppercase_5.10_windows_intelx86!win_graphics_event_loop+0x14 (c:boincsrcmainboincapiwindows_opengl.c:571) FPO: [0,46,0]
0012fcd0 004220eb 00401078 0045c3b0 0040233c 00401078 uppercase_5.10_windows_intelx86!boinc_init_graphics_impl+0x30 (c:boincsrcmainboincapigraphics_impl.c:84) FPO: [2,7,0]
0012fcdc 0040233c 00401078 00454f00 004483a4 00000002 uppercase_5.10_windows_intelx86!boinc_init_graphics+0x4b (c:boincsrcmainboincapigraphics_api.c:45) FPO: [1,0,0]
0012fcf4 004023b1 00000002 0012fd0c 00142550 0012fd0c uppercase_5.10_windows_intelx86!main+0xa (c:boincsrcmainboinc_samplesuppercaseupper_case.c:233) FPO: [2,0,0]
0012fe98 004035b4 00400000 00000000 001425a7 00000001 uppercase_5.10_windows_intelx86!WinMain+0x0 (c:boincsrcmainboinc_samplesuppercaseupper_case.c:110) FPO: [4,100,0]
0012ffc0 77e523cd 00000000 00000000 7ffd9000 8707adb0 uppercase_5.10_windows_intelx86!WinMainCRTStartup+0x1d (f:vs70builds3077vccrtbldcrtsrccrt0.c:251)
0012fff0 00000000 00403430 00000000 78746341 00000020 kernel32!_BaseProcessStart@4+0x0 (f:vs70builds3077vccrtbldcrtsrccrt0.c:251)
Exiting…
Howdy Folks,
Well error rates are still dropping on R@H. It generally takes a few weeks for the old version of an application to filter out of the system.
Here was the pass percentage breakout from yesterday:
Version | OS | Total Results | Pass Rate | Fail Rate |
483 | Darwin | 6733 | 90.24 | 9.76 |
483 | Windows | 99095 | 95.74 | 4.26 |
482 | Darwin | 213 | 96.71 | 3.29 |
482 | Linux | 9387 | 96.48 | 3.52 |
482 | Windows | 6000 | 84.68 | 15.32 |
As you can see there was a 10% drop in failure rate for windows.
Here is what the current error type breakout looks like:
App Version | OS | Exit Status | Error Count |
483 | Darwin | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 4 |
483 | Darwin | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 14 |
483 | Darwin | -185 (0xffffff47) ERR_RESULT_START | 290 |
483 | Darwin | 1 Unknown error number | 11 |
483 | Darwin | 2 Unknown error number | 77 |
483 | Darwin | 4 Unknown error number | 223 |
483 | Darwin | 5 Unknown error number | 17 |
483 | Darwin | 131 (0x83) Unknown error number | 21 |
483 | Windows | -2147483645 (0x80000003) Unknown error number | 29 |
483 | Windows | -2147483641 (0x80000007) Unknown error number | 1 |
483 | Windows | -1073741819 (0xc0000005) Unknown error number | 672 |
483 | Windows | -1073741818 (0xc0000006) Unknown error number | 5 |
483 | Windows | -1073741811 (0xc000000d) Unknown error number | 935 |
483 | Windows | -1073741795 (0xc000001d) Unknown error number | 5 |
483 | Windows | -1073741794 (0xc000001e) Unknown error number | 1 |
483 | Windows | -1073741783 (0xc0000029) Unknown error number | 1 |
483 | Windows | -1073741675 (0xc0000095) Unknown error number | 1 |
483 | Windows | -1073741674 (0xc0000096) Unknown error number | 1 |
483 | Windows | -1073741515 (0xc0000135) Unknown error number | 143 |
483 | Windows | -1073741502 (0xc0000142) Unknown error number | 285 |
483 | Windows | -1073740972 (0xc0000354) Unknown error number | 3 |
483 | Windows | -1073740791 (0xc0000409) Unknown error number | 16 |
483 | Windows | -529697949 (0xe06d7363) Unknown error number | 102 |
483 | Windows | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 154 |
483 | Windows | -187 (0xffffff45) ERR_RESULT_UPLOAD | 4 |
483 | Windows | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 598 |
483 | Windows | -185 (0xffffff47) ERR_RESULT_START | 210 |
483 | Windows | -177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED | 1 |
483 | Windows | -164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED | 39 |
483 | Windows | -1 (0xffffffff) Unknown error number | 74 |
483 | Windows | 0 | 6 |
483 | Windows | 1 Unknown error number | 806 |
483 | Windows | 3 Unknown error number | 15 |
483 | Windows | 128 (0x80) Unknown error number | 79 |
483 | Windows | 1073741845 (0x40000015) Unknown error number | 1 |
483 | Windows | 1073807364 (0x40010004) Unknown error number | 32 |
482 | Darwin | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 2 |
482 | Darwin | -185 (0xffffff47) ERR_RESULT_START | 3 |
482 | Darwin | 1 Unknown error number | 1 |
482 | Darwin | 131 (0x83) Unknown error number | 1 |
482 | Linux | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 16 |
482 | Linux | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 46 |
482 | Linux | -185 (0xffffff47) ERR_RESULT_START | 1 |
482 | Linux | 1 Unknown error number | 22 |
482 | Linux | 2 Unknown error number | 25 |
482 | Linux | 4 Unknown error number | 1 |
482 | Linux | 7 Unknown error number | 1 |
482 | Linux | 11 (0xb) Unknown error number | 29 |
482 | Linux | 13 (0xd) Unknown error number | 33 |
482 | Linux | 26 (0x1a) Unknown error number | 1 |
482 | Linux | 131 (0x83) Unknown error number | 154 |
482 | Linux | 139 (0x8b) Unknown error number | 1 |
482 | Windows | -1073741819 (0xc0000005) Unknown error number | 98 |
482 | Windows | -1073741811 (0xc000000d) Unknown error number | 30 |
482 | Windows | -1073741502 (0xc0000142) Unknown error number | 7 |
482 | Windows | -529697949 (0xe06d7363) Unknown error number | 7 |
482 | Windows | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 561 |
482 | Windows | -187 (0xffffff45) ERR_RESULT_UPLOAD | 9 |
482 | Windows | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 26 |
482 | Windows | -185 (0xffffff47) ERR_RESULT_START | 4 |
482 | Windows | -177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED | 4 |
482 | Windows | -164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED | 115 |
482 | Windows | 1 Unknown error number | 57 |
481 | Linux | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 1 |
481 | Linux | 1 Unknown error number | 2 |
481 | Linux | 11 (0xb) Unknown error number | 2 |
481 | Linux | 131 (0x83) Unknown error number | 4 |
479 | Linux | 1 Unknown error number | 1 |
This weekend I’m going to try and get a register dump of each thread added to the diagnostic output. Along with that I would like to get the function pointers and function parameters added to the diagnostic output.
I did manage to shrink the PDB file size for R@H down to 7MB which still seems to be a little steep for mass consumption. So maybe with the function pointers and parameters we can continue to bring down the error rates.
—– Rom
I am at something of an impasse with the 1% bug.
In order to gain ground I need to be able to see where the program is stuck on the destination machine. There are three ways to do this:
1. Have the community report which workunit stalled on there machine and attempt to reproduce it.
2. Hook up a debugger on the target machine and have the person at the keyboard create a dump file of the process.
3. Introduce a trigger into the executable so that on a certain action it causes it to dump its own backtraces.
Option one proves difficult just in managing the sheer number of workunits to look at. Roughly 550 workunits a day are being aborted or have exceeded their allotted CPU time. R@H hasn’t been able to reproduce the problem in the lab with the workunits they have looked at and are continuing to look at.
Option two doesn’t scale very well, namely of all the people who are hitting this problem only small fraction of them know how to create a dump file with a debugger and only a small fraction of them are willing to spend the time to compress and break the 200MB to 350MB file into smaller pieces to email them to me so I can look at them. Then of course there is only one of me and I still have all my other BOINC work to do, like fixing bugs in the 5.3.x clients so we can ship 5.4.0!
Option three didn’t hit me till Monday night. As part of the feature work we did for CPDN we introduced a way for the core client to notify the science application that it was being aborted so it could clean up after itself. Well I completely forgot that the 5.2.x clients don’t send the abort command to the client when I burned the midnight oil to deliver the backtrace functionality for R@H 4.94. At 4am I had the functionality working for Windows and checked it in.
Fast forward to today. I went looking through the results on Ralph@Home and discovered that the backtraces were not being logged like I thought they should have been. After further investigation I realized that the 5.2.x clients were sending the quit command instead of the abort command. Talk about killing morale. I have posted in the Ralph@Home forums that people should upgrade and I’ve been seeing results come back with 5.3.28 which is good. I’m just not sure when I’ll have enough information about the bug.
We are pretty close to having 5.4 ready for public release. I believe in a week or less. But a big problem remains, typically it takes a few months for a new stable client to reach a high enough level of adoption that patterns emerge that can be tracked.
After some discussions with David Baker we are going to drop the maximum amount of time allotted for a workunit to run on a machine. That’ll keep a good chunk of the wasted CPU cycles down. I am also selling the idea of releasing the PDB file with the Rosetta application for the public project. Now granted, it is a 30MB file. Without it none of the diagnostic stuff built into the BOINC API for tracking down bugs will work. Isn’t a 30MB insurance policy for an abort or crash worth it if the project can get something useful out of it which will lead to bug fixes?
—– Rom
Results on RALPH@Home which is R@H’s alpha project have been very promising.
To give an idea about how large this problem was for R@H I guess I need to provide some numbers. So here goes:
R@H receives roughly 115k results a day.
Roughly there are 16k failures a day.
Of those 16k failures a day, 5.5k fell under the ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED and 0xc0000005 banner. Those are the two error codes used when something really really really bad has happened on Windows. There are another 1.5k errors that have cryptic Windows error codes which may or may not be related.
Now how does this translate to RALPH@Home? Well if you work under the assumption that RALPH@home is a mini R@H, then the percentages should be roughly the same.
That said, sure enough RALPH@Home had roughly the same breakdown of errors that the public project had. Here are some rough stats for RALPH@Home:
RALPH@Home receives roughly 1k results a day.
Before 4.93 was released for Beta the failure rate was 150 or so a day.
Now with 4.93 in the mix it has dropped to 100 or so a day.
Keep in mind that the Mac and Linux clients have not been updated yet and so there error rates remain unchanged.
RALPH@Home went from a 25% failure rate down to a 12% failure rate. Now if you remove the results from Linux and the Mac the failure rate for the Windows client is floating at 5%.
I’ll include the current error rates in the public project and RALPH@Home below.
Now I’m on to the next biggest problem which has been deemed the ‘1% bug’.
For those who noticed the error code 1 in the charts below, that error code is given when Rosetta could not find something in one of the pre-staged files downloaded to your machine or when the application felt something really bad has happened and it couldn’t continue. With 4.82 that actual error data was being written to a different log file than the one BOINC sends back to the server. Starting with 4.94 the reason for the application quitting will be logged and sent back to the server in a way that can be easily tracked and fixed without having to write the workunit names in the forums.
—– Rom
Public Project Results:
482 | Darwin | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 5 |
482 | Darwin | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 3 |
482 | Darwin | -185 (0xffffff47) ERR_RESULT_START | 83 |
482 | Darwin | 1 Unknown error number | 10 |
482 | Darwin | 4 Unknown error number | 135 |
482 | Darwin | 5 Unknown error number | 9 |
482 | Darwin | 6 Unknown error number | 1 |
482 | Darwin | 131 (0x83) Unknown error number | 26 |
482 | Windows | -2147483641 (0x80000007) Unknown error number | 18 |
482 | Windows | -1073741819 (0xc0000005) Unknown error number | 1797 |
482 | Windows | -1073741811 (0xc000000d) Unknown error number | 880 |
482 | Windows | -1073741795 (0xc000001d) Unknown error number | 2 |
482 | Windows | -1073741674 (0xc0000096) Unknown error number | 4 |
482 | Windows | -1073741571 (0xc00000fd) Unknown error number | 63 |
482 | Windows | -1073741515 (0xc0000135) Unknown error number | 2 |
482 | Windows | -1073741502 (0xc0000142) Unknown error number | 336 |
482 | Windows | -1073740972 (0xc0000354) Unknown error number | 2 |
482 | Windows | -529697949 (0xe06d7363) Unknown error number | 226 |
482 | Windows | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 466 |
482 | Windows | -187 (0xffffff45) ERR_RESULT_UPLOAD | 3 |
482 | Windows | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 316 |
482 | Windows | -185 (0xffffff47) ERR_RESULT_START | 248 |
482 | Windows | -177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED | 49 |
482 | Windows | -164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED | 3761 |
482 | Windows | -1 (0xffffffff) Unknown error number | 4 |
482 | Windows | 0 | 18 |
482 | Windows | 1 Unknown error number | 1004 |
482 | Windows | 3 Unknown error number | 52 |
482 | Windows | 128 (0x80) Unknown error number | 7 |
482 | Windows | 1073807364 (0x40010004) Unknown error number | 23 |
481 | Linux | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 7 |
481 | Linux | -186 (0xffffff46) ERR_RESULT_DOWNLOAD | 15 |
481 | Linux | -185 (0xffffff47) ERR_RESULT_START | 4 |
481 | Linux | 0 | 1 |
481 | Linux | 1 Unknown error number | 221 |
481 | Linux | 11 (0xb) Unknown error number | 25 |
481 | Linux | 26 (0x1a) Unknown error number | 2 |
481 | Linux | 131 (0x83) Unknown error number | 144 |
481 | Windows | -2147483645 (0x80000003) Unknown error number | 1 |
481 | Windows | -197 (0xffffff3b) ERR_ABORTED_VIA_GUI | 3 |
Total | 9976 |
RALPH@Home Results:
493 | Windows | -1073741819 (0xffffffffc0000005) Unknown error number | 4 |
493 | Windows | -1073741811 (0xffffffffc000000d) Unknown error number | 19 |
493 | Windows | -1073741678 (0xffffffffc0000092) Unknown error number | 1 |
493 | Windows | -529697949 (0xffffffffe06d7363) Unknown error number | 5 |
493 | Windows | -197 (0xffffffffffffff3b) ERR_ABORTED_VIA_GUI | 5 |
493 | Windows | -186 (0xffffffffffffff46) ERR_RESULT_DOWNLOAD | 5 |
493 | Windows | 0 | 2 |
493 | Windows | 1 Unknown error number | 5 |
493 | Windows | 3 Unknown error number | 1 |
492 | Windows | -1073741819 (0xffffffffc0000005) Unknown error number | 3 |
491 | Windows | -197 (0xffffffffffffff3b) ERR_ABORTED_VIA_GUI | 1 |
485 | Darwin | -185 (0xffffffffffffff47) ERR_RESULT_START | 22 |
485 | Darwin | 4 Unknown error number | 6 |
485 | Darwin | 131 (0x83) Unknown error number | 1 |
484 | Linux | 11 (0xb) Unknown error number | 3 |
484 | Linux | 131 (0x83) Unknown error number | 6 |
Total | 89 |
So, as many of you probably already know, I’ve been brought onboard as a consultant with the Rosetta@Home project. A big issue they were experiencing was related to random crashes when BOINC would notify them that it was time to quite and for another application to begin.
I believe I have found and fixed this style of bug, but alas only time and testing will tell.
To understand this bug I need to explain how things work with a science application. When a science application starts and notifies BOINC that it supports graphics three threads are created to manage what is going on.
The worker thread is the heavy lifter of the science application, it handles all the science. The majority of the memory allocations and de-allocations happen in this thread.
The graphics thread is responsible for displaying the graphics window and for hiding and showing the window at BOINCs request.
The timer thread is responsible for processing the suspend/resume/quite/abort messages from BOINC as well as notify BOINC of trickles.
Now when the science application received the quit request it would call the C Runtime Library function called exit which is supposed to shutdown the application. Part of this shutdown operation calls the Win32 API called ExitProcess. ExitProcess would let the threads continue to run while cleaning up the heap, which is a holdout for letting DLLs decrement their ref counts and unload themselves if nobody else is using them. Well there in lies the problem, the worker thread was still running trying to allocate and de-allocate memory from a heap that has been freed by ExitProcess.
This in turn would cause an access violation which shows up in the log file as 0xc0000005.
Science applications now have the option of requesting a hard termination which stops all executing threads and then cleans up after the process. In essence the application calls TerminateProcess on itself. What this also means is that the application has no chance of writing any more information to a state file or checkpoint file when the BOINC API hasn’t been notified that a checkpoint is in progress. Use with care. It also means that BOINC should no longer believe that a task is invalid from a random crash.
I believe this will take care of quite a few ‘crash on close’ style of bugs. What was really annoying about this kind of bug is that it crashes in a different location each time. Sometimes it would crash in the timer thread and sometimes in the worker thread. A good chunk of the time the clients would report an empty call stack which doesn’t give us anything to work off of.
This style of bug would affect slower machines more than the faster machines. The bug wouldn’t surface if the timer thread could finish all the CPU instructions needed from the time exit was called to the time ExitProcess actually kills the threads in one OS thread scheduling cycle.
I think Rosetta@Home hit this bug more often then most projects because of the amount of memory it allocates while doing its thing. 150MB’s per process. That was just enough to get it to happen on my machine if I left it running for 10 minutes and the graphics running.
It looks like both Einstein@Home and Rosetta@Home are going to be testing this out in the next few days. I’m excited to see what this change does for the success rates of the tasks being assigned to client machines.
—– Rom
Well tomorrow I’ll be taking a trip to theRosetta@Home project.
They are going to be explaining how Rosetta works so I can try and help them out with the problems they are having with the BOINC interface code. I believe it’ll be a great learning experience for both Rosetta@Home and BOINC.
It seems everytime we learn about a new project, there is another way of doing something that is just slightly different from any other project.
—– Rom
Occasionally I’m hit up with the question, “Why doesn’t BOINC Manager expose this feature or that feature?”
Well, quite simply, the goals for BOINC Manager are:
Now a feature that made it in and doesn’t conform to the goals above is the ability to control a remote computer. We put that in mostly to show off the GUI RPCs and provide ourselves with an easy way to make sure we didn’t take to many liberties with assuming needed files would be local. An exception would be the gui_rpc_auth.cfg file which is in need of some help to deal with custom ports and the like.
We knew we weren’t going to be able to make everybody happy with just one way to present the UI, and each group of users presented their own challenges. For instance, somebody new to distributed computing would need to know what the various fields in the UI mean, while old hats would just consider it clutter.
We need to get to the point where we have a tweak-able UI. Something that can start out being pretty verbose in the beginning, and can be made more spartan in nature when the user doesn’t need as much hand holding.
—– Rom
Well I completed moving around all my equipment and stuff.
Back to working on BOINC.
—– Rom
So I got my hands on the Feb CTP of Windows Vista. All I can say is… WOW, nice eye candy.
I haven’t been gutsy enough to actually load it up on a real computer yet, I installed it on a virtual machine using Microsoft Virtual PC 2004. I did run into a few problems in the beginning, it turns out that Vista didn’t like my captured DVD-ROM and I had to capture a virtual DVD-ROM to actually keep the Vista setup program from crashing. A couple of problems with running Vista under emulation is that Vista removed support for ISA devices, so it doesn’t support the emulated sound card and of course the emulated video card doesn’t support Aero Glass.
Instructions for setting up Vista to run on Virtual PC 2004 can be found here.
The first thing I did after getting it all setup was to install BOINC (Berkeley Open Infrastructure for Network Computing) on it. There are a few things I need to do to improve the user experience of BOINC on Vista, for instance opening up the firewall so that the various components can talk to one another.
One thing I noticed is that services run in terminal services session 0 instead of the console session. This completely ruined my plans for a more secure design without the need of the local system account to display graphics for science applications. When you sign in with a real user account the console is moved to which ever session is assigned to the user logging in. I have not discovered a way to move an application window from one terminal server session to another.
My revised design up until this point was to create two user accounts for BOINC, one for boinc.exe to use and manage the sciences applications and the other one for the science applications themselves. Now the account for boinc.exe would manage the file permissions such that the contents of the BOINC directory itself would be off limits to the science applications and the account for the science applications would be limited to the project and slots directories.
Now to make all of the work with the screensaver involved the screensaver asking boinc.exe ‘Which user account is running the science applications?” which boinc.exe would pass back the name. Now the screensaver would proceed to modify the current desktop and window station ACL’s to allow that user to create and display windows within that desktop and window station. Then the science application would be informed of which windows station and desktop to display itself on. Everything was fine until Vista.
We may only be able to support the single-user installation scenario for graphics on Vista until I can figure out a better way to display graphics in the screensaver.
—– Rom