Long (never ending?) work unit

Message boards : Number crunching : Long (never ending?) work unit

To post messages, you must log in.

AuthorMessage
Profile ChertseyAl
Avatar

Send message
Joined: 23 Sep 07
Posts: 16
Credit: 1,038,886
RAC: 0
Message 1673 - Posted: 11 Jul 2010, 17:10:40 UTC

This WU has been running now for 63 hours:

http://www.enigmaathome.net/result.php?resultid=15555105
http://www.enigmaathome.net/workunit.php?wuid=14533239

Shows 65 hours remaining, progress indicator resets to zero, but I am aware that this is a known bug :)

Is it worth allowing this to run? I need to switch off the host within the next day or so to relocate it, and am concerned that the WU will start from zero again as it isn't checkpointing.

Thanks.
ID: 1673 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ChertseyAl
Avatar

Send message
Joined: 23 Sep 07
Posts: 16
Credit: 1,038,886
RAC: 0
Message 1674 - Posted: 12 Jul 2010, 17:38:40 UTC - in response to Message 1673.  

Is it worth allowing this to run?


Anyone?

ID: 1674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
thinking_goose

Send message
Joined: 12 Nov 07
Posts: 119
Credit: 2,750,621
RAC: 0
Message 1675 - Posted: 12 Jul 2010, 23:49:00 UTC

I'd let it run. If it still hasn't finished before the time you need to move the machine, so be it. I've given up looking at the estimated times these units take to complete- I get a fairly accurate prediction by looking at the time it has taken to complete similar work units and just take it from there. So far they're only a few minutes either way.
ID: 1675 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ChertseyAl
Avatar

Send message
Joined: 23 Sep 07
Posts: 16
Credit: 1,038,886
RAC: 0
Message 1677 - Posted: 17 Jul 2010, 18:21:22 UTC - in response to Message 1675.  

The outcome:

The WU was still running at around 100 hours when I had to power down. On rebooting it retained it's runtime, and an outstanding Spinhenge unit started to run. Pretty soon that was running far longer than expected. Rebooted again, Spinhenge unit ran normally and finished (although 3 times the usual time). Enigma WU got stuck again and at around 104 hours reset to zero hours.

Then I remembered that BOINC had blown up about a month ago on this host (all tasks went to computation error, started accepting tasks from random projects etc).

BOINC version was 5.10.28! Upgraded to my favourite version, 5.10.45, and all seemed well, Enigma WU completed in an apparent 4 hours (more than double the usual time, and of course it had been crunching for 104 hours before that!). Still getting stuck tasks on Enigma and other projects.

Uninstalled BOINC, then installed 6.10.56. Somehow it migrated all of the outstanding WUs (I guess they don't get deleted on uninstall?). Took hours to install for some reason, kept freezing.

Seemed to be OK, but got bitten by the stupid Max CPU Time bug, sorry 'feature', and other settings weren't correctly migrated. Straightened the settings out, all apparently OK.

Except now I've got another hung Enigma WU.

Conclusion: I think the host is broken in strange and mysterious ways :/


ID: 1677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
noderaser
Avatar

Send message
Joined: 24 Dec 08
Posts: 88
Credit: 1,496,863
RAC: 0
Message 1678 - Posted: 18 Jul 2010, 5:40:22 UTC

Is something else eating up all the resources, preventing BOINC from getting any to complete its tasks?
Click Here to see My Detailed BOINC Stats
ID: 1678 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ChertseyAl
Avatar

Send message
Joined: 23 Sep 07
Posts: 16
Credit: 1,038,886
RAC: 0
Message 1679 - Posted: 18 Jul 2010, 18:48:55 UTC - in response to Message 1678.  

Is something else eating up all the resources, preventing BOINC from getting any to complete its tasks?


Nope.

CPU usage with BOINC suspended is no more than a few percent, and that's being consumed by the VNC connection (like most of my hosts, I only connect via VNC). With BOINC running it's a solid 100%.

Memory usage minimal when running Enigma. It's a single core machine with 1GB of RAM. The only projects it struggles with would be CosNo and ViP.

Been running EDGeS solidly for 24 hours now with no problems, and that uses a lot of memory. It's almost as if the 'lite' projects like Enigma and Spinhenge are the problematic ones.

Putting things in perspective, this host cost me 20 or 30 UKP a few years ago and no longer really does anything 'useful' on my network. All that it does now is provide a small (manual) cache of some hefty files from my NAS, and I was using the front panel USB ports as charger sockets for my mp3 player and sat nav :)

I'll play around with it for a while, if nothing else as a testbed for different BOINC versions (I'm not liking the latest stuff at all!).



ID: 1679 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Cartoonman

Send message
Joined: 9 May 09
Posts: 1
Credit: 2,027,871
RAC: 0
Message 1684 - Posted: 23 Jul 2010, 1:37:03 UTC

This sounds like the WU application is somehow corrupted, and thus, isn't working properly(a simple re-installation won't rid of your project data), as as much as anyone else has seen, the WU's are performing fine for me (the only prob is that the progress indicator is highly inaccurate, and time left to finish is easily determined by CPU time)

According to your computer stats, your running XP, so your WU files and stats should be in your Application data folder. you can find it in it's default location at:

"C:/Documents and Settings/[your user acc with BOINC*]/Application Data/BOINC/projects/(the Enigma@home folder) (or run a search for "BOINC", and the folder shown in an application data folder is the one)

The folder once your in the projects folder is easily discernible. Delete the entire folder, but make sure that BOINC isn't running. after you've deleted the folder, restart BOINC, and let it re-download all of the necessary files and applications for Enigma, and see how the WU's run after that.

*if you allowed all users to use BOINC, it would be in the All Users folder

ID: 1684 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ChertseyAl
Avatar

Send message
Joined: 23 Sep 07
Posts: 16
Credit: 1,038,886
RAC: 0
Message 1687 - Posted: 25 Jul 2010, 9:09:04 UTC - in response to Message 1684.  

Latest gripping news ... ;-)

Other projects are running intermittently slowly, and occasionally the whole of Windows just runs very slowly.

Checked all of the obvious things and ran a few diagnostics. The only obvious 'problem' was a very high CPU temperature, close to the point that the processor throttles internally. So I left ThrottleWatch running, but that didn't show anything.

Heatsink was clean, fan blowing plenty of air, so took the heatsink off, cleaned it up and refitted with fresh silver heatsink compound. Seemed to run even slower. Removed, cleaned, refitted with ceramic heatsink compound and things seemed a little better. For a while. Now it's hanging again. I'm not convinced that the junction temperature is as high is reported, because the heatsink is barely warm, and cooling it down with freezer spray has very little effect on the processor temperature. So, that might be a red herring.

Next plan is to change the memory. I'll have a gigabyte of working memory becoming free in the next few days when I upgrade a different machine.

This machine is probably going to be scrapped shortly (I only keep the 10 best machines running BOINC, and this is one of the slowest), so I'm not that bothered. But I'd like to find out what's happening just to satisfy my own curiosity :)

ID: 1687 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 267,994,998
RAC: 0
Message 1695 - Posted: 1 Aug 2010, 12:19:21 UTC

Just for info: all the enigma workunits have fixed lengths, so if suddenly WU takes more than a couple of hours of CPU time, it's probably broken - unless it runs on really old hardware. I think that even a Pentium III can go through most of the workunit types in less than 12 hours.

I think that the host has some kind of hardware problem, perhaps memory errors. I've already seen similar problem on a machine with broken RAMs, the O/S itself was stable, but most of the results returned completely random data, usually from completely different WU ranges. Probably the data got corrupted in memory and the app was randomly jumping from one settings to another, this also explained runtime varying from normal to tens of hours.


M4 Project homepage
M4 Project wiki
ID: 1695 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ChertseyAl
Avatar

Send message
Joined: 23 Sep 07
Posts: 16
Credit: 1,038,886
RAC: 0
Message 1699 - Posted: 3 Aug 2010, 17:49:57 UTC - in response to Message 1695.  

all the enigma workunits have fixed lengths, so if suddenly WU takes more than a couple of hours of CPU time, it's probably broken

I think that the host has some kind of hardware problem, perhaps memory errors.


Thanks for that. I'm pretty certain it's a memory problem now. I don't remember when or where I got the memory that's in there at the moment. I'll fit some different memory when I get around to it.

ID: 1699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ChertseyAl
Avatar

Send message
Joined: 23 Sep 07
Posts: 16
Credit: 1,038,886
RAC: 0
Message 1722 - Posted: 21 Aug 2010, 17:48:54 UTC - in response to Message 1699.  

I'm pretty certain it's a memory problem now.


And indeed it was.

At some stage I'd fitted some spare PC3200 based on the Crucial scanner tool which recommended PC2700 or PC3200. Different PC3200 DIMMs showed the same problem. A friend suggested that although PC3200 was 'better', the mobo might not work well with it.

So replaced the memory with 1Gb of PC2700. Every project I've run now works properly. Not tried Enigma yet though as I'm mopping up milestones on other projects ;)
ID: 1722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
elgordodude

Send message
Joined: 3 Jun 10
Posts: 9
Credit: 1,289,107
RAC: 0
Message 1890 - Posted: 5 Jan 2011, 1:21:05 UTC - in response to Message 1695.  

Just for info: all the enigma workunits have fixed lengths, so if suddenly WU takes more than a couple of hours of CPU time, it's probably broken - unless it runs on really old hardware. I think that even a Pentium III can go through most of the workunit types in less than 12 hours.

I think that the host has some kind of hardware problem, perhaps memory errors. I've already seen similar problem on a machine with broken RAMs, the O/S itself was stable, but most of the results returned completely random data, usually from completely different WU ranges. Probably the data got corrupted in memory and the app was randomly jumping from one settings to another, this also explained runtime varying from normal to tens of hours.



Just checked in on my PIII in the corner and found an m4-pldrv64 wu that's been running for 41 hours. The host has been reliable, it even did okay with those 210's a month or two ago. For the moment it started running some regular pldrv's in high priority and those look okay.

I don't think I've seen this type of wu before, is it supposed to take this long, or is this the beginning of that box's swan song?

Here's the task: http://www.enigmaathome.net/result.php?resultid=20309640

Here's the host: http://www.enigmaathome.net/show_host_detail.php?hostid=34839
ID: 1890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TenthReality

Send message
Joined: 6 Sep 09
Posts: 6
Credit: 550,574
RAC: 0
Message 1892 - Posted: 5 Jan 2011, 1:48:07 UTC
Last modified: 5 Jan 2011, 1:52:48 UTC

The pldrv64 series are taking nearly 6 hours on a host the average times are 20 minutes on. So given what you've linked in terms of the p3 host, I don't think 41 hours is that unheard of. Is the % complete going up at all?

At some point in not too long this unit will return on the linked host for comparison:

http://www.enigmaathome.net/result.php?resultid=20678922
http://www.enigmaathome.net/show_host_detail.php?hostid=42242

Fairly long WU's for Enigma, some of the longest I've seen to date.
ID: 1892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
elgordodude

Send message
Joined: 3 Jun 10
Posts: 9
Credit: 1,289,107
RAC: 0
Message 1893 - Posted: 5 Jan 2011, 2:15:06 UTC - in response to Message 1892.  
Last modified: 5 Jan 2011, 2:20:56 UTC

Unfortunately it's a Linux box, so the progress bar swings wildly regardless of elapsed time on all tasks. Which reminds me, any news on the new linux wrapper? generally I don't care, but it would be useful today. Currently, it's at 73.344, but as I said that's meaningless.

The average times on that box are around 200 minutes, she is an old girl, but if you're saying 20 minutes extrapolates to 3600 on your machine, than I guess I'm looking at a runtime around 36,000 minutes, or 60 hours.

So it should be around 66% done. I'll let it run and see what happens, as long as it keeps taking regular work units at high priority when they get close, worst case is it will time out on the 14th.

Generally though does anyone know what's up with these super units? Like these and the pldrv210, is the code really complicated, or is there a ton of it, or both?

Should have looked closer, your task is listed as pldrv64, and those haven't been a problem, this one is really weird, because it's listed as m4-pldrv. I just downloaded some new tasks on another box that were labeled m3-pldrv as a download, but then showed pldrv as a task. Is it possible this task is corrupted given the m4 designation?
ID: 1893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TenthReality

Send message
Joined: 6 Sep 09
Posts: 6
Credit: 550,574
RAC: 0
Message 1898 - Posted: 5 Jan 2011, 19:22:41 UTC

I did not even notice the naming difference between the two. I'm not 100% positive but aren't the m3/m4 prefixed guys imported from the M4 project where guys without prefix are workunits that come from our side of things? Are we just looking at 2 different things trying to crack the same long message which would explain simmilar timing? Also right around new years there was a pre-fix renaming on new units that was a "Happy New Year" type thing that has since gone away, wondering if during the name/rename situation something occured there.

Perhaps our awesome admin can chime in here to try to figure this one out.
ID: 1898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
elgordodude

Send message
Joined: 3 Jun 10
Posts: 9
Credit: 1,289,107
RAC: 0
Message 1899 - Posted: 5 Jan 2011, 23:07:06 UTC - in response to Message 1898.  

Well all is good, it completed in a hair under 57 hours for a whopping 817 credits! However, it gets weirder, I know your on windows, and may not be familiar with the bug, but until now I've never seen a progress bar work on linux.

Once I started to keep an eye on it I noticed the progress bar moving at a rather consistent .010 every few minutes so I wrote down some timestamps.

At 41 1/2 hours - the stated 73.344

At:

42 1/2 - 74.775
43 1/2 - 76.263
57 - 100

This gives a shockingly accurate estimate right around 1.75% per hour. Has anyone else seen behavior like this on a linux machine? Especially with these work units. The app hasn't changed, I'm crunching a 3pldrv210 and pldrv59, that are both appropriately jumping erraticly.

So why did this one work? Was it intentional, Is there something different about the code that caused it to happen to work, Or is this a case of a thousand monkeys on a thousand typewriters?
ID: 1899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TenthReality

Send message
Joined: 6 Sep 09
Posts: 6
Credit: 550,574
RAC: 0
Message 1900 - Posted: 6 Jan 2011, 1:41:24 UTC

the pldrv210 series appear to be taking the same length of time as the 64's should you run into one.
ID: 1900 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Long (never ending?) work unit




Copyright © 2024 TJM