Pascal again available, Turing may be coming soon

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958172931

RAC: 713142

Looking

6 Dec 2018 23:18:25 UTC

Message 168089

(moderation:

)

Looking at https://einsteinathome.org/host/12291110/tasks/0/0?sort=desc&order=Task+ID, I see a number of task records - some running, some 'Aborted via GUI'. Did you abort those? I'm suspicious - I once investigated and diagnosed a website problem like that.

This needs serious investigation - but not at this time of night, on this side of the pond. I'm not sure whether I'm seeing 'fallback to CPU' (as I suggested at SETI), or 'killed for running too long' (badly reported by this website), or 'reporting pseudo-progress, user lost patience' (I'll explain that one in the morning). Please record and report all available evidence.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18740796454

RAC: 7133877

No I always use NNT and then

7 Dec 2018 0:08:39 UTC

Message 168090

(moderation:

)

No I always use NNT and then ask for tasks for when I'm down to a couple onboard. Which means Einstein always sends me more than I can finish before deadline so I immediately abort a number of tasks that gets me down to my usual 120 tasks on board. That is the best value for me to finish tasks before deadline across all hosts.

So any that say aborted and were received within a minute of the send time are ones I aborted on purpose.

I'm coming up on 8 hours run_time so far on the 104X task. It is slowing the progress count the closer it gets to 100%. When it was at 97%, 40 seconds elapsed for the next .001% to count. Now it is taking over 3 minutes to clock the next .001%. Currently at 99.991%.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18740796454

RAC: 7133877

If it wasn't for my special

7 Dec 2018 0:07:10 UTC

Message 168092

(moderation:

)

If it wasn't for my special case of first 104X task running on Linux and a Turing and not immediately failing like it always has on Windows hosts, I would have just aborted the task after seeing such slow progress on the task. I don't want to change conditions until it either fails on its own or the server decides it has been running too long. I don't want to even exit BOINC which sometimes is necessary on Seti tasks to clear up a stopped counter on elapsed run_time on Seti tasks I sometimes see. Also I don't see any checkpoints written for the task. It has never stopped running since starting so without the checkpoint, the card hasn't transitioned to a different project which normally would have happened after 60 minutes.

Also I see that for some reason I lost my SWAN_SYNC=1 parameter in the /etc/environment file somewhere along the way and the tasks on that host have been run with BLOCKING sync instead of SPIN sync like normal.

I will change that once the task finishes and I will also add a gpu exclude option for both Einstein and GPUGrid.net on that host to stay off the 2080 card in my cc_config file.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117640376163

RAC: 35168324

My impressions:-Keith

7 Dec 2018 0:09:00 UTC

Message 168093

(moderation:

)

My impressions:-

Keith mentioned he downloaded a swag of the 0104X tasks. Having seen the poor performance, I suspect he decided to abort a bunch of them. He still has just as many left. I'm fairly confident that this was a genuine user decision rather than something suspicious.
The biggest problem is that the host in question is listed as having 4 coprocessors, one of which is the card of interest and another is a 1080Ti. Because the stderr.txt output is larger than the limit, only the trailing stuff is returned and the GPU identification is lost from the truncated bit at the start. You only see the initial information if the task fails quite early and the log to that point is within the limit. I don't know how you easily identify which device was involved in a 'ran to completion' type report. Maybe I just haven't looked hard enough.
At the time I looked, there were 5 completed and returned 0104X tasks but no way of confirming which device completed them. From Keith's earlier description, it's highly likely these were all done on the 1080Ti.
My guess is that the 'progress' Keith was reporting for the 2080 was some sort of simulated progress and that the task hasn't (and probably never will be) completed. The task properties should be examined to see if an initial checkpoint has actually been written. I suspect not, and in that case I suspect the task will ultimately fail with a TIME LIMIT EXCEEDED type message.

EDIT: I was just doing a response to Richard and hadn't seen Keith's reports before I posted.

@keith. Check if checkpoints are being written and if so how often. If not, continuing on with the task is a complete waste of time.

Cheers,
Gary.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18740796454

RAC: 7133877

This is what the task

7 Dec 2018 0:30:18 UTC

Message 168094

(moderation:

)

This is what the task properties shows for the task. So I think it is checkpointing.

104X task on 2080 on Linux

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18740796454

RAC: 7133877

The properties showed the

7 Dec 2018 0:39:28 UTC

Message 168095

(moderation:

)

The properties showed the slot it in. So grabbed the stderr.txt file.

08:00:32 (108405): [normal]: This Einstein@home App was built at: Feb 15 2017 10:50:14

08:00:32 (108405): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia'.
08:00:32 (108405): [debug]: 1e+16 fp, 4.8e+09 fp/s, 2206736 s, 612h58m56s05
08:00:32 (108405): [normal]: % CPU usage: 0.100000, GPU usage: 1.000000
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0104X.dat --alpha 4.4228137297 --delta -0.0345036602638 --skyRadius 5.817760e-08 --ldiBins 15 --f0start 692.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.71528666e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0104X_0700_954884.dat --debug 1 --device 0 -o LATeah0104X_700.0_0_0.0_954884_2_0.out
output files: 'LATeah0104X_700.0_0_0.0_954884_2_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0104X_700.0_0_0.0_954884_2_0' 'LATeah0104X_700.0_0_0.0_954884_2_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0104X_700.0_0_0.0_954884_2_1'
08:00:32 (108405): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
08:00:32 (108405): [debug]: glibc version/release: 2.27/stable
08:00:32 (108405): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x2608ba0 , 0x2608730]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce RTX 2080" by: NVIDIA Corporation
Max allocation limit: 2082553856
Global mem size: 8330215424
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0104X.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
% Read 1018 binary points
read_checkpoint(): Couldn't open file 'LATeah0104X_700.0_0_0.0_954884_2_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1018
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 38 df1dot: 2.71528666e-15 f1dot_start: -1e-13 f1dot_band: 1e-13
% Filling array of photon pairs

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117640376163

RAC: 35168324

The properties page for that

7 Dec 2018 0:41:25 UTC

Message 168096 in response to message 168094

(moderation:

)

The properties page for that task should show a whole bunch of information.

One of the lines should read:-

CPU time at last checkpoint xx:yy:zz

which gives the hours, minutes, seconds after task start for when the checkpoint was written. If the fields are zeroes or blank (I forget which) then a checkpoint has never been written.

Cheers,
Gary.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18740796454

RAC: 7133877

Gary Roberts wrote:The

7 Dec 2018 0:48:19 UTC

Message 168097 in response to message 168096

(moderation:

)

Gary Roberts wrote:

The properties page for that task should show a whole bunch of information.

One of the lines should read:-

CPU time at last checkpoint xx:yy:zz

which gives the hours, minutes, seconds after task start for when the checkpoint was written. If the fields are zeroes or blank (I forget which) then a checkpoint has never been written.

The screenshot shows the checkpoint time.

CPU time since checkpoint 08:21:58

Ohh, the wording is different if it has checkpointed.

CPU time since last checkpoint

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117640376163

RAC: 35168324

There should be three lines

7 Dec 2018 1:19:51 UTC

Message 168099

(moderation:

)

There should be three lines of 'times'.

CPU time at last checkpoint
CPU time
Elapsed time

In my haste I mis-described the first of these, which was the only one of the three I talked about. I should have said that the hours, minutes, seconds fields refer to CPU time that had been clocked up at the point the properties were being interrogated. I use older versions of BOINC - 7.2.42 and 7.6.33 so I don't know if the properties page is any different on more recent versions. It quite easily could be.

Do you have the above three lines (or their equivalents)? What are the three sets of numbers?

On another point, if the stderr snip you posted was the entire contents, it shows that the task was stuck on the first binary point (1 out of 1018 in total). If there was more and you keep scrolling down through all the binary points, you should eventually find one where a checkpoint was written. If there is no more, then no checkpoints have been written and the task is not making actual progress.

Cheers,
Gary.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18740796454

RAC: 7133877

Yes after looking at the

7 Dec 2018 3:05:35 UTC

Message 168101

(moderation:

)

Yes after looking at the stderr.txt and seeing that it had never written a single checkpoint I just aborted the task. Wasted 9 hours of production for Seti on the 2080.

Tried to exclude the 2080 from both Einstein and GPUGrid and that stops all Seti cpu tasks from running leaving only the four gpu tasks running.

What a mess. Reverted back to basic cc_config to get running normally. For now I will just have to suspend running both projects till someone can explain why excluding gpu device 0 stops all cpu tasks.

Pascal again available, Turing may be coming soon

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner