Concept: Large WU branch of E@H?

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: You'd quicly hit

Message 76828 in response to message 76827

Quote:

You'd quicly hit personal bandwidth limits even assuming E@H doesn't have something to detect bots (either a search spider or a crude DDOS attack) thrashing its servers to throttle you on thier end. And DLing every WUs webpage instead of just the ones that could've been updated would be a clear case of bad implementation.

OKIE DOKIE...

peanut
peanut
Joined: 4 May 07
Posts: 162
Credit: 9644812
RAC: 0

Just for the record. I think

Just for the record. I think my biggest problem is that I have to add delays of 2 or more seconds all over the place in my applescripts. Each time I fetch a task page I have to wait for my browser to load it. Like I said before, I have automated what I would do by hand. A better way of fetching individual tasks would definitely improve my scripts. I have to first fetch each of my results pages for my 400+ tasks not deleted by E@H yet (thats about 400/20 = 20 web page fetches). Then I compare them to tasks I saved already. Only ones with a different state require me to re-fetch the individual task page. That generally means I have to fetch maybe 50 or more task pages. Then I fetch the computer pages for new computers that I have not seen before (host ids that is).

If I didn't need to insert so many wait states in my scripts I would have a much faster setup. Those wait states are pure wasted time and are very large times when you are talking computer instruction times. The way applesript talks to my Appleworks spreadsheet is not real efficient either. Unlike Excel, where you can write Excel macros that are relatively quick, Appleworks doesn't have a macro language built in. I have to write applescripts to do things like Excel macros would do. That is a big bottle neck I think as well. My job is definitely not a shining example of computer science. It is brute force at it's worst (I do use a binary search for seeing if a WU is in my list already because a linear search was WAY TO SLOW).

Hopefully, I didn't put everyone to sleep. zzzzzzzzzzzzzzzzz

And if anyone at E@H thinks this might be damaging to the servers, please speak up and I will stop. I only run my job once a day, but it does look at 70+ pages on average a night.

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: Just for the record. I

Message 76830 in response to message 76829

Quote:
Just for the record. I think my biggest problem is that I have to add delays of 2 or more seconds all over the place in my applescripts. Each time I fetch a task page I have to wait for my browser to load it. Like I said before, I have automated what I would do by hand. A better way of fetching individual tasks would definitely improve my scripts.
...
If I didn't need to insert so many wait states in my scripts I would have a much faster setup. Those wait states are pure wasted time and are very large times when you are talking computer instruction times.

You mention browser loading up and etc, etc, etc...

Why aren't you just retrieving the web page as a plain text IO stream in code, then parsing the text stream for what you want and then just dumping the data that way? I figure there has to be delimiters that could be used to find the pieces of data that you want. I admit, I've never attempted something like that in Java, but I know it is doable in other languages... That's what I was going to look into some, once I had more time...

peanut
peanut
Joined: 4 May 07
Posts: 162
Credit: 9644812
RAC: 0

RE: You mention browser

Message 76831 in response to message 76830

Quote:

You mention browser loading up and etc, etc, etc...

Why aren't you just retrieving the web page as a plain text IO stream in code, then parsing the text stream for what you want and then just dumping the data that way? I figure there has to be delimiters that could be used to find the pieces of data that you want. I admit, I've never attempted something like that in Java, but I know it is doable in other languages... That's what I was going to look into some, once I had more time...

I just use the browser (Safari in my case) because it was the easiest thing I knew how to use for internet access. I have never programmed something to fetch web pages. Processing the web page would be easy. HTML and XML have things like that are easy to parse out. I was surprised by how well appleworks parses the pages from E@H. All I do is "SELECT ALL" in Safari and copy it into my AppleWorks spreadsheet and I get a nice column of cells with the page data nicely parsed out into individual cells .

If you find a way to read over the internet in JAVA let me know. I would be interested in that. I suppose I could do a little "GOOGLEing" on the subject too. I have seen examples of PERL programs that are basically mini webservers using SOCKETs. I suppose something like that would be what I should look into for more efficiently dealing with internet access. PERL has excellent string manipulation features too.

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: I just use the browser

Message 76832 in response to message 76831

Quote:

I just use the browser (Safari in my case) because it was the easiest thing I knew how to use for internet access. I have never programmed something to fetch web pages. Processing the web page would be easy. HTML and XML have things like that are easy to parse out.

Retrieving the pages as text streams will cut down on the overhead of the browser having to render the document. That's probably not a lot of overhead, perhaps half a second, so 200 seconds savings, or roughly 3.5 minutes. The big savings come from not having to do copy/paste and other macro operations. Just being able to do the scraping this way would cut total runtime at least in half, if not more.

I've decided against the idea of splitting the task by hosts for the threads. Instead, a list could be formed and then that master list could be split into thirds or quarters to spawn 3 or 4 worker threads. I/O contention could possibly be an issue. Depends on how Java handles it. I could see breaking the list into sub-lists to isolate the threads. Each thread would end up storing the data in its' own ArrayList / Collection that was static. If fighting for a single master result list was not a problem, that idea could be junked in favor of a single static ArrayList. I dunno how I'd do it all for sure. Just some ideas rattling around upstairs... ;-)

General design ideas:

  • * Hit the first result page for the user and get the (up to) 20 tasks on that page. Just get the workunit IDs and nothing else. Since you're tracking the other host(s) assigned as well, there's no need for anything else at this level.
    * If the text "next 20" shows up, grab the anchor tag that comes right before it for your next URL.
    * Repeat until "next 20" is no longer in the stream.
    * At this point you have a list of all of your workunit ids. You can then break them out and run threads or just plow through them all in a single thread.
    * You'd then need your saved data.
    * Saved data should have a "workunit complete" data element that signifies that all hosts assigned to that workunit have reported in or errored out (anything but pending/unassigned).
    * If a workunit is complete, skip.
    * If a workunit is not complete, use the workunitID to build the URL and grab the text stream and parse that stream.
    * If the workunit is now complete, mark it as such.
    * Keep going until the end of the list / sublist.

This method is very thorough. You could cut back on the amount of page hits if you only looked at results when your specific resultID is granted credit or has an error condition. This means you will not know what's going on with a given task until toward the end of its' life cycle with respect to your machines.

One other data problem I see that you could have is newer versions of host merging allow you to merge a computer by name. For example, if someone were to call their AM2-based X2 6000 processor system "myPC", but then they went out and got a Phenom processor and dropped it into their compatible AM2+ motherboard, if they didn't do a merge, you'd see this as a "new host". However, if they did a merge because the name of the computer was the same, unless you're actively checking the host details, you'd never know that the system just changed from 2cpu to 4cpu (assuming that statistic you have means the number of physical cores)... This new capability isn't installed here at E@H, but it is up and running at SETI. I would guess that E@H would get the capability with a server code upgrade at some point in time...

Oh, and if storing this in local spreadsheets, you'd probably want some sort of "aging" capability, where you could export to a historical archive periodically. However, since mySQL is free, might as well set up a database in it. You may even have one available to you in your hosting account... ;-)

peanut
peanut
Joined: 4 May 07
Posts: 162
Credit: 9644812
RAC: 0

I have MySQL installed and

I have MySQL installed and running. Most spreadsheets eventually hit a wall with big data sets, so eventually I will have to migrate from a spreadsheet to a true database. If my computers don't fry first from being left on all the time. But hey, that would just force me to buy the latest and greatest off the shelf box out there when my systems do fry. The new Mac Pros are going to be interesting to watch when they appear on BOINC projects.

I will probably eventually do a total redesign of the scripts we have been discussing here. Or I may just get sick of keeping track of things and just be happy with the stats provided. For now, the scripts do what I want, they just are slow. Off to work I go..

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: I have MySQL installed

Message 76834 in response to message 76833

Quote:

I have MySQL installed and running. Most spreadsheets eventually hit a wall with big data sets, so eventually I will have to migrate from a spreadsheet to a true database. If my computers don't fry first from being left on all the time. But hey, that would just force me to buy the latest and greatest off the shelf box out there when my systems do fry. The new Mac Pros are going to be interesting to watch when they appear on BOINC projects.

I will probably eventually do a total redesign of the scripts we have been discussing here. Or I may just get sick of keeping track of things and just be happy with the stats provided. For now, the scripts do what I want, they just are slow. Off to work I go..

Well, I pondered over something just now and see a potential problem with streaming the task list in code. The functionality to do the "Your Results" button is tied to being actively logged in. I'm assuming this is just a cookie and thus it may require getting information from the cookie to satisfy the login requirements. Not sure. Would have to cross that bridge when I got to it...

Also, Dan's statement about bulk importing being the wrong way to do it was correct, but I was in a grumpy mood at the time, in case it wasn't obvious... I still think I'm right about threading it being faster though...

peanut
peanut
Joined: 4 May 07
Posts: 162
Credit: 9644812
RAC: 0

For anyone interested, here

For anyone interested, here is my main Applescript file. I am not sure how well it will view through a browser, but here it 'tis. It is a Mac text file; I saved it with TextEdit.

myapplescript driver.txt

There are very limited numbers of comments since I never figured on anyone else looking at it. And as I said before, my domain to IP mapping changes randomly so if it doesn't load quick my site is temporarily inaccessible. Static IPs are not something I am willing to pay for.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 197

RE: Also, Dan's statement

Message 76836 in response to message 76834

Quote:

Also, Dan's statement about bulk importing being the wrong way to do it was correct, but I was in a grumpy mood at the time, in case it wasn't obvious... I still think I'm right about threading it being faster though...

Honestly I wasn't in the most cheerful mood either. Threading might get some speedup, but I still say that the right initial approach would be to unsnarl the main bottle neck, which from glancing down the scripts source (it displayed well) seems to be that it's using scripting to drive the input into the SS instead of writing a CSV and then loading it. Having to insert manual wait states isn't going to help it any either. There might be other bottlenecks but writing the SS programattically as a CSV instead of entering it via scripting the GUI should show a good return in and of itself.

While not directly comparable, I've done number crunching in excel/vba and C# with the latter being 3+ orders of magnitude faster for the same algorithm after doing alot of excel specific stuff that got roughly 4 or 5x speedup in and of itself. The actual margin might be even larger since I never timed the C# end better than ~1s between clicking start and the GUI reporting it was done.

Arion
Arion
Joined: 20 Mar 05
Posts: 147
Credit: 1626747
RAC: 0

RE: As a side question:

Message 76837 in response to message 76819

Quote:
As a side question: Anyone know of a project that has monster WUs? I'd like something that takes about a day to complete on the newer Intel Core 2 quad or duos.

Climate Prediction is really good. It may take 2 or 3 months to finish a workunit, but it sends up trickles everyday and it pretty much remains constant value until the whole unit is complete. As an aside - I have a AMD 3700+ with 1 gig of ram that is running CPDN and I am averaging on that system about 350/375 credits a day. I also have an AMD X2 3800 with 1 gig of ram that is running about 600/640 credits a day. Trickles are computed once per day for your daily total. If your looking for something where you pretty much know exactly what your dialy credits are (and not having to wait for verification by a wingwan) then this is a good alternative) NOTE: with 2+ gigs you get bigger units that may take considerably longer to complete. Possibly 4 months on my X2's

I switched to CPDN June of last year when they were having so much trouble here with the AMD's. When all 3 systems were running CPDN my credits from June 07 until Dec 07 ran up 300,000 credits. My main system X2 4800 w/2 gigs is back on Einstein and Seti and I'm looking to convert the X2 3800 back to einstein and seti as soon as both workunits are finished. I'm going to leave my 3700+ dedicated to CPDN. My biggest reason for changing them back is because I was with seti and einstein for a long time and prefer the science those projects are doing. I figure I've contributed to the "fad" project on climate change fully with all three systems enough to make my self feel I've done something important. Leaving the 3700+ to continue at CPDN will be my lasting contribution to the project while I come home and attend to my preferred science projects.

I'm assuming with your core duo system(s) that your times and credits would be considerably higher than mine.

HTH

Arion

All three systems here are 24/7

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.