GPU Version Planned?

Message boards : Number crunching : GPU Version Planned?

To post messages, you must log in.

AuthorMessage
nhojpatrick

Send message
Joined: 17 Mar 10
Posts: 1
Credit: 1,246,012
RAC: 0
Message 2478 - Posted: 15 Mar 2013, 11:23:45 UTC

Any plans for a GPU version of the application?

Not sure if the current algorithm would be easily portable to be executed on a gpu. If you want me to take a look let me know.
ID: 2478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 70,812,678
RAC: 391,480
Message 2479 - Posted: 15 Mar 2013, 12:11:10 UTC - in response to Message 2478.  
Last modified: 15 Mar 2013, 12:12:25 UTC

If you'd like to take a look, the sources are available at http://www.bytereef.org (enigma-suite).
However, the algorithm is rather complex and I doubt it could be easily tweaked to run on GPU.
M4 Project homepage
M4 Project wiki
ID: 2479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chromatix

Send message
Joined: 8 Feb 09
Posts: 10
Credit: 996,969
RAC: 0
Message 2485 - Posted: 16 Mar 2013, 13:05:37 UTC

I've intermittently looked at making a GPU version of this. Each time, however, I did not seriously consider directly porting the current algorithm, which as you suspected is not easily adaptable to GPUs.

A brief summary of the current algorithm is: for each rotor/ring combination, do a hill-climb on the steckerboard. The outer loop is highly parallellisable, which is well-suited to dividing work between computers using BOINC. The inner loop is almost completely serial, which makes it an extremely poor fit for a GPU. The only practical optimisation would be to run multiple hill-climbs on the same rotor combination in parallel.

The alternative I was thinking of is to turn the algorithm inside out - put the hill-climbing algorithm on the outside, and the rotor combinations on the inner loop. This makes the inner loop amenable to GPU acceleration, allowing much higher performance per computer. In fact, this is broadly similar to how the original, electromechanical Bombe worked.

The hill-climbing algorithm would then need to be adapted to use the fast rotor-running capability effectively, especially if we still want to make use of more than one computer. Since each complete rotor run would take on the order of several seconds with GPU assistance, the algorithm can put a relatively large amount of CPU effort into deciding which steckerboard combination to try next, without impeding the GPU's work in the slightest.

This allows it to maintain lists of combinations already tried, worth trying soon, and queued to try - and hand the latter out to the GPU (and/or other computers) as required. Separate lists should be maintained for each selection of rotors (I-V, I-VIII, I-VIII), which will help to minimise random noise effects. With this sophisticated state record, the outer algorithm would be able to try multiple search strategies simultaneously, and avoid duplicating work - luxuries which are not available on the inner loop.

This is all the more relevant now that the Rasch message has been analysed and found to be relatively resistant to the basic hill-climbing attack (even without the transcription errors). A more powerful hill-climbing algorithm might have been able to break this message sooner, even without the benefit of GPU acceleration, because it might have been better able to recover from the misleading IC metrics. More thorough analysis of the stecker space around the Rasch message would suggest specific strategies that would help with such recalcitrant messages.

There remains, of course, the challenge of actually rewriting the code in this new form. I did start on it several times in the past, but some other project became more urgent and I then forgot about it. I believe at the last attempt, I had rewritten the rotor run code on a more optimised pattern, still in C, but was having trouble getting correct results out of it. A fresh start might reveal better results.
ID: 2485 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 70,812,678
RAC: 391,480
Message 2486 - Posted: 16 Mar 2013, 19:39:27 UTC - in response to Message 2485.  

Actually there is another possibilty.

The current setup is like this:

Work unit -> a "block" of rings + message key range, each work unit is a single hillclimb pass over this range. There is possibility to run "n" passes, but it's unused on CPU due to long execution time, especially for M4.

The target number of restarts is done by sending the same workunit "n" times to the client.



I think that the GPU could take a range of rings + message keys, but instead of running 1 pass, it could run "n" passes simultaneously, store the best result, iterate to next rings/message key step, repeat the process and compare the result again. Reasonable work unit lenght would be eventually achieved by tweaking the range of ring settings (or worst case - message key).
It's the number of restarts that matters, I believe that with enough restarts everything that's not seriously garbled could be broken. From the server point of view it does not matter at all if one client does n-passes over the same WU, or n-clients run 1 pass.


M4 Project homepage
M4 Project wiki
ID: 2486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : GPU Version Planned?




Copyright © 2017 TJM