## Queen of the Races

The Queens N puzzle has lots of opportunity for parallelism so I thought I’d see what my dual core Windows system could do. Using the code from the previous post, pretty much what I’d expect: Two cores, two threads is as good as it gets, adding more threads doesn’t do any good if there aren’t any cores to support them.

724 solution(s) : 1 threads : 15 seconds
724 solution(s) : 2 threads : 9 seconds
724 solution(s) : 3 threads : 10 seconds
724 solution(s) : 4 threads : 10 seconds
724 solution(s) : 5 threads : 9 seconds

_____________________________________________

The code is simple:

• Two job queues: pending and solutions
– The thread reads from the pending queue and solves a queens n puzzle, starting at the second row (thus multiple threads are solving different puzzles).
– Write the solution the solutions queue.
– Repeat
• Write N puzzles to the pending queue.
• Wait for N solutions to show up in the solutions queue (each “solution” is a list of zero or more solutions for the starting board position).
• Print some stats
• Start another thread and repeat

While it is possible, with this code, for one thread to do all the work (by hogging the input queue), that is unlikely (the thread would have to solve a puzzle really fast to read the next puzzle before another thread got it).

do(5){  // test with one to five threads
fcn(pin,pout) { // create a puzzle solving thread
foreach job in (pin) { pout.write(job()); }
}.launch(pin,pout);

time := Time.Clock.time;
foreach c in ([1..N])
{ pin.write(queensN.fp(N,2,T(T(1,c)))); }
Atomic.waitFor(fcn(len,n)
{ len() == n }.fp(pout.len,N));
pout.close();
queens := pout.walker().cat();

println(“%s solution(s) : %s threads : %s seconds”
Time.Clock.time-time));
pout.open();         // reopen output queue
}
pin.close();    // let the threads exit

## Queen for a day

I was reading the Pure manual (“Pure is a modern-style functional programming language based on term rewriting”) trying to learn the basics of functional programming (which, for some reason, I find very difficult) and saw this nice solution to the Queens N puzzle (Wikipedia article with a nice animation of the backtracking algorithm):

queens n       = search n 1 [] with
search n i p = [reverse p] if i>n;
= cat [search n (i+1) ((i,j):p)
| j = 1..n; safe (i,j) p];
safe (i,j) p = ~any (check (i,j)) p;
check (i1,j1) (i2,j2) = i1==i2 || j1==j2 ||
i1+j1==i2+j2 || i1-j1==i2-j2;
end;

• Put a queen in the first row
• Put a queen in the next row, where it isn’t attacked
• Continue until there is a queen on every row
• Otherwise, backup one row and try the next column

For example, here a failure on a 4×4 board (BTW, 4×4 is smallest size with a solution (there are 2), other than a 1×1 board)

For the first row, there can be a queen in every column, I’ll pick column 1 (1,1).
For the second row, there two possibilities: (2,3) and (2,4). If (2,3) is used, all of row 3 is under attack so there is no solution using these two positions. We would then try (2,4) (which also isn’t in a solution). Then back up the first row and try a queen in the second column (which does have a solution).

Ok, can I duplicate this algorithm in zkl? Well, no, I don’t have list comprehensions,  so I have to figure out how to fake around that.

But first, some simple things:
A queen attacks a square if they are on same row, column or diagonal. The diagonal is the interesting part: A diagonal is 45º so the line formula y = mx+b is y = ±1x + b –> y ± x = b. If a queen at (u,v) is on the same line then u ± v = b –> y ± x = u ± v, which is the “checks” code.
A square is “safe” if isn’t being attacked, for us, that means checking every square in a row to see it is being attacked and, if it isn’t, it is safe. My code (in zkl) is:
not qs.filter1(isAttacked,a,b)
where qs is a list of queens in (row,col) pairs (eg L(L(1,1),L(2,3)) for two queens), a,b is the square in question and isAttacked is a function of three parameters: a queen pair, an x and a y and returns True if the queen attacks the square (kinda bass-ackwards but fits the data formats). So, anyway, filter1 checks each queen in qs to see if it attacks (a,b) (and stops at the first positive (the 1 in filter1)).

Now the hard part, the decent. What I do is build a row by putting a queen on every safe square:
[1..N].filter(isSafe.fpM(“101”,row,queens))
.apply(fcn(c,r,qs){qs+T(r,c)},row,queens);

The first line generates a list of safe squares by column and the second line converts the column into a (row,column) pair and adds it to the list of queens already on the board. From the above picture, the second row would be: L(L(1,1),L(2,3)) and L(L(1,1),L(2,4)). Filter removes all the unsafe columns and safe.fpM builds a version of safe (call it isSafe’) that takes one parameter, the column. This is partial function application; I “fix” the first and third parameters so that a call to isSafe’(1) is actually isSafe(row,1,queens). (I have to do this because lambda functions are not lexically scoped and thus don’t see variables that are next to them).

If the row I just built is the last row, it is a solution and I’m done. Otherwise, I have to look at the next row:
qs.apply(self.fcn.fp(N,row+1))
Here, for each set of queens, I recurse to the next row. Again, .fp builds a new function such that f’(queens) is actually f(N,row+1,queens). This results in a list of [partial] solutions and empty lists if there isn’t a solution.

Now that I have a list (of lists) of solutions (and empty lists for non-solutions) for each set of queens, distill that down to a single list of solutions and backtrack:
.flatten()
And we are done.

fcn isAttacked(q,a,b)
{ x,y:=q; (x==a or y==b or x+y==a+b or x-
y==a-b) }
fcn isSafe(a,b,qs){(not qs.filter1(isAttacked,a,b))}
fcn queensN(N=8,row=1,queens=T){
qs := [1..N].filter(isSafe.fpM(“101”,row,queens))
.apply(fcn(c,r,qs){qs+T(r,c)},row,queens);
if (row == N) return(qs);
return(qs.apply(self.fcn.fp(N,row+1)).flatten());
}

queens := queensN(4);
println(queens.len(),” solution(s):”);
queens.apply2(Console.println);

2 solution(s):
L(L(1,2),L(2,4),L(3,1),L(4,3))
L(L(1,3),L(2,1),L(3,4),L(4,2))

___________________________________________________________________

Here is the second solution. After the first row, there is only one possibility for each subsequent row.

____________________________________________________________________________

What if we want only one solution? Again, borrowing from the Pure examples, it is easiest (and simpler and way faster) just to avoid the backtracking headaches altogether by throwing an exception with the first solution:

fcn queensN1(N=8,row=1,queens=T){
if (row > N) throw(Exception.Msg(queens));
[1..N].filter(isSafe.fpM(“101”,row,queens))
.apply(fcn(c,r,qs){qs+T(r,c)},row,queens)
.apply2(self.fcn.fp(N,row+1));
}
reg queens;
println(“A solution: “,queens);

A solution: L(L(1,1),L(2,5),L(3,8),L(4,6),L(5,3),L(6,7),L(7,2),L(8,4))

(apply2() is the same as apply but throws away the results, which we don’t care about as we are just using apply to loop/recurse over the sets of queens)

## A threaded web server in 60 lines

I wrote this toy web server to play with sockets and threads. The web pages are actually in the code, just to make it easy (and self contained). One thread listens for incoming requests and forwards the request to a thread pool. The worker threads read from a request queue, the thread whose read succeeds gets the job. It then processes the request, writes the response and goes back to reading the request queue.

I added a CGI form to play with blowing apart a CGI query string into a hash table (2 lines but doesn’t handle all character encodings) and giving the threads a little more to do.

You can download the source from http://zenkinetic.com/Documents/httpServer.zkl (use Save As, otherwise the embedded HTML will really messes things up).

const PORT = 8080;

/* ******************************************************** */
/* ********************** The Server ********************** */
/* ******************************************************** */

// A class to process requests from clients (eg browsers)
//
class Servlet {
fcn init(jobPipe) { self.launch(jobPipe); }
fcn liftoff(jobPipe) {
while(1) {    // read request, write response, repeat
if(socket.wait(60) != 1)    // what is Chrome doing?
{ socket.close(); continue; }

try { processRequest(request,socket); } catch {}
}
}
fcn splashdown(h,e)

{ println(“Servlet died before its time”); }
}

// map requested page to fcn
var getMap = D(“/”,homePage, “/favicon.ico”,favicon,
“/testPage”,testPage);

fcn processRequest(request,socket) {
println(“vvvvvvvvv “,vm,”\n”,

request.text,”\n^^^^^^^^^^^”);
req := request.text.split(“\r\n”);
method,page := req[0].split();

response := “”;
switch(method){      // GET, HEAD, POST, etc
case(“GET”) {
response =
( if (n:=page.find(“?”)) cgi(page[n+1,*]) else
getMap.find(page,homePage)() );
}
//      else do something
}
socket.write(response); socket.close();    // no Keep-Alive
}

//////////////// Start the server ////////////////////

var jobPipe = Thread.Pipe();    // a queue of requests

// Create the HTTP server listen socket
// Sits here forever passing client HTTP connects
//
to Servlets
serverSocket := Network.TCPServerSocket.open(PORT);
println(“HTTP server started at “,
serverSocket.listen(fcn(socket) { jobPipe.write(socket); });

/* ******************************************************** */
/* ********************* The Web Site ********************* */
/* ******************************************************** */

fcn cgi(queryString) {   // GET /?value=8&fact=calc HTTP/1.1
args := queryString.replace(“+”,” “);
args = args.split(“&”).apply(“split”,”=”).toDictionary();
try {
x := BigNum(args[“value”]);
if (x > 0) return(homePage(x,”%,d”.fmt(factTail(x))));
} catch {}
homePage();
}

## Compiling in parallel and deferred threads

There is a lot of parallelism in compilers. I thread tokenizing, parsing and some other stuff. I tried compiling each function in a thread (ie one thread per function) but that overwhelmed my two core system and was slower than a single thread. My next attempt at parallelizing the compile phase was to split the convert-the-parse-tree-to-asm-code and convert-asm-code-to-VM-instructions stages. Functions are the only objects that actually contains runnable code so it was pretty easy to do the split. Another observation is that there isn’t always code that needs assembling so it would be nice to avoid creating a thread in that case (threads aren’t cheap).

Performance: This was good for about a 15% speed up (AMD dual core). The asm thread appears to be easily able to keep up with compiler and probably should be given more work to do.

The first order of business is to create a thread class that will assemble code. And only create that thread if something needs to be assembled. The thread works by sucking in functions (plus some supporting info) to assemble from a inter-thread pipe, calling the assembler and writing the results to another pipe (so they can be consumed by another thread and squished into the object being compiled):

fcn init {     // –> Deferred Thread/Pipe, Pipe
// Only start thread if actually need it
returnClass(
T(Deferred(fcn{self.launch(); in}),out));
}
// asm a fcn and write it to out in a thread
fcn liftoff {
foreach pFcn,vmClass,cblock,defaultArgs,index in
(in) {
reg names =
T(pFcn.name,pFcn.argNames.xplode());
// .finish() calls Asm.asm()
func = self.fcn.embryo(names,vmClass,
cblock.finish(),defaultArgs,
pFcn.isPrivate);
// Classes are not thread safe,
out.write(T(vmClass,func,index));
}
out.close();
}
}

Crib sheet:

• Creating a new instance of this class doesn’t return a reference to itself (self), it returns two pipes. Thus, the new class is “cloaked” and the only way to access to it is via the pipes. returnClass is used to override the normal return mechanism.
• The input pipe isn’t really a pipe but a Deferred object, which, when first referenced (as in in.write()), causes the contained function to run, which, in this case, launches the thread and returns the pipe, and the original operation (write) is done. Subsequent references will get the pipe.
• T() creates a read only list.
• A call to launch() creates the thread and the new thread calls the liftoff function when it starts running.
• fcn.embryo() creates an “embryonic” function that is ready to be inserted into a Class.
• A foreach loop on a pipe reads from the pipe until it is closed.
• Exiting liftoff stops the thread.

________________________________

The thread is readied at the start of a file (or chunk of text) compilation. This is a bit awkward as the compiler is written as reentrant state machine so there needs to be some deduction as to when to fire off the thread. Since a file is encapsulated as a class (a “RootClass”), do it when one (and there can be only one) of those is encountered. If a source code error is encountered, the thread needs to be shut down but care needs to be taken so that, if the thread hasn’t started, attempted to stop it doesn’t start it.

fcn compileClass {
…
reg conBlock,in,out;
if (pClass.isInstanceOf(Parser.RootClass)) {
// Create a thread to asm the compiled fcns
// in is a Deferred Thread/Pipe, out is a Pipe
in,out = FcnAsmThread(); pClass.asmPipe = in;
onExit(fcn(in,out){
// all done compiling this file
if (in.BaseClass.toBool()) { // thread was started
// all done compiling this file

if (vm.xxception) in.clear();
else {
// add fcns to their Class

}
// else thread wasn’t started, do nothing
return(Void);
},in,out); // onExit
conBlock = compileBlock(classBlock,vmClass); }
}
…
}

Crib sheet:

• All changes and accesses to the Deferred are made in the compiler thread;  a good thing as Deferred objects are not thread safe.
• A “block” contains the source for a class; if a class (such as the RootClass/file) contains classes, compileClass is recursive.
• Asking a Deferred to convert itself to a Bool (eg if (Deferred) … ) is a reference to the underling object (the pipe). We don’t want that. Using .BaseClass.toBool() prevents evaluation and directs the question to the container. The result is True if the Deferred has been evaluated (ie has the thread been created?). I need this check because if I just call in.close(), and the thread hasn’t started, it would start.
• Closing the input pipe stops the thread (since all it does is read from the pipe).
• Once the thread has stopped, out contains all the assembled functions.
• The onExit keyword creates a deferred object (in this case, a function) that is evaluated when the function exits/returns. Using onExit means this code will always be run, I don’t have to worry about exceptions or whatever and puts the thread finish up code close to where the thread is created.
If an error caused the compilation to stop, clear and close the pipe, which will stop the thread.
If the thread was started, functions were assembled so close the pipe, stop the thread and put the functions into their class.
No matter what happens, I know the thread will be stopped.
Returning Void ensures there in won’t be mistakenly evaluated.

______________________________________

The compileFcn function does a bunch of stuff then sends the function code out to be assembled.

fcn compileFcn(pFcn,lexBlock,vmClass) {
…
reg rootClass = lexBlock.findRootClass();
T(pFcn,vmClass,cblock,defaultArgs,
lexBlock.fcnIndex(fcnName)));

…
}

Crib sheet:

• The RootClass contains the pipe to the thread. As the compiler is a state machine, the RootClass is located in the passed in context (parse tree).
• If the thread hasn’t started yet, calling write will cause the Deferred object to evaluate and start the thread and then perform the actual write.

## Chained Comparisons

A + B + C is something I’m calling a “chained” operation: a sequence of binary operators of equal precedence. Standard stuff. So why aren’t comparison operators chainable? Probably because it isn’t used much. I looked though thousands of lines of code and found two places where it would be useful, one in the tokenizer:
else if (0xD800 <= u <= 0xDFFF)
// error, illegal Unicode character

and the other in a script to print a character (or “.” if the character isn’t “printable”):
bytes.reduce(fcn(d,c){

d.append(if (0x20 <= c <= 0x7E) c.toChar() else “.” )},
d.clear() )
Pretty simple stuff. Another reason could be that the parsing gets kinda ugly after three operands. So, to give myself brownie points for implementing it and to keep things simple and sane, I set the following conditions:

1. One or two operators: “A O1 B” or “A O1 B O2 C”, where On is a comparison operator (==, !=, <. <=, >, >=).
2. If there are two operators, the operators must be of equal precedence. This means you can group ==, != or <, <=, >, >=.
”A==B<C” is just painful to parse and if it isn’t used, why bother? “(A==B)<C” doesn’t work because of 5.
3. Logic operators “and” and “or” terminate a chain. Don’t break “A==B and C==D” but parse “(A==B==C) and D==E”. Those ()s are required because I don’t want to chase the expression tree.
4. “A O1 B O2 C” –> “(A O1 B) and (B O2 C)” where B is evaluated only once.
5. ()s scope, just because that is sane. It isn’t very useful as “(A O1 B O2 C) O3 D” is “(BOOL) O3 D”, so D has to be a boolean for the expression to be meaningful.

4 is the stinker. You basically have to rewrite the expression and use a stack or temporary. If the operands are complicated, you could have a real mess on your hands. 1 is a counter, 2 is a check, 3 is just a write barrier and 5 is a stack.

I compile these expressions in two steps:

1. Convert the expression to RPN, inserting special tokens to save/restore a result and do the “and”.
”A O1 B O2 C” –> “A, B, saveB, O1, restoreB, C, O2, and+”
2. Compile the RPN.
Figure out how to save/restore. And and+ has to clean up if the result is False (because the entire expression is False so we can shortcut and are done). This is why operators are limited to two; additional terms have to be saved somewhere and I’d rather not do the stack foo or temp management.

Now I can do stuff like a()==try{b()}catch{c()}==d(). But I don’t, sigh.

## Infinite sequences and lazy evaluation

I’ve been trying to wrap my brain around functional programming, without much success unfortunately. I really like the ideas but I seem to think imperatively. So I’m trying to take little bites and see if I can express them in language I can grok. Some of the ideas are straight forward, like map, reduce, etc. A really difficult one for me is infinite sequences and lazy evaluation. An infinite sequence seems like a really simple concept: a function that, when called, produces the next value in a sequence or just continually pumps out values. Such as a thread that writes random numbers to a pipe or a function that calculates Fibonacci numbers:

fcn fibSeq1 {
var a = 1, b = 1;   // initial values
reg t = a;
a = b; b += t;
return(t);
}
foreach i in (20) { print(fibSeq1(),”,”); } –> 1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765,

Lazy evaluation is a really cool concept: if you don’t have to compute, don’t; just remember what you need to do if somebody calls your bluff. For me, (ignoring && and ||), the classic example is deferred module loading:

// Create a placeholder but don’t actually do anything
var Date = Deferred(Import,“Time.Date”);
<bunch of stuff, maybe leaving this control flow>
// import module and call function in that module
println(Date.prettyDay());
–> Tuesday, the 10th of August 2010

OK, that was easy to digest, how about infinitely recursive sequences? Again, Fibonacci:

fcn fibSeq2(a,b) { return(a,self.fcn(b,a+b)) }
fibSeq2(1,1) –> (1,2), (2,3), (3,5), (5,8), …

Given any two consecutive Fibonacci numbers, this gets into some serious navel gazing calculating the rest of the infinite sequence. Until the stack blows anyway. My question is/was, how can this be useful? All it does is produce a core dump. The Haskell people say this is indeed very useful. How so? Well, with lazy evaluation, you can control the recursion and get it to stop when you have calculated the desired result. Check out this Haskell lazy evaluation example (now that I kinda know what is happening, it is lucid, but that took quite a bit of time). My “quest” was to transpose that example into my iterative world view. The question is: given an infinitely recursive sequence, how I can get the nth item in that sequence? The sequence is represented as
fibSeq3(1,1);
which is slightly tweaked fibSeq2 as I don’t have support for implicit lazy evaluation. To get the third fib:
nthFib(fibSeq3(1,1),3);
Thus the first veil presents itself: how can I pass a call to fibSeq3 without getting that infinite recursion? Just defer it:
nthFib( Deferred(fibSeq3,1,1) ,3);
OK, that gives me a place holder to the start of an infinite sequence. Since fibSeq3 returns a pair, where the first item is the nth fib, if the count reaches 1, we are done and the answer is that first item. Otherwise, decrement the count and get the next fib pair:

fcn nthFib(seq,n){
if (n <= 1)
// done, stop recursion

return(seq[0]);
return(self.fcn(seq[1],n-1));  // tail recursive
}

How to stop fibSeq3s infinite recursion at the nth item? That is where the weird turn pro: The “seq[n]” is magic. How can you subscript an infinite list? Doesn’t subscripting force evaluation of a lazy calculation? Well, yes, but we can put that off:

fcn fibSeq3(a,b) { // a is int, b is int or Deferred
return(a,       // do NOT to force evaluation of b!

Deferred(fcn(a,b){fibSeq3(b,a+b)}, a,b) )
}

What we have done is remove the explicit recursion from fibSeq2 and instead made it implicit by returning the next item and a “this is how to calculate the (next+1)th item using this very same function”. I guess you could call this unrolling the recursion.
Aside on the clumsy Deferred syntax: If I used Deferred(fibSeq3,b,a+b), the “a+b” would force evaluation (eager evaluation of parameters to Deferred), which is what I DON’T want. So I wrap the call in another function and let it put off the evaluation of b.
As you may have inferred, a Deferred is just a wrapper on a bit of [unexecuted] code and parameters that is run when something references through it. After the first evaluation, the result is cached for subsequent requests.
Note that this has been very carefully coded to return a number and a lazy – we can key off the number and avoid evaluation of the lazy. That is the key to this whole mess.

And that is all there is to it:
nthFib(Deferred(fibSeq3,1,1), 3) –> 2
nthFib(Deferred(fibSeq3,1,1), 8) –> 21

To blow away a little bit of the smoke from the mirrors, here are the steps of nthFib(defer{fibSeq3(1,1)}, 3):

nthFib(Deferred,3) == nthFib(fibSeq3(1,1),3)
n > 1 so call nthFib(Defered[1],2) == nthFib(fibSeq3(1,1)[1],2)
Deferred[1] forces evaluation of fibSeq3(1,1) –>
List(1,Deferred) == List(1,fibSeq3(1,1+1))
Deferred[1] –> Deferred == fibSeq3(1,1+1)
so the calling nthFib(Deferred[1],2) == nthFib(Deferred,2) ==
nthFib(fibSeq3(1,2), 2)
nthFib(Deferred,2) == nthFib(fibSeq3(1,2) ,2) –> nthFib(fibSeq3(1,2)[1], 1)
fibSeq3(1,2) –> List(1,Deferred) == List(1, fibSeq3(2,1+2))
nthFib(Deferred,1) == nthFib(fibSeq3(2,3),  1)
Since n ==  1, no more recursion:
answer is fibSeq3(2,3)[0] and again, subscripting forces evaluation
fibSeq3((2,3) –> List(2, fibSeq3(3,2+3))
List(2, fibSeq3(3,2+3))[0] –> 2

One of the things that really sucks about concurrent programming is you are always are having to wait on shared data. Have a global table? You have to wait on a lock to write, you have to wait  on a lock to read. And locks are not cheap if multiple cores have to sync their caches (I’m think about spin locks, OS mutexes really suck time wise). Of course the functional people would say “just say no” to writable tables and Erlang would say no to shared data (I think) but I’m not that advanced and I’m programming in C. It sure would be nice to get rid of locks or at least confine them to writers (especially if the table doesn’t change very much).

So, what would such a table look like? Consider
void *table[N];
My code could request a entry in the table
n = registerBlob(&blob);
and I could pass n to any thread (perhaps buried in a struct) that wanted to look at a blob. I do this when space is at a premium in a struct and N is small. The best part? There is NO locking, no waits. I think writing will cause cores to resync their caches. Here is how it’s done.

Writer code
CAtomicInt  N;
void       *zklObjectTable[256];
int registerThing(void *blob) {
int slot = C_ATOMIC_INT_INC(&N); // returns new value
slot–;
if (slot > 255)vmThrow(NoVM,E_VM_ERROR,”Too many blobs!”);
table[slot] = blob;
return slot;
}

blob.name = “foo”;
foo = registerThing(&blob)

How does this work? First of all, it requires the ability to atomically increment an int. If I can do that, I can reserve a slot in the table no matter how many threads are trying to do the same thing. And once I have my slot, I can take my sweet time filling it in before I let others know about it. The drawback is you can’t iterate over the table because you could wind up looking at a slot as it is changing.

Here are relevant macros for GCC and MSVC:
GCC:
typedef int volatile CAtomicInt;

MSVC:
typedef long int volatile CAtomicInt;
#define C_ATOMIC_INT_INC(ai) InterlockedIncrement(ai)

_________________________________________________________________

How about if I have a linked list? Not quite so sweet as I have to serialize writers but readers are completely lock free. The magic here is the ability to swap pointers atomically. I’ll show prepending to a list because that is how I add to these types of lists.

Writer code
SpinLock     lock;
// Prepend item to the list in a thread safe manner
// 2. Atomically prepend the new item to the list header
SPIN_LOCK_ACQUIRE(&lock);
SPIN_LOCK_RELEASE(&lock);

So, if 100 threads want to add to the list (at the same time), only one at a time can change the list. That lucky thread will atomically add a new item to the list. This vital step removes the need for reads to lock.