TODO Add conversion instructions int->float and float->int Add pseudo code illustrations for instructions Copyright (C) 2001 Michael Leonhard Mike Leonhard mike at tamale dot net http://tamale.net/ Sebae Bytecode spec v1 draft * Introduction to Sebae The Sebae bytecode specification defines the format of programs that will run in the Sebae Virtual Machine. The code of a Sebae bytecode program is divided into sections. Code sections are called definitions, or defns for short. Every defn has an id number that is implied from its location in the file. The first defn is 1, the second is 2, and so on. Defn ids are denoted by a 32-bit unsigned integer. A bytecode program may not contain more than 4294967295 defns. Integers are stored and managed in big endian format. Host architectures that use a different format will need to translate the integers at load time. The LOAD and SAVE instructions will require translation in real time. Floating point numbers are stored and managed in IEEE-754 format. Sebae bytecode must be valid. Bytecode validity is determined when the bytecode is loaded. Sebae programs must also have legal execution. The execution of an illegal instruction will cause the virtual machine to halt. The VM has a single stack. Stack elements are 32bit values. New stack elements may be created. New stack elements always appear at the top of the stack. Stack elements may also be discarded. Only the top-most stack element may be discarded. Stack elements may be referenced and changed in place. In this way, the stack behaves like a resizable set of machine registers. The Sebae VM has a theoretical maximum stack size of 2.14 billion elements. The size of the stack may be limited by the host. This maximum stack size may be obtained with the STACKSIZE isntruction. Exceeding this limit will cause the machine to halt. The VM has a single memory block. The memory addressed in 32bit increments. The address 0 refers to the first 32 bits of the memory block. The address of 1 is the second set of 32 bits, that is the 33rd bit to the 64th bit. 32bit values may be taken from stack elements and written to this memory. Values may be loaded from the memory into stack elements. The Sebae VM can address 16GB of memory. Many hosts will not provide 16GB of memory to the bytecode application. These hosts will allocate a block of memory of fixed size. The size of this memory block may be obtained with the MEMSIZE instruction. Attempts to access non-existent memory will cause the machine to halt. During the execution of a defn, the area at the top of the stack is called the stack frame. Only stack elements in the frame may be referenced or modified. Each Sebae defn has a CONSUME number and a PRODUCE number. These numbers are called the defn's stack-effect. The CONSUME number specifies how many stack elements appear in the frame when the defn begins execution. The VM sets the frame before executing the defn. The position of the bottom of the frame is calculated by taking the number of stack elements on the stack and subtracting the defn's CONSUME number. The position of the bottom of the frame remains constant throughout the execution of the defn. The size of the frame, however, may change if instructions in the defn add or remove stack elements. When the defn concludes execution, the number of stack elements in the frame must match the specified PRODUCE number. The stack-effect of a defn may be verified by examining its instructions. A valid defn will specify a correct stack-effect. When a defn begins execution which would consume more elements than currently exist on the stack, the virtual machine will halt. Stack elements may be specified as parameters to certain instructions. Instructions in a defn may only work with stack elements in the current stack frame. Stack elements are thus referenced as offsets from the base of the frame. These offsets are specified by an integer stored in a 8bit value. Stack frame offsets may be in the range of 0 to 255. The stack frame has a size limit of 256 elements. A valid defn will never exceed this stack frame size limit. Instructions in a valid defn will reference only stack elements which exist in the stack frame. * Sebae bytecode file format File format: bytecode file signature: SEB1 defn list length: 0x0000 FF12 defn list: defn1 defn2 defn3 ... denfn data block: data Defn format: stack elements consumed: one byte 0x01 stack elements produced: one byte 0x00 Sebae instructions: ... defn-terminal instruction: HALT or JUMP A valid Sebae bytecode file begins with the four byte signature, SEB1. Immediately following the signature is an unsigned integer stored in four bytes. This integer specifies the length of the defn list in bytes. It is stored in big endian format. The defn list follows its length field. The list contains the defns of the bytecode program. The defns are assigned ids by their order of appearence in the defn list. The first defn is assigned id 1. The second defn, id 2, and so on. Program execution begins at the first defn. A defn begins with two bytes. The first byte is an unsigned integer which specifies the number of stack elements consumed by the defn. This is the CONSUME value. The second byte is an unsigned integer which specifies the number of stack elements produced by the defn. This is the PRODUCE value. These two values are together called the defn's stack-effect. A defn's PRODUCE and CONSUME values each have a maximum of 255. This is limit is imposed by their storage in an eight-bit byte. The first defn in a valid bytecode file must have a CONSUME number of zero. Following the stack effect is a sequence of Sebae instructions. The defn is terminated by a defn-terminal instruction, HALT or JUMP. A bytecode file may contain a data block. This block may contain any data. The data block follows the defn list. The data will be available to the bytecode program as stream number 0. The streams API has not yet been created. * Sebae instruction set A Sebae instruction is a number stored in a single byte. Some instructions require parameters. Parameters follow the instruction byte in the order they appear in this document. An instruction parameter labeled `value' is a 32bit value used directly by the instruction. An instruction parameter labeled `element' is an 8bit number which references a stack element. The actual stack element is determined in the VM at run time. * Execution and code instructions 0 HALT Halts the program. All VM functions cease unconditionally. 1 JUMP element Causes execution to switch to the defn id held in element. If the specified defn does not exist then this instruction behaves like HALT. Before execution begins in the new defn, the stack base is set to the current stack height minus the consume amount specified in the defn. If the resulting stack base would be less than 0 then the instruction behaves like HALT. If the new defn would create enough stack elements to overflow the stack, then this instruction behaves like HALT. 2 LOADCODE element1 Load defns from bytecode stored in memory starting at the address in element1. The address is a 32bit unsigned integer. Take the defn-id of the first defn that is loaded and store it in a new stack element at the top of the stack. This defn-id is the VECTOR of all the defns loaded by this instruction. Take the number of new defns and store it in a new stack element at the top of the stack. If the bytecode is invalid then this instruction behaves like HALT. 3 VECTOR Take the VECTOR for the current defn and store it in a new stack element at the top of the stack. Libraries may use this instruction to find their peer defns. * Stack instructions 4 PUSH value Put the value into a new element at the top of the stack. 5 POP Remove and discard the topmost stack element. A valid POP instruction may occur only when the current stack frame contains stack elements. 6 ASSIGN element1 element2 Take the value in element1 and put it in element2. 7 STACKSIZE Store the maximum number of elements of the stack in a new stack element at the top of the stack. 8 STACKHEIGHT Store the current number of stack elements in a new stack element at the top of the stack. Following this instruction the stack will be one element higher. * Arithmetic instructions Instructions which work with floating point numbers are prefixed with FP. 9 ADD element1 element2 10 FPADD element1 element2 Find the sum of element1 and element2. Store this value in a new stack element at the top of the stack. 11 SUBTRACT element1 element2 12 FPSUBTRACT element1 element2 Subtract element2 from element1. Store this value in a new stack element at the top of the stack. 13 MULTIPLY element1 element2 14 FPMULTIPLY element1 element2 Multiply element1 by element2. Store this value in a new stack element at the top of the stack. 15 DIVIDE element1 element2 16 FPDIVIDE element1 element2 Divide element1 by element2. Store this value in a new stack element at the top of the stack. If the value in element2 is 0 then this instruction behaves like HALT. 17 REMAINDER element1 element2 18 FPREMAINDER element1 element2 Obtain the remainder of the division of element1 by element2. Store this value in a new stack element at the top of the stack. If the value in element2 is 0 then this instruction behaves like HALT. 19 GREATERTHAN elment1 element2 20 FPGREATERTHAN elment1 element2 If the value in element1 is greater than the value in element2, then store the integer 1 in a new stack element at the top of the stack. Else store the integer 0 in a new stack element at the top of the stack. Boolean instructions: 21 AND element1 element2 Obtain the bitwise AND of element1 and element2. Store this value in a new stack element at the top of the stack. 22 OR element1 element2 Obtain the bitwise OR of element1 and element2. Store this value in a new stack element at the top of the stack. 23 XOR element1 element2 Obtain the bitwise XOR of element1 and element2. Store this value in a new stack element at the top of the stack. 24 ROTATE element1 element2 Take the value in element1 and rotate its bits by the number specified in element2. Take this new value and store it in a new stack element at the top of the stack. A positive number in element2 will cause the bits to be rotated toward bit 0. A negative number in element2 will cause the bits to be rotated toward bit 31. If element2 contains 0 then this instruction simply duplicates element1 into a new stack element. 25 EQUAL element1 element2 If the bits in element1 are identical to the bits in element2, then store the integer 1 in a new stack element at the top of the stack. If they are not identical, then store the integer 0 in a new stack element at the top of the stack. 26 DECIDE element1 element2 element3 If the integer stored in element3 is 1 then take the value in element1 and store it in a new stack element at the top of the stack. If the integer stored in element3 is 0 then take the value in element2 and store it in a new stack element at the top of the stack. If value stored in element3 is not 1 or 0 then this instruction behaves like HALT. Memory instructions: 27 LOAD element1 element2 Take the value from the memory location specified by element1 and put it in element2. 28 SAVE element1 element2 Take the value in element1 and put it in the memory location specified by element2. 29 MEMSIZE Store the size of the memory block in a new stack element at the top of the stack. Example program: SEB1 //file signature 0x0000 0036 //defn list length DEFN 0 4; //defn1 is main() //frame = stack PUSH 1; //frame[0] = 1; //int x = 1 PUSH 0; //frame[1] = 0; //create parameter ASSIGN 0 1; //frame[1] = frame[0]; //set parameter = x PUSH 3; //frame[2] = 3; //ret = defn3 PUSH 2; //frame[3] = 2; //prepare jump to defn2 JUMP 3; //defn2(); //addtwo( x ); DEFN 3 2; //defn2 is addtwo( int n ) //frame[0] = n //frame[1] = return defn //frame[2] = this defn id POP; //del frame[2]; //discard our defn id PUSH 2; //frame[2] = 2; //int a = 2 ADD 0 2; //frame[3] = frame[0] + frame[2];//int b = n + a ASSIGN 3 0; //frame[0] = frame[3]; //n = b POP; //del frame[3]; //del b POP; //del frame[2]; //del a JUMP 1; //defn3(); //return n DEFN 3 0; //defn3 is main() continued //frame[0] = x //frame[1] = n //frame[2] = this defn id POP; //del frame[2]; //discard our defn id ASSIGN 1 0; //frame[0] = frame[1]; //x = n POP; //del frame[1]; //del n POP; //del frame[0]; //del x HALT; //program finish test //data block >>>EOF<<< Acknowledgements: Thanks to Andrew Robbins (adu@OPN) for the long discussions on irc. Thanks to Darius Bacon (demoncrat@OPN) for the inspiration I found in his Froth vm (http://www.accesscom.com/~darius/froth/). Thanks to William Lee Irwin III (wli@OPN) for the long discussions on irc. Changes/Ramblings 08-28-01 Chatted with demoncrat. He says that it would be safe to specify IEEE-754 format for floating point numbers. 08-28-01 Clarified theoretical and practical limits for stack and mem. Removed ambiguity with int/float types for the EQUAL instruction. EQUAL now compares bits and reports with an integer, 1 or 0. FPGREATERTHAN now returns integer, 1 or 0. Decided not to change DECIDE to make it less rigid. C compilers really should support the division operator, /, for floats. Keeping ROTATE as it is, not changing it into RSH, LSH, etc. Clarified addressing of memory, mem is addressed in 32bit chunks. This gives Sebae VM the ability to address 16GB of memory. ;) Merged the `to be added' instructions into their sections. Renumbered the instructions, 30 total. 08-27-01 Removing the NOT instruction because it's useless. 08-23-01 After a long irc chat late at night I decided on these changes to the bytecode format. defn consume and produce should be 8bit all stack element references will be 8bit stack frame may contain at most 256 elements 32bit offset to data block I just completed rewriting this spec to match these changes. I greatly improved the paragraphs describing the stack and its behavior. I want to code this now, but it's nearly 11pm and I have to arise at 7am tomorrow morning. Also I have not been feeling well. I suspect this is due to my frequent late hours. I have been too eager to move this project forward during my brief break from academic studies. 08-21-01 I wanted to go on to irc and talk through this, but my meager global communications services are unavailable at the moment. The Changelog will have to be my listening ear. =) I was implementing the code that will actually execute the defns when it occurred to me that the Entry Defn is redundant. Using four bytes as part of the file header is functionally identical to using a twelve byte defn to JUMP to the program's entry point: DEFN 0 0; PUSH 0x0000 0007; JUMP 1; Thus I have decided to remove the Entry Defn-Id from the file format. Bytecode programs will now begin execution at the first defn. Now I get to go back and fix the assembler and disassembler. =/ The bytecode loader in svm's bytecode.c is very nice. I think I will combine the disassembler with svm. That is, the disassembler will use the same bytecode loading and verification code. Hmm it would be nice to have a function called Sebae_ExportAssembly(). Or maybe a function that just returns a textual assembly representation of a specific defn; this would be very useful for a debugger. The disassembler could then just loop through the defns, calling Sebae_Disassemble(), and write the output to the `.sa' file. When it is finished it can loop through the pipe-0 and write that to a `.sa.bin' file. And finally output the DATA directive to the `.sa' file. Yes. I like this idea. With a bit more thought... it might be useful to use the SebaeVM struct in the assembler also. This would mean adding a Sebae_Assemble() function. This could be difficult. Sebae_Assemble() would need to work on only one defn at a time. Well, I could write a couple of functions to process the assembly code. Sebae_StripComments() would be useful. So also would be Sebae_StripWhitespace(). After which the implementation of Sebae_Assemble() could be used to split the text into defns and call Sebae_AssembleDefn() for each one. This would create the program in the SebaeVM struct. The assembler would then also require functions to obtain the bytecode representation of the program. This is good because I was already planning to implement such code. Sebae_Bytecode() functions would be useful for storing a running program. I can see applications in clustering and process migration which would make use of this capability. It would be easy to dump the stack and memory to the bytecode's data block. It would also be easy to add some defns to the program which loaded the stack and memory again. The hard part is getting this code to run when the bytecode is loaded. The Entry Defn-id would make this trivial... just add a few defns to the program that restore the state, and point the entry point to these defns. But if the first defn is automatically executed then it would be neccessary to modify the first defn, or replace it. Now what would happen if the first defn was replaced with a JUMP to the restore defns. The first defn could be relocated... *sigh* this is all useless. A VM that will support process migration will need to store more data with the program. The bytecode file format is not and SHOULD NOT be extended to facilitate process migration. It may be possible to write bytecode that can restore itself to a previous state, but it will never be possible to restore the external states that such distributed code is sure to make use of. By external states I mean the state of network connections, pipes, queries, etc. These could easily be managed by the custom VM implemenations that will support clustering. Ok so I've made up my mind. Entry Defn-id has got to go. Also I will not worry about saving the state of a running bytecode program. In the future there may be time and resources to worry about clustering. I have also been playing around with another idea. This idea is to enlarge the defn-list-terminator (three null bytes) into a marker. The marker is guaranteed never to exist in the valid bytecode. I believe it would need to be thirteen null bytes. The most parameters that any instruction has is three, or twelve bytes. The HALT instruction, zero, has no parameters. Wait... there could be a DECIDE 0 0 0; HALT; which would match the marker. So I guess it would need to be fourteen zero bytes... DECIDE 0 0 0; HALT; HALT; is invalid. Ok so the marker is this, 0 0 0 0 0 0 0 0 0 0 0 0 0 0. Five, five, and four. Ok but what if you had code before the marker: DECIDE 0 0 0; HALT; 0 0 0 0 0 0 0 0 0 0 0 0 0 0; Then the signature would match incorrectly beginning with the parameters of DECIDE. No, what is needed is a signature which will never match any portion of bytecode. Ok we know that HALT can only appear at the end of a defn. Also HALT can be preceded by at most, three zeros. Also, if we're terminating the defn-list with a signature, then we can remove the magic from DEFN 0 0; HALT; and make it a valid defn again. This means that we can have sets of three zeros anywhere in the bytecode. (0,0,0)a (0,0,0,0 | 0,0,0 | 0,0 | 0)b (0,0,0)c Obviously this means that any number of zeros can exist in a valid bytecode file. Perhaps this signature should consist of the highest byte, 0xFF? Let's see... There is no 0xFF instruction, so it can only occur as a parameter or in a defn header. DECIDE 0xFF 0xFF 0xFF; EQUAL 0xFF 0xFF; JUMP 0xFF; DEFN 0xFF 0xFF; It looks like there can only be, at most, three 0xFF bytes in a valid bytecode file. Err, that is in the defn-list of a bytecode file. Any data may occur in the data block. So if we use four full bytes as our signature, FF,FF,FF,FF, then what about this scenario? JUMP FF; FF,FF,FF,FF; The parameter to JUMP is incorrectly treated as the beginning of the signature. We can fix this by adding a null byte to our signature: JUMP FF; FF,FF,FF,FF,0; This gives us five bytes that should not occur anywhere in the defn-list of a valid bytecode program. Is a five bytes a good number for a signature? A four byte signature could be made with FF,FF,FF,FE. Yes, this is better. Because there is no 0xFE instruction, these four bytes cannot exist in the defn-list of a valid bytecode file. I've made up my mind then. The defn-list will be terminated by four bytes: 0xFF 0xFF 0xFF 0xFE Immediately following these bytes will be the size of the data block. Hmmm... what is the purpose of the data-block-size? Well the advantages I can think of are these: a) Size of block need not be calculated from the bytecode file size. b) Certain environments will not provide the bytecode file size. c) Detect of file truncation The first two are not important. The last one is the most useful, to detect the truncation of the file. Truncation of the defn-list implies the loss of the data block and its size field. Such severe truncation would be detected when the defn-list-terminator signature is not found. So now the question is whether the VM should be repsonsible for verifying the data block. Even if the data block was bad, wouldn't it be the responsibility of the bytecode program to deal with it? What could the VM do? Prompt the user, "Data block corrupted, run anyway?" In which case the bytecode would still have to deal with the problem. I think that the data block size field is useless. The data should be entirely the responsibility of the bytecode. With these deliberations I have come to this conclusion: Data-block-size field MUST GO. NOW. Skat. Skidaddle. =) Hmm interesting... it should be straightforward to translate Sebae bytecode into an imperative language, like C. Maybe Sebae bytecode is more like an `intermediate format' than I had originally thought? oldnote1 I had originally planned to have three jump-like instructions: GOTO would take an immediate value. JUMP would take an element. IFJUMP would take three elements, the first two beind defn ids and the last being a 1 or 0 to determine which defn to switch to. I removed GOTO and IFJUMP because they are redundant. It occurred to me that IFJUMP could easily be synthesized with JUMP and a far more useful instruction, DECIDE. DECIDE will be an arithmetic instruction. The primary function of a computer is to manipulate numbers. It makes sense to create the primary logic instruction to effect numbers. Although GOTO is obviously redundant to JUMP, it has two benefits. The first is that it would allow optimizations in the VM by moving the destination check to load-time rather than run-time. The second is that GOTO would not leave the defn id on the stack. I want Sebae to be minimal so I will forgo these benefits in exchange for a simpler VM. oldnote2 I have been encouraged by several people to add a form of POP that takes a parameter from the stack. This POPN would pop the specified number of elements from the stack. I cannot implement this instruction without severely crippling the VM. Sebae bytecode may be read and translated into optimized native machine code. The Sebae stack is just a model. Implementations need not maintain a contiguous set of bytes to represent the stack. Adding the POPN instruction would mandate a traditional stack and would destroy the register model that I like so much. POPN would make the VM too complex. I need Sebae to be minimal. If you need LISPy or FORTHy stacks then you should implement them in the memory block. oldnote3 I must decide about other mathematical instructions and floating point numbers. It would be ideal to have all the capabilities of a scientific calculator in the VM. This could cause problems with imprecise values and programs producing different results on different implementations. Perhaps these functions would be better suited to a function library? Would they all fit in range from 30 to 254? oldnote4 DECIDE is the logic instruction. All computed decisions can be boiled down to Yes or No. DECIDE translates this yes/no decision into choice. EOF