Zyme

an evolvable language for genetic programming

Zyme is an esoteric language for genetic programming: creating computer programs by means of natural selection.

For successful evolution mutations must generate a wide range of phenotypic variation, a feat nearly impossible when randomly modifying the code of conventional programming languages. Zyme is designed to maximize the likelihood of a bytecode mutation creating a novel yet non-fatal change in program behavior.

Diverging from conventional register or stack-based architectures, Zyme uses a unique molecular automaton-based virtual machine, mimicking an abstract cellular metabolism. This design enables fuzzy control flow and precludes invalid runtime states, transforming potential crashes into opportunities for adaptation.

Interactive demonstration

Editor

Compile the source code

Genetic programming begins with an initial program, often generated randomly. However, evolution can be impeded by an unsuitable choice of this initial program. Two common ways in which programs prove inadequate are:

  • The program may be too small and lack complexity, rendering it overly fragile.
  • The program's behaviour might not produce output in the required format (or any output at all), preventing meaningful evaluation. For example, if a program should return a single byte representing a category in a classification problem, but fails to return anything, there's nothing to assess.

The Zyme assembler compiles human-readable source code into the mutable bytecode, enabling you to build bespoke initial programs for specific machine learning tasks.

Bytecode

Compile source code to generate bytecode

Mutate the bytecode

Click on each byte to mutate it.

This "Hello Creatures." source code generates three strands, which are the primitive data structure of the Zyme virtual machine. Strands are just arrays of bytes, however - given they are homoiconic - they are interpreted as both code (a sequence of instructions) and data (a sequence of integer values).

The strands are concatenated, with empty or null bytes 00 as separators to produce the bytecode ready for mutation.

Any binary data is valid bytecode on the Zyme virtual machine. This design ensures that programs cannot crash; however, they may instead return no output or perform no useful function.

Try to mutate the bytecode to reveal the hidden message1.

Console

hex view:

Execute the program

The Zyme virtual machine executes input bytecode by first splitting it into its constituent strands, then evaluating each independently and in parallel. Within each strand, execution proceeds sequentially, with each byte encoding an individual opcode. These instructions manipulate the state of the current strand to perform various arithmetic, logical and control flow operations.

Strands interact by binding to one another to form temporary complexes. Specific machine instructions control when and how binding occurs. For instance, the WAIT instruction opens a binding site, and the LOAD instruction binds to a waiting open site.

Binding can only occur successfully when the motif sequences associated with the binding sites match. Motifs are specified after the instruction, e.g., WAIT %”ABCDABCD” |. Importantly, these motifs do not require exact matches, allowing for fuzzy control flow.

Zyme in practice

While mutating the “Hello Creatures.” program above to produce garbled output (or revealing the “Secret!” message) may illustrate the quirky features of Zyme, it is far from real-world machine learning. This example was selected as a familiar context to illustrate core features of the virtual machine, like its fuzzy matching motif system, rather than reflecting a practical application. Nevertheless, Zyme does include tools for creating sophisticated initial programs beyond this simple demonstration. Further, preliminary experiments indicate that mutating these programs can generate non-trivial variation in behaviour, indicating that, maybe, just maybe, Zyme could serve as a foundation for a novel machine learning approach.

Language features

The design of Zyme's language balances the capacity for abstraction with evolutionary potential. While abstraction and encapsulation help developers manage complexity by dividing programs into independent components and modules, this can conflict with the goal of being an evolvable language. A primary benefit of modularization — preventing unintended interactions between independent components — can impede evolution. These interactions aren't inherently harmful; rather, the risk lies in manipulating data structures in ways that breach abstraction boundaries, potentially leading to unpredictable or undefined behaviour.

Traditional programming languages view this undefined behaviour as dangerous. However, in the context of an evolvable language like Zyme, this unpredictability becomes a feature. It's precisely this potential for unexpected interactions that drives evolution in the system.

Zyme employs macros to create abstractions without restricting runtime interactions. These macros can group functionally related strands, allowing them to be treated as single units in the source code. During compilation, any explicit trace of the groups is lost as all strands are ultimately compiled to byte arrays. This ensures that when resulting bytecode undergoes mutation, nothing prevents potential interactions between strands, regardless of their original source code organisation.

Source 1 illustrates a macro that transforms a given sequence of instructions into a ‘function’ call by emulating a stack frame. Instead of executing instructions directly within a 'caller' strand, they are placed in a separate 'callee' strand, which is then invoked through the motif binding system. The outcome is that this macro abstracts over the motif system involved in managing multiple strands, making it transparent to the user at the point of use (shown in Source 2).

To elaborate, the first line in Source 1 defines the remote macro, which takes a single argument, block - the block of instructions to be called in a separate stack frame. The macro's body uses a nested lexical scope, denoted by { ... }, where new definitions are locally confined. Within this scope, the first statement defines address, which serves as the unique motif for strand binding. Rather than specifying it literally, % is used to generate a fresh motif: a new motif guaranteed to be unique by the compiler. This feature allows multiple calls to the macro, as each instance will contain a distinct motif locally bound to address.

Next a new strand is generated that will open a new binding site containing the address motif, using WAIT instruction as seen in the initial interactive demonstration. The block of instructions provided as the argument, block is then inserted into this strand, followed by the RTRN instruction. This will become the ‘callee’ strand.

The macro concludes with the emit keyword, specifying what remains in place of the lexical scope after macro evaluation. In this instance, it's a sequence of instructions that initiates the 'function' call using the CALL instruction on the address motif. This will become part of the ‘callee’ strand. Note, both the generated 'callee' strand and the instructions in the 'caller' strand now operate with the same shared motif.

When executed in the virtual machine, the CALL instruction causes the two strands to bind, temporarily shifting execution control from the 'caller' strand's call site to the 'callee' strand. Execution resumes where the 'callee' strand was waiting - after the address motif consumed by the WAIT instruction, at the start of the block argument passed to the macro. Once the block instructions complete, the RTRN instruction triggers strand unbinding, returning control to the 'caller' strand's waiting point: after the address motif consumed by the CALL instruction.

Try this macro by copying Source 1 into the initial interactive demonstration source code, and substituting the original trim_motif macro with Source 2. Note, in Source 2, the square brackets [ ... ] group instructions into a single argument for the remote macro. These changes do not change the initial program’s behaviour, however the resulting bytecode now contains an additional strand.

Despite targeting a virtual machine with an esoteric instruction set, the Zyme enables developers to create abstractions and write structured code, without compromising on evolutionary potential.

Source 1

let remote block => {
    let address => % ;
    WAIT address | block RTRN ;

    emit => CALL address | ;
};

Source 2

let trim_motif =>
    remote [ insert_empty |> |> ] ;

Machine learning and mutations

For Zyme to serve as a viable foundation for a machine learning approach, it must be possible to effectively evolve its bytecode. This means significantly improving program performance through iterative random mutations.

This approach to optimising or improving computer programs - genetic programming - is not new and has a rich literature. The process operates with a population of programs in a two-step cycle of mutation and selection. Traditionally, the first generation starts with a population of random programs, in this case we use a population constructed in the manner we described earlier. This process operates as follows:

  1. mutation - randomly mutate programs in the current population of programs to generate new candidate programs.
  2. selection - evaluate each new candidate program's performance by assessing its ability to produce desired outputs for given inputs, similar to supervised learning. Choose the top-performing programs to form the next generation's population.

The process repeats, and theoretically the performance of the programs should improve over successive generations.

But does this work in practice for Zyme?

In the loose sense, Zyme works; we can observe performance improvements over time as programs undergo mutation. While it is not hard to find mutations that improve the performance at a given task, this phenomenon isn't unique to Zyme; randomly adjusting parameters of any machine learning system or statistical model would likely exhibit some degree of improvement over time. The critical issue is whether mutations of Zyme bytecode can reliably produce non-trivial improvements, potentially yielding results competitive with established machine learning methods.

This is something we are actively exploring. We are currently testing Zyme on small-scale problems to assess its effectiveness and resolve unforeseen issues that emerged after the design phase. The results, while limited so far, are encouraging.

We have been exploring many different ways to mutate the bytecode - often referred to as genetic operators - from single point mutations to more elaborate cross over techniques. Interestingly, we have observed that programs quickly increase their resistance to mutations over the first few generations. In multiple trials, we've observed that a specific mutation regime (combining point mutations and crossover) initially results in only a ~3% survival rate of mutated programs. Where survival here means the program still outputs a correctly formatted return value, allowing for evaluation of program performance. However, in subsequent generations the survival rate can increase up to as high as ~60%.

It's tempting to attribute this phenomenon to bloat-induced statistical artifact. Bloat happens when programs expand uncontrollably, often incorporating 'dead-code' regions in the bytecode that don't provide any functionality or improve performance directly. This surplus code may dilute effects of mutation, and by potentially protecting the functional bytecode, result in an increase in mutation resistance.

While we've observed bloat in Zyme, we don’t think this is driving the increase in mutation resistance and survival rate. Surprisingly, as the survival rate of mutated programs increases, we're also observing a significant increase in variation in performance among these programs at each generation.

In early generations, when survival rates are ~3%, most surviving mutations are synonymous – new programs typically maintain the same performance as their predecessors, with performance improvements being very rare. However, in later generations as survival rates climb, we observe a shift: fewer synonymous mutations and more non-fatal mutations that alter performance levels. This simultaneous increase in mutation resistance, survival rate, and performance variation seems counterintuitive if bloat were the primary driver?

We are not sure what is going on here, but this is one of many interesting early observations. Our next steps are to start scaling up to see how Zyme performs on less trivial problems.

1 While there are many mutations that can reveal the “Secret!” message, the most reliable method is to mutate the first byte of the second strand corresponding to the WAIT instruction of the strand containing “Hello Creatures.”. This corresponds to the 20th byte, 0F in the bytecode.

This mutation prevents the "Hello Creatures." strand from opening a binding site, preventing any other strand from binding to it. Consequently, the LOAD instruction falls back to the binding site of the strand containing the "Secret!" message, which is normally a weak match. As a result, the "Secret!" message is printed instead of "Hello Creatures.".