skip to content

SWO - A Compiled Language

Have you ever wondered how programming languages worked? So did I, and that is why I decided to create my own compiled language, SWO.

Overview

In this post, I go over some of the early development of SWO, as well as what building your own compiled language entails. Since there are many tutorials on compilers out there, instead of having this post be a generic tutorial, I have decided to gear it broader to my concept if and general development of my language.

The Concept

The idea of creating a compiled programming language began to take shape in May of 2022. At that time, I had no major projects in development (except for, as required by every developer, dozens of abandoned half-baked ideas). At this time, I was desperately searching for new things to do. Finding myself unable to come up with anything new, and feeling incredibly bored, I decided to make an attempt at something I had been wanting to do for a while. This task seemed daunting, however, as I would later learn, was not nearly as hard as I thought it would be. Thus, SWO, aka SWO Wants Options, my compiled programming language, came into existence.

The Beginning

When I first started, I had a few major decisions to make. Did I want to build a compiled language, an interpreted one, or transpile it? What language was my compiler going to be written in? Did I want to follow a tutorial, or wing it? After some deliberation, I decided that I would write my compiler in C#. I had a decent amount of C# experience from my time making games in Unity, and I felt that OOP could be helpful. Once I chose the implementation language, all that was left was to choose how my language (SWO) would work. For me, I was able to understand compilation much easier and quicker than in interpretation. This left me with two choices, a transpiled language and a compiled one. In the end, I decided to go with a compiled language for the extra challenge. Naturally, with a compiled language, I decided to target LLVM IR and use the LLVM C-API C# bindings (known as LLVMSharp). For readers who have not previously heard of LLVM, it is essentially a compiler creation framework that is used by several major languages including C, C++, and Rust. LLVM IR refers to the LLVM Intermediate Representation, a low level language above assembly meant to be used as an intermediary for compilation. This means that my compiler can generate LLVM IR, and LLVM will turn that into runnable assembly / object code.

First Months

My first months of development on my compiler were marked by turmoil and constant rewriting. Due to the fact that I knew next to nothing about parsers, compilers, LLVM, and low level programming and general, development pace can best be described as slow. I also didn't want to spend weeks studying LLVM and low level programming; I find the easiest (and most fun way) to learn is through actual programming. I have never read a programming book, and generally, when I am learning something new, I like to just dive straight in with a project. While this might not be the most efficient way of learning, it is my favorite way. Now that you have the background for SWO, I will try to get more into the actual development (and be a little less boring).

In these first few months, I found several useful resources:

  • LLVM Kaleidoscope Tutorial
    • While I didn't know C at the time I used this guide, It was still very helpful for figuring out how the LLVM API worked.
  • LLVMSharp Kaleidoscope Tutorial
    • This tutorial is simply a clone of the LLVM Kaleidoscope Tutorial, and altough it only contains the first few chapters, I still found it helpful to have the Kaleidoscope Tutorial in a language I understood.
  • Mapping High Level Constructs to LLVM IR
    • This website contains examples on what certain high level language components look like in LLVM IR and thus was very helpful for the IR generation component of SWO.
  • A Complete Guide to LLVM
    • Here you can find a very broad (and helpful) overview of LLVM and how you can use it to create your own compiler. I found this useful for some of the specific API functions as well as general information on LLVM IR

The Language

Seeing as this project exists for me solely to be fun and act as a learning experience, I did not have any concrete ideas as to what I wanted SWO to look like when I first began development. I decided that SWO needed to have a "gimmick" and spent several months trying to come up with ideas. First, I settled on making a compiler with syntax configurable through TOML files, but soon realized that this idea merely added work for myself with little gain. Currently, I still am unsure of what the gimmick should be. I have toyed with the idea of making it a stack only allocated functional language. In theory, this could eliminate the need for garbage collection or manual memory management. However, currently, I am unsure if this is feasible or even worth it.

The SWO language (in its current state) is essentially a worse version of C. The language supports functions (returning, arguments, etc), variables (globals, locals, arrays), structs, references and dereferences, if statements, and for loops. I plan to add generics, functions as types, anonymous functions, and macros. I wish to get SWO to a point at which I can bootstrap it, and thus, that is my far off goal.

The Compiler

The SWO compiler is made up of 3 major components, a lexer (tokenizer), a parser, and a code generator. Each component serves its own function and integrates with the others to produce runnable programs.

The Lexer

The lexer, also known as a tokenizer, takes in the raw input from a file and turns it into "tokens". While technically unnecessary, this piece simplifies the next component, the parser.

The Parser

This is by far the largest (and most complex) component of SWO, and any language. The parser takes the tokens created by the lexer, and turns them into an AST (Abstract Syntax Tree). An AST is a high level representation of the strucutre of a programming language. Here is a pseudocode representation of what one might look like:

Code:

while (i < 10) {
    printf("%d", i);
    i++;
}

AST:

while loop {
    condition { variable { name = i } less than number { value = 10 } }
    body {
        function call {
            name = printf
            arg #1 = string { value = %d }
            arg #2 = variable { name = i }
        }
        variable assignment {
            target = variable { name = i }
            final = binary expression { left hand side = {variable { name = 1} } right hand side = number { value = 1 } operator = + }
        }

    }
}

As you can see, the parser will convert high level code into its base components and represent them as a tree data structure.

The Generator

The code generator is the final piece of the SWO compiler. The code generator takes the AST from the parser, and using the LLVMSharp library, generates LLVM IR code from the SWO code. The generation step essentially transpiles the SWO code into LLVM IR. This generated LLVM IR is then optimized and compiled down into object code which can then be linked into an executable.

Conclusion

While it may seem like a daunting task, writing your own programming language is a feasible project for many developers. Not only is it a realistic and attainable goal, but one that will teach you many things and take you on a fun and interesting journey.