Ever wondered how a programming language actually works?
As a newbie programmer, I was wondering how a computer understands and executes my code because I learned in school that computers only understand binary code (0s and 1s). So, how can a simple program be executed by a machine that only works with two digits? This question led me to explore compilers and interpreters.
After studying compilers and interpreters, another question arose: “Okay, if interpreters and compilers enable the computer to understand my program, how do they work behind the scenes?” From here, some complex terminologies started to emerge. Building your own programming language means building your own compiler, which consists of several major steps.
1. Syntax and Grammar Formation :
Syntax is nothing more than a set of rules for how code is written in the language you are building, while grammar defines the structure. This is the most essential part of creating a programming language and requires a lot of thought. Technically, to design a grammar and syntax, we can use one popular approach known as BNF (Backus-Naur Form).
What is BNF? BNF is a notation that describes the grammar of a language using production rules.
2. Lexical Analyzer and Parser :
In the second step, we need to perform lexical analysis on the source code. Essentially, this converts your source code into a stream of tokens. Remember the term “token”.
Token : Tokens are the basic building blocks, like keywords, operators, and literals.
Parser : Now, we have a stream of tokens, which are passed to a parser. The parser will convert these tokens into an Abstract Syntax Tree (AST) using the grammar we defined earlier.
Now, what is an AST? The AST is a hierarchical representation of the code, where each node represents an expression or statement in the language.
3. Semantic Analysis (Optional but Common) :
After parsing and generating the AST, the next step (depending on the language) could be semantic analysis. This step checks for the meaning and consistency of the parsed code. For example, it can check for:
Type consistency: Does the expression make sense.
Variable declaration: Has the variable been declared before use?
4. Intermediate Representation (IR) and Code Generation :
If you’re writing a compiler, after parsing, the AST can be transformed into an intermediate representation (IR). IR is a simplified version of the code that makes it easier to optimize and generate machine code. It serves as a bridge between the high-level code and machine-level code.Finally, the IR is passed to the code generation phase, where it’s converted into the target machine code (binary) or byte-code (for a virtual machine).
If you’re writing an interpreter, this phase may not exist. Instead, the interpreter walks through the AST or IR directly and executes the instructions in real time.
5. Optimization (Optional) :
An optional but important phase in compilers is optimization. Here, the IR is analyzed and transformed to make the final machine code more efficient. For instance, instead of calculating 3 * 2 at runtime, the compiler might replace it with the constant 6. Execution In the final step (for both compiled and interpreted languages), the generated machine code or byte-code is executed by the machine. If it's an interpreter, the instructions are executed directly from the AST or byte-code. If it's a compiler, the code runs after being compiled.
Wrapping Up :
To summarize the whole process:
Syntax & Grammar Formation: Define rules for your language.
Lexical Analysis (Lexer): Convert the source code into tokens.
Parsing (Parser): Use grammar to convert tokens into an AST.
(Optional) Semantic Analysis: Validate the meaning and logic.
(Optional) IR Generation: Create an intermediate representation.
Code Generation: Turn the IR into machine code or byte-code.
(Optional) Optimization: Improve the efficiency of the generated code.
Execution: Run the machine code or byte-code.