15-122 Programming Homework 11 12 Page 1 of 16
15-122: Principles of Imperative Computation, Spring 2020
Programming Homework 11 12: The C0VM
Due: Thursday 23
rd
April, 2020 and
Thursday 30
th
April, 2020 by 9pm
In this assignment you will implement a virtual machine for C0, the C0VM. It has been
influenced by the Java Virtual Machine (JVM) and the LLVM, a low-level virtual machine
for compiler backends. We kept its definition much simpler than the JVM, following the
design of C0. Bytecode verification, one of the cornerstones of the JVM design, fell victim
to this simplification; in this way the machine bears a closer resemblance to the LLVM. It
is a fully functional design and should be able to execute arbitrary C0 code.
The purpose of this assignment is to give you practice in writing C programs in the kind of
application where C is indeed often used in practice. C is appropriate here because a virtual
machine has to perform some low-level data and memory manipulation that is difficult to
make simultaneously efficient and safe in a higher-level language. It should also help you
gain a deeper understanding of how C0 (and, by extension, C) programs are executed.
The code handout for this assignment is on Autolab. The file README.txt in the code
handout goes over the contents of the handout and explains how to hand the assignment
in. There is a TWENTY (20) PENALTY-FREE HANDIN LIMIT for the checkpoint ,
and a TWENTY (20) PENALTY-FREE HANDIN LIMIT for the full assignment. Every
additional handin for each will incur a small (5%) penalty (even if using a late day). Your
score for each part of this assignment (checkpoint and full) will be the score of your last
Autolab submission. The checkpoint is for Tasks 1 and 2 only. This is far, far less than half
the assignment.
After the checkpoint, you can no longer earn points for Tasks 1 and 2, although the full
autograder will continue to run tests against them. It will also run tests for Tasks 3 and 4, but
not for Task 5 until after the handin deadline has passed. This means that you must do all
your own testing for Task 5 and in order to earn full points you must convince yourself that
you fully understand it and have tested it for correct behavior and for memory leaks where
appropriate (you will see later in the writeup that certain memory leaks are unavoidable).
About this writeup
The C0VM is defined in stages, and we have test programs which exercise only part of the
specification. We strongly recommend that you construct your implementation following
these stages and debug and test each stage (on your own and with the autograder) before
moving on to the next. Each part has its own challenges, but each part should be relatively
small and self-contained.
This document describes the structure of the C0VM first and then the instruction set
c© Carnegie Mellon University 2020
15-122 Programming Homework 11 12 Page 2 of 16
(bytecodes) for the C0VM. After this, the document will describe the tasks you need to
perform, step by step. Read this document very carefully as you prepare to do your work.
1 The Structure of the C0VM
Compiled code to be executed by the C0 virtual machine is represented in a byte code format,
typically stored in a file ending in extension .bc0 which we call the bytecode file. This file
contains numerical and string constants as well as byte code for the functions defined in the
C0 source. The precise form of this file is specified in Appendix A and in lib/c0vm.h.
1.1 The Type c0_value
C0 has so-called small types int, bool, char, string, t[], and t*. Values of these types
can be passed to or from functions and held in variables. In the C0VM, we will represent
each of these types in one of two ways: the primitive types are represented as 32-bit signed
integers (which we will abbreviate as w32) and the reference types are represented as void
pointers (which we will abbreviate as *).
C0 type C0VM type C type
int w32 int32_t
bool w32 int32_t
char w32 int32_t
t[] * void*
t* * void*
In lib/c0vm.h, we create a special type c0_value of C0 values. A c0_value can
store both values of primitive type (which we write as x, i, or n) and values of reference or
pointer type (which we write as a). We can turn primitive types into C0 values with the
function int2val, and we can turn C0 values which we know to be primitive types back into
integers with val2int. Similarly, ptr2val(x) and val2ptr(x) move between reference
types (represented by void*) and C0 values. We always have val2int(int2val(x)) == x
for any integer x and val2ptr(ptr2val(a)) == a for any pointer a. There's a function
val_equal(v1,v2) which checks whether the two C0 values v1 and v2 are equal.
1.2 Runtime Data
The C0VM defines several types of data that are used during the execution of a program.
You'll need to consider the first three of these (the operand stack, bytecode, and the program
counter) to get started with Task 1.
The execute function you are extending in this assignment is passed a struct bc0_file
containing all the information from the bytecode. As you read this section, you should refer
to both Appendix A and lib/c0vm.h where this struct is described.
c© Carnegie Mellon University 2020
15-122 Programming Homework 11 12 Page 3 of 16
1.2.1 The Operand Stack (S)
The C0VM is a stack machine, similar in design to Clac and the JVM. This means arithmetic
operations and other instructions pop their operands from an operand stack and push their
result back onto the operand stack.
The C0VM is defined in terms of operand stack S, a stack of C0 values. Because these
values are such an important part of the VM, we define stacks that only contain C0 values in
lib/c0v_stack.h. We recommend writing some helper functions like push_int(S, i) and
pop_ptr(S) early on so that you won't have to repetitively write c0v_push(S, int2val(i))
and val2ptr(c0v_pop(S)) over and over.
1.2.2 Bytecode (P)
In Clac, the program instructions were strings, and they were stored in a queue of strings.
In the C0VM, instructions for the current C0 function are stored in an array of (unsigned)
bytes (ubyte *P). Each function in a compiled C0 program is represented by a struct
function_info that is stored in the array bc0->function_pool. The main() function
that you want to run first is always stored as the first struct in this array, at index 0. When
you start your VM, you should initialize P to be the bytecode code stored in that struct.
You don't ever need to allocate any memory to store bytecode. Just refer to the bytecode
given to the execute function.
1.2.3 The Program Counter (pc)
The program counter pc holds the address of the program instruction currently being exe-
cuted. It always starts at 0 when you begin executing a function. Unless a non-local transfer
of control occurs (goto, conditional branch, function call or return), the program counter is
incremented by the number of bytes in the current instruction before the next instruction is
fetched and interpreted.
1.2.4 Local variables (V)
Locals should be stored in a c0_value array. Every struct function_info has a field
num_vars which specifies how big this array needs to be in order to store all the locals of
that function. The four components S, P, pc, and V are everything that we need to know in
order to represent the current state of any function.
1.2.5 The Call Stack
The C0VM has a call stack consisting of frames, each one containing local variables, a local
operand stack, and a return address. The call stack grows when a function is called and
shrinks when a function returns, deallocating the frame during the return.
Every frame consists of the four components that represent the current state of some
function call that got interrupted in order to call another function. It contains the operand
stack S for computing expression values, a pointer P to the function's byte code, a return
address pc, which is the address of the next instruction in the interrupted function, and an
c© Carnegie Mellon University 2020
15-122 Programming Homework 11 12 Page 4 of 16
array of locals V. At any point during the execution there is a current frame as well as a
calling frame. The latter becomes the current frame when a function returns to its caller.
1.2.6 Constant Pools
Numerical constants requiring more than 8 bits and all string constants occurring in the
program will be allocated in constant pools, called the integer pool and the string pool. They
never change during program execution.
1.2.7 Function Pools
Functions, either C0 functions defined in a source file or library functions, are kept in pools
called the function pool and the native pool, respectively. Functions in the function pool are
stored with their bytecode instructions, while functions in the native pool store an index into
a table of C function pointers that the C0VM implementation can dereference and invoke.
1.2.8 The Heap
At runtime, the C0 heap contains C0 strings, C0 arrays (t[]) and C0 cells (t*). C0 arrays
have to store size information so that dynamic array bounds checks can be performed. You
will use the C heap to implement the C0 heap: that is, you will allocate strings, arrays, and
cells directly using calls to xmalloc and xcalloc as defined earlier in this course.
Since C0 is designed as a garbage-collected language, you will not be able to free space
allocated on behalf of C0 unless you are willing to implement (or use) a garbage collector.
We do not view this as a memory leak of the C0VM implementation. On the other hand,
temporary data structures required for the C0VM's own operation should be properly freed.
This means that when you are testing your own code using valgrind, you must be
aware of which bc0 files should be free of valgrind memory leak reports and which should
produce reports of memory leaks. The autograder will not penalize valgrind reports of
memory leaks for tests that allocate space on the C0 heap.
Note that the autograder will also not penalize for memory leaks for tests that are sup-
posed to end in a runtime error.
1.3 Runtime Errors
In order to fully capture the behavior of C0 programs, you must correctly issue errors for
things like dereferencing NULL, indexing an array outside of its bounds, and dividing by zero.
Check the C0 Language Reference for details on what kinds of errors you must handle, and
then use the following provided functions (defined in c0vm_abort.h and c0vm_abort.c)
to issue appropriate error messages:
void c0_user_error(char *err); // for calls to error() from C0
void c0_assertion_failure(char *err); // for failed assertions from C0
void c0_memory_error(char *err); // for memory-related errors
void c0_arith_error(char *err); // for arithmetic-related errors
c© Carnegie Mellon University 2020
15-122 Programming Homework 11 12 Page 5 of 16
For unexpected situations that arise while executing bytecode, situations which could indi-
cate a bug in your VM, you may use the standard C library functions abort or assert to
abort your program. See Section 3.3 for more details on this distinction.
2 Instruction Set
We group the instructions by type, in order of increasing complexity from the implementation
point of view. We recommend implementing them in order and aborting with an appropriate
message when an unimplemented instruction is encountered. Each task in this assignment
corresponds to one or more sections, and tasks are summarized in Section 3.1.
2.1 Stack Manipulation (Task 1)
There are three instructions for direct stack manipulation without regard to types.
0x59 dup S, v -> S, v, v
0x57 pop S, v -> S
0x5F swap S, v1, v2 -> S, v2, v1
2.2 Arithmetic Instructions (Task 1)
Arithmetic operations in C0 are defined using modular arithmetic based on a two's comple-
ment signed representation. This does not match your implementation language (C) very
well, where the result of signed arithmetic overflow is undefined. There are two solutions
to this problem: first, because unsigned arithmetic overflow is defined to be modular arith-
metic, you could cast int32_t values to uint32_t, perform unsigned arithmetic, then cast
back. This should not be required in your implementations, however: we will compile your
code with the -fwrapv command-line argument that forces gcc to treat integer arithmetic
to be defined as signed two's complement arithmetic. (Remember, however, that this is not
part of the C standard.)
Casting signed integers to unsigned integers and/or using the -fwrapv command-line
argument does not do everything that is needed to ensure C0 compliance, but it goes a long
way. We recommend a careful reading of the arithmetic operations in the C0 Reference, as
well as the notes on casting integers in C.
For this implementation strategy to be correct, it is important to verify that our C
environment does indeed use a two's complement representation and that the C type of int
has 32 bits. The provided main function (see the file c0vm_main.c) performs these checks
before starting the abstract machine and aborts the execution if necessary.
In the instruction table below (and for subsequent tables), we use w32 for the type of
primitive values and * for the type of reference values. Each line has an opcode in hex
notation, followed by the operation mnemonic, followed by the effect of the operation, first
on the stack, then any other effect.
0x60 iadd S, x:w32, y:w32 -> S, x+y:w32
0x7E iand S, x:w32, y:w32 -> S, xy:w32
c© Carnegie Mellon University 2020
15-122 Programming Homework 11 12 Page 6 of 16
0x6C idiv S, x:w32, y:w32 -> S, x/y:w32
0x68 imul S, x:w32, y:w32 -> S, x*y:w32
0x80 ior S, x:w32, y:w32 -> S, x|y:w32
0x70 irem S, x:w32, y:w32 -> S, x%y:w32
0x78 ishl S, x:w32, y:w32 -> S, x<0x7A ishr S, x:w32, y:w32 -> S, x>>y:w32
0x64 isub S, x:w32, y:w32 -> S, x-y:w32
0x82 ixor S, x:w32, y:w32 -> S, x^y:w32
Safety violations in idiv, irem, ishr, and ishl can cause run-time errors (use the pro-
vided function c0_arith_error to generate a message). Please refer to the C0 language
specification for important details.
We have omitted negation -x, which the compiler can simulate with 0-x, and bitwise
negation ~x, which the compiler can simulate with x^(-1).
2.3 Constants (Tasks 1 and 2)
We can push constants onto the operand stack. There are three different forms: (1) a
constant null which is directly coded into the instruction, (2) a small signed (byte-sized)
constant which is an instruction operand and must be sign-extended to 32 bits, and (3)
a constant stored in the constant pool. For the latter, we distinguish constants of primitive
type from those of reference type because they are stored in different pools.
The two constant loading instructions ildc and aldc take two unsigned bytes as operands,
which must be combined into an unsigned integer index for the appropriate constant pool.
The integer pool stores the constants directly, and the index given to ildc is an index into
the integer pool. The string pool is one large array of character strings, each terminated by
’\0’. The index given to aldc indicates the position of the first character; its address is
therefore of type char* in C and understood by C as a string.
0x01 aconst_null S -> S, null:*
0x10 bipush S -> S, x:w32 (x = (w32)b, sign extended)
0x13 ildc