Dissecting C++ Part 2 : Compiler
In this part we will learn how the C++ compiler works. We write C++ as a text that’s all it is, it’s just a text file and then we need some way to transform that text into an actual application that our computer can run . In going from that text form to an actual executable we basically have 2 main operation that need to happen. One of them is compiling and other is linking .
The only thing that the C++ compiler needs to do is to actually take our text file and convert them into intermediate format called an object file. Those object file can be passed onto linker and linker can do it’s stuff. The compiler does several things when it produces these “object” files.
It needs to pre-process our code which means any pre-process statements needs to be evaluated then and there. Once our code is pre-processed we move onto tokenizing , parsing basically sorting out this “English C++” language into a format that the compiler can actually understand.
This basically results into something called an “abstract syntax tree” being created which is representation of our code but as an “abstract syntax tree”
At the end of the day the compiler’s job is to convert all of our code into either constant data or instructions. Once the abstract tree is created then it can actually start generating code. Now this code is going to be actual machine code that our CPU will execute.
Below we have our same example from part 1. We just have a console_log function which is defined in another file console.cpp
Now every cpp file that our project contains , we have to actually tells the compiler “hey compile this cpp file”.
Every single one of those files will result in an object file. These cpp files are things called translation units essentially.
We have to realize that C++ doesn’t care about files. Files are not something that exists in C++. For e.g. in Java Your class name has to be tied to your filename and your folder hierarchy has to be tied to your package because Java expects certain files to be exist. In C++ that is not the case!!!
A file is just a way to feed the compiler with source code , you are responsible for telling the compiler what kind of filetype this is and how the compiler should treat that.
Now of course if you create the file with extension .cpp the compiler is going to treat that as a C++ file. Similarly if i create a file with .h or .c extension it’s going to treat it as header file and C file respectively.
These are basically just a default conventions that are in place , you can override any of them and that’s just how compiler will deal with it if you don’t tell how to deal with it.
I can go around and making .jay file and telling the compiler to compile it and that would be totally fine as long as i tell the compiler “hey this is a C++ file please compile it like a C++ file”.
So that being said every C++ file that we feed into the compiler it will compile it as a translation unit and that translation unit will result in an object file. It’s actually quite common to sometimes include cpp files in other cpp files and create a big one cpp file with a lot of files in it . If you do something like that and compile that one big cpp file you are going to get one translation unit and thus one object file.
So that’s why there is terminology split between what a translation unit is and what a cpp file actually is.
Because a cpp file doesn’t necessarily have to equal a translation unit however if you just make a project with individual cpp files and you never include them in each other then yes every cpp file will be a translation unit and every cpp file will generate an object file.
Now these( object files) are actually pretty big you can see that console.obj is 46KB and outro.obj is just 4KB, the reason for that is we are including iostream and that has a lot of stuff in it so that’s why they are so big and because of that they are pretty complicated so before we dive in and take a look at what’s actually in the file let’s create something little bit more simple . Create a new file called math.cpp
math.cpp will have a very basic multiply function which multiplies 2 numbers together and return the result.
Now compile the math.cpp and checkout the output directory it will have math.obj which is just 4KB.
Before we take a look what’s exactly is in that object file let’s talk about the first stage of compilation i.e. pre-processing.
During the pre-processing stage the compiler will basically just go through all of our pre-processing statements and evaluate them. The ones that we commonly use are include , define, pragma, if and ifdef.
So let’s take a look at one of the most common pre-process statement that we have #include.
how does that work ?
#include is actually really simple , you basically specify which file you want to include and then the pre-processor will open that file read all of its contents and will just paste it into the file where you wrote your include statement and that’s it !! We can prove that
Now create a new header file called endbrace.h . Just add a ‘}’ in the file and save it , that’s it .
Now go to math.cpp and remove the closing brace ‘}’ and compile it . You will get error. Now instead of solving this like a normal person ( by adding ending brace) let’s go and include our endbrace header file and now compile it
And look it compiles successfully . Of course it did because all the compiler did was open this endbrace.h copy whatever was in there and just pasted it in math.cpp . Header files solved , now you should know exactly how they work and how you can use them .
There is actually a way we can tell compiler to output a file which contains the result of all of the pre-processor evaluations that have happened . Right click on your project and go to properties -> go to C/C++ section -> Preprocessor and change preprocess to a file to yes. ( make sure you are doing it to current configuration and platform so that it apply )
Now compile it again. If we bring up our output directory you will see a new .i file ( math.i) which is our pre-processed C++ code.
Let’s open it in a text editor and look at it. Here you can see what the pre-processor has actually generated . You can see our source code (math.cpp) had #include “endbrace.h” and yet our pre-processor code has just inserted a end-brace that was in that header file. Pretty simple stuff !!
Now let’s add some more pre-process statements but first remove that #include”endbrace.h” from math.cpp ( it’s annoying ). Now let’s define INTEGER as int ( don’t ask me why i would ever do that -_- , this is just an example). The define pre-processor statement will basically just do a search for INTEGER and replace it with int . So let’s replace our int here with the word INTEGER . Hit compile !!
Now open math.i file again and you can see what’s happened . It just looks normal . Let’s play around with it a bit more.
Change your math.cpp to normal ( replace INTEGER with int and remove that define statement). Now let’s add #if . The IF pre-processor statement can let us include or exclude code based on a given condition . So in math.cpp write #if 1 which basically means true and then just write #endif at the end of the function. Go ahead and compile the file and open math.i file again. You will see it’s exactly the same
Now if we change #if 1 to #if 0 and compile it again and take a look at math.i file we will have no code.
It’s another great example of how pre-processor statement works . Now i promise this one is last :p . Now let’s include iostream ( massive iostream). Compile it.
Now go ahead and take a look at math.i file ( it’s massive almost 56K lines)
It’s all iostream , now of course iostream also includes other files so it’s kind of like recursion .
You can now hopefully see why those object files were so big because they included iostream and that is a lot of code. Alright that’s all for the pre-processor statements. Now once pre-process stage is done we can actually move to compiling our simple C++ code into machine code .
Go back to math.cpp and remove iostream compile it again. Also go back to your projet properties and disable that preprocessor to a file. ( we need to disable it so that we can get our .obj file) . Now build your project again.
Now let’s take a look what inside our math.obj file . If you try to open math.obj file with just a text editor you will see it’s binary which doesn’t really help us too much . So this is just binary and completely unreadable let’s convert it into a form that might actually be more readable by us.
There are several ways we can do this but let’s just do it with visual studio . Right click on your project -> properties-> C/C++ section -> output files -> change assemble output to assembly-only listing and click ok.
Now hit Ctrl+F7 to compile math.cpp . You will get a new file math.asm ( Ctrl+F7 only compiles specific file that you have opened , if you build whole project you will get more .asm files)
Now open math.asm with a text editor. Okay so this is basically a readable result of what that object file actually contains . If we go down over to line 29 we will see function called multiply and then we have bunch of assembly instructions .
These are the actual instructions that our CPU will execute when we run the function . We are not going to go in huge detail about all this assembly code but if we take a look we will see that our multiplication operation actually happens on line 44 . We load the a variable into our EAX register and then we perform imul instruction which is a multiplication instruction on the b variable and a variable. We are then storing that result in a variable called result and moving it back to EAX to return it. The reason why this double move happens is because we made a variable called result in our math.cpp which stores the multiplication result and then returns it . ( which is completely redundant)
This is a another great example of why if you set your compiler not to optimize you are going to find out what’s slowing code because of this extra stuff like this for no reason . If we go back to our code and actually get rid of that result variable and simply return the multiplication result and then compile it.
You will see that your math.asm will look slightly different because we are doing imul and just returning the result . EAX is going to contain our return value.
Now all this may look like a lot of code because we are actually compiling this in debug mode which doesn’t do any optimization and does extra stuff to make sure that our code is as verbose as possible and as easy to debug as possible . Now go back to the project right click on it -> got to C/C++ section -> optimization -> select maximize speed.
Now go to code generation section and change basic runtime check to default which won’t perform runtime checks .
Let’s hit Ctrl+F7 and look at that assembly file again . Wow that looks a lot smaller we’ve basically just got our variables being loaded into register and then the multiplication and that’s it . You should now have basic idea what compiler does when you tell it to optimize it , it optimizes it 😎
This was very simple example. let’s take a look at something slightly different example . Change your math.cpp to like this also go to properties and disable optimization. Compile it again
Open math.asm . You can see that what it’s done is actually really simple . It’s simply moved 10 into our EAX register which is a register that will actually store our return value .
So if we take a look at our math.cpp again , it has bascially simplified our 5*2 to 10 because there is no need to do something like 5 * 2 at runtime this is called as constant folding where anything that is constant that can be worked out at compile time is worked out.
Let’s make it more interesting by making another function Log. It’s not going to log anything because then we have to include iostream which i don’t want( because it’s going to complicate stuff). It’s just going to return the message. Compile it !!
Now let’s take a look at what compiler has generated . Scroll down a bit and you will see we have Log function which doesn’t really do much but just return our message . You can see it’s moving our message pointer to EAX which is our return register . If we scroll we also have our multiply function and in that multiply function we have a call to log . Right before we actually do our multiplication by using imul we actually call this log function .
Now you might be wondering why this log function is decorated with random characters and signs this is actually the function signature. This needs to uniquely define your function . We will talk more about this in linking section but essentially when we have multiple OBJ’s and our functions are defined in multiple OBJ’s it’s going to be linker’s job to link all of them together and the way that’s it going to do that is going to look up this function signature.
Now in this case it might be little bit stupid because you can see that we are simply calling log we are not even storing the return value. This can be optimized quite a bit . If we go back and turn on optimization to maximize speed and compile it again . You will see that is just disappears entirely .
Yup the compiler just decided that “this code does nothing i am going to remove that code “. You should basically now understand the gist of how the compiler works . It will take out source file and output object files which contains machine code and any other constant data that we have defined .
That’s basically it and now that we have got these object files we can link them into one executable which contains all of the machine code that we actually need to run and that’s how we make a program in C++. Hope you learned something new !!
Part 3 : Linkers
Have some suggestions ? Let’s connect Twitter, Github, LinkedIn