Execution Process of a C Program

In this lesson, we will discuss the complete execution process of a C program in detail. The execution process of a C Program, commonly known as the C build process, is considered one of the most important topics in software interviews. In this lesson, we will talk about each step of the build process in detail. Keep in mind that this lesson might not be beginner friendly, so I would request you to go through it and understand as much as you can if you are a beginner and come back to this lesson later once you have completed this entire course.

What happens during the Execution Process of a C Program?

After we are done writing a C program, we save it with .C extension. As discussed in previous lessons, this file can’t be executed directly and has to go through a series of steps. The main thing that happens during the execution process of a C program is the conversion of the high-level source code into an executable binary file. This process involves many steps and tools, but to simplify, the three main distinct steps are listed below:

  • Compilation: All source files are compiled or assembled into an object file.
  • Linking: All of the object files from the previous step are linked together to produce a single object file. This file is also called the relocatable program.
  • Relocation: Physical memory addresses are assigned to the relative offsets within the relocatable program in a process called relocation.

There are many more intermediate steps too apart from the three steps listed above and we will discuss them below. The result of all these steps is a file containing an executable binary image ready to run on your system. Let’s then begin and try to understand the processes in more detail.

Preprocessor

Preprocessor in C
The preprocessor converts (.c) files into (.i) files

Preprocessing is the first step in the execution process of a C program. This step includes running the saved .C file through a unique program called the preprocessor. Here the input file for this stage is .C file and the output file is a .i file (intermediate file). Usually, the preprocessor program is built into the compiler and performs the below functions.

  • The preprocessor program strips out all the single as well as multi-line comments present in the input .C file.
  • The preprocessor also links all the header files in the input .C file with their source files so that the functions defined in the header files can be used in the program.
  • Finally, the preprocessor also evaluates the preprocessor directives by making substitutions for lines starting with the pound symbol(#define) with the user-defined value in the entire code.

The final result of the preprocessor program is pure C code without any directives or comments. This is finally saved in the .i file.

Note: If a bug or error happens in the preprocessor stage, it is not possible to detect it as the output of the preprocessor is fed directly into the compiler.

Compiler

Compiler-in-C
The compiler converts (.i) files into (.s or .asm) files

In this stage of the execution process in C, the pure C code saved in the .i file is fed to the compiler and then converted into architecture-specific assembly code. This conversion is not a one-to-one mapping of lines rather it is a decomposition of C operations into numerous assembly operations. Each operation in itself is a very basic task. The output of this process is a .s or a .asm file.

Parts of a Compiler

Compiler Parts
Two Main Parts of a Compiler

A compiler can be broadly divided into two parts i.e. Front End and Back End. The front end is generally responsible for the analysis of the source code and finding out of errors. The back end is responsible for the actual assembly code generation as well as optimisation of the generated code. The actual tasks of both these parts are detailed below.

Front End: Analysis

As discussed above the front end of the compiler is responsible for the analysis of the source code to find out all sorts of possible errors. This can be visualized clearly in the image below.

How Compiler Works
Visualization for the Front End of a Compiler
  • The first stage of the front-end part of the compiler consists of scanning the input text from the source code and tokenization of the text by identifying tokens such as keywords, identifiers, operators and literals. These scanned tokens are then passed through a parsing tool that ensures that the tokens are organized according to the C rules to avoid any syntax errors.
  • The second stage consists of checking the parsed sentences from the first stage for correct meaning. This is nothing but checking for semantic errors. The key thing that happens in the semantic analysis is to with variables that are present in the source code. The compiler tries to maintain the information of all the declared variables in a structure called the “Symbol Table”. Here the variables are locked up with their attributes such as type, scope and so on.
  • Suppose the statement passes the semantic check and is found to be meaningful and correct. In that case, the compiler undertakes its following action: translate this statement into an internal representation. What this means is that the compiler here is trying to take the high-level language construct (regardless of the original source code language) and convert it into a form which is closer to the assembly, so that it can be compiled for different targets.
Types of Semantic Errors
1: Undeclared variables that are being used without declaration.
2: Unavailable variables in a given scope, although they have been declared.
3: Incompatible types, example: if a variable with character type is being used in a addition statement then that is a semantic error as it is meaningless to add characters to something else.

Back End: Analysis

The second part of the compiler which is also referred to as the Back End, deals with the actual code generation as well as optimization of the generated code.

  • The first thing that happens in the back end of the compiler is optimization. The current generation of compilers is smart enough to do this very efficiently based on the target system where the code is going to be executed. There are many ways in which optimization can take place such as transforming the code into smaller or faster but functionally equivalent blocks, inline expansion of functions, dead code removal, loop unrolling, register allocation and more.
  • The last thing that the compiler’s back end does is the actual code generation by converting the optimized intermediate code structure into assembly code.

Assembler

Assembler
The assembler converts (.s) or (.asm) files into (.o) or (.obj) files

In this stage of the execution process of a C program, the assembly code that was generated by the compiler in the previous stage (.s or .asm files) gets converted into object code (.o or .obj file) by the compiler. The current generation of compilers can do this without the need for an independent assembler program as the assembles are now integrated into the compilers.

Furthermore, the output file generated here is made up of opcodes and data sections. What this means is that after the code generation is finished, the compiler allocates memory for code and data in sections. Each of these sections consists of different information and is defined by a different name for the information stored in them. To understand how the compiler does this (allocates memory to the code and data), we first need to have an idea about the various memory segments and what they contain.

Memory Sections in C

The final output of the execution process of a C program is the binary executable file (.exe). When this file is executed, the program instructions inside it loads into the RAM in an organized manner. Computers do not access these instructions directly from the secondary memory (hard disk or SSD) as the access time of secondary storage is much lower when compared to RAM. But RAM has limited storage capacity, therefore it is necessary for programmers to utilize this limited but high-speed storage efficiently.

The memory layout in C mainly consists of six components as mentioned below:

  • Heap
  • Stack
  • Text or Code Segment
  • Command Line Arguments
  • Uninitialized Data Segments
  • Initialized Data Segments
Memory-Layout-of-C

Each of these segments has its own read and write permissions and a segmentation fault may occur when a program tries to access any of the segments in a way that is not allowed. This will finally result in a program crash. Let’s now try to understand each of these segments in a little more detail.

Text / Code Segment (.text)

After the compilation is completed during the execution process of a c program, a binary file containing the machine code is generated. This file is used to execute the program by loading it into RAM. This binary file contains instructions and is stored in the text segment of the memory layout. Usually, the text segment is shareable so that a single copy is required in the memory for frequently executed programs such as text editors, the C compiler, shells and so on. The text segment also has read-only permissions to prevent any accidental modifications of the machine code. This segment is also kept at the bottom end of the memory layout to avoid any accidental spillover from the stack and heap sections.

Initialized data segment (.data)

Initialized data segment, usually referred to as the data segment is part of the computer’s virtual memory space of a C program and contains external, global, static and constant variables whose values are initialized at the time of variable declaration in the program. These values of the variables can change later during real-time program execution, therefore, this memory segment has read-write permissions. Based on this the memory segment can further be divided into two parts; read-write and read-only areas. An example code snippet is mentioned below for clarity on this.

In the example above, the global variable global_var and the pointer hello are stored in the read-write part of the initialized data segment same as the static variable a but the global variable global_var2 is stored in the read-only part of the initialized data segment because it was declared with the keyword const.

Uninitialized data segment (.bss)

The uninitialized data segment also referred to as the bss (block started by symbol) segment contains memory for all the uninitialized variables in the program. All uninitialized variables in bss are initialized to arithmetic 0 and all uninitialized pointers to NULL pointers by the kernel before the C program executes. This data segment also contains memory for all static and global variables that were initialized with the arithmetic 0 at the start of the execution. Since values of variables stored in this memory segment can change during the program run time, this data segment also has read-write permissions.

Expected Outcome

In the example above, both the global and static variables global_variable and static_variable are uninitialized. Hence, they are stored in the bss segment in the memory layout of the C program. Also, before the program execution begins, these values are initialized with the arithmetic value 0 by the kernel as can be verified with the output of this program.

Stack Segment

The stack segment stores the values of local variables and parameters passed to a function along with additional information like the return addresses, which gets executed after a function call. This segment follows the LIFO (Last In First Out) structure and grows down towards the lower addresses opposite to the way the heap segment grows. There is a stack pointer register present to keep track of the top of the stack and its values when push/pop operations are performed in this memory segment. The example below will help further understand the variables stored in the stack memory segment.

In the above example, all the variables are stored in the stack memory layout because they are declared inside their parent function’s scope. These variables only take space in memory until their functions are executed. For example, in the above code, when the main() function starts its execution, a stack frame for it is created and pushed into the program stack with data of all the variables declared in it such as local and name. Then inside the main() function itself, the function foo() is called. So, another stack frame is created for it and pushed into the program stack separately with all the variables declared inside it which are a and b. After execution of foo() is completed, its stack frame is popped out and its variables are unallocated of memory. The same happens to the stack frame of the main() function when the program ends.

Heap Segment

The heap segment is used for memory which is allocated during the run time of the program (dynamic memory allocation). Heap begins at the end of the uninitialized (bss) segment and grows towards the stack. Commands like malloc, calloc, free, realloc and more are used to manage these allocations in the heap segment which internally uses sbrk and brk system calls to change the memory allocation within the heap segment. This data segment is also shared among all the modules loading memory dynamically.

The above code is a small example of how dynamic memory allocation can take place in the heap segment during run time. Here, a variable var of data type char is allocated memory of 1 byte at the time of the program execution using the command malloc. Also, this variable is therefore stored in the heap segment of the memory layout.

Command Line Arguments

In cases where a program executes with arguments passed from the console like argv and argc and other environment variables then the values of these variables are stored in this memory layout in C.

The above example explains how command-line arguments are passed and used in a C program. The actual work of the code is out of the scope of this lesson and will be covered in later lessons.

Object File

The final output of the assembler is the object (.obj) file. To understand what is going to happen in the next stages of the execution process of a c program, we need to have a basic idea about the contents present inside the object file created after the completion of the assembly process. The below-mentioned list details the contents of the object file.

object file in C
Contents of object file in a C Program
  • The object file contains all the sections of the static memory layout i.e. Text / Code Segment (.text), the initialized data segment (.data) and the uninitialized data segment (.bss). These are available throughout the whole program.
  • The object file also contains the symbol table which is used to store all variable names and their attributes.
  • The Debug info section contains the mapping between the original source code and the information needed by the debugger.
  • The exports section contains global symbols for either functions or variables.
  • The imports section contains symbol names that are needed from other object files.

The exports, imports, and symbol table sections are used by the linker during the linking stage described below.

Linker

linker in C
The linker converts (.o) or (.obj) file into a Relocatable file

In this stage of the execution process of a C program, all the different object files created by the assembler get converted into one relocatable file by the linker. While combining the object files together the linker majorly performs two operations i.e. Symbol Resolution and Relocation. Let us try to understand them in a little more detail.

Symbol Resolution

In a multi-file program containing many source files or references to multiple libraries, there might be references to labels defined in other files. The assembler while creating the object files will mark these references as “unresolved”. When these object files are then passed through the linker, it determines the values of these references from the other object files and fills the code with the correct values. If the linker is not able to locate the references to these labels in any of the generated object files, then it throws an “unresolved reference to a variable” linking error. If in a case the linker finds the same symbol defined in two object files, then it will throw a “redefinition” error.

Relocation

Relocation in Linker
Relocating Code and Data

Once the linker has completed the symbol resolution step, it is able to associate each symbol reference in the code with exactly one symbol definition in the input object files. At this point, the linker knows the exact sizes of the code and the data sections in the input object files. It can therefore begin with the relocation step, where it merges the input modules and assigns run-time addressed to each symbol. Primarily, relocation consists of two steps i.e. Section Merging and Section Placement

Section Merging

Section-Merging-in-Linker

In the final output file, the linker is responsible for merging the sections from the input object files into the output file. By default, the sections with the same name from each file are placed in continuation with each other in the final output file and the labels to the references are patched to reflect the new run time addresses.  

Section Placement

Section Placement in Linker

In each of the assembled object files, the code section starts from the address 0. Therefore all the labels are assigned values relative to the start of the section. When the final executable file is created, the entire merged code section is placed at some random address X. Therefore all the references to the labels defined in the merged section need to be patched and incremented by a value of X, to point to this new location in the program memory.

Locator

Locator in C
The locator is responsible for the creation of the final binary image file

In this step in the execution process of a C program, the process of assigning physical addresses to the relocatable file produced by the linker takes place using a special tool referred to as the locator. This tool performs the conversion from the relocatable program produced by the linker to an executable binary image. An additional input needed in this step is the target-specific linker file.

The linker script file or the linker configuration file is responsible for telling the locator on how to map the executable into proper addresses. It controls the memory layout of the output file as it provides information on the memory layout of the target board as input to the locator. This information consists of the physical memory layout (Flash / SRAM) and also the placement of the program in different regions. The linker file is highly compiler as well as target-dependent, so each will have its own different format.

Sample Linker File
Sample Linker file for TMS320F28069M microcontroller

This was the final step in the execution process of a C program and if all these steps were completed without any errors then the user is left with the final executable binary image file which can be now loaded into the target computer to run the program.

Summary

To summarize the entire execution process of a C program can be quickly understood with the below flow diagram.

Execution Process of C
The complete execution process of a C program

References

  1. C Build Process in Details
  2. Preprocessor Directives (GeekForGeeks)
  3. Memory Layout in C (Scaler)
  4. Memory Layout of C Program (Hacker Earth)
  5. Linking (University of Pittsburgh)
  6. Programming Embedded Systems: With C and GNU Development Tools, Anthony Massa and Michael Barr
  7. Introduction to Embedded Systems Software and Development Environments by University of Colorado Boulder – Coursera

Best C Programming Books

List of C Programming books curated for beginners as well as experienced programmers.

C Programming Books

Sharing is Caring

If you liked this post, then feel free to share it with your loved ones!

Leave a Reply