Introduction to Code Analysis

This document is a rewrite of the the section on code analysis in the Standard Introduction.. As such, It is presented here in addition to the Standard Introduction.

The motive for reading source code

The best way to understand a program is to read it carefully.   First,  try to discern the purpose of various sections of the program.   Create a high level processing chart,  in your mind at least.   Then iteratively examine the program,  understanding ever finer details about the code.   The rest of this document outlines various tactics for accomplishing this strategic goal.

Decision, Decisions...

Now that you have decided to understand how this program works,  you have to decide how to accomplish the task.   That to a large extent depends on your talents and experience level.   The various tactics available are detailed below.   Which ones are used depends on you.  Find out which techniques work for you.   Write your own tools if think it will help.  

The tool IDBG came about because I wanted a tool that not only made it easy to insert instrumentation into programs, but more importantly make easy to remove that instrumentation without leaving corrupted code behind me.

Technique

Speaking roughly, it can classify analytical techniques into static and dynamic techniques.   Static techniques, start with reading source code.   Dynamic techniques,  using the debuggers,  and the like means to trace the program flow at execution time.

It is good for analysis to begin with dynamic analysis.   Static analysis is somewhat more difficult,  in the beginning,  to fathom the operation of the program.    What you examine with dynamic analysis is fact.  Insrumentation reporting on the program flow is not subject to opinions.   It is good to start from a position of truth.

Static analysis

Run the program

Running the program with various inputs documents and see what the program does.    What inputs are valid and what outputs are produced.   From this data, a starting point for analysis should be able to be discerned.

Read available program documentation

Read all available documentation, especially things containing words like: Hacking, Tour, and Quick start.   You should also try to validate the program usage cases.   Does it actually work the way it is documented.   Again,  the running program does not lie,  the documentation often does!

Directory structure

Examine the directory structure of the source code.   How the development directories are structured will give you hints on the module and source code relationships.

File structure

Another important aspect of reading source code,  is to notice names and prefixes used in module and function names.   Try and discern any systematic naming conventions and the meaning of abbreviations.  For example in Ruby,  does GC relate to Garbage Collection or Graphic Context?   Ruby is very good about naming conventions.

Investigation

Investigate the context of how names are used.   What is the relationship between the basic functions of the program and generally accepted names and abbreviations used by software professionals.   Ruby uses hash tables extensively.  Ask yourself, why?

Data structures

Understanding the data structures used by a program is half the battle.   Knowing those gives a programmer a step up in understanding the source code.   What types of structures are used?   We have already mentioned hast tables.  Ruby also uses the stack extensively.   Look into what is stored on the stack.  

C definition files (*.h)

C definition files can be mined extensively of information.   Structures and Unions obviously.   But also variable names and documentation associate with them.   In the case of Ruby that are a large number of interesting constructs defined in these files.

Call Graphs

Use tools to create both static call graphs and dynamic flow data.   Call graphs help the programmer grasp the overall processing flow.   Dynamic processing flow reports how the program works when processing different input data.   That is, the processing flow will often change depending on the data being processed.   This is especially true with interpreters.

Functions

As a better understanding of overall program is reached,  the inner workings of the functions need to be better understood.   You move from understanding functions from a Black Box perspective, understanding of how the Black Box actually works.

Rewriting, it moves

Another technique for understanding how a particular section of code works is to modify it.   You will often find that section code does not conform completely to your understanding.   The process of determining why it surprised you will often lead a deeper understanding of the source code!

Reading Change History

Reading the Change Log for a program will give a lot of insight the workings of a program.   Look at the changes and the stated reason for the change.   Try to work out why the change was necessary.   Where there number of changes in the same area of the code?

In addition, when using a code management systems,  like CVS,  there is a record of the actual changes and a history of release notes.  Again the value of examining this information is invaluable.   The best feature of source code management systems is the exact code changes are maintained!

Tools for Dynamic Analysis

With dynamic analysis using the debugger and the implanted print statements,  you analyze the processing flow.  

IDBG Debugger

Printf debugging is normally considered a primitive form of dynamic analysis.  However, with the IDBG Suite of programs,  it is somewhat easier to use effectively.  The tool suite is made up of four programs.  

IDBG produces a control file with simple commands that idbg/rdbg understand.   It comprises a preamble header for loading support programs,  followed by an entry for each function in a program file.   Without enhancement,  IDBG will print a function header for each function as it executes.  

GDB and SourceNavigator

GDB and SourceNavigator(snavigator) allow a programmer to explore a program at both static and dynamic levels.

DDD - Data Display Debugger

In addition,  when the DDD (data display debugger) is used, data is somewhat easier to understand.   The DDD program is front-end GUI for several debuggers.   For example, the following snapshot of the DDD program shows how a linked list can be displayed!.

(Reference)

See: "http://www.gnu.org/software/ddd/"

Ctrace, Strace and IDBG Programs

The Program ctrace allow the programmer to install tracing instrumentation into their programs. It is usually very targeted at critical sections of the programs code.

The program IDBG can be used as a stand-alone debugger,   or used to enhance the versatility of Ctrace.

The Program strace prints out the system calls executed during a program execution.

See: "http://www.vicente.org/ctrace/"

See: "http://www.hawthorne-press.com/idbg/"

See: "http://www.liacs.nl/~wichert/strace "

Recommended Reading

"Programming Languages" Ravi Sethi, Tom Stone; Addison-Wesley Pub Co; ISBN: 0201590654; 2nd edition (February 1996)

Tools for static analysis

global

See: "http://www.gnu.org/software/global "

For C/C++, Bash, Java, and Other Languages. Provides functions such as a cross reference and tagging functions. &nbsThis program is a more extensive and capable version

cscope

See: "http://cscope.sourceforge.net "

Cscope is a curses implemented source code viewer.   It has many features in common with global.   It can generate a cross reference list, find symbols, and perform many other functions.  However, snavigator is a better alternative.

ctags and etags

See: "http://ctags.sourceforge.net "

Basically for C language(ctags => VI and etags => EMACS).  These programs generate a tag file for the target editor.   The tag file records the position of functions and variables in the targeted files.  This tag file allows these items to be quickly and easily located by a text editor

lxr

See: "http://lxr.sourceforge.net "

The tool which was developed in order to support the source code reading of the Linux. Name has come from the Linux Cross Referencer.

doxygen

See: "http://www.stack.nl/~dimitri/doxygen "

This program extracts comments from a source code repository, and combines this information with cross-reference data.   Doxygen helps navigate through a large code bases.

cxref

See: "http://www.gedanken.demon.co.uk/cxref "

Documentation is produced for each of the following:

cflow

See: "http://wh58-508.st.uni-magdeburg.de/sparemint/html/packages/cflow.html "

GNU cflow analyzes a collection of C source files and prints a graph, charting control flow within the program.

GNU cflow is able to produce both direct and inverted flow graphs for C sources. Optionally a cross-reference listing can be generated. Two output formats are implemented: POSIX and GNU (extended).

Input files can optionally be preprocessed before analyzing.

The package also provides Emacs major mode for examining the produced flowcharts in Emacs.

SXT

See: "http://sxt.freeservers.com "

Relationship of functions can be visualized.  Programs provided to display a call graph and data structure graph.

VCG

See: "http://rw4.cs.uni-sb.de/users/sander/html/gsvcg1.html "

The VCG tool reads a textual and readable specification of a graph and visualizes the graph. If not all positions of nodes are fixed, the tool layouts the graph using several heuristics as reducing the number of crossings, minimizing the size of edges, centering of nodes

graphviz

See: "http://www.research.att.com/sw/tools/graphviz "

Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. Automatic graph drawing has many important applications in software engineering, database and web design, networking, and in visual interfaces for many other domains




The original work is Copyright © 2002 - 2004 Minero AOKI.
Translations and Additions by C.E. Thornton
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License.