What is AST,CFG,CLANG, how can we use it in deadcode removal algorithm? [closed]
I am about to write a dead-code removal algorithm using C language for an online event with our team.
The requirements are.....
- To read a C program source file,Which has many forms of dead-codes.
- And our output should be a file, Which is free from all dead-codes.
While surfing the internet, we came across the SO links...
How can I know which parts in the code are never used?
Dead code detection in legacy C/C++ project
Before seeing these links,we had the basic idea... Reading the input C file, line by line using normal file stream and store in an string array. Then to analyze those strings and determine the very basic dead codes like if(0) and if(1) etc.. And making a stack, for maintaining the parenthesis. And so more...
But this has a great problem, that this idea will lead us to do more with string operations rather than removing dead-codes.
But After seeing these link... We came to know about Clang library,Abstract Syntax Tree,Control-Flow-Graph etc...
But we are开发者_StackOverflow社区 very newbie to those libraries and those concepts. We came to know that they are used to parse the C code.
Hence we need some basic ideas about these AST,CFG and some basic guidance, explaining how can we use that in our code...
Can we include that clang library as a normal library like math.h?
Where can we download that library?
Does we can use those Clang libraries in windows?
I can explain to you the concept of control flow graphs, but I am not familiar with the library itself.
The concept is simple. Imagine any sequential lines of code (that is without if
, goto
or function call or labels) as one node of a graph. Every goto
or function call creates a directional link from the current node to the node where the goto
label is or the function it is calling. Remember that a function itself could be a graph and not a simple node, because it may have if
s or other function calls inside. Each function call also creates a directional link from leaf nodes of the function (where the function return
s) to the node containing the codes right after the function call. (That can create a lot of links outgoing from the function graph because the function can be called in many parts of the code)
Likewise, if you have an if
, you have two direction links from the current node to both the if
part and the else
part of the if
statement (unless you detect if(0)
or if(1)
like you said in which case there is only one link to the proper location)
The root of your graph is the entry point of main
. Now what you must do to find dead code is to simply traverse the graph from the root position (using DFS or BFS for example) and in the end see which nodes were NOT visited. This shows you the dead codes, that is places in the code that no matter what direction your program takes, it won't reach those locations.
If you want to implement this yourself, you can take a recursive approach (similar to parsing the code but simpler). For example if you see an if
you say:
typedef char *line;
FlowGraph *get_flow_graph(line *code)
{
FlowGraph *current_node = malloc(sizeof *current_node);
current_node->flow_to = malloc(some_maximum * sizeof *current_node->flow_to);
current_node->flow_to_count = 0;
...
if (is_if_statement(code[0]))
{
FlowGraph *if_part = get_flow_graph(code + 1);
FlowGraph *else_part = get_flow_graph(code + find_matching_else(code));
current_node->flow_to[current_node->flow_to_count++] = if_part;
current_node->flow_to[current_node->flow_to_count++] = else_part;
}
else
...
}
You can see examples of control and data flow graphs extracted by the DMS Software Reengineering Toolkit.
We have done this on very large applications (26 million lines of C) using DMS's data flow analysis machinery and its C Front End, including a points-to analysis, which is a practical necessity if you really want to find dead functions in a large C system.
精彩评论