注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

中吴南顾惟一笑

成功法则就是那19个字

 
 
 

日志

 
 

Understanding C parsers generated by GNU Bison  

2010-08-06 17:44:10|  分类: R&D |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
Satya Kiran Popuri
Graduate Student
University of Illinois at Chicago
Chicago, IL 60607
spopur2 [at] uic [dot] edu
Wed Sep 13 12:24:25 CDT 2006

Table of Contents

Introduction

Bison is a general-purpose parser generator that converts an annotated context-free grammar into an LALR(1) or GLR parser for that grammar. The generated parser is implemented as a C or C++ program with a parsing function that can be called from application programs.

It can be useful to understand the generated parser code in some situations. You may want to hand optimize the parser, or split the big yyparse() function into several smaller functions, or you want to understand how a real world parser is implemented as compared to the `theoretical' ways of compiler design books.

This document is an attempt to describe the implementation of an LALR(1) parser in C as generated by Bison 2.3. I used a simple grammar to demonstrate the working of the parser and the nature of parsing tables. You will also find a comparison of these tables to the uncompressed tabular scheme given in the popular book "Compilers - Principles, Techniques and Tools" by Aho, Sethi and Ullman, also called as the "Dragon Book" and many other books on compiler design.

Prerequisites

I am assuming that you are familiar with context free grammars and LR parsing techniques; a little hands on experience with Bison will also help. You must be familiar with the theory and the usual terminology like shifts, reductions, parse stacks and error handling. If you are not familiar with Bison at all, I strongly recommend that you read the Bison user manual before reading this document.

The LR parser

An LR parser is essentially a shift-reduce (bottom up) parsing algorithm driven by parsing tables and a stack. For an overview of the LR parsing algorithm, you can refer to the Dragon book or this excellent wikipedia entry.

The parsing algorithm is same (at least in theory) for all LR parsers. They differ only in the tables generated by the parser generator. In our case, Bison generates LALR(1) parsing tables in the form of C/C++ arrays. The generated yyparse() function implements a parsing routine that uses these tables and a stack to guide the parse.

An example grammar

I will use the following little grammar as an example to demonstrate the internals of a C parser generated by Bison. It will also be used to show the traditional "theoretical" tables and the real world parsing tables and the relationship between them.

(1) L → L;E
(2) L → E
(3) E → E,P
(4) E → P
(5) P → a
(6) P → (M)
(7) M → ε
(8) M → L

A Bison grammar specification for this grammar is like this:

%%
L : L ';' E
| E
;
E : E ',' P
| P
;
P : 'a'
| (M)
;
M : /* nothing */
| L
;
%%

Dragon Book style tables

The (uncompressed) tabular scheme described in the popular Dragon book and many other books on compiler theory serves well to gain a theoretical understanding of the working of an LR parser. Let us examine this scheme for the grammar given in the previous section.

The action and GOTO parts of the table are shown below. This LALR(1) table is calculated by hand.

  action   GOTO
state ; , a ( ) $   L E P M
0     s12 s6       1 8 9  
1 s2         acc          
2     s12 s6         3 9  
3 r1 s4     r1 r1          
4     s12 s6           5  
5 r3 r3     r3 r3          
6     s12 s6 r7     7 8 9 10
7 s2       r8            
8 r2 s4     r2 r2          
9 r4 r4     r4 r4          
10         s11            
11 r6 r6       r6          
12 r5 r5     r5 r5          

The resulting automaton has 13 states. A string belongs to the language generated by this grammar if we encounter the end of string while in state 1. The working of an LR parsing algorithm based on this type of table is fairly simple and is clearly explained in the wikipedia entry and in standard compiler theory books;

We can see that most of the table is either empty or filled with repeated entries. The empty slots in the action part of the table are errors; The GOTO part of the table has only 4 valid entries with the rest of the cells empty. These empty slots are not errors.. they are simply unreachable; It is impossible to detect an error on a non-terminal symbol because of the correctness of the LR algorithm.

A real world useful grammar would end up with hundreds of states and thousands of empty slots in such a table. There is a lot of scope for optimization in a practical parser generator. In fact many books including the Dragon book suggest some scheme for compression of parsing tables. Bison takes a different approach to table compression and so far I haven't come across this method in any of the standard text books.

Tables generated by Bison

Instead of generating a big sparse matrix like the one shown above, Bison generates several smaller one dimensional tables as C arrays. Not all tables are used directly by the parsing routine - some are used to print debugging information and error recovery. Before diving into the essential tables, I must describe some unique concepts utilized by Bison that are not used in theoretical table computing methods.

Bison's Augmented Grammar

Bison augments your input grammar with the following rule:

$accept : <start-symbol> $end

$accept and $end are fictitious symbols. This is somewhat similar to augmenting the grammar with a new start symbol as described by most books. In addition to the new start symbol $accept, Bison also has a terminal symbol $end which will help generate a "final" state of the automaton. This final state always has just one production:

$accept : <start-symbol> $end .

with the dot at the end of the augmented rule. This helps the parsing routine recognize when to accept the input string just by looking at the current state. Also it obviates the necessity to represent "accept" as an entry in the LALR tables. There is no notion of "final state" in theoretical LALR parsing (although it exists in an LR(0) characteristic FSM), but Bison uses one state as the final state by introducing the $end symbol.

One important implication of this augmentation is that the "Rule numbering" will now change for Bison generated tables. The $accept rule is considered rule #1, so in the tables produced by Bison, an r6 will be denoted by r7 and so on.

Default Reductions

In states where there is a reduction, the action for many symbols will be the same. For example in the table above, states 5, 8, 9, and 11 have a common reduce action for different terminal symbols. Some substantial space saving is made by making the most common reduction in that state a "default" action. For example state 5 can have the default action as r3. The parser generated by Bison will do a reduction using rule #3 if it finds itself in state 5 regardless of the look-ahead symbol! This might lead to some illegal reductions. But it will not matter because the error will be caught ultimately before another shift can take place.

Modified traditional tables

Taking the above two aspects into consideration, our dragon book style table will now change considerably. There are now 14 states due to the additional $end terminal symbol. And there is no "accept" entry in the table any more. State 8 simply is an accepting state.

The table given below is constructed directly from the .output report file generated by Bison 2.3. You can generate this report file by using the command line option --report=all. This is not the table that will be used for parsing in a Bison generated parser. This table here is only to gain a perspective on how the final tables will look like.

Table constructed from Bison report Hand calculated table - rearranged
  action   GOTO
state ; , a ( ) $end   L E P M
0     s1 s2       3 4 5  
1 r5 r5 r5 r5 r5 r5          
2 r7 r7 s1 s2 r7 r7   6 4 5 7
3 s9         s8          
4 r2 s10 r2 r2 r2 r2          
5 r4 r4 r4 r4 r4 r4          
6 s9 r8 r8 r8 r8 r8          
7         s11            
8* acc acc acc acc acc acc          
9     s1 s2         12 5  
10     s1 s2 s11         13  
11 r6 r6 r6 r6 r6 r6          
12 r1 s10 r1 r1 r1 r1          
13 r3 r3 r3 r3 r3 r3          
  action   GOTO
state ; , a ( ) $   L E P M
0     s12 s6       1 8 9  
12 r5 r5     r5 r5          
6     s12 s6 r7     7 8 9 10
1 s2         acc          
8 r2 s4     r2 r2          
9 r4 r4     r4 r4          
7 s2       r8            
10         s11            
-                      
2     s12 s6         3 9  
4     s12 s6           5  
11 r6 r6       r6          
3 r1 s4     r1 r1          
5 r3 r3     r3 r3          

The table constructed from the Bison report file is matched with our hand computed table. State 1 generated by Bison corresponds to state 12 in our table and so on. Of course there is no state equivalent to state 8 in the hand computed table because we didn't have the $end symbol and the notion of a final state.

Symbol numbers

Bison assigns a symbol number to each grammar symbol (terminal or non-terminal). Of course Bison has its own symbol table like any other compiler (Bison is a compiler compiler remember). As a rule, Bison always assigns numbers to terminal symbols first and then proceeds to non-terminals. i.e Non-terminals symbols always have higher numbers than terminals. Also, Bison introduces 4 new symbols in the symbol table:

$accept : The new start (non-terminal) symbol of the grammar.
$end : The fictitious end symbol. Symbol number 0 is reserved for this symbol.
error : The 'built-in' error symbol used for error productions. Symbol number 1 is reserved for it.
$undefined : A symbol that represents a token that Bison cannot recognize. (It is not a part of the grammar or not defined using %token). Symbol number 2 is reserved for this symbol.

Numbers for terminals begin from 3 onwards. You can check the symbol numbers assigned to various symbols by looking at the yytname array generated in the output parser. For our example grammar, the symbol numbers look like this:

Symbol Number
$end 0
error 1
$undefined 2
';' 3
',' 4
'a' 5
'(' 6
')' 7
$accept 8
L 9
E 10
P 11
M 12

Symbols 0 through 7 are terminals and the rest are non-terminals. These symbol numbers are used to index various tables as we will see in a moment.

Compressing the parsing table

Let us look at how we can optimally represent our 'modified traditional table'. The table is still a sparse matrix with a lot of white space and repeated information. We can start by collecting all repeated data at one place. The default reductions are our prime targets. They can all be listed in an array indexed by state number:

default_reductions[] = {none,r5,r7,none,r2,r4,r8,none,none,none,none,r6,r1,r3}

We did waste some space for states where there are no default reductions, but that is far less than all the locations used up for repeating reduce actions. Some of the states have only reduce actions and nothing else. So, these rows are pretty much "taken care of" by the above array. Observe that we have listed a default reduction for state 2 (r7) even though it had shift actions on a couple of symbols. Hence, this array may be used only after ensuring that there are no shift actions for the current state on the current look-ahead symbol.

The next obvious target would be the GOTO part of the table which is mostly empty. Since most states do not have a GOTO part, the best space saving scheme here would be to have an array indexed by non-terminal (column) instead of state (row). But each non-terminal can have a different GOTO state depending on the current state of the automaton. So, we choose to list out the most popular GOTO state of each non-terminal in an array like this:

default_gotos[] = { 3, 4, 5, 7 }

That's one entry each for non-terminals L, E, P, M. We still have to take care of the other GOTO states for L, E and P. For now this table has saved us a lot of wasted GOTO white space.

The task now has been reduced to representing the shift actions properly for each state and some GOTOs. This is what is left of the action and GOTO tables. I have zapped default reductions and also states that have only default reductions. Also the terminals and non-terminals are now replaced by their symbol numbers and instead of representing shifts as s1, s2 and so on, I have used only the state numbers as there are no reduction entries now:

  action
state 3 4 5 6 7 0
0     1 2    
2     1 2    
3 9         8
4   10        
6 9          
7         11  
9     1 2    
10     1 2 11  
12   10        
  GOTO
state 9 10 11
2 6    
9   12  
10     13

To compress this part of the table, Bison follows a method described by Tarjan and Yao [3]. It is a fairly complex method combining the idea of Trie data structures and "double displacement" with some of the authors' own ideas thrown in to improve time and space complexity results.

Double displacement is very straight forward. We flatten out the above 2-D table into a simple one dimensional array by mapping all non-white-space entries into some location of the array. We keep the order of elements in each row intact and displace each row by some amount such that no two non-white-space entries in each row occupy the same position in the one-dimensional array. Our mapping will be defined by a set of displacements D(i) for each row i; position (i,j) in the above table is mapped to position D(i) + j in our one-dimensional array. Here is a sample displacement table and the corresponding one dimensional table:

D = {4,6,0,0,12,9,8,0,13} and
T = {9,10,1,2,11,8,1,2,1,2,1,2,9,11,10}

The displacement table D will act as a directory to one dimensional table T that contains the actual entries from the 2-D table; the offset into T for element (i,j) is calculated by D[i] + j. e.g., to find the entry at (3,1)(action for state 4 on symbol 4) in the 2-D table above, we compute D[3] + 1 = 0+1 = 1 and hence T[1] = 10 is the entry we really want.

Even with the above scheme we have a lot of repeated entries in the table T which are really the same states (e.g, states 1 and 2). So this method is combined with column displacements and "Tries" to obtain a more compressed table organization. If this discussion has whet your appetite for more, you can refer to [3] for more details.

The final objective of Bison is to build a one-dimensional table T that specifies what to do in each state. This table will be indexed by a (one dimensional) directory table D. In the above discussion, displacements into T are indexed by the 2-D array index (i,j). But we would rather want to index into T by state number and symbol number instead (since the rows and columns of the 2D table are headed by states and symbols). Bison indexes D by state number and displacement into T by symbol number. If the next look-ahead symbol has number k, table entry for current state can be retrieved as follows:

Action for state n on symbol y = T[ D[n] + k ]

As an example, the action for state 0 is s1 on symbol number 5 ('a'). So if T[0] = 1 then D[0] = -5; to account for action s2 on symbol number 6, we would have T[ D[0] + 6] = T[1] = 2 and so on;

But there is a special case that we need to take care of. There can be some 'explicit' errors in the action table that cannot be over written by default reductions. These errors are due to any %nonassoc declarations in the grammar. In that case, we would have some reductions in the action part "not taken care of" by the default_reductions[] array.

To represent these reductions in the same table T, Bison generates negated rule numbers in T. The negative sign is just to differentiate the shifts (which are positive and represent state numbers) from the reductions. The explicit errors will be represented by another negative number which is out of range of rule numbers and defined explicitly by a macro (see YYTABLE_NINF below). We do not have this situation in our grammar.

Also D will have a specially defined negative value that will indicate that the current state has only a default reduction (like state 1 for example). The parser would always go for a default reduction if this value is encountered in D.

A similar directory table ND (say) can be built for the non-terminals symbols that have non-default GOTO states. This table will contain one reference entry for each non-terminal symbol. These entries are displacements in T, but indexed by state number.

If ND[P] = 2, and the current state is 10, we refer to T[12] to get the GOTO of state 10 on P. But given that current state S, the parser needs to choose between ND and default_gotos to pick the GOTO state. Here we can make ND[S] to be a value that is out of bounds of table T and hence force the parser to pick the GOTO from default_gotos.

Bison also has a guard table that is checked to see if we are within legal bounds of table T. Read on for the actual descriptions of these tables to connect all the dots.

Table descriptions

Bison generates parsing tables as linear arrays as seen in previous sections. All variable names in the output parser begin with 'yy' to avoid naming conflicts with user programs. What follows is a brief description of each table with some examples referring to the sample grammar.

yytranslate

This table maps lexical token numbers to their symbol numbers. If you have %token declarations in your grammar, Bison assigns token numbers to the different tokens; If you just use character representations like we have done in our grammar above, Bison just maps their ASCII values to the symbol numbers. So, in our case, yytranslate[97] = 5 which is 'a' (see symbol table above). The full listing of yytranslate is given below.

/* YYTRANSLATE[YYLEX] -- Bison symbol number corresponding to YYLEX.  */
static const yytype_uint8 yytranslate[] =
{
0, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
6, 7, 2, 2, 4, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 5, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 1, 2
};

Most entries point to symbol #2 which is $undefined. Only the symbols defined in our grammar have been given valid symbol numbers. Whenever yyparse() requires a token, it calls yylex and then translates the returned token number into its internal symbol number using this table.

yydefact

This table lists default reductions for each state. yydefact[state] = rule number to use for a default reduction in that state. Here is the yydefact table produced for our grammar:
/* YYDEFACT[STATE-NAME] -- Default rule to reduce with in state
STATE-NUM when YYTABLE doesn't specify something else to do. Zero
means the default is an error. */
static const yytype_uint8 yydefact[] =
{
0, 6, 8, 0, 3, 5, 9, 0, 1, 0,
0, 7, 2, 4
};

I will get to YYTABLE (mentioned in the comment) in a moment. Note that a rule number 0 in this table means error. Rule number 0 is not used internally by Bison because of some limitations of the internal representation of the input grammar.

If you take a look back at our modified dragon book style table, it is easy see what yydefact is trying to say. One important observation is in order here - rule numbers are incremented by 1 in this table because of the additional rule ($accept → L $end). So what was really r5 for state 1, has become r6 in yydefact.

yydefgoto

This is a compressed form of the GOTO part of our traditional table. It has as many entries as there are non-terminals in the grammar. Each entry specifies the state to transition to on each non-terminal. So here goes:

static const yytype_int8 yydefgoto[] =
{
-1, 3, 4, 5, 7
};

yydefgoto[nth non-terminal] = most common GOTO state for the nth non-terminal.

n starts from zero. An index into this array is obtained by subtracting the number of terminal symbols from the symbol number of the non-terminal. For example, the symbol number for E is 10 and number of tokens in the grammar is 8, so:

Thus yydefgoto[E] = yydefgoto[10-8] = state 4.

yydefgoto is consulted whenever the parser reduces stack contents using a rule. Later we will see that yydefgoto is consulted only after checking with another goto table, and that will explain how we manage to go to state 6 from state 2 on L instead of going to state 3 as this table specifies for L.

As a final note observe that the entry for the zeroth non-terminal ($accept) is -1. The stack will never be reduced with the $accept rule.

yyr1 and yyr2

yyr1 specifies the symbol number of the LHS of each rule. Remember that 0 is never used as a rule number, so this table has NRULES + 1 entries, where NRULES is the number of rules in the grammar (Which is 9 in our case). Here is the listing:

/* YYR1[YYN] -- Symbol number of symbol that rule YYN derives.  */
static const yytype_uint8 yyr1[] =
{
0, 8, 9, 9, 10, 10, 11, 11, 12, 12
};

So, rule #1 has $accept as LHS, and hence yyr1[1] = 8 (see symbol table given previously) and so on. When a reduction takes place, We need to know the LHS symbol of the rule used for reduction to transition to an appropriate state. That is where this table comes into use.

yyr2 specifies the length (number of symbols) of the right hand side of each rule. Here is a listing produced by Bison:

/* YYR2[YYN] -- Number of symbols composing right hand side of rule YYN.  */
static const yytype_uint8 yyr2[] =
{
0, 2, 3, 1, 3, 1, 1, 3, 0, 1
};

Rule #2 (L → L;E) has 3 symbols on the RHS, and hence yyr2[2] = 3. This table is also used at the time of a reduction. The number of states to be popped off the stack is same as the number of symbols on the right hand side of the reducing rule.

yytable

This table is a mixed bag of state numbers and rule numbers in some pre-calculated order. This is the table T we discussed in the previous section. yytable works closely with yycheck, yypact and yypgoto tables (described below) to indicate to the parser either the state to pushed next on to the parse stack or a rule to use for reduction. I will just dump the table listing here. The entries won't make sense until we look at yypact and yypgoto.

/* YYTABLE[YYPACT[STATE-NUM]].  What to do in state STATE-NUM.  If
positive, shift that token. If negative, reduce the rule which
number is the opposite. If zero, do what YYDEFACT says.
If YYTABLE_NINF, syntax error. */
#define YYTABLE_NINF -1
static const yytype_uint8 yytable[] =
{
8, 1, 2, 9, 11, 10, 9, 6, 12, 0,
0, 0, 13
};

One thing worth noting here is the definition of YYTABLE_NINF - the "negative infinity" value for yytable is the highest negative entry which in our case is -1 (since there are no negative values in yytable). This value is used to determine explicit error situations.

yypgoto

This table gives a reference to GOTO entries for non-terminals that can transition the automaton to different states based on previous state. For example, the symbol L can take the automaton to state 3 if the present state is 0, but the GOTO on L for state 2 is state 6. Similarly for symbols E and P there are different GOTO states based on the current state. The most common GOTO is already defined in YYDEFGOTO. The job of YYPGOTO is to indicate the anomalies. Lets take a look:

/* YYPGOTO[NTERM-NUM].  */
static const yytype_int8 yypgoto[] =
{
-5, 5, -1, 2, -5
};

That's our ND table that we discussed in the compressing parsing tables section. It is indexed by non-terminal symbol number; One entry each for $accept, L, E, P, M.

Let us say the current state on top of the stack after reduction by rule #5 (P → a) is state 10 (see example parse in the appendix). Now the GOTO transition for state 10 on P is really state 13 (according to our traditional tables), which is not the yydefgoto entry for P.

The parser adds yypgoto[P] i.e yypgoto[3] to the current exposed state number (state 10). The result is 12 (since yypgoto[3]=2). Now yytable[12] happens to be 13, so this is the new state to be pushed on to the stack.

How does the parser know that it has to pick the state value from yytable (via yypgoto) and not from yydefgoto? For this, there is a guard table called yycheck that will indicate this fact. It is only after checking with this table that a proper pick is made. We will see this operation in the section describing the parsing routine yyparse().

yypact

This table specifies the portion of yytable that describes what to do in state S. It is indexed by the symbol number of the token symbols. This is like the directory table D that was described in the previous section on compressing parsing tables.

#define YYPACT_NINF -5
static const yytype_int8 yypact[] =
{
-4, -5, -4, 0, 1, -5, 3, -3, -5, -4,
-4, -5, 1, -5
};

This is the first table consulted by the parsing loop. If yypact[cur-state] = YYPACT_NINF, its time for a default reduction using yydefact. This means that the state has only reductions as in state 1 (r5); otherwise the entry in yypact[cur-state] is to be added to the symbol number of the current look-ahead token and the resulting number is used as an index into yytable to get the next action (see comments in yytable code).An example:

Suppose we are in state zero and the look-ahead token is 'a'(symbol number 5). Now yypact[0] is -4, so the index into yytable is 5-4 = 1. yytable[1] is 1 which means "shift this symbol and go to state 1". See yycheck below for some more checking information. If the yytable entry specifies a negative value say -3, it means that we should be reducing the stack with rule #3.

yycheck

This is like a guard table. This table is used for various checks. Again this table is another mixed bag - of symbol numbers and state numbers. There is a very good explanation for this table inside Bison source code (src/tables.h). So I will just quote that here:

   YYCHECK = a vector indexed in parallel with YYTABLE.  It indicates,
in a roundabout way, the bounds of the portion you are trying to
examine.

Suppose that the portion of YYTABLE starts at index P and the index
to be examined within the portion is I. Then if YYCHECK[P+I] != I,
I is outside the bounds of what is actually allocated, and the
default (from YYDEFACT or YYDEFGOTO) should be used. Otherwise,
YYTABLE[P+I] should be used.

Here is the table listing:

static const yytype_int8 yycheck[] =
{
0, 5, 6, 3, 7, 4, 3, 2, 9, -1,
-1, -1, 10
};

Examples of usage:

Suppose we are in state 0 and the look-ahead token in 'a'. yypact[0] = -4 and this, added to symbol number of 'a' (5) gives the index in yytable as we have seen before. Before accessing yytable[1] and proceeding to shift the token, the parser must check that yycheck[1] has the symbol number for the current token. If the test fails, it is time for a default reduction! But in our case, yycheck[1] contains 5 which really is the symbol number 'a', so we can safely consult yytable for what to do next.

yycheck also helps with special case reductions where the GOTO on a non-terminal for the current state is not the most common state (specified by yydefgoto); Consider the same example given in the yypgoto section: we were in state #10 and the reduction rule was rule #5 (P → a). We added yypgoto[P]=2 to current state number (10) to get 12. Before consulting yytable the parser checks with yycheck[12]. If this value is 10, then we know that state# 10 is a special case, otherwise, we use yydefgoto to decide the transition.

Other helper tables

There are other tables that Bison will output to help with printing debug information and parsing error recovery and verbose output. Here is a list with brief descriptions.

yyrhs - A -1 separated list of RHS symbol numbers of all rules. yyrhs[n] = first symbol on the RHS of rule #n.

yyprhs[n] - Index in yyrhs of the first RHS symbol of rule n.

yyrline[n] - Line # in the .y grammar source file where rule n is defined.

yytname[n] - A string specifying the symbol for symbol number n. yytoknum[n] - Token number of token n.

Summary of tables

Here is a one line summary of the main tables:

yytranslate: maps token numbers returned by yylex() to Bison's internal symbol numbers.
yydefact: lists default reduction rules for each state. 0 represents an error.
yydefgoto: lists default GOTOs for each non-terminal symbol. It is only used after checking with yypgoto.
yyr1: symbol number of lhs of each rule. Used at the time of a reduction to find the next state.
yyr2: length of rhs of each rule. Used at the time of reduction to pop the stack.
yytable: a highly compressed representation of the actions in each state. Negative entries represent reductions. There is a negative infinity to detect errors.
yypgoto: accounts for non-default GOTOs for all non-terminal symbols.
yypact: directory into yytable indexed by state number. The displacements in yytable are indexed by symbol number.
yycheck: guard table used to check legal bounds within portions of yytable.

Bison generated parsing routine - yyparse()

Lets get down to the operation of yyparse - the LR parsing routine based on the tables just described. To demystify the routine, I have copied here snippets of code from the generated C parser file and altered them slightly. The most important change here is that I have cut out error handling to make the basic operation very clear. Many checks and macro calls have been removed to expose the bare parsing algorithm. All explanation of the code is found in comments.

/*These are global variables*/

/* The look-ahead symbol. */
int yychar;

/* The semantic value of the look-ahead symbol. */
YYSTYPE yylval;

int
yyparse()
{
int yystate; /* current state */
int yyn; /* This is an all purpose variable! One time it may represent a state, the next time, it might be a rule! */

int yyresult; /* Result of parse to be returned to the called */

int yytoken=0; /* current token */

/* The state stack: This parser does not shift symbols on to the stack. Only a stack of states is maintained. */

int yyssa[YYINITDEPTH]; /*YYINITDEPTH is 200 */
int *yyss = yyssa /* Bottom of state stack */
int *yyssp; /* Top of state stack */

/* The semantic value stack: This stack grows parallel to the state stack. At each reduction, semantic values are popped
* off this stack and the semantic action is executed
*/
YYSTYPE yyvsa[YYINITDEPTH];
YYSTYPE *yyvs = yyvsa; /* Bottom of semantic stack */
YYSTYPE *yyvsp; /* Top of semantic stack */

/* POP the state and semantic stacks by N symbols - useful for reduce actions */
#define YYPOPSTACK(N) (yyvsp -= (N), yyssp -= (N))


int yylen = 0; /* This variable is used in reduce actions to store the length of RHS of a rule; */

/* Ok done declaring variables. Set the ball rolling */

yystate = 0; /* Initial state */
yychar = YYEMPTY /* YYEMPTY is -2 */

yyssp = yyss; /* Top = bottom for state stack */
yyvsp = yyvs; /* Same for semantic stack */

goto yysetstate; /* Well, gotos are used for extracting maximum performance. */


/* Each label can be thought of as a function */

yynewstate: /* Push a new state on the stack */

yyssp ++; /*Just increment the stack top; actual 'pushing' will happen in yysetstate */


yysetstate:

*yyssp = yystate; /* Ok pushed state on state stack top */

goto yybackup; /* This is where you will find some action */


yybackup: /* The main parsing code starts here */

yyn = yypact[yystate]; /* Refer to what yypact is saying about the current state */

if ( yyn == YYPACT_NINF) /* If negative infinity its time for a default reduction */

goto yydefault; /* This label implements default reductions; see below */


/* Check if we have a look-ahead token ready. This is LALR(1) parsing */

if (yychar == YYEMPTY)

yychar = YYLEX; /* Macro YYLEX is defined as yylex() */

if (yychar <= YYEOF) /* YYEOF is 0 - the token returned by lexer at end of input */

yychar = yytoken = YYEOF; /* set all to EOF */

else

yytoken = yytranslate[yychar]; /* Translate the lexer token into internal symbol number */


/* Now we have a look-ahead token. Let the party begin ! */


yyn = yyn + yytoken; /* This is yypact[yystate] + yytoken */


/* Observe this check carefully. We are checking that yyn is within the bounds of yytable
* and also if yycheck contains the current token number.
*/
if ( yyn < 0 || YYLAST < yyn || yycheck[yyn] != yytoken ) /* YYLAST is the highest index in yytable */

goto yydefault; /* Its time for a default reduction */

/* Ok, yyn is within bounds of yytable */

yyn = yytable[yyn]; /* This is yytable[ yypact[yystate] + yytoken ] */

if (yyn <= 0) /* If yytable happens to contain a -ve value, its not a shift - its a reduce */
{
if (yyn == 0 || yyn == YYTABLE_NINF) /* But check for out of bounds condition*/
goto yyerrlab; /* Label to handle errors */

/* Other wise reduce with rule # -yyn */

yyn = -yyn;
goto yyreduce; /* Label to implement reductions */
}

/* Last check: See if we reached final state! */
if (yyn == YYFINAL) /* YYFINAL is 8 in our case */
YYACCEPT; /* macro deined as 'goto acceptlab - a label to finish up */

/* That completes all checks; If we reached here, there is no other option but to shift */

yystate = yyn; /* Now, yyn (= yytable[ yypact[yystate] + yytoken ]) is a state that has to be pushed */

*++yyvsp = yylval; /* Push the semantic value of the symbol on the semantic stack */

goto yynewstate; /* This will increment state stack top and the following yysetstate that will do the pushing */


yydefault: /* A label to implement default reductions */

yyn = yydefact[yystate]; /* Get the default reduction rule for this state */

if ( yyn == 0 ) /* This state has no default reduction. Something is wrong */

goto yyerrlab;

goto yyreduce; /* Ok, got the default reduction rule # in yyn; go ahead and reduce the stack */



yyreduce: /* A lablel that implements reductions on stack. */

/* By the time we are here, yyn contains the rule# to use for reducing the stack. */

/* Steps for reduction:
* 1. Find the length of RHS of rule #yyn
* 2. Execute any semantic actions by taking the values from the semantic stack
* 3. POP 'length' symbols from the state stack and 'length' values from semantic stack
* 4. Find the LHS of rule #yyn
* 5. Find the GOTO of state currently on top of stack on LHS symbol
* 6. Push that state on top of stack
*
*/

yylen = yyr2[yyn]; /* Get length of RHS */

/* Default semantic action - $$=$1 */
yyval = yyvsp[1-yylen];

/* Execute semantic actions */
switch ( yyn ) /* Each rule has its own semantic action */
{

default: break; /* We didn't have any semantic actions in the grammar.*/

}

YYPOPSTACK (yylen); /* This will pop both state and semantic stacks. See definition of this macro above */

yylen = 0; /* re-initialize yylen */

*++yyvsp = yyval; /* Push the result of semantic evaluation on top of semantic stack */


/* Now shift the result of reduction (steps 4 - 6) */

yyn = yyr1[yyn]; /* Reuse yyn at every opportunity. For now, yyn is the LHS symbol (number) of the rule */

/* First check for anomalous GOTOs, otherwise use Default GOTO (YYDEFGOTO)
*
* Observe that if we subtract no. of terminals (YYNTOKENS) from symbol number of a nonterminal, we get
* an index into yypgoto or yydefgoto for that non-terminal.
*/

yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;

/* A couple of checks are needed before we know this is not a default GOTO
* 1. yystate must be within bounds of yytable. ( 0 to YYLAST )
* 2. yycheck must contain the state currently on top of the stack
*/
if ( 0 <= yystate && yystate <= YYLAST && yycheck[yystate] = *yyssp)

yystate = yytable[yystate]; /* Take the GOTO from yytable */

else

yystate = yydefgoto[yyn - YYNTOKENS]; /* Otherwise use the default GOTO */

goto yynewstate; /* Simply push the newly found state on top of stack and continue */

} /* End of yyparse() */

That completes a very basic and simplified parsing routine based on the tables produced by Bison. The actual generated parser does other things like checking the stack for overflows, relocating the stacks when necessary and a lot of error checking. If you use error productions, the parser has to get back to a sane state where the error token can be shifted next. Also, it has to print appropriate verbose output and be able to call yyerror(); There is also cleanup code for symbols that are popped off the semantic stack to avoid memory leaks. I have ignored all this code here to focus on the outline of the parsing routine yyparse().

An example parse

Bison has a facility to enable parse tracing. If you set yydebug=1 before calling yyparse(), the parser will produce a trace of shifts and reductions as the input string is being processed. I produced such a trace for a sample input string in the language generated by our example grammar. Following this trace with the parsing routine and the tables can help in appreciating the table scheme and the operation of yyparse() function.

Sample string: a,a;a,a$
$ is my EOF character. Here is the trace listing.

Starting parse
Entering state 0
Reading a token: Next token is token 'a' ()
Shifting token 'a' ()
Entering state 1
Reducing stack by rule 5 (line 26):
$1 = token 'a' ()
-> $$ = nterm P ()
Stack now 0
Entering state 5
Reducing stack by rule 4 (line 24):
$1 = nterm P ()
-> $$ = nterm E ()
Stack now 0
Entering state 4
Reading a token: Next token is token ',' ()
Shifting token ',' ()
Entering state 10
Reading a token: Next token is token 'a' ()
Shifting token 'a' ()
Entering state 1
Reducing stack by rule 5 (line 26):
$1 = token 'a' ()
-> $$ = nterm P ()
Stack now 0 4 10
Entering state 13
Reducing stack by rule 3 (line 23):
$1 = nterm E ()
$2 = token ',' ()
$3 = nterm P ()
-> $$ = nterm E ()
Stack now 0
Entering state 4
Reading a token: Next token is token ';' ()
Reducing stack by rule 2 (line 21):
$1 = nterm E ()
-> $$ = nterm L ()
Stack now 0
Entering state 3
Next token is token ';' ()
Shifting token ';' ()
Entering state 9
Reading a token: Next token is token 'a' ()
Shifting token 'a' ()
Entering state 1
Reducing stack by rule 5 (line 26):
$1 = token 'a' ()
-> $$ = nterm P ()
Stack now 0 3 9
Entering state 5
Reducing stack by rule 4 (line 24):
$1 = nterm P ()
-> $$ = nterm E ()
Stack now 0 3 9
Entering state 12
Reading a token: Next token is token ',' ()
Shifting token ',' ()
Entering state 10
Reading a token: Next token is token 'a' ()
Shifting token 'a' ()
Entering state 1
Reducing stack by rule 5 (line 26):
$1 = token 'a' ()
-> $$ = nterm P ()
Stack now 0 3 9 12 10
Entering state 13
Reducing stack by rule 3 (line 23):
$1 = nterm E ()
$2 = token ',' ().
$3 = nterm P ()
-> $$ = nterm E ()
Stack now 0 3 9
Entering state 12
Reading a token: Now at end of input.
Reducing stack by rule 1 (line 20):
$1 = nterm L ()
$2 = token ';' ()
$3 = nterm E ()
-> $$ = nterm L ()
Stack now 0
Entering state 3
Now at end of input.
Stack now 0 3
Cleanup: popping nterm L ()

References

[1] Nigel P. Chapman "LR Parsing Theory and practice". Cambridge University Press 1987. This is an excellent book on the theory and implementation of LR parsers. Unfortunately this book is out of print now. The example grammar used through out this article is taken from this book.

[2] Aho, Sethi, Ullman "Compilers - Principles Techniques and Tools". Pearson Education. The "red Dragon book" is a classic. It is the best book I have read on compiler design.

[3] Robert Endre Tarjan and Andrew Chi-Chih Yao in their paper "Storing a Sparse Table", Communications of the ACM, Volume 22, issue 11, pages 606-611. Available here.

[4] GNU Bison source code. http://ftp.gnu.org/pub/gnu/bison/

[5] GNU Bison user manual. http://www.gnu.org/software/bison/manual

Copying this document

Copyright ?2006 Satya Kiran Popuri.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled ``GNU Free Documentation License''.
  评论这张
 
阅读(880)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017