Chapter 9: yacc crash course

Outline

Parser and scanner

How to write parsers for programming languages has been an active area of research for a long time, and there is a quite firm established tactic for doing it. If we limit ourselves to a grammar not too strange (or ambiguous), we can solve this problem by following this method.

The first part consists in splitting a string in a list of words (or tokens). This is called a scanner or lexer. The term “lexical analyzer” is also used, but is too complicated to say so we’ll use the name scanner.

When speaking about scanners, the common sense first says “there are generally spaces at the end of a word”. And in practice, it was made like this in most programming languages, because it’s the easiest way.

There can also be exceptions. For example, in the old Fortran, white spaces did not have any meaning. This means a white space did not end a word, and you could put spaces in the name of a variable. However that made the parsing very complicated so the compiler vendors, one by one, started ignoring that standard. Finally Fortran 90 followed this trend and made the fact that white spaces have an impact the standard.

By the way, it seems the reason white spaces had not meaning in Fortran 77 was that when writing programs on punch cards it was easy to make errors in the number of spaces.

List of symbols

I said that the scanner spits out a list of words (tokens), but, to be exact, what the scanner creates is a list of “symbols”, not words.

What are symbols? Let’s take numbers as an example. In a programming language, 1, 2, 3, 99 are all “numbers”. They can all be handled the same way by the grammar. Where we can write 1, we can also write 2 or 3. That’s why the parser does not need to handle them in different ways. For numbers, “number” is enough.

“number”, “identifier” and others can be grouped together as “symbol”. But be careful not to mix this with the Symbol class.

The scanner first splits the string into words and determines what these symbols are. For example, NUMBER or DIGIT for numbers, IDENTIFIER for names like “name”, IF for the reserved word if. These symbols are then given to the next phase.

Parser generator

The list of words and symbols spitted out by the scanner are going to be used to form a tree. This tree is called a syntax tree.

The name “parser” is also sometimes used to include both the scanner and the creation of the syntax tree. However, we will use the narrow sense of “parser”, the creation of the syntax tree. How does this parser make a tree from the list of symbols? In other words, on what should we focus to find the tree corresponding to a piece of code?

The first way is to focus on the meaning of the words. For example, let’s suppose we find the word var. If the definition of the local variable var has been found before this, we’ll understand it’s the reading of a local variable.

An other ways is to only focus on what we see. For example, if after an identified comes a ‘=’, we’ll understand it’s an assignment. If the reserved word if appears, we’ll understand it’s the start of an if statement.

The later method, focusing only on what we see, is the current trend. In other words the language must be designed to be analyzed just by looking at the list of symbols. The choice was because because this way is simpler, can be more easily generalized and can therefore be automatized using tools. These tools are called parser generators.

The most used parser generator under UNIX is yacc. Like many others, ruby’s parser is written using yacc. The input file for this tool is parser.y. That’s why to be able to read ruby’s parser, we need to understand yacc to some extent. (Note: Starting from 1.9, ruby requires bison instead of yacc. However, bison is mainly yacc with additional functionality, so this does not diminish the interest of this chapter.)

This chapter will be a simple presentation of yacc to be able to understand parse.y, and therefore we will limit ourselves to what’s needed to read parse.y. If you want to know more about parsers and parser generators, I recommend you a book I wrote called “Rubyを256倍使 うための本 無道編” (The book to use 256 times more of Ruby - Unreasonable book). I do not recommend it because I wrote it, but because in this field it’s the easiest book to understand. And besides it’s cheap so it won’t make me rich.

Nevertheless, if you would like a book from someone else (or can’t read Japanese), I recommend O’Reilly’s “lex & yacc programming” by John R. Levine, Tony Mason and Doug Brown. And if your are still not satisfied, you can also read “Compilers” (also known as the “dragon book” because of the dragon on its cover) by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman.

Grammar

Grammar file

The input file for yacc is called “grammar file”, as it’s the file where the grammar is written. The convention is to name this grammar file *.y. It will be given to yacc who will generate C source code. This file can then be compiled as usual (figure 1 shows the full process).

File dependencies
Figure 1: File dependencies

The output file name is always y.tab.c and can’t be changed. The recent versions of yacc usually allow to change it on the command line, but for compatibility it was safer to keep y.tab.c. By the way, it seems the tab of y.tab.c comes from table, as lots of huge tables are defined in it. We should now have a look at the file.

The grammar file’s content has the the following form:

▼ General form of the grammar file
%{
Header
%}
%union ....
%token ....
%type ....

%%
Rules part
%%
User defined part

yacc’s input file is first divided in 3 parts by %%. The first part if called the definition part, has a lot of definitions and setups. Between %{ and %} we can write anything we want in C, like for example necessary macros. After that, the instructions starting with % are special yacc instructions. Every time we use one, we’ll explain it.

The middle part of the file is called the rules part, and is the most essential part for yacc. It’s where is written the grammar we want to parse. We’ll explain it in details in the next section.

The last part of the file, the user defined part, can be used freely by the user. yacc just copies this part verbatim in the output file. It’s used for example to put auxiliary routines needed by the parser.

What does yacc do.

What yacc takes care of is mainly this rules part in the middle. yacc takes the grammar written there and use it to make a function called yyparse(). It’s the parser, in the narrow sense of the word.

In the narrow sense, so it means a scanner is needed. However, yacc won’t take care of it, it must be done by the user. The function for the scanner is named yylex().

Even if yacc creates yyparse(), it only takes care of its core part. The “actions” we’ll mention later is out of its scope. You can think the part done by yacc is too small, but that’s not the case. That’s because this “core part” is overly important that yacc survived to this day even though we keep complaining about it.

But what on earth is this core part? That’s what we’re going to see.

BNF

When we want to write a parser in C, its code will be “cut the string this way, make this an if statement…” When using parser generators, we say the opposite, that is “I would like to parse this grammar.” Doing this creates for us a parser to handle the grammar. This means telling the specification gives us the implementation. That’s the convenient point of yacc.

But how can we tell the specification? With yacc, the method of description used is the BNF (Backus-Naur Form). Let’s look at a very simple example.

if_stmt: IF expr THEN stmt END

Let’s see separately what’s at the left and at the right of the “:”. The part on the left side, if_stmt, is equal to the right part… is what I mean here. In other words, I’m saying that:

if_stmt and IF expr THEN stmt END are equivalent.

Here, if_stmt, IF, expr... are all “symbols”. expr is the abbreviation of expression, stmt of statement. It must be for sure the declaration of the if statement.

One definition is called a rule. The part at the left of “:” is called the left side and the right part called the right side. This is quite easy to remember.

But something is missing. We do not want an if statement without being able to use else. And even if we could write else, having to always write the else even when it’s useless would be cumbersome. In this case we could do the following:

if_stmt: IF expr THEN stmt END
       | IF expr THEN stmt ELSE stmt END

|” means “or”.

if_stmt is either “IF expr THEN stmt END” or “IF expr THEN stmt ELSE stmt END”.

That’s it.

Here I would like you to pay attention to the split done with |. With just this, one more rule is added. In fact, punctuating with | is just a shorter way to repeat the left side. The previous example has exactly the same meaning as the following:

if_stmt: IF expr THEN stmt END
if_stmt: IF expr THEN stmt ELSE stmt END

This means two rules are defined in the example.

This is not enough to complete the definition of the if statement. That’s because the symbols expr and stmt are not sent by the scanner, rules must be defined. To be closer to Ruby, let’s boldly add some rules.

stmt   : if_stmt
       | IDENTIFIER '=' expr   /* assignment */
       | expr

if_stmt: IF expr THEN stmt END
       | IF expr THEN stmt ELSE stmt END

expr   : IDENTIFIER       /* reading a variable */
       | NUMBER           /* integer constant */
       | funcall          /* FUNction CALL */

funcall: IDENTIFIER '(' args ')'

args   : expr             /* only one parameter */

I used two new elements. First, comments of the same form as in C, and character expressed using '='. This '=' is also of course a symbol. Symbols like ”=” are different from numbers as there is only one variety for them. That’s why for symbols where can also use '='. It would be great to be able to use for strings for, for example, reserved words, but due to limitations of the C language this cannot be done.

We add rules like this, to the point we complete writing all the grammar. With yacc, the left side of the first written rule is “the whole grammar we want to express”. So in this example, stmt expresses the whole program.

It was a little too abstract. Let’s explain this a little more concretely. By “stmt expresses the whole program”, I mean stmt and the rows of symbols expressed as equivalent by the rules, are all recognized as grammar. For example, stmt and stmt are equivalent. Of course. Then expr is equivalent to stmt. That’s expressed like this in the rule. Then, NUMBER and stmt are equivalent. That’s because NUMBER is expr and expr is stmt.

We can also say that more complicated things are equivalent.

              stmt
               ↓
             if_stmt
               ↓
      IF expr THEN stmt END
          ↓        ↓
IF IDENTIFIER THEN expr END
                    ↓
IF IDENTIFIER THEN NUMBER END

In the statement developed here, at the end all symbols are ones sent by the scanner. That means this expression is a correct program. Or putting it the other way around, if this sequence of symbols is sent by the scanner, the parser will understand it in the opposite way it was developed.

IF IDENTIFIER THEN NUMBER END
                    ↓
IF IDENTIFIER THEN expr END
          ↓        ↓
      IF expr THEN stmt END
               ↓
             if_stmt
               ↓
              stmt

And stmt is a symbol expressing the whole program. That’s why this sequence of symbols is a correct program for the parser. When it’s the case, the parsing routine yyparse() ends returning 0.

By the way, the technical term expressing that the parser succeeded is that it “accepted” the input. The parser is like a government office: if you do not fill the documents in the boxes exactly like he asked you to, he’ll refuse them. The accepted sequences of symbols are the ones for which the boxes where filled correctly. Parser and government office are strangely similar for instance in the fact that they care about details in specification and that they use complicated terms.

Terminal symbols and nonterminal symbols

Well, in the confusion of the moment I used without explaining it the expression “symbols coming from the scanner”. So let’s explain this. I use one word “symbol” but there are two types.

The first type of the symbols are the ones sent by the scanner. They are for example, IF, THEN, END, '=', ... They are called terminal symbols. That’s because like before when we did the quick expansion we find them aligned at the end. In this chapter terminal symbols are always written in capital letters. However, symbols like '=' between quotes are special. Symbols like this are all terminal symbols, without exception.

The other type of symbols are the ones that never come from the scanner, for example if_stmt, expr or stmt. They are called nonterminal symbols. As they don’t come from the scanner, they only exist in the parser. Nonterminal symbols also always appear at one moment or the other as the left side of a rule. In this chapter, nonterminal symbols are always written in lower case letters.

How to test

I’m now going to tell you the way to process the grammar file with yacc.

%token A B C D E
%%
list: A B C
    | de

de  : D E

First, put all terminal symbols used after %token. However, you do not have to type the symbols with quotes (like '='). Then, put %% to mark a change of section and write the grammar. That’s all.

Let’s now process this.

% yacc first.y
% ls
first.y  y.tab.c
%

Like most Unix tools, “silence means success”.

There’s also implementations of yacc that need semicolons at the end of (groups of) rules. When it’s the case we need to do the following:

%token A B C D E
%%
list: A B C
    | de
    ;

de  : D E
    ;

I hate these semicolons so in this book I’ll never use them.

Void rules

Let’s now look a little more at some of the established ways of grammar description. I’ll first introduce void rules.

void:

There’s nothing on the right side, this rule is “void”. For example, the two following targets means exactly the same thing.

target: A B C

target: A void B void C
void  :

What is the use of such a thing? It’s very useful. For example in the following case.

if_stmt : IF expr THEN stmts opt_else END

opt_else:
        | ELSE stmts

Using void rules, we can express cleverly the fact that “the else section may be omitted”. Compared to the rules made previously using two definitions, this way is shorter and we do not have to disperse the burden.

Recursive definitions

The following example is still a little hard to understand.

list: ITEM         /* rule 1 */
    | list ITEM    /* rule 2 */

This expresses a list of one or more items, in other words any of the following lists of symbols:

ITEM
ITEM ITEM
ITEM ITEM ITEM
ITEM ITEM ITEM ITEM
      :

Do you understand why? First, according to rule 1 list can be read ITEM. If you merge this with rule 2, list can be ITEM ITEM.

list: list ITEM
    = ITEM ITEM

We now understand that the list of symbols ITEM ITEM is similar to list. By applying again rule 2 to list, we can say that 3 ITEM are also similar to list. By quickly continuing this process, the list can grow to any size.

I’ll now show you the next example. The following example expresses the lists with 0 or more ITEM.

list:
    | list ITEM

First the first line means “list is equivalent to (void)”. By void I mean the list with 0 ITEM. Then, by looking at rule 2 we can say that “list ITEM” is equivalent to 1 ITEM. That’s because list is equivalent to void.

list: list   ITEM
    = (void) ITEM
    =        ITEM

By applying the same operations of replacement multiple times, we can understand that list is the expression a list of 0 or more items.

With this knowledge, “lists of 2 or more ITEM” or “lists of 3 or more ITEM” are easy, and we can even create “lists or an even number of elements”.

list:
    | list ITEM ITEM

Construction of values

This abstract talk lasted long enough so in this section I’d really like to go on with a more concrete talk.

Shift and reduce

For the moment we have only seen how to write grammars, but what we want is being able to build the full syntax tree. However, I’m afraid to say that has can be expected there’s no way to build the syntax tree with just expressing the rules. That’s why this time we’ll go a little farther and I’ll explain how to build the syntax tree.

We’ll first see what the parser does during the execution. We’ll use the following simple grammar as an example.

%token A B C
%%
program: A B C

In the parser there is a stack called the semantic stack. The parser pushes on it all the symbols coming from the scanner. This move is called “shifting the symbols”.

[ A B ] ← C   shift

And when any of the right side of a rule is equal to the end of the end, this is called ?understanding?. When this happens, the right side of the rule is replaced by the left side on the stack.

[ A B C ]
    ↓         reduction
[ program ]

This move is called “a reduction of A B C” to program”. This term is a little presomptious

It is this operation that “A B C is restored to PROGRAM (reduction)”. But word greatly so, when requires the inside of the white 發 is even, it is kind of something which becomes the Daizo cause. ...... Is that different? Because and PROGRAM displays the whole program, perhaps that it is just PROGRAM in stack, the whole program is found.Therefore if input ends exactly here, it is accepted. With a little just it will try trying already complicated grammar.
%token IF E S THEN END
%%
program : if

if      : IF expr THEN stmts END

expr    : E

stmts   : S
        | stmts S

Input from the scanner like this is.

IF  E  THEN  S  S  S  END

Transition of semantics stack of this time is shown below。

Stack
IF IF
IF EIt shifts EIt shifts
IF expr EexprSo it restores
IF expr THENIt shifts THENIt shifts
IF expr THEN S S
IF expr THEN stmts SstmtsSo it restores
IF expr THEN stmts S S
IF expr THEN stmts stmts SstmtsSo it restores
IF expr THEN stmts S SIt shifts
IF expr THEN stmts stmts SstmtsSo it restores
IF expr THEN stmts END ENDIt shifts
if IF expr THEN stmts ENDifSo it restores
program ifprogramSo it restores
accept.

Lastly just one note.With restoration the sign does not limit decreases with. When there is a empty rule, when “the non” empty sign is formed, it is

Action

Well, is the important place from here.It probably is shift, but it probably is restoration, but if only [udauda] you do during semantics stack, there are no many meanings.As for our last goals because it was to form the syntactic tree, unless it is connected to that, it is troubled. Being the intention how of dropping yacc and acquiring before?“Hook which the parser restores that it will try it is possible the instantaneous”, it is the answering which yacc puts out.That hook is called action (action) of the parser.Rule you write action as follows lastly.

program: A B C { /* Here action */ }

{With} with the part which is surrounded is action.Like this when you write, this action is executed instantaneously restores A B C to PROGRAM.That with action what will be done, it is free.If C cord/code you can write generally regardless.

Value of symbols

But and furthermore is important from here, there is “the value” in all signs.Are the terminal symbol and the nonterminal symbol.Because the terminal symbol comes from the scanner, you receive also the value from the scanner.For example as for that 1 or 9 perhaps 108 vis-a-vis sign NUMBER.Vis-a-vis sign IDENTIFIER perhaps " attr " " name " “sym ".It is good regardless.The value in the sign and simultaneous is stacked in semantics stack.The following figure now exactly it shifts S, it is the place where in value and simultaneous.

IF       expr      THEN      stmts     S
Value    Value     Value     Value     Value

Some time ago, according to rule it can restore stmts S to stmts.If action is written on the rule, but the reason where that is executed, that time, value of the sign of the amount which corresponds to the right-hand side is transferred to action.

IF    expr   THEN   stmts  S      /*Stack  */
Value    Value     Value     Value     Value
                    ↓     ↓
            stmts:  stmts  S      /* Rule */
                    ↓     ↓
                  { $1  +  $2; }  /* Action */

With, this way with action <$1$2$3..... So it can take the value of the sign which is suitable to the systematic right-hand side. The $1 or the $2 is the case that yacc rewrites to the expression which points to stack. Though if C language because truth however type it is to be a variety , is difficult, for the time being int you probably will suppose.

IF    expr   THEN   stmts  S      /* Stack immediately before the restoring */
Value    Value     Value     Value     Value
                    ↓     ↓
            stmts:  stmts  S      /*The rule where the right-hand side matches to end  */
              ↑    ↓     ↓
            { $$  = $1  +  $2; }  /*The action  */

IF    expr   THEN   stmts         /* Stack after the restoring */
Value    Value     Value     Value     Value

Lastly redundancy.As for value of the sign semantic value, there are also times when it is called semantic value. Therefore, the stack which inserts that with semantic value stack, omitting, is the case that it calls semantic stack.

yaccWith type

Well but truly trouble, if story of type is not done, story is not settled.Type of value of the sign just probably is what.When you say from conclusion, it becomes the type, YYSTYPE.Certainly as for this YY Stack TYPE, it is and Semantic value TYPE, difference it is not in either abbreviation. And YYSTYPE naturally something is typedef of another type.The type is the common body which is appointed in the order, %union in the defined department.

But %union how you had not written until now.Although that it is, the fact that it has not become error probably is what reason.Because that processed yacc making the air be effective, having selfishly with default.If you mention default with C, naturally it is int.With the default of YYSTYPE is int with the notion that where you say.

If about example and the portable calculator program which are provided to the book of yacc it does not care even with while it is int but is, to make the syntactic tree, the structure and the pointer and in addition we would like to use the varieties.Then for example %union is used as follows.

%union {
    struct node {
        int type;
        struct node *left;
        struct node *right;
    } *node;
    int num;
    char *str;
}

Because now it is not the case that it is really used, type and member name are suitable.Different from normal C %unionBecause lastly of block the semicolon there is no necessity, note.

When like this you write with that, with y.tab.c it is the case that it becomes as follows.

typedef union {
    struct node {
        int type;
        struct node *left;
        struct node *right;
    } *node;
    int num;
    char *str;
} YYSTYPE;

So when it does, as for semantics stack

YYSTYPE yyvs[256];       /* Substance (yyvs = YY Value Stack) of stack */
YYSTYPE *yyvsp = yyvs;   /* The pointer which points to the point of stack */

Expectation is attached with it probably is the feeling which is said, that. Then value of the sign which appears in action ......

/* yacc Action before the processing */
target: A B C { func($1, $2, $3); }

/* After the converting, the circumstances with y.tab.c */
{ func(yyvsp[-2], yyvsp[-1], yyvsp[0]); ;

Naturally it becomes like this.

In this case because int of default was used, just to refer to stack it is necessary, but when YYSTYPE is the common body, if either the member does not appoint simultaneously, it is the expectation which cannot be accessed.The method of tying at the symbolic unit there are two kinds of method of appointing every time to that.

First generality, from the method of appointing at the symbolic unit.In case of the terminal symbol in case of the nonterminal symbol using %type, you write %token, as follows.

%token<num> A B C    /* As for the value of all A B C int type */
%type<str> target    /* As for the value of all target char* type */

%union { char *str; } %% target: { $$ = "It requires cast like ones"; }

%union { char *str; }
%%
target: { $<str>$ = "It requires cast like ones"; }

As for this method the one which is used as little as possible is better. What the member is decided at the symbolic unit is the basis.

Connection of parser and scanner

Everything you spoke now concerning [arekore] of value in the parser.After if you speak the connected protocol of the scanner, as for the item which becomes the nucleus everything you put away.

First when you verify, the scanner () was function yylex. (With int) it returns (terminal) as for the sign itself as a return value of function.Because yacc #define has done constant with the same name as the sign, if sign NUMBER NUMBER just to write it is necessary.And inserting in the global variable, yylval it transfers the value. Also this and with YYSTYPE type, the completely same thing as the time of the parser can call yylval.In other words when it defines with %union, it becomes the common body.But this time because the member is not done it chooses selfishly, unless you write member name by your, it is useless.In other words when it is very simple example, it becomes as follows.

static int
yylex()
{
    yylval.str = next_token();
    return STRING;
}

Because the relationship to here was arranged to Figure 2, we want trying verifying one by one. yylval, $$, $1 and $2 ......And so on, the variable which becomes interface is all YYSTYPE types.

 yacc-related variable function of c related variable function 2:  Something related to <code>yacc-</code>related variable function</p>
</p> 


	<p style=yacc関連の変数・関数の関係
Figure 2: yacc Something related to related variable function

Pad action

Those which rule are written lastly, with it explained action, but to tell the truth it can also write in the middle of rule

target: A B { puts("embedded action"); } C D

This is called pad action. Pad action is the mere syntax sugar of following kind of description.

target: A B dummy C D

dummy :     /* Empty rule */
        {
            puts("embedded action");
        }

$3

Realistic topic

yacc how we do not fear anymore with this.

When we assume that with you thought, that is rather sweet.Why whether you can fear yacc to this much, the reason is after this.

So far, “when the right-hand side on rule matches to the stacked point”, that you wrote casually, but when there is a following kind of rule, how it probably will become?

target  : A B C
        | A B C

When the string, A B C appears, it is the expectation which stops knowing whether either rule matches.You cannot understand any such a things in even the human. Therefore yacc it is not recognized.When such strange grammar is discovered, as for yacc reduce/reduce conflict (restoration & restoration collision) occurred, that it is complaint.In the sense that plural rules are restoration possible simultaneously.

% yacc rrconf.y
conflicts:  1 reduce/reduce

If with say it is normal, you think such a thing other than accident that it does not do, but will the following example how probably be?The string which has described completely is the same.

target  : abc
        | A bc

abc     : A B C

bc      :   B C

If this it is, it can be relatively.Especially while thinking of rule, when [guchiyaguchiya] it is moving, it is something which it designates such a rule as the unwittingly inside.

There are also following kind of ones with the pattern which is similar.

target  : abc
        | ab C

abc     : A B C

ab      : A B

When the string, A B C appears, you do not know whether it should choose abc one, whether ab it should make C combination.Such time yacc occurred shift/reduce conflict (shift restoration collision), that it droops complaint. This in the sense that the rule which can be shifted simultaneously there is a rule which can be restored.

% yacc srconf.y
conflicts:  1 shift/reduce

The famous example of shift/reduce conflict “it dangles and else problem” is. For example this problem happens in the if sentence of C language.Simplifying story, it will try writing.

stmt     : expr ';'
         | if

expr     : IDENTIFIER

if       : IF '(' expr ')' stmt
         | IF '(' expr ')' stmt  ELSE stmt

As for formula just IDENTIFIER (variable), if as for the substance tried making rule as Bunichi just. Well, when the following program pass is done with this grammar, it probably means some thing.>

if (cond)
    if (cond)
        true_stmt;
    else
        false_stmt;

Like this, when you write, it is visible clearly, how without but is, to tell the truth it can interpret as follows.

if (cond) {
    if (cond)
        true_stmt;
}
else {
    false_stmt;
}

In other words it is the problem you attach else outside and inside either if.

However shift/reduce conflict if you compare to reduce/reduce conflict, relatively is harmless collision.Because that when you say, why when mostly is, if shift is chosen, it goes well.Shift is chosen, that “be as close as possible the element is connected”, it is synonym generally, it is easy to match to the intuition of the human.Actually, if it shifts also dangling else, it goes well.When shift/reduce conflict occurred with such reason yacc in accordance with the flow, it has reached the point where shift is chosen with default.

Pre-reading

We want trying applying the following grammar to yacc in trial.

%token A B C
%%
target  : A B C   /*Rule 1 */
        | A B     /*Rule 2*/

How thinking, it collides and may probably will be.To A B it wishes to shift rule 1 at the point in time when you read and, it wishes to restore rule 2. In other words this is the expectation which becomes shift/reduce conflict.However ......

% yacc conf.y
%

It is strange, it does not collide.Why probably will be.

When it is the truth, the parser which was made with yacc “pre-reading (look ahead)” can do just 1 signs. Before shifting truly and restoring, gledging the following sign, it can judge how it does.

Therefore even at the time of parser formation considering that, if you can distinguish with one pre-reading, it does not make collide.For example if some time ago rule if of A B the C comes next, because only rule 1 there is a possibility, rule 1 is chosen, (shifting,).When input ends, rule 2 is chosen, (restoring,).

Those where we want noting are to be meaning of two kinds in the word, “pre-reading”. As for one when processing *.y with yacc, pre-reading.As for one more when really moving the parser which is formed, pre-reading.Pre-reading in the execution time is not difficult very, but yacc itself pre-reading is very complicated.Because because estimating all input configurations of the execution time from just syntax rule, unless it decides behavior, it is not good.

Though, because really as for “all” it is unreasonable, with it means to cope “considerable” pattern.Whether or not and it can cope with the inside about some range of all pattern, it is the case that it becomes strength of pre-reading algorithm.In the collision solution algorithm which as for the pre-reading algorithm which yacc uses at the time of grammatical file processing LALR (1) you say, exists extremely they are powerful ones.

The varieties you said, but because those where you do in this book just read rule are not to write, there are no times when you worry excessively.When we would like to explain here, as for are not pre-reading which used grammar and are method of pre-reading the execution time.

Operators precedence order

For a while because abstract story continued, already a little concrete story is done here and others.+ And * and so on we will have decided to try defining the rule of binary operator (infix type operator).Because there is a formula even in this, following to that placidly, you should have put.Those like the portable calculator which can use arithmetical operation below were defined.

expr    : expr '+' expr
        | expr '-' expr
        | expr '*' expr
        | expr '/' expr
        | primary

primary : NUMBER
        | '(' expr ')'

primary is translated “the section”.It is the smallest grammatical unit. When expr is bound with the parenthesis, the place that is the point it becomes primary.

The [te], writing this grammar on the file suitably, when it compiles, it becomes like this.

% yacc infix.y
16 shift/reduce conflicts

It collided extremely.If only five minutes think, you think that it is understood, but with this rule it is troubled in following kind of case.

1 - 1 - 1

Being able to interpret this it finishes in both of the following two kinds.。

(1 - 1) - 1
1 - (1 - 1)

The fact that it is natural as a numerical formula of course is former.But as for yacc doing you see to the last and it is processing of for the sake of, meaning does not enter completely there. - With considering also the semantic how this [tsu] [po] [chi] which the sign which is said has, you do not give.Intention of the human is made to reflect just, it is to do to be, unless indicating thing steadily.

That when you say, so how it should have done, like this you should have written on the defined department.

%left '+' '-'
%left '*' '/'

This order priority and appoints connection of operator characteristic two simultaneously. It probably will keep explaining to order.

As for the word, priority when doing the story of grammar of programming language, you think that it comes out well.When you speak theoretically, because it is complicated, speaking intuitively know, it is the story the parenthesis is attached to either operator in following kind of case.

1 + 2 * 3

*If one priority is high, it becomes like this.

1 + (2 * 3)

+If one priority is high, it becomes like this.

(1 + 2) * 3

This way, those which are strong in operator setting weak ones, the fact that it solves shift/reduce conflict is operator priority

But but, falling in the same circumstance, when should priority have done is the same how probably?For example this way.

1 - 2 - 3

Because this time both is -, priority completely is the same.Using connection characteristic such time, it solves.There are three types of left right nonassoc in connection characteristic, are interpreted respectively as follows.

Connection characteristic Interpretation
left(The left connection) (1 - 2) - 3
right(The right connection) 1 - (2 - 3)
nonassoc(Non connection) [pasuera]

When it is operator for numerical formula, they are most left connections.As for the right connection = of substitution and you use mainly with not of denial.

a = b = 1    # (a = (b = 1))
not not a    # (not (not a))

The typical case of nonassoc probably is comparison operators.

a == b == c   # Pass error
a <= b <= c   # Pass error

Though because with Python and the like comparison of three sections is possible, it is not this limit

The order, some time ago %left %right %nonassoc with that is used in order to show the connection characteristic of according to name.And it shows priority in the order which arranges. About the operator which is under priority is high.If it is in the same line, it is the same ranking.

%left  '+' '-'    /* With the left connection priority 3 */
%left  '*' '/'    /* With the left connection priority 2 */
%right '!'        /* With the left connection priority 1 */

The original work is Copyright © 2002 - 2004 Minero AOKI.
Translated by Vincent ISAMBART
Translations and Additions by C.E. Thornton
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License.