hub: minipeg

--- a/minipeg.1

+++ b/minipeg.1

@@ -1,43 +1,18 @@

-.\" Copyright (c) 2007,2016 by Ian Piumarta

-.\" All rights reserved.

-.\"

-.\" Permission is hereby granted, free of charge, to any person obtaining a

-.\" copy of this software and associated documentation files (the 'Software'),

-.\" to deal in the Software without restriction, including without limitation

-.\" the rights to use, copy, modify, merge, publish, distribute, and/or sell

-.\" copies of the Software, and to permit persons to whom the Software is

-.\" furnished to do so, provided that the above copyright notice(s) and this

-.\" permission notice appear in all copies of the Software.  Acknowledgement

-.\" of the use of this Software in supporting documentation would be

-.\" appreciated but is not required.

-.\"

-.\" THE SOFTWARE IS PROVIDED 'AS IS'.  USE ENTIRELY AT YOUR OWN RISK.

-.\"

-.\" Last edited: 2016-07-22 09:47:29 by piumarta on zora.local

-.\"

-.TH PEG 1 "September 2013" "Version 0.1"

+.TH MINIPEG 1

 .SH NAME

-peg, leg \- parser generators

+minipeg \- parser generator

 .SH SYNOPSIS

-.B peg

-.B [\-hvV \-ooutput]

+.B minipeg

+.B [\-hvVP \-ooutput]

 .I [filename ...]

-.sp 0

-.B leg

-.B [\-hvV \-ooutput]

-.I [filename ...]

 .SH DESCRIPTION

-.I peg

-and

-.I leg

-are tools for generating recursive\-descent parsers: programs that

+.I minipeg

+is a tool for generating recursive\-descent parsers: programs that

 perform pattern matching on text.  They process a Parsing Expression

 Grammar (PEG) [Ford 2004] to produce a program that recognises legal

 sentences of that grammar.

-.I peg

-processes PEGs written using the original syntax described by Ford;

-.I leg

-processes PEGs written using slightly different syntax and conventions

+.I minipeg

+processes PEGs written with syntax and conventions

 that are intended to make it an attractive replacement for parsers

 built with

 .IR lex (1)

@@ -47,20 +22,18 @@

 .I lex

and

 .IR yacc ,

-.I peg

-and

-.I leg

+.I minipeg

 support unlimited backtracking, provide ordered choice as a means for

 disambiguation, and can combine scanning (lexical analysis) and

 parsing (syntactic analysis) into a single activity.

.PP

-.I peg

+.I minipeg

 reads the specified

 .IR filename s,

 or standard input if no

 .IR filename s

 are given, for a grammar describing the parser to generate.

-.I peg

+.I minipeg

 then generates a C source file that defines a function

 .IR yyparse().

 This C source file can be included in, or compiled and then linked

@@ -81,9 +54,7 @@

 .IR yacc (1),

 for example.)

 .SH OPTIONS

-.I peg

-and

-.I leg

+.I minipeg

 provide the following options:

.TP

 .B \-h

@@ -102,95 +73,88 @@

.TP

 .B \-V

 writes version information to standard error then exits.

-.SH A SIMPLE EXAMPLE

-The following

-.I peg

-input specifies a grammar with a single rule (called 'start') that is

-satisfied when the input contains the string "username".

+.SH EXAMPLE: A CALCULATOR

+Here we show a simple desk calculator supporting the four common arithmetic

+operators and named variables.  The intermediate results of arithmetic

+evaluation will be accumulated on an implicit stack by returning them

+as semantic values from sub\-rules.

.nf

-    start <\- "username"

+    %{

+    #include <stdio.h>     /* printf() */

+    #include <stdlib.h>    /* atoi() */

+    int vars[26];

+    %}

+    Stmt    = \- e:Expr EOL                  { printf("%d\\n", e); }

+            | ( !EOL . )* EOL               { printf("error\\n"); }

+    Expr    = i:ID ASSIGN s:Sum             { $$ = vars[i] = s; }

+            | s:Sum                         { $$ = s; }

+    Sum     = l:Product

+                    ( PLUS  r:Product       { l += r; }

+                    | MINUS r:Product       { l \-= r; }

+                    )*                      { $$ = l; }

+    Product = l:Value

+                    ( TIMES  r:Value        { l *= r; }

+                    | DIVIDE r:Value        { l /= r; }

+                    )*                      { $$ = l; }

+    Value   = i:NUMBER                      { $$ = atoi(yytext); }

+            | i:ID !ASSIGN                  { $$ = vars[i]; }

+            | OPEN i:Expr CLOSE             { $$ = i; }

+    NUMBER  = < [0\-9]+ >    \-               { $$ = atoi(yytext); }

+    ID      = < [a\-z]  >    \-               { $$ = yytext[0] \- 'a'; }

+    ASSIGN  = '='           \-

+    PLUS    = '+'           \-

+    MINUS   = '\-'           \-

+    TIMES   = '*'           \-

+    DIVIDE  = '/'           \-

+    OPEN    = '('           \-

+    CLOSE   = ')'           \-

+    \-       = [ \\t]*

+    EOL     = '\\n' | '\\r\\n' | '\\r' | ';'

+    %%

+    int main()

+    {

+      while (yyparse())

+        ;

+      return 0;

+    }

.fi

-(The quotation marks are

-.I not

-part of the matched text; they serve to indicate a literal string to

-be matched.)  In other words,

-.IR  yyparse ()

-in the generated C source will return non\-zero only if the next eight

-characters read from the input spell the word "username".  If the

-input contains anything else,

-.IR yyparse ()

-returns zero and no input will have been consumed.  (Subsequent calls

-to

-.IR yyparse ()

-will also return zero, since the parser is effectively blocked looking

-for the string "username".)  To ensure progress we can add an

-alternative clause to the 'start' rule that will match any single

-character if "username" is not found.

-.nf

-    start <\- "username"

-           / .

-.fi

-.IR yyparse ()

-now always returns non\-zero (except at the very end of the input).  To

-do something useful we can add actions to the rules.  These actions

-are performed after a complete match is found (starting from the first

-rule) and are chosen according to the 'path' taken through the grammar

-to match the input.  (Linguists would call this path a 'phrase

-marker'.)

-.nf

-    start <\- "username"    { printf("%s\\n", getlogin()); }

-           / < . >         { putchar(yytext[0]); }

-.fi

-The first line instructs the parser to print the user's login name

-whenever it sees "username" in the input.  If that match fails, the

-second line tells the parser to echo the next character on the input

-the standard output.  Our parser is now performing useful work: it

-will copy the input to the output, replacing all occurrences of

-"username" with the user's account name.

.PP

-Note the angle brackets ('<' and '>') that were added to the second

-alternative.  These have no effect on the meaning of the rule, but

-serve to delimit the text made available to the following action in

-the variable

-.IR yytext .

-.PP

 If the above grammar is placed in the file

-.BR username.peg ,

+.BR calc.peg ,

 running the command

.nf

-    peg \-o username.c username.peg

+    $ minipeg \-o calc.c calc.peg

.fi

 will save the corresponding parser in the file

-.BR username.c .

-To create a complete program this parser could be included by a C

-program as follows.

+.BR calc.c .

+The program can then be compiled with a C compiler and run

.nf

-    #include <stdio.h>      /* printf(), putchar() */

-    #include <unistd.h>     /* getlogin() */

-    #include "username.c"   /* yyparse() */

-    int main()

-    {

-      while (yyparse())     /* repeat until EOF */

-        ;

-      return 0;

-    }

+    $ cc \-o calc calc.c

+    $ ./calc

+    a=5

+    5

+    a+5

+    10

.fi

-.SH PEG GRAMMARS

+.SH MINIPEG GRAMMARS

 A grammar consists of a set of named rules.

.nf

-    name <\- pattern

+    name = pattern

.fi

The

@@ -253,6 +217,50 @@

 (These variable names are historical; see

 .IR lex (1).)

.TP

+.IB @{\ action\ }

+Actions prefixed with an 'at' symbol will be performed during parsing,

+at the time they are encountered while matching the input text with a

+rule.

+Because of back-tracking in the PEG parsing algorithm, actions

+prefixed with '@' might be performed multiple times for the same input

+text.

+(The usual behviour of actions is that they are saved up until

+matching is complete, and then those that are part of the

+final derivation are performed in left-to-right order.)

+The variable

+.I yytext

+is available within these actions.

+.TP

+.IB exp \ ~ \ {\ action\ }

+A postfix operator

+.BI ~ {\ action\ }

+can be placed after any expression and will behave like a normal

+action (arbitrary C code) except that it is invoked only when

+.I exp

+fails.  It binds less tightly than any other operator except alternation and sequencing, and

+is intended to make error handling and recovery code easier to write.

+Note that

+.I yytext

+and

+.I yyleng

+are not available inside these actions, but the pointer variable

+.I yy

+is available to give the code access to any user\-defined members

+of the parser state (see "CUSTOMISING THE PARSER" below).

+Note also that

+.I exp

+is always a single expression; to invoke an error action for any

+failure within a sequence, parentheses must be used to group the

+sequence into a single expression.

+.nf

+    rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }

+         | ...

+    rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); }

+         | ...

+.fi

+.TP

 .B <

 An opening angle bracket always matches (consuming no input) and

 causes the parser to begin accumulating matched text.  This text will

@@ -339,102 +347,14 @@

 if each individual element within it matches, from left to right.

.PP

 Sequences can be separated into disjoint alternatives by the

-alternation operator '/'.

+alternation operator '|'.

.TP

-.RB sequence\-1\  / \ sequence\-2\  / \ ...\  / \ sequence\-N

+.RB sequence\-1\  | \ sequence\-2\  | \ ...\  | \ sequence\-N

 Each sequence is tried in turn until one of them matches, at which

 time matching for the overall pattern succeeds.  If none of the

 sequences matches then the match of the overall pattern fails.

.PP

-Finally, the pound sign (#) introduces a comment (discarded) that

-continues until the end of the line.

-.PP

-To summarise the above, the parser tries to match the input text

-against a pattern containing literals, names (representing other

-rules), and various operators (written as prefixes, suffixes,

-juxtaposition for sequencing and and infix alternation operator) that

-modify how the elements within the pattern are matched.  Matches are

-made from left to right, 'descending' into named sub\-rules as they are

-encountered.  If the matching process fails, the parser 'back tracks'

-('rewinding' the input appropriately in the process) to find the

-nearest alternative 'path' through the grammar.  In other words the

-parser performs a depth\-first, left\-to\-right search for the first

-successfully\-matching path through the rules.  If found, the actions

-along the successful path are executed (in the order they were

-encountered).

-.PP

-Note that predicates are evaluated

-.I immediately

-during the search for a successful match, since they contribute to the

-success or failure of the search.  Actions, however, are evaluated

-only after a successful match has been found.

-.SH PEG GRAMMAR FOR PEG GRAMMARS

-The grammar for

-.I peg

-grammars is shown below.  This will both illustrate and formalise

-the above description.

-.nf

-    Grammar         <\- Spacing Definition+ EndOfFile

-    Definition      <\- Identifier LEFTARROW Expression

-    Expression      <\- Sequence ( SLASH Sequence )*

-    Sequence        <\- Prefix*

-    Prefix          <\- AND Action

-                     / ( AND | NOT )? Suffix

-    Suffix          <\- Primary ( QUERY / STAR / PLUS )?

-    Primary         <\- Identifier !LEFTARROW

-                     / OPEN Expression CLOSE

-                     / Literal

-                     / Class

-                     / DOT

-                     / Action

-                     / BEGIN

-                     / END

-    Identifier      <\- < IdentStart IdentCont* > Spacing

-    IdentStart      <\- [a\-zA\-Z_]

-    IdentCont       <\- IdentStart / [0\-9]

-    Literal         <\- ['] < ( !['] Char  )* > ['] Spacing

-                     / ["] < ( !["] Char  )* > ["] Spacing

-    Class           <\- '[' < ( !']' Range )* > ']' Spacing

-    Range           <\- Char '\-' Char / Char

-    Char            <\- '\\\\' [abefnrtv'"\\[\\]\\\\]

-                     / '\\\\' [0\-3][0\-7][0\-7]

-                     / '\\\\' [0\-7][0\-7]?

-                     / '\\\\' '\-'

-                     / !'\\\\' .

-    LEFTARROW       <\- '<\-' Spacing

-    SLASH           <\- '/' Spacing

-    AND             <\- '&' Spacing

-    NOT             <\- '!' Spacing

-    QUERY           <\- '?' Spacing

-    STAR            <\- '*' Spacing

-    PLUS            <\- '+' Spacing

-    OPEN            <\- '(' Spacing

-    CLOSE           <\- ')' Spacing

-    DOT             <\- '.' Spacing

-    Spacing         <\- ( Space / Comment )*

-    Comment         <\- '#' ( !EndOfLine . )* EndOfLine

-    Space           <\- ' ' / '\\t' / EndOfLine

-    EndOfLine       <\- '\\r\\n' / '\\n' / '\\r'

-    EndOfFile       <\- !.

-    Action          <\- '{' < [^}]* > '}' Spacing

-    BEGIN           <\- '<' Spacing

-    END             <\- '>' Spacing

-.fi

-.SH LEG GRAMMARS

-.I leg

-is a variant of

-.I peg

-that adds some features of

-.IR lex (1)

-and

-.IR yacc (1).

-It differs from

-.I peg

-in the following ways.

+The following elements can appear in addition to rules.

.TP

 .BI %{\  text... \ %}

 A declaration section can appear anywhere that a rule definition is

@@ -444,108 +364,31 @@

 generated C parser code

 .I before

 the code that implements the parser itself.

+.PP

+The pound sign (#) introduces a comment (discarded) that

+continues until the end of the line.

.TP

-.IB name\  = \ pattern

-The 'assignment' operator replaces the left arrow operator '<\-'.

-.TP

+.BI %% \ text...

+A double percent '%%' terminates the rules (and declarations) section of

+the grammar.  All

+.I text

+following '%%' is copied verbatim to the generated C parser code

+.I after

+the parser implementation code.

+.PP

+Some notes regarding rules and and patterns follow.

+.PP

 .B rule\-name

 Hyphens can appear as letters in the names of rules.  Each hyphen is

 converted into an underscore in the generated C source code.  A

 single hyphen '\-' is a legal rule name.

-.nf

-    \-       = [ \\t\\n\\r]*

-    number  = [0\-9]+                 \-

-    name    = [a\-zA\-Z_][a\-zA_Z_0\-9]* \-

-    l\-paren = '('                    \-

-    r\-paren = ')'                    \-

-.fi

-This example shows how ignored whitespace can be obvious when reading

-the grammar and yet unobtrusive when placed liberally at the end of

-every rule associated with a lexical element.

.TP

-.IB seq\-1\  | \ seq\-2

-The alternation operator is vertical bar '|' rather than forward

-slash '/'.  The

-.I peg

-rule

-.nf

-    name <\- sequence\-1

-          / sequence\-2

-          / sequence\-3

-.fi

-is therefore written

-.nf

-    name = sequence\-1

-         | sequence\-2

-         | sequence\-3

-         ;

-.fi

-in

-.I leg

-(with the final semicolon being optional, as described next).

-.TP

-.IB @{\ action\ }

-Actions prefixed with an 'at' symbol will be performed during parsing,

-at the time they are encountered while matching the input text with a

-rule.

-Because of back-tracking in the PEG parsing algorithm, actions

-prefixed with '@' might be performed multiple times for the same input

-text.

-(The usual behviour of actions is that they are saved up until

-matching is complete, and then those that are part of the

-final derivation are performed in left-to-right order.)

-The variable

-.I yytext

-is available within these actions.

-.TP

-.IB exp \ ~ \ {\ action\ }

-A postfix operator

-.BI ~ {\ action\ }

-can be placed after any expression and will behave like a normal

-action (arbitrary C code) except that it is invoked only when

-.I exp

-fails.  It binds less tightly than any other operator except alternation and sequencing, and

-is intended to make error handling and recovery code easier to write.

-Note that

-.I yytext

-and

-.I yyleng

-are not available inside these actions, but the pointer variable

-.I yy

-is available to give the code access to any user\-defined members

-of the parser state (see "CUSTOMISING THE PARSER" below).

-Note also that

-.I exp

-is always a single expression; to invoke an error action for any

-failure within a sequence, parentheses must be used to group the

-sequence into a single expression.

-.nf

-    rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }

-         | ...

-    rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); }

-         | ...

-.fi

-.TP

 .IB pattern\  ;

 A semicolon punctuator can optionally terminate a

 .IR pattern .

+.PP

+Within actions you can access and manipulate named values.

.TP

-.BI %% \ text...

-A double percent '%%' terminates the rules (and declarations) section of

-the grammar.  All

-.I text

-following '%%' is copied verbatim to the generated C parser code

-.I after

-the parser implementation code.

-.TP

 .BI $$\ = \ value

 A sub\-rule can return a semantic

 .I value

@@ -559,72 +402,9 @@

 is associated with the

 .I identifier

 and can be referred to in subsequent actions.

-.PP

-The desk calculator example below illustrates the use of '$$' and ':'.

-.SH LEG EXAMPLE: A DESK CALCULATOR

-The extensions in

-.I leg

-described above allow useful parsers and evaluators (including

-declarations, grammar rules, and supporting C functions such

-as 'main') to be kept within a single source file.  To illustrate this

-we show a simple desk calculator supporting the four common arithmetic

-operators and named variables.  The intermediate results of arithmetic

-evaluation will be accumulated on an implicit stack by returning them

-as semantic values from sub\-rules.

-.nf

-    %{

-    #include <stdio.h>     /* printf() */

-    #include <stdlib.h>    /* atoi() */

-    int vars[26];

-    %}

-    Stmt    = \- e:Expr EOL                  { printf("%d\\n", e); }

-            | ( !EOL . )* EOL               { printf("error\\n"); }

-    Expr    = i:ID ASSIGN s:Sum             { $$ = vars[i] = s; }

-            | s:Sum                         { $$ = s; }

-    Sum     = l:Product

-                    ( PLUS  r:Product       { l += r; }

-                    | MINUS r:Product       { l \-= r; }

-                    )*                      { $$ = l; }

-    Product = l:Value

-                    ( TIMES  r:Value        { l *= r; }

-                    | DIVIDE r:Value        { l /= r; }

-                    )*                      { $$ = l; }

-    Value   = i:NUMBER                      { $$ = atoi(yytext); }

-            | i:ID !ASSIGN                  { $$ = vars[i]; }

-            | OPEN i:Expr CLOSE             { $$ = i; }

-    NUMBER  = < [0\-9]+ >    \-               { $$ = atoi(yytext); }

-    ID      = < [a\-z]  >    \-               { $$ = yytext[0] \- 'a'; }

-    ASSIGN  = '='           \-

-    PLUS    = '+'           \-

-    MINUS   = '\-'           \-

-    TIMES   = '*'           \-

-    DIVIDE  = '/'           \-

-    OPEN    = '('           \-

-    CLOSE   = ')'           \-

-    \-       = [ \\t]*

-    EOL     = '\\n' | '\\r\\n' | '\\r' | ';'

-    %%

-    int main()

-    {

-      while (yyparse())

-        ;

-      return 0;

-    }

-.fi

-.SH LEG GRAMMAR FOR LEG GRAMMARS

+.SH MINIPEG GRAMMAR FOR MINIPEG GRAMMARS

 The grammar for

-.I leg

+.I minipeg

 grammars is shown below.  This will both illustrate and formalise the

 above description.

.nf

@@ -922,7 +702,7 @@

 .IR free ()

 to manage them explicitly.  The example in the following section

 demonstrates one approach to resource management.

-.SH LEG EXAMPLE: EXTENDING THE PARSER'S CONTEXT

+.SH EXAMPLE: EXTENDING THE PARSER'S CONTEXT

The

 .I yy

 variable passed to actions contains the state of the parser plus any

@@ -978,10 +758,8 @@

.fi

 .SH DIAGNOSTICS

-.I peg

-and

-.I leg

-warn about the following conditions while converting a grammar into a parser.

+.I minipeg

+warns about the following conditions while converting a grammar into a parser.

.TP

 .B syntax error

 The input grammar was malformed in some way.  The error message will

@@ -1074,8 +852,6 @@

 any character at all ("." expression) when it attempts to read beyond

 the end of the input.

 .SH BUGS

-You have to type 'man peg' to read the manual page for

-.IR leg (1).

.PP

 The 'yy' and 'YY' prefixes cannot be changed.

.PP

@@ -1121,27 +897,24 @@

and

 .IR yacc (1)

 which influenced the syntax and features of

-.IR leg .

+.IR minipeg .

.PP

 The source code for

-.I peg

-and

-.I leg

+.I minipeg

 whose grammar parsers are written using themselves.

.PP

 The latest version of this software and documentation:

.nf

-    http://piumarta.com/software/peg

+    https://github.com/andrewchambers/minipeg

.fi

 .SH AUTHOR

-.IR peg ,

-.I leg

-and this manual page were written by Ian Piumarta (first\-name at

-last\-name dot com) while investigating the viability of regular and

-parsing\-expression grammars for efficiently extracting type and

-signature information from C header files.

+.IR minipeg

+and this manual page were originally written by Ian Piumarta

+under the project name peg/leg.

+.IR minipeg

+is a fork of peg/leg by Andrew Chambers aiming to simplify the project.

.PP

 Please send bug reports and suggestions for improvements to the author

-at the above address.

+at the project address.

home: hub: minipeg