Extensible Term Language Specification

Constantine Plotnikov

<constantine.plotinikov@gmail.com
   >

Revision History
Revision 0.2.1	2009-01-19
The specification finally carries much more meat than before. EBNF is provided to specify some layers and grammar definition in itself is included into the document.
Revision 0.2.0	2006-02-05
This is the first draft of the specification. ETL now uses itself to define grammar.

Abstract

This specification describes syntax and semantics of ETL meta-language. This language is intended to provide framework for creation of human readable and human writable languages that can be extended and combined together.

List of Examples

A.1. grammar-0_2_1.g.etl
B.1. doctype.g.etl
C.1. default.g.etl

Chapter 1. Overview

The language consists from three layers: Lexical, Phrase, and Syntax. Each layer delimits underlying layer and annotates objects or underlying layer.

Character Level

This is primitive layer provided by underlying runtime. This layer produces a sequence of characters from some data source.

Lexical Level

On this level a character stream is translated into the sequence of tokens.

Phrase Level

On this level stream of tokens is translated into blocks, segments, and annotated tokens. Tokens are annotated as belonging to one of the following classes:

Ignorable
Control
Significant

Only significant tokens have to be considered by other levels during parsing. Other are just passed through.

Term Level

On this level source code is mapped to abstract syntax tree.

Differently from most of other syntax definition frameworks, this level uses notion of statements and operators rather than some form BNF.

There are the following reasons for it:

Abstraction of statement and operator is used by language designers for long time. However these abstractions are not directly expressed by meta-languages.
Using higher level constructs directly, provides more high-level extension points. It is possible to add new statements and operators rather than new productions that should be integrated with existing languages.

Chapter 2. Lexical Level

Abstract

This chapter describes lexical level of the parser.

Table of Contents

1. Overview
2. New Lines
3. White Spaces
4. Comments
5. Brackets
6. Strings
7. Numbers
8. Identifiers
9. Graphics
10. Other
11. Complete Lexical Level Grammar

1. Overview

The lexical level is quite traditional. Current definition of lexical is incomplete with respect to Unicode. Outside of strings and comments only ASCII characters are supported.

The lexical level is fixed and it cannot be extended by grammar writer. ^[1]

Another feature is that there are no keywords on lexical level. Whether the specific token is keyword or not is content dependent. Keywords are treated as local rather then global.

[1]	source	`::=`	tokens , EOF
[2]	tokens	`::=`	token *
[3]	token	`::=`	identifier \| integer \| integer-with-suffix \| float \| float-with-suffix \| graphics \| semicolon \| comma \| open-round \| close-round \| open-square \| close-square \| open-curly \| close-curly \| whitespace \| newline \| block-comment \| line-comment \| documentation-comment \| string

2. New Lines

Both UNIX and DOS new line styles are supported

[4]	newline	`::=`	( CR , LF ) \| ( LF , CR ) \| LF \| CR	/* longest matching sequence is selected */
[5]	CR	`::=`	U+000D
[6]	LF	`::=`	U+000A

3. White Spaces

Tab, space and new line characters outside of string and comments are considered as white space tokens. The conforming parser may merge individual characters into bigger tokens.

[7]	whitespace	`::=`	( TAB \| SPACE )+
[8]	TAB	`::=`	U+0007
[9]	SPACE	`::=`	U+0020

Other kinds of Unicode spaces and new lines will be supported in the future.

4. Comments

Comments might be of three kinds:

Block comments

These are traditional C++/Java/C# comments that starts at with /* and end with */ . Nested block comments are not allowed.

Line Comments

These are traditional C++/Java/C# line comments that starts at with // and end at the end of the line.

Documentation Comments

These are traditional C# documentation comments. These comments are a specialization of line comments and are treated as line comments in places where documentation comments are not expected. Documentation comments start with /// and last until the end of the line.

A single format for documentation comments has been selected to make different languages defined with ETL framework consistent. C# comment format has an advantage over Java format in that it allows any text inside comments. In java documentation comments it is not possible to use block comments or documentation comments in sample code.

[10]	documentation-comment	`::=`	'/','/','/',(~( newline ))*
[11]	line-comment	`::=`	'/','/',( ~( newline \|'/'), (~( newline ))*)?
[12]	block-comment	`::=`	'/', '', (~'' \| ('',~'/')), '*', '/'

5. Brackets

The ETL recognizes traditional bracket kinds [] , () , {} . The biggest difference from C-like languages is that '[' can be directly followed by graphics token, and ']' can be prefixed by graphics token. The text [++i++] will be parsed as three tokens [++ , i , and ++] . So it is a good style always put spaces after open square bracket and before close bracket. The graphic suffixes and prefixes allow easy introducing brackets with custom semantics to the language.

[13]	open-round	`::=`	'('	/* also known as parenthesis. */
[14]	close-round	`::=`	')'
[15]	open-square	`::=`	'[', graphics-modifier ?
[16]	close-square	`::=`	graphics-modifier ?, ']'
[17]	open-curly	`::=`	'{'	/* also known as braces. */
[18]	close-curly	`::=`	'}'

All brackets are singleton characters. They always consist of single character from the stream.

6. Strings

Strings definition is most similar to C and Java tradition. The biggest difference is that by themselves, there is no difference between char and string literal.

The current version supports only two kinds of quotes for the strings: " and ' ^[2] .

The string could be prefixed by identifier like UTF8"Some text". The two letter prefixes that start with the upper case and lower case letter Q (Unicode code points U+0051 and U+0071) are reserved for the future use in the Unicode support.

The lexer also supports multiline strings that could include newline character directly in the text. A multiline strings starts with triple quote character and ends with sequence of three quotes of the same type. TODO backslash semantics.

[19]	string	`::=`	identifier ? , ( same-quote-string(quote='\'') \| same-quote-string(quote='\"') \| same-quote-multiline-string(quote='\'') \| same-quote-multiline-string(quote='\"') )
[20]	same-quote-string(quote:char)	`::=`	quote,(~( newline \|quote\| backslash ) \| escape-sequence )+, quote
[21]	same-quote-multiline-string(quote:char)	`::=`	quote,quote,quote,(~((quote, quote, quote)\| backslash ) \| escape-sequence )+, quote, quote, quote
[22]	escape-sequence	`::=`	backslash ,( quote-char \| backslash \| (TODO any char?)
[23]	backslash	`::=`	U+005C	/* the character "\\", also known as "REVERSE SOLIDUS" */
[24]	quote-char	`::=`	'"' \| '\''

7. Numbers

Numbers are borrowed from Ada programming language almost as is. There are two formats for numbers: decimal and based.

Decimal numbers are like ones in other languages. However floating point number must have digits around "." char.

Bases numbers allow using any base from 2 to 36 inclusive. In based numbers exponent is decimal number that specifies power of base on which mantissa is multiplied. For example 2#1#E10 is floating point number 1024 . 36#10.0#E-1 is floating point number 1 .

Numbers can have underscore symbol in the mantissa. It is ignored during evaluating value. It can be used to improve readability of the number for example: 16#7FFF_FFFF# .

Numbers are divided into two different classes: integers and floating point numbers. Floating point numbers are ones that contain '.' in mantissa or have exponent specified.

Numbers can have an optional suffix that is an identifier. The suffix cannot start from upper case or lower case letter "E" to avoid conflict with exponent specification and it cannot start with underscore character in order to prevent confusion with separator of number parts. The suffix was introduced in order to support typed numeric constants used in C and Java. For example: 36#XYZ#ul , 36#XYZ#i32 .

[25]	integer-with-suffix	`::=`	integer , numeric-suffix
[26]	float-with-suffix	`::=`	float , numeric-suffix
[27]	numeric-suffix	`::=`	identifier	[ The numeric suffix must start with valid character. ]
[28]	integer	`::=`	based-integer \| decimal-integer
[29]	float	`::=`	based-float \| decimal-float
[30]	based-integer	`::=`	numeric-base , '#', based-digits , '#',	[ Extended digits must conform to base. ]
[31]	based-float	`::=`	( numeric-base , '#', based-digits , '.', based-digits , '#', numeric-exponent ?) \| ( based-integer , numeric-exponent )	[ Extended digits must conform to base. ]
[32]	numeric-base	`::=`	decimal-integer
[33]	decimal-float	`::=`	decimal-integer , (( '.', decimal-integer , numeric-exponent ?) \| numeric-exponent )
[34]	numeric-exponent	`::=`	('e'\|'E'),('+'\|'-'), decimal-integer
[35]	decimal-integer	`::=`	digit , ( digit \| ( underscore +, digit ) )*
[36]	based-digits	`::=`	extended-digit , ( extended-digit \| ( underscore +, extended-digit ) )*
[37]	extended-digit	`::=`	digit \| alpha	/* Value of alpha characters are interpreted as number in alphabet + 10 (for example 'A' maps to 10, 'b' maps to 11). Mapping is case insensitive. */
[44]	underscore	`::=`	'_'
[45]	alpha	`::=`	'A'..'Z'\|'a'..'z'
[46]	digit	`::=`	'0'..'9'

Extended digits must conform to base.

Value of each extended digit in based numeric literal must be strictly less than value of the specified base. For example if base is 12, only digits '0'..'9' and 'A'..'B' might appear.

The numeric suffix must start with valid character.

The numeric suffix cannot start with "E" (U+0045: Latin Capital Letter E), "e" (U+0065: Latin Small Letter E), or "_" (U+005F: Low Line; Underscore).

8. Identifiers

Identifiers are quite typical. Currently identifiers must start with letter or underscore and continue with letter, underscore or digit.

[38]	identifier	`::=`	( alpha \| underscore ), ( alpha \| digit \| underscore )*
[44]	underscore	`::=`	'_'
[45]	alpha	`::=`	'A'..'Z'\|'a'..'z'
[46]	digit	`::=`	'0'..'9'

9. Graphics

The name of category and its definition is borrowed from prolog. Graphics token is non empty sequence of the following characters: ~+-%^&*|<=:?!>.@/\$ . Such tokens are typically used to define operators. Note that backslash does not have any special meaning in the graphics token.

[39]	graphics	`::=`	( safe-graphics-char \| '*' \| ( '/', safe-graphics-char ))+, graphics-char ?	[ Comments have higher priority than graphics. ]
[40]	safe-graphics-char	`::=`	'~' \| '+' \| '-' \| '%' \| '^' \| '&' \| '\|' \| '<' \| '=' \| ':' \| '?' \| '!' \| '>' \| '.' \| '@' \| '`' \| '$' \| backslash
[41]	graphics-char	`::=`	'/' \| '*' \| safe-graphics-char

Comments have higher priority than graphics.

The comments have higher priority than graphics tokens. Therefore character sequences '/*' and '//' are always interpreted as start of comment. And it may not happen that graphics token ends with the character '/' , and the next character is '/' or '*' .

10. Other

There are two singleton characters that do fall in category of graphics. It is semicolon and comma.

[42]	semicolon	`::=`	';'
[43]	comma	`::=`	','

11. Complete Lexical Level Grammar

This section provides aggregate view of productions of lexical level.

Lexical grammar

[1]	source	`::=`	tokens , EOF
[2]	tokens	`::=`	token *
[3]	token	`::=`	identifier \| integer \| integer-with-suffix \| float \| float-with-suffix \| graphics \| semicolon \| comma \| open-round \| close-round \| open-square \| close-square \| open-curly \| close-curly \| whitespace \| newline \| block-comment \| line-comment \| documentation-comment \| string
[4]	newline	`::=`	( CR , LF ) \| ( LF , CR ) \| LF \| CR	/* longest matching sequence is selected */
[5]	CR	`::=`	U+000D
[6]	LF	`::=`	U+000A
[7]	whitespace	`::=`	( TAB \| SPACE )+
[8]	TAB	`::=`	U+0007
[9]	SPACE	`::=`	U+0020
[10]	documentation-comment	`::=`	'/','/','/',(~( newline ))*
[11]	line-comment	`::=`	'/','/',( ~( newline \|'/'), (~( newline ))*)?
[12]	block-comment	`::=`	'/', '', (~'' \| ('',~'/')), '*', '/'
[13]	open-round	`::=`	'('	/* also known as parenthesis. */
[14]	close-round	`::=`	')'
[15]	open-square	`::=`	'[', graphics-modifier ?
[16]	close-square	`::=`	graphics-modifier ?, ']'
[17]	open-curly	`::=`	'{'	/* also known as braces. */
[18]	close-curly	`::=`	'}'
[19]	string	`::=`	identifier ? , ( same-quote-string(quote='\'') \| same-quote-string(quote='\"') \| same-quote-multiline-string(quote='\'') \| same-quote-multiline-string(quote='\"') )
[20]	same-quote-string(quote:char)	`::=`	quote,(~( newline \|quote\| backslash ) \| escape-sequence )+, quote
[21]	same-quote-multiline-string(quote:char)	`::=`	quote,quote,quote,(~((quote, quote, quote)\| backslash ) \| escape-sequence )+, quote, quote, quote
[22]	escape-sequence	`::=`	backslash ,( quote-char \| backslash \| (TODO any char?)
[23]	backslash	`::=`	U+005C	/* the character "\\", also known as "REVERSE SOLIDUS" */
[24]	quote-char	`::=`	'"' \| '\''
[25]	integer-with-suffix	`::=`	integer , numeric-suffix
[26]	float-with-suffix	`::=`	float , numeric-suffix
[27]	numeric-suffix	`::=`	identifier	[ The numeric suffix must start with valid character. ]
[28]	integer	`::=`	based-integer \| decimal-integer
[29]	float	`::=`	based-float \| decimal-float
[30]	based-integer	`::=`	numeric-base , '#', based-digits , '#',	[ Extended digits must conform to base. ]
[31]	based-float	`::=`	( numeric-base , '#', based-digits , '.', based-digits , '#', numeric-exponent ?) \| ( based-integer , numeric-exponent )	[ Extended digits must conform to base. ]
[32]	numeric-base	`::=`	decimal-integer
[33]	decimal-float	`::=`	decimal-integer , (( '.', decimal-integer , numeric-exponent ?) \| numeric-exponent )
[34]	numeric-exponent	`::=`	('e'\|'E'),('+'\|'-'), decimal-integer
[35]	decimal-integer	`::=`	digit , ( digit \| ( underscore +, digit ) )*
[36]	based-digits	`::=`	extended-digit , ( extended-digit \| ( underscore +, extended-digit ) )*
[37]	extended-digit	`::=`	digit \| alpha	/* Value of alpha characters are interpreted as number in alphabet + 10 (for example 'A' maps to 10, 'b' maps to 11). Mapping is case insensitive. */
[38]	identifier	`::=`	( alpha \| underscore ), ( alpha \| digit \| underscore )*
[39]	graphics	`::=`	( safe-graphics-char \| '*' \| ( '/', safe-graphics-char ))+, graphics-char ?	[ Comments have higher priority than graphics. ]
[40]	safe-graphics-char	`::=`	'~' \| '+' \| '-' \| '%' \| '^' \| '&' \| '\|' \| '<' \| '=' \| ':' \| '?' \| '!' \| '>' \| '.' \| '@' \| '`' \| '$' \| backslash
[41]	graphics-char	`::=`	'/' \| '*' \| safe-graphics-char
[42]	semicolon	`::=`	';'
[43]	comma	`::=`	','
[44]	underscore	`::=`	'_'
[45]	alpha	`::=`	'A'..'Z'\|'a'..'z'
[46]	digit	`::=`	'0'..'9'

^[1] This is not as big problem as it looks. Basic tokens like strings, identifiers, and numbers are repeated from language to language.

^[2] The character ` is considered to be graphics, as it is used as graphics in the most languages. The prefixed strings give ability to have a lot different string tokens anyway.

Chapter 3. Phrase Level

Notion of phrase layer was initially borrowed from Dylan. Than the idea of what such syntax should do was significantly affected by Python line syntax.

On phase level a forest of a blocks and segments is built. A source is sequence of ignorable tokens and segments. A segment is a sequence of tokens and blocks terminated by semicolon. A block is sequence of segments and tokens enclosed into curly brackets.

Ignorable: These tokens can be ignored during parsing terms. Tokens in this category are white space, new lines, and comments (except documentation comments).
Control: Control tokens are tokens that are designate start and end of blocks and segments. Because they are processed by this phrase parser, they can be ignored on term parser level.
Significant: These tokens are significant tokens that are parsed by term parsers. These are tokens like identifiers, numbers, and strings. Documentation tokens are also considered as significant tokens.

The for example lets consider the following text.

  { a ;}; 
  a {b;} c; 
  /// a 
  a;

The text above is interpreted as the following by parser.

There are three segments on top level.
The first segment consists of one block with an ignorable white space token and nested single segment with significant token a and white space token that follows this token. Semicolons and braces are reported as control tokens.
The next segment is similar and significant and ignorable tokens in addition to block.
Last segment starts with a documentation token. That is followed by white space.

Phase grammar

[47]	source	`::=`	segment-sequence , EOF
[48]	segment-sequence	`::=`	( segment \| ignorable )*
[49]	ignorable	`::=`	lexer::block-comment \| lexer::line-comment \| lexer::whitespace \| lexer::newline
[50]	control(token:Token)	`::=`	token
[51]	block	`::=`	control ( lexer::open-curly ), segment-sequence , control ( lexer::close-curly )
[52]	segment	`::=`	(( significant \| block ), ( significant \| block \| ignorable )*)?, control ( ( lexer:semicolon )
[53]	significant	`::=`	lexer::identifier \| lexer::integer \| lexer::integer-with-suffix \| float \| lexer::float-with-suffix \| lexer::graphics \| lexer::comma \| lexer::open-round \| lexer::close-round \| lexer::open-square \| lexer:close-square \| lexer::documentation-comment \| lexer::string

Chapter 4. Term Layer Abstract Syntax Tree Model

Grammar can be considered as mapping from source to abstract syntax tree that can be represented by the following model:

AST Object Model

[54]	source	`::=`	object *
[55]	object(namespace:URI, name: Identifier)	`::=`	( property \| list-property )*	[ Tree should be well formed ]
[56]	list-property(name: Identifier)	`::=`	( object + \| value + )?
[57]	property(name: Identifier)	`::=`	( object \| value )?
[58]	value	`::=`	phrase::significant

Tree should be well formed

The properties with the same name should have the same interpretation with in context of the same object type. For example simple property might not be a list property in context of the same object type. And if it contains objects, it cannot contain values in other context. It is a grammar error, if it can produce non well-formed tree. Process may fail to detect it.

AST model is only one of possible views of the object model. Such view can be useful for tools that are primary interested in significant information from the source.

Chapter 5. Term Layer Document Object Model

This section specifies document object model from point of view of the client APIs. How this document object model is built from source code is specified in the chapter Term Layer Grammar Language . Note that if we remove all nodes except object and properties and values, what will remain will be AST model and it should also follow well-formness constraint .

Chapter 6. Term Layer Grammar Language

This chapter specifies grammar language using EBNF and plain text. The specification also features definition of ETL grammar language using ETL grammar language itself .

Chapter 7. Term Layer

Table of Contents

1. Overview

2. Grammar Organization

3. Context Definition

3.1. Statement Definition
3.2. Attributes Definition
3.3. Documentation Definition
3.4. Operator Definition
3.5. Fragment Definition

4. General Syntax Expressions

4.1. Token expressions
4.2. Sequences
4.3. Keywords and Sequences
4.4. Modifiers
4.5.
4.6. Usage of fragments
4.7. Syntax Operators
4.8. Operand Expressions
4.9. Object expression and let statement
4.10. Documentation lines

1. Overview

Parsers on term layer delimits stream of tokens from phrase parser by objects and properties. Other way to look at term parsing process is that parser maps source code to AST.

AST is assumed to consist of objects and properties. Objects types are designated by name and namespace. Properties are designated by name.

AST structure is closed related to tree produced by phrase layer. A segment sequence of on source level or block level is always described by some grammar and is described by some context.

Statement declaration in the grammar describes syntax of a single segment.

Grammar defines mapping from sequence of tokens to AST.

2. Grammar Organization

Top level element of the grammar is grammar object. The grammar consists of context. Each context has syntax definitions, fragments, and imports.

Syntax definitions define syntax constructs available in that context. There are two major classes for syntax definitions: statements and operators. Each statement describes syntax of one segment. Operators describe the syntax of expression. A syntax of expression forms in modular way just like in prolog. The almost entire operator level is borrowed from prolog. Only major addition is composite operators that allow more syntax constructs to appear in the grammar.

Top Level

[59]	grammar	`::=`	"grammar" grammar-name "{" ( grammar-include \| grammar-import \| namespace-declaration \| context-definition ) * "}" ";"
[60]	grammar-name	`::=`	lexer::identifier ("." lexer::identifier )+
[61]	string	`::=`	lexer::string(quote="\"") \| lexer::string(quote="\'")
[62]	string	`::=`	string
[63]	string	`::=`	string
[64]	grammar-reference	`::=`	( systemid ( "public" publicid )? ) \| ( "public" publicid )
[65]	grammarInclude	`::=`	"include" grammar-reference ";"
[66]	grammar-import	`::=`	"import" grammar-import-name "=" grammar-reference ";"
[67]	grammar-import-name	`::=`	lexer::identifier
[68]	namespaceDeclaration	`::=`	"namespace" namespacePrefix "=" string ";"
[69]	namespacePrefix	`::=`	lexer::identifier

3. Context Definition

Context contains syntax definitions. Like grammars contexts might be included into each other and imported from each other. The context contains three kinds of definitions. The operators used to define expressions. They are composite and simple. Simple operators are just like prolog's ones. Composite allows using complex expressions in the place of the operator. So complex operators like Java new operator can be defined. Statements allow defining a statement that could happen at block level. Among other things, an expression statement might be defined. The context might also contain reusable blocks of syntax.

The context includes all definitions inherited through grammar and context include operations. The context can add new definitions to the set of already inherited definitions. It is also possible to redefine inherited definitions. Removing is not directly possible. However it is possible to create a "def" with the same as a definition in the parent context. It will remove statement or operator.

Context Definition

[70]	context-name	`::=`	lexer::identifier
[71]	context	`::=`	"context" context-name "{" ( context-import \| context-include \| statement-definition \| operator-definition \| attribute-definition \| fragment-definition \| documentation-definition )* "}" ";"
[72]	context-import	`::=`	"import" context-name ("from" grammar-import-name )? ";"
[73]	context-include	`::=`	"include" context-name ";"
[74]	definition-name	`::=`	lexer::identifier
[75]	operator-definition	`::=`	"op" "composite"? definition-name "(" ( "f" \| "xf" \| "fx" \| "yf" \| "fy" \| "xfx" \| "xfy" \| "yfx" \| "yfy" ) ( "," operator-precedence ( "," phrase::significant )? )? ")" "{" "}" ";"
[76]	operator-precedence	`::=`	lexer::integer
[77]	statement-definition	`::=`	"statement" definition-name "{" "}" ";"
[78]	fragment-definition	`::=`	"def" definition-name "{" "}" ";"
[79]	attributes-definition	`::=`	"attributes" definition-name "{" "}" ";"
[80]	documentation-definition	`::=`	"documentation" definition-name "{" "}" ";"

3.1. Statement Definition

This definition specifies a statement for the context. Each segment in the source should match one of the statements. The root syntax expression in the statement must be an object creation expression.

Syntax

Statement definition

[77] statement-definition ::= "statement" definition-name "{" "}" ";"

Example

statement Include {
	^ g:GrammarInclude {
		% include {
			ref(GrammarRef);
		};
	};
};

3.2. Attributes Definition

This definition specifies mapping for attributes in this context. There could be only attributes definition per context.

Attributes are standard prefix for all statements in this context. They allow defining constructs like Java annotation and C# attributes. The attributes behave as if they were inserted into object creation construct inside statement as first element. However because they are common for all statements, this does not cause conflict.

Because attributes are assumed as defined inside object context, they should specify property to which they are mapped.

Syntax

Attributes definition

[79] attributes-definition ::= "attributes" definition-name "{" "}" ";"

Example

attributes Attributes {
    @ attributeSets += {
        ^ ej:AttributeSet {
            % @ % [ {
                @ attributes += list , {
                    expression(Expression,precedence=100);
                };
            } % ];
        };
    }+;
};

3.3. Documentation Definition

This definition specifies mapping for documentation in this context. There could be only documentation definition per context.

If there is no documentation definition in the context, the documentation comments are just treated as normal line comments. So they will not be seen by parsers that look only for AST events.

Syntax

Documentation definition

[80] documentation-definition ::= "documentation" definition-name "{" "}" ";"

Example

/// Documentation mapping definition. 
documentation Documentation {
    @ documentation += doclines wrapper g:DocumentationLine.text;
};

3.4. Operator Definition

This definition specifies a infix, prefix, or suffix operator. And it is also possible to specify primary expressions using this construct.

The associativity is specified like it is done in the Prolog

Syntax

Operator definition

[75] operator-definition ::= "op" "composite"? definition-name "(" ( "f" | "xf" | "fx" | "yf" | "fy" | "xfx" | "xfy" | "yfx" | "yfy" ) ( "," operator-precedence ( "," phrase::significant )? )? ")" "{" "}" ";"

Example

TBD

3.5. Fragment Definition

This definition specifies a reusable fragment that could be used in statement definition and other fragment definitions.

Syntax

Fragment definition

[78] fragment-definition ::= "def" definition-name "{" "}" ";"

Example

/// String token definition
def String {
    string(quote="\"") | string(quote='\'');
};

/// This definition provides way of referencing other grammars.
def GrammarRef {
   {
      @ systemId = ref(String);
      % public {
         @ publicId = ref(String);
      }?;
   } | {
      % public {
         @ publicId = ref(String);
      };
   };    
};

4. General Syntax Expressions

ETL syntax expressions are evaluated with respect to some token stream. The expression might consume some tokens and yield some AST related events. Some expressions do both.

4.1. Token expressions

All token expressions consume a single token of the specified kind and yield token text as a value event.

Syntax

[81] syntax-wrapper ::= "wrapper" syntax-object-qualified-name "." syntax-property-name

Example

@ name += identifier wrapper g:DocumentationLine.text;

4.2. Sequences

Sequence expression causes syntax expressions to be executed in the specified order.

4.3. Keywords and Sequences

4.4. Modifiers

4.5. 4.6. Usage of fragments

Fragment definitions defined with "def" statement . Note that referenced fragment is included into definition, so it is not possible to refer fragments recursively.

The fragment reference expression has the same effect as if text of the referenced fragment definition has been written textually instead of fragment reference. The only difference is that the fragment definition uses namespace declarations from the grammar where it is defined.

Syntax

[82] fragment-ref ::= "ref" "(" definition-name ")"

Example

/// String token definition
def String {
    string(quote="\"") | string(quote='\'');
};

/// This definition provides way of referencing other grammars.
def GrammarRef {
   {
      @ systemId = ref(String);
      % public {
         @ publicId = ref(String);
      }?;
   } | {
      % public {
         @ publicId = ref(String);
      };
   };    
};
/// Grammar include
statement Include {
   ^ g:GrammarInclude {
      % include {
         ref(GrammarRef);
      };
   };
};

4.7. Syntax Operators

4.8. Operand Expressions

Operand expressions are different from most of other kinds of syntax expression. They neither consume tokens nor generate values at place where they are defined. They are used to specify properties for left and right argument of operations. This syntax expression is allowed to happen only at top level object and it should not be wrapped.

Currently it is possible to use these expressions anywhere at top level object. However, future versions of the specification might limit valid paces only to first and last statement of operation object.

Syntax

[83] syntax-operand ::= "right" | "left"

Example

/// Plus expression
op Plus(yfx,100,+) {
    ^ c:Plus { 
         @values += left; 
         @values += right; 
    };
};
/// Java-like conditional expression
op composite Cond(xfy, 1400) {
    ^ c:Cond { 
         @ condition = left;
         % ?; 
         @ thenPart = expression;       
         % :;
         @ elsePart = right;
    };         
};

4.9. Object expression and let statement

Syntax

[84]	syntax-object-name	`::=`	lexer::identifier
[85]	syntax-object-name	`::=`	lexer::identifier
[86]	syntax-object-name	`::=`	namespace-prefix ":" syntax-object-name

Example

4.10. Documentation lines

This expression consumes documentation lines and yields them as values. This expression might happen only in context of documentation definition .

Like other expressions that directly yield value, the doclines expression supports wrappers.

Syntax

[87]	syntax-doclines	`::=`	"doclines" syntax-wrapper
[81]	syntax-wrapper	`::=`	"wrapper" syntax-object-qualified-name "." syntax-property-name

Example

/// Documentation mapping definition. 
documentation Documentation {
    @ documentation += doclines wrapper g:DocumentationLine.text;
};

Appendix A. ETL 0.2.1 Grammar

This section provides the text of ETL grammar language defined using grammar language itself. It is expected that parses will actually use this definition (possibly stripped of comments) during parsing process. It expected that a simplified bootstrap parser will read this grammar (for example error recovery is not needed for such parser since it will only read a correct grammar), and than a compiled parser is used to parse all other grammars.

Example A.1. grammar-0_2_1.g.etl

     1: // Reference ETL Parser for Java
     2: // Copyright (c) 2000-2009 Constantine A Plotnikov
     3: //
     4: // Permission is hereby granted, free of charge, to any person 
     5: // obtaining a copy of this software and associated documentation 
     6: // files (the "Software"), to deal in the Software without restriction,
     7: // including without limitation the rights to use, copy, modify, merge, 
     8: // publish, distribute, sublicense, and/or sell copies of the Software, 
     9: // and to permit persons to whom the Software is furnished to do so, 
    10: // subject to the following conditions:
    11: //
    12: // The above copyright notice and this permission notice shall be 
    13: // included in all copies or substantial portions of the Software.
    14: // 
    15: // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
    16: // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
    17: // MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
    18: // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
    19: // BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 
    20: // ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 
    21: // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
    22: // SOFTWARE. 
    23: doctype public "-//IDN etl.sf.net//ETL//Grammar 0.2.1";
    24: /// This is a grammar for 0.2.1 syntax defined using 0.2.1 syntax.
    25: ///
    26: /// This is a definition for the grammar language itself.
    27: /// This grammar is actually used for parsing other grammars. 
    28: /// The text of this specific grammar itself is parsed using bootstrap 
    29: /// parser, then the grammar is compiled using normal grammar compilation 
    30: /// path.
    31: ///
    32: /// The parsing model is AST building. The parser tries to match syntax 
    33: /// constructs and creates AST according to specified AST constructs. 
    34: /// AST is assumed to contain objects and properties. So AST is directly 
    35: /// mappable to object models like C#, JavaBeans, EMOF, MOF, and EMF.
    36: /// 
    37: /// Properties are identified by name and objects are identified by 
    38: /// namespace URI and name. The object identification idea is borrowed from XMI
    39: /// and it is even possible to generate XMI-file without prior knowledge of 
    40: /// metamodel.
    41: ///
    42: /// There are two kinds of syntax constructs: expressions and statements.
    43: ///
    44: /// Expression model is borrowed from prolog. Operator has been borrowed from
    45: /// Prolog almost as is. Each operator has precedence and associativity. 
    46: /// Associativity has format "AfA" where A can be "", "x", or "y". Blank 
    47: /// specifies that there is no argument at this place. The "y" matches 
    48: /// expression of the same precedence and the "x" matches expression of lesser 
    49: /// precedence. For example yfx operator is "+" and "-" from C, x+y-z is parsed 
    50: /// as (x+y)-y. Example of  xfy operator is assignment operator from C. a=b=c 
    51: /// is parsed as a=(b=c). The yfy operator is any associative. If .. is a yfy
    52: /// operator of the same level as "+", then a + b + c .. a + b + c will be
    53: /// ((a+b)+c)..((a+b)+c)
    54: ///
    55: /// Also "f" operators have special semantics, they appear on level 0 and 
    56: /// are primary operators. They do not have neither left nor right part.
    57: ///
    58: /// Operators can be simple or composite. Simple operators have just a token 
    59: /// specified. See the definition of "|" and "?" operators in this grammar.
    60: /// Composite operators allow more complex syntax constructs. Composite 
    61: /// operators are usually used to define primary level of the grammar. However 
    62: /// they can be used to specify non primary operators too. Java method 
    63: /// invocation and array access operators are examples of this. Composite 
    64: /// operator can use all syntax expressions used in statements.
    65: ///
    66: /// Statement defines content of segment returned from term parser. The 
    67: /// statement is defined using generic constructs like pattern, lists, choice, 
    68: /// and tokens.
    69: /// 
    70: /// @author const
    71: grammar net.sf.etl.grammars.Grammar {
    72: 	namespace default g = "http://etl.sf.net/etl/grammar/0.2.1";
    73: 	
    74: 	/// This abstract context contains definition used across this grammar. 
    75: 	context abstract Base {
    76: 		
    77: 		/// String token definition. The two type of string are understood 
    78: 		/// by the grammar language and they have the same semantics.
    79: 		def String {
    80: 			string(quote="\"") | string(quote='\'');
    81: 		};
    82: 
    83: 		/// Documentation mapping definition. This mapping is used 
    84: 		/// by all statements in the grammar.
    85: 		documentation Documentation {
    86: 			@ documentation += doclines wrapper g:DocumentationLine.text;
    87: 		};		
    88: 
    89: 		/// Definition of object name expression. This reusable fragment
    90: 		/// is used in places where object name is required. The used prefix
    91: 		/// is defined by {@link #GrammarContent.Namespace} definition.
    92: 		def ObjectNameDef {
    93: 			^ g:ObjectName {
    94: 				@ prefix = identifier;
    95: 				% :; 
    96: 				@ name = identifier;
    97: 			};
    98: 		};
    99: 
   100: 		/// Definition for wrapper section fragment. This fragment is attached
   101: 		/// to syntax expressions that match and produce individual tokens. 
   102: 		/// The fragment specification causes this token to be wrapped into
   103: 		/// into the specified object and property. 
   104: 		def WrapperDef {
   105: 			% wrapper {
   106: 				ref(WrapperObject);
   107: 			};
   108: 		};
   109: 
   110: 		/// Definition for wrapper specification fragment. Wrapper is
   111: 		/// is usually attached to tokens. When token matches, its value
   112: 		/// wrapped into specified object and property. 
   113: 		def WrapperObject {
   114: 			^ g:Wrapper {
   115: 				@ object = ref(ObjectNameDef);
   116: 				% .;
   117: 				@ property = identifier;
   118: 			};
   119: 		};
   120: 	};
   121: 
   122: 
   123: 	
   124: 	/// This is base mapping syntax context. Mapping context might
   125: 	/// contain blank statements and let statements.
   126: 	context abstract BaseMappingSyntax {
   127: 		include Base;
   128: 
   129: 		/// Let statement. It is used to define mapping from syntax 
   130: 		/// to property of the object. The statement matches expression
   131: 		/// after "=" or "+=" and yields property assignment. All objects 
   132: 		/// or values that are encountered in the property are assigned 
   133: 		/// to property of top object specified by name property.
   134: 		///
   135: 		/// The "+=" version assumes list property, the "=" version assumes 
   136: 		/// property with upper multiplicity equal to 1.
   137: 		/// {@example #CompositeOperatorSyntax
   138: 		///   @ name = identifier; // match single identifier
   139: 		///   // match non-empty sequence of numbers separated by comma
   140: 		///   @ numbers += list , { integer | float;}; 
   141: 		/// }
   142: 		statement Let {
   143: 			% @;
   144: 			@ name = identifier;
   145: 			@ operator = token(+=) | token(=);
   146: 			@ expression = expression;
   147: 		};
   148: 		
   149: 		/// This is blank statement. It is used to attach attributes 
   150: 		/// and documentation comments. This is may be used for example
   151: 		/// for attaching annotations after last statement.
   152: 		statement BlankSyntaxStatement {
   153: 		};
   154: 	};
   155: 
   156: 	/// This is base syntax context. This context might contain object
   157: 	/// specification in addition to let statement.
   158: 	context abstract BaseSyntax {
   159: 		include BaseMappingSyntax;
   160: 
   161: 		/// Utility definition used in different parts of the syntax
   162: 		/// It allows to specify a block consisting of syntax statements.
   163: 		/// It matches the specified statements in the sequence. 
   164: 		/// {@example #CompositeSyntax
   165: 		///   {% let; @name = identifier; % =; @value = expression;};
   166: 		/// }
   167: 		def SequenceDef  {
   168: 			^ g:Sequence {
   169: 				@ syntax += block;
   170: 			};
   171: 		};
   172: 
   173: 		/// Object expression. It is used to specify context of parsing. 
   174: 		/// 
   175: 		/// The expression matches its content, and creates an context object
   176: 		/// all properties that are directly or indirectly specified in the 
   177: 		/// content will be assumed to be specified in context of this object
   178: 		/// unless new object directive is encountered.
   179: 		///
   180: 		/// It is an error to specify value or object generators inside object
   181: 		/// without property layer.
   182: 		/// {@example #CompositeSyntax
   183: 		///   ^ t:Ref {% ref % (; @name = identifier; % );};
   184: 		/// }
   185: 		op composite ObjectOp(f) {
   186: 			% ^;
   187: 			@ name = ref(ObjectNameDef);
   188: 			@ syntax = ref(SequenceDef);
   189: 		};
   190: 	};
   191: 
   192: 
   193: 	/// This is base syntax context. 
   194: 	context abstract BaseCompositeSyntax {
   195: 		include BaseSyntax;
   196: 		
   197: 		/// Expression statement. This statement is a container for expression.
   198: 		/// The statement has the same semantics as expression contained in it.
   199: 		statement ExpressionStatement {
   200: 			@ syntax = expression;
   201: 		};
   202: 	};
   203: 
   204: 
   205: 	/// This syntax is used inside documentation statement. 
   206: 	context DocumentationSyntax {
   207: 		include BaseMappingSyntax;
   208: 		/// This is doclines expression. It matches sequence of documentation lines.
   209: 		/// {@example #DocumentationSyntax;
   210: 		///   @ documentation += doclines wrapper xj:DocumentationLine.text;
   211: 		/// };
   212: 		op composite DoclinesOp(f) {
   213: 			% doclines;
   214: 			@ wrapper = ref(WrapperDef)?;
   215: 		};
   216: 	};	
   217: 
   218: 	/// This context specifies syntax for simple operations
   219: 	context SimpleOpSyntax {
   220: 		include BaseCompositeSyntax;
   221: 		
   222: 		/// This expression matches left operand in expression. It is used
   223: 		/// in let expression to specify property to which left operand 
   224: 		/// of operator should be assigned.
   225: 		/// {@example #ContextContent
   226: 		///   op Minus(500, yfx, -) {
   227: 		///     @ minuend = left;
   228: 		///     @ subtrahend = right;
   229: 		///   };
   230: 		/// };
   231: 		op composite Left(f) {
   232: 			^ g:OperandOp {
   233: 				@ position = token(left);
   234: 			};
   235: 		};
   236: 		
   237: 		/// This expression matches right operand in expression. It is used
   238: 		/// in let expression to specify property to which right operand 
   239: 		/// of operator should be assigned.
   240: 		op composite Right(f) {
   241: 			^ g:OperandOp {
   242: 				@ position = token(right);
   243: 			};
   244: 		};
   245: 	};
   246: 	
   247: 	/// This context contains definition of primitive syntax operators
   248: 	context abstract CompositeOperatorsSyntax {
   249: 		include BaseCompositeSyntax;
   250: 		
   251: 		/// Choice operator. It matches one of two alternatives. It is 
   252: 		/// an error if both alternatives match an empty sequence or 
   253: 		/// or might start with the same token. Note that it is not error
   254: 		/// if one alternative starts with generic token kind (for example
   255: 		/// string quoted with double quote, and another one starts with 
   256: 		/// specific token like token "my string".
   257: 		op ChoiceOp(xfy,300,|) {
   258: 			@ options += left; @ options += right;
   259: 		};
   260: 
   261: 		/// First choice operator. It tries to match the first 
   262: 		/// alternative than the second one. This operator never
   263: 		/// produces conflicts even if the second alternative matches
   264: 		/// the first one.
   265: 		op FirstChoiceOp(xfy,200,/) {
   266: 			@ first = left; @ second = right;
   267: 		};
   268: 
   269: 		/// This operator matches empty sequence of tokens or its operand.
   270: 		op OptionalOp(yf,100,?) {
   271: 			@ syntax = left;
   272: 		};
   273: 		
   274: 		/// This operation matches non empty sequence of specified operand.
   275: 		op OneOrMoreOp(yf,100,+) {
   276: 			@ syntax = left;
   277: 		};
   278: 
   279: 		/// This operation is composition of optional and one of more operators.
   280: 		op ZeroOrMoreOp(yf,100,*) {
   281: 			@ syntax = left;
   282: 		};
   283: 
   284: 	};
   285: 	
   286: 	/// This context defines expressions that might happen in context 
   287: 	/// of modifiers expressions.
   288: 	context ModifiersSyntax {
   289: 		include BaseMappingSyntax;
   290: 
   291: 		/// This is modifier specification. It can contain optional wrapper.
   292: 		op composite ModifierOp(f) {
   293: 			% modifier;
   294: 			@ value = token;
   295: 			@ wrapper = ref(WrapperDef)?; 
   296: 		};
   297: 	};
   298: 	
   299: 	/// Free form composite syntax
   300: 	context CompositeSyntax {
   301: 		include CompositeOperatorsSyntax;
   302: 
   303: 		/// A keyword definition statement. It could happen only
   304: 		/// as part of {@link #PatternOp}
   305: 		def KeywordStmtDef {
   306: 			^ g:KeywordStatement {
   307: 				% % {
   308: 					@ text = token;
   309: 				};
   310: 			};
   311: 		};
   312: 		
   313: 		
   314: 		/// This is a sequence of keywords and blocks separated by white spaces. 
   315: 		/// It is used to define literal syntax patterns in the grammar. Keywords
   316: 		/// are just parsed and are not reported to the parser. Contents of the 
   317: 		/// blocks is a sequence of syntax expressions and it is passed through
   318: 		/// to the root sequence. Note that two blocks must be separated by one
   319: 		/// or more keyword.
   320: 		op composite PatternOp(f) {
   321: 			^ g:Sequence {
   322: 				@ syntax += {
   323: 					{ 
   324: 					  ref(KeywordStmtDef); 
   325: 					  block?;
   326: 					}+ | {
   327: 						block;
   328: 						{ 
   329: 					  		ref(KeywordStmtDef); 
   330: 					  		block?;
   331: 						}*; 
   332: 					};
   333: 				};
   334: 			};
   335: 		};
   336: 
   337: 		/// Reference to definition in this context or in included context.
   338: 		/// The expression is replaced with content of original definition.
   339: 		/// Recursion is not allowed to be created using references.
   340: 		op composite RefOp(f) {
   341: 			% ref % ( {
   342: 				@ name = identifier;
   343: 			} % ) ;
   344: 		};
   345: 
   346: 		/// Block reference. The statement matches block that that contains
   347: 		/// statements of the specified context. If no context is specified,
   348: 		/// reference to current context is assumed. Block produces possibly
   349: 		/// empty sequence of objects. And it should happen in context of 
   350: 		/// of list property.
   351: 		op composite BlockRef(f) {
   352: 			% block; 
   353: 			% ( {
   354: 			 @ context = identifier ;
   355: 			} % ) ?; 
   356: 		};
   357: 		
   358: 		/// This is reusable fragment used to specify expression precedence
   359: 		def ExpressionPrecedenceDef {
   360: 			% precedence % = {
   361: 				@ precedence = integer;
   362: 			};
   363: 		};
   364: 		/// Expression reference. This reference matches expression from 
   365: 		/// specified context and of specified precedence. If context is omitted,
   366: 		/// current context is assumed. The expression production always 
   367: 		/// produces a single object as result if parsing is successful.
   368: 		op composite ExpressionRef(f) {
   369: 			% expression;
   370: 			% ( {
   371: 				{
   372: 					ref(ExpressionPrecedenceDef);
   373: 				} | {	
   374: 					@ context = identifier;
   375: 					% , {
   376: 						ref(ExpressionPrecedenceDef);
   377: 					}?;
   378: 				};	
   379: 			} % ) ?;
   380: 		};
   381: 
   382: 
   383: 		/// This construct matches sequence separated by the specified 
   384: 		/// separator. This construct is just useful shortcut. The separator 
   385: 		/// can be any specific token. The expression 
   386: 		/// {@example #CompositeSyntax
   387: 		/// list , { 
   388: 		///   ref(Something);
   389: 		/// }; 
   390: 		/// }
   391: 		/// is equivalent to 
   392: 		/// {@example #CompositeSyntax
   393: 		/// {
   394: 		///    ref(Something); 
   395: 		///    % , {
   396: 		///       ref(Something); 
   397: 		///    }*;
   398: 		/// };
   399: 		/// }
   400: 		op composite ListOp(f) {
   401: 			% list {
   402: 				@ separator = token;
   403: 				@ syntax = ref(SequenceDef);
   404: 			};
   405: 		};
   406: 
   407: 		
   408: 		/// This construct matches set of modifiers. This construct
   409: 		/// matches any number or modifiers in any order. Each modifier
   410: 		/// matches and produces its text as a value. Wrapper specified 
   411: 		/// for modifiers construct applies to all modifiers inside it
   412: 		/// unless overridden by modifier.
   413: 		op composite ModifiersOp(f) {
   414: 			% modifiers;
   415: 			@ wrapper = ref(WrapperDef)?; 
   416: 			@ modifiers += block(ModifiersSyntax);
   417: 		};
   418: 		
   419: 		/// This construct matches any token or token specified in brackets.
   420: 		/// It produces a value of its text. If no token is specified, 
   421: 		/// the construct matches any significant token with exception of 
   422: 		/// documentation comment. See this grammar for numerous examples of its
   423: 		/// usage (including this definition).
   424: 		///
   425: 		/// Optional wrapper causes wrapping value produced by this expression
   426: 		/// into specified wrapper.
   427: 		op composite TokenOp(f) {
   428: 			% token {
   429: 				% ( {
   430: 					@ value = token;
   431: 				} % ) ?;
   432: 			};
   433: 			@ wrapper = ref(WrapperDef)?; 
   434: 		};
   435: 		/// This operator matches string with specified quote kind.
   436: 		/// The quote must be specified. The operator produces matched text
   437: 		/// as a value.
   438: 		///
   439: 		/// The operator optionally supports prefixed and multiline strings.
   440: 		/// Only strings that match the specific prefix could be specified.
   441: 		///
   442: 		/// Optional wrapper causes wrapping value produced by this expression
   443: 		/// into specified wrapper.
   444: 		op composite StringOp(f) {
   445: 			% string % ( {
   446: 				% prefix % = {
   447: 					@ prefix += list | { 
   448: 						identifier;
   449: 					};	 
   450: 				} % , ?;
   451: 				% quote % = {
   452: 					@ quote = ref(String);
   453: 				};	
   454: 				% , % multiline % = {
   455: 					@ multiline = token(true); 
   456: 				}?;
   457: 			} % );
   458: 			@ wrapper = ref(WrapperDef)?; 
   459: 		};
   460: 		
   461: 		/// This operator matches any identifier. The operator produces matched text
   462: 		/// as a value.
   463: 		///
   464: 		/// Optional wrapper causes wrapping value produced by this expression
   465: 		/// into specified wrapper.
   466: 		op composite IdentifierOp(f) {
   467: 			% identifier;
   468: 			@ wrapper = ref(WrapperDef)?; 
   469: 		};
   470: 
   471: 
   472: 		/// This operator matches integer without suffix or with specified suffix
   473: 		/// The operator produces matched text as a value.
   474: 		///
   475: 		/// Optional wrapper causes wrapping value produced by this expression
   476: 		/// into specified wrapper.
   477: 		op composite IntegerOp(f) {
   478: 			% integer {
   479: 				% ( {
   480: 					% suffix % = {
   481: 						@ suffix += list | { 
   482: 							identifier;
   483: 						};
   484: 					}?;
   485: 				} % ) ?;
   486: 				@ wrapper = ref(WrapperDef)?; 
   487: 			};
   488: 		};
   489: 
   490: 
   491: 		/// This operator matches float without suffix or with specified suffix.
   492: 		/// The operator produces matched text as a value.
   493: 		///
   494: 		/// Optional wrapper causes wrapping value produced by this expression
   495: 		/// into specified wrapper.
   496: 		op composite FloatOp(f) {
   497: 			% float;
   498: 			% ( {
   499: 				% suffix % = {
   500: 					@ suffix += list | { 
   501: 						identifier;
   502: 					};
   503: 				}? ;
   504: 			} % ) ?;
   505: 			@ wrapper = ref(WrapperDef)?; 
   506: 		};
   507: 
   508: 
   509: 		/// This operator matches any graphics token.
   510: 		/// The operator produces matched text as a value.
   511: 		///
   512: 		/// Optional wrapper causes wrapping value produced by this expression
   513: 		/// into specified wrapper.
   514: 		op composite GraphicsOp(f) {
   515: 			% graphics;
   516: 			@ wrapper = ref(WrapperDef)?; 
   517: 		};
   518: 
   519: 	};
   520: 	
   521: 	/// Composite operator syntax. 
   522: 	/// Note that this definition is oversimplified. There are additional 
   523: 	/// constraint that "left" and "right" expression might happen only on top 
   524: 	/// level. The construct will be possibly adjusted later. 
   525: 	context CompositeOpSyntax {
   526: 		include SimpleOpSyntax;
   527: 		include CompositeSyntax;
   528: 	};
   529: 	
   530: 	/// This context defines content of context statement. So it defines itself.
   531: 	context ContextContent {
   532: 		include Base;
   533: 
   534: 		/// This is blank statement. It is used to attach attributes 
   535: 		/// and documentation comments.
   536: 		statement BlankContextStatement {
   537: 		};
   538: 
   539: 
   540: 		/// Operator associativity definition. It matches any valid 
   541: 		/// associativity.
   542: 		def OpAssociativity {
   543: 			token(f) | token(xf) | token(yf) |token(xfy) |
   544: 			token(xfx) |token(yfx) |token(fx) | token(fy) | token(yfy);
   545: 		};
   546: 
   547: 
   548: 		/// Operator definition. There are two kinds of operators - simple
   549: 		/// composite.
   550: 		///
   551: 		/// If the operator definition does not contain a single object creation
   552: 		/// expression it is assumed to have a content wrapped in the object
   553: 		/// creation expression with default namespace and operator name as an
   554: 		/// object name.
   555: 		statement OperatorDefinition {
   556: 			% op;
   557: 			modifiers wrapper g:Modifier.value {
   558: 				@ isComposite = modifier composite;
   559: 			};
   560: 			@ name = identifier;
   561: 			% ( {
   562: 				@ associativity = ref(OpAssociativity);
   563: 				% , {
   564: 					@ precedence = integer;
   565: 				    % , {
   566: 						@ text = token;
   567: 					} % ) {
   568: 						@ syntax += block(SimpleOpSyntax);
   569: 					} | % ) {
   570: 						@ syntax += block(CompositeOpSyntax);
   571: 					};	
   572: 				} | % ) {
   573: 					@ syntax += block(CompositeOpSyntax);
   574: 				};
   575: 			};
   576: 		};
   577: 
   578: 		/// Attributes definition. Attributes can be applied only to
   579: 		/// statements. To apply them to expressions, define an composite
   580: 		/// operator that uses the same syntax. Such operator and attributes
   581: 		/// declaration can share syntax through def statement.
   582: 		statement Attributes {
   583: 			% attributes;
   584: 			@ name = identifier;
   585: 			@ syntax += block(CompositeSyntax);
   586: 		};
   587: 		
   588: 		/// Statement definition. Statement attempts to match entire segment.
   589: 		/// If statement matches part of segment and there are some 
   590: 		/// unmatched significant tokens left, it is a syntax error.
   591: 		///
   592: 		/// If the statement definition does not contain a single object creation
   593: 		/// expression it is assumed to have a content wrapped in the object
   594: 		/// creation expression with default namespace and statement name as an
   595: 		/// object name.
   596: 		statement Statement {
   597: 			% statement;
   598: 			@ name = identifier;
   599: 			@ syntax += block(CompositeSyntax);
   600: 		};
   601: 
   602: 		/// Documentation syntax. It matches documentation comments before
   603: 		/// start of grammar. The definition is used to specify property
   604: 		/// where documentation is put.
   605: 		statement DocumentationSyntax {
   606: 			% documentation;
   607: 			@ name = identifier;
   608: 			@ syntax += block(DocumentationSyntax);
   609: 		};
   610: 
   611: 		/// A fragment definition. It is used to define reusable parts of the 
   612: 		/// syntax. References to definitions are replaced with content of the 
   613: 		/// definition, so it is an error for definition to refer to itself 
   614: 		/// through ref construct.
   615: 		statement Def {
   616: 			% def;
   617: 			@ name = identifier;
   618: 			@ syntax += block(CompositeSyntax);
   619: 		};
   620: 		
   621: 		/// Include operation cause all definitions except redefined 
   622: 		/// to be included in this context. It is an error if two definitions 
   623: 		/// are available using different paths. If wrapper chain is specified
   624: 		/// The statements will be wrapped into the specified chain.
   625: 		statement ContextInclude {
   626: 			% include;
   627: 			@ contextName = identifier;
   628: 			@ wrappers += % wrapper {
   629: 				list / {
   630: 					ref(WrapperObject);
   631: 				};
   632: 			}?;
   633: 		};
   634: 
   635: 		/// Import operation makes context referenceable from this context or 
   636: 		/// allows redefinition of context reference.
   637: 		statement ContextImport {
   638: 			% import;
   639: 			@ localName = identifier;
   640: 			% = {
   641: 				@ contextName = identifier;
   642: 				% from {
   643: 					@ grammarName = identifier;
   644: 				}?;
   645: 			};
   646: 		};
   647: 	};
   648: 	
   649: 	/// This context defines grammar content.
   650: 	context GrammarContent {
   651: 		include Base;
   652: 		
   653: 		/// This definition provides way of referencing other grammars.
   654: 		def GrammarRef {
   655: 			{
   656: 				@ systemId = ref(String);
   657: 				% public {
   658: 					@ publicId = ref(String);
   659: 				}?;
   660: 			} | {
   661: 				% public {
   662: 					@ publicId = ref(String);
   663: 				};
   664: 			};				
   665: 		};
   666: 
   667: 		/// This is blank statement. It is used to attach attributes 
   668: 		/// and documentation comments.
   669: 		statement BlankGrammarStatement {
   670: 		};
   671: 
   672: 		
   673: 		/// This is an include statement. Include causes all context from 
   674: 		/// included grammar to be added to current grammar. The definitions
   675: 		/// from grammar include are added only if current grammar does not 
   676: 		/// have definitions with the same name.
   677: 		///
   678: 		/// Grammar imports and context imports also follow this inclusion rule.
   679: 		/// It is an error to include two different non-shadowed definitions by
   680: 		/// different include paths.
   681: 		statement GrammarInclude {
   682: 			% include;
   683: 			ref(GrammarRef);
   684: 		};
   685: 
   686: 		/// This is grammar import statement. A statement allows contexts of this
   687: 		/// grammar to import context from specified grammar.
   688: 		statement GrammarImport {
   689: 			% import {
   690: 				@ name = identifier;
   691: 			} % = {
   692: 				ref(GrammarRef);
   693: 			};	
   694: 		};
   695: 
   696: 
   697: 		/// Namespace declaration is used to declare namespace prefix. The 
   698: 		/// prefix declaration is local to grammar and is not inherited in 
   699: 		/// the case of grammar include. 
   700: 		///
   701: 		/// The namespace can have a default modifier. This namespace will 
   702: 		/// be used along with operator or statement name in case when 
   703: 		/// there are several children in the definition or when the only 
   704: 		/// child is not an object creation expression. 
   705: 		statement Namespace {
   706: 			% namespace;
   707: 			modifiers wrapper g:Modifier.value {
   708: 				@ defaultModifier = modifier default;
   709: 			};
   710: 			@ prefix = identifier;
   711: 			% = ;
   712: 			@ uri = ref(String);
   713: 		};
   714: 
   715: 		/// Context definition. This definition is used to define context.
   716: 		/// Context may be default and abstract. Abstract contexts 
   717: 		/// cannot be used for parsing and are used only in context include.
   718: 		/// Abstract contexts may be imported only by abstract contexts.
   719: 		///
   720: 		/// Default context is a context that used to parse source when
   721: 		/// no context is specified in doctype. 
   722: 		statement Context {
   723: 			% context;
   724: 			modifiers wrapper g:Modifier.value {
   725: 				@ abstractModifier = modifier abstract;
   726: 				@ defaultModifier = modifier default;
   727: 			};
   728: 			@ name = identifier;
   729: 			@ content += block(ContextContent);
   730: 		};
   731: 	};
   732: 	
   733: 	/// This context contains definition of grammar construct itself
   734: 	context default GrammarSource {
   735: 		include Base;
   736: 
   737: 		/// This is blank statement. It is used to attach attributes 
   738: 		/// and documentation comments. It is ignored during grammar
   739: 		/// compilation.
   740: 		statement BlankTopLevel {
   741: 		};
   742: 
   743: 		/// Grammar statement. It defines grammar. Grammar name is purely 
   744: 		/// informative and is used in reported events to identify grammar
   745: 		/// by logical name rather by URI that happens to be current grammar
   746: 		/// location.
   747: 		///
   748: 		/// Grammar can be abstract; in that case it cannot be instantiated
   749: 		/// and referenced from doctype. It can be only included into other 
   750: 		/// grammars.
   751: 		statement Grammar {
   752: 			% grammar;
   753: 			modifiers wrapper g:Modifier.value {
   754: 				@ abstractModifier = modifier abstract;
   755: 			};
   756: 			@ name += list . {identifier;};
   757: 			@ content += block(GrammarContent);
   758: 		};
   759: 	};
   760: };

Appendix B. Document Type Grammar

This section provides the grammar for the doctype directive. This grammar is normally hardcoded in the parsers, but it should generate events as it is specified in this section.

Example B.1. doctype.g.etl

     1: // Reference ETL Parser for Java
     2: // Copyright (c) 2000-2007 Constantine A Plotnikov
     3: //
     4: // Permission is hereby granted, free of charge, to any person 
     5: // obtaining a copy of this software and associated documentation 
     6: // files (the "Software"), to deal in the Software without restriction,
     7: // including without limitation the rights to use, copy, modify, merge, 
     8: // publish, distribute, sublicense, and/or sell copies of the Software, 
     9: // and to permit persons to whom the Software is furnished to do so, 
    10: // subject to the following conditions:
    11: //
    12: // The above copyright notice and this permission notice shall be 
    13: // included in all copies or substantial portions of the Software.
    14: // 
    15: // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
    16: // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
    17: // MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
    18: // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
    19: // BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 
    20: // ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 
    21: // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
    22: // SOFTWARE. 
    23: doctype public "-//IDN etl.sf.net//ETL//Grammar 0.2.1";
    24: 
    25: /// This is a grammar for doctype declaration. The doctype can be encountered 
    26: /// as the first statement in the source code of ELT-based language. This 
    27: /// grammar is hard-coded in the parser for obvious reasons. So this file is 
    28: /// for information only.
    29: ///
    30: /// Note that this grammar does not support documentation comments. The mapping
    31: /// for these comments differs between different contexts and there is no 
    32: /// universal mapping that is suitable for all.
    33: ///
    34: /// <author>const</author>
    35: grammar net.sf.etl.grammars.DoctypeDeclaration {
    36: 	namespace dc = "http://etl.sf.net/etl/doctype/0.2.1";
    37: 	
    38: 	/// This is the only context in the grammar	
    39: 	context default DoctypeContext {
    40: 		
    41: 		/// A definition for string used in the grammar. Two kinds of string are allowed. 
    42: 		/// <example>
    43: 		///		'aaa'
    44: 		///		"aaa"
    45: 		/// </example>
    46: 		def String {
    47: 			string(quote="\"") | string(quote='\'');
    48: 		};
    49: 
    50: 		/// A doctype statement that declares grammar associated 
    51: 		/// with the file. The doctype statement is an obvious rip-off of XML doctype. 
    52: 		/// Inline grammar is not supported yet. 
    53: 		///
    54: 		/// System identifier or public identifier or both might be used. 
    55: 		/// <example>
    56: 		/// 	doctype public "-//IDN etl.sf.net/ETL/Grammar 0.2";
    57: 		/// 	doctype "http://etl.sf.net/2005/etl/grammar.g.etl" public '-//IDN etl.sf.net/ETL/Grammar 0.2';
    58: 		/// 	doctype 'mygrammar.g.etl';
    59: 		/// </example>
    60: 		statement DoctypeStatement {
    61: 			^ dc:DoctypeDeclaration {
    62: 				% doctype {
    63: 					{
    64: 						@ systemId = ref(String);
    65: 						% public {
    66: 							@ publicId = ref(String);
    67: 						}?;
    68: 					} | {
    69: 						% public {
    70: 							@ publicId = ref(String);
    71: 						};
    72: 					};
    73: 					% context {
    74: 						@ context = ref(String);
    75: 					}?;
    76: 				};
    77: 			};
    78: 		};
    79: 	};
    80: };

Appendix C. Default Grammar

The grammar specified in this section should be used for parsing sources if the grammar is not available or was compiled with errors. This grammar is able to parse any source without syntax errors.

Example C.1. default.g.etl

     1: // Reference ETL Parser for Java
     2: // Copyright (c) 2000-2009 Constantine A Plotnikov
     3: //
     4: // Permission is hereby granted, free of charge, to any person 
     5: // obtaining a copy of this software and associated documentation 
     6: // files (the "Software"), to deal in the Software without restriction,
     7: // including without limitation the rights to use, copy, modify, merge, 
     8: // publish, distribute, sublicense, and/or sell copies of the Software, 
     9: // and to permit persons to whom the Software is furnished to do so, 
    10: // subject to the following conditions:
    11: //
    12: // The above copyright notice and this permission notice shall be 
    13: // included in all copies or substantial portions of the Software.
    14: // 
    15: // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
    16: // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
    17: // MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
    18: // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
    19: // BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 
    20: // ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 
    21: // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
    22: // SOFTWARE. 
    23: doctype public "-//IDN etl.sf.net//ETL//Grammar 0.2.1";
    24: 
    25: /// This is a default grammar. It is used if one of the following happened:
    26: /// <ul>
    27: ///   <li>Doctype directive is missing or it has invalid syntax and default 
    28: ///       grammar is not specified for parser.</li>
    29: ///   <li>Grammar referenced by doctype statement cannot be located.</li>
    30: ///   <li>Grammar is located but failed to be parsed because of IO error 
    31: ///       or it is invalid (some syntax or semantic errors).</li>
    32: /// </ul>
    33: ///
    34: /// Note that this grammar is hard-coded and it is provided here just for 
    35: /// informational purposes.
    36: ///
    37: /// <author>const</author>
    38: grammar net.sf.etl.grammars.DefaultGrammar {
    39: 	namespace default d = "http://etl.sf.net/etl/default/0.2.1";
    40: 
    41: 	/// The only context in this grammar
    42: 	context default DefaultContext {
    43: 
    44: 		/// Documentation mapping definition. 
    45: 		documentation DefaultDocumentation {
    46: 			@ documentation += doclines wrapper d:DefaultDocumentationLine.text;
    47: 		};	
    48: 		
    49: 		/// Default statement that matches anything
    50: 		statement DefaultStatement {
    51: 			@ content += {
    52: 				{
    53: 					^ d:DefaultBlock { @ content += block; };
    54: 				} | {
    55: 					^ d:DefaultTokens { @ values += token+; };
    56: 				};
    57: 			}*;
    58: 		};
    59: 	};
    60: };

Glossary

Extensible Term Language: This is a temporary name for the meta-language being defined by this specification.