Extensible Term Language Specification

Constantine Plotnikov

Revision History
Revision 0.22006-02-05

This is the first draft of the specification. ETL now uses itself to define grammar.


Table of Contents

1. Introduction
2. Overview
3. Lexical Level
Overview
White Spaces and New Lines
Comments
Brackets
Strings
Numbers
Identifiers
Graphics
Other
4. Phrase Level
5. Term Layer
Overview
Grammar Organization
General Syntax Expression
6. Non-normative: Parser architectures
Kinds of Parsers
Error Recovery
Glossary

Chapter 1. Introduction

This specification describes syntax and semantics of ETL meta-language. This language is intended to provide framework for creation of human readable and human writable languages that can be extended and combined together.

Some of these goals have been achieved, but there is no standard way to achieve these goals together yet.

The specification contains a few of new ideas as it mostly borrows ideas from other languages and integrates them together. Most of features presented here have been encountered in one form or other in other languages. Among the languages that affected this specification are:

  • Ada

  • C, C++, Java, C#

  • Dylan

  • E

  • Eiffel

  • Prolog

  • Python

  • Scheme and Lisp

  • TEX

  • XML and SGML

Other languages have affected ETL either more indirectly or less noticeably. But it is possible to locate their traces in the language design.

Chapter 2. Overview

The language consists from three layers: Lexical, Phrase, and Syntax. Each layer delimits underlying layer and annotates objects or underlying layer.

Character Level

This is primitive layer provided by underlying runtime. This layer produces a sequence of characters from some data source.

Lexical Level

On this level a character stream is translated into the sequence of tokens.

Phrase Level

On this level stream of tokens is translated into blocks, segments, and annotated tokens. Tokens are annotated as belonging to one of the following classes:

  • Ignorable

  • Control

  • Significant

Only significant tokens have to be considered by other levels during parsing. Other are just passed through.

Term Level

On this level source code is mapped to abstract syntax tree.

Differently from most of other syntax definition frameworks, this level uses notion of statements and operators rather than some form BNF.

There are the following reasons for it:

  • Abstraction of statement and operator is used by language designers for long time. However these abstractions are not directly expressed by meta-languages.

  • Using higher level constructs directly, provides more high-level extension points. It is possible to add new statements and operators rather than new productions that should be integrated with existing languages.

Chapter 3. Lexical Level

Abstract

This chapter describes lexical level of the parser.

Overview

The lexical level is quite traditional. Current definition of lexical is incomplete with respect to Unicode. More valid Unicode characters will be added in the future to the language.

White Spaces and New Lines

Tab, space and new line characters outside of string and comments are considered as white spaces.

Comments

Comments might be of three kinds:

Block comments

These are traditional C++/Java/C# comments that starts at with /* and end with */. Nested block comments are not allowed.

Line Comments

These are traditional C++/Java/C# line comments that starts at with // and end at the end of the line.

Documentation Comments

These are traditional C# documentation comments. These comments are a specialization of line comments and are treated as line comments in places where documentation comments are not expected. Documentation comments start with /// and last until the end of the line.

A single format for documentation comments has been selected to make different languages defined with ETL framework consistent. C# comment format has an advantage over Java format in that it allows any text inside comments. In java documentation comments it is not possible to use block comments or documentation comments in sample code.

Brackets

The following bracket types are recognized by ETL:

Braces, Curly Brackets

These are traditional curly brackets: { }.

Square Brackets

These are traditional square brackets: [ ]

Parenthesis, Round Brackets

These are traditional round brackets: ( )

All brackets are singleton characters. They always consist of single character from the stream.

Strings

Strings definition is most similar to C and Java tradition. The biggest difference is that by themselves, there is no difference between char and string literal.

The current version supports only three kinds of quotes for the strings:

"

This is a usual double quote.

'

This is a usual single quote.

`

This is a usual back quote.

Numbers

Numbers are borrowed from Ada programming language almost as is. There are two formats for numbers: decimal and based.

Decimal numbers are like ones in other languages. However floating point number must have digits around . char.

Based numbers have the format base#mantissa#[E-+optionalExponent]. Base is a decimal number from 2 to 36 inclusive. Mantissa is a sequence of digits and letters that conform to base. Exponent is decimal number that specifies power of base on which mantissa is multiplied. For example 2#1#E10 is floating point number 1024. 36#10.0#E-1 is floating point number 1.

Numbers can have underscore symbol in the mantissa. It is ignored during evaluating value. It can be used to improve readability of the number for example: 16#7FFF_FFFF#.

Numbers can have an optional suffix that must consist of alphabetical characters. The suffix cannot start from uppercase or lowercase letter "E" to avoid conflict with exponent specification. This is done to support typed numeric constants used in C and Java. For example: 36#XYZ#ul.

Identifiers

Identifiers are quite typical. Currently identifiers must start with letter or underscore and continue with letter, underscore or digit.

Graphics

The name of category and its definition is borrowed from prolog. Graphics token is non empty sequence of the following characters: ~+-%^&*|<=:?!>.@/. Note that // and /* start the comment and therefore are not considered as part of the graphics token.

Other

There are two singleton characters that do fall in category of graphics. it is semicolon and comma.

Chapter 4. Phrase Level

Notion of phrase layer was initially borrowed from Dylan. Than the idea of what such syntax should do was significantly affected by Python line syntax.

On phase level a forest of a blocks and segments is built. A source is sequence of ignorable tokens and segments. A segment is a sequence of tokens and blocks terminated by semicolon. A block is sequence of segments and tokens enclosed into curly brackets.

Ignorable

These tokens can be ignored during parsing terms. Tokens in this category are white space, new lines, and comments (except documentation comments).

Control

Control tokens are tokens that are designate start and end of blocks and segments. Because they are processed by this phrase parser, they can be ignored on term parser level.

Significant

These tokens are significant tokens that are parsed by term parsers. These are tokens like identifiers, numbers, and strings. Documentation tokens are also considered as significant tokens.

The for example lets consider the following text.

    	{ a ;};
    	a {b;} c;
    	/// a
    	a;
    

The text above is interpreted as the following by parser.

  • There are three segments on top level.

  • The first segment consists of one block with an ignorable white space token and nested single segment with significant token a and white space token that follows this token. Semicolons and braces are reported as control tokens.

  • The next segment is similar and significant and ignorable tokens in addition to block.

  • Last segment starts with a documentation token. That is followed by white space.

Chapter 5. Term Layer

Overview

Parsers on term layer delimits stream of tokens from phrase parser by objects and propertes. Other way to look at term parsing process is that parser maps source code to AST.

AST is assumed to consist of objects and properties. Objects types are designated by name and namespace. Properties are designated by name.

AST structure is closed related to tree produced by phrase layer. A segment sequence of on source level or block level is always described by some grammar and is described by some context.

Statement declaration in the grammar describes syntax of a single segment.

Grammar defines mapping from sequence of tokens to AST.

Grammar Organization

Top level element of the grammar is grammar object. The grammar consists of context. Each context has syntax definitions, fragments, and imports.

Syntax definitions define syntax constructs available in that context. There are two major classes for syntax definitions: statements and operators. Each statement describes syntax of one segment. Operators describe the syntax of expression. A syntax of expression forms in modular way just like in prolog. The almost entire operator level is borrowed from prolog. Only major addition is composite operators that allow more syntax constructs to appear in the grammar.

General Syntax Expression

TBD See grammar.g.etl file as example and as definition of syntax constructs.

Chapter 6. Non-normative: Parser architectures

Kinds of Parsers

There are the following kinds of parsers possible for ETL.

Pull Parser

This kind of parser allows reading events from stream. This is most generic of parser. Such parsers also usually more efficient and allow reading unlimited streams of data. In case of ETL reading unlimited streams requires that expressions are short in the source. Expressions generally require unlimited look-ahead. So events have to be stored until end of event is reached.

This parser kind is less convenient to use directly than AST or DOM parsers (in cases when both are applicable) because it is not possible to organize a second pass on the document.

Push parser

This kind of parser is simpler to implement but less convenient to use than Pull Parser . However in some languages like E that have convenient asynchronous API, the distinction between pull and push parsers is small as it is possible to convert push parser to pull one using queues.

DOM parser

DOM parser builds tree rather than event stream. The parser has close relationship with pull parser. Pull parser can be considered as iterator over DOM model. And DOM model can be considered as structured record of the events produced by pull or push parser.

AST parser

This parser builds tree of objects from source code. Such parser is normally used by compilers and other tools that are only interested in significant aspects of the source code and do not generally care how these constructs are represented.

Error Recovery

At least the following error recovery strategies are possible for ETL term parsers:

No recovery

This is simplest error recovery strategy. When syntax error is encountered, the parsing is stopped.

Segment-based

In this error recovery strategy if syntax error is encountered everything until end of segment is ignored.

Section-based

In this error recovery strategy if syntax error is encountered, significant tokens starts to be ignored until token that designates start of the next section, list separator, bracket close, or block is encountered. Also recovery points are scoped. It is only possible to reach recovery points in upper scope. Sections, list, bracket constructs create new scope. Bracket end, list separator and section or block start form new recovery points.

This strategy produces more errors as there is a chance that recovery will work incorrectly in some cases.

Glossary

Extensible Term Language

This is a temporary name for the meta-language being defined by this specification.

Dylan Programming Language

This is functional object-oriented language with support of macro-based extensibility model. http://www.opendylan.org/. Look at this wikipedia entry for description.

E Programming Language

This is functional object-oriented language with support for asynchronous communications and capability security model. The home page of the language is at http://www.erights.org. Look at this wikipedia entry for description.