Problem Statement

Constantine Plotnikov

<constantine.plotnikov@gmail.com
   >

Revision History
Revision 0.2.1	2009-01-19
The updated document for the ETL framework version 0.2.1.
Revision 0.2.0	2006-02-05
The first release of the document.

Abstract

ETL is a framework that allows easy creation of textual domain specific languages and new programming languages. While DSLs are hot topic and there are a number of competing offering, why yet another framework? This article attempts to answer this question.

Table of Contents

1. Model-based approach

2. Textual DSL

2.1. Manual Coding
2.2. Parser generators: Yacc, Antlr, ...
2.3. XML

3. Reasons for XML Success

4. ETL goals

1. Model-based approach

There are two primary competing approaches of DSL creation: model-based approach and text-based approach.

The Eclipse GMF framework is an example of such tools. Microsoft also has an offering in this area.

The model based approach is focused on creation of the problem domain directly. It maintain model in some files that are not supposed to be modified in any way bypassing the tool or tool libraries.

The model is usually kept in consistent form with resolved references. And only of the biggest challenges of this approach is distributed evolution of the models. If model versions will diverge strong enough, the will stop to work together.

The MDA tools usually provide a nice GUI to work with models. This is a both blessing and a curse. The model editing GUI allows for very nice presentation, and it allows expressing some concepts in compacts and easy to understand form. The GUI is also able to prevent the user from making mistakes. On other hand working from the keyboard usually much faster than using mouse.

For some purposes such frameworks have been very successful (UML, ER-diagrams). For other they are not so good. For example diagram for complex BPEL process quickly goes out of control. However XML representation is not much better.

2. Textual DSL

The alternative and a longer deployed solution is creation of the textual domain specific language. This approach has own advantages over the model-based approach.

It is possible to work with the language with great range of tools that can work with text.
It is possible to have an incorrect text. And this is very good things for distributed changes. If the text has become incorrect, it is possible to correct it. For example VCS merge tools intentionally generate incorrect text, in order to allow user to merge conflicting changes. With model based tools, this is very hard to do.
It easy to input the text, the keyboard is still the fastest input device when it comes to complex information. But correctness of your input has to be checked and it is possible to write gibberish.

You have probably experienced most of advantages of textual languages by programming. After all a programming language describes program execution process in a more high level constructs than machine code, and it does not allows execution of arbitrary instructions. And languages like SQL describe this process with quite a few assumptions.

There are several ways to create textual DSLs, and we will consider them in the next sections.

2.1. Manual Coding

This is the oldest way of creation of a new language, and there is no way around it when creating a language of higher level than exiting one. Fortran parser has to coded in assembler first.

The problem with this approach is that it requires considerable efforts to create and maintain such parser, and human time is scarce resource that DSLs are supposed to save. So it is approached only when there is no better alternative.

The primary advantage of this approach is that almost any textual language can be created using it.

2.2. Parser generators: Yacc, Antlr, ...

The parsing process itself could be described by own DSL. The archetype of most DSLs that describe the parsing process is Extended Backus–Naur Form which is able to describe context free grammars.

The parser generators like Yacc and Antlr compile DSL based on EBNF into the executable code in the normal programming language like C or Java.

This makes creation of the new languages quite simple. And this is very popular technology. However this way is not without difficulties:

Parsers works well on correct source, but error recovery is very difficult and not automated. Because LL and LR grammars are very flexible, it is possible to define almost any language, and this disables development of meaningful error recovery policy. Default generated parsers usually stop at first error in source.
Incremental parsers that are required for editors (for example Eclipse) are difficult to develop. They usually coded manually for each language.
Automatically generated parsers have limited modes supported. For example parser might create AST or execute actions when some construct is detected. The actions usually are hard-coded into grammar and than executed. Also generated parsers usually work in push model (in XML it corresponds to SAX API) rather than pull model (in XML it corresponds to StAX API).
Parser generators generate code that has to be compiled and than used. This code usually strongly coupled with specific version of runtime. It is quite regular situation when two antlr generated parsers cannot live together because of version incompatibility between runtimes. Compare it with world of XML where different parsers provide the same external interface and use the same grammar definition files that can be updated independently of the parser (DTD and XML Schema).
It is impossible for user of the compiler that uses parser to update grammar recognized by parser to use some minor extension of the language. For example user cannot support of C# "using" statement to Java compiler. Another possible example is "foreach" statement and enumerations that has been missing until Java 1.5. Or Java closures where there is sill no consensus, and you can use none of the proposals. This tool extension requires two things:
1. Adding support of new constructs to the parser. This is very hard using current parser technologies. Particularly considering that many production-level parsers are written manually in order to support error recovery.
2. Adding support to compiler. This is much easier part because technology for extending transformation components is well developed. Compilers are often using it internally to transform high-level constructs to more primitive forms. This is just a question of compiler plug-in architecture.
This could look like a minor issue since. Why one would want to change compiler? But it looks such only because we have used to this. This restriction disables user innovation at language level.
On code library level there is a lot of innovation because users can use reusable libraries and develop custom code basing on them. The best ideas from custom code libraries are selected and integrated into standard libraries.
However on language level there is much less innovation by users of compilers. Creating new dialect of the language is too big effort. And new feature has to be used at least in three projects to be considered reusable. Most of times, it is economically unfeasible to invest into new tool chain. If code were written in that way, we would not have ability to define even simple procedures and we had to ask our library providers for new procedures. The custom code would have been a big main method.
Only people that have full power to innovate are compiler developers. Others have to live to restrictions laid by developers of compilers.
There are no standard and portable syntax to define grammars. Each tool uses own tool specific language. This makes impossible publishing executable specifications. The specification of some language has to be transformed into tool specific specification.
Languages do not mix well. To create a new language that combine features of two one have to build tool-chain almost from ground up. For example there is Java language and SQL language. There also a hybrid language named SQLJ. It has not caught much support because it requires the separate tool chain. It requires own compiler and own editor support and for editor a new parser has to be written that can works with errors in both SQL and Java code.
Other problem is that languages are very different. SQL code do not look like Java code and programmers has to switch between different rules for literal values (for example number literals are different in SQL and escape sequences do not look like ones from Java) and other syntax peculiarities.
There are no reusable or standard components between languages. If a new language is designed, a language designer cannot get some standard module like "OASIS Common Arithmetic Syntax" and reuse it in the language.

2.3. XML

Could it be that is natural for textual DSLs to work in this way and there is no way out? XML is a clear counterexample to this statement. XML overcomes many limitations parser generators at a great loss of generality.

The most text editors have a error recovery policy that more or less works for all XML files. It might be not ideal one, but no additional coding is required to support it.
Incremental parsers are possible to create and they will work for any XML language.
Different parsers kinds are used with all XML-based languages: JAXB, DOM, SAX, StAX... This includes push, pull, and DOM models.
The XML parsers are usually dynamic and they need DTD mostly for validating. There is no need for code generation in most cases (except for JAXB).
It is possible to introduce new elements into XML-based language. And to do this, there is no need to copy the entire XML schema, and the cost of extension is usually proportional to the size of the extension. The tools usually need a minimal effort to support new elements. Consider Spring xml files as an example. It has many extensions specific to different frameworks and the grammar is very modular.
I think that one of reasons of popularity of the XML is that it allowed easy innovation on syntax level and ready-to-use tool chain that supported the process.
XML grammar definition languages are portable between different parsers (provided that they are supported, since recently there is quite a few of them, and RelaxNG support is coming).
The different XML-based languages look like each other and mix very well.
There is a lot of reuse going in XML. For example, WS-Security grammar had reused XML signature and XML encryption grammars. Docbook reused XML Link. And reuse stories could go on.

XML is likely most used tool for creating textual DSLs right now. And this is despite of its verboseness. The people use something other then XML, only if they could not live with XML (for example programming languages, and even there are attempts to use XML).

Note that XML and SGML were created as basis DSLs that allow writing textual documents. And for this purpose they serve quite well. This document for example was written in XML. XML just became used and misused in ways that was not originally intended.

3. Reasons for XML Success

I think that there are the several reasons for limitations in parser generators and XML success.

Languages are defined using too low level constructs. If we read language specifications, the specifications talk about statements, operators, parts of statements, names. However when we read grammar even in language specifications, we see productions and tokens.
Because of this when we want to introduce new statement or operator to the language, we have to map it to update of the productions. This is a routine and error prone work. It is obvious place of automation.
Basic low-level constructs like blocks, literal values are designed completely independently and rules are different for different languages. If languages are mixed together, it will be difficult for language user to understand what is allowed and what is not. Also it is difficult to mix languages together because of reserved word concept, what is reserved word in one context might have no significance in another.
There are no common standard interfaces for language tool chain. For XML there is standard tool-chain that includes editors, parsers, and transformation components. DSL creators have to use XML that have ready tool chain or they have to reproduce it for themselves. Producing tool chain is very heavy investment.

The new generation of textual DSL tools like Microsoft's M tries to partially overcome these problems (for example grammars are loaded dynamically). But I believe that they will be able to overcome disadvantages completely while they are stick to EBNF.

To summarize points above, it is believed that these problems generic parser generators are caused by fact that solved task is too broad. And if we will reduce the scope of tool we can gain advantages of common generic tool-chain. This has been done with XML before after all.

But question was: does the resulting language has to be as ugly as well? There already was a less verbose solution in form LISP syntax. It allowed extensibility in a great degree. But syntax carried quite a few similarities with XML on meta level. It was just less verbose.

4. ETL goals

ETL was created as attempt to have most advantages of approaches provided by parser generators and and XML tool environment, but the goal was to enable creation of terse programming languages. The starting point was the following list of goals and principles (no particular order):

The framework should work with plain text.
1. It should be possible to use generic text editors that are not aware of ETL to edit ETL texts.
2. Usage of special hidden markup in the text should not be required.
3. The framework should not restrict editing operations; particularly it should be possible to save and load an incorrect text in editor as intermediate phase.
4. It should be possible to define language that is reasonably easy to read and write. As a dog-food test, the grammar language definition syntax should be written in itself and it should use the same parser generator pipeline to parse grammars.
It should use high-level constructs like operator and statement, because these constructs are actually used in language definition process. Translation of these constructs to executable parser should be task of tool-chain rather than language designer.
It should be possible to define a new language that extends existing language with new operators and statements.
It should be possible to define and reuse language modules.
It should be possible to redefine existing operators and statements when language is extended.
It should be possible to define languages independently and to combine them together. I.e. things like "OASIS Common Arithmetic Syntax" should be possible.
It should have common structural principles for organizing source code at least on lexical and phrase level.
It should be possible to create common reusable tool-chain that contains the following components (we already have this for XML):
1. Generic parser with a working error recovery model and syntax construct identification model. The parser should be useful for different situations like editors, interpreters/shell scripts, compilers, code analyzers. Note that editors and compilers put very heavy requirements on error recovery. Basically source should always parse and errors should be reported.
2. Generic editor that can be specialized for specific language that uses generic parser and provide basic services like outline and syntax highlighting. Note that it is believed to be possible with planned additions to the parser. However it is not yet tested that it is actually possible.
The framework should be suitable for definition of different classes of the languages that are based on statements and expressions including:
1. Imperative
2. Scripting
3. Functional
4. Rule-based
Such languages should be definable in the way that is more or less natural to them.
Note that it is not requirement to be able to define existing languages except by the sheer luck. This framework is targeted to construction of new languages. Support for old languages would have put the framework back to production/token area. This is similar to XML. It is not possible to define RTF syntax using XML. But it is possible to define XML language for documents.
The parser should not be a center of the application, it should produce an output that processed by further components (for example MDA tools).

The project is more or less successful in following these principles. With any luck ETL will be viable replacement for XML in domains where XML is currently misused only because XML has a good tool support.

ETL does not directly compete with XML because it does well things where XML is too verbose to be useful as surface syntax. And ETL has some disadvantages in areas where XML shines. For example ETL is unsuitable for creation of articles like this one. For web services, ETL parser is possibly too complex and easy writing/reading by human is not significant for this area (however it is still required that text can be analyzed and XML still provides this).

Note that resulting language definition framework is able to do more in tool-chain for domain specific languages because set of allowed languages is restricted. The allowed languages are strict subset of the LL(1) languages, and even not all LL(1) languages are supported. The languages should follow rules of lexical and phrase level.

Tools like antlr and yacc can handle much larger set of languages, but for that larger set of languages much less common tools can be developed because these languages have much less in common. Also because these languages have a little in common, it is not possible to merge languages cleanly.