ETL Framework 0.2.1a Readme

Constantine Plotnikov

Revision History
Revision 0.2.12009-01-19

Updated the document to reflect changes in the version 0.2.1a.

Revision 0.2.02006-02-05

This is the first release of the document.

Abstract

This document helps to understand the structure of the distributive and where to look to get started with framework.


Table of Contents

1. Package Status
2. Framework Overview
3. Package Contents
4. Possible Next Steps

1. Package Status

This is the second public release of Extensible Term Language (ETL) Framework for Java. The release is of alpha quality.

The biggest known quality issues are the following:

  1. Documentation quality . There is the definitive lack of documentation and samples. If you create a test, a sample or a tutorial, please consider contributing it to the project or publishing it on the web for further reference.

  2. Language stability . The grammar language is expected to change. Some planned changes are known now. The document PlannedChanges.html specifies planned changes. However this document will be expanded upon received feedback. Some backward incompatible changes might be also introduced.

  3. API stability . API is expected to be changed upon feedback received and due to planned changes in the grammar language. Also AST-building parser will be extended to support more data types. Current support is very limited and restricted to needs of demos and framework itself.

  4. Implementation limitations . The current implementation does not implement some features completely. Also this version assumes that grammar does not change after it has been read, because of this limitation the framework cannot be used in applications where grammar is expected to change while the work is in progress. Therefore the parser is suitable for most kinds of command line tools, but is not suitable for IDEs like IDEA or Eclipse right now. This issue is considered to be very important it is the next high priority item. Also full Unicode support is missing, non-ascii characters are currently supported only inside the strings. It is planned for some later version as it is possible to evaluate major ideas of the framework without Unicode support.

  5. Test coverage . There are few tests and these test mostly cover success condition. There are very few tests on error conditions. There are a lot of error conditions in the grammar definition language, some of them have been checked during development process due to bugs in the tests, but other yet to surface. Other bugs are expected to surface due to limited testing of the framework. Please submit all found bugs to http://sourceforge.net/tracker/?group_id=153075&atid=786315 .

  6. Performance . Parsing is quite fast, but grammar compilation is relatively slow. To alleviate this, future versions of framework will allow saving compiled grammars and even possibly generating java code. Also AST building BeansTermParser and FieldTermParser use reflective API heavily with unknown performance impact. No profiling has been done so far, current priority is to get things right before getting it fast.

    There is also a not-yet-investigated problem with startup time of utility applications and tests. Possibly it is related to class loading.

  7. Licensing . The license might change to some other permissive open source license like Apache License 2.0 or Eclipse Public License 1.0. Please consider it when submitting patches, samples or tutorials.

If you encounter other quality issues, please report them as bugs.

2. Framework Overview

The primary focus of the framework is development of the language that allows defining extensible and reusable grammars. This implementation is intended to be a Java implementation of this grammar definition language rather than the implementation of the language.

The language consists of three layers:

  1. Lexical

  2. Phrase

  3. Syntax

From logical point of view, the parser is defined by grammar takes source code and produces AST. AST consists of objects, properties and values. A source is parsed to sequence of objects and contains properties. A property should belong to some the object and may contain other properties or values. Values directly correspond to some tokens.

Note that AST parsers create a "shallow" domain model. AST is a strict tree objects, properties, and values. Resolving references from AST is task of other components in processing queue. For example in grammar compiler there is a view component that provides a view of the grammar that usable for further processing.

However actual pull parser interfaces return much more events, then just about properties, objects, and values. The returned stream of events returns all tokens from lexical level including white-spaces, comments etc and all additional tokens from phrase level. This generic pull parser is used to implement AST building parsers (there are already three AST building parsers in the project: one that build JavaBeans based AST, one builds EMF based AST, and one that builds AST using reflection [this one is also used internally to create AST of grammars]) that ignore the most of that stream of events.

But the same pull parser can be used by text editors to implement syntax highlighting and outline views.

The stack of pull parsers involved into the work actually is very similar. They all partition and annotate stream returned from lower level.

The lexer works with character stream and partitions characters into the tokens. Phrase parser partitions sequence of tokens into blocks and segments, and annotate tokens as control (ones that already have been used to mark up the phrases like open/close braces and semicolons), ignorable and significant. Term parser partitions the stream from phrase layer into objects and properties and annotates significant tokens as structural and value tokens.

And all pull parsers in the project are failure resistant provided that underlying reader works and no bugs are encountered. If invalid input is encountered, parsers report the problem and try to continue. The error can result in incomplete objects (for example if name is required by grammar and syntax error happened before value for name is parser, name will not be returned by parser), but all open objects are guaranteed to close. Even if closing curly bracket is omitted from the source, events about block closing will be reported at the end of the source.

Also if syntax error is detected in the segment, the rest of the segment is marked as erroneous, but term parser continue with the next segment. Only serious problem that might generate a lot of errors is missing open or close brace. The statements will be misinterpreted as belonging to other segment, and that will result in a lot of syntax errors. But parser will continue to parse source until end of file. Such behavior is very important for editors where a lot of invalid input is likely to be encountered. Note that current version of parser supports reparsing parts of the source on segment level; the focus of next version, will be support on the expression level.

3. Package Contents

This section lists items of interest for reviewers of the framework. If you are reading this document on the web and have not downloaded the ETL framework yet, you could download it from the sourceforge.net download site . Note that some links work only if directory xmlout contains output of demo script or if xml samples have not been removed.

  1. The directory doc contains some documents that might be useful to read. Documentation is very limited but at least it is here:

    1. Readme.html – this document

    2. InstallationInstructions.html – installation instructions for the package.

    3. CalculatorTutorial.html – this document demonstrate most of ETL extensibility constructs while creating a simple calculator language.

    4. ProblemStatement.html – the document that describes motivation and problem this framework attempts to solve.

    5. ETL-specification.html – this is a draft specification of the framework. Note it is quite informal right now.

    6. PlannedChanges.html – this document contains list of planned changes for languages and runtime.

    7. DesignLog.html – this document lists some design decisions for the language.

  2. There are some samples of grammars (you need to download xmlout version of the package or generate the the xml files accroding to installation instructions . If you are looking to this document on the web, the links should work. Most of the grammars are accompanied with samples of sources that conform to them. Most of samples are only syntactically correct. They are usually intentionally semantically incorrect in order to cover more features of the grammar in the single source.

    1. The grammar src/parser/src/net/sf/etl/grammars/grammar.g.etl ( AST ) is grammar of ETL language itself written in ETL. This is a possibly the place where to start and to return later. Documentation comments for the grammar explain usage of some contructs.

    2. The grammar src/parser/src/net/sf/etl/grammars/default.g.etl is the grammar of default language. This grammar is used in cases when there is a error during locating or compiling grammar. Note that parser for this grammar is hard-coded.

    3. The grammar src/parser/src/net/sf/etl/grammars/doctype.g.etl is grammar of doctype directive that specifies location of grammar for the source.

    4. The grammar src/tests/src/net/sf/etl/samples/ej/grammars/EJ.g.etl is a grammar of Java-like language. The grammar is quite complete. The grammar also demonstrates some extensibility features. The grammar reuses generic control flow statement definitions from CommonControlFlow.g.etl and operator definitions from CommonOperators.g.etl . The grammar itself is extended by AsyncEJ.g.etl to add few operators and new statements to interface and class. The tests also contain HelloWolrld sample and almost minimal grammar required to support it.

    5. The grammar src/tests/src/net/sf/etl/samples/ecore/grammars/Ecore.g.etl contains a grammar of language that can be used to define Ecore models. The grammar is incomplete and supports only supplied demo files. The grammar EcoreJava.g.etl from the same directory demonstrates grammar extension by adding java specific declaration to content of datatype construct. The files Library.ecore.etl ( AST ) and SchoolLibrary.ecore.etl ( AST ) demonstrate languages created by these grammars.

    6. Grammar imports are demonstrated by src/tests/src/net/sf/etl/tests/data/imports/MainGrammar.g.etl and ExpressionGrammar.g.etl from the same directory. The sample that conforms to these grammars is Test.i.etl ( AST ).

    7. The grammar src/tests/src/net/sf/etl/tests/data/fallbacks/Fallbacks.g.etl is quite simple. However it has two samples EmptyFallbacks.test.etl ( AST ) and NonEmptyFallbacks.test.etl ( AST ) with intentional errors in them that are used to test error recovery. They also demonstrate how syntax highlighting might work in case of an error in the source.

    8. The directory src/samples/calculator/src/calculator/grammars/ contains grammars and calculator script files used for calculator tutorial. See calculator tutorial for explanation of the directory content.

    9. The directory src/samples/events/ contains the grammar events2.g.etl and events2.sm.etl ( AST ) that define state machine (based on on the chapter from Martin Fowler's book ). There is also a more verbose version of the grammar e events.g.etl and the sample events.sm.etl ( AST ), but the resulting AST is the same.

    10. The directory src/samples/agreement/ contains contains a sample that is a reimplementation of agreement language from the Martin Fowler's article as a textual DSL. The resulting sample is sample.plans.etl ( AST ) which is defined using the grammar plans.g.etl . The plans grammar imports the accounting grammar accounting.g.etl which in its turn includes the formula grammar formula.g.etl

  3. The public API packages of the framework are (see javadoc comments for more details):

    1. net.sf.etl.parsers – this is interface to pull parser. The parser has no dependencies on external components.

    2. net.sf.etl.parsers.errors – this is a package that contains resources with descriptions of errors.

    3. net.sf.etl.parsers.beans – this is a utility parser that builds AST tree using java.beans and java.lang.reflection API.

    4. net.sf.etl.parsers.utils– this is a utilities that help to construct own tree parsers.

  4. The test suite demonstrates some work with API and some test check internals of the framework.

    1. The package net.sf.etl.tests.lexer contains tests for lexer. The test might be examined to understand which kinds of tokens are understood by framework. Related test is in the package net.sf.etl.tests.utils. It tests a utility class that converts tokens to values.

    2. The package net.sf.etl.tests.phrase_parser contains a test for phase parser. The phrase parser tests are incomplete and many error conditions have not yet been tested by unit tests. However most of them have been tested during parser development.

    3. The package net.sf.etl.tests.term_parser demonstrates usage of the primary parser of the framework.

      • FallbackTest, HelloWorldTestCase and ECoreTestCase use generic pull term parser to read a source.

      • ImportsTestCase uses BeansTermParser to read a source into JavaBeans.

  5. There are some utilities that might help you to get feel of framework. They are launched from ant build script build.xml that is located in the root of package. The script has been tested with ant 1.7.1. The script creates directory xmlout and converts different ETL sources to a number of XML forms with the following extensions (the script also could be used to recompile sources and samples):

    .beans.xml

    Files with this extension are created using net.sf.etl.parsers.utils.etl2beans.ETL2Beans application. The application parses source code and creates JavaBeans from it. Then the beans are serialized with java.beans.XMLEncoder.

    .s.xml

    Files with this extension are created using net.sf.etl.parsers.xml.ETL2XML application in the structural mode. In this mode, the application parses source code and emits structurally significant events (start/end of objects and properties, and values) to output file.

    .p.xml

    Files with this extension are created using net.sf.etl.parsers.xml.ETL2XML application in the presentation mode. In this mode, the application parses source code and emits almost all events to output file as XML tags. If stylesheet has been specified, the reference to stylesheet is emitted to output file too. If stylesheet has been specified, resulting files can be opened in a web browser.

    .html

    Files with this extension are created using net.sf.etl.parsers.xml.ETL2XML application in the html mode. In this mode, the application parses source code, creates presentation xml in memory, applies xlst stylesheet to it, and saves resulting html.

    .o.html

    Files with this extension are created using net.sf.etl.parsers.xml.ETL2XML application in the outline html mode. In this mode, the application parses source code, creates presentation xml in memory, applies outline xlst stylesheet to it, and saves resulting html. In addition to the normal html output that is generated for .html files, in this mode an object tree is dumped to the file. The both tree and source are clickable, so it would be easy to understand correspondance between source and resulting object tree.

    .c.html, .c.p.xml

    Files with this extension are created using net.sf.etl.parsers.xml.ETL2XML application in the html and presentation mode. Only difference form normal files are that they are created using a custom xslt stylesheet rather than generic one. Demo script creates such files for EJ sources and grammar files.

    The following targets in demo script might interesting.

    genxml-all

    This target generates all demo xml files except .beans.xml files. The target genxml-all-with-beans generates content of the xmlout directory in the same form as it is in "xmlout" version of the package, but it also recompiles ETL sources.

    genxml

    This is a parameterized target that generate .s.xml , .p.xml , .html , and .o.html files from specified directory. The target uses generic stylesheet.

    genxml-c

    This is a parameterized target that generate .p.xml , and .html files from specified directory. The target uses explicitly specified stylesheet and has ability add additional suffix.

    genxml-imports-beans

    This target demonstrates how to generate .beans.xml files.

  6. The *.jar files in the root directory are files are parser library and utility jars:

    etl-parser.jar

    The core parser library. The library has dependencies only on standard JDK 1.5 classes. This is the only required library for ETL.

    etl-xml.jar

    This library contains utilities that convert ETL sources to different XML forms. These utilities are useful for debugging. The library depends on OASIS catalog support provided in lib/resolver.jar and StAX parser.

    etl-tests.jar

    This library contains contains compiled tests. It is used by genxml-imports-beans ant target.

  7. The directory lib contains libraries used in the project.

    lib/resolver.jar

    This is Apache OASIS catalog support library used for ETL2XML utilities.

    lib/stax/*.jar

    These libraries are StAX implementation libraries that are needed in case of Java 5. They are used by ETL2XML utility. Woodstox StAX implementation is used. In case of Java 6, the bundled implementation could be used.

4. Possible Next Steps

You are welcome to try the framework and to report bugs, usability problems, or other issues to http://sourceforge.net/tracker2/?func=browse&group_id=153075&atid=786315 If you have comments of other nature about the project, please post them at the project forum . I will also track places where I have posted the announcement for some time.

As sample project, you could try defining own DSL, read it into JavaBeans, and generate code from it or directly interpret it. Field-model and JavaBeans are best suited for cases when domain model is small. It is also relatively easy to define custom AST model using existing abstract tree parser. There is also EMF based AST parser that is available from the project SVN .

The calculator tutorial will lead you through basic steps required for creating own DSL using ETL. More advanced tutorials will be supplied in later versions of the framework.

Note that framework does not provide support for creating text editors yet. This use case will be supported in the next versions of the framework. It is already known how to support it but requires some refactoring of the library core.