[Date Index][Thread Index]
[Date Prev][Date Next][Thread Prev][Thread Next]

[LONG][WML] Mp4h parsing: summary



Hi folks,

this post tries to give clear hints about tag parsing to let you decide
how this must be done.

Tags begin a left-angle bracket ("<" sign), followed by a succession of
characters (tag name), attributes, and a right-angle bracket (">" sign).
We suppose here that tag names match the "[a-z][a-z*]*" regular
expression (this is of course a simplified expression).

When finding the "<" character, parser reads next chars and decides in
which category this token belongs:
  1. Mp4h tags
  2. Complex HTML tag
  3. Simple HTML tag
  4. Embedded HTML tag
  5. Masqueraded HTML tag
  6. Not a tag.

These categories need some explanations:
  1. Mp4h tags
     There is no need to distinguish simple and complex tags, because
     this feature has been determined at definition time.

  2. Complex HTML tag

  3. Simple HTML tag

  4. Embedded HTML tag
     This is an HTML tag, but it is embedded into another programming
     language (e.g. javascript, ePerl,...) and thus cannot be parsed.

  5. Masqueraded HTML tag
     In ePerl or other programming languages, inferior sign is part of
     its syntax.  For instance it is valid to write
        open (IN, "<foo ") or die "File foo not found";
     or
        print <<EOT
        hello world!
        EOT
        ;
     Mp4h believes that these are respectively <FOO> and <EOT> tags
     if these tags have been defined.  There is no way to automatically
     prevent mp4h to parse these expressions, so they *must* be
     protected against mp4h expansion.

  6. Not a tag.
     The "<" sign does not begin a tag, because it is not followed
     by a letter (like "<:", "<?", "<!--", etc.)

Now the question is how do we want these tags being parsed?
I explain roughly below how it was done in WML 2.0.0 and WML 2.0.1,
what i suggest for WML 2.0.2, give some pros and cons of each type, and
after that you have to decide which one (or another one) is better
fitted for our needs.

There is a consensus on case 1 and 6: mp4h tags are parsed and expanded,
whereas "invalid tags" are simply printed as text.
There has been no consensus on case 5 yet, but there is no alternative
IMO, and i would not discuss this point in this message. Feel free to do
it, but you need killer arguments ;-)

A) WML 2.0.0
============

  2. Complex HTML tag
     The inferior sign has no special meaning, it is printed as any
     other character.

  3. Simple HTML tag
     The inferior sign has no special meaning, it is printed as any
     other character.

  4. Embedded HTML tag
     The inferior sign has no special meaning, it is printed as any
     other character.

PROS:
  * embedded HTML tags do not require any special treatment, just write
       document.write('<img src="foo.png" alt="');
       document.write(text+'">');

CONS:
  * HTML tags must be quoted
wrong: <ifeq 0 0 <b>foo</b>>
                   ^
               Close <ifeq
right: <ifeq 0 0 "<b>foo</b>">

wrong: <ifeq 0 0
          <group <a href="foo.html">Click here</a> now>>
                                   ^             ^
                             Close <group     Close <ifeq
right: <ifeq 0 0
          <group "<a href=\"foo.html\">Click here</a>" now>>

  * Because of this need for quotes, inner quotes must be escaped

B) WML 2.0.1
============

  2. Complex HTML tag
     This tag is parsed like a simple tag, i.e. attributes are read
     and other characters are ordinary text.

  3. Simple HTML tag
     Parser reads attributes.

  4. Embedded HTML tag
     These tags are expanded as in 2 or 3, and so must be protected
     against expansion when necessary.

PROS:
  * Better respect of the structure of the document.

CONS:
  * HTML complex tags are divided into more than one token, which
    implies that parsing may differ if tag is defined or not.

  * Embedded HTML tags must be protected, for instance
       document.write('<img src="foo.png" alt="');
       document.write(text+'">');
   is replaced by
       document.write('<'+'img src="foo.png" alt="');
       document.write(text+'">');
   All tags belonging to this category must be changed into category 6
   with appropriate fixes

  * Quoting has not been well defined, i tried to be compatible with
    Meta-HTML when it was impossible.

C) Suggestions
==============

  2. Complex HTML tag
     Parser reads attributes and body.

  3. Simple HTML tag
     Parser reads attributes. To let mp4h know this tag is simple,
     it must conform to xhtml standard and contain a trailing slash.

  4. Embedded HTML tag
     When this is an HTML complex tag but appear as a simple tag, e.g.
       <navbar:header><table><tr></navbar:header>
     a trailing star marks this tag as a non parsable complex HTML tag
       <navbar:header><table*><tr*></navbar:header>
     When it is simple, there is no problem because of the trailing slash
       <navbar:epilog><br /></navbar:epilog>
     When this tag cannot be parsed, it must be protected with a
     leading star, which will make it fall into category 6.
       document.write('<*img src="foo.png" alt="');
       document.write(text+'">');

PROS:
  * Best respect of the structure of the document.
  * Fewer side effects, syntax cannot be broken by defining mp4h tags.

CONS:
  * Embedded HTML tags must be protected, for instance
       document.write('<img src="foo.png" alt="');
       document.write(text+'">');
    is replaced by
       document.write('<*img src="foo.png" alt="');
       document.write(text+'">');
    All tags belonging to this category must be changed into category 6
    with appropriate fixes.

  * All HTML tags must be protected in templates and simple tags must
    contain a trailing slash.


Note About Quoting:
===================

I did not want to expose how quoting is performed, because it is a
complex task. If you want to play with Meta-HTML and mp4h quoting,
consider these lines:
  <img url="foo.png" alt="This is $ref->{'name'}" /> foo bar
  print "<img url="foo.png" alt="This is $ref->{'name'}" /> foo bar"
  <group "<img url="foo.png" alt="This is $ref->{'name'}" /> foo bar" />
  <group <img url="foo.png" alt="This is $ref->{'name'}" /> foo bar />
  <ifeq 0 0 "<img url="foo.png" alt="This is $ref->{'name'}" /> foo bar" />
and the same lines when escaping quotes.

It is quite hard to provide a robust syntax, but here is how it could
be done with the suggested scheme above:

   Quotes should never be escaped, unless they must be escaped in output.

So in lines above only the 2nd must be changed to
 print "<img url=\"foo.png\" alt=\"This is $ref->{'name'}\" /> foo bar"

-- 
Denis Barbier
WML Maintainer

______________________________________________________________________
Website META Language (WML)                www.engelschall.com/sw/wml/
Official Support Mailing List                   sw-wml@engelschall.com
Automated List Manager                       majordomo@engelschall.com