[Date Index][Thread Index]
[Date Prev][Date Next][Thread Prev][Thread Next]
[LONG][WML] Mp4h parsing: summary
- From: Denis Barbier <nospam@thanx>
- Date: Fri, 2 Jun 2000 16:42:27 +0200 (CET)
Hi folks,
this post tries to give clear hints about tag parsing to let you decide
how this must be done.
Tags begin a left-angle bracket ("<" sign), followed by a succession of
characters (tag name), attributes, and a right-angle bracket (">" sign).
We suppose here that tag names match the "[a-z][a-z*]*" regular
expression (this is of course a simplified expression).
When finding the "<" character, parser reads next chars and decides in
which category this token belongs:
1. Mp4h tags
2. Complex HTML tag
3. Simple HTML tag
4. Embedded HTML tag
5. Masqueraded HTML tag
6. Not a tag.
These categories need some explanations:
1. Mp4h tags
There is no need to distinguish simple and complex tags, because
this feature has been determined at definition time.
2. Complex HTML tag
3. Simple HTML tag
4. Embedded HTML tag
This is an HTML tag, but it is embedded into another programming
language (e.g. javascript, ePerl,...) and thus cannot be parsed.
5. Masqueraded HTML tag
In ePerl or other programming languages, inferior sign is part of
its syntax. For instance it is valid to write
open (IN, "<foo ") or die "File foo not found";
or
print <<EOT
hello world!
EOT
;
Mp4h believes that these are respectively <FOO> and <EOT> tags
if these tags have been defined. There is no way to automatically
prevent mp4h to parse these expressions, so they *must* be
protected against mp4h expansion.
6. Not a tag.
The "<" sign does not begin a tag, because it is not followed
by a letter (like "<:", "<?", "<!--", etc.)
Now the question is how do we want these tags being parsed?
I explain roughly below how it was done in WML 2.0.0 and WML 2.0.1,
what i suggest for WML 2.0.2, give some pros and cons of each type, and
after that you have to decide which one (or another one) is better
fitted for our needs.
There is a consensus on case 1 and 6: mp4h tags are parsed and expanded,
whereas "invalid tags" are simply printed as text.
There has been no consensus on case 5 yet, but there is no alternative
IMO, and i would not discuss this point in this message. Feel free to do
it, but you need killer arguments ;-)
A) WML 2.0.0
============
2. Complex HTML tag
The inferior sign has no special meaning, it is printed as any
other character.
3. Simple HTML tag
The inferior sign has no special meaning, it is printed as any
other character.
4. Embedded HTML tag
The inferior sign has no special meaning, it is printed as any
other character.
PROS:
* embedded HTML tags do not require any special treatment, just write
document.write('<img src="foo.png" alt="');
document.write(text+'">');
CONS:
* HTML tags must be quoted
wrong: <ifeq 0 0 <b>foo</b>>
^
Close <ifeq
right: <ifeq 0 0 "<b>foo</b>">
wrong: <ifeq 0 0
<group <a href="foo.html">Click here</a> now>>
^ ^
Close <group Close <ifeq
right: <ifeq 0 0
<group "<a href=\"foo.html\">Click here</a>" now>>
* Because of this need for quotes, inner quotes must be escaped
B) WML 2.0.1
============
2. Complex HTML tag
This tag is parsed like a simple tag, i.e. attributes are read
and other characters are ordinary text.
3. Simple HTML tag
Parser reads attributes.
4. Embedded HTML tag
These tags are expanded as in 2 or 3, and so must be protected
against expansion when necessary.
PROS:
* Better respect of the structure of the document.
CONS:
* HTML complex tags are divided into more than one token, which
implies that parsing may differ if tag is defined or not.
* Embedded HTML tags must be protected, for instance
document.write('<img src="foo.png" alt="');
document.write(text+'">');
is replaced by
document.write('<'+'img src="foo.png" alt="');
document.write(text+'">');
All tags belonging to this category must be changed into category 6
with appropriate fixes
* Quoting has not been well defined, i tried to be compatible with
Meta-HTML when it was impossible.
C) Suggestions
==============
2. Complex HTML tag
Parser reads attributes and body.
3. Simple HTML tag
Parser reads attributes. To let mp4h know this tag is simple,
it must conform to xhtml standard and contain a trailing slash.
4. Embedded HTML tag
When this is an HTML complex tag but appear as a simple tag, e.g.
<navbar:header><table><tr></navbar:header>
a trailing star marks this tag as a non parsable complex HTML tag
<navbar:header><table*><tr*></navbar:header>
When it is simple, there is no problem because of the trailing slash
<navbar:epilog><br /></navbar:epilog>
When this tag cannot be parsed, it must be protected with a
leading star, which will make it fall into category 6.
document.write('<*img src="foo.png" alt="');
document.write(text+'">');
PROS:
* Best respect of the structure of the document.
* Fewer side effects, syntax cannot be broken by defining mp4h tags.
CONS:
* Embedded HTML tags must be protected, for instance
document.write('<img src="foo.png" alt="');
document.write(text+'">');
is replaced by
document.write('<*img src="foo.png" alt="');
document.write(text+'">');
All tags belonging to this category must be changed into category 6
with appropriate fixes.
* All HTML tags must be protected in templates and simple tags must
contain a trailing slash.
Note About Quoting:
===================
I did not want to expose how quoting is performed, because it is a
complex task. If you want to play with Meta-HTML and mp4h quoting,
consider these lines:
<img url="foo.png" alt="This is $ref->{'name'}" /> foo bar
print "<img url="foo.png" alt="This is $ref->{'name'}" /> foo bar"
<group "<img url="foo.png" alt="This is $ref->{'name'}" /> foo bar" />
<group <img url="foo.png" alt="This is $ref->{'name'}" /> foo bar />
<ifeq 0 0 "<img url="foo.png" alt="This is $ref->{'name'}" /> foo bar" />
and the same lines when escaping quotes.
It is quite hard to provide a robust syntax, but here is how it could
be done with the suggested scheme above:
Quotes should never be escaped, unless they must be escaped in output.
So in lines above only the 2nd must be changed to
print "<img url=\"foo.png\" alt=\"This is $ref->{'name'}\" /> foo bar"
--
Denis Barbier
WML Maintainer
______________________________________________________________________
Website META Language (WML) www.engelschall.com/sw/wml/
Official Support Mailing List sw-wml@engelschall.com
Automated List Manager majordomo@engelschall.com