#2805 Invalid HTML

SlimerDude Wed 8 Jul 2020

Hi,

I know this is pedantic, I mention it only because it may be affecting SEO and the like...

The fantom.org website is declared with an XHTML DOCTYPE:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

But the <head> uses an empty <meta> void tag (invalid XML).

fansh> xml::XParser(web::WebClient(`https://fantom.org/`).getStr.replace("&ndash;", "-").in).parseDoc

xml::XErr: Expecting end of element 'meta' (start line 7) [line 21, col 3]
  xml::XParser.err (XParser.fan:986)
  xml::XParser.parseElemEnd (XParser.fan:505)
  xml::XParser.next (XParser.fan:174)
  xml::XParser.parseElem (XParser.fan:69)
  xml::XParser.parseDoc (XParser.fan:46)
  xml::XParser.parseDoc (XParser.fan)

But beyond this, there seems to be a missing </div> end tag somewhere that makes even HTML invalid.

fansh> xml::XParser(web::WebClient(`https://fantom.org/`).getStr.replace("&ndash;", "-").replace("1.0'", "1.0'/").in).parseDoc

xml::XErr: Expecting end of element 'div' (start line 51) [line 98, col 3]
  xml::XParser.err (XParser.fan:986)
  xml::XParser.parseElemEnd (XParser.fan:505)
  xml::XParser.next (XParser.fan:174)
  xml::XParser.parseElem (XParser.fan:69)
  xml::XParser.parseDoc (XParser.fan:46)
  xml::XParser.parseDoc (XParser.fan)

It was noticed by a colleague who was using the Fantom site to test a bug fix to HTML Parser.

matthew Fri 10 Jul 2020

Thanks for reporting. I'll look at fixing both these things next week.

matthew Mon 13 Jul 2020

@SlimerDude - switched the doctype to HTML5 doctype so you won't be able to parse it as XML anymore. I also fixed the unmatched <div> tag

SlimerDude Tue 14 Jul 2020

Hi Matthew, cool the homepage is looking good and parsing nicely!

However, parsing this page, and any other forum page, gives:

sys::ParseErr: End tag </span> does not match start tag <h2>

<h2 onclick='Sidewalk.toggleComment(this);'>
  <span style='color:#800000;'>matthew</span>
  </span>
  <span title='13-Jul-2020 Mon 18:02:07 UTC'>Yesterday</span>
</h2>

Seems there's an extraneous span end tag making an appearance.

matthew Tue 14 Jul 2020

I'm sure these are not the only cases of unbalanced tags. I pushed a fix to the website, but I might suggest that improperly formatted HTML is a very common thing in general, and that your HtmlParser be able to more gracefully continue parsing in the face of issues like this. If you find an unexpected <end/> tag, log it as an error and keep parsing - or maybe have an optional strict mode or something.

SlimerDude Thu 16 Jul 2020

Don't worry @matthew, I'll not bug you again on this matter! :)

Login or Signup to reply.