htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
(1) |
9
|
10
|
11
|
12
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
|
23
|
24
|
25
|
26
|
27
|
28
|
29
|
30
|
31
|
|
|
|
From: Derrick O. <Der...@Ro...> - 2003-12-08 00:17:01
|
The 'lexer integration' subject line is wearing a little thin, since it's been a while since lexer integration issues were complete, so from now on I'll try to label it appropriately. I've removed the scanners that didn't do anything anymore, leaving script and jsp scanners. Instead of registering a scanner to enable returning a specific tag you now add a tag to a new class called PrototypicalNodeFactory. These 'prototype' tags are cloned as needed to be returned from the parser. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour so you will need to recurse into returned nodes to get at what you want, or if you want to return only some of the derived tags while keeping most as generic tags and a flatter structure, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. I've changed the operation of toString() for CompositeTags. It now returns an indented listing of children so the mainline from the Parser looks better. TODO ===== 1.3.1 ------ It looks like there are enough bugs and requests to warrant another 1.3 point release with some patched files. I hate to work on a branch, but it may be the only way to get everyone off my back. Filters ------- Implement the new filtering mechanism for NodeList.searchFor (). Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. A filter builder tool to graphically construct a program to extract a snippet from an HTML page would blow people away. Applications ----------- Rework all the applications for a better 'out of the box' experience for new and novice users. Fix all the scripts in /bin (for unix and windows) and add any others that don't exist already. Clean Up ------------ The integration process needs to be revamped to take use the $Name: CVS substitution, so a checkin isn't required every integration. Block/Inline ---------------- The tag-enders and end-tag-enders lists are only a partial solution to the HTML specification for block and inline tags. By ensuring block tags don't overlap, a better parsing job could be done, i.e. <FORM> .... <TABLE> ... </FORM></TABLE> would be rearranged as <FORM> .... <TABLE> ... </TABLE></FORM> This needs some design work. |