htmlparser-user Mailing List for HTML Parser

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi

I have been using htmlparser 1.5 for a short while now and am getting 
pretty flustrated. :-) Not because it doesn't seem to be a good utility 
and expose a rich API, but because sufficient documentation to get 
started with key issues is severly lacking. However, it seems of much of 
my flustration is due to the API's flexability - the ability to do the 
same things using different techniques. However, for someone gettings 
started using the parser, comparing and picking these techniques is 
difficult and I implore the developers and users who have experience 
using the API to share their knowledge in relavant forums, add more 
demos and write a JavaWorld (or other) basic "getting started" article 
on basic usage. This would also lead to more exposure to other java 
developers unaware of the htmlparser project.

Anyway, as the title says I need to transform some html for presentation 
   to use in another web site. To do this I need to remove some tags 
(and children), alter som tags, remove some attributes, alter some 
attributes etc.

Up to now I have been following a basic simple procedure using a 
combination of registering tag handlers and using filters to achieve my 
goal. From what I have seen there seems to be many ways of retrieving 
tag content, but only one way to change tag content while parsing (by 
overriding doSemantivAction() in a tag subclass):

Ok, my code looks something like this:

// register tag handlers for tags I know I want to handle

PrototypicalNodeFactory factory = new PrototypicalNodeFactory();

factory.registerTag(new LinkTag() {
   public void doSemanticAction() throws ParserException {
     // do something with the Link tag or attributes
   }
});

factory.registerTag(new BodyTag() {
   public void doSemanticAction() throws ParserException {
     // do something with the body tag or attributes
   }
});

//factory.registerTag(new general tag that processes all tags
//                    even those that have specific handlers
//                    see text below

Parser parser = new Parser(url);
parser.setNodeFactory(factory);

// this step seems a bit ridiculous. why doesn't parser have a
// public NodeList getNodeList() method? Or am I missing something?

NodeList list = new NodeList();

for (NodeIterator iter = parser.elements(); iter.hasMoreNodes(); ) {
   list.add(iter.nextNode());
}

// get the body tag only. can be part of a complete html document
// or just a html fragment (containing a body tag)

NodeList body = list.extractAllNodesThatMatch(
   new NodeClassFilter(BodyTag.class), true);

// remove unwanted tags and their children

body.keepAllNodesThatMatch(new NotFilter(
      new NodeClassFilter(ScriptTag.class)), true);
body.keepAllNodesThatMatch(new NotFilter(
      new NodeClassFilter(MetaTag.class)), true);

// print the result for debugging

for (int i = 0; i < l2.size (); i++) {
   System.out.print(l2.elementAt(i).toHtml());
}

Does this look correct? It seems to get the result I want...

However, what I can't quite figure out is how to register a general tag 
listener that will take care of removing an attribute for all tags (or a 
subset of them), but also have the ability to register the more specific 
tag listeners at the same time. Is this possible? I tried to create a 
subclass of TagNode and override the doSemanticAction() method, but it 
did't work and was never called.

Am I on the right track? Any ideas?

Regards
Lee Francis Wilhelmsen

-- 
Programs should be written for people to read, and only
incidentally for machines to execute.
-- Structure and Interpretation of Computer Programs, MIT Press

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
		1	2	3	4	5
6	7 (2)	8 (2)	9 (1)	10 (1)	11	12
13	14 (1)	15 (5)	16	17 (1)	18	19
20	21	22	23	24	25	26
27	28 (2)	29	30	31 (1)

htmlparser-user Mailing List for HTML Parser

htmlparser-user — The user mailing list for users of the htmlparser library