htmlparser-user Mailing List for HTML Parser

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Derrick,
Thanks again for your help. By the way, I didn't have much luck with the
threading issue. I figured upgrading would be a good course of action. I
think I am getting the hang of the new parser now...

Once again, thanks and that's two I owe you...

Steve

-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Thursday, February 26, 2004 8:17 PM
To: htm...@li...
Subject: Re: [Htmlparser-user] Getting title tag text

Steve,

I think you've hit on the nub of the difference between the lexer and 
the parser.
The lexer simply returns nodes, in order, and doesn't try to match end 
tags with start tags. So yes, you will get a TitleTag, but it hasn't 
been fed it's children.
The parser on the other hand will cause collection of the nodes between 
start and end tags so as to "know" that the thing between the TITLE and 
/TITLE tag is the "title of the document". See the home page for another

explanation: http://htmlparser.sourceforge.net/

What you can do with the Lexer is get the next node *after* the TITLE 
tag and assume it's a plain text title in a string node (people do funny

things with HTML, so you're bound to see <TITLE><B>My Title</B><TITLE> 
and stuff like that, which I'm not sure is even completely handled by 
the parser code, so you have to be careful). Or perhaps get the *next* 
StringNode from the lexer which is presumably the title for the same 
reasons as outline before, but you have to watch out for empty 
<TITLE></TITLE> constructs. Or you can use the Parser and hope it does 
the 'right thing'.  If it doesn't, let us know.

Derrick

Steve McCann wrote:

>Using the following code, the assert for the title fails (getTitle()
>returns an empty string). Is it not possible to retrieve that
>information using the lexer rather than the parser? I am using HTML
>Parser Integration Release 1.4-20040125.
>
>Thank you,
>Steve
>
>    public void testTitleScan() throws ParserException
>    {
>	String inputHTML =
>"<html><!--remark--><head><title>Yahoo!</title></head>";
>		Lexer lexer = new Lexer (new Page (inputHTML));
>		
>		PrototypicalNodeFactory factory = 
>					new PrototypicalNodeFactory(new
>TitleTag());
>		lexer.setNodeFactory (factory);
>
>		Node node;
>		while (null != (node = lexer.nextNode ()))
>		{
>			if (node instanceof TitleTag)
>			{
>		        TitleTag titleTag = (TitleTag) node;
>			  String test = titleTag.getTitle();
>	
>assertEquals("Title","Yahoo!",titleTag.getTitle());
>			}
>			if(node instanceof RemarkNode)
>			{
>			   RemarkNode remarkNode = (RemarkNode)node;
>			   String test = remarkNode.toPlainTextString();
>                     assertEquals("Remark","remark",test);
>			}
>		}
>    }
>
>
>  
>

-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
1	2 (3)	3	4	5	6	7
8	9	10	11	12	13	14 (2)
15 (1)	16	17 (5)	18 (2)	19 (1)	20	21
22	23	24	25	26 (1)	27 (2)	28
29

htmlparser-user Mailing List for HTML Parser

htmlparser-user — The user mailing list for users of the htmlparser library