htmlparser-user Mailing List for HTML Parser

Brought to you by: derrickoswald

htmlparser-user — The user mailing list for users of the htmlparser library

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
1	2	3	4	5 (2)	6	7 (2)
8	9 (2)	10 (1)	11 (1)	12 (5)	13 (1)	14
15	16 (1)	17	18 (2)	19 (5)	20 (1)	21 (1)
22	23	24	25 (4)	26 (1)	27 (3)	28 (4)
29 (3)	30 (2)	31 (1)

Flat | Threaded

1 2 > >> (Page 1 of 2)

Re: [Htmlparser-user] using parser/lexer on non-html markup pages

From: Rob E. <re...@ap...> - 2004-08-31 15:40:56

Thanks, Derrick.  I'll download the latest version now.

Rob.

Derrick Oswald wrote:
> Rob,
> 
> This may be a bug that was recently (July 28) fixed:
> 
> Bug #995703 Parser Crash and bug #988846 Linkbean getLinks() 
> segmentation fault
> by not testing for content type "text/XXX" in Page, but rather issuing a 
> warning when this is
> discovered by the Parser level.
> 
> What's the exception message? Is it "...does not contain text"?
> If so, either download a new version or remove the test in Page.java:
> 
>          type = getContentType ();
> -         if (type != null && !type.startsWith ("text"))
> -             throw new ParserException (
> -                 "URL "
> -                 + connection.getURL ().toExternalForm ()
> -                 + " does not contain text");
>          charset = getCharset (type);
>          try
> 
> Derrick
> 
> 
> Rob Eger wrote:
> 
>> Okay, I seem to have figured out how to make the parser do what I 
>> need, except for one small issue - if the files have a .xml extension 
>> it throws an exception saying the file "does not contain text".  If I 
>> append .html onto it, things work fine.
>>
>> Is there a way to make the parser accept .xml as a valid file extension?
>>
>> Thanks,
>> Rob.
>>
>>
>> Derrick Oswald wrote:
>>
>>> Rob,
>>>
>>> I haven't had any problems parsing XML with htmlparser.
>>> An example is provided for parsing RSS feeds which are XML:
>>>    http://htmlparser.sourceforge.net/wiki/index.php/RSSFeeds
>>>
>>> Derrick
>>
>>
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

Re: [Htmlparser-user] using parser/lexer on non-html markup pages

From: Derrick O. <Der...@Ro...> - 2004-08-30 22:24:40

Rob,

This may be a bug that was recently (July 28) fixed:

Bug #995703 Parser Crash and bug #988846 Linkbean getLinks() segmentation fault
by not testing for content type "text/XXX" in Page, but rather issuing a warning when this is
discovered by the Parser level.

What's the exception message? Is it "...does not contain text"?
If so, either download a new version or remove the test in Page.java:

          type = getContentType ();
-         if (type != null && !type.startsWith ("text"))
-             throw new ParserException (
-                 "URL "
-                 + connection.getURL ().toExternalForm ()
-                 + " does not contain text");
          charset = getCharset (type);
          try

Derrick


Rob Eger wrote:

> Okay, I seem to have figured out how to make the parser do what I 
> need, except for one small issue - if the files have a .xml extension 
> it throws an exception saying the file "does not contain text".  If I 
> append .html onto it, things work fine.
>
> Is there a way to make the parser accept .xml as a valid file extension?
>
> Thanks,
> Rob.
>
>
> Derrick Oswald wrote:
>
>> Rob,
>>
>> I haven't had any problems parsing XML with htmlparser.
>> An example is provided for parsing RSS feeds which are XML:
>>    http://htmlparser.sourceforge.net/wiki/index.php/RSSFeeds
>>
>> Derrick
>

Re: [Htmlparser-user] using parser/lexer on non-html markup pages

From: Rob E. <re...@ap...> - 2004-08-30 18:09:36

Okay, I seem to have figured out how to make the parser do what I need, 
except for one small issue - if the files have a .xml extension it 
throws an exception saying the file "does not contain text".  If I 
append .html onto it, things work fine.

Is there a way to make the parser accept .xml as a valid file extension?

Thanks,
Rob.


Derrick Oswald wrote:
> Rob,
> 
> I haven't had any problems parsing XML with htmlparser.
> An example is provided for parsing RSS feeds which are XML:
>    http://htmlparser.sourceforge.net/wiki/index.php/RSSFeeds
> 
> Derrick
> 
> Neil Aggarwal wrote:
> 
>> Rob:
>>
>> Your input is XML.  You should use an XML parser like xerces
>> http://xml.apache.org/xerces2-j/index.html
>> to parse it.
>>
>> Neil
>>  
>>
>>> -----Original Message-----
>>> From: htm...@li... 
>>> [mailto:htm...@li...] On Behalf Of Rob 
>>> Eger
>>> Sent: Friday, August 27, 2004 3:51 PM
>>> To: htm...@li...
>>> Subject: [Htmlparser-user] using parser/lexer on non-html markup pages
>>>
>>>
>>> I've been using the HTMLParser to parse html pages up until now 
>>> (works great by the way), but I was just given a small new project to 
>>> parse a set of marked up files.  Basically info in tags.
>>>
>>> The files contain blocks (many per file) like this:
>>>
>>> <listing id="324" key="xyz">
>>>    <name>random name</name>
>>>    <lineBlock heading="header1">
>>>       <line lineNum="1">contents of the line</line>
>>>       <line lineNum="2">more line contents</line>
>>>    </lineBlock>
>>> </listing>
>>>
>>> and so on...
>>>
>>> I tried just re-using some of the code I was using for parsing html 
>>> (added some custom tags to handle the specific tags I'm dealing 
>>> with), but it didn't work at first pass.  Not sure why, nothing 
>>> obvious stands out.
>>>
>>> Can I use the parser (or would the lexer be better) to do this at 
>>> all? Or am I trying to fit a square peg in a round hole?
>>>
>>> Thanks,
>>> Rob.
>>>
>>>   
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>