htmlparser-developer Mailing List for HTML Parser

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Sourceforge CVS services can be set up to send email notification when 
commits occur.
I've set this up for myself, but if people think it's useful I can set 
up an email list.

Oops, I should have said that the parser_core.jar outputs a stream of 
undifferentiated *nodes*

Derrick Oswald wrote:

>
> Since it's a library incorporated within other applications, size is 
> always an issue.
> There are two aspects though, disk footprint (jar size) and memory usage.
> Usually, there is a speed/memory usage trade-off to be made, which is 
> only sometimes reflected in the disk footprint size.
> With current desktop hardware, people usually trade off memory for speed.
> It's only with embedded or mobile applications you concentrate on disk 
> size and memory consumption.
>
> Regarding your picture, the layers won't necessarily follow the 
> current package structure.
> For example, logging is integral to the core parser to report 
> problems, and the beans layer removes all HTML tags so it can't be 
> used by upper layers. In order to decide the breakdown in layers, a 
> poll of users regarding typical use-cases might be in order.
>
> Lets say there are two major groupings:
>
> 1) extraction of all or part of the information on a page to be 
> consumed by another application.
> 2) rewriting URLs, content, specific tags, clean-up, reformatting or 
> pretty printing HTML text
>
> This would suggest three configuration items (jars):
>
> parser_applications.jar - Sample applications, GUI tools, beans, tests
> parser_edit.jar - Rewriting tools, DOM type heirarchical editing, 
> visitors, smart tags
> parser_core.jar - Read-only core parser, stream of undifferentiated tags
>
> If a programs parser usage involves extraction, it need only use the 
> parser_core.jar and pass through the data in a stream-like fashion. 
> But if rewriting is in order, they use both parser_core.jar and 
> parser_edit.jar and the parser presents the full HTML document as a 
> heirarchy of tag specific nodes. All else goes into parser_applications.
>
> We could probably get parser_core.jar below 25KB, or in that range.
>
> Derrick
>
> Somik Raha wrote:
> <snip>
>
>> [1] I find the parser's differentiating factor is its size - time and 
>> time again the feedback I've received is that folks love its being 
>> below 100K. Size almost directly maps on to simplicity. And that 
>> impacts the other important area - performance.
>>  
>> [2] I hate to pay for what I don't need - when folks get tons of 
>> stuff that they don't need, they are paying for the needs of a few.
>>  
>> At the same time, I think it is a challenge to be able to accomodate 
>> new requests and still keep the parser light. I see a natural layer 
>> forming:
>>  
>>
>>   ,----------------------------------------.
>>   |        Sample Applications, GUI        |
>>   |   ,'''''''''''''''''''''''''''''''`.   |
>>   |   |        Logging Mechanism       |   |
>>   |   |  ,''''''''''''''''''''''''''|  |   |
>>   |   |  |        Beans             |  |   |
>>   |   |  |  +--------------------b  |  |   |
>>   |   |  |  |    Scanners        |  |  |   |
>>   |   |  |  | ,---------------Y  |  |  |   |
>>   |   |  |  | |  Core Parser  |  |  |  |   |
>>   |   |  |  | `.............../  |  |  |   |
>>   |   |  |  L____________________|  |  |   |
>>   |   |  |                          |  |   |
>>   |   |  '`''''''''''''''''''''''''''  |   |
>>   |   |     default, log4j, jdk1.4     |   |
>>   |   `................................/   |
>>   |________________________________________|
>>  
>> If we can perform this seperation in the design and the packaging, it 
>> might allow people to choose what they need. We don't have to follow 
>> the "one size fits all" policy.
>>  
>> What are your thoughts? I am not sure how we'd achieve this 
>> seperation or whether it really makes sense - so please jump in with 
>> your two cents..
>>  
>> Regards,
>> Somik
>>  
>
>
> <snip>
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>

Since it's a library incorporated within other applications, size is 
always an issue.
There are two aspects though, disk footprint (jar size) and memory usage.
Usually, there is a speed/memory usage trade-off to be made, which is 
only sometimes reflected in the disk footprint size.
With current desktop hardware, people usually trade off memory for speed.
It's only with embedded or mobile applications you concentrate on disk 
size and memory consumption.

Regarding your picture, the layers won't necessarily follow the current 
package structure.
For example, logging is integral to the core parser to report problems, 
and the beans layer removes all HTML tags so it can't be used by upper 
layers. In order to decide the breakdown in layers, a poll of users 
regarding typical use-cases might be in order.

Lets say there are two major groupings:

1) extraction of all or part of the information on a page to be consumed 
by another application.
2) rewriting URLs, content, specific tags, clean-up, reformatting or 
pretty printing HTML text

This would suggest three configuration items (jars):

parser_applications.jar - Sample applications, GUI tools, beans, tests
parser_edit.jar - Rewriting tools, DOM type heirarchical editing, 
visitors, smart tags
parser_core.jar - Read-only core parser, stream of undifferentiated tags

If a programs parser usage involves extraction, it need only use the 
parser_core.jar and pass through the data in a stream-like fashion. But 
if rewriting is in order, they use both parser_core.jar and 
parser_edit.jar and the parser presents the full HTML document as a 
heirarchy of tag specific nodes. All else goes into parser_applications.

We could probably get parser_core.jar below 25KB, or in that range.

Derrick

Somik Raha wrote:
<snip>

> [1] I find the parser's differentiating factor is its size - time and 
> time again the feedback I've received is that folks love its being 
> below 100K. Size almost directly maps on to simplicity. And that 
> impacts the other important area - performance.
>  
> [2] I hate to pay for what I don't need - when folks get tons of stuff 
> that they don't need, they are paying for the needs of a few.
>  
> At the same time, I think it is a challenge to be able to accomodate 
> new requests and still keep the parser light. I see a natural layer 
> forming:
>  
>
>   ,----------------------------------------.
>   |        Sample Applications, GUI        |
>   |   ,'''''''''''''''''''''''''''''''`.   |
>   |   |        Logging Mechanism       |   |
>   |   |  ,''''''''''''''''''''''''''|  |   |
>   |   |  |        Beans             |  |   |
>   |   |  |  +--------------------b  |  |   |
>   |   |  |  |    Scanners        |  |  |   |
>   |   |  |  | ,---------------Y  |  |  |   |
>   |   |  |  | |  Core Parser  |  |  |  |   |
>   |   |  |  | `.............../  |  |  |   |
>   |   |  |  L____________________|  |  |   |
>   |   |  |                          |  |   |
>   |   |  '`''''''''''''''''''''''''''  |   |
>   |   |     default, log4j, jdk1.4     |   |
>   |   `................................/   |
>   |________________________________________|
>  
> If we can perform this seperation in the design and the packaging, it 
> might allow people to choose what they need. We don't have to follow 
> the "one size fits all" policy.
>  
> What are your thoughts? I am not sure how we'd achieve this seperation 
> or whether it really makes sense - so please jump in with your two cents..
>  
> Regards,
> Somik
>  

<snip>

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

S	M	T	W	T	F	S
				1 (5)	2 (7)	3 (3)
4	5 (4)	6 (3)	7 (3)	8 (6)	9 (10)	10 (7)
11 (4)	12 (6)	13 (5)	14 (12)	15 (8)	16 (8)	17
18	19	20 (4)	21 (7)	22	23 (1)	24
25 (1)	26	27 (3)	28 (6)	29 (3)	30 (3)	31

htmlparser-developer Mailing List for HTML Parser

htmlparser-developer — The developer mailing list of the htmlparser project