htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
1
(5) |
2
(7) |
3
(3) |
4
|
5
(4) |
6
(3) |
7
(3) |
8
(6) |
9
(10) |
10
(7) |
11
(4) |
12
(6) |
13
(5) |
14
(12) |
15
(8) |
16
(8) |
17
|
18
|
19
|
20
(4) |
21
(7) |
22
|
23
(1) |
24
|
25
(1) |
26
|
27
(3) |
28
(6) |
29
(3) |
30
(3) |
31
|
From: Derrick O. <Der...@ro...> - 2003-05-03 14:33:51
|
Sourceforge CVS services can be set up to send email notification when commits occur. I've set this up for myself, but if people think it's useful I can set up an email list. |
From: Derrick O. <Der...@ro...> - 2003-05-03 12:12:22
|
Oops, I should have said that the parser_core.jar outputs a stream of undifferentiated *nodes* Derrick Oswald wrote: > > Since it's a library incorporated within other applications, size is > always an issue. > There are two aspects though, disk footprint (jar size) and memory usage. > Usually, there is a speed/memory usage trade-off to be made, which is > only sometimes reflected in the disk footprint size. > With current desktop hardware, people usually trade off memory for speed. > It's only with embedded or mobile applications you concentrate on disk > size and memory consumption. > > Regarding your picture, the layers won't necessarily follow the > current package structure. > For example, logging is integral to the core parser to report > problems, and the beans layer removes all HTML tags so it can't be > used by upper layers. In order to decide the breakdown in layers, a > poll of users regarding typical use-cases might be in order. > > Lets say there are two major groupings: > > 1) extraction of all or part of the information on a page to be > consumed by another application. > 2) rewriting URLs, content, specific tags, clean-up, reformatting or > pretty printing HTML text > > This would suggest three configuration items (jars): > > parser_applications.jar - Sample applications, GUI tools, beans, tests > parser_edit.jar - Rewriting tools, DOM type heirarchical editing, > visitors, smart tags > parser_core.jar - Read-only core parser, stream of undifferentiated tags > > If a programs parser usage involves extraction, it need only use the > parser_core.jar and pass through the data in a stream-like fashion. > But if rewriting is in order, they use both parser_core.jar and > parser_edit.jar and the parser presents the full HTML document as a > heirarchy of tag specific nodes. All else goes into parser_applications. > > We could probably get parser_core.jar below 25KB, or in that range. > > Derrick > > Somik Raha wrote: > <snip> > >> [1] I find the parser's differentiating factor is its size - time and >> time again the feedback I've received is that folks love its being >> below 100K. Size almost directly maps on to simplicity. And that >> impacts the other important area - performance. >> >> [2] I hate to pay for what I don't need - when folks get tons of >> stuff that they don't need, they are paying for the needs of a few. >> >> At the same time, I think it is a challenge to be able to accomodate >> new requests and still keep the parser light. I see a natural layer >> forming: >> >> >> ,----------------------------------------. >> | Sample Applications, GUI | >> | ,'''''''''''''''''''''''''''''''`. | >> | | Logging Mechanism | | >> | | ,''''''''''''''''''''''''''| | | >> | | | Beans | | | >> | | | +--------------------b | | | >> | | | | Scanners | | | | >> | | | | ,---------------Y | | | | >> | | | | | Core Parser | | | | | >> | | | | `.............../ | | | | >> | | | L____________________| | | | >> | | | | | | >> | | '`'''''''''''''''''''''''''' | | >> | | default, log4j, jdk1.4 | | >> | `................................/ | >> |________________________________________| >> >> If we can perform this seperation in the design and the packaging, it >> might allow people to choose what they need. We don't have to follow >> the "one size fits all" policy. >> >> What are your thoughts? I am not sure how we'd achieve this >> seperation or whether it really makes sense - so please jump in with >> your two cents.. >> >> Regards, >> Somik >> > > > <snip> > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Derrick O. <Der...@ro...> - 2003-05-03 12:03:32
|
Since it's a library incorporated within other applications, size is always an issue. There are two aspects though, disk footprint (jar size) and memory usage. Usually, there is a speed/memory usage trade-off to be made, which is only sometimes reflected in the disk footprint size. With current desktop hardware, people usually trade off memory for speed. It's only with embedded or mobile applications you concentrate on disk size and memory consumption. Regarding your picture, the layers won't necessarily follow the current package structure. For example, logging is integral to the core parser to report problems, and the beans layer removes all HTML tags so it can't be used by upper layers. In order to decide the breakdown in layers, a poll of users regarding typical use-cases might be in order. Lets say there are two major groupings: 1) extraction of all or part of the information on a page to be consumed by another application. 2) rewriting URLs, content, specific tags, clean-up, reformatting or pretty printing HTML text This would suggest three configuration items (jars): parser_applications.jar - Sample applications, GUI tools, beans, tests parser_edit.jar - Rewriting tools, DOM type heirarchical editing, visitors, smart tags parser_core.jar - Read-only core parser, stream of undifferentiated tags If a programs parser usage involves extraction, it need only use the parser_core.jar and pass through the data in a stream-like fashion. But if rewriting is in order, they use both parser_core.jar and parser_edit.jar and the parser presents the full HTML document as a heirarchy of tag specific nodes. All else goes into parser_applications. We could probably get parser_core.jar below 25KB, or in that range. Derrick Somik Raha wrote: <snip> > [1] I find the parser's differentiating factor is its size - time and > time again the feedback I've received is that folks love its being > below 100K. Size almost directly maps on to simplicity. And that > impacts the other important area - performance. > > [2] I hate to pay for what I don't need - when folks get tons of stuff > that they don't need, they are paying for the needs of a few. > > At the same time, I think it is a challenge to be able to accomodate > new requests and still keep the parser light. I see a natural layer > forming: > > > ,----------------------------------------. > | Sample Applications, GUI | > | ,'''''''''''''''''''''''''''''''`. | > | | Logging Mechanism | | > | | ,''''''''''''''''''''''''''| | | > | | | Beans | | | > | | | +--------------------b | | | > | | | | Scanners | | | | > | | | | ,---------------Y | | | | > | | | | | Core Parser | | | | | > | | | | `.............../ | | | | > | | | L____________________| | | | > | | | | | | > | | '`'''''''''''''''''''''''''' | | > | | default, log4j, jdk1.4 | | > | `................................/ | > |________________________________________| > > If we can perform this seperation in the design and the packaging, it > might allow people to choose what they need. We don't have to follow > the "one size fits all" policy. > > What are your thoughts? I am not sure how we'd achieve this seperation > or whether it really makes sense - so please jump in with your two cents.. > > Regards, > Somik > <snip> |