htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
|
1
|
2
|
3
|
4
|
5
(2) |
6
(3) |
7
(1) |
8
|
9
|
10
|
11
|
12
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
(2) |
22
|
23
|
24
|
25
|
26
|
27
|
28
|
29
|
30
|
From: Martin K. <Mar...@St...> - 2005-04-06 21:20:40
|
Hi Derrick, > Hmmm, another project. > Despite the large number of developers in htmlparser, it's really only me > that is active. So in effect you are asking me if I would like to take on > another project. > I'm not sure. > It's a bit heavy-weight for such a utility. I don't think it is that difficult to form a new project. I would expect to see about 6 or 8 classes and a high quality unit-test suite. Since the Html 4.0 standards seams to be not a subject of change, I guess there would be only hunting left (I would guess there isn't that much chance of seeing more then a couple of bugs in that area). > Are there other projects that would use it besides htmlparser and spring? Maybe jakarta projects may also be a target audience. Googling for character entity references, I found about 5 or 6 java projects implementing their own solutions to this problem. > Would spring even agree to use it? I can ask those folks, if you wouldn't mind. They said that they would like to replace the current implementation with a 3rd party library. But you are right, we should be sure, if they would use such a small special purpose library. . > While conceptually I agree with the concept of coalescence of disparate > code streams, I wonder if it would gain the traction necessary in the open > source bazaar given the ease with which a 'good enough' solution can be > created. Is there another way? What exactly do you mean by 'another way'? > The encoding/decoding of URLs (%20 for space etc.) derives from a > different origin RFP 2396: http://www.ietf.org/rfc/rfc2396.txt > Is that what you mean by encodeUrl() or am I missing something? Right. The problem is mean is currently you have two special classes to use. URLEncoder and URLDecoder. Also everytime using them, it ends with specifying UTF-8 as the encoding shema (since the decode/encode(String) methods are deprecated). Cheers, Martin (Kersten) >> Hi Derrick, >> >>> HTML Parser has a similar class as well... >>> org.htmlparser.util.Translate.java: >> >> >> This file I was originally thinking about, thats right. >> >>> This file was arrived at via a similar mechanism to your own. It's not >>> stand-alone, relying on a sort utility, to avoid two copies of each >>> reference in the class, and two reference classes, for a total of about >>> 5 classes. >> >> >> I use a class called CharacterEntityReferences storing a collection >> of the references (simply map of String to Integer). >> >> To determine if a given character is a special character (a entity >> reference exist), I use a binary field (int type with each bit >> representing a character reference between [0...1000] and >> [8000...10000]. Works very well. >> >> But this are implementational details we can discuss later. >> The goal is to provide a single solution which is well tested >> and reliable. I found a bug within the original spring version >> (numeric references of � till of 	 are not processed >> in the right way). >> >> Also I was looking around finding some implementations only >> supporting character references for characters <=255. So >> these implementations are not complete in terms of the >> specifications. >> >> So I guess there is a need for that library. >> >>> There was a patch provided by *Karsten Pawlik* that loaded the table >>> from a resource: >>> >>> http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401 >>> but this was never integrated. >> >> >> I don't like my version using a secondary resource also. >> You know I need to use a TokenStream which makes it quite >> complex. (I use the DTD definitions provided by w3.org). >> But it should give a great unit-test case since in this resource >> is every entity reference and so converting this file and checking >> if all entity references are converted would be a necessary test, >> which is the desired test situation. >> >> I currently favour a version using something like sequences to shorten >> the needed amount of lines of code to set up all entities. >> But which implementation is finally choosen does not matter >> much to me anyways as long as the version is highly reliable and >> quite fast when it comes to actually conversion. >> >>> This could be broken out into a separate jar. Is that what you are >>> suggesting? >> >> >> Yes. I would like to have a special library for only encoding and >> decoding strings to HTML. Also I would like to add >> encode/decodeURL methods to avoid using URLEncoder and >> URLDecoder by also adding "UTF-8" as default encoding character >> set, since URLDecoder.decode(String) is deprecated. >> >> So something like: >> >> HtmlUtils.encode(String normalString) : htmlString >> HtmlUtils.decode(String htmlString) : normal string >> HtmlUtils.encodeUrl(String normalString) : URL (UTF-8) >> HtmlUtils.decodeUrl(String url) : String normalString >> >> Maybe renaming the HtmlUtils to HtmlCoder or something similar >> would also be appreciated. (or HtmlConverter) >> >> Is it possible to put this library under a special sourceforge >> project? Like HtmlCoder project or what ever? I guess there is >> some more functionality to add in order to support streams and >> readers to simplify its use. If we are doing right with this library >> there is a chance that even more projects will use this library >> instead of their on (possible limited) solutions. >> >> For the ownership of the project: I would suggest if you guys >> can handle it. You do well with the html-parser and I guess >> you have what it takes. Of cause I would like to contribute >> all my code and knowledge. Also I would like to do some >> testing which implementation of the decoding/encoding >> algorithms are finally used and review/develop a complete >> set of unit-tests (I don't know your test coverage so maybe >> your HtmlParser codebase already have anything what it >> takes). >> >> So you see this is actually a quite complex field :-) >> >> >> Cheers, >> >> Martin (Kersten) >> >> PS: I would also like to use named entity references >> when encoding a html-string (currently the Spring version >> does only use number formats, which arn't that handy >> if you need to read the encoded html). >> >>> >>> Derrick >>> >>> Martin Kersten wrote: >>> >>>> Dear Html-Parser developers, >>>> >>>> my name is Martin Kersten and I am looking for a library doing HTML >>>> related conversions. Originally I started to refactor the HtmlUtils >>>> class of the Spring framework. But thinking about it, it would be best >>>> if such a capability (decode/encode strings from / to Html) would be >>>> provided by a special and tiny library. Such a library would be a >>>> relief, I guess. Also I wouldn't like to invent the wheel, twice... . >>>> >>>> Also by my own refactoring affords, I ended with a special class >>>> encapuslating the named entity references and load it from file >>>> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html files). >>>> I don't know if this efford is worth it but from an OOP stand point it >>>> looks nice :-). >>>> >>>> Anyways, I would be happy if such a highly focused library would >>>> be out there. >>>> So what do you think? Any chance that such a library can be created? >>>> >>>> >>>> Cheers, >>>> >>>> Martin (Kersten) >>>> >>>> PS: I am not a Spring developer, I am just a Spring user who cares... . >>>> :-) >>>> > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Derrick O. <Der...@Ro...> - 2005-04-06 11:19:43
|
Martin, Hmmm, another project. Despite the large number of developers in htmlparser, it's really only me that is active. So in effect you are asking me if I would like to take on another project. I'm not sure. It's a bit heavy-weight for such a utility. Are there other projects that would use it besides htmlparser and spring? Would spring even agree to use it? While conceptually I agree with the concept of coalescence of disparate code streams, I wonder if it would gain the traction necessary in the open source bazaar given the ease with which a 'good enough' solution can be created. Is there another way? The encoding/decoding of URLs (%20 for space etc.) derives from a different origin RFP 2396: http://www.ietf.org/rfc/rfc2396.txt Is that what you mean by encodeUrl() or am I missing something? Derrick Martin Kersten wrote: > Hi Derrick, > >> HTML Parser has a similar class as well... >> org.htmlparser.util.Translate.java: > > > This file I was originally thinking about, thats right. > >> This file was arrived at via a similar mechanism to your own. It's >> not stand-alone, relying on a sort utility, to avoid two copies of >> each reference in the class, and two reference classes, for a total >> of about 5 classes. > > > I use a class called CharacterEntityReferences storing a collection > of the references (simply map of String to Integer). > > To determine if a given character is a special character (a entity > reference exist), I use a binary field (int type with each bit > representing a character reference between [0...1000] and > [8000...10000]. Works very well. > > But this are implementational details we can discuss later. > The goal is to provide a single solution which is well tested > and reliable. I found a bug within the original spring version > (numeric references of � till of 	 are not processed > in the right way). > > Also I was looking around finding some implementations only > supporting character references for characters <=255. So > these implementations are not complete in terms of the > specifications. > > So I guess there is a need for that library. > >> There was a patch provided by *Karsten Pawlik* that loaded the table >> from a resource: >> >> http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401 >> >> but this was never integrated. > > > I don't like my version using a secondary resource also. > You know I need to use a TokenStream which makes it quite > complex. (I use the DTD definitions provided by w3.org). > But it should give a great unit-test case since in this resource > is every entity reference and so converting this file and checking > if all entity references are converted would be a necessary test, > which is the desired test situation. > > I currently favour a version using something like sequences to shorten > the needed amount of lines of code to set up all entities. > But which implementation is finally choosen does not matter > much to me anyways as long as the version is highly reliable and > quite fast when it comes to actually conversion. > >> This could be broken out into a separate jar. Is that what you are >> suggesting? > > > Yes. I would like to have a special library for only encoding and > decoding strings to HTML. Also I would like to add > encode/decodeURL methods to avoid using URLEncoder and > URLDecoder by also adding "UTF-8" as default encoding character > set, since URLDecoder.decode(String) is deprecated. > > So something like: > > HtmlUtils.encode(String normalString) : htmlString > HtmlUtils.decode(String htmlString) : normal string > HtmlUtils.encodeUrl(String normalString) : URL (UTF-8) > HtmlUtils.decodeUrl(String url) : String normalString > > Maybe renaming the HtmlUtils to HtmlCoder or something similar > would also be appreciated. (or HtmlConverter) > > Is it possible to put this library under a special sourceforge > project? Like HtmlCoder project or what ever? I guess there is > some more functionality to add in order to support streams and > readers to simplify its use. If we are doing right with this library > there is a chance that even more projects will use this library > instead of their on (possible limited) solutions. > > For the ownership of the project: I would suggest if you guys > can handle it. You do well with the html-parser and I guess > you have what it takes. Of cause I would like to contribute > all my code and knowledge. Also I would like to do some > testing which implementation of the decoding/encoding > algorithms are finally used and review/develop a complete > set of unit-tests (I don't know your test coverage so maybe > your HtmlParser codebase already have anything what it > takes). > > So you see this is actually a quite complex field :-) > > > Cheers, > > Martin (Kersten) > > PS: I would also like to use named entity references > when encoding a html-string (currently the Spring version > does only use number formats, which arn't that handy > if you need to read the encoded html). > >> >> Derrick >> >> Martin Kersten wrote: >> >>> Dear Html-Parser developers, >>> >>> my name is Martin Kersten and I am looking for a library doing >>> HTML related conversions. Originally I started to refactor the >>> HtmlUtils >>> class of the Spring framework. But thinking about it, it would be best >>> if such a capability (decode/encode strings from / to Html) would be >>> provided by a special and tiny library. Such a library would be a >>> relief, I guess. Also I wouldn't like to invent the wheel, twice... . >>> >>> Also by my own refactoring affords, I ended with a special class >>> encapuslating the named entity references and load it from file >>> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html files). >>> I don't know if this efford is worth it but from an OOP stand point >>> it looks nice :-). >>> >>> Anyways, I would be happy if such a highly focused library would >>> be out there. >>> So what do you think? Any chance that such a library can be created? >>> >>> >>> Cheers, >>> >>> Martin (Kersten) >>> >>> PS: I am not a Spring developer, I am just a Spring user who >>> cares... . :-) >>> |
From: Martin K. <Mar...@St...> - 2005-04-06 00:33:30
|
Hi Derrick, > HTML Parser has a similar class as well... > org.htmlparser.util.Translate.java: This file I was originally thinking about, thats right. > This file was arrived at via a similar mechanism to your own. It's not > stand-alone, relying on a sort utility, to avoid two copies of each > reference in the class, and two reference classes, for a total of about 5 > classes. I use a class called CharacterEntityReferences storing a collection of the references (simply map of String to Integer). To determine if a given character is a special character (a entity reference exist), I use a binary field (int type with each bit representing a character reference between [0...1000] and [8000...10000]. Works very well. But this are implementational details we can discuss later. The goal is to provide a single solution which is well tested and reliable. I found a bug within the original spring version (numeric references of � till of 	 are not processed in the right way). Also I was looking around finding some implementations only supporting character references for characters <=255. So these implementations are not complete in terms of the specifications. So I guess there is a need for that library. > There was a patch provided by *Karsten Pawlik* that loaded the table from > a resource: > > http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401 > but this was never integrated. I don't like my version using a secondary resource also. You know I need to use a TokenStream which makes it quite complex. (I use the DTD definitions provided by w3.org). But it should give a great unit-test case since in this resource is every entity reference and so converting this file and checking if all entity references are converted would be a necessary test, which is the desired test situation. I currently favour a version using something like sequences to shorten the needed amount of lines of code to set up all entities. But which implementation is finally choosen does not matter much to me anyways as long as the version is highly reliable and quite fast when it comes to actually conversion. > This could be broken out into a separate jar. Is that what you are > suggesting? Yes. I would like to have a special library for only encoding and decoding strings to HTML. Also I would like to add encode/decodeURL methods to avoid using URLEncoder and URLDecoder by also adding "UTF-8" as default encoding character set, since URLDecoder.decode(String) is deprecated. So something like: HtmlUtils.encode(String normalString) : htmlString HtmlUtils.decode(String htmlString) : normal string HtmlUtils.encodeUrl(String normalString) : URL (UTF-8) HtmlUtils.decodeUrl(String url) : String normalString Maybe renaming the HtmlUtils to HtmlCoder or something similar would also be appreciated. (or HtmlConverter) Is it possible to put this library under a special sourceforge project? Like HtmlCoder project or what ever? I guess there is some more functionality to add in order to support streams and readers to simplify its use. If we are doing right with this library there is a chance that even more projects will use this library instead of their on (possible limited) solutions. For the ownership of the project: I would suggest if you guys can handle it. You do well with the html-parser and I guess you have what it takes. Of cause I would like to contribute all my code and knowledge. Also I would like to do some testing which implementation of the decoding/encoding algorithms are finally used and review/develop a complete set of unit-tests (I don't know your test coverage so maybe your HtmlParser codebase already have anything what it takes). So you see this is actually a quite complex field :-) Cheers, Martin (Kersten) PS: I would also like to use named entity references when encoding a html-string (currently the Spring version does only use number formats, which arn't that handy if you need to read the encoded html). > > Derrick > > Martin Kersten wrote: > >> Dear Html-Parser developers, >> >> my name is Martin Kersten and I am looking for a library doing HTML >> related conversions. Originally I started to refactor the HtmlUtils >> class of the Spring framework. But thinking about it, it would be best >> if such a capability (decode/encode strings from / to Html) would be >> provided by a special and tiny library. Such a library would be a relief, >> I guess. Also I wouldn't like to invent the wheel, twice... . >> >> Also by my own refactoring affords, I ended with a special class >> encapuslating the named entity references and load it from file >> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html files). >> I don't know if this efford is worth it but from an OOP stand point it >> looks nice :-). >> >> Anyways, I would be happy if such a highly focused library would >> be out there. >> So what do you think? Any chance that such a library can be created? >> >> >> Cheers, >> >> Martin (Kersten) >> >> PS: I am not a Spring developer, I am just a Spring user who cares... . >> :-) >> > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |