htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
1
|
2
(3) |
3
(2) |
4
|
5
|
6
|
7
(5) |
8
(2) |
9
|
10
|
11
|
12
(1) |
13
(1) |
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
|
23
|
24
|
25
|
26
|
27
|
28
|
29
|
30
|
31
|
|
From: Craig R. <cr...@qu...> - 2002-05-13 10:36:30
|
Wrong mail address, again :> -------- Hi Somik, I thought I'd brief you on how my investigation in the SwingParser was going. I took your CVS module and managed with some changes to integrate it into Swing's JEditorPane HTML renderer to make a simple HTML browser. It soon become apparent however that the renderer requires perfectly formed HTML. After playing with the idea of trying to fix bad HTML myself, I realised the enormity of this task and looked for an existing implementation. JTidy (http://www.sourceforge.net/projects/jtidy), a port of a C library (HTML Tidy), is another SourceForge project which performs HTML validation and pretty-printing. It produces a DOM of the HTML page from an InputStream from which I performed the relevant callbacks. The result is a good replacement for Sun's DocumentParser, and it produces a nice output of what was wrong/fixed during parsing. I am still trying to determine whether the 174kb it adds on to any project is worth it tho (and if there are any performance implications). I haven't checked my code back in since it longer depends on htmlparser in any way, but I can send it to you if you're interested. -craig |
From: Somik R. <so...@ya...> - 2002-05-12 09:07:49
|
Hi Raghav I went thru the yahoo.txt, and just like your previous one, this one = too had very dirty html. The reason you got the OutofMemoryException was = that this kind of html sent the parser into an infinite loop (in = HTMLLinkScanner). The tag which did this was : <a href=3Ds/8741><img = src=3D"http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=3D16 = width=3D16 border=3D0></img></td><td nowrap> <a href=3Ds/7509><b>Yahoo! Movies</b></a> As you can see, the first link tag does not have an end tag. I verified = with the actual yahoo page, and this link occurs quite decently, with = the correct end tag. After looking closely at your supplied file, I also = notice the </img> file, which is highly unusual in normal html. So - I am guessing that this file is generated by a program and not by a = human. You would definitely want to check the program thats doing it - = its surely buggy. However, my yardstick for the robustness of this parser is Internet = Explorer. If the stuff works in IE, then its got to work here. And as I = tried this particularly bad piece of html, I found IE does not crash. = Hence, I had to go about empowering the parser to parse these erroneous = tags <sigh> Took hours!! </sigh> The good news is, its done. We can parse these tags, and the correct = end tag is inserted just before td. Of course, I have done a minimal = adjustment for your purpose. As time goes on, robustness ought to = increase further. All test cases passing. The framework for handling = dirty html is also slightly modified. An integration release has been made (2002-05-12), and is under the = integration builds package. You can download from = http://htmlparser.sourceforge.net.=20 =20 The parser should not crash on your html now. Regards, Somik ----- Original Message -----=20 From: Raghavender Srimantula=20 To: htm...@li...=20 Sent: Saturday, May 11, 2002 4:32 AM Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations andwriteoutdocument Hi Somik, I have mentioned about the out of memory error problem earlier. last = time=20 for every iteration of for loop I was adding the whole page to my = string=20 buffer. so it was giving me the out of memory error. I removed that = now. it=20 was working fine till yesterday. now I find that error again. this = time=20 nothing to do with string buffer...and it looks like a real problem. I = can=20 send you the main class and the yahoo.txt I have. try running it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations=20 >andwriteoutdocument >Date: Fri, 10 May 2002 00:43:19 +0900 > >Hi Raghav, > On analyzing yahoo.txt, I found that you have incorrect html. = There is=20 >a script tag that has not been closed. So naturally the script = scanner goes=20 >bonkers. Rename the extension to .html, and open this file in IE, and = you=20 >will find that IE also cant handle this. > I verified from www.yahoo.com, and found that they do have the = correct=20 ></script> tag provided. So I guess your yahoo.txt file is faulty. > >Regards, >Somik > ----- Original Message ----- > From: Raghavender Srimantula > To: htm...@li... > Sent: Thursday, May 09, 2002 4:53 AM > Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations andwriteoutdocument > > > Hi Somik, > I was using the 1.1 version of htmlparser. I save the = www.yahoo.com=20 >content > in a flat file yahoo.txt. and I run the parser against this. = throws a > nullpointerexception in HTMLScriptScanner. this seems to be a new=20 >addition > for 1.1. I will send the stacktrace, the main program and the = yahoo.txt. > actually I cannot send the stacktrace. I made some changes and the = line > numbers dont match. but if you run this program you would see the > nullpointerexception. > Thanks, > Raghav > > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > >Date: Mon, 6 May 2002 13:59:11 +0900 > > > >Hi Raghav, > > I sent another mail sometime back to you - > > > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in = the > >enumeration will be your HTMLImageTag." > >HTMLNode node; > >HTMLImageTag imageTag; > >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) { > > node =3D (HTMLNode)e.nextElement(); > > if (node instanceof HTMLImageTag) { > > imageTag =3D (HTMLImageTag)node; > > // your code here > > } > >} > > > >Regards, > >Somik > >----- Original Message ----- > >From: "Raghavender Srimantula" <kin...@ho...> > >To: <htm...@li...> > >Sent: Monday, May 06, 2002 10:43 AM > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > > > > > > > Hi Somik, > > > this question is regarding "not all images are being = retrieved". I=20 >mean > >the > > > images under <a tag. I did try to open the attachment you sent = me. I > >could > > > not find anything. but seeing the previous mails I could read = that=20 >it is > >not > > > a bug. but still if I do want to retrieve all the images how = do I do=20 >it. > > > Thanks, > > > Raghav > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > >Reply-To: htm...@li... > > > >To: <htm...@li...> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > > > >Hi Raghav, > > > > Ah - this was a question by Annette Doyle (titled "Not = all=20 >image > >tags > > > >are returned"). I am attaching my reply. > > > > > > > >Regards > > > >Somik > > > > > > > >----- Original Message ----- > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > >To: <htm...@li...> > > > >Sent: Tuesday, April 30, 2002 11:16 AM > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > > > > > > > > > > > hi Somik, > > > > > I found one more interesting thing here. when I am trying = to get=20 >all > >the > > > > > images the image scanner would give me images > > > > > <img > = >src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm> > > > > > so if I do a imagetag.getImageLocation(), I would get > > > > > = http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > > > but is the html content is like this > > > > > <a href=3Ds/6006><img > > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > > border=3D0 width=3D70 height=3D22></a> > > > > > which starts with <a and ends with </a>, then the image = scanner=20 >will > >not > > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif = when=20 >I do > >a > > > > > imagetag.getImageLocation(). this is not even classified = as an > >ImageTag. > > > > > this is classified as LinkTag. how to get this image. > > > > > > > > > > the above content is from www.yahoo.com. on the netscape = browser=20 >if > >you > > > >goto > > > > > view-->pageinfo, you will see a bunch of images. > > > > > but when you run the htmlparser you can get only one = image. > > > > > > > > > > Thanks, > > > > > Raghav > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > >Reply-To: htm...@li... > > > > > >To: <htm...@li...> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > > > >Can you describe your application ? Was it parsing a = single=20 >page > >when > > > >the > > > > > >problem occurred ? > > > > > > > > > > > >Regards, > > > > > >Somik > > > > > >----- Original Message ----- > > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > > >To: <htm...@li...> > > > > > >Cc: <htm...@li...> > > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > > > > > > > > > > > > > > > Hi Somik, > > > > > > > I encountered a strange problem today. while I was = running > > > > > >htmlparser...I > > > > > > > got a java.lang.OutOfMemoryError. seems that lot of = objects=20 >are > > > >being > > > > > > > allocated. where exactly is this happening. I mean = could you > >give > >me > > > >an > > > > > >idea > > > > > > > where or in which file the potential problem could be. > > > > > > > Raghav > > > > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > > >Reply-To: htm...@li... > > > > > > > >To: <htm...@li...> > > > > > > > >CC: <htm...@li...> > > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > > > >Hi Annette, > > > > > > > > Pls find attached a program to get you started. = This > >program > > > >will > > > > > >do > > > > > > > >what you want - you will need to modify the construct = that > >checks > > > >for > > > > > >the > > > > > > > >image tag - and replace it with the location of your=20 >choice. > > > > > > > > Also - I found one bug thanks to this = requirement -=20 >image > >tags > > > > > >params > > > > > > > >were not being correctly put in. Though it needs a = deeper=20 >look, > >I > > > >have > > > > > >done > > > > > > > >a quick fix for now, and all test cases are passing = (with=20 >one > >test > > > >case > > > > > >in > > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > > Please check out the latest html parser source = code=20 >from > >CVS. > > > > > > > > > > > > > > > >Regards, > > > > > > > >Somik > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: Doyle, Annette > > > > > > > > To: htm...@li... > > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > > Subject: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to = change=20 >only > >image > > > >tag > > > > > > > >locations and then, (or at the same time) write out = the=20 >html > > > >document > > > > > >to > > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > > Join the world's largest e-mail service with MSN = Hotmail. > > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Htmlparser-user mailing list > > > > > > > Htm...@li... > > > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > >Htmlparser-user mailing list > > > > > >Htm...@li... > > > > > = >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > > > > >=20 >_________________________________________________________________ > > > > > Send and receive Hotmail on your mobile device: > >http://mobile.msn.com > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ><< > > > > >=20 > = >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not= aBu > >g].eml > > > > >> > > > > > > > > > > > > > > > = _________________________________________________________________ > > > MSN Photos is the easiest way to share and print your photos: > > > http://photos.msn.com/support/worldwide.aspx > > > > > > > > > = _______________________________________________________________ > > > > > > Have big pipes? SourceForge.net is looking for download = mirrors. We > >supply > > > the hardware. You get the recognition. Email Us: > >ban...@so... > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _________________________________________________________________ > Get your FREE download of MSN Explorer at=20 >http://explorer.msn.com/intl.asp. > _________________________________________________________________ Join the world's largest e-mail service with MSN Hotmail.=20 http://www.hotmail.com |
From: Somik R. <so...@ya...> - 2002-05-08 11:17:08
|
Hi Craig, You are now on the developer team. > For the handleSimpleTag, I'm thinking the only way to do this is to > maintain an internal tag buffer and callback only once the entire > document has been parsed and the end tags have been found. Its not > ideal, but you have to be able to deal with <p> and <p> </p>. Hmm - I am using a very similar approach - check the code and the explanations that I sent earlier. I dont know what the parser is doing with <p>, it will be interesting to find out. Cheers, Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: <so...@ya...> Sent: Wednesday, May 08, 2002 7:38 PM Subject: [Htmlparser-developer] RE: [Htmlparser-user] Swing integration > Thanks, Somik, I'm on the user and dev lists now, and its coming through > fine. My SourceForge ID is 538740, username 'craigra'. > > For the handleSimpleTag, I'm thinking the only way to do this is to > maintain an internal tag buffer and callback only once the entire > document has been parsed and the end tags have been found. Its not > ideal, but you have to be able to deal with <p> and <p> </p>. > > > > -----Original Message----- > From: htm...@li... > [mailto:htm...@li...] On Behalf Of Somik > Raha > Sent: 08 May 2002 12:13 PM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, > I actually replied to you on htmlparser-developer, your earlier > mails > went there. Are you on that list ? > Am attaching the relevant mails to this mail - hope it goes thru. > Regards > Somik > ----- Original Message ----- > From: "Craig Raw" <cr...@qu...> > To: <htm...@li...> > Cc: <so...@ya...> > Sent: Wednesday, May 08, 2002 6:49 PM > Subject: [Htmlparser-user] Swing integration > > > > Posted this earlier, seems to have got lost.... > > ---- > > > > > > Hi Somik, > > > > I'm looking into the HTMLParser-Swing integration again, and I have > two > > questions: > > > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > > callback functions. Can this position be extracted from the HTMLTag's > > elementBegin()? > > > > 2. There is a need to differentiate between a callback to > > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > > iterating through the HTMLTag elements Enumeration. How? > > > > You mentioned you have started an implementation - if you have a > > framework going, I'd be happy to continue with the donkey work. I > really > > think this could make Swing's HTML rendering a lot more stable. > > > > Regards, > > Craig > > > > > > > > > > > > -----Original Message----- > > From: Somik Raha [mailto:so...@ya...] > > Sent: 16 April 2002 04:57 AM > > To: htm...@li... > > Cc: Craig Raw > > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, Asgher > > I finally had the time to check Swing integration. Boy - the > parser > > design in Swing sucks!! Theoretically its possible to do it - and I > got > > started, but just realized that in order to be compatible with swing > > objects > > that do compile time type checking with a particular tag, I have to > > actually > > have 73 if statements to give the right tag to the callback. > > I have more important things to do at the moment, but probably > will > > get > > back to this donkey work. *sigh* > > > > I am thinking we should make release 1.1 and then try this. Any > > suggestions ? > > > > Regards, > > Somik > > ----- Original Message ----- > > From: "Somik Raha" <so...@ya...> > > To: <htm...@li...> > > Sent: Thursday, April 04, 2002 11:20 AM > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > > Hi Craig, > > > Thanks a lot for the post. Pls go ahead with your analysis. I > will > > try > > > to catch up this weekend. > > > Regards, > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: "'Somik Raha'" <so...@ya...> > > > Sent: Tuesday, April 02, 2002 3:32 PM > > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > > > > Hi Somik, > > > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > > which > > > > is the driver behind JEditorPane's reading and writing HTML > > > > capabilities. > > > > > > > > --- > > > > Extendable/Scalable > > > > > > > > To maximize the usefulness of this kit, a great deal of effort has > > gone > > > > into making it extendable. These are some of the features. > > > > The parser is replaceable. The default parser is the Hot Java > parser > > > > which is DTD based. A different DTD can be used, or an entirely > > > > different parser can be used. To change the parser, reimplement > the > > > > getParser method. The default parser is dynamically loaded when > > first > > > > asked for, so the class files will never be loaded if an > alternative > > > > parser is used. The default parser is in a separate package called > > > > parser below this package. > > > > > > > > The parser drives the ParserCallback, which is provided by > > HTMLDocument. > > > > To change the callback, subclass HTMLDocument and reimplement the > > > > createDefaultDocument method to return document that produces a > > > > different reader. The reader controls how the document is > > structured. > > > > Although the Document provides HTML support by default, there is > > nothing > > > > preventing support of non-HTML tags that result in alternative > > element > > > > structures. > > > > --- > > > > > > > > I may find some time to look into this as well, although I am not > > sure > > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > > capabilities.... > > > > > > > > -craig > > > > > > > > > > > > -----Original Message----- > > > > From: htm...@li... > > > > [mailto:htm...@li...] On Behalf Of > > Somik > > > > Raha > > > > Sent: 01 April 2002 05:28 PM > > > > To: HTMLParser User List > > > > Cc: HTMLParser Developer List > > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > > > Hi Craig > > > > Wow! Thats a great question. > > > > Actually, I doubt if I could replace Sun Microsystems' code > with > > > > mine. I > > > > dont think Java is that open (or is it ?) > > > > However, we could think of writing our own adapter for the html > > parser > > > > that > > > > might plugin in some way... > > > > I have never used Sun's html parser (If I had, I might not > have > > > > started > > > > this project). > > > > I will need to study Sun's parser before I can answer your > > > > question.. > > > > But there does seem to be some interesting possibilities. > > > > > > > > Regards > > > > Somik > > > > ----- Original Message ----- > > > > From: "Craig Raw" <cr...@qu...> > > > > To: <htm...@li...> > > > > Sent: Monday, April 01, 2002 10:20 PM > > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit > to > > > > > provide a better implementation of JEditorPane's HTML viewing > > > > > capabilities? HTML Parser would need to replace > > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > > buggy. > > > > > Anyone tried this? > > > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > _________________________________________________________ > > > > Do You Yahoo!? > > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _______________________________________________________________ > > > > Have big pipes? SourceForge.net is looking for download mirrors. We > supply > > the hardware. You get the recognition. Email Us: > ban...@so... > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Craig R. <cr...@qu...> - 2002-05-08 10:38:45
|
Thanks, Somik, I'm on the user and dev lists now, and its coming through fine. My SourceForge ID is 538740, username 'craigra'. For the handleSimpleTag, I'm thinking the only way to do this is to maintain an internal tag buffer and callback only once the entire document has been parsed and the end tags have been found. Its not ideal, but you have to be able to deal with <p> and <p> </p>. -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Somik Raha Sent: 08 May 2002 12:13 PM To: htm...@li... Cc: Craig Raw Subject: Re: [Htmlparser-user] Swing integration Hi Craig, I actually replied to you on htmlparser-developer, your earlier mails went there. Are you on that list ? Am attaching the relevant mails to this mail - hope it goes thru. Regards Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: <so...@ya...> Sent: Wednesday, May 08, 2002 6:49 PM Subject: [Htmlparser-user] Swing integration > Posted this earlier, seems to have got lost.... > ---- > > > Hi Somik, > > I'm looking into the HTMLParser-Swing integration again, and I have two > questions: > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? > > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? > > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. > > Regards, > Craig > > > > > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: 16 April 2002 04:57 AM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, Asgher > I finally had the time to check Swing integration. Boy - the parser > design in Swing sucks!! Theoretically its possible to do it - and I got > started, but just realized that in order to be compatible with swing > objects > that do compile time type checking with a particular tag, I have to > actually > have 73 if statements to give the right tag to the callback. > I have more important things to do at the moment, but probably will > get > back to this donkey work. *sigh* > > I am thinking we should make release 1.1 and then try this. Any > suggestions ? > > Regards, > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Thursday, April 04, 2002 11:20 AM > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, > > Thanks a lot for the post. Pls go ahead with your analysis. I will > try > > to catch up this weekend. > > Regards, > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: "'Somik Raha'" <so...@ya...> > > Sent: Tuesday, April 02, 2002 3:32 PM > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > Hi Somik, > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > which > > > is the driver behind JEditorPane's reading and writing HTML > > > capabilities. > > > > > > --- > > > Extendable/Scalable > > > > > > To maximize the usefulness of this kit, a great deal of effort has > gone > > > into making it extendable. These are some of the features. > > > The parser is replaceable. The default parser is the Hot Java parser > > > which is DTD based. A different DTD can be used, or an entirely > > > different parser can be used. To change the parser, reimplement the > > > getParser method. The default parser is dynamically loaded when > first > > > asked for, so the class files will never be loaded if an alternative > > > parser is used. The default parser is in a separate package called > > > parser below this package. > > > > > > The parser drives the ParserCallback, which is provided by > HTMLDocument. > > > To change the callback, subclass HTMLDocument and reimplement the > > > createDefaultDocument method to return document that produces a > > > different reader. The reader controls how the document is > structured. > > > Although the Document provides HTML support by default, there is > nothing > > > preventing support of non-HTML tags that result in alternative > element > > > structures. > > > --- > > > > > > I may find some time to look into this as well, although I am not > sure > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > capabilities.... > > > > > > -craig > > > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of > Somik > > > Raha > > > Sent: 01 April 2002 05:28 PM > > > To: HTMLParser User List > > > Cc: HTMLParser Developer List > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > Hi Craig > > > Wow! Thats a great question. > > > Actually, I doubt if I could replace Sun Microsystems' code with > > > mine. I > > > dont think Java is that open (or is it ?) > > > However, we could think of writing our own adapter for the html > parser > > > that > > > might plugin in some way... > > > I have never used Sun's html parser (If I had, I might not have > > > started > > > this project). > > > I will need to study Sun's parser before I can answer your > > > question.. > > > But there does seem to be some interesting possibilities. > > > > > > Regards > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: <htm...@li...> > > > Sent: Monday, April 01, 2002 10:20 PM > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > > provide a better implementation of JEditorPane's HTML viewing > > > > capabilities? HTML Parser would need to replace > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > buggy. > > > > Anyone tried this? > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-05-07 11:57:54
|
Hi Craig, A brief description - you will probably want to use ParserTester. The code is quite dirty at the moment, but a basic idea is : [1] HTMLParserAdapter adapts our HTMLParser into Parser (Swing's parser) - The donkey work comes here with all those if-then statements. [2] HTMLParserProvider gives me the parser (makes the method public), so I can control it. [3] TrialParser - the parser class that allows you to configure which parser you want. ParserTester uses this to create two different parsers, by using the c'tor params. Based on the params, at the point of invoking the parser (in MyParserDelegator), the decision is made as to which parser is to be used. [4] MyParserCallBack - the same class is used for both parsers. For every call back method, an object of a certain time is created, which is collected in a vector, and is used later for comparison in the testcase. So, handleSimpleTag() will create a SimpleTagCallBack object. If this method is correctly called by our parser, then the two objects ought to match. (The equals method accomplishes this). [5] testTypes package contains the various types like SimpleTagCallBack, which aid us in testing these call back objects returned by the two parsers. [6] ParserTester - the main testing mechanism - where you get to create the two parsers, choose what html they have to parse, and then compare their respective callback objects. This one's a nightmare - bcos the swing parser puts in tags that werent there. You can ignore the other classes safely (I ought to delete them). If you have any doubts, pls let me know. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-07 10:34:24
|
Hi Craig, You can get the latest code of the SwingParser from CVS. The module name is SwingParser. cvs -z3 -d:ext:dev...@cv...:/cvsroot/htmlpar ser co SwingParser Bytway, if you give me your developer id, I can add you to the developer list. Then you can directly checkin your work. Regards, Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: "'Somik Raha'" <so...@ya...> Sent: Tuesday, May 07, 2002 6:54 PM Subject: [Htmlparser-developer] RE: [Htmlparser-user] Swing integration > Hi Somik, > > I'm looking into the HTMLParser-Swing integration again, and I have two > questions: > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? > > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? > > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. > > Regards, > Craig > > > > > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: 16 April 2002 04:57 AM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, Asgher > I finally had the time to check Swing integration. Boy - the parser > design in Swing sucks!! Theoretically its possible to do it - and I got > started, but just realized that in order to be compatible with swing > objects > that do compile time type checking with a particular tag, I have to > actually > have 73 if statements to give the right tag to the callback. > I have more important things to do at the moment, but probably will > get > back to this donkey work. *sigh* > > I am thinking we should make release 1.1 and then try this. Any > suggestions ? > > Regards, > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Thursday, April 04, 2002 11:20 AM > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, > > Thanks a lot for the post. Pls go ahead with your analysis. I will > try > > to catch up this weekend. > > Regards, > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: "'Somik Raha'" <so...@ya...> > > Sent: Tuesday, April 02, 2002 3:32 PM > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > Hi Somik, > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > which > > > is the driver behind JEditorPane's reading and writing HTML > > > capabilities. > > > > > > --- > > > Extendable/Scalable > > > > > > To maximize the usefulness of this kit, a great deal of effort has > gone > > > into making it extendable. These are some of the features. > > > The parser is replaceable. The default parser is the Hot Java parser > > > which is DTD based. A different DTD can be used, or an entirely > > > different parser can be used. To change the parser, reimplement the > > > getParser method. The default parser is dynamically loaded when > first > > > asked for, so the class files will never be loaded if an alternative > > > parser is used. The default parser is in a separate package called > > > parser below this package. > > > > > > The parser drives the ParserCallback, which is provided by > HTMLDocument. > > > To change the callback, subclass HTMLDocument and reimplement the > > > createDefaultDocument method to return document that produces a > > > different reader. The reader controls how the document is > structured. > > > Although the Document provides HTML support by default, there is > nothing > > > preventing support of non-HTML tags that result in alternative > element > > > structures. > > > --- > > > > > > I may find some time to look into this as well, although I am not > sure > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > capabilities.... > > > > > > -craig > > > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of > Somik > > > Raha > > > Sent: 01 April 2002 05:28 PM > > > To: HTMLParser User List > > > Cc: HTMLParser Developer List > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > Hi Craig > > > Wow! Thats a great question. > > > Actually, I doubt if I could replace Sun Microsystems' code with > > > mine. I > > > dont think Java is that open (or is it ?) > > > However, we could think of writing our own adapter for the html > parser > > > that > > > might plugin in some way... > > > I have never used Sun's html parser (If I had, I might not have > > > started > > > this project). > > > I will need to study Sun's parser before I can answer your > > > question.. > > > But there does seem to be some interesting possibilities. > > > > > > Regards > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: <htm...@li...> > > > Sent: Monday, April 01, 2002 10:20 PM > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > > provide a better implementation of JEditorPane's HTML viewing > > > > capabilities? HTML Parser would need to replace > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > buggy. > > > > Anyone tried this? > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2002-05-07 10:06:12
|
Hi Craig, > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? Yes - thats exactly what Im doing at the moment > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? Simple tags are those which dont come in XML like pairs. e.g. <BR> <META> will be simple tags. While <title> would be a start tag, as its got to have an end tag. Sadly, for every different case, we will need to manually handle them. > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. Ok - I can put out the code - maybe as a new module.. Will let u know as soon as its done. Regards, Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: "'Somik Raha'" <so...@ya...> Sent: Tuesday, May 07, 2002 6:54 PM Subject: RE: [Htmlparser-user] Swing integration > Hi Somik, > > I'm looking into the HTMLParser-Swing integration again, and I have two > questions: > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? > > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? > > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. > > Regards, > Craig > > > > > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: 16 April 2002 04:57 AM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, Asgher > I finally had the time to check Swing integration. Boy - the parser > design in Swing sucks!! Theoretically its possible to do it - and I got > started, but just realized that in order to be compatible with swing > objects > that do compile time type checking with a particular tag, I have to > actually > have 73 if statements to give the right tag to the callback. > I have more important things to do at the moment, but probably will > get > back to this donkey work. *sigh* > > I am thinking we should make release 1.1 and then try this. Any > suggestions ? > > Regards, > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Thursday, April 04, 2002 11:20 AM > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, > > Thanks a lot for the post. Pls go ahead with your analysis. I will > try > > to catch up this weekend. > > Regards, > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: "'Somik Raha'" <so...@ya...> > > Sent: Tuesday, April 02, 2002 3:32 PM > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > Hi Somik, > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > which > > > is the driver behind JEditorPane's reading and writing HTML > > > capabilities. > > > > > > --- > > > Extendable/Scalable > > > > > > To maximize the usefulness of this kit, a great deal of effort has > gone > > > into making it extendable. These are some of the features. > > > The parser is replaceable. The default parser is the Hot Java parser > > > which is DTD based. A different DTD can be used, or an entirely > > > different parser can be used. To change the parser, reimplement the > > > getParser method. The default parser is dynamically loaded when > first > > > asked for, so the class files will never be loaded if an alternative > > > parser is used. The default parser is in a separate package called > > > parser below this package. > > > > > > The parser drives the ParserCallback, which is provided by > HTMLDocument. > > > To change the callback, subclass HTMLDocument and reimplement the > > > createDefaultDocument method to return document that produces a > > > different reader. The reader controls how the document is > structured. > > > Although the Document provides HTML support by default, there is > nothing > > > preventing support of non-HTML tags that result in alternative > element > > > structures. > > > --- > > > > > > I may find some time to look into this as well, although I am not > sure > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > capabilities.... > > > > > > -craig > > > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of > Somik > > > Raha > > > Sent: 01 April 2002 05:28 PM > > > To: HTMLParser User List > > > Cc: HTMLParser Developer List > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > Hi Craig > > > Wow! Thats a great question. > > > Actually, I doubt if I could replace Sun Microsystems' code with > > > mine. I > > > dont think Java is that open (or is it ?) > > > However, we could think of writing our own adapter for the html > parser > > > that > > > might plugin in some way... > > > I have never used Sun's html parser (If I had, I might not have > > > started > > > this project). > > > I will need to study Sun's parser before I can answer your > > > question.. > > > But there does seem to be some interesting possibilities. > > > > > > Regards > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: <htm...@li...> > > > Sent: Monday, April 01, 2002 10:20 PM > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > > provide a better implementation of JEditorPane's HTML viewing > > > > capabilities? HTML Parser would need to replace > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > buggy. > > > > Anyone tried this? > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Craig R. <cr...@qu...> - 2002-05-07 09:54:23
|
Hi Somik, I'm looking into the HTMLParser-Swing integration again, and I have two questions: 1. The HTMLEditorKit.ParserCallback takes a position with most of its callback functions. Can this position be extracted from the HTMLTag's elementBegin()? 2. There is a need to differentiate between a callback to handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when iterating through the HTMLTag elements Enumeration. How? You mentioned you have started an implementation - if you have a framework going, I'd be happy to continue with the donkey work. I really think this could make Swing's HTML rendering a lot more stable. Regards, Craig -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: 16 April 2002 04:57 AM To: htm...@li... Cc: Craig Raw Subject: Re: [Htmlparser-user] Swing integration Hi Craig, Asgher I finally had the time to check Swing integration. Boy - the parser design in Swing sucks!! Theoretically its possible to do it - and I got started, but just realized that in order to be compatible with swing objects that do compile time type checking with a particular tag, I have to actually have 73 if statements to give the right tag to the callback. I have more important things to do at the moment, but probably will get back to this donkey work. *sigh* I am thinking we should make release 1.1 and then try this. Any suggestions ? Regards, Somik ----- Original Message ----- From: "Somik Raha" <so...@ya...> To: <htm...@li...> Sent: Thursday, April 04, 2002 11:20 AM Subject: Re: [Htmlparser-user] Swing integration > Hi Craig, > Thanks a lot for the post. Pls go ahead with your analysis. I will try > to catch up this weekend. > Regards, > Somik > ----- Original Message ----- > From: "Craig Raw" <cr...@qu...> > To: "'Somik Raha'" <so...@ya...> > Sent: Tuesday, April 02, 2002 3:32 PM > Subject: RE: [Htmlparser-user] Swing integration > > > > Hi Somik, > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - which > > is the driver behind JEditorPane's reading and writing HTML > > capabilities. > > > > --- > > Extendable/Scalable > > > > To maximize the usefulness of this kit, a great deal of effort has gone > > into making it extendable. These are some of the features. > > The parser is replaceable. The default parser is the Hot Java parser > > which is DTD based. A different DTD can be used, or an entirely > > different parser can be used. To change the parser, reimplement the > > getParser method. The default parser is dynamically loaded when first > > asked for, so the class files will never be loaded if an alternative > > parser is used. The default parser is in a separate package called > > parser below this package. > > > > The parser drives the ParserCallback, which is provided by HTMLDocument. > > To change the callback, subclass HTMLDocument and reimplement the > > createDefaultDocument method to return document that produces a > > different reader. The reader controls how the document is structured. > > Although the Document provides HTML support by default, there is nothing > > preventing support of non-HTML tags that result in alternative element > > structures. > > --- > > > > I may find some time to look into this as well, although I am not sure > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > capabilities.... > > > > -craig > > > > > > -----Original Message----- > > From: htm...@li... > > [mailto:htm...@li...] On Behalf Of Somik > > Raha > > Sent: 01 April 2002 05:28 PM > > To: HTMLParser User List > > Cc: HTMLParser Developer List > > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig > > Wow! Thats a great question. > > Actually, I doubt if I could replace Sun Microsystems' code with > > mine. I > > dont think Java is that open (or is it ?) > > However, we could think of writing our own adapter for the html parser > > that > > might plugin in some way... > > I have never used Sun's html parser (If I had, I might not have > > started > > this project). > > I will need to study Sun's parser before I can answer your > > question.. > > But there does seem to be some interesting possibilities. > > > > Regards > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: <htm...@li...> > > Sent: Monday, April 01, 2002 10:20 PM > > Subject: [Htmlparser-user] Swing integration > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > provide a better implementation of JEditorPane's HTML viewing > > > capabilities? HTML Parser would need to replace > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > buggy. > > > Anyone tried this? > > > > > > -craig > > > > > > > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _________________________________________________________ > Do You Yahoo!? > Get your free @yahoo.com address at http://mail.yahoo.com > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-05-07 06:25:34
|
Hi Folks, Following some nice suggestions from Sam Joseph, I have just = completed some design modifications to the basic HTMLNode API. The modifications are : [1] HTMLNode is no longer an interface, but an abstract class. There = were two reasons for this change. Firstly, I couldnt think of a scenario = where an object would be an html tag AND something else. Secondly, I = wanted to enforce the implementation of toString(), which is usually = done if you implement from the interface (as Object has a default = toString()). [2] abstract toString() method - children have to implement this. [3] print() and print(PrintWriter) - final methods. They will make a = call to toString(), and print to standard output and the print writer = respectively. [4] toPlainText() - this method will provide a string representation of = a tag, if there is such a representation. If not , a blank string is = returned. This has implications - our program to extract all strings = from a html page will be simplified to: HTMLNode node; for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); System.out.println(node.toPlainTextString()); // or whatever = processing you want to do with the string } [5] toRawString() - this method provides the complete html element (a = reconstruction), thus allowing ripping programs to be really simple. So = if you want to rip the html page to your local hard disk, your program = would look like, PrintWriter pw =3D new PrintWriter(new FileWriter("...")); for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); pw.println(node.toRawString()); } pw.close(); [6] Lots of bug fixes done - HTMLImageScanner had a bug, = HTMLStyleScanner also had one - all caught with more testcases. We have 100 testcases as of now, all of them passing. To-do list for Release 1.2 ------------------------------------ [1] Integration of Raghavender Srimantula's contribution - = HTMLFrameScanner and HTMLFormScanner, into the parser. This will be = integrated as soon as I get the testcases from Raghav. [2] Adding an HTML Ripping program in the parserApplications package. [3] Improving the Robot Crawler (??) [4] Bug fixes to any bugs that get reported in this period. You can check out the latest code from CVS. Or you can go to = http://htmlparser.sourceforge.net and click on the download link, and = choose htmlparser1_2_20020507.zip Feedback is welcome. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-03 09:26:01
|
Hi Folks, A testing build is out - you can download it from = http://htmlparser.sourceforge.net (choose the download link). This is a = testing build with important bug fixes.=20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-03 08:15:23
|
Hi Folks, We seem to have a heroic parser now... You can check out the latest code from CVS. Here's the fix. As you know - if we have an additional erroneous = inverted comma in a tag, the parser cannot judge whether to treat this = as erroneous or valid. Now the parser has some amount of intelligence - = if it encounters an inverted comma, and a close tag character, then it = does a check to see whether it should treat this as an error or a valid = character. This decision making process is facilitated with a strictVector - = which holds the tags for which it should not make allowances. Currently, = there is only one - "INPUT" (Should we have any more? ). If the tag = being parsed is not a strict tag like INPUT, then it is assumed that = this is an erroneous tag and needs to be corrected. The correction process occurs (and is validated with some testcases = in HTMLTag - particularly testStrictParsing). If you go thru that = testcase - you will see that the attributes are also correctly = retrieved. This solution doesent break anything else - we have 82 testcases, = all passing. I'd be grateful if folks can test this version and let me know if = this solution is acceptable. =20 Also - a general question - would you prefer something like nightly = drop packages for downloading, or is a request to checkout from CVS fine = ? Thanks and Regards, Somik =20 |
From: Somik R. <so...@ya...> - 2002-05-02 03:30:52
|
Hi Folks, Thanks to an interesting bug report by Roger Sollberger, a bug in = HTMLStringNode has been fixed. Links of the type : <a href=3D"http://asgard.ch">[> ASGARD <]</a> would get messed up bcos of the tag symbols, when they should really be = a part of HTMLStringNode. This has been fixed (after the bug has been reproduced in a testcase in = HTMLStringNodeTest).=20 CVS code base updated. Roger --> Thanks a lot for the report. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-02 03:11:27
|
Hi Folks, If you've been following the latest exchange on htmlparser-user, = Annette has shown us a crazy example of dirty html, which works in the = browser, but crashes the parser. The site is http://www.cia.gov =20 Search for this string - <font face=3D"Arial,"helvetica," and you will find it in the html. Now this erroneous inverted comma = in front of helvetica should not be there.=20 This has been captured in a test case in HTMLTagTest.java (you can = get it from CVS), and this test fails (testParsing()). The problem is - the core parsing mechanism ignores anything within = inverted commas. This is critical so as to be able to accept angular = brackets in inverted commas. If we remove this feature from the parser = other tests will break. =20 So I need some suggestions on how we might modify our parsing - how = do we intelligently understand that this is an error (how easy it is for = us humans to figure this out) ? Looks like linear approaches wouldnt = work anymore... Maybe we need to associate some intelligence - that if = its a font tag, then this kind of stuff is most definitely an error. = Whereas if its a jsp tag, we can get more strict with our parsing. This = will probably cause a fundamental shift in our core parsing technique. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-02 02:59:22
|
Hi Annette, Regarding your second problem, the parsing error occurs because -=20 =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font=20 In the above - font face=3D"Arial,"helvetica," -- note the erroneoue = extra " in front of helvetica. Remove it and the parsing is fine. Now of = course you cant remove it, bcos this site is not yours :). So, we do = have to support this kind of dirty html. Thank you so much for bringing = it to our notice. I have written a test case to reproduce this bug, and = am working to resolve this. Regards, Somik =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" = color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |