htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
1
(5) |
2
(7) |
3
(3) |
4
|
5
(4) |
6
(3) |
7
(3) |
8
(6) |
9
(10) |
10
(7) |
11
(4) |
12
(6) |
13
(5) |
14
(12) |
15
(8) |
16
(8) |
17
|
18
|
19
|
20
(4) |
21
(7) |
22
|
23
(1) |
24
|
25
(1) |
26
|
27
(3) |
28
(6) |
29
(3) |
30
(3) |
31
|
From: Somik R. <so...@ya...> - 2003-05-30 12:08:34
|
Dear Dhaval, Thank you for being a part of this project, and best wishes for your higher studies! Cheers Somik ----- Original Message ----- From: <dha...@po...> To: <htm...@li...>; <htm...@li...> Sent: Friday, May 30, 2003 2:50 AM Subject: [Htmlparser-developer] Bye bye Everyone, I have been associated with this project for a shade less than one year. During this period I have made some small contributions to this project and identified a few bugs. Most of all what I have enjoyed is the tremendous learning that I have received both from a techncial viewpoint and a design perspective. Its altered my methodology of software development. For one it has instilled JUnit into my development methodolgy. It has also showed me that redesign is not such a bad thing. On the whole it has been quite a great experience working with some amazing people like Somik, Derrick and many more amongst u all. I thank u all for the suport that I have recvd, the quick bug-fixes, the quick-fix solutions and the exhilarating discussions that I have been involved in within this group. I am moving on for higher studies int eh field of management and I do not think I can keep so many things on my plate. So very sadly letting go of a few. One of them being the HTMLParser. I wish it all the best in future and hope that the tool continues for a long long time to come. Regards to all, Dhaval |
From: Derrick O. <Der...@ro...> - 2003-05-30 10:59:27
|
Dhaval, Your valuable input and extensive experience will be sorely missed. Best of luck in your new endeavors. Derrick dha...@po... wrote: >Everyone, > >I have been associated with this project for a shade less than one year. >During this period I have made some small contributions to this project >and identified a few bugs. Most of all what I have enjoyed is the >tremendous learning that I have received both from a technical viewpoint >and a design perspective. Its altered my methodology of software >development. For one it has instilled JUnit into my development >methodology. It has also showed me that redesign is not such a bad thing. >On the whole it has been quite a great experience working with some >amazing people like Somik, Derrick and many more amongst u all. I thank >u all for the support that I have recvd, the quick bug-fixes, the >quick-fix solutions and the exhilarating discussions that I have been >involved in within this group. > >I am moving on for higher studies in the field of management and I do >not think I can keep so many things on my plate. So very sadly letting >go of a few. One of them being the HTMLParser. > >I wish it all the best in future and hope that the tool continues for a >long long time to come. > >Regards to all, >Dhaval > > >------------------------------------------------------------------------ > >This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. >If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. >You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, >distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. > >Visit Us at http://www.polaris.co.in > > |
From: <dha...@po...> - 2003-05-30 07:13:00
|
Everyone, I have been associated with this project for a shade less than one year. During this period I have made some small contributions to this project and identified a few bugs. Most of all what I have enjoyed is the tremendous learning that I have received both from a techncial viewpoint and a design perspective. Its altered my methodology of software development. For one it has instilled JUnit into my development methodolgy. It has also showed me that redesign is not such a bad thing. On the whole it has been quite a great experience working with some amazing people like Somik, Derrick and many more amongst u all. I thank u all for the suport that I have recvd, the quick bug-fixes, the quick-fix solutions and the exhilarating discussions that I have been involved in within this group. I am moving on for higher studies int eh field of management and I do not think I can keep so many things on my plate. So very sadly letting go of a few. One of them being the HTMLParser.=20 I wish it all the best in future and hope that the tool continues for a long long time to come.=20 Regards to all, Dhaval |
From: <dha...@po...> - 2003-05-29 13:06:43
|
Hi Terry, I had also felt the need for a root tag which would allow me to drill down using CompositeTag functionality. However as a work around u could register the HtmlScanner and then you would obtain the HTML as the abse tag under which all tags would be present. Instead of using registerScanners use registerDomScanners(). Apart from registerScanners it also registers the HtmlScanner, HeadScanner and BodyScanner. Dhaval > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of tez...@ya... > Sent: Thursday, May 29, 2003 5:23 PM > To: Htm...@li... > Subject: [Htmlparser-developer] Composite Tags !=3D Composite Pattern >=20 >=20 > The CompositeTag is quite heavy on tasty [Australian > for good] functionality. >=20 > The way it seems to be implemented here is contrary to > the 'Composite' Design Pattern. I'm having difficulty > forming a composite of the whole document, say an > abstract <PARSE_ROOT> object. >=20 > Currently I'm doing a hack, knowing there is a table: > See >=20 > Node nodes [] =3D myParser.extractAllNodesThatAre(TableTag.class); > TableTag table =3D (TableTag)nodes[0]; > TableTag htmlComposite =3D (TableTag) nodes[0]; >=20 > I need to do this to access the CompositeTag > functionality. Is there a simpler way? >=20 > Would it be useful to have a >=20 > public CompositeTag getRootTag() {} >=20 > in Parser? >=20 > Terry. >=20 > =3D=3D=3D=3D=3D > ------------------------------------------------------------ > Terry Alexis Lurie | 'Something witty that doesn't > Freelance Computer Engineer | look good with variable > United Kingdom | width fonts' - Most nerds >=20 > __________________________________________________ > Yahoo! Plus - For a better Internet experience=20 > http://uk.promotions.yahoo.com/yplus/yoffer.ht> ml >=20 >=20 >=20 > ------------------------------------------------------- >=20 > This SF.net email is sponsored by: eBay > Get office equipment for less on eBay!=20 > http://adfarm.mediaplex.com/ad/ck/711-11697-> 6916-5 >=20 > _______________________________________________ >=20 > Htmlparser-developer mailing list=20 > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >=20 |
From: <tez...@ya...> - 2003-05-29 11:53:13
|
The CompositeTag is quite heavy on tasty [Australian for good] functionality. The way it seems to be implemented here is contrary to the 'Composite' Design Pattern. I'm having difficulty forming a composite of the whole document, say an abstract <PARSE_ROOT> object. Currently I'm doing a hack, knowing there is a table: See Node nodes [] = myParser.extractAllNodesThatAre(TableTag.class); TableTag table = (TableTag)nodes[0]; TableTag htmlComposite = (TableTag) nodes[0]; I need to do this to access the CompositeTag functionality. Is there a simpler way? Would it be useful to have a public CompositeTag getRootTag() {} in Parser? Terry. ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ Yahoo! Plus - For a better Internet experience http://uk.promotions.yahoo.com/yplus/yoffer.html |
From: <dha...@po...> - 2003-05-29 04:29:13
|
Marc, Your requirement is quite common. Mostly code inside <SCRIPT> tag should = be produced as it is. I think its important that we have the test cases = and appropriate fixes in the main codebase. Dhaval > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of Marc Novakowski > Sent: Wednesday, May 28, 2003 8:30 PM > To: htm...@li...;=20 > htm...@li... > Subject: RE: [Htmlparser-developer] RE: [Htmlparser-cvs]=20 > htmlparser/src/org/htmlparser/scanners=20 > CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 >=20 >=20 > Derrick, if it's anybody's fault that my code is failing=20 > because of your change, it's mine. I should have checked in=20 > specific test cases that excersise my usage of the library. =20 > I apologise for not doing that earlier... > =20 > Here are the main things that the new ScriptScanner does that=20 > breaks my code: > 1) acts very strangely when it encounters "\" at a newline=20 > (doesn't just get rid of the newline, but it starts repeating=20 > the entire line about 6 times) > 2) uppercases and auto-closes tags that aren't in quotes > =20 > I have some specific test cases that demonstrate these. I'll=20 > check them in if you'd like. I have to admit that after=20 > playing with the internals of NodeReader, TagScanner, etc.=20 > that I'm not 100% clear on how some of this low level=20 > scanning code works. Nor is it always clear from reading the=20 > code. That's why I am not confident that I will be able to=20 > refactor the existing code to handle my specific problems. > =20 > I realize my usage of the parser may be quite different than=20 > 95% of the people who use the library, so if there isn't a=20 > solution that fits into the existing architecture I'll be=20 > happy to just make some local changes to fix things. I can=20 > always make my own scanner and not check it into the codeline=20 > (or just copy the old version of ScriptScanner into my code).=20 > However, if I'm running into this now, chances are somebody=20 > in the future will, also. > =20 > Marc >=20 > -----Original Message-----=20 > From: Derrick Oswald [mailto:Der...@ro...]=20 > Sent: Tue 5/27/2003 6:26 PM=20 > To: htm...@li...=20 > Cc:=20 > Subject: Re: [Htmlparser-developer] RE:=20 > [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners=20 > CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > =09 > =09 >=20 > You may need to back out the change, or at a minimum=20 > get the old code by > going back a version and putting it in your=20 > ScriptScanner base class. > =09 > I guess I screwed up. I saw you're drop that allowed=20 > all the lines to be > accumulated in a tag and I thought the two scanners=20 > were very close then > (apart from the tags in quotes thing). My only excuse=20 > is it passed all > the unit tests. Well to be truthful I changed two of=20 > the tests, but it > was only extraneous newline stuff at the start and end of text. > =09 > The script scanner is breaking your code because of=20 > uppercasing tags > (not just within in comments) and removing newlines=20 > after \, right? > =09 > Marc Novakowski wrote: > =09 > >I just realized that it's more complicated than that=20 > (for me, at least). In my application that uses htmlparser,=20 > I am extending certain scanners and tags (such as=20 > ScriptScanner but mostly CompositeTagScanner) to allow for=20 > "custom" tags in an HTML page. When the "HTML + custom tags"=20 > are run through my custom parser, the custom tags are=20 > converted into an object model which is then turned into=20 > dynamic javascript code. > > > >Long story short: some of these custom tags (i.e. the=20 > ones that extend ScriptScanner) _absolutely_ need the inner=20 > contents of the tag to remain unchanged. Also, since it's=20 > not always Javascript that is inside of the tags, adding=20 > extra rules to ignore tags in comments or strings won't=20 > always work. For example, one tag allows for arbitrary XML=20 > innards. Currently, the scanner will UPPERCASE all tags=20 > inside unless they're in quotes (which messes up the XML). > > > >The old ScriptScanner did exactly what I needed --=20 > that is, it didn't scan for tags at all. It just looked for=20 > the exact (case-insensitive) string match of the end tag. It=20 > didn't look for "<" and it didn't defer to scanners. I took=20 > a look at the current code and I can't see any easy way to do this. > > > >Marc > > > >-----Original Message----- > >From: Derrick Oswald [mailto:Der...@ro...] > >Sent: Tuesday, May 27, 2003 2:39 PM > >To: htm...@li... > >Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] > >htmlparser/src/org/htmlparser/scanners > >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > > > > >Marc, > > > >The text within <SCRIPT></SCRIPT> is supposed to be=20 > parsed as pure text > >or remarks. > >I guess the text scanner goes until it sees a <x...=20 > and then stops to > >defer to a tag scanner. I hadn't thought about those=20 > in comments, or > >about the \ end of lines. > > > >Perhaps, rather than write a new scanner, fix the=20 > StringScanner (the > >remark scanner should be OK), so that it does the=20 > correct behaviour when > >balance_quotes is true. Then the 'balance_quotes' flag=20 > could be called > >'strict_script' or something. > > > >Derrick > > > >Marc Novakowski wrote: > > > >=20 > > > =09 > =09 > =09 > =09 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your=20 > application fit in a > relational database is painful, don't do it! Check out=20 > ObjectStore. > Now part of Progress Software.=20 > http://www.objectstore.net/sourceforge > =09 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > =09 > https://lists.sourceforge.net/lists/listinfo/h> tmlparser-developer > =09 >=20 > NHun~uj=CA=89jjjjvv > 9r>JF yqjzzzy=E2=96=8Az >=20 |
From: Marc N. <ma...@ke...> - 2003-05-28 22:44:45
|
RGVycmljaywNCg0KSSBsaWtlIHlvdXIgaWRlYXMsIGFuZCBJIHRoaW5rIHRoYXQgeW91ciBzdWdn ZXN0ZWQgcmVmYWN0b3Jpbmcgd291bGQgbWFrZSB0aGUgbG93ZXItbGV2ZWwgY29kZSBpbiBodG1s cGFyc2VyIG11Y2ggbGVzcyBteXN0ZXJpb3VzIGFuZCAoaG9wZWZ1bGx5KSBlYXNpZXIgdG8gbWFp bnRhaW4gYW5kIGV4dGVuZC4NCg0KTWFyYw0KDQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0K RnJvbTogRGVycmljayBPc3dhbGQgW21haWx0bzpEZXJyaWNrT3N3YWxkQHJvZ2Vycy5jb21dDQpT ZW50OiBXZWRuZXNkYXksIE1heSAyOCwgMjAwMyAzOjI1IFBNDQpUbzogaHRtbHBhcnNlci1kZXZl bG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQpTdWJqZWN0OiBSZTogW0h0bWxwYXJzZXItZGV2 ZWxvcGVyXSBSRTogW0h0bWxwYXJzZXItY3ZzXQ0KaHRtbHBhcnNlci9zcmMvb3JnL2h0bWxwYXJz ZXIvc2Nhbm5lcnMNCkNvbXBvc2l0ZVRhZ1NjYW5uZXIuamF2YSwxLjUyLDEuNTMgU2NyaXB0U2Nh bm5lci5qYXZhLDEuMjEsMS4yMg0KDQoNCg0KTWFyYywNCg0KSSd2ZSBiZWVuIHRoaW5raW5nIGFi b3V0IHlvdXIgcHJvYmxlbSBhbmQgSSB0aGluayBJIGhhdmUgYSBzb2x1dGlvbi4NCkknbGwgcmUt d3JpdGUgdGhlIG5vZGUgcmVhZGVyLg0KDQpPSywgdGhhdCdzIHRoZSBib3R0b20gbGluZSwgYnV0 IEkndmUgc2FpZCBiZWZvcmUgdGhhdCB0aGUgbG93ZXN0IGxldmVsIA0Kc2hvdWxkIHJldHVybiBh IGNvbnRpZ3VvdXMgc3RyZWFtIG9mIG5vZGVzLCB0aGF0IGhhdmUgdGhlIG9yaWdpbmFsIA0KY2hh cmFjdGVycyAobm90IGNhc2UgY29udmVydGVkKSBhbmQgaW5jbHVkZSB0aGUgZm9ybWF0dGluZyBs aWtlIGxpbmUgDQplbmRpbmdzIGFuZCBvdGhlciB3aGl0ZXNwYWNlIHNvIHRoYXQgdG9IdG1sKCkg Z2l2ZXMgeW91IHRoZSBleGFjdCBzYW1lIA0KcGFnZSB0aGF0IHlvdSBzdGFydGVkIHdpdGguDQoN Ckkgc2hvdWxkIG1ha2UgYSBwaWN0dXJlLCBidXQgc2VlIGlmIHlvdSBjYW4gZm9sbG93IG1lIGhl cmUuDQoNClRoZSBsb3dlc3QgbGV2ZWwgaXMgYSBieXRlIHN0cmVhbSwgcmlnaHQgb2ZmIHRoZSB3 aXJlLiBUaGlzIG5lZWRzIHRvIA0Kc3VwcG9ydCBtYXJrIGFuZCByZXNldCBpbiBjYXNlIHRoZSBj aGFyYWN0ZXIgc2V0IGNoYW5nZXMuDQoNClRoZSBzZWNvbmQgbGV2ZWwgaXMgYSBjaGFyYWN0ZXIg c3RyZWFtLCBhZnRlciBhcHBseWluZyB0aGUgZGVjb2RpbmcgZm9yIA0KYSBwYXJ0aWN1bGFyIGNo YXJzZXQuDQoNClRoZSB0aGlyZCBsZXZlbCBpcyBhIHN0cmluZywgd2hpY2ggaXMgYSBjaGFyIGFy cmF5LiBUaGUgY2hhcnMgYXJlIGNvcGllZCANCmZyb20gdGhlIHNlY29uZCBsZXZlbCwgc28gdGhh dCBjYW4gYmUgZGlzY2FyZGVkLCBidXQgb25seSBhZnRlciB0aGUgDQplbnRpcmUgc3RyZWFtIGhh cyBiZWVuIGRyYWluZWQuIElmIHdlIHdhbnQgdG8gZG8gdGhyZWFkZWQgYWNjZXNzIHRvIHRoZSAN CnNvY2tldCB0byBwcm92aWRlIGZvciBwYXJhbGxlbCBwYXJzaW5nIHdoaWxlIHJlYWRpbmcsIHRo ZSBjaGFyYWN0ZXJzIA0KbmVlZCB0byBiZSBrZXB0IGFyb3VuZCB0byBjcmVhdGUgd2hvbGUgbmV3 IHN0cmluZ3MuDQoNClRoZSBmb3VydGggbGV2ZWwgaXMgYSBzdHJlYW0gb2YgdGFncy4gSW5zdGVh ZCBvZiBrZWVwaW5nIHN1YnN0cmluZ3MgDQp0aG91Z2gsIHRoZSB0YWdzIGp1c3Qga2VlcCBjaGFy YWN0ZXIgcG9zaXRpb24sIHN0YXJ0IGFuZCBlbmQsIHdpdGhpbiB0aGUgDQplbnRpcmUgcGFnZSwg bGlrZSBhIGN1cnNvciwgYW5kIGEgcG9pbnRlciB0byBhIG5ldyAnUGFnZScgb2JqZWN0LiBUaGF0 IA0Kd2F5IGFzIHRoZSBQYWdlIHJlYWRzIG1vcmUgYnl0ZXMgZnJvbSB0aGUgc3RyZWFtLCBpdCBh Y2N1bXVsYXRlcyBtb3JlIA0KY2hhcmFjdGVycywgd2hpY2ggbWFrZSBhIGJpZ2dlciBzdHJpbmcg dGhhdCByZXByZXNlbnRzIHRoZSBwYWdlIHJlYWQgc28gDQpmYXIsIGFuZCB0aGVyZSdzIG5vdGhp bmcgcHJldmVudGluZyB0aGUgb2xkZXIgc3RyaW5ncyBmcm9tIGJlaW5nIGdhcmJhZ2UgDQpjb2xs ZWN0ZWQuDQoNClRoZSB1cHBlciBjYXNlIHRoaW5nIGdvZXMgYXdheSBzaW5jZSB0aGUgdGFncyBw b2ludCB0byB0aGUgb3JpZ2luYWwgDQpjaGFyYWN0ZXJzIHZpYSB0aGVpciBvZmZzZXRzLiBUaGUg ZW5kIG9mIGxpbmUgdGhpbmcgZ29lcyBhd2F5IGJlY2F1c2UgDQp0aGUgcmVhZGVyIGp1c3QgdHJl YXRzIGEgbmV3bGluZSBhcyBhbnkgb3RoZXIgd2hpdGVzcGFjZS4NCg0KU28gd2hhdCB5b3UgaGF2 ZSBhZnRlciBhIHBhcnNlIGlzIGEgc2luZ2xlICh2ZXJ5IGxhcmdlKSBzdHJpbmcgd2l0aCBhIA0K cGFyYWxsZWwgc3RyZWFtIG9mIHRhZyBvYmplY3RzIHdpdGggYSB3aG9sZSBidW5jaCBvZiBjdXJz b3JzIHBvaW50aW5nIA0KaW50byB0aGUgc3RyaW5nLg0KDQpJJ3ZlIGV4cGVyaW1lbnRlZCB3aXRo IHJlYWRpbmcgYWxsIHRoZSBjaGFyYWN0ZXJzIHVwIGZyb250IGFuZCB0aGF0IA0KYnJlYWtzIDY3 IHRlc3QgY2FzZXMuIElmIHlvdSBlcnJvbmVvdXNseSBzdWJzdGl0dXRlICJcbiIgZm9yICJcclxu IiAob3IgDQp2aWNlIHZlcnNhKSB0aGVyZSBhcmUgb25seSA0NyBmYWlsZWQgY2FzZXMgbGVmdC4g VGhlIHJlc2V0IG9uIGNoYXJhY3RlciANCnNldCBjaGFuZ2UgdGVzdCBjYXNlIGlzIG9uZSBvZiB0 aGVtLiAgSWYgeW91IGVycm9uZW91c2x5IGNvbnN1bWUgDQpuZXdsaW5lcyBhdCB0aGUgZnJvbnQg b2Ygc3RyaW5nIG5vZGVzIHRoZSBudW1iZXIgb2YgZmFpbGluZyB0ZXN0cyBpcyANCm9ubHkgMzMu IEFuZCBpZiB5b3UgZXJyb25lb3VzbHkgcmV0dXJuIG5vIHN0cmluZyBub2RlcyBpZiB0aGF0IA0K Y29uc3VtcHRpb24gbGVhdmVzIG5vdGhpbmcgbGVmdCBpbiB0aGUgc3RyaW5nLCB0aGVyZSBhcmUg b25seSAxNSBmYWlsaW5nIA0KY2FzZXMuIFRoZXNlIHdvdWxkIGhhdmUgdG8gYmUgZXhhbWluZWQg aW4gZGV0YWlsIGZvciBjb3JyZWN0bmVzcywgDQphY2NvcmRpbmcgdG8gSFRNTCB0aGUgc3BlYy4N Cg0KU28gaXQncyBkb2FibGUuDQpJIGp1c3QgaGF2ZSB0byBmaW5kIHRoZSB0aW1lLg0KRm9yIG5v dyBqdXN0IGluY2x1ZGUgdGhlIGVudGlyZSBvcmlnaW5hbCBTY3JpcFNjYW5uZXIuc2NhbigpIGNv ZGUgaW4gYSANCmJhc2UgY2xhc3MgZm9yIHlvdXIgc2NyaXB0IHNjYW5uZXJzIHNvIHRoYXQgdGhl IGV2aWwgDQpDb21wb3NpdGVUYWdTY2FubmVyLnNjYW4oKSBpcyBvdmVycmlkZGVuLg0KDQpEZXJy aWNrDQoNCk1hcmMgd3JvdGU6DQoNCj5IZXJlIGFyZSB0aGUgbWFpbiB0aGluZ3MgdGhhdCB0aGUg bmV3IFNjcmlwdFNjYW5uZXIgZG9lcyB0aGF0IGJyZWFrcyBteSBjb2RlOg0KPiAgDQo+DQo+SGVy ZSBhcmUgdGhlIG1haW4gdGhpbmdzIHRoYXQgdGhlIG5ldyBTY3JpcHRTY2FubmVyIGRvZXMgdGhh dCBicmVha3MgbXkgY29kZToNCj4xKSBhY3RzIHZlcnkgc3RyYW5nZWx5IHdoZW4gaXQgZW5jb3Vu dGVycyAiXCIgYXQgYSBuZXdsaW5lIChkb2Vzbid0IGp1c3QgZ2V0IHJpZCBvZiB0aGUgbmV3bGlu ZSwgYnV0IGl0IHN0YXJ0cyByZXBlYXRpbmcgdGhlIGVudGlyZSBsaW5lIGFib3V0IDYgdGltZXMp DQo+MikgdXBwZXJjYXNlcyBhbmQgYXV0by1jbG9zZXMgdGFncyB0aGF0IGFyZW4ndCBpbiBxdW90 ZXMNCj4gIA0KPg0KDQoNCg0KDQotLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tDQpUaGlzIFNGLm5ldCBlbWFpbCBpcyBzcG9uc29yZWQgYnk6IGVC YXkNCkdldCBvZmZpY2UgZXF1aXBtZW50IGZvciBsZXNzIG9uIGVCYXkhDQpodHRwOi8vYWRmYXJt Lm1lZGlhcGxleC5jb20vYWQvY2svNzExLTExNjk3LTY5MTYtNQ0KX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCkh0bWxwYXJzZXItZGV2ZWxvcGVyIG1haWxp bmcgbGlzdA0KSHRtbHBhcnNlci1kZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQpodHRw czovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRldmVs b3Blcg0K |
From: Derrick O. <Der...@ro...> - 2003-05-28 22:32:24
|
Marc, I've been thinking about your problem and I think I have a solution. I'll re-write the node reader. OK, that's the bottom line, but I've said before that the lowest level should return a contiguous stream of nodes, that have the original characters (not case converted) and include the formatting like line endings and other whitespace so that toHtml() gives you the exact same page that you started with. I should make a picture, but see if you can follow me here. The lowest level is a byte stream, right off the wire. This needs to support mark and reset in case the character set changes. The second level is a character stream, after applying the decoding for a particular charset. The third level is a string, which is a char array. The chars are copied from the second level, so that can be discarded, but only after the entire stream has been drained. If we want to do threaded access to the socket to provide for parallel parsing while reading, the characters need to be kept around to create whole new strings. The fourth level is a stream of tags. Instead of keeping substrings though, the tags just keep character position, start and end, within the entire page, like a cursor, and a pointer to a new 'Page' object. That way as the Page reads more bytes from the stream, it accumulates more characters, which make a bigger string that represents the page read so far, and there's nothing preventing the older strings from being garbage collected. The upper case thing goes away since the tags point to the original characters via their offsets. The end of line thing goes away because the reader just treats a newline as any other whitespace. So what you have after a parse is a single (very large) string with a parallel stream of tag objects with a whole bunch of cursors pointing into the string. I've experimented with reading all the characters up front and that breaks 67 test cases. If you erroneously substitute "\n" for "\r\n" (or vice versa) there are only 47 failed cases left. The reset on character set change test case is one of them. If you erroneously consume newlines at the front of string nodes the number of failing tests is only 33. And if you erroneously return no string nodes if that consumption leaves nothing left in the string, there are only 15 failing cases. These would have to be examined in detail for correctness, according to HTML the spec. So it's doable. I just have to find the time. For now just include the entire original ScripScanner.scan() code in a base class for your script scanners so that the evil CompositeTagScanner.scan() is overridden. Derrick Marc wrote: >Here are the main things that the new ScriptScanner does that breaks my code: > > >Here are the main things that the new ScriptScanner does that breaks my code: >1) acts very strangely when it encounters "\" at a newline (doesn't just get rid of the newline, but it starts repeating the entire line about 6 times) >2) uppercases and auto-closes tags that aren't in quotes > > |
From: Marc N. <ma...@ke...> - 2003-05-28 14:59:52
|
RGVycmljaywgaWYgaXQncyBhbnlib2R5J3MgZmF1bHQgdGhhdCBteSBjb2RlIGlzIGZhaWxpbmcg YmVjYXVzZSBvZiB5b3VyIGNoYW5nZSwgaXQncyBtaW5lLiAgSSBzaG91bGQgaGF2ZSBjaGVja2Vk IGluIHNwZWNpZmljIHRlc3QgY2FzZXMgdGhhdCBleGNlcnNpc2UgbXkgdXNhZ2Ugb2YgdGhlIGxp YnJhcnkuICBJIGFwb2xvZ2lzZSBmb3Igbm90IGRvaW5nIHRoYXQgZWFybGllci4uLg0KIA0KSGVy ZSBhcmUgdGhlIG1haW4gdGhpbmdzIHRoYXQgdGhlIG5ldyBTY3JpcHRTY2FubmVyIGRvZXMgdGhh dCBicmVha3MgbXkgY29kZToNCjEpIGFjdHMgdmVyeSBzdHJhbmdlbHkgd2hlbiBpdCBlbmNvdW50 ZXJzICJcIiBhdCBhIG5ld2xpbmUgKGRvZXNuJ3QganVzdCBnZXQgcmlkIG9mIHRoZSBuZXdsaW5l LCBidXQgaXQgc3RhcnRzIHJlcGVhdGluZyB0aGUgZW50aXJlIGxpbmUgYWJvdXQgNiB0aW1lcykN CjIpIHVwcGVyY2FzZXMgYW5kIGF1dG8tY2xvc2VzIHRhZ3MgdGhhdCBhcmVuJ3QgaW4gcXVvdGVz DQogDQpJIGhhdmUgc29tZSBzcGVjaWZpYyB0ZXN0IGNhc2VzIHRoYXQgZGVtb25zdHJhdGUgdGhl c2UuICBJJ2xsIGNoZWNrIHRoZW0gaW4gaWYgeW91J2QgbGlrZS4gIEkgaGF2ZSB0byBhZG1pdCB0 aGF0IGFmdGVyIHBsYXlpbmcgd2l0aCB0aGUgaW50ZXJuYWxzIG9mIE5vZGVSZWFkZXIsIFRhZ1Nj YW5uZXIsIGV0Yy4gdGhhdCBJJ20gbm90IDEwMCUgY2xlYXIgb24gaG93IHNvbWUgb2YgdGhpcyBs b3cgbGV2ZWwgc2Nhbm5pbmcgY29kZSB3b3Jrcy4gIE5vciBpcyBpdCBhbHdheXMgY2xlYXIgZnJv bSByZWFkaW5nIHRoZSBjb2RlLiAgVGhhdCdzIHdoeSBJIGFtIG5vdCBjb25maWRlbnQgdGhhdCBJ IHdpbGwgYmUgYWJsZSB0byByZWZhY3RvciB0aGUgZXhpc3RpbmcgY29kZSB0byBoYW5kbGUgbXkg c3BlY2lmaWMgcHJvYmxlbXMuDQogDQpJIHJlYWxpemUgbXkgdXNhZ2Ugb2YgdGhlIHBhcnNlciBt YXkgYmUgcXVpdGUgZGlmZmVyZW50IHRoYW4gOTUlIG9mIHRoZSBwZW9wbGUgd2hvIHVzZSB0aGUg bGlicmFyeSwgc28gaWYgdGhlcmUgaXNuJ3QgYSBzb2x1dGlvbiB0aGF0IGZpdHMgaW50byB0aGUg ZXhpc3RpbmcgYXJjaGl0ZWN0dXJlIEknbGwgYmUgaGFwcHkgdG8ganVzdCBtYWtlIHNvbWUgbG9j YWwgY2hhbmdlcyB0byBmaXggdGhpbmdzLiAgSSBjYW4gYWx3YXlzIG1ha2UgbXkgb3duIHNjYW5u ZXIgYW5kIG5vdCBjaGVjayBpdCBpbnRvIHRoZSBjb2RlbGluZSAob3IganVzdCBjb3B5IHRoZSBv bGQgdmVyc2lvbiBvZiBTY3JpcHRTY2FubmVyIGludG8gbXkgY29kZSkuICBIb3dldmVyLCBpZiBJ J20gcnVubmluZyBpbnRvIHRoaXMgbm93LCBjaGFuY2VzIGFyZSBzb21lYm9keSBpbiB0aGUgZnV0 dXJlIHdpbGwsIGFsc28uDQogDQpNYXJjDQoNCgktLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLSAN CglGcm9tOiBEZXJyaWNrIE9zd2FsZCBbbWFpbHRvOkRlcnJpY2tPc3dhbGRAcm9nZXJzLmNvbV0g DQoJU2VudDogVHVlIDUvMjcvMjAwMyA2OjI2IFBNIA0KCVRvOiBodG1scGFyc2VyLWRldmVsb3Bl ckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQgDQoJQ2M6IA0KCVN1YmplY3Q6IFJlOiBbSHRtbHBhcnNl ci1kZXZlbG9wZXJdIFJFOiBbSHRtbHBhcnNlci1jdnNdIGh0bWxwYXJzZXIvc3JjL29yZy9odG1s cGFyc2VyL3NjYW5uZXJzIENvbXBvc2l0ZVRhZ1NjYW5uZXIuamF2YSwxLjUyLDEuNTMgU2NyaXB0 U2Nhbm5lci5qYXZhLDEuMjEsMS4yMg0KCQ0KCQ0KDQoJWW91IG1heSBuZWVkIHRvIGJhY2sgb3V0 IHRoZSBjaGFuZ2UsIG9yIGF0IGEgbWluaW11bSBnZXQgdGhlIG9sZCBjb2RlIGJ5DQoJZ29pbmcg YmFjayBhIHZlcnNpb24gYW5kIHB1dHRpbmcgaXQgaW4geW91ciBTY3JpcHRTY2FubmVyIGJhc2Ug Y2xhc3MuDQoJDQoJSSBndWVzcyBJIHNjcmV3ZWQgdXAuIEkgc2F3IHlvdSdyZSBkcm9wIHRoYXQg YWxsb3dlZCBhbGwgdGhlIGxpbmVzIHRvIGJlDQoJYWNjdW11bGF0ZWQgaW4gYSB0YWcgYW5kIEkg dGhvdWdodCB0aGUgdHdvIHNjYW5uZXJzIHdlcmUgdmVyeSBjbG9zZSB0aGVuDQoJKGFwYXJ0IGZy b20gdGhlIHRhZ3MgaW4gcXVvdGVzIHRoaW5nKS4gIE15IG9ubHkgZXhjdXNlIGlzIGl0IHBhc3Nl ZCBhbGwNCgl0aGUgdW5pdCB0ZXN0cy4gV2VsbCB0byBiZSB0cnV0aGZ1bCBJIGNoYW5nZWQgdHdv IG9mIHRoZSB0ZXN0cywgYnV0IGl0DQoJd2FzIG9ubHkgZXh0cmFuZW91cyBuZXdsaW5lIHN0dWZm IGF0IHRoZSBzdGFydCBhbmQgZW5kIG9mIHRleHQuDQoJDQoJVGhlIHNjcmlwdCBzY2FubmVyIGlz IGJyZWFraW5nIHlvdXIgY29kZSBiZWNhdXNlIG9mIHVwcGVyY2FzaW5nIHRhZ3MNCgkobm90IGp1 c3Qgd2l0aGluIGluIGNvbW1lbnRzKSBhbmQgcmVtb3ZpbmcgbmV3bGluZXMgYWZ0ZXIgXCwgcmln aHQ/DQoJDQoJTWFyYyBOb3Zha293c2tpIHdyb3RlOg0KCQ0KCT5JIGp1c3QgcmVhbGl6ZWQgdGhh dCBpdCdzIG1vcmUgY29tcGxpY2F0ZWQgdGhhbiB0aGF0IChmb3IgbWUsIGF0IGxlYXN0KS4gIElu IG15IGFwcGxpY2F0aW9uIHRoYXQgdXNlcyBodG1scGFyc2VyLCBJIGFtIGV4dGVuZGluZyBjZXJ0 YWluIHNjYW5uZXJzIGFuZCB0YWdzIChzdWNoIGFzIFNjcmlwdFNjYW5uZXIgYnV0IG1vc3RseSBD b21wb3NpdGVUYWdTY2FubmVyKSB0byBhbGxvdyBmb3IgImN1c3RvbSIgdGFncyBpbiBhbiBIVE1M IHBhZ2UuICBXaGVuIHRoZSAiSFRNTCArIGN1c3RvbSB0YWdzIiBhcmUgcnVuIHRocm91Z2ggbXkg Y3VzdG9tIHBhcnNlciwgdGhlIGN1c3RvbSB0YWdzIGFyZSBjb252ZXJ0ZWQgaW50byBhbiBvYmpl Y3QgbW9kZWwgd2hpY2ggaXMgdGhlbiB0dXJuZWQgaW50byBkeW5hbWljIGphdmFzY3JpcHQgY29k ZS4NCgk+DQoJPkxvbmcgc3Rvcnkgc2hvcnQ6IHNvbWUgb2YgdGhlc2UgY3VzdG9tIHRhZ3MgKGku ZS4gdGhlIG9uZXMgdGhhdCBleHRlbmQgU2NyaXB0U2Nhbm5lcikgX2Fic29sdXRlbHlfIG5lZWQg dGhlIGlubmVyIGNvbnRlbnRzIG9mIHRoZSB0YWcgdG8gcmVtYWluIHVuY2hhbmdlZC4gIEFsc28s IHNpbmNlIGl0J3Mgbm90IGFsd2F5cyBKYXZhc2NyaXB0IHRoYXQgaXMgaW5zaWRlIG9mIHRoZSB0 YWdzLCBhZGRpbmcgZXh0cmEgcnVsZXMgdG8gaWdub3JlIHRhZ3MgaW4gY29tbWVudHMgb3Igc3Ry aW5ncyB3b24ndCBhbHdheXMgd29yay4gIEZvciBleGFtcGxlLCBvbmUgdGFnIGFsbG93cyBmb3Ig YXJiaXRyYXJ5IFhNTCBpbm5hcmRzLiAgQ3VycmVudGx5LCB0aGUgc2Nhbm5lciB3aWxsIFVQUEVS Q0FTRSBhbGwgdGFncyBpbnNpZGUgdW5sZXNzIHRoZXkncmUgaW4gcXVvdGVzICh3aGljaCBtZXNz ZXMgdXAgdGhlIFhNTCkuDQoJPg0KCT5UaGUgb2xkIFNjcmlwdFNjYW5uZXIgZGlkIGV4YWN0bHkg d2hhdCBJIG5lZWRlZCAtLSB0aGF0IGlzLCBpdCBkaWRuJ3Qgc2NhbiBmb3IgdGFncyBhdCBhbGwu ICBJdCBqdXN0IGxvb2tlZCBmb3IgdGhlIGV4YWN0IChjYXNlLWluc2Vuc2l0aXZlKSBzdHJpbmcg bWF0Y2ggb2YgdGhlIGVuZCB0YWcuICBJdCBkaWRuJ3QgbG9vayBmb3IgIjwiIGFuZCBpdCBkaWRu J3QgZGVmZXIgdG8gc2Nhbm5lcnMuICBJIHRvb2sgYSBsb29rIGF0IHRoZSBjdXJyZW50IGNvZGUg YW5kIEkgY2FuJ3Qgc2VlIGFueSBlYXN5IHdheSB0byBkbyB0aGlzLg0KCT4NCgk+TWFyYw0KCT4N Cgk+LS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCgk+RnJvbTogRGVycmljayBPc3dhbGQgW21h aWx0bzpEZXJyaWNrT3N3YWxkQHJvZ2Vycy5jb21dDQoJPlNlbnQ6IFR1ZXNkYXksIE1heSAyNywg MjAwMyAyOjM5IFBNDQoJPlRvOiBodG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3Jn ZS5uZXQNCgk+U3ViamVjdDogUmU6IFtIdG1scGFyc2VyLWRldmVsb3Blcl0gUkU6IFtIdG1scGFy c2VyLWN2c10NCgk+aHRtbHBhcnNlci9zcmMvb3JnL2h0bWxwYXJzZXIvc2Nhbm5lcnMNCgk+Q29t cG9zaXRlVGFnU2Nhbm5lci5qYXZhLDEuNTIsMS41MyBTY3JpcHRTY2FubmVyLmphdmEsMS4yMSwx LjIyDQoJPg0KCT4NCgk+TWFyYywNCgk+DQoJPlRoZSB0ZXh0IHdpdGhpbiA8U0NSSVBUPjwvU0NS SVBUPiBpcyBzdXBwb3NlZCB0byBiZSBwYXJzZWQgYXMgcHVyZSB0ZXh0DQoJPm9yIHJlbWFya3Mu DQoJPkkgZ3Vlc3MgdGhlIHRleHQgc2Nhbm5lciBnb2VzIHVudGlsIGl0IHNlZXMgYSA8eC4uLiBh bmQgdGhlbiBzdG9wcyB0bw0KCT5kZWZlciB0byBhIHRhZyBzY2FubmVyLiBJIGhhZG4ndCB0aG91 Z2h0IGFib3V0IHRob3NlIGluIGNvbW1lbnRzLCBvcg0KCT5hYm91dCB0aGUgXCBlbmQgb2YgbGlu ZXMuDQoJPg0KCT5QZXJoYXBzLCByYXRoZXIgdGhhbiB3cml0ZSBhIG5ldyBzY2FubmVyLCBmaXgg dGhlIFN0cmluZ1NjYW5uZXIgKHRoZQ0KCT5yZW1hcmsgc2Nhbm5lciBzaG91bGQgYmUgT0spLCBz byB0aGF0IGl0IGRvZXMgdGhlIGNvcnJlY3QgYmVoYXZpb3VyIHdoZW4NCgk+YmFsYW5jZV9xdW90 ZXMgaXMgdHJ1ZS4gVGhlbiB0aGUgJ2JhbGFuY2VfcXVvdGVzJyBmbGFnIGNvdWxkIGJlIGNhbGxl ZA0KCT4nc3RyaWN0X3NjcmlwdCcgb3Igc29tZXRoaW5nLg0KCT4NCgk+RGVycmljaw0KCT4NCgk+ TWFyYyBOb3Zha293c2tpIHdyb3RlOg0KCT4NCgk+IA0KCT4NCgkNCgkNCgkNCgkNCgktLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoJVGhpcyBT Ri5uZXQgZW1haWwgaXMgc3BvbnNvcmVkIGJ5OiBPYmplY3RTdG9yZS4NCglJZiBmbGF0dGVuaW5n IG91dCBDKysgb3IgSmF2YSBjb2RlIHRvIG1ha2UgeW91ciBhcHBsaWNhdGlvbiBmaXQgaW4gYQ0K CXJlbGF0aW9uYWwgZGF0YWJhc2UgaXMgcGFpbmZ1bCwgZG9uJ3QgZG8gaXQhIENoZWNrIG91dCBP YmplY3RTdG9yZS4NCglOb3cgcGFydCBvZiBQcm9ncmVzcyBTb2Z0d2FyZS4gaHR0cDovL3d3dy5v YmplY3RzdG9yZS5uZXQvc291cmNlZm9yZ2UNCglfX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fXw0KCUh0bWxwYXJzZXItZGV2ZWxvcGVyIG1haWxpbmcgbGlzdA0K CUh0bWxwYXJzZXItZGV2ZWxvcGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldA0KCWh0dHBzOi8vbGlz dHMuc291cmNlZm9yZ2UubmV0L2xpc3RzL2xpc3RpbmZvL2h0bWxwYXJzZXItZGV2ZWxvcGVyDQoJ DQoNCg== |
From: <dha...@po...> - 2003-05-28 05:23:17
|
Marc, I agree with Derrick. Lets correct the existing scanner rather than write something new since typically it gets confusing for users to know what to deal with and how the two scanenrs are different. It takes a lot of experiecne with the parser to understand the subtle difference between the two. > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of Der...@ro... > Sent: Wednesday, May 28, 2003 3:09 AM > To: htm...@li... > Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs]=20 > htmlparser/src/org/htmlparser/scanners=20 > CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 >=20 >=20 > Marc, >=20 > The text within <SCRIPT></SCRIPT> is supposed to be parsed as=20 > pure text=20 > or remarks. > I guess the text scanner goes until it sees a <x... and then stops to=20 > defer to a tag scanner. I hadn't thought about those in comments, or=20 > about the \ end of lines. >=20 > Perhaps, rather than write a new scanner, fix the StringScanner (the=20 > remark scanner should be OK), so that it does the correct=20 > behaviour when=20 > balance_quotes is true. Then the 'balance_quotes' flag could=20 > be called=20 > 'strict_script' or something. >=20 > Derrick >=20 > Marc Novakowski wrote: >=20 > >Derrick, > > > >I was relying on some of the old behavior of ScriptScanner,=20 > mostly the=20 > >fact that its contents were not parsed as HTML. I'm still=20 > seeing cases=20 > >where tags inside of <script> are recognised as "HTML" and modified=20 > >(i.e. turned into uppercase, auto-closed, etc). For=20 > example, if there=20 > >is an HTML tag in a Javascript comment. Also, using "\" to=20 > concatenate=20 > >lines (which is valid in Javacript) is totally messed up now=20 > when I try=20 > >to get the script code using "toHtml()". > > > >However, I think your change was valid and fixes the bug as=20 > requested. =20 > >What I think I'm going to do, though, is make a new scanner=20 > class that=20 > >does what the old ScriptScanner did. That is, do a=20 > bare-bones "leave=20 > >everything inside that tag as-is" parse of the HTML,=20 > searching only for=20 > >the end tag with no knowledge of quotes or anything. I=20 > think there are=20 > >cases where Javascript is written such that any modification at all=20 > >will break it. > > > >I'll send a note to the list when this class is done (today=20 > sometime). =20 > >I'll call it StrictScriptScanner or something. > > > >Marc > > > >-----Original Message----- > >From: der...@us... > >[mailto:der...@us...] > >Sent: Saturday, May 24, 2003 2:05 PM > >To: htm...@li... > >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners > >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > > > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners > >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > > > >Modified Files: > > CompositeTagScanner.java ScriptScanner.java > >Log Message: > >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags > >Major overhaul of ScriptScanner. > >It now uses the scan() method of CompositeTagScanner (i.e.=20 > doesn't override). > >CompositeTagScanner now has a balance_quotes member field=20 > that dictates > >whether strings tags are scanned honouring single and double quotes. > >This affected the call chain through NodeReader and=20 > StringScanner which > >now have this parameter. > >StringScanner now correctly handles quotes if asked. The=20 > ignoreState stuff is removed, > >it didn't work anyway since a single StringScanner is used=20 > recursively by the NodeReader, > >and the member field would have been tromped. > >Sorry to all those who have broken code because of this, but=20 > it's for the better. Really. > > > > > > > >Index: CompositeTagScanner.java=20 > = >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >RCS file:=20 > >/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Co > mpositeTagScanner.java,v > >retrieving revision 1.52 > >retrieving revision 1.53 > >diff -C2 -d -r1.52 -r1.53 > >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 > >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 > >*************** > >*** 97,100 **** > >--- 97,101 ---- > > private Set tagEnderSet; > > private Set endTagEnderSet; > >+ private boolean balance_quotes; > > =09 > > public CompositeTagScanner(String [] nameOfTagToMatch) { > >*************** > >*** 125,129 **** > > this(filter,nameOfTagToMatch,tagEnders,new=20 > String[] {}, allowSelfChildren); > > } > >! =09 > > public CompositeTagScanner( > > String filter, > >--- 126,130 ---- > > this(filter,nameOfTagToMatch,tagEnders,new=20 > String[] {}, allowSelfChildren); > > } > >!=20 > > public CompositeTagScanner( > > String filter,=20 > >*************** > >*** 131,138 **** > > String [] tagEnders,=20 > > String [] endTagEnders, > >! boolean allowSelfChildren) { > > super(filter); > > this.nameOfTagToMatch =3D nameOfTagToMatch; > > this.allowSelfChildren =3D allowSelfChildren; > > this.tagEnderSet =3D new HashSet(); > > for (int i=3D0;i<tagEnders.length;i++) > >--- 132,172 ---- > > String [] tagEnders,=20 > > String [] endTagEnders, > >! boolean allowSelfChildren) > >! { > >! =20 > this(filter,nameOfTagToMatch,tagEnders,endTagEnders,=20 > allowSelfChildren, false); > >! } > >!=20 > >! /** > >! * Constructor specifying all member fields. > >! * @param filter A string that is used to match which=20 > tags are to be allowed > >! * to pass through. This can be useful when one wishes=20 > to dynamically filter > >! * out all tags except one type which may be programmed=20 > later than the parser. > >! * @param nameOfTagToMatch The tag names recognized by=20 > this scanner. > >! * @param tagEnders The non-endtag tag names which=20 > signal that no closing > >! * end tag was found. For example, encountering=20 > <FORM> while > >! * scanning a <A> link tag would mean that no=20 > </A> was found > >! * and needs to be corrected. > >! * @param endTagEnders The endtag names which signal=20 > that no closing end > >! * tag was found. For example, encountering </HTML> while > >! * scanning a <BODY> tag would mean that no=20 > </BODY> was found > >! * and needs to be corrected. These items are not=20 > prefixed by a '/'. > >! * @param allowSelfChildren If <code>true</code> a tag=20 > of the same name is > >! * allowed within this tag. Used to determine when an=20 > endtag is missing. > >! * @param balance_quotes <code>true</code> if scanning=20 > string nodes needs to > >! * honour quotes. For example, ScriptScanner defines=20 > this <code>true</code> > >! * so that text within <SCRIPT></SCRIPT>=20 > ignores tag-like text > >! * within quotes. > >! */ > >! public CompositeTagScanner( > >! String filter,=20 > >! String [] nameOfTagToMatch,=20 > >! String [] tagEnders,=20 > >! String [] endTagEnders, > >! boolean allowSelfChildren, > >! boolean balance_quotes) { > > super(filter); > > this.nameOfTagToMatch =3D nameOfTagToMatch; > > this.allowSelfChildren =3D allowSelfChildren; > >+ this.balance_quotes =3D balance_quotes; > > this.tagEnderSet =3D new HashSet(); > > for (int i=3D0;i<tagEnders.length;i++) > >*************** > >*** 145,149 **** > > public Tag scan(Tag tag, String url, NodeReader=20 > reader,String currLine) throws ParserException { > > CompositeTagScannerHelper helper =3D=20 > >! new=20 > CompositeTagScannerHelper(this,tag,url,reader,currLine); > > return helper.scan(); > > } > >--- 179,183 ---- > > public Tag scan(Tag tag, String url, NodeReader=20 > reader,String currLine) throws ParserException { > > CompositeTagScannerHelper helper =3D=20 > >! new=20 > CompositeTagScannerHelper(this,tag,url,reader,currLine,balance > _quotes); > > return helper.scan(); > > } > >*************** > >*** 193,196 **** > > return false; > > } > >- > > } > >--- 227,229 ---- > > > >Index: ScriptScanner.java=20 > = >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >RCS file:=20 > >/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Sc > riptScanner.java,v > >retrieving revision 1.21 > >retrieving revision 1.22 > >diff -C2 -d -r1.21 -r1.22 > >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 > >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 > >*************** > >*** 28,64 **** > > =20 > > package org.htmlparser.scanners; > >! ///////////////////////// > >! // HTML Parser Imports // > >! ///////////////////////// > >! import org.htmlparser.Node; > >! import org.htmlparser.NodeReader; > >! import org.htmlparser.StringNode; > >! import org.htmlparser.tags.EndTag; > > import org.htmlparser.tags.ScriptTag; > > import org.htmlparser.tags.Tag; > > import org.htmlparser.tags.data.CompositeTagData; > > import org.htmlparser.tags.data.TagData; > >! import org.htmlparser.util.NodeList; > >! import org.htmlparser.util.ParserException; > > /** > > * The HTMLScriptScanner identifies javascript code > > */ > >- > > public class ScriptScanner extends CompositeTagScanner { > >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > > private static final String ENDERS [] =3D {"BODY", "HTML"}; > > public ScriptScanner() { > >! super("",MATCH_NAME,ENDERS); > > } > > =20 > > public ScriptScanner(String filter) { > >! super(filter,MATCH_NAME,ENDERS); > > } > > =20 > >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { > >! super(filter,nameOfTagToMatch,ENDERS); > > } > >! =09 > > public String [] getID() { > > return MATCH_NAME; > >--- 28,59 ---- > > =20 > > package org.htmlparser.scanners; > >! > > import org.htmlparser.tags.ScriptTag; > > import org.htmlparser.tags.Tag; > > import org.htmlparser.tags.data.CompositeTagData; > > import org.htmlparser.tags.data.TagData; > >!=20 > > /** > > * The HTMLScriptScanner identifies javascript code > > */ > > public class ScriptScanner extends CompositeTagScanner { > > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > > private static final String ENDERS [] =3D {"BODY", "HTML"}; > > public ScriptScanner() { > >! this(""); > > } > > =20 > > public ScriptScanner(String filter) { > >! this(filter,MATCH_NAME,ENDERS); > > } > > =20 > >! public ScriptScanner(String filter, String[]=20 > nameOfTagToMatch, String[] enders) { > >! this(filter,nameOfTagToMatch,enders, new=20 > String[0], true, true); > > } > >!=20 > >! public ScriptScanner(String filter, String[]=20 > nameOfTagToMatch, String[] enders, String[] endtagenders,=20 > boolean allowSelfChildren, boolean balance_quotes) { > >! super(filter,nameOfTagToMatch,enders, new=20 > String[0], allowSelfChildren, balance_quotes); > >! } > >!=20 > > public String [] getID() { > > return MATCH_NAME; > >*************** > >*** 70,205 **** > > return new ScriptTag(tagData,compositeTagData); > > } > >-=20 > >- public Tag scan(Tag tag, String url, NodeReader reader,=20 > String currLine) > >- throws ParserException { > >- try { > >- int startLine =3D reader.getLastLineNumber(); > >- String line =3D null; > >- StringBuffer scriptContents =3D=20 > >- new StringBuffer(); > >- boolean endTagFound =3D false; > >- Tag startTag =3D tag; > >- Tag endTag =3D null; > >- line =3D currLine; > >- boolean sameLine =3D true; > >- int startingPos =3D startTag.elementEnd(); > >- do { > >- int endTagLoc =3D=20 > line.toUpperCase().indexOf(getEndTag(),startingPos); > >- while (endTagLoc>0 &&=20 > isScriptEmbeddedInDocumentWrite(line, endTagLoc)) { > >- startingPos =3D=20 > endTagLoc+getEndTag().length(); > >- endTagLoc =3D=20 > line.toUpperCase().indexOf(getEndTag(), startingPos); =09 > >- } > >- =20 > >- if (endTagLoc!=3D-1) { > >- endTagFound =3D true; > >- endTag =3D=20 > (EndTag)EndTag.find(line,endTagLoc); > >- if (sameLine)=20 > >- scriptContents.append( > >- =09 > getCodeBetweenStartAndEndTags( > >- line, > >- =09 > startTag, > >- =09 > endTagLoc) > >- ); > >- else { > >- =09 > scriptContents.append(Node.getLineSeparator()); > >- =09 > scriptContents.append(line.substring(0,endTagLoc)); > >- } > >- =09 > >- =09 > reader.setPosInLine(endTag.elementEnd()); > >- } else { > >- if (sameLine)=20 > >- scriptContents.append( > >- line.substring( > >- =09 > startTag.elementEnd()+1 > >- ) > >- ); > >- else { > >- =09 > scriptContents.append(Node.getLineSeparator()); > >- =09 > scriptContents.append(line); > >- } > >- } > >- if (!endTagFound) { > >- line =3D reader.getNextLine(); > >- startingPos =3D 0; > >- } > >- if (sameLine)=20 > >- sameLine =3D false; > >- } > >- while (line!=3Dnull && !endTagFound); > >- if (endTag =3D=3D null) { > >- // If end tag doesn't exist, create one > >- String endTagName =3D tag.getTagName(); > >- int endTagBegin =3D=20 > reader.getLastReadPosition()+1 ; > >- int endTagEnd =3D endTagBegin +=20 > endTagName.length() + 2;=20 > >- endTag =3D new EndTag( > >- new TagData( > >- endTagBegin, > >- endTagEnd, > >- endTagName, > >- currLine > >- ) > >- ); > >- } > >- NodeList childrenNodeList =3D new NodeList(); > >- childrenNodeList.add( > >- new StringNode( > >- scriptContents, > >- startTag.elementEnd(), > >- endTag.elementBegin()-1 > >- ) > >- ); > >- return createTag( > >- new TagData( > >- startTag.elementBegin(), > >- endTag.elementEnd(), > >- startLine, > >- reader.getLastLineNumber(), > >- startTag.getText(), > >- currLine, > >- url, > >- false > >- ), new CompositeTagData( > >- startTag,endTag,childrenNodeList > >- ) > >- ); > >- =09 > >- } > >- catch (Exception e) { > >- throw new ParserException("Error in=20 > ScriptScanner: ",e); > >- } > >- } > >-=20 > >- public String getCodeBetweenStartAndEndTags( > >- String line, > >- Tag startTag, > >- int endTagLoc) throws ParserException { > >- try { > >- =09 > >- return line.substring( > >- startTag.elementEnd()+1, > >- endTagLoc > >- ); > >- } > >- catch (Exception e) { > >- StringBuffer msg =3D new=20 > StringBuffer("Error in getCodeBetweenStartAndEndTags():\n"); > >- msg.append("substring starts at:=20 > "+(startTag.elementEnd()+1)).append("\n"); > >- msg.append("substring ends at: "+(endTagLoc)); > >- throw new ParserException(msg.toString(),e); > >- } > >- } > >-=20 > >- /** > >- * Gets the end tag that the scanner uses to stop=20 > scanning. Subclasses of > >- * <code>ScriptScanner</code> you should override this method. > >- * @return String containing the end tag to search for,=20 > i.e. </SCRIPT> > >- */=20 > >- public String getEndTag() { > >- return SCRIPT_END_TAG; > >- } > >- =09 > >- private boolean isScriptEmbeddedInDocumentWrite(String=20 > line, int endTagLoc) { > >- if (endTagLoc+getEndTag().length() >=20 > line.length()-1) return false; > >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; > >- } > >- > > } > >--- 65,67 ---- > > > > > > > > > >------------------------------------------------------- > >This SF.net email is sponsored by: ObjectStore. > >If flattening out C++ or Java code to make your application fit in a=20 > >relational database is painful, don't do it! Check out=20 > ObjectStore. Now=20 > >part of Progress Software. http://www.objectstore.net/sourceforge > >_______________________________________________ > >Htmlparser-cvs mailing list Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > > > > >------------------------------------------------------- > >This SF.net email is sponsored by: ObjectStore. > >If flattening out C++ or Java code to make your application fit in a=20 > >relational database is painful, don't do it! Check out=20 > ObjectStore. Now=20 > >part of Progress Software. http://www.objectstore.net/sourceforge > >_______________________________________________ > >Htmlparser-developer mailing list=20 > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > =20 > > >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-28 01:34:33
|
You may need to back out the change, or at a minimum get the old code by going back a version and putting it in your ScriptScanner base class. I guess I screwed up. I saw you're drop that allowed all the lines to be accumulated in a tag and I thought the two scanners were very close then (apart from the tags in quotes thing). My only excuse is it passed all the unit tests. Well to be truthful I changed two of the tests, but it was only extraneous newline stuff at the start and end of text. The script scanner is breaking your code because of uppercasing tags (not just within in comments) and removing newlines after \, right? Marc Novakowski wrote: >I just realized that it's more complicated than that (for me, at least). In my application that uses htmlparser, I am extending certain scanners and tags (such as ScriptScanner but mostly CompositeTagScanner) to allow for "custom" tags in an HTML page. When the "HTML + custom tags" are run through my custom parser, the custom tags are converted into an object model which is then turned into dynamic javascript code. > >Long story short: some of these custom tags (i.e. the ones that extend ScriptScanner) _absolutely_ need the inner contents of the tag to remain unchanged. Also, since it's not always Javascript that is inside of the tags, adding extra rules to ignore tags in comments or strings won't always work. For example, one tag allows for arbitrary XML innards. Currently, the scanner will UPPERCASE all tags inside unless they're in quotes (which messes up the XML). > >The old ScriptScanner did exactly what I needed -- that is, it didn't scan for tags at all. It just looked for the exact (case-insensitive) string match of the end tag. It didn't look for "<" and it didn't defer to scanners. I took a look at the current code and I can't see any easy way to do this. > >Marc > >-----Original Message----- >From: Derrick Oswald [mailto:Der...@ro...] >Sent: Tuesday, May 27, 2003 2:39 PM >To: htm...@li... >Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] >htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Marc, > >The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text >or remarks. >I guess the text scanner goes until it sees a <x... and then stops to >defer to a tag scanner. I hadn't thought about those in comments, or >about the \ end of lines. > >Perhaps, rather than write a new scanner, fix the StringScanner (the >remark scanner should be OK), so that it does the correct behaviour when >balance_quotes is true. Then the 'balance_quotes' flag could be called >'strict_script' or something. > >Derrick > >Marc Novakowski wrote: > > > |
From: Marc N. <ma...@ke...> - 2003-05-28 00:30:59
|
I just realized that it's more complicated than that (for me, at least). = In my application that uses htmlparser, I am extending certain scanners = and tags (such as ScriptScanner but mostly CompositeTagScanner) to allow = for "custom" tags in an HTML page. When the "HTML + custom tags" are = run through my custom parser, the custom tags are converted into an = object model which is then turned into dynamic javascript code. Long story short: some of these custom tags (i.e. the ones that extend = ScriptScanner) _absolutely_ need the inner contents of the tag to remain = unchanged. Also, since it's not always Javascript that is inside of the = tags, adding extra rules to ignore tags in comments or strings won't = always work. For example, one tag allows for arbitrary XML innards. = Currently, the scanner will UPPERCASE all tags inside unless they're in = quotes (which messes up the XML). The old ScriptScanner did exactly what I needed -- that is, it didn't = scan for tags at all. It just looked for the exact (case-insensitive) = string match of the end tag. It didn't look for "<" and it didn't defer = to scanners. I took a look at the current code and I can't see any easy = way to do this. Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Tuesday, May 27, 2003 2:39 PM To: htm...@li... Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20 or remarks. I guess the text scanner goes until it sees a <x... and then stops to=20 defer to a tag scanner. I hadn't thought about those in comments, or=20 about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the=20 remark scanner should be OK), so that it does the correct behaviour when = balance_quotes is true. Then the 'balance_quotes' flag could be called=20 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. > >I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java=20 >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, >it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for = the better. Really. > > > >Index: CompositeTagScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > =09 > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >! =09 > public CompositeTagScanner( > String filter,=20 >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >!=20 > public CompositeTagScanner( > String filter,=20 >*************** >*** 131,138 **** > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); >! } >!=20 >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to = be allowed >! * to pass through. This can be useful when one wishes to = dynamically filter >! * out all tags except one type which may be programmed later than = the parser. >! * @param nameOfTagToMatch The tag names recognized by this = scanner. >! * @param tagEnders The non-endtag tag names which signal that no = closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was = found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no = closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> = was found >! * and needs to be corrected. These items are not prefixed by a = '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same = name is >! * allowed within this tag. Used to determine when an endtag is = missing. >! * @param balance_quotes <code>true</code> if scanning string = nodes needs to >! * honour quotes. For example, ScriptScanner defines this = <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter,=20 >! String [] nameOfTagToMatch,=20 >! String [] tagEnders,=20 >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; >+ this.balance_quotes =3D balance_quotes; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >-=20 > } >--- 227,229 ---- > >Index: ScriptScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > =20 > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >-=20 > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > =20 > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! =09 > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > =20 > package org.htmlparser.scanners; >!=20 > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >!=20 > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > =20 > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >!=20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); >! } >!=20 > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >-=20 >- public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) >- throws ParserException { >- try { >- int startLine =3D reader.getLastLineNumber(); >- String line =3D null; >- StringBuffer scriptContents =3D=20 >- new StringBuffer(); >- boolean endTagFound =3D false; >- Tag startTag =3D tag; >- Tag endTag =3D null; >- line =3D currLine; >- boolean sameLine =3D true; >- int startingPos =3D startTag.elementEnd(); >- do { >- int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { >- startingPos =3D endTagLoc+getEndTag().length(); >- endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 >- } >- =20 >- if (endTagLoc!=3D-1) { >- endTagFound =3D true; >- endTag =3D (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine)=20 >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- =09 >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine)=20 >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line =3D reader.getNextLine(); >- startingPos =3D 0; >- } >- if (sameLine)=20 >- sameLine =3D false; >- } >- while (line!=3Dnull && !endTagFound); >- if (endTag =3D=3D null) { >- // If end tag doesn't exist, create one >- String endTagName =3D tag.getTagName(); >- int endTagBegin =3D reader.getLastReadPosition()+1 ; >- int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 >- endTag =3D new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList =3D new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- =09 >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >-=20 >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- =09 >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >-=20 >- /** >- * Gets the end tag that the scanner uses to stop scanning. = Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. = </SCRIPT> >- */=20 >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- =09 >- private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; >- } >-=20 > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > =20 > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Marc N. <ma...@ke...> - 2003-05-27 22:55:27
|
Sure, I'll see if I can fix it. -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Tuesday, May 27, 2003 2:39 PM To: htm...@li... Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20 or remarks. I guess the text scanner goes until it sees a <x... and then stops to=20 defer to a tag scanner. I hadn't thought about those in comments, or=20 about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the=20 remark scanner should be OK), so that it does the correct behaviour when = balance_quotes is true. Then the 'balance_quotes' flag could be called=20 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. > >I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java=20 >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, >it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for = the better. Really. > > > >Index: CompositeTagScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > =09 > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >! =09 > public CompositeTagScanner( > String filter,=20 >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >!=20 > public CompositeTagScanner( > String filter,=20 >*************** >*** 131,138 **** > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); >! } >!=20 >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to = be allowed >! * to pass through. This can be useful when one wishes to = dynamically filter >! * out all tags except one type which may be programmed later than = the parser. >! * @param nameOfTagToMatch The tag names recognized by this = scanner. >! * @param tagEnders The non-endtag tag names which signal that no = closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was = found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no = closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> = was found >! * and needs to be corrected. These items are not prefixed by a = '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same = name is >! * allowed within this tag. Used to determine when an endtag is = missing. >! * @param balance_quotes <code>true</code> if scanning string = nodes needs to >! * honour quotes. For example, ScriptScanner defines this = <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter,=20 >! String [] nameOfTagToMatch,=20 >! String [] tagEnders,=20 >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; >+ this.balance_quotes =3D balance_quotes; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >-=20 > } >--- 227,229 ---- > >Index: ScriptScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > =20 > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >-=20 > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > =20 > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! =09 > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > =20 > package org.htmlparser.scanners; >!=20 > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >!=20 > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > =20 > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >!=20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); >! } >!=20 > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >-=20 >- public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) >- throws ParserException { >- try { >- int startLine =3D reader.getLastLineNumber(); >- String line =3D null; >- StringBuffer scriptContents =3D=20 >- new StringBuffer(); >- boolean endTagFound =3D false; >- Tag startTag =3D tag; >- Tag endTag =3D null; >- line =3D currLine; >- boolean sameLine =3D true; >- int startingPos =3D startTag.elementEnd(); >- do { >- int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { >- startingPos =3D endTagLoc+getEndTag().length(); >- endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 >- } >- =20 >- if (endTagLoc!=3D-1) { >- endTagFound =3D true; >- endTag =3D (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine)=20 >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- =09 >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine)=20 >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line =3D reader.getNextLine(); >- startingPos =3D 0; >- } >- if (sameLine)=20 >- sameLine =3D false; >- } >- while (line!=3Dnull && !endTagFound); >- if (endTag =3D=3D null) { >- // If end tag doesn't exist, create one >- String endTagName =3D tag.getTagName(); >- int endTagBegin =3D reader.getLastReadPosition()+1 ; >- int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 >- endTag =3D new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList =3D new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- =09 >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >-=20 >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- =09 >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >-=20 >- /** >- * Gets the end tag that the scanner uses to stop scanning. = Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. = </SCRIPT> >- */=20 >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- =09 >- private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; >- } >-=20 > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > =20 > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-27 21:46:44
|
Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text or remarks. I guess the text scanner goes until it sees a <x... and then stops to defer to a tag scanner. I hadn't thought about those in comments, or about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the remark scanner should be OK), so that it does the correct behaviour when balance_quotes is true. Then the 'balance_quotes' flag could be called 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the fact that its contents were not parsed as HTML. I'm still seeing cases where tags inside of <script> are recognised as "HTML" and modified (i.e. turned into uppercase, auto-closed, etc). For example, if there is an HTML tag in a Javascript comment. Also, using "\" to concatenate lines (which is valid in Javacript) is totally messed up now when I try to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. What I think I'm going to do, though, is make a new scanner class that does what the old ScriptScanner did. That is, do a bare-bones "leave everything inside that tag as-is" parse of the HTML, searching only for the end tag with no knowledge of quotes or anything. I think there are cases where Javascript is written such that any modification at all will break it. > >I'll send a note to the list when this class is done (today sometime). I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState stuff is removed, >it didn't work anyway since a single StringScanner is used recursively by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for the better. Really. > > > >Index: CompositeTagScanner.java >=================================================================== >RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren); > } >! > public CompositeTagScanner( > String filter, >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren); > } >! > public CompositeTagScanner( > String filter, >*************** >*** 131,138 **** > String [] tagEnders, > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch = nameOfTagToMatch; > this.allowSelfChildren = allowSelfChildren; > this.tagEnderSet = new HashSet(); > for (int i=0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders, > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, allowSelfChildren, false); >! } >! >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to be allowed >! * to pass through. This can be useful when one wishes to dynamically filter >! * out all tags except one type which may be programmed later than the parser. >! * @param nameOfTagToMatch The tag names recognized by this scanner. >! * @param tagEnders The non-endtag tag names which signal that no closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> was found >! * and needs to be corrected. These items are not prefixed by a '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same name is >! * allowed within this tag. Used to determine when an endtag is missing. >! * @param balance_quotes <code>true</code> if scanning string nodes needs to >! * honour quotes. For example, ScriptScanner defines this <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter, >! String [] nameOfTagToMatch, >! String [] tagEnders, >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch = nameOfTagToMatch; > this.allowSelfChildren = allowSelfChildren; >+ this.balance_quotes = balance_quotes; > this.tagEnderSet = new HashSet(); > for (int i=0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException { > CompositeTagScannerHelper helper = >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException { > CompositeTagScannerHelper helper = >! new CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >- > } >--- 227,229 ---- > >Index: ScriptScanner.java >=================================================================== >RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >- > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG = "</SCRIPT>"; > private static final String MATCH_NAME [] = {"SCRIPT"}; > private static final String ENDERS [] = {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > > package org.htmlparser.scanners; >! > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] = {"SCRIPT"}; > private static final String ENDERS [] = {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > >! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >! >! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders, String[] endtagenders, boolean allowSelfChildren, boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], allowSelfChildren, balance_quotes); >! } >! > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >- >- public Tag scan(Tag tag, String url, NodeReader reader, String currLine) >- throws ParserException { >- try { >- int startLine = reader.getLastLineNumber(); >- String line = null; >- StringBuffer scriptContents = >- new StringBuffer(); >- boolean endTagFound = false; >- Tag startTag = tag; >- Tag endTag = null; >- line = currLine; >- boolean sameLine = true; >- int startingPos = startTag.elementEnd(); >- do { >- int endTagLoc = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, endTagLoc)) { >- startingPos = endTagLoc+getEndTag().length(); >- endTagLoc = line.toUpperCase().indexOf(getEndTag(), startingPos); >- } >- >- if (endTagLoc!=-1) { >- endTagFound = true; >- endTag = (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine) >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine) >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line = reader.getNextLine(); >- startingPos = 0; >- } >- if (sameLine) >- sameLine = false; >- } >- while (line!=null && !endTagFound); >- if (endTag == null) { >- // If end tag doesn't exist, create one >- String endTagName = tag.getTagName(); >- int endTagBegin = reader.getLastReadPosition()+1 ; >- int endTagEnd = endTagBegin + endTagName.length() + 2; >- endTag = new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList = new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >- >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg = new StringBuffer("Error in getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >- >- /** >- * Gets the end tag that the scanner uses to stop scanning. Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. </SCRIPT> >- */ >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- >- private boolean isScriptEmbeddedInDocumentWrite(String line, int endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=='"'; >- } >- > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Marc N. <ma...@ke...> - 2003-05-27 18:23:03
|
Derrick, I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. Marc -----Original Message----- From: der...@us... [mailto:der...@us...] Sent: Saturday, May 24, 2003 2:05 PM To: htm...@li... Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners Modified Files: CompositeTagScanner.java ScriptScanner.java=20 Log Message: Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags Major overhaul of ScriptScanner. It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). CompositeTagScanner now has a balance_quotes member field that dictates whether strings tags are scanned honouring single and double quotes. This affected the call chain through NodeReader and StringScanner which now have this parameter. StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, and the member field would have been tromped. Sorry to all those who have broken code because of this, but it's for = the better. Really. Index: CompositeTagScanner.java =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v retrieving revision 1.52 retrieving revision 1.53 diff -C2 -d -r1.52 -r1.53 *** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 --- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 *************** *** 97,100 **** --- 97,101 ---- private Set tagEnderSet; private Set endTagEnderSet; + private boolean balance_quotes; =09 public CompositeTagScanner(String [] nameOfTagToMatch) { *************** *** 125,129 **** this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); } ! =09 public CompositeTagScanner( String filter,=20 --- 126,130 ---- this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); } !=20 public CompositeTagScanner( String filter,=20 *************** *** 131,138 **** String [] tagEnders,=20 String [] endTagEnders, ! boolean allowSelfChildren) { super(filter); this.nameOfTagToMatch =3D nameOfTagToMatch; this.allowSelfChildren =3D allowSelfChildren; this.tagEnderSet =3D new HashSet(); for (int i=3D0;i<tagEnders.length;i++) --- 132,172 ---- String [] tagEnders,=20 String [] endTagEnders, ! boolean allowSelfChildren) ! { ! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); ! } !=20 ! /** ! * Constructor specifying all member fields. ! * @param filter A string that is used to match which tags are to = be allowed ! * to pass through. This can be useful when one wishes to = dynamically filter ! * out all tags except one type which may be programmed later than = the parser. ! * @param nameOfTagToMatch The tag names recognized by this = scanner. ! * @param tagEnders The non-endtag tag names which signal that no = closing ! * end tag was found. For example, encountering <FORM> while ! * scanning a <A> link tag would mean that no </A> was = found ! * and needs to be corrected. ! * @param endTagEnders The endtag names which signal that no = closing end ! * tag was found. For example, encountering </HTML> while ! * scanning a <BODY> tag would mean that no </BODY> was = found ! * and needs to be corrected. These items are not prefixed by a = '/'. ! * @param allowSelfChildren If <code>true</code> a tag of the same = name is ! * allowed within this tag. Used to determine when an endtag is = missing. ! * @param balance_quotes <code>true</code> if scanning string nodes = needs to ! * honour quotes. For example, ScriptScanner defines this = <code>true</code> ! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text ! * within quotes. ! */ ! public CompositeTagScanner( ! String filter,=20 ! String [] nameOfTagToMatch,=20 ! String [] tagEnders,=20 ! String [] endTagEnders, ! boolean allowSelfChildren, ! boolean balance_quotes) { super(filter); this.nameOfTagToMatch =3D nameOfTagToMatch; this.allowSelfChildren =3D allowSelfChildren; + this.balance_quotes =3D balance_quotes; this.tagEnderSet =3D new HashSet(); for (int i=3D0;i<tagEnders.length;i++) *************** *** 145,149 **** public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { CompositeTagScannerHelper helper =3D=20 ! new CompositeTagScannerHelper(this,tag,url,reader,currLine); return helper.scan(); } --- 179,183 ---- public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { CompositeTagScannerHelper helper =3D=20 ! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); return helper.scan(); } *************** *** 193,196 **** return false; } -=20 } --- 227,229 ---- Index: ScriptScanner.java =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 --- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 *************** *** 28,64 **** =20 package org.htmlparser.scanners; ! ///////////////////////// ! // HTML Parser Imports // ! ///////////////////////// ! import org.htmlparser.Node; ! import org.htmlparser.NodeReader; ! import org.htmlparser.StringNode; ! import org.htmlparser.tags.EndTag; import org.htmlparser.tags.ScriptTag; import org.htmlparser.tags.Tag; import org.htmlparser.tags.data.CompositeTagData; import org.htmlparser.tags.data.TagData; ! import org.htmlparser.util.NodeList; ! import org.htmlparser.util.ParserException; /** * The HTMLScriptScanner identifies javascript code */ -=20 public class ScriptScanner extends CompositeTagScanner { - private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; private static final String MATCH_NAME [] =3D {"SCRIPT"}; private static final String ENDERS [] =3D {"BODY", "HTML"}; public ScriptScanner() { ! super("",MATCH_NAME,ENDERS); } =20 public ScriptScanner(String filter) { ! super(filter,MATCH_NAME,ENDERS); } =20 ! public ScriptScanner(String filter, String[] nameOfTagToMatch) { ! super(filter,nameOfTagToMatch,ENDERS); } ! =09 public String [] getID() { return MATCH_NAME; --- 28,59 ---- =20 package org.htmlparser.scanners; !=20 import org.htmlparser.tags.ScriptTag; import org.htmlparser.tags.Tag; import org.htmlparser.tags.data.CompositeTagData; import org.htmlparser.tags.data.TagData; !=20 /** * The HTMLScriptScanner identifies javascript code */ public class ScriptScanner extends CompositeTagScanner { private static final String MATCH_NAME [] =3D {"SCRIPT"}; private static final String ENDERS [] =3D {"BODY", "HTML"}; public ScriptScanner() { ! this(""); } =20 public ScriptScanner(String filter) { ! this(filter,MATCH_NAME,ENDERS); } =20 ! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { ! this(filter,nameOfTagToMatch,enders, new String[0], true, true); } !=20 ! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { ! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); ! } !=20 public String [] getID() { return MATCH_NAME; *************** *** 70,205 **** return new ScriptTag(tagData,compositeTagData); } -=20 - public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) - throws ParserException { - try { - int startLine =3D reader.getLastLineNumber(); - String line =3D null; - StringBuffer scriptContents =3D=20 - new StringBuffer(); - boolean endTagFound =3D false; - Tag startTag =3D tag; - Tag endTag =3D null; - line =3D currLine; - boolean sameLine =3D true; - int startingPos =3D startTag.elementEnd(); - do { - int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); - while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { - startingPos =3D endTagLoc+getEndTag().length(); - endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 - } - =20 - if (endTagLoc!=3D-1) { - endTagFound =3D true; - endTag =3D (EndTag)EndTag.find(line,endTagLoc); - if (sameLine)=20 - scriptContents.append( - getCodeBetweenStartAndEndTags( - line, - startTag, - endTagLoc) - ); - else { - scriptContents.append(Node.getLineSeparator()); - scriptContents.append(line.substring(0,endTagLoc)); - } - =09 - reader.setPosInLine(endTag.elementEnd()); - } else { - if (sameLine)=20 - scriptContents.append( - line.substring( - startTag.elementEnd()+1 - ) - ); - else { - scriptContents.append(Node.getLineSeparator()); - scriptContents.append(line); - } - } - if (!endTagFound) { - line =3D reader.getNextLine(); - startingPos =3D 0; - } - if (sameLine)=20 - sameLine =3D false; - } - while (line!=3Dnull && !endTagFound); - if (endTag =3D=3D null) { - // If end tag doesn't exist, create one - String endTagName =3D tag.getTagName(); - int endTagBegin =3D reader.getLastReadPosition()+1 ; - int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 - endTag =3D new EndTag( - new TagData( - endTagBegin, - endTagEnd, - endTagName, - currLine - ) - ); - } - NodeList childrenNodeList =3D new NodeList(); - childrenNodeList.add( - new StringNode( - scriptContents, - startTag.elementEnd(), - endTag.elementBegin()-1 - ) - ); - return createTag( - new TagData( - startTag.elementBegin(), - endTag.elementEnd(), - startLine, - reader.getLastLineNumber(), - startTag.getText(), - currLine, - url, - false - ), new CompositeTagData( - startTag,endTag,childrenNodeList - ) - ); - =09 - } - catch (Exception e) { - throw new ParserException("Error in ScriptScanner: ",e); - } - } -=20 - public String getCodeBetweenStartAndEndTags( - String line, - Tag startTag, - int endTagLoc) throws ParserException { - try { - =09 - return line.substring( - startTag.elementEnd()+1, - endTagLoc - ); - } - catch (Exception e) { - StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); - msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); - msg.append("substring ends at: "+(endTagLoc)); - throw new ParserException(msg.toString(),e); - } - } -=20 - /** - * Gets the end tag that the scanner uses to stop scanning. = Subclasses of - * <code>ScriptScanner</code> you should override this method. - * @return String containing the end tag to search for, i.e. = </SCRIPT> - */=20 - public String getEndTag() { - return SCRIPT_END_TAG; - } - =09 - private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { - if (endTagLoc+getEndTag().length() > line.length()-1) return false; - return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; - } -=20 } --- 65,67 ---- ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-cvs mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs |
From: Derrick O. <Der...@ro...> - 2003-05-25 23:46:02
|
Version 1.3 of the most popular HTML parser on sourceforge is now available. Four weeks of candidate testing have culminated in a very stable, production level product, with many new user requested features. Features added since 1.2 include: constructor(URLConnection) for POST and exotic GET improved character set handling hierarchically nested tags, i.e. tables scanners for each type of tag java beans for easy integration of text and link fetching 'visitor' patterns Wiki page documentation improved script scanning improved whitespace handling The developers of the HTML Parser hope you enjoy it. |
From: <dha...@po...> - 2003-05-23 11:59:59
|
Hi, I wrote the following test case public void testUnClosed () throws ParserException { createParser("<TABLE><TR><TR></TR></TABLE>"); parseAndAssertNodeCount(1); =09 assertEquals("Unclosed","<TABLE><TR></TR><TR></TR></TABLE>",node[0].toHt ml()); }=09 I was expecting one node, since <TABLE> would be the main node and <TR> would be its children, but I got 5!!!. Obviously because of that the assert also failed. Since nesting is allowed by default, if there is code as follows : ... <TR> <TD>blah blah</TD> <TR> <TD>blah blah</TD> </TR> .... Then the second <TR> is incorrectly considered as a child of the first <TR> whereas in reality, a closing </TR> was missed out should have been put in. Hence for <TR> nesting should be disallowed through the scanner. Same holds for <TD> tag. I think this is a bug. Dhaval |
From: Somik R. <so...@ya...> - 2003-05-21 23:44:19
|
You should not be using setParsed. Instead, all you have to do is use setAttribute on TableTag, like so: tableTag.setAttribute("BORDER",1); Then, make a call to tableTag.toHtml(), and it should show up. Regards, Somik ----- Original Message ----- From: "Terry Alexis Lurie" <tez...@ya...> To: <htm...@li...> Sent: Wednesday, May 21, 2003 5:10 AM Subject: Re: [Htmlparser-developer] HTMLTag patch > Yes, I'd like to be able to programmatically set > certain attributes. Its for a highlighted step-by-step > through a web-rip, so the focus table is border=1 or > whatever [very uncommon these days], the rest is as > is. > > I've been doing this in Perl's HTML::Parse for a > while, but now shifting to Java because of work. > > Terry. > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > Just curious - why do you need to call > > setParsed() ? > > Are you trying to take all tables and ensure > > that they have a border "1" > > ? > > > > Regards, > > Somik > > ----- Original Message ----- > > From: "Terry Alexis Lurie" <tez...@ya...> > > To: <htm...@li...> > > Sent: Tuesday, May 20, 2003 11:03 AM > > Subject: [Htmlparser-developer] HTMLTag patch > > > > > > > Hi, this is further to my Bug report via the SF > > site. > > > > > > Basically, setParsed() wasn't effecting the actual > > > output of the Node thereafter. This made it a real > > > pain to highlight HTML, the example here being > > making > > > tables have a border of 1 to show them. > > > > > > Patch attached. Has some debugging commented out, > > > you'll want to get rid of this. I put a patch for > > th > > > testing code on the sourceforge bug report. > > > > > > Cheers, > > > > > > Terry. > > > > > > -------------------- > > > > > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > > > --- HTMLTag.java 2003/05/20 14:52:42 > > > *************** > > > *** 273,283 **** > > > } > > > /** > > > * Sets the parsed. > > > ! * @param parsed The parsed to set > > > */ > > > public void setParsed(Hashtable parsed) { > > > this.parsed = parsed; > > > } > > > /** > > > * Sets the strictTags. > > > * @param strictTags The strictTags to set > > > --- 273,306 ---- > > > } > > > /** > > > * Sets the parsed. > > > ! * Note: There is no guarantee that the > > attributes > > > will be: > > > ! * in the same order or case as originally. > > > ! * This isn't expected to be a problem, but > > > then again > > > ! * it never is, is it? > > > ! * Also: This currently makes no effort to place > > > the attribute > > > ! * in quotes if necessary. You have to take > > > care of that > > > ! * yourself > > > ! * @param parsed The hash of (key,value) > > attribute > > > pairs to set > > > */ > > > public void setParsed(Hashtable parsed) { > > > this.parsed = parsed; > > > + > > > + setText((String) parsed.get(this.TAGNAME)); > > //Set > > > the tag first > > > + for(Enumeration e = parsed.keys(); > > > e.hasMoreElements();) { > > > + String temp = (String) e.nextElement(); > > > + if (!temp.equals(this.TAGNAME)) { //Don't > > add > > > the tagname again > > > + append(" " + temp + '=' + ((String) > > > parsed.get(temp))); > > > + > > > + //Debug > > > + //System.out.println("setParsed appending key: " > > > + temp + " to value: " + ((String) > > parsed.get(temp))); > > > + } > > > + } > > > + > > > + //Debug > > > + //System.out.println("setParsed: completed, now > > > text is:" + getText()); > > > + > > > } > > > + > > > /** > > > * Sets the strictTags. > > > * @param strictTags The strictTags to set > > > > > > > > > ===== > > > > > > ------------------------------------------------------------ > > > Terry Alexis Lurie | 'Something witty > > that doesn't > > > Freelance Computer Engineer | look good with > > variable > > > United Kingdom | width fonts' - Most > > nerds > > > > > > __________________________________________________ > > > It's Samaritans' Week. Help Samaritans help > > others. > > > Call 08709 000032 to give or donate online now at > > http://www.samaritans.org/support/donations.shtm > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.net email is sponsored by: ObjectStore. > > > If flattening out C++ or Java code to make your > > application fit in a > > > relational database is painful, don't do it! Check > > out ObjectStore. > > > Now part of Progress Software. > > http://www.objectstore.net/sourceforge > > > _______________________________________________ > > > Htmlparser-developer mailing list > > > Htm...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: ObjectStore. > > If flattening out C++ or Java code to make your > > application fit in a > > relational database is painful, don't do it! Check > > out ObjectStore. > > Now part of Progress Software. > > http://www.objectstore.net/sourceforge > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > ===== > ------------------------------------------------------------ > Terry Alexis Lurie | 'Something witty that doesn't > Freelance Computer Engineer | look good with variable > United Kingdom | width fonts' - Most nerds > > __________________________________________________ > It's Samaritans' Week. Help Samaritans help others. > Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application fit in a > relational database is painful, don't do it! Check out ObjectStore. > Now part of Progress Software. http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <tez...@ya...> - 2003-05-21 12:15:28
|
Right. That was definitely the answer I was looking for. Hopefully be able to use my talents for good rather than evil. I'm just avers to using bleeding edge in production, but now I'm sort of familiar with the scope of the project, I think it is worth the small risk. Terry. --- Derrick Oswald <Der...@ro...> wrote: > Terry, > > You should really switch to the 1.3 codebase, > version 1.2 is very long > in the tooth and a final release of 1.3 is imminent. > These problems you are encountering don't seem to be > present any more > and you would have a more sympathetic ear. > > Derrick > > Terry Alexis Lurie wrote: > > >Yes, I'd like to be able to programmatically set > >certain attributes. Its for a highlighted > step-by-step > >through a web-rip, so the focus table is border=1 > or > >whatever [very uncommon these days], the rest is as > >is. > > > >I've been doing this in Perl's HTML::Parse for a > >while, but now shifting to Java because of work. > > > >Terry. > > > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > > > > >> Just curious - why do you need to call > >>setParsed() ? > >> Are you trying to take all tables and ensure > >>that they have a border "1" > >>? > >> > >>Regards, > >>Somik > >>----- Original Message ----- > >>From: "Terry Alexis Lurie" <tez...@ya...> > >>To: <htm...@li...> > >>Sent: Tuesday, May 20, 2003 11:03 AM > >>Subject: [Htmlparser-developer] HTMLTag patch > >> > >> > >> > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your > application fit in a > relational database is painful, don't do it! Check > out ObjectStore. > Now part of Progress Software. > http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: <dha...@po...> - 2003-05-21 12:12:29
|
I was just going to say that, Derrick ;) I too believe that the problems you mentioned no longer exist in 1.3. > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of Der...@ro... > Sent: Wednesday, May 21, 2003 5:28 PM > To: htm...@li... > Subject: Re: [Htmlparser-developer] HTMLTag patch >=20 >=20 > Terry, >=20 > You should really switch to the 1.3 codebase, version 1.2 is=20 > very long=20 > in the tooth and a final release of 1.3 is imminent. > These problems you are encountering don't seem to be present any more=20 > and you would have a more sympathetic ear. >=20 > Derrick >=20 > Terry Alexis Lurie wrote: >=20 > >Yes, I'd like to be able to programmatically set > >certain attributes. Its for a highlighted step-by-step > >through a web-rip, so the focus table is border=3D1 or > >whatever [very uncommon these days], the rest is as > >is. > > > >I've been doing this in Perl's HTML::Parse for a > >while, but now shifting to Java because of work. > > > >Terry. > > > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > =20 > > > >> Just curious - why do you need to call > >>setParsed() ? > >> Are you trying to take all tables and ensure > >>that they have a border "1" > >>? > >> > >>Regards, > >>Somik > >>----- Original Message ----- > >>From: "Terry Alexis Lurie" <tez...@ya...> > >>To: <htm...@li...> > >>Sent: Tuesday, May 20, 2003 11:03 AM > >>Subject: [Htmlparser-developer] HTMLTag patch > >> > >> =20 > >> >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-21 12:07:24
|
Terry, You should really switch to the 1.3 codebase, version 1.2 is very long in the tooth and a final release of 1.3 is imminent. These problems you are encountering don't seem to be present any more and you would have a more sympathetic ear. Derrick Terry Alexis Lurie wrote: >Yes, I'd like to be able to programmatically set >certain attributes. Its for a highlighted step-by-step >through a web-rip, so the focus table is border=1 or >whatever [very uncommon these days], the rest is as >is. > >I've been doing this in Perl's HTML::Parse for a >while, but now shifting to Java because of work. > >Terry. > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > >> Just curious - why do you need to call >>setParsed() ? >> Are you trying to take all tables and ensure >>that they have a border "1" >>? >> >>Regards, >>Somik >>----- Original Message ----- >>From: "Terry Alexis Lurie" <tez...@ya...> >>To: <htm...@li...> >>Sent: Tuesday, May 20, 2003 11:03 AM >>Subject: [Htmlparser-developer] HTMLTag patch >> >> >> |
From: <tez...@ya...> - 2003-05-21 10:32:08
|
A patch for HTMLTagTest.java. When you call registerScanners, they don't print the attributes properly. Here in this test case you get <A EN="" =="" HREF="http://www.google.com/webhp?hl"></A> from <a href=http://www.google.com/webhp?hl=en> See how you get the bogus atttributes EN="" and =="" ? This doesn't occur if you don't call registerScanners(); Terry ------- public void testHTMLOutputOfDifficultLinksWithRegisterScanners() throws HTMLParserException { createParser("<a href=http://www.google.com/webhp?hl=en>"); //Straight out of a real world example // assertTrue("Node should be a HTMLLinkTag",node[0] instanceof HTMLLinkTag); parser.registerScanners(); // Register standard scanners (Very Important) String stringTemp=""; for (HTMLEnumeration e = parser.elements(); e.hasMoreNodes();) { HTMLNode newNode = e.nextHTMLNode(); // Get the next HTML Node stringTemp = newNode.toHTML(); System.out.println(stringTemp); } assertEquals("Parsed text should be","<a href=http://www.google.com/webhp?hl=en>",stringTemp); } ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: <tez...@ya...> - 2003-05-21 09:10:46
|
Yes, I'd like to be able to programmatically set certain attributes. Its for a highlighted step-by-step through a web-rip, so the focus table is border=1 or whatever [very uncommon these days], the rest is as is. I've been doing this in Perl's HTML::Parse for a while, but now shifting to Java because of work. Terry. --- Somik Raha <so...@ya...> wrote: > Hi Terry > Just curious - why do you need to call > setParsed() ? > Are you trying to take all tables and ensure > that they have a border "1" > ? > > Regards, > Somik > ----- Original Message ----- > From: "Terry Alexis Lurie" <tez...@ya...> > To: <htm...@li...> > Sent: Tuesday, May 20, 2003 11:03 AM > Subject: [Htmlparser-developer] HTMLTag patch > > > > Hi, this is further to my Bug report via the SF > site. > > > > Basically, setParsed() wasn't effecting the actual > > output of the Node thereafter. This made it a real > > pain to highlight HTML, the example here being > making > > tables have a border of 1 to show them. > > > > Patch attached. Has some debugging commented out, > > you'll want to get rid of this. I put a patch for > th > > testing code on the sourceforge bug report. > > > > Cheers, > > > > Terry. > > > > -------------------- > > > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > > --- HTMLTag.java 2003/05/20 14:52:42 > > *************** > > *** 273,283 **** > > } > > /** > > * Sets the parsed. > > ! * @param parsed The parsed to set > > */ > > public void setParsed(Hashtable parsed) { > > this.parsed = parsed; > > } > > /** > > * Sets the strictTags. > > * @param strictTags The strictTags to set > > --- 273,306 ---- > > } > > /** > > * Sets the parsed. > > ! * Note: There is no guarantee that the > attributes > > will be: > > ! * in the same order or case as originally. > > ! * This isn't expected to be a problem, but > > then again > > ! * it never is, is it? > > ! * Also: This currently makes no effort to place > > the attribute > > ! * in quotes if necessary. You have to take > > care of that > > ! * yourself > > ! * @param parsed The hash of (key,value) > attribute > > pairs to set > > */ > > public void setParsed(Hashtable parsed) { > > this.parsed = parsed; > > + > > + setText((String) parsed.get(this.TAGNAME)); > //Set > > the tag first > > + for(Enumeration e = parsed.keys(); > > e.hasMoreElements();) { > > + String temp = (String) e.nextElement(); > > + if (!temp.equals(this.TAGNAME)) { //Don't > add > > the tagname again > > + append(" " + temp + '=' + ((String) > > parsed.get(temp))); > > + > > + //Debug > > + //System.out.println("setParsed appending key: " > > + temp + " to value: " + ((String) > parsed.get(temp))); > > + } > > + } > > + > > + //Debug > > + //System.out.println("setParsed: completed, now > > text is:" + getText()); > > + > > } > > + > > /** > > * Sets the strictTags. > > * @param strictTags The strictTags to set > > > > > > ===== > > > ------------------------------------------------------------ > > Terry Alexis Lurie | 'Something witty > that doesn't > > Freelance Computer Engineer | look good with > variable > > United Kingdom | width fonts' - Most > nerds > > > > __________________________________________________ > > It's Samaritans' Week. Help Samaritans help > others. > > Call 08709 000032 to give or donate online now at > http://www.samaritans.org/support/donations.shtm > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: ObjectStore. > > If flattening out C++ or Java code to make your > application fit in a > > relational database is painful, don't do it! Check > out ObjectStore. > > Now part of Progress Software. > http://www.objectstore.net/sourceforge > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your > application fit in a > relational database is painful, don't do it! Check > out ObjectStore. > Now part of Progress Software. > http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: Somik R. <so...@ya...> - 2003-05-21 03:03:57
|
Hi Terry Just curious - why do you need to call setParsed() ? Are you trying to take all tables and ensure that they have a border "1" ? Regards, Somik ----- Original Message ----- From: "Terry Alexis Lurie" <tez...@ya...> To: <htm...@li...> Sent: Tuesday, May 20, 2003 11:03 AM Subject: [Htmlparser-developer] HTMLTag patch > Hi, this is further to my Bug report via the SF site. > > Basically, setParsed() wasn't effecting the actual > output of the Node thereafter. This made it a real > pain to highlight HTML, the example here being making > tables have a border of 1 to show them. > > Patch attached. Has some debugging commented out, > you'll want to get rid of this. I put a patch for th > testing code on the sourceforge bug report. > > Cheers, > > Terry. > > -------------------- > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > --- HTMLTag.java 2003/05/20 14:52:42 > *************** > *** 273,283 **** > } > /** > * Sets the parsed. > ! * @param parsed The parsed to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > } > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > --- 273,306 ---- > } > /** > * Sets the parsed. > ! * Note: There is no guarantee that the attributes > will be: > ! * in the same order or case as originally. > ! * This isn't expected to be a problem, but > then again > ! * it never is, is it? > ! * Also: This currently makes no effort to place > the attribute > ! * in quotes if necessary. You have to take > care of that > ! * yourself > ! * @param parsed The hash of (key,value) attribute > pairs to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > + > + setText((String) parsed.get(this.TAGNAME)); //Set > the tag first > + for(Enumeration e = parsed.keys(); > e.hasMoreElements();) { > + String temp = (String) e.nextElement(); > + if (!temp.equals(this.TAGNAME)) { //Don't add > the tagname again > + append(" " + temp + '=' + ((String) > parsed.get(temp))); > + > + //Debug > + //System.out.println("setParsed appending key: " > + temp + " to value: " + ((String) parsed.get(temp))); > + } > + } > + > + //Debug > + //System.out.println("setParsed: completed, now > text is:" + getText()); > + > } > + > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > > > ===== > ------------------------------------------------------------ > Terry Alexis Lurie | 'Something witty that doesn't > Freelance Computer Engineer | look good with variable > United Kingdom | width fonts' - Most nerds > > __________________________________________________ > It's Samaritans' Week. Help Samaritans help others. > Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application fit in a > relational database is painful, don't do it! Check out ObjectStore. > Now part of Progress Software. http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <tez...@ya...> - 2003-05-20 16:14:38
|
Hmm, well that breaks everything under the sun.. I have re-corrected it on my side by changing this addition into a new method resetParsed(). So more of a helper function than a major change... Obviously I've blundered in here half-cocked. Should I submit further stuff off the CVS or the 1.2 code base? I'm a bit loathe to use the CVS in production, so any patches I do I'm inclined to do off 1.2 Thoughts? If you want the diff that implements the resetParsed() and appropriate test, just email me. Cheers, Terry. --- Terry Alexis Lurie <tez...@ya...> wrote: > Hi, this is further to my Bug report via the SF > site. > > Basically, setParsed() wasn't effecting the actual > output of the Node thereafter. This made it a real > pain to highlight HTML, the example here being > making > tables have a border of 1 to show them. > > Patch attached. Has some debugging commented out, > you'll want to get rid of this. I put a patch for th > testing code on the sourceforge bug report. > > Cheers, > > Terry. > > -------------------- > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > --- HTMLTag.java 2003/05/20 14:52:42 > *************** > *** 273,283 **** > } > /** > * Sets the parsed. > ! * @param parsed The parsed to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > } > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > --- 273,306 ---- > } > /** > * Sets the parsed. > ! * Note: There is no guarantee that the > attributes > will be: > ! * in the same order or case as originally. > ! * This isn't expected to be a problem, but > then again > ! * it never is, is it? > ! * Also: This currently makes no effort to place > the attribute > ! * in quotes if necessary. You have to take > care of that > ! * yourself > ! * @param parsed The hash of (key,value) > attribute > pairs to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > + > + setText((String) parsed.get(this.TAGNAME)); > //Set > the tag first > + for(Enumeration e = parsed.keys(); > e.hasMoreElements();) { > + String temp = (String) e.nextElement(); > + if (!temp.equals(this.TAGNAME)) { //Don't > add > the tagname again > + append(" " + temp + '=' + ((String) > parsed.get(temp))); > + > + //Debug > + //System.out.println("setParsed appending key: > " > + temp + " to value: " + ((String) > parsed.get(temp))); > + } > + } > + > + //Debug > + //System.out.println("setParsed: completed, now > text is:" + getText()); > + > } > + > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > > > ===== > ------------------------------------------------------------ > Terry Alexis Lurie | 'Something witty that > doesn't > Freelance Computer Engineer | look good with > variable > United Kingdom | width fonts' - Most > nerds > > __________________________________________________ > It's Samaritans' Week. Help Samaritans help others. > Call 08709 000032 to give or donate online now at > http://www.samaritans.org/support/donations.shtm > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your > application fit in a > relational database is painful, don't do it! Check > out ObjectStore. > Now part of Progress Software. > http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |