Thread: RE: [Htmlparser-developer] Writing OPTION tag
Brought to you by:
derrickoswald
From: <dha...@or...> - 2002-08-14 07:06:41
Attachments:
BDY.RTF
|
Hi guys, I am yet trying to solve my problem with the scanner of my OPTION tag. I would really appreciate any help from the developers of the parsing engine. I think a solution may lie in knowing certain internals of the parser. Let me explain my problem in detail. Assume the following 2 OPTION tags : <OPTION value="AltaVista Search">AltaVista <OPTION value="Lycos Search"></OPTION> The OPTION tag does not explicitly require an end tag. Hence the first line is valid. My parsing logic in scan is as follows : 1. Disable existing parsers 2. Read elements from the Reader. 3. Check whether it is an EndTag for OPTION or SELECT (since OPTION tags are always under SELECT). If so create an OptionTag object with necessary values 4. If it is not an EndTag, check whether it is a StringNode (this would be for the value between <OPTION> and </OPTION> tags). If so it is the text of the OPTION tag and store it temporarily. (This will be later used in the constructor). 5. If it is neither it could be an error or the beginning of another tag (possible another <OPTION> tag as above) and hence the current loop must be terminated and the option object must be constructed. The problem with my input is that <OPTION value="AltaVista Search"> would be read as an OptionTag, AltaVista would be read as the StringNode and then <OPTION value="Lycos Search"> would be read and since it is neither a StringNode nor an EndTag an OptionTag would be created for the above 2 values. However since this tag is already read it will not qualify as a new OptionTag and hence I am missing out this tag in my parsing. I hope I have been able to explain my problem clearly. If not, I would certainly like to clarify on any points which are not understood. A snippet of code from scan() of HTMLOptionTagScanner is given below Vector lScannerVector = HTMLParserUtils.adjustScanners(pReader); do { lNode = pReader.readElement(); System.out.println(lNode.toHTML()); if (lNode instanceof HTMLEndTag) { lEndTag = (HTMLEndTag)lNode; String lEndTagString = lEndTag.getText().toUpperCase(); if (lEndTagString.equals("OPTION") || lEndTagString.equals("SELECT")) { endTagFound = true; } } else if (lNode instanceof HTMLStringNode) { lText.append(lNode.toHTML()); } else if (lNode instanceof HTMLTag) { endTagFound = true; } } while (!endTagFound); HTMLOptionTag lOptionTag = new HTMLOptionTag(0, lNode.elementEnd(), pTag.getText(), lText.toString(), pCurrLine); HTMLParserUtils.restoreScanners(pReader, lScannerVector); Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 |
From: <dha...@or...> - 2002-08-14 08:16:58
Attachments:
BDY.RTF
|
Hi Somik, Thats exactly what happens. Everythign inside <OPTION ..> will be tag and outside it will be HTMLStringNode however when I ahve to read another <OPTIOn ....tag> wherein the previous OPTION tag did not have a closing </OPTION> the later <OPTION....> tag gets read and since it is once read it is unavailable for scanning again as a new Option tag. Anyway I seem to have made my testcases work by storing the previous node value and in case </OPTION> is not present I take care of it accordingly. I have just added some more test cases to validate its robustness. For the time being I think its done. Thanx for the response nevertheless. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 -----Original Message----- From: somik [mailto:so...@ya...] Sent: Wednesday, August 14, 2002 1:14 PM To: htmlparser-developer Cc: somik Subject: Re: [Htmlparser-developer] Writing OPTION tag Hi Dhaval, Sorry, Ive been really swamped.. > The problem with my input is that <OPTION value="AltaVista Search"> > would be read as an OptionTag, AltaVista would be read as the StringNode > and then <OPTION value="Lycos Search"> would be read and since it is > neither a StringNode nor an EndTag an OptionTag would be created for the > above 2 values. .. This idea is incorrect. <OPTION. .... > is a tag. Nothing inside the Option tag is a string node. <OPTION ... > (this is HTMLTag) some text here sdjklsdjk (this is HTMLStringNode) </OPTION> (this is HTMLEndTag) HTH. Cheers, Somik ------------------------------------------------------- This sf.net email is sponsored by: Dice - The leading online job board for high-tech professionals. Search and apply for tech jobs today! http://seeker.dice.com/seeker.epl?rel_code=31 _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2002-08-14 08:49:53
|
Hi Dhaval, > Thats exactly what happens. Everythign inside <OPTION ..> will be tag > and outside it will be HTMLStringNode however when I ahve to read > another <OPTIOn ....tag> wherein the previous OPTION tag did not have a > closing </OPTION> the later <OPTION....> tag gets read and since it is > once read it is unavailable for scanning again as a new Option tag. > Anyway I seem to have made my testcases work by storing the previous > node value and in case </OPTION> is not present I take care of it > accordingly. I have just added some more test cases to validate its > robustness. For the time being I think its done. Good question. I faced the same thing with several other tags. To counter this issue - you will find a variable in the evaluate() method - previousOpenScanner. Suppose you are trying to search for </OPTION> and encounter a <OPTION> instead, then evaluate actually allows you to do something about it. At that point, you must fool the open scanner into believing that the previous tag got closed. This is exactly whats done in HTMLLinkScanner. On seeing there was a previousOpenScanner, we accept it as true. And in scan(), the end tag (which wasnt there) is returned, putting in a correction, so that the next tag still gets parsed (in elementEnd() positioning). Let me know if you need more help. (You simply cant do this without testcases..) Cheers Somik ----- Original Message ----- From: <dha...@or...> To: <htm...@li...> Sent: Wednesday, August 14, 2002 5:15 PM Subject: RE: [Htmlparser-developer] Writing OPTION tag > Hi Somik, > > Thats exactly what happens. Everythign inside <OPTION ..> will be tag > and outside it will be HTMLStringNode however when I ahve to read > another <OPTIOn ....tag> wherein the previous OPTION tag did not have a > closing </OPTION> the later <OPTION....> tag gets read and since it is > once read it is unavailable for scanning again as a new Option tag. > Anyway I seem to have made my testcases work by storing the previous > node value and in case </OPTION> is not present I take care of it > accordingly. I have just added some more test cases to validate its > robustness. For the time being I think its done. > > Thanx for the response nevertheless. > > Regards, > > Dhaval Udani > Senior Analyst > M-Line, QPEG > OrbiTech Solutions Ltd. > +91-22-8290019 Extn. 1457 > > > > -----Original Message----- > From: somik [mailto:so...@ya...] > Sent: Wednesday, August 14, 2002 1:14 PM > To: htmlparser-developer > Cc: somik > Subject: Re: [Htmlparser-developer] Writing OPTION tag > > > Hi Dhaval, > Sorry, Ive been really swamped.. > > The problem with my input is that <OPTION value="AltaVista Search"> > > would be read as an OptionTag, AltaVista would be read as the > StringNode > > and then <OPTION value="Lycos Search"> would be read and since it is > > neither a StringNode nor an EndTag an OptionTag would be created for > the > > above 2 values. .. > > This idea is incorrect. <OPTION. .... > is a tag. Nothing inside the > Option > tag is a string node. > <OPTION ... > (this is HTMLTag) > some text here sdjklsdjk (this is HTMLStringNode) > </OPTION> (this is HTMLEndTag) > > HTH. > > Cheers, > Somik > > > > ------------------------------------------------------- > This sf.net email is sponsored by: Dice - The leading online job board > for high-tech professionals. Search and apply for tech jobs today! > http://seeker.dice.com/seeker.epl?rel_code=31 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Somik R. <so...@ya...> - 2002-08-14 07:43:29
|
Hi Dhaval, Sorry, Ive been really swamped.. > The problem with my input is that <OPTION value="AltaVista Search"> > would be read as an OptionTag, AltaVista would be read as the StringNode > and then <OPTION value="Lycos Search"> would be read and since it is > neither a StringNode nor an EndTag an OptionTag would be created for the > above 2 values. .. This idea is incorrect. <OPTION. .... > is a tag. Nothing inside the Option tag is a string node. <OPTION ... > (this is HTMLTag) some text here sdjklsdjk (this is HTMLStringNode) </OPTION> (this is HTMLEndTag) HTH. Cheers, Somik |