htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
| 2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
| 2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
| 2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
| 2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
| 2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
| 2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
| 2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
| 2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
| 2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
| 2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
| 2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
| 2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
| 2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
| 2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
| 2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
| 2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
| 2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
1
(4) |
2
(4) |
3
(3) |
4
|
|
5
|
6
(2) |
7
(1) |
8
(3) |
9
(1) |
10
(1) |
11
|
|
12
(1) |
13
|
14
|
15
|
16
|
17
|
18
|
|
19
|
20
|
21
|
22
|
23
|
24
|
25
|
|
26
|
27
|
28
|
29
|
30
|
31
|
|
|
From: Somik R. <so...@ya...> - 2002-05-12 09:07:49
|
Hi Raghav
I went thru the yahoo.txt, and just like your previous one, this one =
too had very dirty html. The reason you got the OutofMemoryException was =
that this kind of html sent the parser into an infinite loop (in =
HTMLLinkScanner).
The tag which did this was :
<a href=3Ds/8741><img =
src=3D"http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=3D16 =
width=3D16 border=3D0></img></td><td nowrap>
<a href=3Ds/7509><b>Yahoo! Movies</b></a>
As you can see, the first link tag does not have an end tag. I verified =
with the actual yahoo page, and this link occurs quite decently, with =
the correct end tag. After looking closely at your supplied file, I also =
notice the </img> file, which is highly unusual in normal html.
So - I am guessing that this file is generated by a program and not by a =
human. You would definitely want to check the program thats doing it - =
its surely buggy.
However, my yardstick for the robustness of this parser is Internet =
Explorer. If the stuff works in IE, then its got to work here. And as I =
tried this particularly bad piece of html, I found IE does not crash. =
Hence, I had to go about empowering the parser to parse these erroneous =
tags <sigh> Took hours!! </sigh>
The good news is, its done. We can parse these tags, and the correct =
end tag is inserted just before td. Of course, I have done a minimal =
adjustment for your purpose. As time goes on, robustness ought to =
increase further. All test cases passing. The framework for handling =
dirty html is also slightly modified.
An integration release has been made (2002-05-12), and is under the =
integration builds package. You can download from =
http://htmlparser.sourceforge.net.=20
=20
The parser should not crash on your html now.
Regards,
Somik
----- Original Message -----=20
From: Raghavender Srimantula=20
To: htm...@li...=20
Sent: Saturday, May 11, 2002 4:32 AM
Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations andwriteoutdocument
Hi Somik,
I have mentioned about the out of memory error problem earlier. last =
time=20
for every iteration of for loop I was adding the whole page to my =
string=20
buffer. so it was giving me the out of memory error. I removed that =
now. it=20
was working fine till yesterday. now I find that error again. this =
time=20
nothing to do with string buffer...and it looks like a real problem. I =
can=20
send you the main class and the yahoo.txt I have. try running it.
Thanks,
Raghav
>From: "Somik Raha" <so...@ya...>
>Reply-To: htm...@li...
>To: <htm...@li...>
>Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations=20
>andwriteoutdocument
>Date: Fri, 10 May 2002 00:43:19 +0900
>
>Hi Raghav,
> On analyzing yahoo.txt, I found that you have incorrect html. =
There is=20
>a script tag that has not been closed. So naturally the script =
scanner goes=20
>bonkers. Rename the extension to .html, and open this file in IE, and =
you=20
>will find that IE also cant handle this.
> I verified from www.yahoo.com, and found that they do have the =
correct=20
></script> tag provided. So I guess your yahoo.txt file is faulty.
>
>Regards,
>Somik
> ----- Original Message -----
> From: Raghavender Srimantula
> To: htm...@li...
> Sent: Thursday, May 09, 2002 4:53 AM
> Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations andwriteoutdocument
>
>
> Hi Somik,
> I was using the 1.1 version of htmlparser. I save the =
www.yahoo.com=20
>content
> in a flat file yahoo.txt. and I run the parser against this. =
throws a
> nullpointerexception in HTMLScriptScanner. this seems to be a new=20
>addition
> for 1.1. I will send the stacktrace, the main program and the =
yahoo.txt.
> actually I cannot send the stacktrace. I made some changes and the =
line
> numbers dont match. but if you run this program you would see the
> nullpointerexception.
> Thanks,
> Raghav
>
>
> >From: "Somik Raha" <so...@ya...>
> >Reply-To: htm...@li...
> >To: <htm...@li...>
> >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations
> >and writeoutdocument
> >Date: Mon, 6 May 2002 13:59:11 +0900
> >
> >Hi Raghav,
> > I sent another mail sometime back to you -
> >
> >"HTMLLinkTag.linkData() - this gives you an enumeration - and in =
the
> >enumeration will be your HTMLImageTag."
> >HTMLNode node;
> >HTMLImageTag imageTag;
> >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) {
> > node =3D (HTMLNode)e.nextElement();
> > if (node instanceof HTMLImageTag) {
> > imageTag =3D (HTMLImageTag)node;
> > // your code here
> > }
> >}
> >
> >Regards,
> >Somik
> >----- Original Message -----
> >From: "Raghavender Srimantula" <kin...@ho...>
> >To: <htm...@li...>
> >Sent: Monday, May 06, 2002 10:43 AM
> >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations
> >and writeoutdocument
> >
> >
> > > Hi Somik,
> > > this question is regarding "not all images are being =
retrieved". I=20
>mean
> >the
> > > images under <a tag. I did try to open the attachment you sent =
me. I
> >could
> > > not find anything. but seeing the previous mails I could read =
that=20
>it is
> >not
> > > a bug. but still if I do want to retrieve all the images how =
do I do=20
>it.
> > > Thanks,
> > > Raghav
> > >
> > >
> > > >From: "Somik Raha" <so...@ya...>
> > > >Reply-To: htm...@li...
> > > >To: <htm...@li...>
> > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
> >locations
> > > >and write outdocument
> > > >Date: Tue, 30 Apr 2002 11:37:26 +0900
> > > >
> > > >Hi Raghav,
> > > > Ah - this was a question by Annette Doyle (titled "Not =
all=20
>image
> >tags
> > > >are returned"). I am attaching my reply.
> > > >
> > > >Regards
> > > >Somik
> > > >
> > > >----- Original Message -----
> > > >From: "Raghavender Srimantula" <kin...@ho...>
> > > >To: <htm...@li...>
> > > >Sent: Tuesday, April 30, 2002 11:16 AM
> > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
> >locations
> > > >and write outdocument
> > > >
> > > >
> > > > > hi Somik,
> > > > > I found one more interesting thing here. when I am trying =
to get=20
>all
> >the
> > > > > images the image scanner would give me images
> > > > > <img
> =
>src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif"
> > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm>
> > > > > so if I do a imagetag.getImageLocation(), I would get
> > > > > =
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif
> > > > >
> > > > > but is the html content is like this
> > > > > <a href=3Ds/6006><img
> > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif
> > > > > border=3D0 width=3D70 height=3D22></a>
> > > > > which starts with <a and ends with </a>, then the image =
scanner=20
>will
> >not
> > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif =
when=20
>I do
> >a
> > > > > imagetag.getImageLocation(). this is not even classified =
as an
> >ImageTag.
> > > > > this is classified as LinkTag. how to get this image.
> > > > >
> > > > > the above content is from www.yahoo.com. on the netscape =
browser=20
>if
> >you
> > > >goto
> > > > > view-->pageinfo, you will see a bunch of images.
> > > > > but when you run the htmlparser you can get only one =
image.
> > > > >
> > > > > Thanks,
> > > > > Raghav
> > > > >
> > > > >
> > > > > >From: "Somik Raha" <so...@ya...>
> > > > > >Reply-To: htm...@li...
> > > > > >To: <htm...@li...>
> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image tag
> > > >locations
> > > > > >and write outdocument
> > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900
> > > > > >
> > > > > >Can you describe your application ? Was it parsing a =
single=20
>page
> >when
> > > >the
> > > > > >problem occurred ?
> > > > > >
> > > > > >Regards,
> > > > > >Somik
> > > > > >----- Original Message -----
> > > > > >From: "Raghavender Srimantula" <kin...@ho...>
> > > > > >To: <htm...@li...>
> > > > > >Cc: <htm...@li...>
> > > > > >Sent: Tuesday, April 30, 2002 8:36 AM
> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image tag
> > > >locations
> > > > > >and write outdocument
> > > > > >
> > > > > >
> > > > > > > Hi Somik,
> > > > > > > I encountered a strange problem today. while I was =
running
> > > > > >htmlparser...I
> > > > > > > got a java.lang.OutOfMemoryError. seems that lot of =
objects=20
>are
> > > >being
> > > > > > > allocated. where exactly is this happening. I mean =
could you
> >give
> >me
> > > >an
> > > > > >idea
> > > > > > > where or in which file the potential problem could be.
> > > > > > > Raghav
> > > > > > >
> > > > > > >
> > > > > > > >From: "Somik Raha" <so...@ya...>
> > > > > > > >Reply-To: htm...@li...
> > > > > > > >To: <htm...@li...>
> > > > > > > >CC: <htm...@li...>
> > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image=20
>tag
> > > > > >locations
> > > > > > > >and write out document
> > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900
> > > > > > > >
> > > > > > > >Hi Annette,
> > > > > > > > Pls find attached a program to get you started. =
This
> >program
> > > >will
> > > > > >do
> > > > > > > >what you want - you will need to modify the construct =
that
> >checks
> > > >for
> > > > > >the
> > > > > > > >image tag - and replace it with the location of your=20
>choice.
> > > > > > > > Also - I found one bug thanks to this =
requirement -=20
>image
> >tags
> > > > > >params
> > > > > > > >were not being correctly put in. Though it needs a =
deeper=20
>look,
> >I
> > > >have
> > > > > >done
> > > > > > > >a quick fix for now, and all test cases are passing =
(with=20
>one
> >test
> > > >case
> > > > > >in
> > > > > > > >HTMLImageScannerTest trapping this bug).
> > > > > > > > Please check out the latest html parser source =
code=20
>from
> >CVS.
> > > > > > > >
> > > > > > > >Regards,
> > > > > > > >Somik
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: Doyle, Annette
> > > > > > > > To: htm...@li...
> > > > > > > > Sent: Friday, April 26, 2002 10:08 PM
> > > > > > > > Subject: [Htmlparser-user] Hints on how to change =
image=20
>tag
> > > > > >locations
> > > > > > > >and write out document
> > > > > > > >
> > > > > > > >
> > > > > > > > Could you please give me some hints as how to =
change=20
>only
> >image
> > > >tag
> > > > > > > >locations and then, (or at the same time) write out =
the=20
>html
> > > >document
> > > > > >to
> > > > > > > >file (with new image tag locations)?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks-
> > > > > > > >
> > > > > > > > Annette Doyle
> > > > > > > >
> > > > > > > ><< ImageTagRetriever.java >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> >_________________________________________________________________
> > > > > > > Join the world's largest e-mail service with MSN =
Hotmail.
> > > > > > > http://www.hotmail.com
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Htmlparser-user mailing list
> > > > > > > Htm...@li...
> > > > > > > =
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > > > >
> > > > > >
> > > > > >_______________________________________________
> > > > > >Htmlparser-user mailing list
> > > > > >Htm...@li...
> > > > > =
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >=20
>_________________________________________________________________
> > > > > Send and receive Hotmail on your mobile device:
> >http://mobile.msn.com
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Htmlparser-user mailing list
> > > > > Htm...@li...
> > > > > =
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > ><<
> > >
> >=20
> =
>[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not=
aBu
> >g].eml
> > > > >>
> > >
> > >
> > >
> > >
> > > =
_________________________________________________________________
> > > MSN Photos is the easiest way to share and print your photos:
> > > http://photos.msn.com/support/worldwide.aspx
> > >
> > >
> > > =
_______________________________________________________________
> > >
> > > Have big pipes? SourceForge.net is looking for download =
mirrors. We
> >supply
> > > the hardware. You get the recognition. Email Us:
> >ban...@so...
> > > _______________________________________________
> > > Htmlparser-user mailing list
> > > Htm...@li...
> > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
>
> _________________________________________________________________
> Get your FREE download of MSN Explorer at=20
>http://explorer.msn.com/intl.asp.
>
_________________________________________________________________
Join the world's largest e-mail service with MSN Hotmail.=20
http://www.hotmail.com
|
|
From: Raghavender S. <kin...@ho...> - 2002-05-10 19:32:47
|
Hi Somik, I have mentioned about the out of memory error problem earlier. last time for every iteration of for loop I was adding the whole page to my string buffer. so it was giving me the out of memory error. I removed that now. it was working fine till yesterday. now I find that error again. this time nothing to do with string buffer...and it looks like a real problem. I can send you the main class and the yahoo.txt I have. try running it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >andwriteoutdocument >Date: Fri, 10 May 2002 00:43:19 +0900 > >Hi Raghav, > On analyzing yahoo.txt, I found that you have incorrect html. There is >a script tag that has not been closed. So naturally the script scanner goes >bonkers. Rename the extension to .html, and open this file in IE, and you >will find that IE also cant handle this. > I verified from www.yahoo.com, and found that they do have the correct ></script> tag provided. So I guess your yahoo.txt file is faulty. > >Regards, >Somik > ----- Original Message ----- > From: Raghavender Srimantula > To: htm...@li... > Sent: Thursday, May 09, 2002 4:53 AM > Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations andwriteoutdocument > > > Hi Somik, > I was using the 1.1 version of htmlparser. I save the www.yahoo.com >content > in a flat file yahoo.txt. and I run the parser against this. throws a > nullpointerexception in HTMLScriptScanner. this seems to be a new >addition > for 1.1. I will send the stacktrace, the main program and the yahoo.txt. > actually I cannot send the stacktrace. I made some changes and the line > numbers dont match. but if you run this program you would see the > nullpointerexception. > Thanks, > Raghav > > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > >and writeoutdocument > >Date: Mon, 6 May 2002 13:59:11 +0900 > > > >Hi Raghav, > > I sent another mail sometime back to you - > > > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in the > >enumeration will be your HTMLImageTag." > >HTMLNode node; > >HTMLImageTag imageTag; > >for (Enumeration e = linkTag.linkData();e.hasMoreElements();) { > > node = (HTMLNode)e.nextElement(); > > if (node instanceof HTMLImageTag) { > > imageTag = (HTMLImageTag)node; > > // your code here > > } > >} > > > >Regards, > >Somik > >----- Original Message ----- > >From: "Raghavender Srimantula" <kin...@ho...> > >To: <htm...@li...> > >Sent: Monday, May 06, 2002 10:43 AM > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > >and writeoutdocument > > > > > > > Hi Somik, > > > this question is regarding "not all images are being retrieved". I >mean > >the > > > images under <a tag. I did try to open the attachment you sent me. I > >could > > > not find anything. but seeing the previous mails I could read that >it is > >not > > > a bug. but still if I do want to retrieve all the images how do I do >it. > > > Thanks, > > > Raghav > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > >Reply-To: htm...@li... > > > >To: <htm...@li...> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > >locations > > > >and write outdocument > > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > > > >Hi Raghav, > > > > Ah - this was a question by Annette Doyle (titled "Not all >image > >tags > > > >are returned"). I am attaching my reply. > > > > > > > >Regards > > > >Somik > > > > > > > >----- Original Message ----- > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > >To: <htm...@li...> > > > >Sent: Tuesday, April 30, 2002 11:16 AM > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > >locations > > > >and write outdocument > > > > > > > > > > > > > hi Somik, > > > > > I found one more interesting thing here. when I am trying to get >all > >the > > > > > images the image scanner would give me images > > > > > <img > >src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > > width=296 height=27 border=0 usemap=#tm> > > > > > so if I do a imagetag.getImageLocation(), I would get > > > > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > > > but is the html content is like this > > > > > <a href=s/6006><img > > > >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > > border=0 width=70 height=22></a> > > > > > which starts with <a and ends with </a>, then the image scanner >will > >not > > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when >I do > >a > > > > > imagetag.getImageLocation(). this is not even classified as an > >ImageTag. > > > > > this is classified as LinkTag. how to get this image. > > > > > > > > > > the above content is from www.yahoo.com. on the netscape browser >if > >you > > > >goto > > > > > view-->pageinfo, you will see a bunch of images. > > > > > but when you run the htmlparser you can get only one image. > > > > > > > > > > Thanks, > > > > > Raghav > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > >Reply-To: htm...@li... > > > > > >To: <htm...@li...> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > > >locations > > > > > >and write outdocument > > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > > > >Can you describe your application ? Was it parsing a single >page > >when > > > >the > > > > > >problem occurred ? > > > > > > > > > > > >Regards, > > > > > >Somik > > > > > >----- Original Message ----- > > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > > >To: <htm...@li...> > > > > > >Cc: <htm...@li...> > > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > > >locations > > > > > >and write outdocument > > > > > > > > > > > > > > > > > > > Hi Somik, > > > > > > > I encountered a strange problem today. while I was running > > > > > >htmlparser...I > > > > > > > got a java.lang.OutOfMemoryError. seems that lot of objects >are > > > >being > > > > > > > allocated. where exactly is this happening. I mean could you > >give > >me > > > >an > > > > > >idea > > > > > > > where or in which file the potential problem could be. > > > > > > > Raghav > > > > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > > >Reply-To: htm...@li... > > > > > > > >To: <htm...@li...> > > > > > > > >CC: <htm...@li...> > > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > > > >Hi Annette, > > > > > > > > Pls find attached a program to get you started. This > >program > > > >will > > > > > >do > > > > > > > >what you want - you will need to modify the construct that > >checks > > > >for > > > > > >the > > > > > > > >image tag - and replace it with the location of your >choice. > > > > > > > > Also - I found one bug thanks to this requirement - >image > >tags > > > > > >params > > > > > > > >were not being correctly put in. Though it needs a deeper >look, > >I > > > >have > > > > > >done > > > > > > > >a quick fix for now, and all test cases are passing (with >one > >test > > > >case > > > > > >in > > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > > Please check out the latest html parser source code >from > >CVS. > > > > > > > > > > > > > > > >Regards, > > > > > > > >Somik > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: Doyle, Annette > > > > > > > > To: htm...@li... > > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > > Subject: [Htmlparser-user] Hints on how to change image >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to change >only > >image > > > >tag > > > > > > > >locations and then, (or at the same time) write out the >html > > > >document > > > > > >to > > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Htmlparser-user mailing list > > > > > > > Htm...@li... > > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > >Htmlparser-user mailing list > > > > > >Htm...@li... > > > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > Send and receive Hotmail on your mobile device: > >http://mobile.msn.com > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ><< > > > > > > >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBu > >g].eml > > > > >> > > > > > > > > > > > > > > > _________________________________________________________________ > > > MSN Photos is the easiest way to share and print your photos: > > > http://photos.msn.com/support/worldwide.aspx > > > > > > > > > _______________________________________________________________ > > > > > > Have big pipes? SourceForge.net is looking for download mirrors. We > >supply > > > the hardware. You get the recognition. Email Us: > >ban...@so... > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _________________________________________________________________ > Get your FREE download of MSN Explorer at >http://explorer.msn.com/intl.asp. > _________________________________________________________________ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com |
|
From: Somik R. <so...@ya...> - 2002-05-09 15:43:29
|
Hi Raghav,
On analyzing yahoo.txt, I found that you have incorrect html. There =
is a script tag that has not been closed. So naturally the script =
scanner goes bonkers. Rename the extension to .html, and open this file =
in IE, and you will find that IE also cant handle this.
I verified from www.yahoo.com, and found that they do have the =
correct </script> tag provided. So I guess your yahoo.txt file is =
faulty.
Regards,
Somik
----- Original Message -----=20
From: Raghavender Srimantula=20
To: htm...@li...=20
Sent: Thursday, May 09, 2002 4:53 AM
Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations andwriteoutdocument
Hi Somik,
I was using the 1.1 version of htmlparser. I save the www.yahoo.com =
content=20
in a flat file yahoo.txt. and I run the parser against this. throws a=20
nullpointerexception in HTMLScriptScanner. this seems to be a new =
addition=20
for 1.1. I will send the stacktrace, the main program and the =
yahoo.txt.
actually I cannot send the stacktrace. I made some changes and the =
line=20
numbers dont match. but if you run this program you would see the=20
nullpointerexception.
Thanks,
Raghav
>From: "Somik Raha" <so...@ya...>
>Reply-To: htm...@li...
>To: <htm...@li...>
>Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations=20
>and writeoutdocument
>Date: Mon, 6 May 2002 13:59:11 +0900
>
>Hi Raghav,
> I sent another mail sometime back to you -
>
>"HTMLLinkTag.linkData() - this gives you an enumeration - and in the
>enumeration will be your HTMLImageTag."
>HTMLNode node;
>HTMLImageTag imageTag;
>for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) {
> node =3D (HTMLNode)e.nextElement();
> if (node instanceof HTMLImageTag) {
> imageTag =3D (HTMLImageTag)node;
> // your code here
> }
>}
>
>Regards,
>Somik
>----- Original Message -----
>From: "Raghavender Srimantula" <kin...@ho...>
>To: <htm...@li...>
>Sent: Monday, May 06, 2002 10:43 AM
>Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations
>and writeoutdocument
>
>
> > Hi Somik,
> > this question is regarding "not all images are being retrieved". I =
mean
>the
> > images under <a tag. I did try to open the attachment you sent me. =
I=20
>could
> > not find anything. but seeing the previous mails I could read that =
it is
>not
> > a bug. but still if I do want to retrieve all the images how do I =
do it.
> > Thanks,
> > Raghav
> >
> >
> > >From: "Somik Raha" <so...@ya...>
> > >Reply-To: htm...@li...
> > >To: <htm...@li...>
> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations
> > >and write outdocument
> > >Date: Tue, 30 Apr 2002 11:37:26 +0900
> > >
> > >Hi Raghav,
> > > Ah - this was a question by Annette Doyle (titled "Not all =
image
>tags
> > >are returned"). I am attaching my reply.
> > >
> > >Regards
> > >Somik
> > >
> > >----- Original Message -----
> > >From: "Raghavender Srimantula" <kin...@ho...>
> > >To: <htm...@li...>
> > >Sent: Tuesday, April 30, 2002 11:16 AM
> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations
> > >and write outdocument
> > >
> > >
> > > > hi Somik,
> > > > I found one more interesting thing here. when I am trying to =
get all
>the
> > > > images the image scanner would give me images
> > > > <img
>src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif"
> > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm>
> > > > so if I do a imagetag.getImageLocation(), I would get
> > > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif
> > > >
> > > > but is the html content is like this
> > > > <a href=3Ds/6006><img
> > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif
> > > > border=3D0 width=3D70 height=3D22></a>
> > > > which starts with <a and ends with </a>, then the image =
scanner will
>not
> > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif =
when I do=20
>a
> > > > imagetag.getImageLocation(). this is not even classified as an
>ImageTag.
> > > > this is classified as LinkTag. how to get this image.
> > > >
> > > > the above content is from www.yahoo.com. on the netscape =
browser if
>you
> > >goto
> > > > view-->pageinfo, you will see a bunch of images.
> > > > but when you run the htmlparser you can get only one image.
> > > >
> > > > Thanks,
> > > > Raghav
> > > >
> > > >
> > > > >From: "Somik Raha" <so...@ya...>
> > > > >Reply-To: htm...@li...
> > > > >To: <htm...@li...>
> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
> > >locations
> > > > >and write outdocument
> > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900
> > > > >
> > > > >Can you describe your application ? Was it parsing a single =
page=20
>when
> > >the
> > > > >problem occurred ?
> > > > >
> > > > >Regards,
> > > > >Somik
> > > > >----- Original Message -----
> > > > >From: "Raghavender Srimantula" <kin...@ho...>
> > > > >To: <htm...@li...>
> > > > >Cc: <htm...@li...>
> > > > >Sent: Tuesday, April 30, 2002 8:36 AM
> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
> > >locations
> > > > >and write outdocument
> > > > >
> > > > >
> > > > > > Hi Somik,
> > > > > > I encountered a strange problem today. while I was running
> > > > >htmlparser...I
> > > > > > got a java.lang.OutOfMemoryError. seems that lot of =
objects are
> > >being
> > > > > > allocated. where exactly is this happening. I mean could =
you=20
>give
>me
> > >an
> > > > >idea
> > > > > > where or in which file the potential problem could be.
> > > > > > Raghav
> > > > > >
> > > > > >
> > > > > > >From: "Somik Raha" <so...@ya...>
> > > > > > >Reply-To: htm...@li...
> > > > > > >To: <htm...@li...>
> > > > > > >CC: <htm...@li...>
> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image tag
> > > > >locations
> > > > > > >and write out document
> > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900
> > > > > > >
> > > > > > >Hi Annette,
> > > > > > > Pls find attached a program to get you started. This =
>program
> > >will
> > > > >do
> > > > > > >what you want - you will need to modify the construct =
that=20
>checks
> > >for
> > > > >the
> > > > > > >image tag - and replace it with the location of your =
choice.
> > > > > > > Also - I found one bug thanks to this requirement - =
image
>tags
> > > > >params
> > > > > > >were not being correctly put in. Though it needs a deeper =
look,=20
>I
> > >have
> > > > >done
> > > > > > >a quick fix for now, and all test cases are passing (with =
one
>test
> > >case
> > > > >in
> > > > > > >HTMLImageScannerTest trapping this bug).
> > > > > > > Please check out the latest html parser source code =
from
>CVS.
> > > > > > >
> > > > > > >Regards,
> > > > > > >Somik
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > From: Doyle, Annette
> > > > > > > To: htm...@li...
> > > > > > > Sent: Friday, April 26, 2002 10:08 PM
> > > > > > > Subject: [Htmlparser-user] Hints on how to change =
image tag
> > > > >locations
> > > > > > >and write out document
> > > > > > >
> > > > > > >
> > > > > > > Could you please give me some hints as how to change =
only
>image
> > >tag
> > > > > > >locations and then, (or at the same time) write out the =
html
> > >document
> > > > >to
> > > > > > >file (with new image tag locations)?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks-
> > > > > > >
> > > > > > > Annette Doyle
> > > > > > >
> > > > > > ><< ImageTagRetriever.java >>
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >=20
>_________________________________________________________________
> > > > > > Join the world's largest e-mail service with MSN Hotmail.
> > > > > > http://www.hotmail.com
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > Htmlparser-user mailing list
> > > > > > Htm...@li...
> > > > > > =
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > > >
> > > > >
> > > > >_______________________________________________
> > > > >Htmlparser-user mailing list
> > > > >Htm...@li...
> > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > >
> > > >
> > > >
> > > >
> > > > =
_________________________________________________________________
> > > > Send and receive Hotmail on your mobile device:=20
>http://mobile.msn.com
> > > >
> > > >
> > > > _______________________________________________
> > > > Htmlparser-user mailing list
> > > > Htm...@li...
> > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > ><<
> >
> =
>[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not=
aBu
>g].eml
> > > >>
> >
> >
> >
> >
> > _________________________________________________________________
> > MSN Photos is the easiest way to share and print your photos:
> > http://photos.msn.com/support/worldwide.aspx
> >
> >
> > _______________________________________________________________
> >
> > Have big pipes? SourceForge.net is looking for download mirrors. =
We=20
>supply
> > the hardware. You get the recognition. Email Us:=20
>ban...@so...
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
_________________________________________________________________
Get your FREE download of MSN Explorer at =
http://explorer.msn.com/intl.asp.
|
|
From: Raghavender S. <kin...@ho...> - 2002-05-08 19:54:09
|
Hi Somik, I was using the 1.1 version of htmlparser. I save the www.yahoo.com content in a flat file yahoo.txt. and I run the parser against this. throws a nullpointerexception in HTMLScriptScanner. this seems to be a new addition for 1.1. I will send the stacktrace, the main program and the yahoo.txt. actually I cannot send the stacktrace. I made some changes and the line numbers dont match. but if you run this program you would see the nullpointerexception. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and writeoutdocument >Date: Mon, 6 May 2002 13:59:11 +0900 > >Hi Raghav, > I sent another mail sometime back to you - > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in the >enumeration will be your HTMLImageTag." >HTMLNode node; >HTMLImageTag imageTag; >for (Enumeration e = linkTag.linkData();e.hasMoreElements();) { > node = (HTMLNode)e.nextElement(); > if (node instanceof HTMLImageTag) { > imageTag = (HTMLImageTag)node; > // your code here > } >} > >Regards, >Somik >----- Original Message ----- >From: "Raghavender Srimantula" <kin...@ho...> >To: <htm...@li...> >Sent: Monday, May 06, 2002 10:43 AM >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and writeoutdocument > > > > Hi Somik, > > this question is regarding "not all images are being retrieved". I mean >the > > images under <a tag. I did try to open the attachment you sent me. I >could > > not find anything. but seeing the previous mails I could read that it is >not > > a bug. but still if I do want to retrieve all the images how do I do it. > > Thanks, > > Raghav > > > > > > >From: "Somik Raha" <so...@ya...> > > >Reply-To: htm...@li... > > >To: <htm...@li...> > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > >Hi Raghav, > > > Ah - this was a question by Annette Doyle (titled "Not all image >tags > > >are returned"). I am attaching my reply. > > > > > >Regards > > >Somik > > > > > >----- Original Message ----- > > >From: "Raghavender Srimantula" <kin...@ho...> > > >To: <htm...@li...> > > >Sent: Tuesday, April 30, 2002 11:16 AM > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > > > > > > > > > hi Somik, > > > > I found one more interesting thing here. when I am trying to get all >the > > > > images the image scanner would give me images > > > > <img >src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > width=296 height=27 border=0 usemap=#tm> > > > > so if I do a imagetag.getImageLocation(), I would get > > > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > but is the html content is like this > > > > <a href=s/6006><img > > >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > border=0 width=70 height=22></a> > > > > which starts with <a and ends with </a>, then the image scanner will >not > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when I do >a > > > > imagetag.getImageLocation(). this is not even classified as an >ImageTag. > > > > this is classified as LinkTag. how to get this image. > > > > > > > > the above content is from www.yahoo.com. on the netscape browser if >you > > >goto > > > > view-->pageinfo, you will see a bunch of images. > > > > but when you run the htmlparser you can get only one image. > > > > > > > > Thanks, > > > > Raghav > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > >Reply-To: htm...@li... > > > > >To: <htm...@li...> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write outdocument > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > >Can you describe your application ? Was it parsing a single page >when > > >the > > > > >problem occurred ? > > > > > > > > > >Regards, > > > > >Somik > > > > >----- Original Message ----- > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > >To: <htm...@li...> > > > > >Cc: <htm...@li...> > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write outdocument > > > > > > > > > > > > > > > > Hi Somik, > > > > > > I encountered a strange problem today. while I was running > > > > >htmlparser...I > > > > > > got a java.lang.OutOfMemoryError. seems that lot of objects are > > >being > > > > > > allocated. where exactly is this happening. I mean could you >give >me > > >an > > > > >idea > > > > > > where or in which file the potential problem could be. > > > > > > Raghav > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > >Reply-To: htm...@li... > > > > > > >To: <htm...@li...> > > > > > > >CC: <htm...@li...> > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > > > >locations > > > > > > >and write out document > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > >Hi Annette, > > > > > > > Pls find attached a program to get you started. This >program > > >will > > > > >do > > > > > > >what you want - you will need to modify the construct that >checks > > >for > > > > >the > > > > > > >image tag - and replace it with the location of your choice. > > > > > > > Also - I found one bug thanks to this requirement - image >tags > > > > >params > > > > > > >were not being correctly put in. Though it needs a deeper look, >I > > >have > > > > >done > > > > > > >a quick fix for now, and all test cases are passing (with one >test > > >case > > > > >in > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > Please check out the latest html parser source code from >CVS. > > > > > > > > > > > > > >Regards, > > > > > > >Somik > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: Doyle, Annette > > > > > > > To: htm...@li... > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > Subject: [Htmlparser-user] Hints on how to change image tag > > > > >locations > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to change only >image > > >tag > > > > > > >locations and then, (or at the same time) write out the html > > >document > > > > >to > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Htmlparser-user mailing list > > > > > > Htm...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > >_______________________________________________ > > > > >Htmlparser-user mailing list > > > > >Htm...@li... > > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > Send and receive Hotmail on your mobile device: >http://mobile.msn.com > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ><< > > > >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBu >g].eml > > > >> > > > > > > > > > > _________________________________________________________________ > > MSN Photos is the easiest way to share and print your photos: > > http://photos.msn.com/support/worldwide.aspx > > > > > > _______________________________________________________________ > > > > Have big pipes? SourceForge.net is looking for download mirrors. We >supply > > the hardware. You get the recognition. Email Us: >ban...@so... > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp. |
|
From: Somik R. <so...@ya...> - 2002-05-08 10:16:49
|
Hi Craig,
I actually replied to you on htmlparser-developer, your earlier mails
went there. Are you on that list ?
Am attaching the relevant mails to this mail - hope it goes thru.
Regards
Somik
----- Original Message -----
From: "Craig Raw" <cr...@qu...>
To: <htm...@li...>
Cc: <so...@ya...>
Sent: Wednesday, May 08, 2002 6:49 PM
Subject: [Htmlparser-user] Swing integration
> Posted this earlier, seems to have got lost....
> ----
>
>
> Hi Somik,
>
> I'm looking into the HTMLParser-Swing integration again, and I have two
> questions:
>
> 1. The HTMLEditorKit.ParserCallback takes a position with most of its
> callback functions. Can this position be extracted from the HTMLTag's
> elementBegin()?
>
> 2. There is a need to differentiate between a callback to
> handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and
> handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when
> iterating through the HTMLTag elements Enumeration. How?
>
> You mentioned you have started an implementation - if you have a
> framework going, I'd be happy to continue with the donkey work. I really
> think this could make Swing's HTML rendering a lot more stable.
>
> Regards,
> Craig
>
>
>
>
>
> -----Original Message-----
> From: Somik Raha [mailto:so...@ya...]
> Sent: 16 April 2002 04:57 AM
> To: htm...@li...
> Cc: Craig Raw
> Subject: Re: [Htmlparser-user] Swing integration
>
> Hi Craig, Asgher
> I finally had the time to check Swing integration. Boy - the parser
> design in Swing sucks!! Theoretically its possible to do it - and I got
> started, but just realized that in order to be compatible with swing
> objects
> that do compile time type checking with a particular tag, I have to
> actually
> have 73 if statements to give the right tag to the callback.
> I have more important things to do at the moment, but probably will
> get
> back to this donkey work. *sigh*
>
> I am thinking we should make release 1.1 and then try this. Any
> suggestions ?
>
> Regards,
> Somik
> ----- Original Message -----
> From: "Somik Raha" <so...@ya...>
> To: <htm...@li...>
> Sent: Thursday, April 04, 2002 11:20 AM
> Subject: Re: [Htmlparser-user] Swing integration
>
>
> > Hi Craig,
> > Thanks a lot for the post. Pls go ahead with your analysis. I will
> try
> > to catch up this weekend.
> > Regards,
> > Somik
> > ----- Original Message -----
> > From: "Craig Raw" <cr...@qu...>
> > To: "'Somik Raha'" <so...@ya...>
> > Sent: Tuesday, April 02, 2002 3:32 PM
> > Subject: RE: [Htmlparser-user] Swing integration
> >
> >
> > > Hi Somik,
> > >
> > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc -
> which
> > > is the driver behind JEditorPane's reading and writing HTML
> > > capabilities.
> > >
> > > ---
> > > Extendable/Scalable
> > >
> > > To maximize the usefulness of this kit, a great deal of effort has
> gone
> > > into making it extendable. These are some of the features.
> > > The parser is replaceable. The default parser is the Hot Java parser
> > > which is DTD based. A different DTD can be used, or an entirely
> > > different parser can be used. To change the parser, reimplement the
> > > getParser method. The default parser is dynamically loaded when
> first
> > > asked for, so the class files will never be loaded if an alternative
> > > parser is used. The default parser is in a separate package called
> > > parser below this package.
> > >
> > > The parser drives the ParserCallback, which is provided by
> HTMLDocument.
> > > To change the callback, subclass HTMLDocument and reimplement the
> > > createDefaultDocument method to return document that produces a
> > > different reader. The reader controls how the document is
> structured.
> > > Although the Document provides HTML support by default, there is
> nothing
> > > preventing support of non-HTML tags that result in alternative
> element
> > > structures.
> > > ---
> > >
> > > I may find some time to look into this as well, although I am not
> sure
> > > how much it would fix JEditorPane's somewhat buggy HTML rendering
> > > capabilities....
> > >
> > > -craig
> > >
> > >
> > > -----Original Message-----
> > > From: htm...@li...
> > > [mailto:htm...@li...] On Behalf Of
> Somik
> > > Raha
> > > Sent: 01 April 2002 05:28 PM
> > > To: HTMLParser User List
> > > Cc: HTMLParser Developer List
> > > Subject: Re: [Htmlparser-user] Swing integration
> > >
> > > Hi Craig
> > > Wow! Thats a great question.
> > > Actually, I doubt if I could replace Sun Microsystems' code with
> > > mine. I
> > > dont think Java is that open (or is it ?)
> > > However, we could think of writing our own adapter for the html
> parser
> > > that
> > > might plugin in some way...
> > > I have never used Sun's html parser (If I had, I might not have
> > > started
> > > this project).
> > > I will need to study Sun's parser before I can answer your
> > > question..
> > > But there does seem to be some interesting possibilities.
> > >
> > > Regards
> > > Somik
> > > ----- Original Message -----
> > > From: "Craig Raw" <cr...@qu...>
> > > To: <htm...@li...>
> > > Sent: Monday, April 01, 2002 10:20 PM
> > > Subject: [Htmlparser-user] Swing integration
> > >
> > >
> > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to
> > > > provide a better implementation of JEditorPane's HTML viewing
> > > > capabilities? HTML Parser would need to replace
> > > > javax.swing.text.html.parser.Parser, which is currently somewhat
> > > buggy.
> > > > Anyone tried this?
> > > >
> > > > -craig
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Htmlparser-user mailing list
> > > > Htm...@li...
> > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > >
> > >
> > > _________________________________________________________
> > > Do You Yahoo!?
> > > Get your free @yahoo.com address at http://mail.yahoo.com
> > >
> > >
> > > _______________________________________________
> > > Htmlparser-user mailing list
> > > Htm...@li...
> > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> > _________________________________________________________
> > Do You Yahoo!?
> > Get your free @yahoo.com address at http://mail.yahoo.com
> >
> >
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
> _______________________________________________________________
>
> Have big pipes? SourceForge.net is looking for download mirrors. We supply
> the hardware. You get the recognition. Email Us: ban...@so...
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Craig R. <cr...@qu...> - 2002-05-08 09:50:22
|
Posted this earlier, seems to have got lost....
----
Hi Somik,
I'm looking into the HTMLParser-Swing integration again, and I have two
questions:
1. The HTMLEditorKit.ParserCallback takes a position with most of its
callback functions. Can this position be extracted from the HTMLTag's
elementBegin()?
2. There is a need to differentiate between a callback to
handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and
handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when
iterating through the HTMLTag elements Enumeration. How?
You mentioned you have started an implementation - if you have a
framework going, I'd be happy to continue with the donkey work. I really
think this could make Swing's HTML rendering a lot more stable.
Regards,
Craig
-----Original Message-----
From: Somik Raha [mailto:so...@ya...]
Sent: 16 April 2002 04:57 AM
To: htm...@li...
Cc: Craig Raw
Subject: Re: [Htmlparser-user] Swing integration
Hi Craig, Asgher
I finally had the time to check Swing integration. Boy - the parser
design in Swing sucks!! Theoretically its possible to do it - and I got
started, but just realized that in order to be compatible with swing
objects
that do compile time type checking with a particular tag, I have to
actually
have 73 if statements to give the right tag to the callback.
I have more important things to do at the moment, but probably will
get
back to this donkey work. *sigh*
I am thinking we should make release 1.1 and then try this. Any
suggestions ?
Regards,
Somik
----- Original Message -----
From: "Somik Raha" <so...@ya...>
To: <htm...@li...>
Sent: Thursday, April 04, 2002 11:20 AM
Subject: Re: [Htmlparser-user] Swing integration
> Hi Craig,
> Thanks a lot for the post. Pls go ahead with your analysis. I will
try
> to catch up this weekend.
> Regards,
> Somik
> ----- Original Message -----
> From: "Craig Raw" <cr...@qu...>
> To: "'Somik Raha'" <so...@ya...>
> Sent: Tuesday, April 02, 2002 3:32 PM
> Subject: RE: [Htmlparser-user] Swing integration
>
>
> > Hi Somik,
> >
> > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc -
which
> > is the driver behind JEditorPane's reading and writing HTML
> > capabilities.
> >
> > ---
> > Extendable/Scalable
> >
> > To maximize the usefulness of this kit, a great deal of effort has
gone
> > into making it extendable. These are some of the features.
> > The parser is replaceable. The default parser is the Hot Java parser
> > which is DTD based. A different DTD can be used, or an entirely
> > different parser can be used. To change the parser, reimplement the
> > getParser method. The default parser is dynamically loaded when
first
> > asked for, so the class files will never be loaded if an alternative
> > parser is used. The default parser is in a separate package called
> > parser below this package.
> >
> > The parser drives the ParserCallback, which is provided by
HTMLDocument.
> > To change the callback, subclass HTMLDocument and reimplement the
> > createDefaultDocument method to return document that produces a
> > different reader. The reader controls how the document is
structured.
> > Although the Document provides HTML support by default, there is
nothing
> > preventing support of non-HTML tags that result in alternative
element
> > structures.
> > ---
> >
> > I may find some time to look into this as well, although I am not
sure
> > how much it would fix JEditorPane's somewhat buggy HTML rendering
> > capabilities....
> >
> > -craig
> >
> >
> > -----Original Message-----
> > From: htm...@li...
> > [mailto:htm...@li...] On Behalf Of
Somik
> > Raha
> > Sent: 01 April 2002 05:28 PM
> > To: HTMLParser User List
> > Cc: HTMLParser Developer List
> > Subject: Re: [Htmlparser-user] Swing integration
> >
> > Hi Craig
> > Wow! Thats a great question.
> > Actually, I doubt if I could replace Sun Microsystems' code with
> > mine. I
> > dont think Java is that open (or is it ?)
> > However, we could think of writing our own adapter for the html
parser
> > that
> > might plugin in some way...
> > I have never used Sun's html parser (If I had, I might not have
> > started
> > this project).
> > I will need to study Sun's parser before I can answer your
> > question..
> > But there does seem to be some interesting possibilities.
> >
> > Regards
> > Somik
> > ----- Original Message -----
> > From: "Craig Raw" <cr...@qu...>
> > To: <htm...@li...>
> > Sent: Monday, April 01, 2002 10:20 PM
> > Subject: [Htmlparser-user] Swing integration
> >
> >
> > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to
> > > provide a better implementation of JEditorPane's HTML viewing
> > > capabilities? HTML Parser would need to replace
> > > javax.swing.text.html.parser.Parser, which is currently somewhat
> > buggy.
> > > Anyone tried this?
> > >
> > > -craig
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Htmlparser-user mailing list
> > > Htm...@li...
> > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> > _________________________________________________________
> > Do You Yahoo!?
> > Get your free @yahoo.com address at http://mail.yahoo.com
> >
> >
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
> _________________________________________________________
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Somik R. <so...@ya...> - 2002-05-07 06:29:11
|
Hi Folks,
Following some nice suggestions from Sam Joseph, I have just =
completed some design modifications to the basic HTMLNode API.
The modifications are :
[1] HTMLNode is no longer an interface, but an abstract class. There =
were two reasons for this change. Firstly, I couldnt think of a scenario =
where an object would be an html tag AND something else. Secondly, I =
wanted to enforce the implementation of toString(), which is usually =
done if you implement from the interface (as Object has a default =
toString()).
[2] abstract toString() method - children have to implement this.
[3] print() and print(PrintWriter) - final methods. They will make a =
call to toString(), and print to standard output and the print writer =
respectively.
[4] toPlainText() - this method will provide a string representation of =
a tag, if there is such a representation. If not , a blank string is =
returned. This has implications - our program to extract all strings =
from a html page will be simplified to:
HTMLNode node;
for (Enumeration e =3D parser.elements();e.hasMoreElements();) {
node =3D (HTMLNode)e.nextElement();
System.out.println(node.toPlainTextString()); // or whatever =
processing you want to do with the string
}
[5] toRawString() - this method provides the complete html element (a =
reconstruction), thus allowing ripping programs to be really simple. So =
if you want to rip the html page to your local hard disk, your program =
would look like,
PrintWriter pw =3D new PrintWriter(new FileWriter("..."));
for (Enumeration e =3D parser.elements();e.hasMoreElements();) {
node =3D (HTMLNode)e.nextElement();
pw.println(node.toRawString());
}
pw.close();
[6] Lots of bug fixes done - HTMLImageScanner had a bug, =
HTMLStyleScanner also had one - all caught with more testcases.
We have 100 testcases as of now, all of them passing.
To-do list for Release 1.2
------------------------------------
[1] Integration of Raghavender Srimantula's contribution - =
HTMLFrameScanner and HTMLFormScanner, into the parser. This will be =
integrated as soon as I get the testcases from Raghav.
[2] Adding an HTML Ripping program in the parserApplications package.
[3] Improving the Robot Crawler (??)
[4] Bug fixes to any bugs that get reported in this period.
You can check out the latest code from CVS. Or you can go to =
http://htmlparser.sourceforge.net and click on the download link, and =
choose htmlparser1_2_20020507.zip
Feedback is welcome.
Regards,
Somik
|
|
From: Somik R. <so...@ya...> - 2002-05-06 04:59:10
|
Hi Raghav,
I sent another mail sometime back to you -
"HTMLLinkTag.linkData() - this gives you an enumeration - and in the
enumeration will be your HTMLImageTag."
HTMLNode node;
HTMLImageTag imageTag;
for (Enumeration e = linkTag.linkData();e.hasMoreElements();) {
node = (HTMLNode)e.nextElement();
if (node instanceof HTMLImageTag) {
imageTag = (HTMLImageTag)node;
// your code here
}
}
Regards,
Somik
----- Original Message -----
From: "Raghavender Srimantula" <kin...@ho...>
To: <htm...@li...>
Sent: Monday, May 06, 2002 10:43 AM
Subject: Re: [Htmlparser-user] Hints on how to change image tag locations
and writeoutdocument
> Hi Somik,
> this question is regarding "not all images are being retrieved". I mean
the
> images under <a tag. I did try to open the attachment you sent me. I could
> not find anything. but seeing the previous mails I could read that it is
not
> a bug. but still if I do want to retrieve all the images how do I do it.
> Thanks,
> Raghav
>
>
> >From: "Somik Raha" <so...@ya...>
> >Reply-To: htm...@li...
> >To: <htm...@li...>
> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations
> >and write outdocument
> >Date: Tue, 30 Apr 2002 11:37:26 +0900
> >
> >Hi Raghav,
> > Ah - this was a question by Annette Doyle (titled "Not all image
tags
> >are returned"). I am attaching my reply.
> >
> >Regards
> >Somik
> >
> >----- Original Message -----
> >From: "Raghavender Srimantula" <kin...@ho...>
> >To: <htm...@li...>
> >Sent: Tuesday, April 30, 2002 11:16 AM
> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations
> >and write outdocument
> >
> >
> > > hi Somik,
> > > I found one more interesting thing here. when I am trying to get all
the
> > > images the image scanner would give me images
> > > <img
src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif"
> > > width=296 height=27 border=0 usemap=#tm>
> > > so if I do a imagetag.getImageLocation(), I would get
> > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif
> > >
> > > but is the html content is like this
> > > <a href=s/6006><img
> >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif
> > > border=0 width=70 height=22></a>
> > > which starts with <a and ends with </a>, then the image scanner will
not
> > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when I do a
> > > imagetag.getImageLocation(). this is not even classified as an
ImageTag.
> > > this is classified as LinkTag. how to get this image.
> > >
> > > the above content is from www.yahoo.com. on the netscape browser if
you
> >goto
> > > view-->pageinfo, you will see a bunch of images.
> > > but when you run the htmlparser you can get only one image.
> > >
> > > Thanks,
> > > Raghav
> > >
> > >
> > > >From: "Somik Raha" <so...@ya...>
> > > >Reply-To: htm...@li...
> > > >To: <htm...@li...>
> > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag
> >locations
> > > >and write outdocument
> > > >Date: Tue, 30 Apr 2002 09:15:38 +0900
> > > >
> > > >Can you describe your application ? Was it parsing a single page when
> >the
> > > >problem occurred ?
> > > >
> > > >Regards,
> > > >Somik
> > > >----- Original Message -----
> > > >From: "Raghavender Srimantula" <kin...@ho...>
> > > >To: <htm...@li...>
> > > >Cc: <htm...@li...>
> > > >Sent: Tuesday, April 30, 2002 8:36 AM
> > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag
> >locations
> > > >and write outdocument
> > > >
> > > >
> > > > > Hi Somik,
> > > > > I encountered a strange problem today. while I was running
> > > >htmlparser...I
> > > > > got a java.lang.OutOfMemoryError. seems that lot of objects are
> >being
> > > > > allocated. where exactly is this happening. I mean could you give
me
> >an
> > > >idea
> > > > > where or in which file the potential problem could be.
> > > > > Raghav
> > > > >
> > > > >
> > > > > >From: "Somik Raha" <so...@ya...>
> > > > > >Reply-To: htm...@li...
> > > > > >To: <htm...@li...>
> > > > > >CC: <htm...@li...>
> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag
> > > >locations
> > > > > >and write out document
> > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900
> > > > > >
> > > > > >Hi Annette,
> > > > > > Pls find attached a program to get you started. This program
> >will
> > > >do
> > > > > >what you want - you will need to modify the construct that checks
> >for
> > > >the
> > > > > >image tag - and replace it with the location of your choice.
> > > > > > Also - I found one bug thanks to this requirement - image
tags
> > > >params
> > > > > >were not being correctly put in. Though it needs a deeper look, I
> >have
> > > >done
> > > > > >a quick fix for now, and all test cases are passing (with one
test
> >case
> > > >in
> > > > > >HTMLImageScannerTest trapping this bug).
> > > > > > Please check out the latest html parser source code from
CVS.
> > > > > >
> > > > > >Regards,
> > > > > >Somik
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: Doyle, Annette
> > > > > > To: htm...@li...
> > > > > > Sent: Friday, April 26, 2002 10:08 PM
> > > > > > Subject: [Htmlparser-user] Hints on how to change image tag
> > > >locations
> > > > > >and write out document
> > > > > >
> > > > > >
> > > > > > Could you please give me some hints as how to change only
image
> >tag
> > > > > >locations and then, (or at the same time) write out the html
> >document
> > > >to
> > > > > >file (with new image tag locations)?
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks-
> > > > > >
> > > > > > Annette Doyle
> > > > > >
> > > > > ><< ImageTagRetriever.java >>
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > _________________________________________________________________
> > > > > Join the world's largest e-mail service with MSN Hotmail.
> > > > > http://www.hotmail.com
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Htmlparser-user mailing list
> > > > > Htm...@li...
> > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > >
> > > >
> > > >_______________________________________________
> > > >Htmlparser-user mailing list
> > > >Htm...@li...
> > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > >
> > >
> > >
> > >
> > > _________________________________________________________________
> > > Send and receive Hotmail on your mobile device: http://mobile.msn.com
> > >
> > >
> > > _______________________________________________
> > > Htmlparser-user mailing list
> > > Htm...@li...
> > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> ><<
>
>[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBu
g].eml
> > >>
>
>
>
>
> _________________________________________________________________
> MSN Photos is the easiest way to share and print your photos:
> http://photos.msn.com/support/worldwide.aspx
>
>
> _______________________________________________________________
>
> Have big pipes? SourceForge.net is looking for download mirrors. We supply
> the hardware. You get the recognition. Email Us: ban...@so...
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Raghavender S. <kin...@ho...> - 2002-05-06 01:44:05
|
Hi Somik, this question is regarding "not all images are being retrieved". I mean the images under <a tag. I did try to open the attachment you sent me. I could not find anything. but seeing the previous mails I could read that it is not a bug. but still if I do want to retrieve all the images how do I do it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and write outdocument >Date: Tue, 30 Apr 2002 11:37:26 +0900 > >Hi Raghav, > Ah - this was a question by Annette Doyle (titled "Not all image tags >are returned"). I am attaching my reply. > >Regards >Somik > >----- Original Message ----- >From: "Raghavender Srimantula" <kin...@ho...> >To: <htm...@li...> >Sent: Tuesday, April 30, 2002 11:16 AM >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and write outdocument > > > > hi Somik, > > I found one more interesting thing here. when I am trying to get all the > > images the image scanner would give me images > > <img src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > width=296 height=27 border=0 usemap=#tm> > > so if I do a imagetag.getImageLocation(), I would get > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > but is the html content is like this > > <a href=s/6006><img >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > border=0 width=70 height=22></a> > > which starts with <a and ends with </a>, then the image scanner will not > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when I do a > > imagetag.getImageLocation(). this is not even classified as an ImageTag. > > this is classified as LinkTag. how to get this image. > > > > the above content is from www.yahoo.com. on the netscape browser if you >goto > > view-->pageinfo, you will see a bunch of images. > > but when you run the htmlparser you can get only one image. > > > > Thanks, > > Raghav > > > > > > >From: "Somik Raha" <so...@ya...> > > >Reply-To: htm...@li... > > >To: <htm...@li...> > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > >Can you describe your application ? Was it parsing a single page when >the > > >problem occurred ? > > > > > >Regards, > > >Somik > > >----- Original Message ----- > > >From: "Raghavender Srimantula" <kin...@ho...> > > >To: <htm...@li...> > > >Cc: <htm...@li...> > > >Sent: Tuesday, April 30, 2002 8:36 AM > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > > > > > > > > > Hi Somik, > > > > I encountered a strange problem today. while I was running > > >htmlparser...I > > > > got a java.lang.OutOfMemoryError. seems that lot of objects are >being > > > > allocated. where exactly is this happening. I mean could you give me >an > > >idea > > > > where or in which file the potential problem could be. > > > > Raghav > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > >Reply-To: htm...@li... > > > > >To: <htm...@li...> > > > > >CC: <htm...@li...> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write out document > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > >Hi Annette, > > > > > Pls find attached a program to get you started. This program >will > > >do > > > > >what you want - you will need to modify the construct that checks >for > > >the > > > > >image tag - and replace it with the location of your choice. > > > > > Also - I found one bug thanks to this requirement - image tags > > >params > > > > >were not being correctly put in. Though it needs a deeper look, I >have > > >done > > > > >a quick fix for now, and all test cases are passing (with one test >case > > >in > > > > >HTMLImageScannerTest trapping this bug). > > > > > Please check out the latest html parser source code from CVS. > > > > > > > > > >Regards, > > > > >Somik > > > > > > > > > > ----- Original Message ----- > > > > > From: Doyle, Annette > > > > > To: htm...@li... > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > Subject: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write out document > > > > > > > > > > > > > > > Could you please give me some hints as how to change only image >tag > > > > >locations and then, (or at the same time) write out the html >document > > >to > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > Annette Doyle > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > http://www.hotmail.com > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > >_______________________________________________ > > >Htmlparser-user mailing list > > >Htm...@li... > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _________________________________________________________________ > > Send and receive Hotmail on your mobile device: http://mobile.msn.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user ><< >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBug].eml > >> _________________________________________________________________ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx |
|
From: Somik R. <so...@ya...> - 2002-05-03 09:25:52
|
Hi Folks,
A testing build is out - you can download it from =
http://htmlparser.sourceforge.net (choose the download link). This is a =
testing build with important bug fixes.=20
Regards,
Somik
|
|
From: Somik R. <so...@ya...> - 2002-05-03 08:35:27
|
Hi Annette,
I went thru the first problem you reported again, and I realized the =
mistake in my testcase- this tag has two newlines instead of one for =
each line. Could reproduce the bug after that. Have applied your fix, =
and updated CVS.
Thanks a lot.
Regards,
Somik
----- Original Message -----=20
From: Doyle, Annette=20
To: htm...@li...=20
Sent: Thursday, May 02, 2002 5:06 AM
Subject: [Htmlparser-user] fixed previous problem - (however, new =
problem)
Fixed:
<td rowspan=3D3><img height=3D49=20
=20
alt=3D"Central Intelligence Agency, Director of Central =
Intelligence"=20
=20
src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20
=20
width=3D241></td>
=20
by changing HTMLTag as follows:
public static int incrementCounter(HTMLReader reader, int =
state, int i, HTMLTag tag) {
String strLine =3D null;
if ((state=3D=3DTAG_BEGIN_PARSING_STATE || =
state =3D=3D TAG_IGNORE_DATA_STATE) && =
i=3D=3Dtag.getTagLine().length()-1)
{
// We need to continue parsing to =
the next line
;
while ((strLine =3D =
reader.getNextLine()).length() =3D=3D 0);
=
//tag.setTagLine(reader.getNextLine());
tag.setTagLine(strLine);
// convert the end of line to a =
space
// The following line masked by =
Somik Raha, 15 Apr 2002, to fix space bug in links
tag.append('\n');
i=3D-1;
} =20
return ++i;
}
=20
NEW PROBLEM in following:
=20
<div align=3D"center"><font face=3D"Arial,"helvetica," =
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
| <a href=3D"/cia/notices.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20
| <a href=3D"/cia/notices.html#priv" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20
| <a href=3D"/cia/notices.html#sec" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20
| <a href=3D"/cia/contact.htm" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a>
| <a href=3D"/cia/sitemap.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a>
| <a href=3D"/cia/siteindex.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a>
| <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Search</font></a>=20
</font></div>
=20
Stops at=20
TAG LINE FOUND <div align=3D"center"><font =
face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" =
color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20
LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," =
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
POSITION IS 26
TAGLINE 197
Process completed.
=20
Annette Doyle
=20
|
|
From: Somik R. <so...@ya...> - 2002-05-03 08:15:23
|
Hi Folks,
We seem to have a heroic parser now...
You can check out the latest code from CVS.
Here's the fix. As you know - if we have an additional erroneous =
inverted comma in a tag, the parser cannot judge whether to treat this =
as erroneous or valid. Now the parser has some amount of intelligence - =
if it encounters an inverted comma, and a close tag character, then it =
does a check to see whether it should treat this as an error or a valid =
character.
This decision making process is facilitated with a strictVector - =
which holds the tags for which it should not make allowances. Currently, =
there is only one - "INPUT" (Should we have any more? ). If the tag =
being parsed is not a strict tag like INPUT, then it is assumed that =
this is an erroneous tag and needs to be corrected.
The correction process occurs (and is validated with some testcases =
in HTMLTag - particularly testStrictParsing). If you go thru that =
testcase - you will see that the attributes are also correctly =
retrieved.
This solution doesent break anything else - we have 82 testcases, =
all passing.
I'd be grateful if folks can test this version and let me know if =
this solution is acceptable.
=20
Also - a general question - would you prefer something like nightly =
drop packages for downloading, or is a request to checkout from CVS fine =
?
Thanks and Regards,
Somik =20
|
|
From: Somik R. <so...@ya...> - 2002-05-02 03:30:50
|
Hi Folks,
Thanks to an interesting bug report by Roger Sollberger, a bug in =
HTMLStringNode has been fixed.
Links of the type :
<a href=3D"http://asgard.ch">[> ASGARD <]</a>
would get messed up bcos of the tag symbols, when they should really be =
a part of HTMLStringNode.
This has been fixed (after the bug has been reproduced in a testcase in =
HTMLStringNodeTest).=20
CVS code base updated.
Roger --> Thanks a lot for the report.
Regards,
Somik
|
|
From: Somik R. <so...@ya...> - 2002-05-02 03:11:27
|
Hi Folks,
If you've been following the latest exchange on htmlparser-user, =
Annette has shown us a crazy example of dirty html, which works in the =
browser, but crashes the parser.
The site is http://www.cia.gov =20
Search for this string - <font face=3D"Arial,"helvetica,"
and you will find it in the html. Now this erroneous inverted comma =
in front of helvetica should not be there.=20
This has been captured in a test case in HTMLTagTest.java (you can =
get it from CVS), and this test fails (testParsing()).
The problem is - the core parsing mechanism ignores anything within =
inverted commas. This is critical so as to be able to accept angular =
brackets in inverted commas. If we remove this feature from the parser =
other tests will break.
=20
So I need some suggestions on how we might modify our parsing - how =
do we intelligently understand that this is an error (how easy it is for =
us humans to figure this out) ? Looks like linear approaches wouldnt =
work anymore... Maybe we need to associate some intelligence - that if =
its a font tag, then this kind of stuff is most definitely an error. =
Whereas if its a jsp tag, we can get more strict with our parsing. This =
will probably cause a fundamental shift in our core parsing technique.
Regards,
Somik
|
|
From: Somik R. <so...@ya...> - 2002-05-02 02:59:22
|
Hi Annette,
Regarding your second problem, the parsing error occurs because -=20
=20
<div align=3D"center"><font face=3D"Arial,"helvetica," =
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font=20
In the above - font face=3D"Arial,"helvetica," -- note the erroneoue =
extra " in front of helvetica. Remove it and the parsing is fine. Now of =
course you cant remove it, bcos this site is not yours :). So, we do =
have to support this kind of dirty html. Thank you so much for bringing =
it to our notice. I have written a test case to reproduce this bug, and =
am working to resolve this.
Regards,
Somik
=20
<div align=3D"center"><font face=3D"Arial,"helvetica," =
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
| <a href=3D"/cia/notices.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20
| <a href=3D"/cia/notices.html#priv" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20
| <a href=3D"/cia/notices.html#sec" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20
| <a href=3D"/cia/contact.htm" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a>
| <a href=3D"/cia/sitemap.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a>
| <a href=3D"/cia/siteindex.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a>
| <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Search</font></a>=20
</font></div>
=20
Stops at=20
TAG LINE FOUND <div align=3D"center"><font =
face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" =
color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20
LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," =
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
POSITION IS 26
TAGLINE 197
Process completed.
=20
Annette Doyle
=20
|
|
From: Somik R. <so...@ya...> - 2002-05-02 02:42:14
|
Hi Annette,
Regarding the first problem, I wrote a testcase, but was unable to =
reproduce the error. Can you checkout the latest code from CVS, =
(HTMLImageScanner), and take a look at the testcase =
testImageTagOnThreeLines(). This test case passes. It ought to fail if =
there is a problem in the parsing.=20
Meanwhile I am taking a look at the second issue.
Regards,
Somik =20
----- Original Message -----=20
From: Doyle, Annette=20
To: htm...@li...=20
Sent: Thursday, May 02, 2002 5:06 AM
Subject: [Htmlparser-user] fixed previous problem - (however, new =
problem)
Fixed:
<td rowspan=3D3><img height=3D49=20
=20
alt=3D"Central Intelligence Agency, Director of Central =
Intelligence"=20
=20
src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20
=20
width=3D241></td>
=20
by changing HTMLTag as follows:
public static int incrementCounter(HTMLReader reader, int =
state, int i, HTMLTag tag) {
String strLine =3D null;
if ((state=3D=3DTAG_BEGIN_PARSING_STATE || =
state =3D=3D TAG_IGNORE_DATA_STATE) && =
i=3D=3Dtag.getTagLine().length()-1)
{
// We need to continue parsing to =
the next line
;
while ((strLine =3D =
reader.getNextLine()).length() =3D=3D 0);
=
//tag.setTagLine(reader.getNextLine());
tag.setTagLine(strLine);
// convert the end of line to a =
space
// The following line masked by =
Somik Raha, 15 Apr 2002, to fix space bug in links
tag.append('\n');
i=3D-1;
} =20
return ++i;
}
=20
NEW PROBLEM in following:
=20
<div align=3D"center"><font face=3D"Arial,"helvetica," =
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
| <a href=3D"/cia/notices.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20
| <a href=3D"/cia/notices.html#priv" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20
| <a href=3D"/cia/notices.html#sec" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20
| <a href=3D"/cia/contact.htm" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a>
| <a href=3D"/cia/sitemap.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a>
| <a href=3D"/cia/siteindex.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a>
| <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Search</font></a>=20
</font></div>
=20
Stops at=20
TAG LINE FOUND <div align=3D"center"><font =
face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" =
color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" =
vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20
LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," =
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
POSITION IS 26
TAGLINE 197
Process completed.
=20
Annette Doyle
=20
|
|
From: Doyle, A. <Ann...@au...> - 2002-05-01 20:07:05
|
Fixed:
<td rowspan=3D3><img height=3D49=20
=20
alt=3D"Central Intelligence Agency, Director of Central
Intelligence"=20
=20
src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20
=20
width=3D241></td>
=20
by changing HTMLTag as follows:
public static int incrementCounter(HTMLReader reader, int
state, int i, HTMLTag tag) {
String strLine =3D null;
if ((state=3D=3DTAG_BEGIN_PARSING_STATE || state =
=3D=3D
TAG_IGNORE_DATA_STATE) && i=3D=3Dtag.getTagLine().length()-1)
{
// We need to continue parsing to
the next line
;
while ((strLine =3D =
reader.getNextLine()).length()
=3D=3D 0);
=20
//tag.setTagLine(reader.getNextLine());
tag.setTagLine(strLine);
// convert the end of line to a
space
// The following line masked by
Somik Raha, 15 Apr 2002, to fix space bug in links
tag.append('\n');
i=3D-1;
} =20
return ++i;
}
=20
NEW PROBLEM in following:
=20
<div align=3D"center"><font face=3D"Arial,"helvetica,"
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html"
link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
| <a href=3D"/cia/notices.html" link=3D"#000000"
vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20
| <a href=3D"/cia/notices.html#priv" link=3D"#000000"
vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20
| <a href=3D"/cia/notices.html#sec" link=3D"#000000"
vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20
| <a href=3D"/cia/contact.htm" link=3D"#000000"
vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a>
| <a href=3D"/cia/sitemap.html" link=3D"#000000"
vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a>
| <a href=3D"/cia/siteindex.html" link=3D"#000000"
vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a>
| <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font
color=3D"#FFFFFF">Search</font></a>=20
</font></div>
=20
Stops at=20
TAG LINE FOUND <div align=3D"center"><font =
face=3D"Arial,"helvetica,"
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html"
link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
LINE is <div align=3D"center"><font face=3D"Arial,"helvetica,"
sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a =
href=3D"/index.html"
link=3D"#000000" vlink=3D"#000000"><font =
color=3D"#FFFFFF">Home</font></a>=20
POSITION IS 26
TAGLINE 197
Process completed.
=20
Annette Doyle
=20
|
|
From: Doyle, A. <Ann...@au...> - 2002-05-01 18:39:28
|
The following html is not parsed correctly. Try http://www.cia.gov <http://www.cia.gov/> . =20 <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 Annette Doyle |
|
From: Somik R. <so...@ya...> - 2002-05-01 06:07:52
|
The second issue that you mentioned is already fixed. It is in release 1.1 - have you got the latest release ? Regards, Somik ----- Original Message ----- From: "Raghavender Srimantula" <kin...@ho...> To: <htm...@li...> Sent: Wednesday, May 01, 2002 4:15 AM Subject: Re: [Htmlparser-user] extracting only certain links > hi Somik, > I have tried urls www.nba.com, www.yahoo.com which seem to have lot of > links. yahoo.com has 191 links when I tried. I wrote a small class > Parser.java which I am mailing as an attachment. everytime I run my project > using JBuilder after a series of parsing it throws a OutOfMemoryError. since > I am using JBuilder I havent set any -ms or -mx parameters to run my > Parser.java. so you might want to try it out. > and the other thing I noticed while running the Parser.java was at the > LinkScanner in the extractLink() method for a particular <a tag I get > relativeLink as "null" and then when we do > "return (new HTMLLinkProcessor()).extract(relativeLink,url);" > it throws a NullPointerException in that method since relativeLink is null. > The exact place it throws a NullPointerException is > "if (link.indexOf("https://")==-1 && link.indexOf("mailto:")==-1 && url != > null)" in "checkIfLinkIsRelative" method of HTMLLinkProcessor. this could be > fixed. I fixed it....but the OutOfMemoryError seems to be potentially > dangerous. > > Thanks, > Raghav > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] extracting only certain links > >Date: Tue, 30 Apr 2002 11:44:17 +0900 > > > >Semantic analysis... > >Write a conditional to process the tag contents. You will have code like > >this : > > > >if (node instanceof HTMLLinkTag) { > > HTMLLinkTag linkTag = (HTMLLinkTag) node; > > if (linkTag.getLink().indexOf("http://rd.yahoo.com")==0) { > > // print the tag or display it however you want > > } > >} > > > >Regards > >Somik > >----- Original Message ----- > >From: "Sodergren, M.G." <mg...@le...> > >To: <htm...@li...> > >Sent: Tuesday, April 30, 2002 2:19 AM > >Subject: [Htmlparser-user] extracting only certain links > > > > > >Hello. > >When i enter a url like http://search.yahoo.com/bin/search?p=SEARCHENTERED > >(yahoo result page for SEARCHENTERED),the program extracts all the links > >from the html page but i just want it to extract the links that are > >returned > >as the result of my search by yahoo, so for example (with yahoo), all the > >links beginning with <a href="http://srd.yahoo.com > >but not the links beginning with <a href="http://rd.yahoo.com/ > >so in other words all the links with srd and not rd. > >How would i solve this problem? What code do i put and where? > > > >Thanks > >Mats > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _________________________________________________________________ > Chat with friends online, try MSN Messenger: http://messenger.msn.com > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
|
From: Raghavender S. <kin...@ho...> - 2002-05-01 00:15:32
|
hi Somik,
sorry about the false alarm. there was bug in my code. but the second one--
the LinkScanner throwing a nullpointer exception is there.
Ragahv
>From: "Somik Raha" <so...@ya...>
>Reply-To: htm...@li...
>To: <htm...@li...>
>Subject: Re: [Htmlparser-user] extracting only certain links
>Date: Tue, 30 Apr 2002 11:44:17 +0900
>
>Semantic analysis...
>Write a conditional to process the tag contents. You will have code like
>this :
>
>if (node instanceof HTMLLinkTag) {
> HTMLLinkTag linkTag = (HTMLLinkTag) node;
> if (linkTag.getLink().indexOf("http://rd.yahoo.com")==0) {
> // print the tag or display it however you want
> }
>}
>
>Regards
>Somik
>----- Original Message -----
>From: "Sodergren, M.G." <mg...@le...>
>To: <htm...@li...>
>Sent: Tuesday, April 30, 2002 2:19 AM
>Subject: [Htmlparser-user] extracting only certain links
>
>
>Hello.
>When i enter a url like http://search.yahoo.com/bin/search?p=SEARCHENTERED
>(yahoo result page for SEARCHENTERED),the program extracts all the links
>from the html page but i just want it to extract the links that are
>returned
>as the result of my search by yahoo, so for example (with yahoo), all the
>links beginning with <a href="http://srd.yahoo.com
>but not the links beginning with <a href="http://rd.yahoo.com/
>so in other words all the links with srd and not rd.
>How would i solve this problem? What code do i put and where?
>
>Thanks
>Mats
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
_________________________________________________________________
MSN Photos is the easiest way to share and print your photos:
http://photos.msn.com/support/worldwide.aspx
|