htmlparser-user Mailing List for HTML Parser

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Raghav
    I went thru the yahoo.txt, and just like your previous one, this one =
too had very dirty html. The reason you got the OutofMemoryException was =
that this kind of html sent the parser into an infinite loop (in =
HTMLLinkScanner).
    The tag which did this was :
<a href=3Ds/8741><img =
src=3D"http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=3D16 =
width=3D16 border=3D0></img></td><td nowrap> &nbsp;
<a href=3Ds/7509><b>Yahoo! Movies</b></a>

As you can see, the first link tag does not have an end tag. I verified =
with the actual yahoo page, and this link occurs quite decently, with =
the correct end tag. After looking closely at your supplied file, I also =
notice the </img> file, which is highly unusual in normal html.

So - I am guessing that this file is generated by a program and not by a =
human. You would definitely want to check the program thats doing it - =
its surely buggy.

    However, my yardstick for the robustness of this parser is Internet =
Explorer. If the stuff works in IE, then its got to work here. And as I =
tried this particularly bad piece of html, I found IE does not crash. =
Hence, I had to go about empowering the parser to parse these erroneous =
tags <sigh> Took hours!! </sigh>

    The good news is, its done. We can parse these tags, and the correct =
end tag is inserted just before td. Of course, I have done a minimal =
adjustment for your purpose. As time goes on, robustness ought to =
increase further. All test cases passing. The framework for handling =
dirty html is also slightly modified.

    An integration release has been made (2002-05-12), and is under the =
integration builds package. You can download from =
http://htmlparser.sourceforge.net.=20
   =20
    The parser should not crash on your html now.

Regards,
Somik
  ----- Original Message -----=20
  From: Raghavender Srimantula=20
  To: htm...@li...=20
  Sent: Saturday, May 11, 2002 4:32 AM
  Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations andwriteoutdocument

  Hi Somik,
  I have mentioned about the out of memory error problem earlier. last =
time=20
  for every iteration of for loop I was adding the whole page to my =
string=20
  buffer. so it was giving me the out of memory error. I removed that =
now. it=20
  was working fine till yesterday. now I find that error again. this =
time=20
  nothing to do with string buffer...and it looks like a real problem. I =
can=20
  send you the main class and the yahoo.txt I have. try running it.
  Thanks,
  Raghav

  >From: "Somik Raha" <so...@ya...>
  >Reply-To: htm...@li...
  >To: <htm...@li...>
  >Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations=20
  >andwriteoutdocument
  >Date: Fri, 10 May 2002 00:43:19 +0900
  >
  >Hi Raghav,
  >     On analyzing yahoo.txt, I found that you have incorrect html. =
There is=20
  >a script tag that has not been closed. So naturally the script =
scanner goes=20
  >bonkers. Rename the extension to .html, and open this file in IE, and =
you=20
  >will find that IE also cant handle this.
  >     I verified from www.yahoo.com, and found that they do have the =
correct=20
  ></script> tag provided. So I guess your yahoo.txt file is faulty.
  >
  >Regards,
  >Somik
  >   ----- Original Message -----
  >   From: Raghavender Srimantula
  >   To: htm...@li...
  >   Sent: Thursday, May 09, 2002 4:53 AM
  >   Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
  >locations andwriteoutdocument
  >
  >
  >   Hi Somik,
  >   I was using the 1.1 version of htmlparser. I save the =
www.yahoo.com=20
  >content
  >   in a flat file yahoo.txt. and I run the parser against this. =
throws a
  >   nullpointerexception in HTMLScriptScanner. this seems to be a new=20
  >addition
  >   for 1.1. I will send the stacktrace, the main program and the =
yahoo.txt.
  >   actually I cannot send the stacktrace. I made some changes and the =
line
  >   numbers dont match. but if you run this program you would see the
  >   nullpointerexception.
  >   Thanks,
  >   Raghav
  >
  >
  >   >From: "Somik Raha" <so...@ya...>
  >   >Reply-To: htm...@li...
  >   >To: <htm...@li...>
  >   >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
  >locations
  >   >and writeoutdocument
  >   >Date: Mon, 6 May 2002 13:59:11 +0900
  >   >
  >   >Hi Raghav,
  >   >     I sent another mail sometime back to you -
  >   >
  >   >"HTMLLinkTag.linkData() - this gives you an enumeration - and in =
the
  >   >enumeration will be your HTMLImageTag."
  >   >HTMLNode node;
  >   >HTMLImageTag imageTag;
  >   >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) {
  >   >     node =3D (HTMLNode)e.nextElement();
  >   >     if (node instanceof HTMLImageTag) {
  >   >         imageTag =3D (HTMLImageTag)node;
  >   >         // your code here
  >   >     }
  >   >}
  >   >
  >   >Regards,
  >   >Somik
  >   >----- Original Message -----
  >   >From: "Raghavender Srimantula" <kin...@ho...>
  >   >To: <htm...@li...>
  >   >Sent: Monday, May 06, 2002 10:43 AM
  >   >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
  >locations
  >   >and writeoutdocument
  >   >
  >   >
  >   > > Hi Somik,
  >   > > this question is regarding "not all images are being =
retrieved". I=20
  >mean
  >   >the
  >   > > images under <a tag. I did try to open the attachment you sent =
me. I
  >   >could
  >   > > not find anything. but seeing the previous mails I could read =
that=20
  >it is
  >   >not
  >   > > a bug. but still if I do want to retrieve all the images how =
do I do=20
  >it.
  >   > > Thanks,
  >   > > Raghav
  >   > >
  >   > >
  >   > > >From: "Somik Raha" <so...@ya...>
  >   > > >Reply-To: htm...@li...
  >   > > >To: <htm...@li...>
  >   > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
  >   >locations
  >   > > >and write outdocument
  >   > > >Date: Tue, 30 Apr 2002 11:37:26 +0900
  >   > > >
  >   > > >Hi Raghav,
  >   > > >     Ah - this was a question by Annette Doyle (titled "Not =
all=20
  >image
  >   >tags
  >   > > >are returned"). I am attaching my reply.
  >   > > >
  >   > > >Regards
  >   > > >Somik
  >   > > >
  >   > > >----- Original Message -----
  >   > > >From: "Raghavender Srimantula" <kin...@ho...>
  >   > > >To: <htm...@li...>
  >   > > >Sent: Tuesday, April 30, 2002 11:16 AM
  >   > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
  >   >locations
  >   > > >and write outdocument
  >   > > >
  >   > > >
  >   > > > > hi Somik,
  >   > > > > I found one more interesting thing here. when I am trying =
to get=20
  >all
  >   >the
  >   > > > > images the image scanner would give me images
  >   > > > > <img
  >   =
>src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif"
  >   > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm>
  >   > > > > so if I do a imagetag.getImageLocation(), I would get
  >   > > > > =
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif
  >   > > > >
  >   > > > > but is the html content is like this
  >   > > > > <a href=3Ds/6006><img
  >   > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif
  >   > > > > border=3D0 width=3D70 height=3D22></a>
  >   > > > > which starts with <a and ends with </a>, then the image =
scanner=20
  >will
  >   >not
  >   > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif =
when=20
  >I do
  >   >a
  >   > > > > imagetag.getImageLocation(). this is not even classified =
as an
  >   >ImageTag.
  >   > > > > this is classified as LinkTag. how to get this image.
  >   > > > >
  >   > > > > the above content is from www.yahoo.com. on the netscape =
browser=20
  >if
  >   >you
  >   > > >goto
  >   > > > > view-->pageinfo, you will see a bunch of images.
  >   > > > > but when you run the htmlparser you can get only one =
image.
  >   > > > >
  >   > > > > Thanks,
  >   > > > > Raghav
  >   > > > >
  >   > > > >
  >   > > > > >From: "Somik Raha" <so...@ya...>
  >   > > > > >Reply-To: htm...@li...
  >   > > > > >To: <htm...@li...>
  >   > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image tag
  >   > > >locations
  >   > > > > >and write outdocument
  >   > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900
  >   > > > > >
  >   > > > > >Can you describe your application ? Was it parsing a =
single=20
  >page
  >   >when
  >   > > >the
  >   > > > > >problem occurred ?
  >   > > > > >
  >   > > > > >Regards,
  >   > > > > >Somik
  >   > > > > >----- Original Message -----
  >   > > > > >From: "Raghavender Srimantula" <kin...@ho...>
  >   > > > > >To: <htm...@li...>
  >   > > > > >Cc: <htm...@li...>
  >   > > > > >Sent: Tuesday, April 30, 2002 8:36 AM
  >   > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image tag
  >   > > >locations
  >   > > > > >and write outdocument
  >   > > > > >
  >   > > > > >
  >   > > > > > > Hi Somik,
  >   > > > > > > I encountered a strange problem today. while I was =
running
  >   > > > > >htmlparser...I
  >   > > > > > > got a java.lang.OutOfMemoryError. seems that lot of =
objects=20
  >are
  >   > > >being
  >   > > > > > > allocated. where exactly is this happening. I mean =
could you
  >   >give
  >   >me
  >   > > >an
  >   > > > > >idea
  >   > > > > > > where or in which file the potential problem could be.
  >   > > > > > > Raghav
  >   > > > > > >
  >   > > > > > >
  >   > > > > > > >From: "Somik Raha" <so...@ya...>
  >   > > > > > > >Reply-To: htm...@li...
  >   > > > > > > >To: <htm...@li...>
  >   > > > > > > >CC: <htm...@li...>
  >   > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image=20
  >tag
  >   > > > > >locations
  >   > > > > > > >and write out document
  >   > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900
  >   > > > > > > >
  >   > > > > > > >Hi Annette,
  >   > > > > > > >     Pls find attached a program to get you started. =
This
  >   >program
  >   > > >will
  >   > > > > >do
  >   > > > > > > >what you want - you will need to modify the construct =
that
  >   >checks
  >   > > >for
  >   > > > > >the
  >   > > > > > > >image tag - and replace it with the location of your=20
  >choice.
  >   > > > > > > >     Also - I found one bug thanks to this =
requirement -=20
  >image
  >   >tags
  >   > > > > >params
  >   > > > > > > >were not being correctly put in. Though it needs a =
deeper=20
  >look,
  >   >I
  >   > > >have
  >   > > > > >done
  >   > > > > > > >a quick fix for now, and all test cases are passing =
(with=20
  >one
  >   >test
  >   > > >case
  >   > > > > >in
  >   > > > > > > >HTMLImageScannerTest trapping this bug).
  >   > > > > > > >     Please check out the latest html parser source =
code=20
  >from
  >   >CVS.
  >   > > > > > > >
  >   > > > > > > >Regards,
  >   > > > > > > >Somik
  >   > > > > > > >
  >   > > > > > > >   ----- Original Message -----
  >   > > > > > > >   From: Doyle, Annette
  >   > > > > > > >   To: htm...@li...
  >   > > > > > > >   Sent: Friday, April 26, 2002 10:08 PM
  >   > > > > > > >   Subject: [Htmlparser-user] Hints on how to change =
image=20
  >tag
  >   > > > > >locations
  >   > > > > > > >and write out document
  >   > > > > > > >
  >   > > > > > > >
  >   > > > > > > >   Could you please give me some hints as how to =
change=20
  >only
  >   >image
  >   > > >tag
  >   > > > > > > >locations and then, (or at the same time) write out =
the=20
  >html
  >   > > >document
  >   > > > > >to
  >   > > > > > > >file (with new image tag locations)?
  >   > > > > > > >
  >   > > > > > > >
  >   > > > > > > >
  >   > > > > > > >   Thanks-
  >   > > > > > > >
  >   > > > > > > >   Annette Doyle
  >   > > > > > > >
  >   > > > > > > ><< ImageTagRetriever.java >>
  >   > > > > > >
  >   > > > > > >
  >   > > > > > >
  >   > > > > > >
  >   > > > > > >
  >   >_________________________________________________________________
  >   > > > > > > Join the world's largest e-mail service with MSN =
Hotmail.
  >   > > > > > > http://www.hotmail.com
  >   > > > > > >
  >   > > > > > >
  >   > > > > > > _______________________________________________
  >   > > > > > > Htmlparser-user mailing list
  >   > > > > > > Htm...@li...
  >   > > > > > > =
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
  >   > > > > >
  >   > > > > >
  >   > > > > >_______________________________________________
  >   > > > > >Htmlparser-user mailing list
  >   > > > > >Htm...@li...
  >   > > > > =
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
  >   > > > >
  >   > > > >
  >   > > > >
  >   > > > >
  >   > > > >=20
  >_________________________________________________________________
  >   > > > > Send and receive Hotmail on your mobile device:
  >   >http://mobile.msn.com
  >   > > > >
  >   > > > >
  >   > > > > _______________________________________________
  >   > > > > Htmlparser-user mailing list
  >   > > > > Htm...@li...
  >   > > > > =
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
  >   > > ><<
  >   > >
  >   >=20
  > =
>[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not=
aBu
  >   >g].eml
  >   > > > >>
  >   > >
  >   > >
  >   > >
  >   > >
  >   > > =
_________________________________________________________________
  >   > > MSN Photos is the easiest way to share and print your photos:
  >   > > http://photos.msn.com/support/worldwide.aspx
  >   > >
  >   > >
  >   > > =
_______________________________________________________________
  >   > >
  >   > > Have big pipes? SourceForge.net is looking for download =
mirrors. We
  >   >supply
  >   > > the hardware. You get the recognition. Email Us:
  >   >ban...@so...
  >   > > _______________________________________________
  >   > > Htmlparser-user mailing list
  >   > > Htm...@li...
  >   > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
  >   >
  >   >
  >   >
  >   >_______________________________________________
  >   >Htmlparser-user mailing list
  >   >Htm...@li...
  >   >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
  >
  >
  >
  >
  >   _________________________________________________________________
  >   Get your FREE download of MSN Explorer at=20
  >http://explorer.msn.com/intl.asp.
  >

  _________________________________________________________________
  Join the world's largest e-mail service with MSN Hotmail.=20
  http://www.hotmail.com

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct (1)	Nov	Dec

S	M	T	W	T	F	S
			1 (4)	2 (4)	3 (3)	4
5	6 (2)	7 (1)	8 (3)	9 (1)	10 (1)	11
12 (1)	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

htmlparser-user Mailing List for HTML Parser

htmlparser-user — The user mailing list for users of the htmlparser library