htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
1
|
2
|
3
|
4
|
5
(2) |
6
|
7
(2) |
8
|
9
(2) |
10
(1) |
11
(1) |
12
(5) |
13
(1) |
14
|
15
|
16
(1) |
17
|
18
(2) |
19
(5) |
20
(1) |
21
(1) |
22
|
23
|
24
|
25
(4) |
26
(1) |
27
(3) |
28
(4) |
29
(3) |
30
(2) |
31
(1) |
|
|
|
|
From: Rob E. <re...@ap...> - 2004-08-31 15:40:56
|
Thanks, Derrick. I'll download the latest version now. Rob. Derrick Oswald wrote: > Rob, > > This may be a bug that was recently (July 28) fixed: > > Bug #995703 Parser Crash and bug #988846 Linkbean getLinks() > segmentation fault > by not testing for content type "text/XXX" in Page, but rather issuing a > warning when this is > discovered by the Parser level. > > What's the exception message? Is it "...does not contain text"? > If so, either download a new version or remove the test in Page.java: > > type = getContentType (); > - if (type != null && !type.startsWith ("text")) > - throw new ParserException ( > - "URL " > - + connection.getURL ().toExternalForm () > - + " does not contain text"); > charset = getCharset (type); > try > > Derrick > > > Rob Eger wrote: > >> Okay, I seem to have figured out how to make the parser do what I >> need, except for one small issue - if the files have a .xml extension >> it throws an exception saying the file "does not contain text". If I >> append .html onto it, things work fine. >> >> Is there a way to make the parser accept .xml as a valid file extension? >> >> Thanks, >> Rob. >> >> >> Derrick Oswald wrote: >> >>> Rob, >>> >>> I haven't had any problems parsing XML with htmlparser. >>> An example is provided for parsing RSS feeds which are XML: >>> http://htmlparser.sourceforge.net/wiki/index.php/RSSFeeds >>> >>> Derrick >> >> > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2004-08-30 22:24:40
|
Rob, This may be a bug that was recently (July 28) fixed: Bug #995703 Parser Crash and bug #988846 Linkbean getLinks() segmentation fault by not testing for content type "text/XXX" in Page, but rather issuing a warning when this is discovered by the Parser level. What's the exception message? Is it "...does not contain text"? If so, either download a new version or remove the test in Page.java: type = getContentType (); - if (type != null && !type.startsWith ("text")) - throw new ParserException ( - "URL " - + connection.getURL ().toExternalForm () - + " does not contain text"); charset = getCharset (type); try Derrick Rob Eger wrote: > Okay, I seem to have figured out how to make the parser do what I > need, except for one small issue - if the files have a .xml extension > it throws an exception saying the file "does not contain text". If I > append .html onto it, things work fine. > > Is there a way to make the parser accept .xml as a valid file extension? > > Thanks, > Rob. > > > Derrick Oswald wrote: > >> Rob, >> >> I haven't had any problems parsing XML with htmlparser. >> An example is provided for parsing RSS feeds which are XML: >> http://htmlparser.sourceforge.net/wiki/index.php/RSSFeeds >> >> Derrick > |
From: Rob E. <re...@ap...> - 2004-08-30 18:09:36
|
Okay, I seem to have figured out how to make the parser do what I need, except for one small issue - if the files have a .xml extension it throws an exception saying the file "does not contain text". If I append .html onto it, things work fine. Is there a way to make the parser accept .xml as a valid file extension? Thanks, Rob. Derrick Oswald wrote: > Rob, > > I haven't had any problems parsing XML with htmlparser. > An example is provided for parsing RSS feeds which are XML: > http://htmlparser.sourceforge.net/wiki/index.php/RSSFeeds > > Derrick > > Neil Aggarwal wrote: > >> Rob: >> >> Your input is XML. You should use an XML parser like xerces >> http://xml.apache.org/xerces2-j/index.html >> to parse it. >> >> Neil >> >> >>> -----Original Message----- >>> From: htm...@li... >>> [mailto:htm...@li...] On Behalf Of Rob >>> Eger >>> Sent: Friday, August 27, 2004 3:51 PM >>> To: htm...@li... >>> Subject: [Htmlparser-user] using parser/lexer on non-html markup pages >>> >>> >>> I've been using the HTMLParser to parse html pages up until now >>> (works great by the way), but I was just given a small new project to >>> parse a set of marked up files. Basically info in tags. >>> >>> The files contain blocks (many per file) like this: >>> >>> <listing id="324" key="xyz"> >>> <name>random name</name> >>> <lineBlock heading="header1"> >>> <line lineNum="1">contents of the line</line> >>> <line lineNum="2">more line contents</line> >>> </lineBlock> >>> </listing> >>> >>> and so on... >>> >>> I tried just re-using some of the code I was using for parsing html >>> (added some custom tags to handle the specific tags I'm dealing >>> with), but it didn't work at first pass. Not sure why, nothing >>> obvious stands out. >>> >>> Can I use the parser (or would the lexer be better) to do this at >>> all? Or am I trying to fit a square peg in a round hole? >>> >>> Thanks, >>> Rob. >>> >>> > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2004-08-29 14:57:39
|
... sorry, that should be getTagName(). Derrick Oswald wrote: > Curtney, > > The tag.getName () will return the matched name from your list of ids: > > for (int i = 0; i < list.size (); i++) > { > PublicationTag tag = (PublicationTag)list.elementAt (i); > if (tag.getName ().equals ("TITLE")) > ... do title processing... > else if (tag.getName ().equals ("SUBTITLE")) > ... do subtitle processing... > > I don't think the other way would work (but I didn't check). > > Derrick > > Curtney Jacobs wrote: > >> Yep, I tried a filter and that worked. >> >> One more thing, is possible to set a general tag, (i.e >> PublicationTags) that >> would hold all tags of interest, then somehow iterate through each >> custom >> tag extracting it content? The code would look something like the >> following: >> >> public class PublicationTags extends CompositeTag { >> >> public PublicationTags() { >> super(); >> } >> >> >> public String[] getIds() { >> return new String[] {"TITLE","SUBTITLE", "ABSTRACT"}; >> } >> >> } >> >> >> NodeList list = parser.extractAllNodesThatMatch (new NodeClassFilter >> (PublicationTag.class)); >> >> How would I iterate through list to get the title, subtitle, and >> abstract >> content from the above? >> >> OR >> >> should I be thinking something like the following instead >> >> String [] publicationTags = {"TITLE","SUBTITLE","ABSTRACT"}; >> TagFindingVisitor visitor = new TagFindingVisitor >> (publicationTags); >> parser.visitAllNodesWith (visitor); >> >> Node titleTags = visitor.getTags(0); >> Node subtitleTag = visitor.getTags(1); >> Node abstractTag = visitor.getTags(2); >> >> >> Thanks, >> >> _Curtney >> >> >> >> ----- Original Message ----- From: "Derrick Oswald" >> <Der...@Ro...> >> To: <htm...@li...> >> Sent: Saturday, August 28, 2004 2:54 PM >> Subject: Re: [Htmlparser-user] Parsing Custom HTML Tags >> >> >> >> >>> Curtney, >>> >>> If you just want to discard the markup, I might be tempted to use a >>> filter again... >>> NodeList nl = tag.getChildren (); >>> NodeList justText = nl.extractAllNodesThatMatch (new NodeClassFilter >>> (TextNode.class)); >>> // you are using version 1.5 right? otherwise the class would be >>> StringNode I think >>> System.out.println (justText); >>> >>> But if you want a newline wherever there's a <p>, you need to do it >>> your >>> way and check if it breaks flow, >>> like <b></b> doesn't cause a newline: >>> >>> while (ni.hasMoreNodes ()) >>> { >>> Node childNode = ni.nextNode(); >>> if (childNode instanceof TagNode) >>> { >>> if (((TagNode)childNode).breaksFlow ()) >>> System.out.println (); >>> } >>> else >>> System.out.print (childNode.toPlainTextString()); >>> } >>> >>> >>> Derrick >>> >>> Curtney Jacobs wrote: >>> >>> >> |
From: Derrick O. <Der...@Ro...> - 2004-08-29 09:18:26
|
Curtney, The tag.getName () will return the matched name from your list of ids: for (int i = 0; i < list.size (); i++) { PublicationTag tag = (PublicationTag)list.elementAt (i); if (tag.getName ().equals ("TITLE")) ... do title processing... else if (tag.getName ().equals ("SUBTITLE")) ... do subtitle processing... I don't think the other way would work (but I didn't check). Derrick Curtney Jacobs wrote: >Yep, I tried a filter and that worked. > >One more thing, is possible to set a general tag, (i.e PublicationTags) that >would hold all tags of interest, then somehow iterate through each custom >tag extracting it content? The code would look something like the following: > >public class PublicationTags extends CompositeTag { > > public PublicationTags() { > super(); > } > > > public String[] getIds() { > return new String[] {"TITLE","SUBTITLE", "ABSTRACT"}; > } > >} > > >NodeList list = parser.extractAllNodesThatMatch (new NodeClassFilter >(PublicationTag.class)); > >How would I iterate through list to get the title, subtitle, and abstract >content from the above? > >OR > >should I be thinking something like the following instead > > String [] publicationTags = {"TITLE","SUBTITLE","ABSTRACT"}; > TagFindingVisitor visitor = new TagFindingVisitor (publicationTags); > parser.visitAllNodesWith (visitor); > >Node titleTags = visitor.getTags(0); >Node subtitleTag = visitor.getTags(1); >Node abstractTag = visitor.getTags(2); > > >Thanks, > >_Curtney > > > >----- Original Message ----- >From: "Derrick Oswald" <Der...@Ro...> >To: <htm...@li...> >Sent: Saturday, August 28, 2004 2:54 PM >Subject: Re: [Htmlparser-user] Parsing Custom HTML Tags > > > > >>Curtney, >> >>If you just want to discard the markup, I might be tempted to use a >>filter again... >> NodeList nl = tag.getChildren (); >> NodeList justText = nl.extractAllNodesThatMatch (new NodeClassFilter >>(TextNode.class)); >> // you are using version 1.5 right? otherwise the class would be >>StringNode I think >> System.out.println (justText); >> >>But if you want a newline wherever there's a <p>, you need to do it your >>way and check if it breaks flow, >>like <b></b> doesn't cause a newline: >> >> while (ni.hasMoreNodes ()) >> { >> Node childNode = ni.nextNode(); >> if (childNode instanceof TagNode) >> { >> if (((TagNode)childNode).breaksFlow ()) >> System.out.println (); >> } >> else >> System.out.print (childNode.toPlainTextString()); >> } >> >> >>Derrick >> >>Curtney Jacobs wrote: >> >> >> |
From: Curtney J. <c.c...@co...> - 2004-08-29 01:12:22
|
Yep, I tried a filter and that worked. One more thing, is possible to set a general tag, (i.e PublicationTags) that would hold all tags of interest, then somehow iterate through each custom tag extracting it content? The code would look something like the following: public class PublicationTags extends CompositeTag { public PublicationTags() { super(); } public String[] getIds() { return new String[] {"TITLE","SUBTITLE", "ABSTRACT"}; } } NodeList list = parser.extractAllNodesThatMatch (new NodeClassFilter (PublicationTag.class)); How would I iterate through list to get the title, subtitle, and abstract content from the above? OR should I be thinking something like the following instead String [] publicationTags = {"TITLE","SUBTITLE","ABSTRACT"}; TagFindingVisitor visitor = new TagFindingVisitor (publicationTags); parser.visitAllNodesWith (visitor); Node titleTags = visitor.getTags(0); Node subtitleTag = visitor.getTags(1); Node abstractTag = visitor.getTags(2); Thanks, _Curtney ----- Original Message ----- From: "Derrick Oswald" <Der...@Ro...> To: <htm...@li...> Sent: Saturday, August 28, 2004 2:54 PM Subject: Re: [Htmlparser-user] Parsing Custom HTML Tags > Curtney, > > If you just want to discard the markup, I might be tempted to use a > filter again... > NodeList nl = tag.getChildren (); > NodeList justText = nl.extractAllNodesThatMatch (new NodeClassFilter > (TextNode.class)); > // you are using version 1.5 right? otherwise the class would be > StringNode I think > System.out.println (justText); > > But if you want a newline wherever there's a <p>, you need to do it your > way and check if it breaks flow, > like <b></b> doesn't cause a newline: > > while (ni.hasMoreNodes ()) > { > Node childNode = ni.nextNode(); > if (childNode instanceof TagNode) > { > if (((TagNode)childNode).breaksFlow ()) > System.out.println (); > } > else > System.out.print (childNode.toPlainTextString()); > } > > > Derrick > > Curtney Jacobs wrote: > > >Thanks Derrick, that worked nicely. I realize that my <abstract> tag will > >also contain one or more <p> tags. Currently, I am doing the following to > >extract the content from those tags. > > > >Is the following an efficient way of doing this. > > > >if (tag instanceof CompositeTag) { > > //get a list of p tags, etc, that may contain content > > NodeList nl = tag.getChildren(); > > if (nl != null) { > > NodeIterator ni = nl.elements(); > > while (ni.hasMoreNodes()) { > > Node childNode = ni.nextNode(); > > System.out.println ("text: " + > >childNode.toPlainTextString()); > > } > > > > } > > > >} > > > > > > > > > > > > > >----- Original Message ----- > >From: "Derrick Oswald" <Der...@Ro...> > >To: <htm...@li...> > >Sent: Friday, August 27, 2004 9:06 PM > >Subject: Re: [Htmlparser-user] Parsing Custom HTML Tags > > > > > > > > > >>Curtney, > >> > >>I think you want to create a custom composite tag and register it with > >>the parser since you want the contents *between* the <abstract> and the > >></abstract>. I would use a filter but your TagFindingVisitor should work > >>as well: > >> > >>class Abstract extends CompositeTag { public String[] getIds () { return > >> > >> > >(new String[] { "ABSTRACT" }); } } > > > > > >>... > >> > >> factory = new PrototypicalNodeFactory (); > >> factory.registerTag (new Abstract ()); // add your custom tag > >> parser.setNodeFactory (factory); > >> NodeList list = parser.extractAllNodesThatMatch (new > >> > >> > >NodeClassFilter (Abstract.class)); > > > > > >> AbstractTag abstract = (AbstractTag)list.elementAt (0); > >> TextNode text = abstract.getChildren ().elementAt (0); // might > >> > >> > >not be the first one if markup exists > > > > > >> System.out.println (text.getText ()); > >> > >>Derrick > >> > >> > >> > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2004-08-28 21:55:07
|
Curtney, If you just want to discard the markup, I might be tempted to use a filter again... NodeList nl = tag.getChildren (); NodeList justText = nl.extractAllNodesThatMatch (new NodeClassFilter (TextNode.class)); // you are using version 1.5 right? otherwise the class would be StringNode I think System.out.println (justText); But if you want a newline wherever there's a <p>, you need to do it your way and check if it breaks flow, like <b></b> doesn't cause a newline: while (ni.hasMoreNodes ()) { Node childNode = ni.nextNode(); if (childNode instanceof TagNode) { if (((TagNode)childNode).breaksFlow ()) System.out.println (); } else System.out.print (childNode.toPlainTextString()); } Derrick Curtney Jacobs wrote: >Thanks Derrick, that worked nicely. I realize that my <abstract> tag will >also contain one or more <p> tags. Currently, I am doing the following to >extract the content from those tags. > >Is the following an efficient way of doing this. > >if (tag instanceof CompositeTag) { > //get a list of p tags, etc, that may contain content > NodeList nl = tag.getChildren(); > if (nl != null) { > NodeIterator ni = nl.elements(); > while (ni.hasMoreNodes()) { > Node childNode = ni.nextNode(); > System.out.println ("text: " + >childNode.toPlainTextString()); > } > > } > >} > > > > > > >----- Original Message ----- >From: "Derrick Oswald" <Der...@Ro...> >To: <htm...@li...> >Sent: Friday, August 27, 2004 9:06 PM >Subject: Re: [Htmlparser-user] Parsing Custom HTML Tags > > > > >>Curtney, >> >>I think you want to create a custom composite tag and register it with >>the parser since you want the contents *between* the <abstract> and the >></abstract>. I would use a filter but your TagFindingVisitor should work >>as well: >> >>class Abstract extends CompositeTag { public String[] getIds () { return >> >> >(new String[] { "ABSTRACT" }); } } > > >>... >> >> factory = new PrototypicalNodeFactory (); >> factory.registerTag (new Abstract ()); // add your custom tag >> parser.setNodeFactory (factory); >> NodeList list = parser.extractAllNodesThatMatch (new >> >> >NodeClassFilter (Abstract.class)); > > >> AbstractTag abstract = (AbstractTag)list.elementAt (0); >> TextNode text = abstract.getChildren ().elementAt (0); // might >> >> >not be the first one if markup exists > > >> System.out.println (text.getText ()); >> >>Derrick >> >> >> |
From: Curtney J. <c.c...@co...> - 2004-08-28 21:28:15
|
Thanks Derrick, that worked nicely. I realize that my <abstract> tag will also contain one or more <p> tags. Currently, I am doing the following to extract the content from those tags. Is the following an efficient way of doing this. if (tag instanceof CompositeTag) { //get a list of p tags, etc, that may contain content NodeList nl = tag.getChildren(); if (nl != null) { NodeIterator ni = nl.elements(); while (ni.hasMoreNodes()) { Node childNode = ni.nextNode(); System.out.println ("text: " + childNode.toPlainTextString()); } } } ----- Original Message ----- From: "Derrick Oswald" <Der...@Ro...> To: <htm...@li...> Sent: Friday, August 27, 2004 9:06 PM Subject: Re: [Htmlparser-user] Parsing Custom HTML Tags > Curtney, > > I think you want to create a custom composite tag and register it with > the parser since you want the contents *between* the <abstract> and the > </abstract>. I would use a filter but your TagFindingVisitor should work > as well: > > class Abstract extends CompositeTag { public String[] getIds () { return (new String[] { "ABSTRACT" }); } } > > ... > > factory = new PrototypicalNodeFactory (); > factory.registerTag (new Abstract ()); // add your custom tag > parser.setNodeFactory (factory); > NodeList list = parser.extractAllNodesThatMatch (new NodeClassFilter (Abstract.class)); > AbstractTag abstract = (AbstractTag)list.elementAt (0); > TextNode text = abstract.getChildren ().elementAt (0); // might not be the first one if markup exists > System.out.println (text.getText ()); > > Derrick > > Curtney Jacobs wrote: > > >Greetings! > > > >I am having problems parsing a custom tag, <abstract></abstract>. I have > >followed the wiki example and was still unable to parse the tag. Only the the > >tag name is returened (i.e abstract). The following is what my code looks > >like. Also, if there is a better way of wrting the following code, please > >show me. Thanks. > > > >. > >. > > > >. > >. > >//extract the title content > >Parser parser = new Parser (f.getPath()); > >Nodes nodes[] = parser.extractAllNodesThatAre (TitleTag.class); > >TitleTag node = (TitleTag) nodes[0]; > >String title = node.getTitle(); > > > > > >String[] abstractTag = {"abstract"}; > >String summary = null; > >TagFindingVisitor visitor = new TagFindingVisitor (abstractTag, true); > >parser.reset(); > >parser.visitAllNodesWith (visitor); > >Node [] abstractNodes = visitor.getTags(0); > >Node summaryNode = (Node)abstractNodes[0]; > >summary = summaryNode.getText(); > >. > >. > >. > >. > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2004-08-28 04:06:50
|
Curtney, I think you want to create a custom composite tag and register it with the parser since you want the contents *between* the <abstract> and the </abstract>. I would use a filter but your TagFindingVisitor should work as well: class Abstract extends CompositeTag { public String[] getIds () { return (new String[] { "ABSTRACT" }); } } ... factory = new PrototypicalNodeFactory (); factory.registerTag (new Abstract ()); // add your custom tag parser.setNodeFactory (factory); NodeList list = parser.extractAllNodesThatMatch (new NodeClassFilter (Abstract.class)); AbstractTag abstract = (AbstractTag)list.elementAt (0); TextNode text = abstract.getChildren ().elementAt (0); // might not be the first one if markup exists System.out.println (text.getText ()); Derrick Curtney Jacobs wrote: >Greetings! > >I am having problems parsing a custom tag, <abstract></abstract>. I have >followed the wiki example and was still unable to parse the tag. Only the the >tag name is returened (i.e abstract). The following is what my code looks >like. Also, if there is a better way of wrting the following code, please >show me. Thanks. > >. >. > >. >. >//extract the title content >Parser parser = new Parser (f.getPath()); >Nodes nodes[] = parser.extractAllNodesThatAre (TitleTag.class); >TitleTag node = (TitleTag) nodes[0]; >String title = node.getTitle(); > > >String[] abstractTag = {"abstract"}; >String summary = null; >TagFindingVisitor visitor = new TagFindingVisitor (abstractTag, true); >parser.reset(); >parser.visitAllNodesWith (visitor); >Node [] abstractNodes = visitor.getTags(0); >Node summaryNode = (Node)abstractNodes[0]; >summary = summaryNode.getText(); >. >. >. >. > |
From: Curtney J. <c.c...@co...> - 2004-08-28 03:24:09
|
Greetings! I am having problems parsing a custom tag, <abstract></abstract>. I have followed the wiki example and was still unable to parse the tag. Only the the tag name is returened (i.e abstract). The following is what my code looks like. Also, if there is a better way of wrting the following code, please show me. Thanks. . . . . //extract the title content Parser parser = new Parser (f.getPath()); Nodes nodes[] = parser.extractAllNodesThatAre (TitleTag.class); TitleTag node = (TitleTag) nodes[0]; String title = node.getTitle(); String[] abstractTag = {"abstract"}; String summary = null; TagFindingVisitor visitor = new TagFindingVisitor (abstractTag, true); parser.reset(); parser.visitAllNodesWith (visitor); Node [] abstractNodes = visitor.getTags(0); Node summaryNode = (Node)abstractNodes[0]; summary = summaryNode.getText(); . . . . . . . |
From: Derrick O. <Der...@Ro...> - 2004-08-27 22:46:36
|
Rob, I haven't had any problems parsing XML with htmlparser. An example is provided for parsing RSS feeds which are XML: http://htmlparser.sourceforge.net/wiki/index.php/RSSFeeds Derrick Neil Aggarwal wrote: >Rob: > >Your input is XML. You should use an XML parser like xerces >http://xml.apache.org/xerces2-j/index.html >to parse it. > >Neil > > >>-----Original Message----- >>From: htm...@li... >>[mailto:htm...@li...] On >>Behalf Of Rob Eger >>Sent: Friday, August 27, 2004 3:51 PM >>To: htm...@li... >>Subject: [Htmlparser-user] using parser/lexer on non-html markup pages >> >> >>I've been using the HTMLParser to parse html pages up until >>now (works >>great by the way), but I was just given a small new project >>to parse a >>set of marked up files. Basically info in tags. >> >>The files contain blocks (many per file) like this: >> >><listing id="324" key="xyz"> >> <name>random name</name> >> <lineBlock heading="header1"> >> <line lineNum="1">contents of the line</line> >> <line lineNum="2">more line contents</line> >> </lineBlock> >></listing> >> >>and so on... >> >>I tried just re-using some of the code I was using for parsing html >>(added some custom tags to handle the specific tags I'm >>dealing with), >>but it didn't work at first pass. Not sure why, nothing >>obvious stands out. >> >>Can I use the parser (or would the lexer be better) to do >>this at all? >>Or am I trying to fit a square peg in a round hole? >> >>Thanks, >>Rob. >> >> >> |
From: Neil A. <ne...@JA...> - 2004-08-27 21:36:23
|
Rob: Your input is XML. You should use an XML parser like xerces http://xml.apache.org/xerces2-j/index.html to parse it. Neil -- Neil Aggarwal, JAMM Consulting, (972)612-6056, www.JAMMConsulting.com FREE! Valuable info on how your business can reduce operating costs by 17% or more in 6 months or less! http://newsletter.JAMMConsulting.com > -----Original Message----- > From: htm...@li... > [mailto:htm...@li...] On > Behalf Of Rob Eger > Sent: Friday, August 27, 2004 3:51 PM > To: htm...@li... > Subject: [Htmlparser-user] using parser/lexer on non-html markup pages > > > I've been using the HTMLParser to parse html pages up until > now (works > great by the way), but I was just given a small new project > to parse a > set of marked up files. Basically info in tags. > > The files contain blocks (many per file) like this: > > <listing id="324" key="xyz"> > <name>random name</name> > <lineBlock heading="header1"> > <line lineNum="1">contents of the line</line> > <line lineNum="2">more line contents</line> > </lineBlock> > </listing> > > and so on... > > I tried just re-using some of the code I was using for parsing html > (added some custom tags to handle the specific tags I'm > dealing with), > but it didn't work at first pass. Not sure why, nothing > obvious stands out. > > Can I use the parser (or would the lexer be better) to do > this at all? > Or am I trying to fit a square peg in a round hole? > > Thanks, > Rob. > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Rob E. <re...@ap...> - 2004-08-27 20:51:30
|
I've been using the HTMLParser to parse html pages up until now (works great by the way), but I was just given a small new project to parse a set of marked up files. Basically info in tags. The files contain blocks (many per file) like this: <listing id="324" key="xyz"> <name>random name</name> <lineBlock heading="header1"> <line lineNum="1">contents of the line</line> <line lineNum="2">more line contents</line> </lineBlock> </listing> and so on... I tried just re-using some of the code I was using for parsing html (added some custom tags to handle the specific tags I'm dealing with), but it didn't work at first pass. Not sure why, nothing obvious stands out. Can I use the parser (or would the lexer be better) to do this at all? Or am I trying to fit a square peg in a round hole? Thanks, Rob. |
From: Mihnea G. <mga...@fr...> - 2004-08-26 13:09:23
|
Thanks, that's all i needed :) -----Original Message----- From: htm...@li... [mailto:htm...@li...]On Behalf Of Derrick Oswald Sent: Wednesday, August 25, 2004 6:18 PM To: htm...@li... Subject: Re: [Htmlparser-user] Relative link Breaks Mihnea, HTML parser is not really set up for synthesis, being primarily for data = extraction and transformation. The answer you got from eastenchild is about as good as it gets, unless=20 you have the nodes as a string, then the construction would be something = like: // gather the contents of the page Parser parser =3D new Parser ("http://whatever.com"); NodeList contents =3D new NodeList (); for (NodeIterator iterator =3D parser.elements ();=20 iterator.hasMoreNodes (); ) contents.add (iterator.nextNode ()); // pull out the <SELECT> tag NodeFilter filter =3D new AndFilter (new TagNameFilter ("select"), = new=20 HasAttributeFilter ("name", "myselect")); Node select =3D contents.extractAllNodesThatMatch (filter).elementAt = (0); // is there always exactly one? // create the node to be added String html =3D "<option yadda=3D'foo' yabba=3D'bar'>text</option>"; Parser miniparser =3D new Parser (new Lexer (html)); Node option =3D miniparser.elements ().nextNode (); // then add it to the child list with: select.getChildren ().add (option); // or select.getChildren=20 ().prepend (option); // and print the page out with System.out.println (contents.toHtml ()); Derrick Mihnea Galeteanu wrote: >Hi,=20 >I was wondering if there is any way with the parser to add tags? > > =20 > ------------------------------------------------------- SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2004-08-25 22:17:47
|
Mihnea, HTML parser is not really set up for synthesis, being primarily for data extraction and transformation. The answer you got from eastenchild is about as good as it gets, unless you have the nodes as a string, then the construction would be something like: // gather the contents of the page Parser parser = new Parser ("http://whatever.com"); NodeList contents = new NodeList (); for (NodeIterator iterator = parser.elements (); iterator.hasMoreNodes (); ) contents.add (iterator.nextNode ()); // pull out the <SELECT> tag NodeFilter filter = new AndFilter (new TagNameFilter ("select"), new HasAttributeFilter ("name", "myselect")); Node select = contents.extractAllNodesThatMatch (filter).elementAt (0); // is there always exactly one? // create the node to be added String html = "<option yadda='foo' yabba='bar'>text</option>"; Parser miniparser = new Parser (new Lexer (html)); Node option = miniparser.elements ().nextNode (); // then add it to the child list with: select.getChildren ().add (option); // or select.getChildren ().prepend (option); // and print the page out with System.out.println (contents.toHtml ()); Derrick Mihnea Galeteanu wrote: >Hi, >I was wondering if there is any way with the parser to add tags? > > > |
From: Mihnea G. <mga...@fr...> - 2004-08-25 13:41:39
|
Hi,=20 I was wondering if there is any way with the parser to add tags? -----Original Message----- From: htm...@li... [mailto:htm...@li...]On Behalf Of Derrick Oswald Sent: Wednesday, August 25, 2004 5:47 AM To: htm...@li... Subject: Re: [Htmlparser-user] Relative link Breaks Bikramjit, Yes, the htmlparser has relative to absolute URL conversion. In the 1.42 = version this logic is in the org.htmlparser.util.LinkProcessor class,=20 while in the version 1.5 code stream it is encapsulated in the=20 org.htmlparser.lexer.Page class. I recommend the 1.5 version. If you don't need to convert from absolute to local, there is an example = for transforming links on a page to local disk references in the=20 org.htmlparser.parserapplications.SiteCapturer class that could be=20 easily adapted to your requirements by adjusting String makeLocalLink (String link, String current) to basically do nothing (i.e. simply return the link) and altering=20 program flow to not recurse into linked pages. The META, HREF, FRAME,=20 BASE and IMG tag handling are already included in that example. Derrick Bikramjit Naha wrote: > > Hi, > I am a WebSphere Portal Developer and have a situation where I need to = > retrieve external weeb applications in small portlet windows. > Everything would have been fine but once the web page is retreived=20 > into my portlet window and I click on a link(say a href) the relative > link breaks.The problem is definately because the links and images are = > set relative in the originating remote server. > Does the html parser project has any api and method which would help=20 > to convert the relative url to absolute url so that the link,image... > problem is solved. > I shall be extremely gratefull if some one posts a code snippet(if th=20 > api supports) along with an answer. > (Please dont worry about portal as thats not important.Important is=20 > the links are breaking) > > > > > regards > > Bikramjit Naha > Application Developer > Employee Id:802662 > IBM Global Services India(Pvt) Ltd > Work:91 33 23579120/91 33 23579110 Extn:3450 > Mobile:91 9830485394 TieLine:49964 > E-Mail:bik...@in... > Lotus Notes:Bikramjit Naha/India/IBM@IBMIN ------------------------------------------------------- SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2004-08-25 09:47:15
|
Bikramjit, Yes, the htmlparser has relative to absolute URL conversion. In the 1.42 version this logic is in the org.htmlparser.util.LinkProcessor class, while in the version 1.5 code stream it is encapsulated in the org.htmlparser.lexer.Page class. I recommend the 1.5 version. If you don't need to convert from absolute to local, there is an example for transforming links on a page to local disk references in the org.htmlparser.parserapplications.SiteCapturer class that could be easily adapted to your requirements by adjusting String makeLocalLink (String link, String current) to basically do nothing (i.e. simply return the link) and altering program flow to not recurse into linked pages. The META, HREF, FRAME, BASE and IMG tag handling are already included in that example. Derrick Bikramjit Naha wrote: > > Hi, > I am a WebSphere Portal Developer and have a situation where I need to > retrieve external weeb applications in small portlet windows. > Everything would have been fine but once the web page is retreived > into my portlet window and I click on a link(say a href) the relative > link breaks.The problem is definately because the links and images are > set relative in the originating remote server. > Does the html parser project has any api and method which would help > to convert the relative url to absolute url so that the link,image... > problem is solved. > I shall be extremely gratefull if some one posts a code snippet(if th > api supports) along with an answer. > (Please dont worry about portal as thats not important.Important is > the links are breaking) > > > > > regards > > Bikramjit Naha > Application Developer > Employee Id:802662 > IBM Global Services India(Pvt) Ltd > Work:91 33 23579120/91 33 23579110 Extn:3450 > Mobile:91 9830485394 TieLine:49964 > E-Mail:bik...@in... > Lotus Notes:Bikramjit Naha/India/IBM@IBMIN |
From: Bikramjit N. <bik...@in...> - 2004-08-25 04:15:00
|
Hi, I am a WebSphere Portal Developer and have a situation where I need to retrieve external weeb applications in small portlet windows. Everything would have been fine but once the web page is retreived into my portlet window and I click on a link(say a href) the relative link breaks.The problem is definately because the links and images are set relative in the originating remote server. Does the html parser project has any api and method which would help to convert the relative url to absolute url so that the link,image... problem is solved. I shall be extremely gratefull if some one posts a code snippet(if th api supports) along with an answer. (Please dont worry about portal as thats not important.Important is the links are breaking) regards Bikramjit Naha Application Developer Employee Id:802662 IBM Global Services India(Pvt) Ltd Work:91 33 23579120/91 33 23579110 Extn:3450 Mobile:91 9830485394 TieLine:49964 E-Mail:bik...@in... Lotus Notes:Bikramjit Naha/India/IBM@IBMIN |
From: joseph m. <qt...@ya...> - 2004-08-21 10:32:09
|
Thanks. --- Amol Deshmukh <Amo...@cc...> wrote: > qt, > > > IMO, the better approach to extracting text is using > a visitor. > Here's how you can solve you problem: > > 1. use the class RegexMatchingVisitor that I have > attached with this > mail. > > 2. Then the following program demonstrates how you > can extract the > number you want to. > Note that for thisproblem I have created a regex > matchign visitor > (which is my custom visitor), > but depending on the exact nature ofthe problem you > can create > appropriate visitors. > > -------------------------- Class for testing > RegexMatchingVisitor > ----------------------- > > import test.util.*; > > /** > * @author Amol > * > * TODO To change the template for this generated > type comment go to > * Window - Preferences - Java - Code Style - Code > Templates > */ > public class ExtractStringNode { > > public static void main(String[] args) throws > Exception { > Parser parser = new Parser(); > String filename = "deleteme.html"; > > if (args.length > 0 && args[0] != null) { > filename = args[0]; > } > > // HTMLParserUtils is > just a simple > class that reads a file and returns the string > contents. > // in your > case you would be > reading from a URL, so this statement may be > irrelevant. > // > Nevertheless, I have > attached HTMLParserUtils java file also. > > parser.setInputHTML(HTMLParserUtils.fileToString(filename)); > > > // create a > RegexMatchingVisitor for the > pattern ddd-ddd-ddd-ddd where d is a digit. > RegexMatchingVisitor regexMatchingVisitor = new > RegexMatchingVisitor("[0-9]{3}[-][0-9]{3}[-][0-9]{3}[-][0-9]{3}", > true); > parser.visitAllNodesWith(regexMatchingVisitor); > > // iterate over all > the matching > ddd-ddd-ddd-ddd strings in the text nodes. > for (int i=0, > iSize=regexMatchingVisitor.getExtractedTextList().size(); > i<iSize; i++) > { > > System.out.println(regexMatchingVisitor.getExtractedTextList().get(i)); > } > } > } > > -------------------------- End Class for testing > RegexMatchingVisitor > ----------------------- > > > The class RegexMatchingVisitor is adapted from the > TextExtractingVisitor class that is provided with > the HTMLparser > distribution. > > Hope that helps. > > Regards, > ~ amol > > > > > > > >>> qt...@ya... 8/19/2004 10:30:19 AM >>> > how do i parse the string from this html code? > <table> > . > . > <TR align="left"> > <td width="5%"> </td> > <TD class="notifyBody3" align="left"> > Your Taxpayer Identification Number is > <b>300-184-335-000</b><BR><BR> > </TD> > </TR> > . > .</table> > > > i wan to get :300-184-335-000 > > > > > > __________________________________ > Do you Yahoo!? > New and Improved Yahoo! Mail - Send 10MB messages! > http://promotions.yahoo.com/new_mail > > > ------------------------------------------------------- > SF.Net email is sponsored by Shop4tech.com-Lowest > price on Blank Media > 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R > for only $33 > Save 50% off Retail on Ink & Toner - Free Shipping > and Free Gift. > http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ATTACHMENT part 2 application/octet-stream name=RegexMatchingVisitor.java > ATTACHMENT part 3 application/octet-stream name=HTMLParserUtils.java __________________________________ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail |
From: eastenchild <eas...@to...> - 2004-08-20 03:09:53
|
hi,I feel it is quite powerful to read a html file to get information we want,but not so simple to modify or add something we need. I'm not very experienced in using the parser.here is a relatively primitive approach. try { Parser parser = new Parser("test.htm"); //contains a select node. //get all select nodes TagNameFilter filter = new TagNameFilter("select"); NodeList selects = parser.extractAllNodesThatMatch(filter); //use a list to store the new items NodeList lstOptions = new NodeList(); //the select node we want to modify TagNode node=null; for(int i=0;i<selects.size();i++) { node = (TagNode)selects.elementAt(i); if(!node.getAttribute("name").equalsIgnoreCase("myselect")) continue; //add a new option tag CompositeTag newNode = new CompositeTag(); newNode.setTagName("option"); //set the end tag LabelTag endTag = new LabelTag(); endTag.setText("</option>"); newNode.setEndTag(endTag); newNode.setAttribute("value", "testvalue"); newNode.setAttribute(new Attribute("selected",null,null)); TextNode txtNode = new TextNode("testvalue"); //testvalue or something //use a list to store the single text node NodeList lstChild = new NodeList(); lstChild.add(txtNode); //now the option node created with a option tag newNode.setChildren(lstChild); //add to the list //on above we add one option tag,you can add more. //All of the should be added to the list lstOptions.add(newNode); } //set as the select node's childs node.setChildren(lstOptions); //the following code show the result Node theRoot=null; Node root = node.getParent(); while (root != null) { theRoot = root; root = root.getParent(); } //we can also save it to a htm file System.out.println(theRoot.toHtml()); }catch(Exception ex) { System.out.println(ex.getMessage()); } The test.htm file is quite simple.It's like this. <body> <form name="form1" method="post" action=""> <table width="200" border="1"> <tr> <td> <select name="myselect"> <option value="Mary" selected>Mary</option> <option value="Jack">Jack</option> </select> </td> <td> </td> </tr> <tr> <td> </td> <td> </td> </tr> </table> </form> </body> I hope this would be helpful. Regards. ======= 2004-08-19 21:56:42 original message======= >Let's say I have a <select name="myselect"></select> tag in my html. Based on the name "myselect" I would like to use the htmlparser to add some <option> tags under that preexisting select. >Sorry for not being more clear. >Thanks, > >-----Original Message----- >From: htm...@li... [mailto:htm...@li...]On Behalf Of eastenchild >Sent: Wednesday, August 18, 2004 8:53 PM >To: htm...@li...urceforg >Subject: Re: [Htmlparser-user] adding tags > > >hi,Mihnea Galeteanu > > I wish I can be helpful.But I don't understand your idea clearly.Perhaps you can explain it in a bit more detail. > >======= 2004-08-19 03:26:11 original message======= > >>Hi, >>I was wondering if it is possible to add new tags to composite tags after a certain other tag? >>Thanks, >> >>Mihnea Galeteanu >>Software Developer >>FreeBalance Inc. >>Visit the new FreeBalance website @ www.FreeBalance.com >> >>Tel: (613) 236-5150 ext. 339 >>Fax: (613) 236-7785 >>mga...@Fr... >> >>This email message is for the sole use of the intended recipient(s) and may contain confidential and proprietary information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you are not the intended recipient(s) please contact the sender by reply email and destroy all copies of the original message and any attachments. > >= = = = = = = = = = = = = = = = = = = = > >Regards. > >eastenchild >eas...@to... >2004-08-19 > > > > > >------------------------------------------------------- >SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media >100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 >Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. >http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------- >SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media >100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 >Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. >http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user = = = = = = = = = = = = = = = = = = = = Regards. eastenchild eas...@to... 2004-08-20 |
From: Amol D. <Amo...@cc...> - 2004-08-19 15:58:42
|
qt, IMO, the better approach to extracting text is using a visitor. Here's how you can solve you problem: 1. use the class RegexMatchingVisitor that I have attached with this mail. 2. Then the following program demonstrates how you can extract the number you want to. Note that for thisproblem I have created a regex matchign visitor (which is my custom visitor), but depending on the exact nature ofthe problem you can create appropriate visitors. -------------------------- Class for testing RegexMatchingVisitor ----------------------- import test.util.*; /** * @author Amol * * TODO To change the template for this generated type comment go to * Window - Preferences - Java - Code Style - Code Templates */ public class ExtractStringNode { public static void main(String[] args) throws Exception { Parser parser = new Parser(); String filename = "deleteme.html"; if (args.length > 0 && args[0] != null) { filename = args[0]; } // HTMLParserUtils is just a simple class that reads a file and returns the string contents. // in your case you would be reading from a URL, so this statement may be irrelevant. // Nevertheless, I have attached HTMLParserUtils java file also. parser.setInputHTML(HTMLParserUtils.fileToString(filename)); // create a RegexMatchingVisitor for the pattern ddd-ddd-ddd-ddd where d is a digit. RegexMatchingVisitor regexMatchingVisitor = new RegexMatchingVisitor("[0-9]{3}[-][0-9]{3}[-][0-9]{3}[-][0-9]{3}", true); parser.visitAllNodesWith(regexMatchingVisitor); // iterate over all the matching ddd-ddd-ddd-ddd strings in the text nodes. for (int i=0, iSize=regexMatchingVisitor.getExtractedTextList().size(); i<iSize; i++) { System.out.println(regexMatchingVisitor.getExtractedTextList().get(i)); } } } -------------------------- End Class for testing RegexMatchingVisitor ----------------------- The class RegexMatchingVisitor is adapted from the TextExtractingVisitor class that is provided with the HTMLparser distribution. Hope that helps. Regards, ~ amol >>> qt...@ya... 8/19/2004 10:30:19 AM >>> how do i parse the string from this html code? <table> . . <TR align="left"> <td width="5%"> </td> <TD class="notifyBody3" align="left"> Your Taxpayer Identification Number is <b>300-184-335-000</b><BR><BR> </TD> </TR> . .</table> i wan to get :300-184-335-000 __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail ------------------------------------------------------- SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: joseph m. <qt...@ya...> - 2004-08-19 14:30:30
|
how do i parse the string from this html code? <table> . . <TR align="left"> <td width="5%"> </td> <TD class="notifyBody3" align="left"> Your Taxpayer Identification Number is <b>300-184-335-000</b><BR><BR> </TD> </TR> . .</table> i wan to get :300-184-335-000 __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail |
From: joseph m. <qt...@ya...> - 2004-08-19 14:01:18
|
whats missing in this code can seem to find the StringNodes package test; import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.StringNodeFactory; import org.htmlparser.Text; import org.htmlparser.tags.CompositeTag; import org.htmlparser.tags.TableTag; public class Htmlget { public static void main(String[] args) { try{ Parser parser = new Parser(); Node nodes [] = parser.extractAllNodesThatAre(TableTag.class); // Get the first table found TableTag table = (TableTag)nodes[0]; //Text[] test = table.digupStringNode("Name"); // Find the position of Name. StringNode[] stringNodes = table.digupStringNode("Name"); StringNode name = stringNodes[0]; // We assume that the first node that matched is the one we want. We // navigate to its parent, the column tag <td> CompositeTag td = name.getParent(); // From the parent, we shall find out the position of "Name" int posOfName = td.findPositionOf(name); // Its easy now to navigate to John Doe, as we know it is 3 positions away Node expectedName = td.childAt(posOfName + 3); }catch(Exception e){ System.out.println(e.toString()); } } } _______________________________ Do you Yahoo!? Win 1 of 4,000 free domain names from Yahoo! Enter now. http://promotions.yahoo.com/goldrush |
From: Mihnea G. <mga...@fr...> - 2004-08-19 13:56:54
|
Let's say I have a <select name=3D"myselect"></select> tag in my html. = Based on the name "myselect" I would like to use the htmlparser to add = some <option> tags under that preexisting select. Sorry for not being more clear. Thanks, -----Original Message----- From: htm...@li... = [mailto:htm...@li...]On Behalf Of = eastenchild Sent: Wednesday, August 18, 2004 8:53 PM To: htm...@li...urceforg Subject: Re: [Htmlparser-user] adding tags hi,Mihnea Galeteanu I wish I can be helpful.But I don't understand your idea = clearly.Perhaps you can explain it in a bit more detail. =3D=3D=3D=3D=3D=3D=3D 2004-08-19 03:26:11 original = message=3D=3D=3D=3D=3D=3D=3D >Hi,=20 >I was wondering if it is possible to add new tags to composite tags = after a certain other tag? >Thanks, > >Mihnea Galeteanu >Software Developer >FreeBalance Inc. >Visit the new FreeBalance website @ www.FreeBalance.com > >Tel: (613) 236-5150 ext. 339 >Fax: (613) 236-7785 >mga...@Fr... > >This email message is for the sole use of the intended recipient(s) and = may contain confidential and proprietary information. Any unauthorized = review, use, disclosure, or distribution is prohibited. If you are not = the intended recipient(s) please contact the sender by reply email and = destroy all copies of the original message and any attachments.=20 =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D =3D = =3D =3D =09 Regards. =20 eastenchild eas...@to... 2004-08-19 ------------------------------------------------------- SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: eastenchild <eas...@to...> - 2004-08-19 00:53:15
|
hi,Mihnea Galeteanu I wish I can be helpful.But I don't understand your idea clearly.Perhaps you can explain it in a bit more detail. ======= 2004-08-19 03:26:11 original message======= >Hi, >I was wondering if it is possible to add new tags to composite tags after a certain other tag? >Thanks, > >Mihnea Galeteanu >Software Developer >FreeBalance Inc. >Visit the new FreeBalance website @ www.FreeBalance.com > >Tel: (613) 236-5150 ext. 339 >Fax: (613) 236-7785 >mga...@Fr... > >This email message is for the sole use of the intended recipient(s) and may contain confidential and proprietary information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you are not the intended recipient(s) please contact the sender by reply email and destroy all copies of the original message and any attachments. = = = = = = = = = = = = = = = = = = = = Regards. eastenchild eas...@to... 2004-08-19 |