Thread: [Htmlparser-developer] Tags
Brought to you by:
derrickoswald
From: Enrique E. <kik...@gm...> - 2010-09-10 10:27:47
|
Hello, can anybody tell me which html tags HtmlParser analyzes in order to extract text from a web page??? Thank you!!! |
From: Elliot H. <ell...@gm...> - 2010-09-10 17:53:34
|
I don't know exactly what you mean by "analyzes." But I think the answer to your question is all of them. Here is an example that might help you get started. You'll want to make sure you understand the various interfaces provided in the API (ie: Node, NodeFilter, etc...). import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; import org.htmlparser.tags.Html; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; public class Example { public static void main(String... params) { // Parser parser = getParser(getHtml(), "UTF-8"); Parser parser = getParser(getHtml()); try { NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter(Html.class)); for(int i = 0; i < list.size(); i++) { Html html = (Html) list.elementAt(i); System.out.println(html.toString()); } } catch(ParserException e) { e.printStackTrace(); } } private static Parser getParser(String html, String charset) { return new Parser(new Lexer(new Page(html, charset))); } private static Parser getParser(String html) { Parser parser = new Parser(); try { parser.setInputHTML(html); } catch(ParserException e) { e.printStackTrace(); } return parser; } private static String getHtml() { return new StringBuilder() .append("\n<html>") .append("\n\t<head>") .append("\n\t\t<title>Html Parser Example</title>") .append("\n\t</head>") .append("\n\t<body>") .append("\n\t\t<p>Hello <span>World</span>!</p>") .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at home!\">but html parser still understands it</thisIsAMadeUpTag>") .append("\n\t</body>") .append("\n</html>") .toString(); } } On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote: > Hello, > > can anybody tell me which html tags HtmlParser analyzes in order to extract > text from a web page??? > > Thank you!!! > > > ------------------------------------------------------------------------------ > Automate Storage Tiering Simply > Optimize IT performance and efficiency through flexible, powerful, > automated storage tiering capabilities. View this brief to learn how > you can reduce costs and improve performance. > http://p.sf.net/sfu/dell-sfdev2dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > -- Elliot |
From: Elliot H. <ell...@gm...> - 2010-09-10 18:14:36
|
When I reviewed the output of the program a little closer I realized that although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it did not properly nest the tags content as children nodes. Is this expected because the tag is not a valid html tag or is this a bug? Maybe this is what you meant in your original email Enrique when you asked which tags are "analyzed" by the html parser? Here is the output from running the program. Notice that all the valid html tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag" tag's "should be" children are not nested one level deeper. Is this a bug or a feature? Tag (1[1,0],7[1,6]): html Txt (7[1,6],9[2,1]): \n\t Tag (9[2,1],15[2,7]): head Txt (15[2,7],18[3,2]): \n\t\t Tag (18[3,2],25[3,9]): title Txt (25[3,9],44[3,28]): Html Parser Example End (44[3,28],52[3,36]): /title Txt (52[3,36],54[4,1]): \n\t End (54[4,1],61[4,8]): /head Txt (61[4,8],63[5,1]): \n\t Tag (63[5,1],69[5,7]): body Txt (69[5,7],72[6,2]): \n\t\t Tag (72[6,2],75[6,5]): p Txt (75[6,5],81[6,11]): Hello Tag (81[6,11],87[6,17]): span Txt (87[6,17],92[6,22]): World End (92[6,22],99[6,29]): /span Txt (99[6,29],100[6,30]): ! End (100[6,30],104[6,34]): /p Txt (104[6,34],107[7,2]): \n\t\t Tag (107[7,2],110[7,5]): p Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at home!" Txt (159[7,54],195[7,90]): but html parser still understands it End (195[7,90],214[7,109]): /thisIsAMadeUpTag End (214[7,109],218[7,113]): /p Txt (218[7,113],220[8,1]): \n\t End (220[8,1],227[8,8]): /body Txt (227[8,8],228[9,0]): \n End (228[9,0],235[9,7]): /html On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington < ell...@gm...> wrote: > I don't know exactly what you mean by "analyzes." But I think the answer to > your question is all of them. > > Here is an example that might help you get started. You'll want to make > sure you understand the various interfaces provided in the API (ie: Node, > NodeFilter, etc...). > > import org.htmlparser.Parser; > import org.htmlparser.filters.NodeClassFilter; > import org.htmlparser.lexer.Lexer; > import org.htmlparser.lexer.Page; > import org.htmlparser.tags.Html; > import org.htmlparser.util.NodeList; > import org.htmlparser.util.ParserException; > > public class Example { > public static void main(String... params) { > // Parser parser = getParser(getHtml(), "UTF-8"); > Parser parser = getParser(getHtml()); > > try { > NodeList list = parser.extractAllNodesThatMatch(new > NodeClassFilter(Html.class)); > for(int i = 0; i < list.size(); i++) { > Html html = (Html) list.elementAt(i); > System.out.println(html.toString()); > } > } catch(ParserException e) { > e.printStackTrace(); > } > > } > > private static Parser getParser(String html, String charset) { > return new Parser(new Lexer(new Page(html, charset))); > } > > private static Parser getParser(String html) { > Parser parser = new Parser(); > try { > parser.setInputHTML(html); > } catch(ParserException e) { > e.printStackTrace(); > } > return parser; > } > > private static String getHtml() { > return new StringBuilder() > .append("\n<html>") > .append("\n\t<head>") > .append("\n\t\t<title>Html Parser Example</title>") > .append("\n\t</head>") > .append("\n\t<body>") > .append("\n\t\t<p>Hello <span>World</span>!</p>") > .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at > home!\">but html parser still understands it</thisIsAMadeUpTag>") > .append("\n\t</body>") > .append("\n</html>") > .toString(); > } > } > > > > > On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote: > >> Hello, >> >> can anybody tell me which html tags HtmlParser analyzes in order to >> extract text from a web page??? >> >> Thank you!!! >> >> >> ------------------------------------------------------------------------------ >> Automate Storage Tiering Simply >> Optimize IT performance and efficiency through flexible, powerful, >> automated storage tiering capabilities. View this brief to learn how >> you can reduce costs and improve performance. >> http://p.sf.net/sfu/dell-sfdev2dev >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> > > > -- > Elliot > -- Elliot |
From: Derrick O. <der...@gm...> - 2010-09-11 20:09:24
|
Only composite tags are nested... See http://htmlparser.sourceforge.net/faq.html#composite So you would need to create a tag class derived from CoimpositeTag and add it to the node factory, as outlined. On Fri, Sep 10, 2010 at 8:14 PM, Elliot Huntington < ell...@gm...> wrote: > When I reviewed the output of the program a little closer I realized that > although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it > did not properly nest the tags content as children nodes. > > Is this expected because the tag is not a valid html tag or is this a bug? > > Maybe this is what you meant in your original email Enrique when you asked > which tags are "analyzed" by the html parser? > > Here is the output from running the program. Notice that all the valid html > tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag" > tag's "should be" children are not nested one level deeper. Is this a bug or > a feature? > > Tag (1[1,0],7[1,6]): html > Txt (7[1,6],9[2,1]): \n\t > Tag (9[2,1],15[2,7]): head > Txt (15[2,7],18[3,2]): \n\t\t > Tag (18[3,2],25[3,9]): title > Txt (25[3,9],44[3,28]): Html Parser Example > End (44[3,28],52[3,36]): /title > Txt (52[3,36],54[4,1]): \n\t > End (54[4,1],61[4,8]): /head > Txt (61[4,8],63[5,1]): \n\t > Tag (63[5,1],69[5,7]): body > Txt (69[5,7],72[6,2]): \n\t\t > Tag (72[6,2],75[6,5]): p > Txt (75[6,5],81[6,11]): Hello > Tag (81[6,11],87[6,17]): span > Txt (87[6,17],92[6,22]): World > End (92[6,22],99[6,29]): /span > Txt (99[6,29],100[6,30]): ! > End (100[6,30],104[6,34]): /p > Txt (104[6,34],107[7,2]): \n\t\t > Tag (107[7,2],110[7,5]): p > Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at > home!" > Txt (159[7,54],195[7,90]): but html parser still understands it > End (195[7,90],214[7,109]): /thisIsAMadeUpTag > End (214[7,109],218[7,113]): /p > Txt (218[7,113],220[8,1]): \n\t > End (220[8,1],227[8,8]): /body > Txt (227[8,8],228[9,0]): \n > End (228[9,0],235[9,7]): /html > > > > > On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington < > ell...@gm...> wrote: > >> I don't know exactly what you mean by "analyzes." But I think the answer >> to your question is all of them. >> >> Here is an example that might help you get started. You'll want to make >> sure you understand the various interfaces provided in the API (ie: Node, >> NodeFilter, etc...). >> >> import org.htmlparser.Parser; >> import org.htmlparser.filters.NodeClassFilter; >> import org.htmlparser.lexer.Lexer; >> import org.htmlparser.lexer.Page; >> import org.htmlparser.tags.Html; >> import org.htmlparser.util.NodeList; >> import org.htmlparser.util.ParserException; >> >> public class Example { >> public static void main(String... params) { >> // Parser parser = getParser(getHtml(), "UTF-8"); >> Parser parser = getParser(getHtml()); >> >> try { >> NodeList list = parser.extractAllNodesThatMatch(new >> NodeClassFilter(Html.class)); >> for(int i = 0; i < list.size(); i++) { >> Html html = (Html) list.elementAt(i); >> System.out.println(html.toString()); >> } >> } catch(ParserException e) { >> e.printStackTrace(); >> } >> >> } >> >> private static Parser getParser(String html, String charset) { >> return new Parser(new Lexer(new Page(html, charset))); >> } >> >> private static Parser getParser(String html) { >> Parser parser = new Parser(); >> try { >> parser.setInputHTML(html); >> } catch(ParserException e) { >> e.printStackTrace(); >> } >> return parser; >> } >> >> private static String getHtml() { >> return new StringBuilder() >> .append("\n<html>") >> .append("\n\t<head>") >> .append("\n\t\t<title>Html Parser Example</title>") >> .append("\n\t</head>") >> .append("\n\t<body>") >> .append("\n\t\t<p>Hello <span>World</span>!</p>") >> .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at >> home!\">but html parser still understands it</thisIsAMadeUpTag>") >> .append("\n\t</body>") >> .append("\n</html>") >> .toString(); >> } >> } >> >> >> >> >> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm... >> > wrote: >> >>> Hello, >>> >>> can anybody tell me which html tags HtmlParser analyzes in order to >>> extract text from a web page??? >>> >>> Thank you!!! >>> >>> >>> ------------------------------------------------------------------------------ >>> Automate Storage Tiering Simply >>> Optimize IT performance and efficiency through flexible, powerful, >>> automated storage tiering capabilities. View this brief to learn how >>> you can reduce costs and improve performance. >>> http://p.sf.net/sfu/dell-sfdev2dev >>> _______________________________________________ >>> Htmlparser-developer mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >> >> >> -- >> Elliot >> > > > > -- > Elliot > > > ------------------------------------------------------------------------------ > Start uncovering the many advantages of virtual appliances > and start using them to simplify application deployment and > accelerate your shift to cloud computing > http://p.sf.net/sfu/novell-sfdev2dev > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |