htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: <daf...@gm...> - 2016-07-20 06:37:34
|
我已邀请您填写以下表单: 回复:全球属于您的目标客户您是否都联系过 要填写此表单,请访问: https://docs.google.com/forms/d/e/1FAIpQLSfwXGKdDUHZ-qQjxJXk9mOibuEJPS-Bnx3-SWoB-6pRfrk-ZQ/viewform?c=0&w=1&usp=mail_form_link Google表单:创建调查问卷并分析调查结果。 |
From: Marc P. <mar...@we...> - 2016-07-19 13:49:23
|
Hi! I have a problem when parsing an "aside" tag. The source html has an aside tag with text inside but when parsed the method getChildren returns null. String page = "<html><head></head><body><h1>Good Text 1, " + // "<aside class=\"test it\">Irrelevant Text A, </aside>" + // "<div class=\"news-footer\">Irrelevant Text B, </div>" + // "Good Text 2 </h1></body></html>"; Page p = new Page(page, charset); Lexer l = new Lexer(p); Parser parser = new Parser(l); NodeList nodes = parser.parse(null); Node body = nodes.elementAt(0).getChildren().elementAt(1); Node h1 = body.getChildren().elementAt(0); assertNotNull(h1.getChildren().elementAt(1).getChildren()); On the other hand, the tag div has children as expected. Is there anything worng? Thanks in advance Regards -- Marc Poch [image: websays.com] <http://www.websays.com/> [image: facebook.com/websays] <http://www.facebook.com/websays> [image: twitter.com/websays] <http://www.twitter.com/websays> [image: linkedin.com/company/websays] <http://www.linkedin.com/company/websays> The information contained in this email and in any attachments is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. |
From: <men...@gm...> - 2015-12-17 04:13:55
|
阿里B2B已经让你感到厌烦,失去了开发客户的激情 全球你行业有多少目标客户数据你可知道? 会上阿里的估计就如同九牛一毛 客户开发是要去找要去联系的,小高带给你不一样的客户开发方式,让全球的客户 都关注到你 Q我体・验 442772363 我已邀请您填写表单 祝您财源滚滚,发大财。 要填写此表单,请访问: https://docs.google.com/forms/d/1oFJ7zm4IiIBU7umLJu68rl2Ijz2lCE7UShU0DT71WI0/viewform?c=0&w=1&usp=mail_form_link |
From: <ril...@gm...> - 2015-11-21 06:21:44
|
如今阿里上的有效询盘越来越少,展会的价~格越来越高,您是否也觉得传统的 B2B展会越来越行不通。 我们是继B2B 展会的第三种主流的客户开发方式。让您每天收到来自全球各国潜在 客户的一对一有效询盘。 邀请您免费体验您产品不一样的效果 企鹅号Q 343086184 I've invited you to fill out the form 祝您生意兴隆. To fill it out, visit: https://docs.google.com/forms/d/1Vpvg3LRy9nDm09IN8IvSFoRz5843xRAhQ5XrQ-VwS88/viewform?c=0&w=1&usp=mail_form_link |
From: Muriel <sua...@16...> - 2015-06-11 03:09:46
|
R&B Sung1asses Just 22.10 More Than Cheap GIasses. Save Big On GIasses! Free Delivery On Order 3Pairs. www.kyogr.pw |
From: Keith F. <fog...@ya...> - 2014-05-04 12:34:07
|
Hi HTMLParser-developers, I was hoping you may be able to help. I'm working on a MSc. thesis and am looking to distinguish (java) Open Source Test-First and Test-Last projects for a research experiment. To this end I was hoping that you as contributors to HTML Parser would consider filling out the linked short survey (12 questions) for this project (or any other open source java project you are involved in)? My research will look at and compare design pattern usage in Test-First and Test-Last projects in relation to design quality and effectiveness and if you are interested in the results you can leave your contact details with the survey and I will forward to you. Many thanks in advance; your help is very much appreciated.Keith Fogarty Link to the survey: Open Source Software Project Survey Open Source Software Project Survey Hi, I was hoping you may be able to help me. I'm currently working on an M.Sc. Thesis and require a number of Test-First and Test-Last subject projects as part o... View on docs.google.com Preview by Yahoo |
From: <nic...@gm...> - 2012-05-29 23:49:01
|
If you have trouble viewing or submitting this form, you can fill it out online: https://docs.google.com/spreadsheet/viewform?formkey=dDlpdS1Fb3pGU3Z5YTlVT28wcDZpd0E6MQ xDD projects survey Hi, my name is Nicolás Dascanio and i'm doing my software engenieering thesis on TDD and ATDD. My native language is not english, so I apologize for any mistake this survey may have. I'm looking for projects devoloped with xDD and i'm asking if you could take a moment of your time to fill out this survey. If you have several projects, it would be ideal to fill out the survey once for each project. However, if the answer to all the questions for every project is the same, you can fill out the survey once. ¿What do I mean with TDD? En the lifecycle starts with unit-tests -> development -> green light -> refactor (this last step may be skipped in some cycles) ¿What do I mean with ATDD? It's like TDD, but the lifecycle starts with acceptance-tests, and inside the cycle there are many TDD-cycles. It may be a big simplification, and they are not mutually exclusive, but I'm interested to know if only unit-tests were used in the sense of UTDD (unit-TDD) or if acceptance tests guided the development. The thesis will be written in Spanish, since that's my native language, but I'll do my best effort to write the conclusions (in a paper or something) in English, so everyone that helped me can read it. Thank you very much! Name and email It's not required to answer this question, the results will be completely anonymous and I won't give your personal information to anyone. I'm just asking to thank you later and ask any question that may arise from the survey. Project Name and repository link * project name (or names if there are several) and link to the repository (SVN, CVS, GIT, etc). If the project has a bug tracker, please put the link too. In which language was it developed? * Which methodology was used to develop it? * I'll take into account only the core of the system, leaving out autogenerated code or UI. If you have any comment or clarification please put "other" and explain TDD ATDD UTDD BDD NDD STDD Other: xDD experience * This was my first project with xDD I have already done other projects with xDD How much did you use xDD in the development of your project? * Almost everything was done with it, 95 to 100% was developed with xDD Very much, most of it was developed with xDD Half of it, 50% Little or nothing If you have any additional comments you can use this space You can make here any clarification about the previous questions Powered by Google Docs Report Abuse - Terms of Service - Additional Terms |
From: Marco Y. <yeu...@ho...> - 2011-05-24 12:02:46
|
why dont you attach some memory profiler and see if you can identify where the leak came from? From: jia...@qq... To: htm...@li... Date: Tue, 24 May 2011 16:31:33 +0800 Subject: [Htmlparser-developer] 回复: memory-leak Is there anybody who can help me to solve this memory-leak problem? Thank you very much! ------------------ 原始邮件 ------------------ 发件人: "jiawangxi"<jia...@qq...>; 发送时间: 2011年5月24日(星期二) 下午4:07 收件人: "Htmlparser-developer"<Htm...@li...>; 主题: [Htmlparser-developer] memory-leak I am using htmlparser to parse tens of thousands of web pages. After running an hour, the program will take more than 100M and this number is increasing alll the time. It seems that htmlparser is sufferring from the memory leak problem, has anybody encountered this problem? ------------------------------------------------------------------------------ vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: 贾. <jia...@qq...> - 2011-05-24 08:35:18
|
Is there anybody who can help me to solve this memory-leak problem? Thank you very much! ------------------ 原始邮件 ------------------ 发件人: "jiawangxi"<jia...@qq...>; 发送时间: 2011年5月24日(星期二) 下午4:07 收件人: "Htmlparser-developer"<Htm...@li...>; 主题: [Htmlparser-developer] memory-leak I am using htmlparser to parse tens of thousands of web pages. After running an hour, the program will take more than 100M and this number is increasing alll the time. It seems that htmlparser is sufferring from the memory leak problem, has anybody encountered this problem? |
From: 贾. <jia...@qq...> - 2011-05-24 08:08:50
|
I am using htmlparser to parse tens of thousands of web pages. After running an hour, the program will take more than 100M and this number is increasing alll the time. It seems that htmlparser is sufferring from the memory leak problem, has anybody encountered this problem? |
From: Arafat R. <ami...@gm...> - 2010-11-12 15:15:07
|
how parser works and i want to write html tag in my language please help me -- Arafat Ur Rahman |
From: Derrick O. <der...@gm...> - 2010-09-21 17:02:39
|
This code snippet is only activated when redirection happens. The HTTP standard <http://www.w3.org/Protocols/rfc2616/rfc2616.html> defines several status codes<http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3> for redirection: - 300 multiple choices (e.g. offer different languages) - 301 moved permanently - 302 found (originally temporary redirect, but now commonly used to specify redirection for unspecified reason) - 303 see other (e.g. for results of cgi-scripts) - 307 temporary redirect When the limit is reached (repeated >= 20) the boolean value of repeat is not set and the outer loop exits. I'm not sure what your problem is in recompiling a modified file. On Tue, Sep 21, 2010 at 10:42 AM, john wu <wj...@gm...> wrote: > Hello, > > I met below problem when I used htmlparser. > > Problem: unlimited try to connect a web site to get the html page, but it > failed in openConnection in ConnectionManager.java. > > Code: > if ((3 == (code / 100)) && (repeated < 20)) > if (null != (uri = getLocation (http))) > { > url = new URL (uri); > repeat = true; > repeated++; > } > > Here if repeated >= 20, then it will always to use the old url. And it will > be in unlimited unconnected status. > > *Am I right?* > > So I try to modify the code, but the compile failed in mvn install and hint > me there are some unexpected characters \65535. > I opened it in eclipse and modified it. > *Who could tell me what is the problem about it?* > > Thank you! > > Br, > > John Wu > > > ------------------------------------------------------------------------------ > Start uncovering the many advantages of virtual appliances > and start using them to simplify application deployment and > accelerate your shift to cloud computing. > http://p.sf.net/sfu/novell-sfdev2dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: john wu <wj...@gm...> - 2010-09-21 08:42:14
|
Hello, I met below problem when I used htmlparser. Problem: unlimited try to connect a web site to get the html page, but it failed in openConnection in ConnectionManager.java. Code: if ((3 == (code / 100)) && (repeated < 20)) if (null != (uri = getLocation (http))) { url = new URL (uri); repeat = true; repeated++; } Here if repeated >= 20, then it will always to use the old url. And it will be in unlimited unconnected status. *Am I right?* So I try to modify the code, but the compile failed in mvn install and hint me there are some unexpected characters \65535. I opened it in eclipse and modified it. *Who could tell me what is the problem about it?* Thank you! Br, John Wu |
From: Derrick O. <der...@gm...> - 2010-09-11 20:09:24
|
Only composite tags are nested... See http://htmlparser.sourceforge.net/faq.html#composite So you would need to create a tag class derived from CoimpositeTag and add it to the node factory, as outlined. On Fri, Sep 10, 2010 at 8:14 PM, Elliot Huntington < ell...@gm...> wrote: > When I reviewed the output of the program a little closer I realized that > although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it > did not properly nest the tags content as children nodes. > > Is this expected because the tag is not a valid html tag or is this a bug? > > Maybe this is what you meant in your original email Enrique when you asked > which tags are "analyzed" by the html parser? > > Here is the output from running the program. Notice that all the valid html > tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag" > tag's "should be" children are not nested one level deeper. Is this a bug or > a feature? > > Tag (1[1,0],7[1,6]): html > Txt (7[1,6],9[2,1]): \n\t > Tag (9[2,1],15[2,7]): head > Txt (15[2,7],18[3,2]): \n\t\t > Tag (18[3,2],25[3,9]): title > Txt (25[3,9],44[3,28]): Html Parser Example > End (44[3,28],52[3,36]): /title > Txt (52[3,36],54[4,1]): \n\t > End (54[4,1],61[4,8]): /head > Txt (61[4,8],63[5,1]): \n\t > Tag (63[5,1],69[5,7]): body > Txt (69[5,7],72[6,2]): \n\t\t > Tag (72[6,2],75[6,5]): p > Txt (75[6,5],81[6,11]): Hello > Tag (81[6,11],87[6,17]): span > Txt (87[6,17],92[6,22]): World > End (92[6,22],99[6,29]): /span > Txt (99[6,29],100[6,30]): ! > End (100[6,30],104[6,34]): /p > Txt (104[6,34],107[7,2]): \n\t\t > Tag (107[7,2],110[7,5]): p > Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at > home!" > Txt (159[7,54],195[7,90]): but html parser still understands it > End (195[7,90],214[7,109]): /thisIsAMadeUpTag > End (214[7,109],218[7,113]): /p > Txt (218[7,113],220[8,1]): \n\t > End (220[8,1],227[8,8]): /body > Txt (227[8,8],228[9,0]): \n > End (228[9,0],235[9,7]): /html > > > > > On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington < > ell...@gm...> wrote: > >> I don't know exactly what you mean by "analyzes." But I think the answer >> to your question is all of them. >> >> Here is an example that might help you get started. You'll want to make >> sure you understand the various interfaces provided in the API (ie: Node, >> NodeFilter, etc...). >> >> import org.htmlparser.Parser; >> import org.htmlparser.filters.NodeClassFilter; >> import org.htmlparser.lexer.Lexer; >> import org.htmlparser.lexer.Page; >> import org.htmlparser.tags.Html; >> import org.htmlparser.util.NodeList; >> import org.htmlparser.util.ParserException; >> >> public class Example { >> public static void main(String... params) { >> // Parser parser = getParser(getHtml(), "UTF-8"); >> Parser parser = getParser(getHtml()); >> >> try { >> NodeList list = parser.extractAllNodesThatMatch(new >> NodeClassFilter(Html.class)); >> for(int i = 0; i < list.size(); i++) { >> Html html = (Html) list.elementAt(i); >> System.out.println(html.toString()); >> } >> } catch(ParserException e) { >> e.printStackTrace(); >> } >> >> } >> >> private static Parser getParser(String html, String charset) { >> return new Parser(new Lexer(new Page(html, charset))); >> } >> >> private static Parser getParser(String html) { >> Parser parser = new Parser(); >> try { >> parser.setInputHTML(html); >> } catch(ParserException e) { >> e.printStackTrace(); >> } >> return parser; >> } >> >> private static String getHtml() { >> return new StringBuilder() >> .append("\n<html>") >> .append("\n\t<head>") >> .append("\n\t\t<title>Html Parser Example</title>") >> .append("\n\t</head>") >> .append("\n\t<body>") >> .append("\n\t\t<p>Hello <span>World</span>!</p>") >> .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at >> home!\">but html parser still understands it</thisIsAMadeUpTag>") >> .append("\n\t</body>") >> .append("\n</html>") >> .toString(); >> } >> } >> >> >> >> >> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm... >> > wrote: >> >>> Hello, >>> >>> can anybody tell me which html tags HtmlParser analyzes in order to >>> extract text from a web page??? >>> >>> Thank you!!! >>> >>> >>> ------------------------------------------------------------------------------ >>> Automate Storage Tiering Simply >>> Optimize IT performance and efficiency through flexible, powerful, >>> automated storage tiering capabilities. View this brief to learn how >>> you can reduce costs and improve performance. >>> http://p.sf.net/sfu/dell-sfdev2dev >>> _______________________________________________ >>> Htmlparser-developer mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >> >> >> -- >> Elliot >> > > > > -- > Elliot > > > ------------------------------------------------------------------------------ > Start uncovering the many advantages of virtual appliances > and start using them to simplify application deployment and > accelerate your shift to cloud computing > http://p.sf.net/sfu/novell-sfdev2dev > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Elliot H. <ell...@gm...> - 2010-09-10 18:14:36
|
When I reviewed the output of the program a little closer I realized that although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it did not properly nest the tags content as children nodes. Is this expected because the tag is not a valid html tag or is this a bug? Maybe this is what you meant in your original email Enrique when you asked which tags are "analyzed" by the html parser? Here is the output from running the program. Notice that all the valid html tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag" tag's "should be" children are not nested one level deeper. Is this a bug or a feature? Tag (1[1,0],7[1,6]): html Txt (7[1,6],9[2,1]): \n\t Tag (9[2,1],15[2,7]): head Txt (15[2,7],18[3,2]): \n\t\t Tag (18[3,2],25[3,9]): title Txt (25[3,9],44[3,28]): Html Parser Example End (44[3,28],52[3,36]): /title Txt (52[3,36],54[4,1]): \n\t End (54[4,1],61[4,8]): /head Txt (61[4,8],63[5,1]): \n\t Tag (63[5,1],69[5,7]): body Txt (69[5,7],72[6,2]): \n\t\t Tag (72[6,2],75[6,5]): p Txt (75[6,5],81[6,11]): Hello Tag (81[6,11],87[6,17]): span Txt (87[6,17],92[6,22]): World End (92[6,22],99[6,29]): /span Txt (99[6,29],100[6,30]): ! End (100[6,30],104[6,34]): /p Txt (104[6,34],107[7,2]): \n\t\t Tag (107[7,2],110[7,5]): p Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at home!" Txt (159[7,54],195[7,90]): but html parser still understands it End (195[7,90],214[7,109]): /thisIsAMadeUpTag End (214[7,109],218[7,113]): /p Txt (218[7,113],220[8,1]): \n\t End (220[8,1],227[8,8]): /body Txt (227[8,8],228[9,0]): \n End (228[9,0],235[9,7]): /html On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington < ell...@gm...> wrote: > I don't know exactly what you mean by "analyzes." But I think the answer to > your question is all of them. > > Here is an example that might help you get started. You'll want to make > sure you understand the various interfaces provided in the API (ie: Node, > NodeFilter, etc...). > > import org.htmlparser.Parser; > import org.htmlparser.filters.NodeClassFilter; > import org.htmlparser.lexer.Lexer; > import org.htmlparser.lexer.Page; > import org.htmlparser.tags.Html; > import org.htmlparser.util.NodeList; > import org.htmlparser.util.ParserException; > > public class Example { > public static void main(String... params) { > // Parser parser = getParser(getHtml(), "UTF-8"); > Parser parser = getParser(getHtml()); > > try { > NodeList list = parser.extractAllNodesThatMatch(new > NodeClassFilter(Html.class)); > for(int i = 0; i < list.size(); i++) { > Html html = (Html) list.elementAt(i); > System.out.println(html.toString()); > } > } catch(ParserException e) { > e.printStackTrace(); > } > > } > > private static Parser getParser(String html, String charset) { > return new Parser(new Lexer(new Page(html, charset))); > } > > private static Parser getParser(String html) { > Parser parser = new Parser(); > try { > parser.setInputHTML(html); > } catch(ParserException e) { > e.printStackTrace(); > } > return parser; > } > > private static String getHtml() { > return new StringBuilder() > .append("\n<html>") > .append("\n\t<head>") > .append("\n\t\t<title>Html Parser Example</title>") > .append("\n\t</head>") > .append("\n\t<body>") > .append("\n\t\t<p>Hello <span>World</span>!</p>") > .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at > home!\">but html parser still understands it</thisIsAMadeUpTag>") > .append("\n\t</body>") > .append("\n</html>") > .toString(); > } > } > > > > > On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote: > >> Hello, >> >> can anybody tell me which html tags HtmlParser analyzes in order to >> extract text from a web page??? >> >> Thank you!!! >> >> >> ------------------------------------------------------------------------------ >> Automate Storage Tiering Simply >> Optimize IT performance and efficiency through flexible, powerful, >> automated storage tiering capabilities. View this brief to learn how >> you can reduce costs and improve performance. >> http://p.sf.net/sfu/dell-sfdev2dev >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> > > > -- > Elliot > -- Elliot |
From: Elliot H. <ell...@gm...> - 2010-09-10 17:53:34
|
I don't know exactly what you mean by "analyzes." But I think the answer to your question is all of them. Here is an example that might help you get started. You'll want to make sure you understand the various interfaces provided in the API (ie: Node, NodeFilter, etc...). import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; import org.htmlparser.tags.Html; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; public class Example { public static void main(String... params) { // Parser parser = getParser(getHtml(), "UTF-8"); Parser parser = getParser(getHtml()); try { NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter(Html.class)); for(int i = 0; i < list.size(); i++) { Html html = (Html) list.elementAt(i); System.out.println(html.toString()); } } catch(ParserException e) { e.printStackTrace(); } } private static Parser getParser(String html, String charset) { return new Parser(new Lexer(new Page(html, charset))); } private static Parser getParser(String html) { Parser parser = new Parser(); try { parser.setInputHTML(html); } catch(ParserException e) { e.printStackTrace(); } return parser; } private static String getHtml() { return new StringBuilder() .append("\n<html>") .append("\n\t<head>") .append("\n\t\t<title>Html Parser Example</title>") .append("\n\t</head>") .append("\n\t<body>") .append("\n\t\t<p>Hello <span>World</span>!</p>") .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at home!\">but html parser still understands it</thisIsAMadeUpTag>") .append("\n\t</body>") .append("\n</html>") .toString(); } } On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote: > Hello, > > can anybody tell me which html tags HtmlParser analyzes in order to extract > text from a web page??? > > Thank you!!! > > > ------------------------------------------------------------------------------ > Automate Storage Tiering Simply > Optimize IT performance and efficiency through flexible, powerful, > automated storage tiering capabilities. View this brief to learn how > you can reduce costs and improve performance. > http://p.sf.net/sfu/dell-sfdev2dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > -- Elliot |
From: Enrique E. <kik...@gm...> - 2010-09-10 10:27:47
|
Hello, can anybody tell me which html tags HtmlParser analyzes in order to extract text from a web page??? Thank you!!! |
From: Sören G. <htm...@un...> - 2010-04-27 22:10:21
|
Hi, I've written a small patch to Filterbuilder. It changes the generated java-source-files to have a public static NodeFilter[] createFilter() That way, you can simply drop that file in your source tree and use it without having to change anything (but the packages declaration) I hope you'll find this as useful as I do. Greetings Sören |
From: Derrick O. <der...@gm...> - 2010-04-18 05:23:36
|
Any NodeList can be output with toHtml(). If you've first caught the whole page in a NodeList before applying your changes, they should be reflected in the output from that list. On Sun, Apr 18, 2010 at 3:41 AM, S Ahmed <sah...@gm...> wrote: > Is there a way to put my HTML so I can see if the filter is applied > correctly? > > > On Wed, Apr 14, 2010 at 1:06 PM, Derrick Oswald <der...@gm...>wrote: > >> FilterBuilder is a program that helps you build filters. >> See http://htmlparser.sourceforge.net/samples.html >> You can even run it online. >> After playing with the FilterBuilder (there is Help), you can save the >> filter as a small executable class. >> Then include the outputted code that creates the filter into your >> applications. >> >> >> On Wed, Apr 14, 2010 at 12:23 PM, S Ahmed <sah...@gm...> wrote: >> >>> Sorry kinda new to java, the filterbuilder seems to be a .jar? >>> >>> I have this so far: >>> >>> nlRows = htmlParser.extractAllNodesThatMatch( >>> new AndFilter( >>> >>> new HasChildFilter("class","prod") >>> >>> ) >>> ); >>> >>> >>> How do I get all elements with <tr>? >>> >>> >>> On Tue, Apr 13, 2010 at 1:38 AM, Derrick Oswald < >>> der...@gm...> wrote: >>> >>>> You should be able to create a filter that finds all TR nodes that have >>>> TD child nodes with the class="prod" attribute. >>>> See the FilterBuilder application. >>>> >>>> >>>> On Tue, Apr 13, 2010 at 5:55 AM, S Ahmed <sah...@gm...> wrote: >>>> >>>>> Hi, >>>>> >>>>> Sending my question here, the user mailing list seems to be filled with >>>>> spam? >>>>> >>>>> I have a HTML page, and part of the page that I want to focus on looks >>>>> like: >>>>> >>>>> >>>>> <table> >>>>> >>>>> <tr><td class="prod">.... >>>>> </tr> >>>>> >>>>> <tr><td class="prod">.... >>>>> </tr> >>>>> >>>>> <tr><td class="prod">.... >>>>> </tr> >>>>> >>>>> <tr><td class="prod">.... >>>>> </tr> >>>>> >>>>> </tabe> >>>>> >>>>> >>>>> So I want to extract all the <tr>. >>>>> >>>>> I have used tmlParser.extractAllNodesThatMatch(...) in the past, but in >>>>> this case it seems the only way to get a NodeList of all the <tr> groupings >>>>> in this table is to use the value from the 1st <td> in each <tr> that has a >>>>> class of 'Prod". >>>>> >>>>> (class="prod" is unique to the entire HTML page). >>>>> >>>>> Is this possible to do? >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Download Intel® Parallel Studio Eval >>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>> proactively, and fine-tune applications for parallel performance. >>>>> See why Intel Parallel Studio got high marks during beta. >>>>> http://p.sf.net/sfu/intel-sw-dev >>>>> _______________________________________________ >>>>> Htmlparser-developer mailing list >>>>> Htm...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Download Intel® Parallel Studio Eval >>>> Try the new software tools for yourself. Speed compiling, find bugs >>>> proactively, and fine-tune applications for parallel performance. >>>> See why Intel Parallel Studio got high marks during beta. >>>> http://p.sf.net/sfu/intel-sw-dev >>>> _______________________________________________ >>>> Htmlparser-developer mailing list >>>> Htm...@li... >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Htmlparser-developer mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: S A. <sah...@gm...> - 2010-04-18 01:41:58
|
Is there a way to put my HTML so I can see if the filter is applied correctly? On Wed, Apr 14, 2010 at 1:06 PM, Derrick Oswald <der...@gm...>wrote: > FilterBuilder is a program that helps you build filters. > See http://htmlparser.sourceforge.net/samples.html > You can even run it online. > After playing with the FilterBuilder (there is Help), you can save the > filter as a small executable class. > Then include the outputted code that creates the filter into your > applications. > > > On Wed, Apr 14, 2010 at 12:23 PM, S Ahmed <sah...@gm...> wrote: > >> Sorry kinda new to java, the filterbuilder seems to be a .jar? >> >> I have this so far: >> >> nlRows = htmlParser.extractAllNodesThatMatch( >> new AndFilter( >> >> new HasChildFilter("class","prod") >> >> ) >> ); >> >> >> How do I get all elements with <tr>? >> >> >> On Tue, Apr 13, 2010 at 1:38 AM, Derrick Oswald <der...@gm... >> > wrote: >> >>> You should be able to create a filter that finds all TR nodes that have >>> TD child nodes with the class="prod" attribute. >>> See the FilterBuilder application. >>> >>> >>> On Tue, Apr 13, 2010 at 5:55 AM, S Ahmed <sah...@gm...> wrote: >>> >>>> Hi, >>>> >>>> Sending my question here, the user mailing list seems to be filled with >>>> spam? >>>> >>>> I have a HTML page, and part of the page that I want to focus on looks >>>> like: >>>> >>>> >>>> <table> >>>> >>>> <tr><td class="prod">.... >>>> </tr> >>>> >>>> <tr><td class="prod">.... >>>> </tr> >>>> >>>> <tr><td class="prod">.... >>>> </tr> >>>> >>>> <tr><td class="prod">.... >>>> </tr> >>>> >>>> </tabe> >>>> >>>> >>>> So I want to extract all the <tr>. >>>> >>>> I have used tmlParser.extractAllNodesThatMatch(...) in the past, but in >>>> this case it seems the only way to get a NodeList of all the <tr> groupings >>>> in this table is to use the value from the 1st <td> in each <tr> that has a >>>> class of 'Prod". >>>> >>>> (class="prod" is unique to the entire HTML page). >>>> >>>> Is this possible to do? >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Download Intel® Parallel Studio Eval >>>> Try the new software tools for yourself. Speed compiling, find bugs >>>> proactively, and fine-tune applications for parallel performance. >>>> See why Intel Parallel Studio got high marks during beta. >>>> http://p.sf.net/sfu/intel-sw-dev >>>> _______________________________________________ >>>> Htmlparser-developer mailing list >>>> Htm...@li... >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Htmlparser-developer mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Derrick O. <der...@gm...> - 2010-04-14 17:06:23
|
FilterBuilder is a program that helps you build filters. See http://htmlparser.sourceforge.net/samples.html You can even run it online. After playing with the FilterBuilder (there is Help), you can save the filter as a small executable class. Then include the outputted code that creates the filter into your applications. On Wed, Apr 14, 2010 at 12:23 PM, S Ahmed <sah...@gm...> wrote: > Sorry kinda new to java, the filterbuilder seems to be a .jar? > > I have this so far: > > nlRows = htmlParser.extractAllNodesThatMatch( > new AndFilter( > > new HasChildFilter("class","prod") > > ) > ); > > > How do I get all elements with <tr>? > > > On Tue, Apr 13, 2010 at 1:38 AM, Derrick Oswald <der...@gm...>wrote: > >> You should be able to create a filter that finds all TR nodes that have TD >> child nodes with the class="prod" attribute. >> See the FilterBuilder application. >> >> >> On Tue, Apr 13, 2010 at 5:55 AM, S Ahmed <sah...@gm...> wrote: >> >>> Hi, >>> >>> Sending my question here, the user mailing list seems to be filled with >>> spam? >>> >>> I have a HTML page, and part of the page that I want to focus on looks >>> like: >>> >>> >>> <table> >>> >>> <tr><td class="prod">.... >>> </tr> >>> >>> <tr><td class="prod">.... >>> </tr> >>> >>> <tr><td class="prod">.... >>> </tr> >>> >>> <tr><td class="prod">.... >>> </tr> >>> >>> </tabe> >>> >>> >>> So I want to extract all the <tr>. >>> >>> I have used tmlParser.extractAllNodesThatMatch(...) in the past, but in >>> this case it seems the only way to get a NodeList of all the <tr> groupings >>> in this table is to use the value from the 1st <td> in each <tr> that has a >>> class of 'Prod". >>> >>> (class="prod" is unique to the entire HTML page). >>> >>> Is this possible to do? >>> >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Htmlparser-developer mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: S A. <sah...@gm...> - 2010-04-14 10:23:30
|
Sorry kinda new to java, the filterbuilder seems to be a .jar? I have this so far: nlRows = htmlParser.extractAllNodesThatMatch( new AndFilter( new HasChildFilter("class","prod") ) ); How do I get all elements with <tr>? On Tue, Apr 13, 2010 at 1:38 AM, Derrick Oswald <der...@gm...>wrote: > You should be able to create a filter that finds all TR nodes that have TD > child nodes with the class="prod" attribute. > See the FilterBuilder application. > > > On Tue, Apr 13, 2010 at 5:55 AM, S Ahmed <sah...@gm...> wrote: > >> Hi, >> >> Sending my question here, the user mailing list seems to be filled with >> spam? >> >> I have a HTML page, and part of the page that I want to focus on looks >> like: >> >> >> <table> >> >> <tr><td class="prod">.... >> </tr> >> >> <tr><td class="prod">.... >> </tr> >> >> <tr><td class="prod">.... >> </tr> >> >> <tr><td class="prod">.... >> </tr> >> >> </tabe> >> >> >> So I want to extract all the <tr>. >> >> I have used tmlParser.extractAllNodesThatMatch(...) in the past, but in >> this case it seems the only way to get a NodeList of all the <tr> groupings >> in this table is to use the value from the 1st <td> in each <tr> that has a >> class of 'Prod". >> >> (class="prod" is unique to the entire HTML page). >> >> Is this possible to do? >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: Derrick O. <der...@gm...> - 2010-04-13 05:38:45
|
You should be able to create a filter that finds all TR nodes that have TD child nodes with the class="prod" attribute. See the FilterBuilder application. On Tue, Apr 13, 2010 at 5:55 AM, S Ahmed <sah...@gm...> wrote: > Hi, > > Sending my question here, the user mailing list seems to be filled with > spam? > > I have a HTML page, and part of the page that I want to focus on looks > like: > > > <table> > > <tr><td class="prod">.... > </tr> > > <tr><td class="prod">.... > </tr> > > <tr><td class="prod">.... > </tr> > > <tr><td class="prod">.... > </tr> > > </tabe> > > > So I want to extract all the <tr>. > > I have used tmlParser.extractAllNodesThatMatch(...) in the past, but in > this case it seems the only way to get a NodeList of all the <tr> groupings > in this table is to use the value from the 1st <td> in each <tr> that has a > class of 'Prod". > > (class="prod" is unique to the entire HTML page). > > Is this possible to do? > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |
From: S A. <sah...@gm...> - 2010-04-13 03:55:40
|
Hi, Sending my question here, the user mailing list seems to be filled with spam? I have a HTML page, and part of the page that I want to focus on looks like: <table> <tr><td class="prod">.... </tr> <tr><td class="prod">.... </tr> <tr><td class="prod">.... </tr> <tr><td class="prod">.... </tr> </tabe> So I want to extract all the <tr>. I have used tmlParser.extractAllNodesThatMatch(...) in the past, but in this case it seems the only way to get a NodeList of all the <tr> groupings in this table is to use the value from the 1st <td> in each <tr> that has a class of 'Prod". (class="prod" is unique to the entire HTML page). Is this possible to do? |
From: S A. <sah...@gm...> - 2010-04-13 03:51:11
|