htmlparser-developer Mailing List for HTML Parser

Brought to you by: derrickoswald

htmlparser-developer — The developer mailing list of the htmlparser project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

Flat | Threaded

1 2 3 .. 33 > >> (Page 1 of 33)

[Htmlparser-developer] 回复：全球属于您的目标客户您是否都联系过

From: <daf...@gm...> - 2016-07-20 06:37:34

我已邀请您填写以下表单：
回复：全球属于您的目标客户您是否都联系过

要填写此表单，请访问：
https://docs.google.com/forms/d/e/1FAIpQLSfwXGKdDUHZ-qQjxJXk9mOibuEJPS-Bnx3-SWoB-6pRfrk-ZQ/viewform?c=0&amp;w=1&amp;usp=mail_form_link

Google表单：创建调查问卷并分析调查结果。

[Htmlparser-developer] aside tag problem

From: Marc P. <mar...@we...> - 2016-07-19 13:49:23

Hi!

I have a problem when parsing an "aside" tag. The source html has an aside
tag with text inside but when parsed the method getChildren returns null.

    String page = "<html><head></head><body><h1>Good Text 1, " + //
        "<aside class=\"test it\">Irrelevant Text A, </aside>" + //
        "<div class=\"news-footer\">Irrelevant Text B, </div>" + //
        "Good Text 2 </h1></body></html>";

    Page p = new Page(page, charset);
    Lexer l = new Lexer(p);
    Parser parser = new Parser(l);
    NodeList nodes = parser.parse(null);
    Node body = nodes.elementAt(0).getChildren().elementAt(1);
    Node h1 = body.getChildren().elementAt(0);
    assertNotNull(h1.getChildren().elementAt(1).getChildren());

On the other hand, the tag div has children as expected.

Is there anything worng?

Thanks in advance

Regards
-- 
Marc Poch
[image: websays.com] <http://www.websays.com/> [image: facebook.com/websays]
<http://www.facebook.com/websays>  [image: twitter.com/websays]
<http://www.twitter.com/websays>  [image: linkedin.com/company/websays]
<http://www.linkedin.com/company/websays>
The information contained in this email and in any attachments is intended
only for the person or entity to which it is addressed and may contain
confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon,
this information by persons or entities other than the intended recipient
is prohibited.

[Htmlparser-developer] 你不了解是我的错，你不体验是你的错

From: <men...@gm...> - 2015-12-17 04:13:55

阿里B2B已经让你感到厌烦，失去了开发客户的激情

全球你行业有多少目标客户数据你可知道？ 会上阿里的估计就如同九牛一毛

客户开发是要去找要去联系的，小高带给你不一样的客户开发方式，让全球的客户
都关注到你

Q我体・验 442772363






我已邀请您填写表单 祝您财源滚滚，发大财。 要填写此表单，请访问：
https://docs.google.com/forms/d/1oFJ7zm4IiIBU7umLJu68rl2Ijz2lCE7UShU0DT71WI0/viewform?c=0&w=1&usp=mail_form_link

[Htmlparser-developer] 避开质量差的询盘开发客户

From: <ril...@gm...> - 2015-11-21 06:21:44

如今阿里上的有效询盘越来越少，展会的价~格越来越高，您是否也觉得传统的
B2B展会越来越行不通。
我们是继B2B 展会的第三种主流的客户开发方式。让您每天收到来自全球各国潜在
客户的一对一有效询盘。

邀请您免费体验您产品不一样的效果
企鹅号Q 343086184




I've invited you to fill out the form 祝您生意兴隆. To fill it out,
visit:
https://docs.google.com/forms/d/1Vpvg3LRy9nDm09IN8IvSFoRz5843xRAhQ5XrQ-VwS88/viewform?c=0&w=1&usp=mail_form_link

[Htmlparser-developer] Top Selection! R&B Sung1asses Only 22.39

From: Muriel <sua...@16...> - 2015-06-11 03:09:46

R&B Sung1asses Just 22.10
More Than Cheap GIasses. Save Big On GIasses!
Free Delivery On Order 3Pairs.
www.kyogr.pw

[Htmlparser-developer] HTML Parser - survey request for MSc Project

From: Keith F. <fog...@ya...> - 2014-05-04 12:34:07

Hi HTMLParser-developers,

I was
hoping you may be able to help. I'm
working on a MSc. thesis and am looking to distinguish (java) Open Source
Test-First and Test-Last projects for a research experiment.

To this
end I was hoping that you as contributors to HTML Parser would consider filling
out the linked short survey (12 questions) for this project (or any other open source java project you are
involved in)? 


My
research will look at and compare design pattern usage in Test-First and
Test-Last projects in relation to design quality and effectiveness and if you
are interested in the results you can leave your contact details with the
survey and I will forward to you.

Many
thanks in advance; your help is very much appreciated.Keith Fogarty


 
Link to the survey:

Open Source Software Project Survey
 
   Open Source Software Project Survey
Hi, I was hoping you may be able to help me. I'm currently working on an M.Sc. Thesis and require a number of Test-First and Test-Last subject projects as part o...  
View on docs.google.com Preview by Yahoo

[Htmlparser-developer] xDD projects survey

From: <nic...@gm...> - 2012-05-29 23:49:01

If you have trouble viewing or submitting this form, you can fill it out
online:
https://docs.google.com/spreadsheet/viewform?formkey=dDlpdS1Fb3pGU3Z5YTlVT28wcDZpd0E6MQ

xDD projects survey

Hi, my name is Nicolás Dascanio and i'm doing my software engenieering
thesis on TDD and ATDD. My native language is not english, so I apologize
for any mistake this survey may have. I'm looking for projects devoloped
with xDD and i'm asking if you could take a moment of your time to fill out
this survey. If you have several projects, it would be ideal to fill out
the survey once for each project. However, if the answer to all the
questions for every project is the same, you can fill out the survey once.
¿What do I mean with TDD? En the lifecycle starts with unit-tests ->
development -> green light -> refactor (this last step may be skipped in
some cycles) ¿What do I mean with ATDD? It's like TDD, but the lifecycle
starts with acceptance-tests, and inside the cycle there are many
TDD-cycles. It may be a big simplification, and they are not mutually
exclusive, but I'm interested to know if only unit-tests were used in the
sense of UTDD (unit-TDD) or if acceptance tests guided the development. The
thesis will be written in Spanish, since that's my native language, but
I'll do my best effort to write the conclusions (in a paper or something)
in English, so everyone that helped me can read it. Thank you very much!

Name and email It's not required to answer this question, the results will
be completely anonymous and I won't give your personal information to
anyone. I'm just asking to thank you later and ask any question that may
arise from the survey.

Project Name and repository link * project name (or names if there are
several) and link to the repository (SVN, CVS, GIT, etc). If the project
has a bug tracker, please put the link too.

In which language was it developed? *

Which methodology was used to develop it? * I'll take into account only the
core of the system, leaving out autogenerated code or UI. If you have any
comment or clarification please put "other" and explain

TDD
ATDD
UTDD
BDD
NDD
STDD
Other:

xDD experience *

This was my first project with xDD
I have already done other projects with xDD

How much did you use xDD in the development of your project? *

Almost everything was done with it, 95 to 100% was developed with xDD
Very much, most of it was developed with xDD
Half of it, 50%
Little or nothing

If you have any additional comments you can use this space You can make
here any clarification about the previous questions

Re: [Htmlparser-developer] 回复： memory-leak

From: Marco Y. <yeu...@ho...> - 2011-05-24 12:02:46

why dont you attach some memory profiler and see if you can identify where the leak came from?
 


From: jia...@qq...
To: htm...@li...
Date: Tue, 24 May 2011 16:31:33 +0800
Subject: [Htmlparser-developer] 回复： memory-leak


Is there anybody who can help me to solve this memory-leak problem? Thank you very much!

 
 

------------------ 原始邮件 ------------------

发件人: "jiawangxi"<jia...@qq...>;
发送时间: 2011年5月24日(星期二) 下午4:07
收件人: "Htmlparser-developer"<Htm...@li...>; 

主题: [Htmlparser-developer] memory-leak
 
I am using htmlparser to parse tens of thousands of web pages. After running an hour, the program will take more than 100M and this number is increasing alll the time.  It seems that htmlparser is sufferring from the memory leak problem, has anybody encountered this problem?
------------------------------------------------------------------------------ vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1
_______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

[Htmlparser-developer] 回复： memory-leak

From: 贾. <jia...@qq...> - 2011-05-24 08:35:18

Is there anybody who can help me to solve this memory-leak problem? Thank you very much!
   
  
  ------------------ 原始邮件 ------------------
  发件人: "jiawangxi"<jia...@qq...>;
 发送时间: 2011年5月24日(星期二) 下午4:07
 收件人: "Htmlparser-developer"<Htm...@li...>; 
 
 主题: [Htmlparser-developer] memory-leak

  
 I am using htmlparser to parse tens of thousands of web pages. After running an hour, the program will take more than 100M and this number is increasing alll the time.  It seems that htmlparser is sufferring from the memory leak problem, has anybody encountered this problem?

[Htmlparser-developer] memory-leak

From: 贾. <jia...@qq...> - 2011-05-24 08:08:50

I am using htmlparser to parse tens of thousands of web pages. After running an hour, the program will take more than 100M and this number is increasing alll the time.  It seems that htmlparser is sufferring from the memory leak problem, has anybody encountered this problem?

[Htmlparser-developer] i want to know

From: Arafat R. <ami...@gm...> - 2010-11-12 15:15:07

how   parser  works  and  i want to  write  html tag in my  language
please  help me
-- 
Arafat   Ur   Rahman

Re: [Htmlparser-developer] unlimited unconnected status in ConnectionManager.java

From: Derrick O. <der...@gm...> - 2010-09-21 17:02:39

This code snippet is only activated when redirection happens.

The HTTP standard <http://www.w3.org/Protocols/rfc2616/rfc2616.html> defines
several status codes<http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3>
for
redirection:

   - 300 multiple choices (e.g. offer different languages)
   - 301 moved permanently
   - 302 found (originally temporary redirect, but now commonly used to
   specify redirection for unspecified reason)
   - 303 see other (e.g. for results of cgi-scripts)
   - 307 temporary redirect

When the limit is reached (repeated >= 20) the boolean value of repeat is
not set and the outer loop exits.

I'm not sure what your problem is in recompiling a modified file.

On Tue, Sep 21, 2010 at 10:42 AM, john wu <wj...@gm...> wrote:

> Hello,
>
> I met below problem when I used htmlparser.
>
> Problem: unlimited try to connect a web site to get the html page, but it
> failed in openConnection in ConnectionManager.java.
>
> Code:
>                          if ((3 == (code / 100)) && (repeated < 20))
>                                 if (null != (uri = getLocation (http)))
>                                 {
>                                     url = new URL (uri);
>                                     repeat = true;
>                                     repeated++;
>                                 }
>
> Here if repeated >= 20, then it will always to use the old url. And it will
> be in unlimited unconnected status.
>
> *Am I right?*
>
> So I try to modify the code, but the compile failed in mvn install and hint
> me there are some unexpected characters \65535.
> I opened it in eclipse and modified it.
> *Who could tell me what is the problem about it?*
>
> Thank you!
>
> Br,
>
> John Wu
>
>
> ------------------------------------------------------------------------------
> Start uncovering the many advantages of virtual appliances
> and start using them to simplify application deployment and
> accelerate your shift to cloud computing.
> http://p.sf.net/sfu/novell-sfdev2dev
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>

[Htmlparser-developer] unlimited unconnected status in ConnectionManager.java

From: john wu <wj...@gm...> - 2010-09-21 08:42:14

Hello,

I met below problem when I used htmlparser.

Problem: unlimited try to connect a web site to get the html page, but it
failed in openConnection in ConnectionManager.java.

Code:
                         if ((3 == (code / 100)) && (repeated < 20))
                                if (null != (uri = getLocation (http)))
                                {
                                    url = new URL (uri);
                                    repeat = true;
                                    repeated++;
                                }

Here if repeated >= 20, then it will always to use the old url. And it will
be in unlimited unconnected status.

*Am I right?*

So I try to modify the code, but the compile failed in mvn install and hint
me there are some unexpected characters \65535.
I opened it in eclipse and modified it.
*Who could tell me what is the problem about it?*

Thank you!

Br,

John Wu

Re: [Htmlparser-developer] Tags

From: Derrick O. <der...@gm...> - 2010-09-11 20:09:24

Only composite tags are nested...
See http://htmlparser.sourceforge.net/faq.html#composite
So you would need to create a tag class derived from CoimpositeTag and add
it to the node factory, as outlined.

On Fri, Sep 10, 2010 at 8:14 PM, Elliot Huntington <
ell...@gm...> wrote:

> When I reviewed the output of the program a little closer I realized that
> although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it
> did not properly nest the tags content as children nodes.
>
> Is this expected because the tag is not a valid html tag or is this a bug?
>
> Maybe this is what you meant in your original email Enrique when you asked
> which tags are "analyzed" by the html parser?
>
> Here is the output from running the program. Notice that all the valid html
> tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag"
> tag's "should be" children are not nested one level deeper. Is this a bug or
> a feature?
>
> Tag (1[1,0],7[1,6]): html
>   Txt (7[1,6],9[2,1]): \n\t
>   Tag (9[2,1],15[2,7]): head
>     Txt (15[2,7],18[3,2]): \n\t\t
>     Tag (18[3,2],25[3,9]): title
>       Txt (25[3,9],44[3,28]): Html Parser Example
>       End (44[3,28],52[3,36]): /title
>     Txt (52[3,36],54[4,1]): \n\t
>     End (54[4,1],61[4,8]): /head
>   Txt (61[4,8],63[5,1]): \n\t
>   Tag (63[5,1],69[5,7]): body
>     Txt (69[5,7],72[6,2]): \n\t\t
>     Tag (72[6,2],75[6,5]): p
>       Txt (75[6,5],81[6,11]): Hello
>       Tag (81[6,11],87[6,17]): span
>         Txt (87[6,17],92[6,22]): World
>         End (92[6,22],99[6,29]): /span
>       Txt (99[6,29],100[6,30]): !
>       End (100[6,30],104[6,34]): /p
>     Txt (104[6,34],107[7,2]): \n\t\t
>     Tag (107[7,2],110[7,5]): p
>       Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at
> home!"
>       Txt (159[7,54],195[7,90]): but html parser still understands it
>       End (195[7,90],214[7,109]): /thisIsAMadeUpTag
>       End (214[7,109],218[7,113]): /p
>     Txt (218[7,113],220[8,1]): \n\t
>     End (220[8,1],227[8,8]): /body
>   Txt (227[8,8],228[9,0]): \n
>   End (228[9,0],235[9,7]): /html
>
>
>
>
> On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington <
> ell...@gm...> wrote:
>
>> I don't know exactly what you mean by "analyzes." But I think the answer
>> to your question is all of them.
>>
>> Here is an example that might help you get started. You'll want to make
>> sure you understand the various interfaces provided in the API (ie: Node,
>> NodeFilter, etc...).
>>
>> import org.htmlparser.Parser;
>> import org.htmlparser.filters.NodeClassFilter;
>> import org.htmlparser.lexer.Lexer;
>> import org.htmlparser.lexer.Page;
>> import org.htmlparser.tags.Html;
>> import org.htmlparser.util.NodeList;
>> import org.htmlparser.util.ParserException;
>>
>> public class Example {
>>     public static void main(String... params) {
>> //        Parser parser = getParser(getHtml(), "UTF-8");
>>         Parser parser = getParser(getHtml());
>>
>>         try {
>>             NodeList list = parser.extractAllNodesThatMatch(new
>> NodeClassFilter(Html.class));
>>             for(int i = 0; i < list.size(); i++) {
>>                 Html html = (Html) list.elementAt(i);
>>                 System.out.println(html.toString());
>>             }
>>         } catch(ParserException e) {
>>             e.printStackTrace();
>>         }
>>
>>     }
>>
>>     private static Parser getParser(String html, String charset) {
>>         return new Parser(new Lexer(new Page(html, charset)));
>>     }
>>
>>     private static Parser getParser(String html) {
>>         Parser parser = new Parser();
>>         try {
>>             parser.setInputHTML(html);
>>         } catch(ParserException e) {
>>             e.printStackTrace();
>>         }
>>         return parser;
>>     }
>>
>>     private static String getHtml() {
>>         return new StringBuilder()
>>             .append("\n<html>")
>>             .append("\n\t<head>")
>>             .append("\n\t\t<title>Html Parser Example</title>")
>>             .append("\n\t</head>")
>>             .append("\n\t<body>")
>>             .append("\n\t\t<p>Hello <span>World</span>!</p>")
>>             .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
>> home!\">but html parser still understands it</thisIsAMadeUpTag>")
>>             .append("\n\t</body>")
>>             .append("\n</html>")
>>             .toString();
>>     }
>> }
>>
>>
>>
>>
>> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...
>> > wrote:
>>
>>> Hello,
>>>
>>> can anybody tell me which html tags HtmlParser analyzes in order to
>>> extract text from a web page???
>>>
>>> Thank you!!!
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Automate Storage Tiering Simply
>>> Optimize IT performance and efficiency through flexible, powerful,
>>> automated storage tiering capabilities. View this brief to learn how
>>> you can reduce costs and improve performance.
>>> http://p.sf.net/sfu/dell-sfdev2dev
>>> _______________________________________________
>>> Htmlparser-developer mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>>>
>>>
>>
>>
>> --
>> Elliot
>>
>
>
>
> --
> Elliot
>
>
> ------------------------------------------------------------------------------
> Start uncovering the many advantages of virtual appliances
> and start using them to simplify application deployment and
> accelerate your shift to cloud computing
> http://p.sf.net/sfu/novell-sfdev2dev
>
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>

Re: [Htmlparser-developer] Tags

From: Elliot H. <ell...@gm...> - 2010-09-10 18:14:36

When I reviewed the output of the program a little closer I realized that
although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it
did not properly nest the tags content as children nodes.

Is this expected because the tag is not a valid html tag or is this a bug?

Maybe this is what you meant in your original email Enrique when you asked
which tags are "analyzed" by the html parser?

Here is the output from running the program. Notice that all the valid html
tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag"
tag's "should be" children are not nested one level deeper. Is this a bug or
a feature?

Tag (1[1,0],7[1,6]): html
  Txt (7[1,6],9[2,1]): \n\t
  Tag (9[2,1],15[2,7]): head
    Txt (15[2,7],18[3,2]): \n\t\t
    Tag (18[3,2],25[3,9]): title
      Txt (25[3,9],44[3,28]): Html Parser Example
      End (44[3,28],52[3,36]): /title
    Txt (52[3,36],54[4,1]): \n\t
    End (54[4,1],61[4,8]): /head
  Txt (61[4,8],63[5,1]): \n\t
  Tag (63[5,1],69[5,7]): body
    Txt (69[5,7],72[6,2]): \n\t\t
    Tag (72[6,2],75[6,5]): p
      Txt (75[6,5],81[6,11]): Hello
      Tag (81[6,11],87[6,17]): span
        Txt (87[6,17],92[6,22]): World
        End (92[6,22],99[6,29]): /span
      Txt (99[6,29],100[6,30]): !
      End (100[6,30],104[6,34]): /p
    Txt (104[6,34],107[7,2]): \n\t\t
    Tag (107[7,2],110[7,5]): p
      Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at
home!"
      Txt (159[7,54],195[7,90]): but html parser still understands it
      End (195[7,90],214[7,109]): /thisIsAMadeUpTag
      End (214[7,109],218[7,113]): /p
    Txt (218[7,113],220[8,1]): \n\t
    End (220[8,1],227[8,8]): /body
  Txt (227[8,8],228[9,0]): \n
  End (228[9,0],235[9,7]): /html



On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington <
ell...@gm...> wrote:

> I don't know exactly what you mean by "analyzes." But I think the answer to
> your question is all of them.
>
> Here is an example that might help you get started. You'll want to make
> sure you understand the various interfaces provided in the API (ie: Node,
> NodeFilter, etc...).
>
> import org.htmlparser.Parser;
> import org.htmlparser.filters.NodeClassFilter;
> import org.htmlparser.lexer.Lexer;
> import org.htmlparser.lexer.Page;
> import org.htmlparser.tags.Html;
> import org.htmlparser.util.NodeList;
> import org.htmlparser.util.ParserException;
>
> public class Example {
>     public static void main(String... params) {
> //        Parser parser = getParser(getHtml(), "UTF-8");
>         Parser parser = getParser(getHtml());
>
>         try {
>             NodeList list = parser.extractAllNodesThatMatch(new
> NodeClassFilter(Html.class));
>             for(int i = 0; i < list.size(); i++) {
>                 Html html = (Html) list.elementAt(i);
>                 System.out.println(html.toString());
>             }
>         } catch(ParserException e) {
>             e.printStackTrace();
>         }
>
>     }
>
>     private static Parser getParser(String html, String charset) {
>         return new Parser(new Lexer(new Page(html, charset)));
>     }
>
>     private static Parser getParser(String html) {
>         Parser parser = new Parser();
>         try {
>             parser.setInputHTML(html);
>         } catch(ParserException e) {
>             e.printStackTrace();
>         }
>         return parser;
>     }
>
>     private static String getHtml() {
>         return new StringBuilder()
>             .append("\n<html>")
>             .append("\n\t<head>")
>             .append("\n\t\t<title>Html Parser Example</title>")
>             .append("\n\t</head>")
>             .append("\n\t<body>")
>             .append("\n\t\t<p>Hello <span>World</span>!</p>")
>             .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
> home!\">but html parser still understands it</thisIsAMadeUpTag>")
>             .append("\n\t</body>")
>             .append("\n</html>")
>             .toString();
>     }
> }
>
>
>
>
> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote:
>
>> Hello,
>>
>> can anybody tell me which html tags HtmlParser analyzes in order to
>> extract text from a web page???
>>
>> Thank you!!!
>>
>>
>> ------------------------------------------------------------------------------
>> Automate Storage Tiering Simply
>> Optimize IT performance and efficiency through flexible, powerful,
>> automated storage tiering capabilities. View this brief to learn how
>> you can reduce costs and improve performance.
>> http://p.sf.net/sfu/dell-sfdev2dev
>> _______________________________________________
>> Htmlparser-developer mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>>
>>
>
>
> --
> Elliot
>



-- 
Elliot

Re: [Htmlparser-developer] Tags

From: Elliot H. <ell...@gm...> - 2010-09-10 17:53:34

I don't know exactly what you mean by "analyzes." But I think the answer to
your question is all of them.

Here is an example that might help you get started. You'll want to make sure
you understand the various interfaces provided in the API (ie: Node,
NodeFilter, etc...).

import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.lexer.Page;
import org.htmlparser.tags.Html;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

public class Example {
    public static void main(String... params) {
//        Parser parser = getParser(getHtml(), "UTF-8");
        Parser parser = getParser(getHtml());

        try {
            NodeList list = parser.extractAllNodesThatMatch(new
NodeClassFilter(Html.class));
            for(int i = 0; i < list.size(); i++) {
                Html html = (Html) list.elementAt(i);
                System.out.println(html.toString());
            }
        } catch(ParserException e) {
            e.printStackTrace();
        }

    }

    private static Parser getParser(String html, String charset) {
        return new Parser(new Lexer(new Page(html, charset)));
    }

    private static Parser getParser(String html) {
        Parser parser = new Parser();
        try {
            parser.setInputHTML(html);
        } catch(ParserException e) {
            e.printStackTrace();
        }
        return parser;
    }

    private static String getHtml() {
        return new StringBuilder()
            .append("\n<html>")
            .append("\n\t<head>")
            .append("\n\t\t<title>Html Parser Example</title>")
            .append("\n\t</head>")
            .append("\n\t<body>")
            .append("\n\t\t<p>Hello <span>World</span>!</p>")
            .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
home!\">but html parser still understands it</thisIsAMadeUpTag>")
            .append("\n\t</body>")
            .append("\n</html>")
            .toString();
    }
}




On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote:

> Hello,
>
> can anybody tell me which html tags HtmlParser analyzes in order to extract
> text from a web page???
>
> Thank you!!!
>
>
> ------------------------------------------------------------------------------
> Automate Storage Tiering Simply
> Optimize IT performance and efficiency through flexible, powerful,
> automated storage tiering capabilities. View this brief to learn how
> you can reduce costs and improve performance.
> http://p.sf.net/sfu/dell-sfdev2dev
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>


-- 
Elliot

[Htmlparser-developer] Tags

From: Enrique E. <kik...@gm...> - 2010-09-10 10:27:47

Hello,

can anybody tell me which html tags HtmlParser analyzes in order to extract
text from a web page???

Thank you!!!

[Htmlparser-developer] [patch] Filterbuilder

From: Sören G. <htm...@un...> - 2010-04-27 22:10:21

Attachments: filterbuilder.patch

Hi,

I've written a small patch to Filterbuilder. It changes the generated 
java-source-files to have a
public static NodeFilter[] createFilter()
That way, you can simply drop that file in your source tree and use it 
without having to change anything (but the packages declaration)

I hope you'll find this as useful as I do.

Greetings

Sören