Thread: [Htmlparser-developer] Tags

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

can anybody tell me which html tags HtmlParser analyzes in order to extract
text from a web page???

Thank you!!!

I don't know exactly what you mean by "analyzes." But I think the answer to
your question is all of them.

Here is an example that might help you get started. You'll want to make sure
you understand the various interfaces provided in the API (ie: Node,
NodeFilter, etc...).

import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.lexer.Page;
import org.htmlparser.tags.Html;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

public class Example {
    public static void main(String... params) {
//        Parser parser = getParser(getHtml(), "UTF-8");
        Parser parser = getParser(getHtml());

        try {
            NodeList list = parser.extractAllNodesThatMatch(new
NodeClassFilter(Html.class));
            for(int i = 0; i < list.size(); i++) {
                Html html = (Html) list.elementAt(i);
                System.out.println(html.toString());
            }
        } catch(ParserException e) {
            e.printStackTrace();
        }

    }

    private static Parser getParser(String html, String charset) {
        return new Parser(new Lexer(new Page(html, charset)));
    }

    private static Parser getParser(String html) {
        Parser parser = new Parser();
        try {
            parser.setInputHTML(html);
        } catch(ParserException e) {
            e.printStackTrace();
        }
        return parser;
    }

    private static String getHtml() {
        return new StringBuilder()
            .append("\n<html>")
            .append("\n\t<head>")
            .append("\n\t\t<title>Html Parser Example</title>")
            .append("\n\t</head>")
            .append("\n\t<body>")
            .append("\n\t\t<p>Hello <span>World</span>!</p>")
            .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
home!\">but html parser still understands it</thisIsAMadeUpTag>")
            .append("\n\t</body>")
            .append("\n</html>")
            .toString();
    }
}

On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote:

> Hello,
>
> can anybody tell me which html tags HtmlParser analyzes in order to extract
> text from a web page???
>
> Thank you!!!
>
>
> ------------------------------------------------------------------------------
> Automate Storage Tiering Simply
> Optimize IT performance and efficiency through flexible, powerful,
> automated storage tiering capabilities. View this brief to learn how
> you can reduce costs and improve performance.
> http://p.sf.net/sfu/dell-sfdev2dev
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>

-- 
Elliot

When I reviewed the output of the program a little closer I realized that
although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it
did not properly nest the tags content as children nodes.

Is this expected because the tag is not a valid html tag or is this a bug?

Maybe this is what you meant in your original email Enrique when you asked
which tags are "analyzed" by the html parser?

Here is the output from running the program. Notice that all the valid html
tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag"
tag's "should be" children are not nested one level deeper. Is this a bug or
a feature?

Tag (1[1,0],7[1,6]): html
  Txt (7[1,6],9[2,1]): \n\t
  Tag (9[2,1],15[2,7]): head
    Txt (15[2,7],18[3,2]): \n\t\t
    Tag (18[3,2],25[3,9]): title
      Txt (25[3,9],44[3,28]): Html Parser Example
      End (44[3,28],52[3,36]): /title
    Txt (52[3,36],54[4,1]): \n\t
    End (54[4,1],61[4,8]): /head
  Txt (61[4,8],63[5,1]): \n\t
  Tag (63[5,1],69[5,7]): body
    Txt (69[5,7],72[6,2]): \n\t\t
    Tag (72[6,2],75[6,5]): p
      Txt (75[6,5],81[6,11]): Hello
      Tag (81[6,11],87[6,17]): span
        Txt (87[6,17],92[6,22]): World
        End (92[6,22],99[6,29]): /span
      Txt (99[6,29],100[6,30]): !
      End (100[6,30],104[6,34]): /p
    Txt (104[6,34],107[7,2]): \n\t\t
    Tag (107[7,2],110[7,5]): p
      Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at
home!"
      Txt (159[7,54],195[7,90]): but html parser still understands it
      End (195[7,90],214[7,109]): /thisIsAMadeUpTag
      End (214[7,109],218[7,113]): /p
    Txt (218[7,113],220[8,1]): \n\t
    End (220[8,1],227[8,8]): /body
  Txt (227[8,8],228[9,0]): \n
  End (228[9,0],235[9,7]): /html

On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington <
ell...@gm...> wrote:

> I don't know exactly what you mean by "analyzes." But I think the answer to
> your question is all of them.
>
> Here is an example that might help you get started. You'll want to make
> sure you understand the various interfaces provided in the API (ie: Node,
> NodeFilter, etc...).
>
> import org.htmlparser.Parser;
> import org.htmlparser.filters.NodeClassFilter;
> import org.htmlparser.lexer.Lexer;
> import org.htmlparser.lexer.Page;
> import org.htmlparser.tags.Html;
> import org.htmlparser.util.NodeList;
> import org.htmlparser.util.ParserException;
>
> public class Example {
>     public static void main(String... params) {
> //        Parser parser = getParser(getHtml(), "UTF-8");
>         Parser parser = getParser(getHtml());
>
>         try {
>             NodeList list = parser.extractAllNodesThatMatch(new
> NodeClassFilter(Html.class));
>             for(int i = 0; i < list.size(); i++) {
>                 Html html = (Html) list.elementAt(i);
>                 System.out.println(html.toString());
>             }
>         } catch(ParserException e) {
>             e.printStackTrace();
>         }
>
>     }
>
>     private static Parser getParser(String html, String charset) {
>         return new Parser(new Lexer(new Page(html, charset)));
>     }
>
>     private static Parser getParser(String html) {
>         Parser parser = new Parser();
>         try {
>             parser.setInputHTML(html);
>         } catch(ParserException e) {
>             e.printStackTrace();
>         }
>         return parser;
>     }
>
>     private static String getHtml() {
>         return new StringBuilder()
>             .append("\n<html>")
>             .append("\n\t<head>")
>             .append("\n\t\t<title>Html Parser Example</title>")
>             .append("\n\t</head>")
>             .append("\n\t<body>")
>             .append("\n\t\t<p>Hello <span>World</span>!</p>")
>             .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
> home!\">but html parser still understands it</thisIsAMadeUpTag>")
>             .append("\n\t</body>")
>             .append("\n</html>")
>             .toString();
>     }
> }
>
>
>
>
> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote:
>
>> Hello,
>>
>> can anybody tell me which html tags HtmlParser analyzes in order to
>> extract text from a web page???
>>
>> Thank you!!!
>>
>>
>> ------------------------------------------------------------------------------
>> Automate Storage Tiering Simply
>> Optimize IT performance and efficiency through flexible, powerful,
>> automated storage tiering capabilities. View this brief to learn how
>> you can reduce costs and improve performance.
>> http://p.sf.net/sfu/dell-sfdev2dev
>> _______________________________________________
>> Htmlparser-developer mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>>
>>
>
>
> --
> Elliot
>

-- 
Elliot

Only composite tags are nested...
See http://htmlparser.sourceforge.net/faq.html#composite
So you would need to create a tag class derived from CoimpositeTag and add
it to the node factory, as outlined.

On Fri, Sep 10, 2010 at 8:14 PM, Elliot Huntington <
ell...@gm...> wrote:

> When I reviewed the output of the program a little closer I realized that
> although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it
> did not properly nest the tags content as children nodes.
>
> Is this expected because the tag is not a valid html tag or is this a bug?
>
> Maybe this is what you meant in your original email Enrique when you asked
> which tags are "analyzed" by the html parser?
>
> Here is the output from running the program. Notice that all the valid html
> tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag"
> tag's "should be" children are not nested one level deeper. Is this a bug or
> a feature?
>
> Tag (1[1,0],7[1,6]): html
>   Txt (7[1,6],9[2,1]): \n\t
>   Tag (9[2,1],15[2,7]): head
>     Txt (15[2,7],18[3,2]): \n\t\t
>     Tag (18[3,2],25[3,9]): title
>       Txt (25[3,9],44[3,28]): Html Parser Example
>       End (44[3,28],52[3,36]): /title
>     Txt (52[3,36],54[4,1]): \n\t
>     End (54[4,1],61[4,8]): /head
>   Txt (61[4,8],63[5,1]): \n\t
>   Tag (63[5,1],69[5,7]): body
>     Txt (69[5,7],72[6,2]): \n\t\t
>     Tag (72[6,2],75[6,5]): p
>       Txt (75[6,5],81[6,11]): Hello
>       Tag (81[6,11],87[6,17]): span
>         Txt (87[6,17],92[6,22]): World
>         End (92[6,22],99[6,29]): /span
>       Txt (99[6,29],100[6,30]): !
>       End (100[6,30],104[6,34]): /p
>     Txt (104[6,34],107[7,2]): \n\t\t
>     Tag (107[7,2],110[7,5]): p
>       Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at
> home!"
>       Txt (159[7,54],195[7,90]): but html parser still understands it
>       End (195[7,90],214[7,109]): /thisIsAMadeUpTag
>       End (214[7,109],218[7,113]): /p
>     Txt (218[7,113],220[8,1]): \n\t
>     End (220[8,1],227[8,8]): /body
>   Txt (227[8,8],228[9,0]): \n
>   End (228[9,0],235[9,7]): /html
>
>
>
>
> On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington <
> ell...@gm...> wrote:
>
>> I don't know exactly what you mean by "analyzes." But I think the answer
>> to your question is all of them.
>>
>> Here is an example that might help you get started. You'll want to make
>> sure you understand the various interfaces provided in the API (ie: Node,
>> NodeFilter, etc...).
>>
>> import org.htmlparser.Parser;
>> import org.htmlparser.filters.NodeClassFilter;
>> import org.htmlparser.lexer.Lexer;
>> import org.htmlparser.lexer.Page;
>> import org.htmlparser.tags.Html;
>> import org.htmlparser.util.NodeList;
>> import org.htmlparser.util.ParserException;
>>
>> public class Example {
>>     public static void main(String... params) {
>> //        Parser parser = getParser(getHtml(), "UTF-8");
>>         Parser parser = getParser(getHtml());
>>
>>         try {
>>             NodeList list = parser.extractAllNodesThatMatch(new
>> NodeClassFilter(Html.class));
>>             for(int i = 0; i < list.size(); i++) {
>>                 Html html = (Html) list.elementAt(i);
>>                 System.out.println(html.toString());
>>             }
>>         } catch(ParserException e) {
>>             e.printStackTrace();
>>         }
>>
>>     }
>>
>>     private static Parser getParser(String html, String charset) {
>>         return new Parser(new Lexer(new Page(html, charset)));
>>     }
>>
>>     private static Parser getParser(String html) {
>>         Parser parser = new Parser();
>>         try {
>>             parser.setInputHTML(html);
>>         } catch(ParserException e) {
>>             e.printStackTrace();
>>         }
>>         return parser;
>>     }
>>
>>     private static String getHtml() {
>>         return new StringBuilder()
>>             .append("\n<html>")
>>             .append("\n\t<head>")
>>             .append("\n\t\t<title>Html Parser Example</title>")
>>             .append("\n\t</head>")
>>             .append("\n\t<body>")
>>             .append("\n\t\t<p>Hello <span>World</span>!</p>")
>>             .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
>> home!\">but html parser still understands it</thisIsAMadeUpTag>")
>>             .append("\n\t</body>")
>>             .append("\n</html>")
>>             .toString();
>>     }
>> }
>>
>>
>>
>>
>> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...
>> > wrote:
>>
>>> Hello,
>>>
>>> can anybody tell me which html tags HtmlParser analyzes in order to
>>> extract text from a web page???
>>>
>>> Thank you!!!
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Automate Storage Tiering Simply
>>> Optimize IT performance and efficiency through flexible, powerful,
>>> automated storage tiering capabilities. View this brief to learn how
>>> you can reduce costs and improve performance.
>>> http://p.sf.net/sfu/dell-sfdev2dev
>>> _______________________________________________
>>> Htmlparser-developer mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>>>
>>>
>>
>>
>> --
>> Elliot
>>
>
>
>
> --
> Elliot
>
>
> ------------------------------------------------------------------------------
> Start uncovering the many advantages of virtual appliances
> and start using them to simplify application deployment and
> accelerate your shift to cloud computing
> http://p.sf.net/sfu/novell-sfdev2dev
>
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>