htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
(2) |
9
(1) |
10
|
11
|
12
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
|
23
|
24
(1) |
25
|
26
|
27
(1) |
28
|
29
|
30
|
31
|
|
From: Miguel A. M. <mig...@gm...> - 2012-08-27 08:23:18
|
Hello Ernest, This is the function I use in order to extract the text. I hope it helps you. public StringBuilder textExtractor(String URL){ StringBuilder textInPage = null; try { Parser parser = new Parser(URL); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); textInPage = new StringBuilder(visitor.getExtractedText()); } catch (ParserException ex) { Logger.getLogger(HTMLAnalizer.class.getName()).log(Level.SEVERE, null, ex); } return textInPage; } Regards, Miguel On 24 August 2012 21:14, Ernest Cronin <ern...@gm...> wrote: > Hi, > > I use the parser a lot for work. one thing i've noticed is that in many > news articles there are comment sections, and in these sections, plain > text. but the parser doesn't pick them up. what is about the comment > sections that make it unreadable? is there a different class i should be > using? > > Thank you, > ernest > > On Wed, Aug 17, 2011 at 4:25 PM, ernest cronin <ern...@gm...>wrote: > >> Hi, >> >> I have been trying to use the parser for some time and I have been unable >> to get it to do exactly what I want, which is to gather only the plaintext >> without javascript or style stuff. Here is the code I've been running: >> >> public class Test >> { >> public static void main (String[] args) >> { >> try >> { >> Parser parser = new Parser (args[0]); >> TextExtractingVisitor visitor = new TextExtractingVisitor(); >> parser.visitAllNodesWith(visitor); >> String textInPage = visitor.getExtractedText(); >> System.out.println(textInPage); >> } >> catch (ParserException pe) >> { >> pe.printStackTrace (); >> } >> } >> } >> >> I could really use some help with this! >> >> Thanks, >> Ernest >> >> > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Ernest C. <ern...@gm...> - 2012-08-24 19:14:07
|
Hi, I use the parser a lot for work. one thing i've noticed is that in many news articles there are comment sections, and in these sections, plain text. but the parser doesn't pick them up. what is about the comment sections that make it unreadable? is there a different class i should be using? Thank you, ernest On Wed, Aug 17, 2011 at 4:25 PM, ernest cronin <ern...@gm...>wrote: > Hi, > > I have been trying to use the parser for some time and I have been unable > to get it to do exactly what I want, which is to gather only the plaintext > without javascript or style stuff. Here is the code I've been running: > > public class Test > { > public static void main (String[] args) > { > try > { > Parser parser = new Parser (args[0]); > TextExtractingVisitor visitor = new TextExtractingVisitor(); > parser.visitAllNodesWith(visitor); > String textInPage = visitor.getExtractedText(); > System.out.println(textInPage); > } > catch (ParserException pe) > { > pe.printStackTrace (); > } > } > } > > I could really use some help with this! > > Thanks, > Ernest > > |
From: Aniket P <ani...@gm...> - 2012-08-09 17:15:57
|
hello, Can anyone help me in my work? I am stuck somewhere. I am parsing a page using htmlparser. In the page there can be a call to a particular function. Let function is f(a,b,c) {a+b+c;}. And it is called somewhere in page like f(p,q,r). I want to ask that, how will i come to know that there is a call from the page to 'f'. Is there any provision that can be used to identify the call made and to which function?? Please help me, I need an urgent help. |
From: Miguel A. M. <mig...@gm...> - 2012-08-08 15:42:16
|
Hello AniketP, I had the same problem but whit the bold and italics tags (<b> and <i> respectively). Here is my solution for <i> tags: Create a class for the tag you are interested in, that extends CompositeTag: public class ItalicTag *extends CompositeTag*{ private static final String[] mIds = new String[] {*"I"*}; //Change this as appropriate public ItalicTag () { } public String[] getIds () { return (mIds); } public String[] getEnders () { return (mIds); } public String[] getEndTagEnders () { return (new String[0]); } } //In your main class: factory = new PrototypicalNodeFactory(); // create a factory factory.registerTag(new ItalicTag ()); //register your new tag try { Parser parser = new Parser (URL); parser.setNodeFactory(factory); NodeList list; NodeFilter tagfilter = new NodeClassFilter(ItalicTag.class); list = parser.extractAllNodesThatMatch(tagfilter); for (Node node : list.toNodeArray()) { String texto = *extractText*(node); // In this function we will extract the content between tags (<i> </i>) } } catch (ParserException ex) { //do something } //the extactText method : /** *Gets the text that is enclosed between labels. In order to do that *it studies the children components in the labels in a recursive way. * * @param studiedNode * @return Text between nested tags */ public String *extactText *(Node studiedNode ) { Node node; String text = ""; boolean exit= false; try { for (SimpleNodeIterator e = studiedNode .getChildren().elements(); e.hasMoreNodes() && !exit;) { node = e.nextNode(); if (node instanceof CompositeTag) { text = extactText (node); } else { if (node != null) { text = node.getText(); } exit= true; } } } catch (NullPointerException ex) { // do something } return text.trim(); } I hope this helps. On 8 August 2012 10:57, Aniket P <ani...@gm...> wrote: > Hello all, > Currently I am using the htmlparser in my work. I want to extract <script> > </script> part, and more specifically I want to extract different functions > in <script> </script>. After that I need to execute those functions. So can > anyone please help me how to use that?? > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Aniket P <ani...@gm...> - 2012-08-08 08:57:17
|
Hello all, Currently I am using the htmlparser in my work. I want to extract <script> </script> part, and more specifically I want to extract different functions in <script> </script>. After that I need to execute those functions. So can anyone please help me how to use that?? |