htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
|
1
|
2
|
3
|
4
(1) |
5
|
6
|
7
|
8
|
9
|
10
|
11
(1) |
12
(1) |
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
(1) |
23
|
24
(2) |
25
|
26
|
27
|
28
|
29
|
30
|
31
(1) |
|
|
|
|
|
|
From: Somik R. <so...@ya...> - 2002-03-31 09:19:42
|
Hi Folks, A major bug fix has been done. I had previously reported that the = parser crashes when encountering very dirty html of the form : <A HREF=3D"http://www.somelink.com">SomeText<A> Instead of the end tag, we put in a begin tag by mistake, and the parser = promptly crashes. This called for a modification in the evaluate() = method, as the current scanners dont have more than existing local info = about the parsing process. But now, Ive introduced a parameter - which = takes in the scanner. So, if a tag was being parsed, and in the process = of the parsing, another tag starts being parsed, then the second tag = will now know that a scanner process is already running. This enables the HTMLLinkScanner to come to the conclusion that its = current parsing activity is of a dirty html tag, and hence take the = appropriate action (flag the scanner into a dirty mode, and return an = HTMLEndTag - which is expected by the previous scanner). This solves this bug - and finally we can handle some really crazy = pages... This fix and some others, along with some additions (META and TITLE) = will make it to release 1.1 (coming soon). Currently, the latest code is = available thru CVS. In case any of you have written your own scanners - you will need to = modify the evaluate method signature to be compatible with the new = HTMLTagScanner. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-24 05:51:01
|
Dear Users, Thanks for using HTMLParser. HTMLParser is getting some new = features, namely,=20 [1] HTMLMetaTag scanner [2] Support for not ".html" pages - I am planning to bring in dynamic = pages under the purview of the parser as well. Though I might need a bit = of help for this. I wanted to have some feedback from the user community -what are the = features that you would really like to see added to the parser (or r u = quite happy with the parser as is?) Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-24 05:48:24
|
Hi Folks, I am encountering a really strange scenario - try to create a link = like this in a web page - <A HREF=3D"...">something<A> i.e. instead of putting a close tag </A>, put an open tag. I find that = Internet Explorer renders it just fine. Now if IE renders it, then = perhaps we ought to support it in HTML Parser. However, its not so easy = - check out the latest source from CVS - I have put in a testcase for this = situation which is failing (in HTMLLinkScannerTest - = com.kizna.html.scannersTests) The problem is in HTMLReader.find() - which goes into a sort of = recursion - when it finds <A ...> the first time, the scanner asks it to = find the remaining tags. Now if the second A is encountered, it will try = to keep parsing till the end tag is encountered, which wont happen. Now, = I need a clean elegant way of telling the reader not to expand in = exceptional situations like this one. I can of course do it with some flags - but before I do it - I was = wondering if anyone has insights on this problem - and if anyone thinks = we should not support this dirty html even if IE does. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-22 16:40:58
|
Hi Folks, Release 1.04 is out. Has the following bug fixes : [1] Parsing JSP tags which had tags within inverted commas, was causing = problems. [2] A link with no link url would cause the parser to crash with a null = pointer exception. The above bugs were reported by Gordon Deudney and Robert Kausch. More test cases added.=20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-03-12 08:52:14
|
Hi Don, It will be appreciated if you can post usage doubts in the htmlparser-user mailing list (link is at http://htmlparser.sourceforge.net). To your query - the code you posted seems rather complex to do a not so complex task :) Here's how you would do it in HTML Parser (in the attached code). The code I have given is the shortcut-way. There is a way to get much shorter code that what I am providing you, but that requires getting into the design docs of the parser - and writing a Table Scanner. Then your code could become some this like this : HTMLParser parser = new HTMLParser("http://www.nba.com"); HTMLNode node; int tableCount = 0; for (Enumeration e = parser.elements();e.hasMoreElements();) { node = (HTMLNode) e.nextElement(); if (node instanceof HTMLTableNode) { tableCount ++; if (tableCount==4) { HTMLTableNode tableNode = (HTMLTableNode)node; tableNode.print(); } } } Regards, Somik ----- Original Message ----- From: "Don Taggart" <dta...@e-...> To: <Htm...@li...> Sent: Tuesday, March 12, 2002 1:33 AM Subject: [Htmlparser-developer] HTMLParser Sample App > Hi, > I am attempting to grab the content of a certain table on any website. For > instance I'd like to get all of the text, tags, comments, etc contained in > the 4rth table I run across. I've been able to do this successfully using > the htmleditorkit in swing, but it has a few bugs. > > Would your HTML Parser be useful for this scenario, and If so, could you > give me some guidance on how to start. > > Thanks, > Don > > > Heres my code that goes and get the contents of the 4rth table at nba.com > > import java.io.*; > import java.net.*; > import java.util.*; > import javax.swing.text.*; > import javax.swing.text.html.*; > import javax.swing.text.html.parser.*; > > /** > * This small demo program shows how to use the > * HTMLEditorKit.Parser and its implementing class > * ParserDelegator in the Swing system. > */ > > public class HtmlParseDemo2 { > public static void main(String [] args) { > Reader r; > String host = ""; > String spec = "http://www.nba.com"; > long endTime; > long endTime2; > long startTime = System.currentTimeMillis(); > String snippet = ""; > > > try { > if (spec.indexOf("://") > 0) { > URL u = new URL(spec); > host = u.getHost(); > Object content = u.getContent(); > > if (content instanceof InputStream) { > > r = new InputStreamReader((InputStream)content); > } > else if (content instanceof Reader) { > r = (Reader)content; > } > else { > throw new Exception("Bad URL content type."); > } > } > else { > r = new FileReader(spec); > } > > endTime = System.currentTimeMillis(); > System.out.println("Time to complete connection: " + (endTime - > startTime)); > > HTMLEditorKit.Parser parser; > System.out.println("About to parse " + spec); > parser = new ParserDelegator(); > > HTMLParseLister2 snippetCallback = new HTMLParseLister2(host); > > file://Parse Away! > parser.parse(r, snippetCallback, true); > r.close(); > > > endTime2 = System.currentTimeMillis(); > System.out.println("Time to complete: " + (endTime2 - > startTime)); > } > catch (Exception e) { > System.err.println("Error: " + e); > e.printStackTrace(System.err); > } > } > } > > /** > * HTML parsing proceeds by calling a callback for > * each and every piece of the HTML document. This > * simple callback class simply prints an indented > * structural listing of the HTML data. > */ > class HTMLParseLister2 extends HTMLEditorKit.ParserCallback > { > > > > int indentSize = 0; > int tableNum = 0; > String atts; > String tabNum; > String endTable; > String tableLevel; > Stack tableStack = new Stack(); > boolean finished = false; > HTML.Tag selectedTag = HTML.Tag.TABLE; > String selectedTable = Integer.toString(4); > boolean inImportantTag = false; > StringBuffer snippetString = new StringBuffer(); > > > > private String host; > > > > public HTMLParseLister2(String host) { > this.host = host; > } > > public String getSnippet() { > return snippetString.toString(); > } > > protected void indent() { > indentSize += 4; > } > > protected void unIndent() { > indentSize -= 4; if (indentSize < 0) indentSize = 0; > } > > protected void pIndent() { > for(int i = 0; i < indentSize; i++) System.out.print(" "); > } > > public void handleText(char[] data, int pos) { > if (!tableStack.empty() && !finished) > { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > file://pIndent(); > String str = new String(data); > System.out.println(str); > } > } > > if (inImportantTag) > { > String str = new String(data); > System.out.println(str); > } > } > > // ******************************************************** > public void handleComment(char[] data, int pos) { > > if (!tableStack.empty() && !finished) > { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > file://pIndent(); > String str = new String(data); > file://System.out.println("<!--" + str + "-->"); > file://indent(); > file://pIndent(); > } > } > > if (inImportantTag) > { > String str = new String(data); > System.out.println("<!--" + str + "-->"); > } > > } > // ******************************************************** > > // ******************************************************** > public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { > // Is this Tag One of the few that we want to list outside the chosen > component > if (t == HTML.Tag.STYLE || t == HTML.Tag.LINK) > { > atts = listAttributes(a); > inImportantTag = true; > System.out.print("<" + t.toString() + " " + atts + ">"); > return; > } > > if (t == selectedTag && !finished) > { > > file://pIndent(); > tableNum++; > tabNum = Integer.toString(tableNum); > tableStack.push(tabNum); > atts = listAttributes(a); > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > file://System.out.println("<Table#" + tableLevel + ">"); > > } > } > > if (!tableStack.empty() && !finished) { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > { > atts = listAttributes(a); > System.out.println("<" + t.toString() + " " + atts + ">"); > } > } > } > // ******************************************************** > > > // ******************************************************** > public void handleEndTag(HTML.Tag t, int pos) { > if (inImportantTag) > { > inImportantTag = false; > System.out.println("</" + t.toString() + ">"); > } > > if (!tableStack.empty() && !finished) > { > if (t == selectedTag) > { > file://unIndent(); > file://pIndent(); > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))){ > System.out.println("</" + t.toString() + ">"); > } > if (tableStack.peek().equals(selectedTable)) > finished = true; > endTable = (String) tableStack.pop(); > } > } > if (!tableStack.empty() && !finished) { > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable)) && t != selectedTag) { > file://pIndent(); > System.out.println("</" + t.toString() + ">"); > file://pIndent(); > } > } > } > // ******************************************************** > > > > // ******************************************************** > public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) > { > > > > > if (t == HTML.Tag.LINK && !finished) > { > atts = listAttributes(a); > System.out.println("<" + t.toString() + " " + atts + ">"); > } > > if (!tableStack.empty() && !finished) > { > > > atts = listAttributes(a); > if(a.getAttribute(HTML.Attribute.ENDTAG) != null) > { > handleEndTag(t, pos); > return; > } > file://if (tableStack.peek() == selectedTable) > file://pIndent(); > > tableLevel = (String)tableStack.peek(); > if (Integer.parseInt(tableLevel) >= > (Integer.parseInt(selectedTable))) > System.out.println("<" + t.toString() + " " + atts + ">"); > } > } > // ******************************************************** > > > > > // ******************************************************** > private String listAttributes(AttributeSet attributes) { > Enumeration e = attributes.getAttributeNames(); > String attString = ""; > > while (e.hasMoreElements()) { > Object name = e.nextElement(); > Object value = attributes.getAttribute(name); > > if (name.toString().equals("href") || name.toString().equals("src") > || name.toString().equals("action")) > { > if (value.toString().charAt(0) == '/') > value = host + value; > } > attString = attString + name + "=\"" + value + "\" "; > > } > return attString; > } > // ******************************************************** > > // ******************************************************** > public void handleError(String errorMsg, int pos){ > file://System.out.println("Parsing error: " + errorMsg + " at " + pos); > } > } > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Don T. <dta...@e-...> - 2002-03-11 16:37:08
|
Hi, I am attempting to grab the content of a certain table on any website. For instance I'd like to get all of the text, tags, comments, etc contained in the 4rth table I run across. I've been able to do this successfully using the htmleditorkit in swing, but it has a few bugs. Would your HTML Parser be useful for this scenario, and If so, could you give me some guidance on how to start. Thanks, Don Heres my code that goes and get the contents of the 4rth table at nba.com import java.io.*; import java.net.*; import java.util.*; import javax.swing.text.*; import javax.swing.text.html.*; import javax.swing.text.html.parser.*; /** * This small demo program shows how to use the * HTMLEditorKit.Parser and its implementing class * ParserDelegator in the Swing system. */ public class HtmlParseDemo2 { public static void main(String [] args) { Reader r; String host = ""; String spec = "http://www.nba.com"; long endTime; long endTime2; long startTime = System.currentTimeMillis(); String snippet = ""; try { if (spec.indexOf("://") > 0) { URL u = new URL(spec); host = u.getHost(); Object content = u.getContent(); if (content instanceof InputStream) { r = new InputStreamReader((InputStream)content); } else if (content instanceof Reader) { r = (Reader)content; } else { throw new Exception("Bad URL content type."); } } else { r = new FileReader(spec); } endTime = System.currentTimeMillis(); System.out.println("Time to complete connection: " + (endTime - startTime)); HTMLEditorKit.Parser parser; System.out.println("About to parse " + spec); parser = new ParserDelegator(); HTMLParseLister2 snippetCallback = new HTMLParseLister2(host); //Parse Away! parser.parse(r, snippetCallback, true); r.close(); endTime2 = System.currentTimeMillis(); System.out.println("Time to complete: " + (endTime2 - startTime)); } catch (Exception e) { System.err.println("Error: " + e); e.printStackTrace(System.err); } } } /** * HTML parsing proceeds by calling a callback for * each and every piece of the HTML document. This * simple callback class simply prints an indented * structural listing of the HTML data. */ class HTMLParseLister2 extends HTMLEditorKit.ParserCallback { int indentSize = 0; int tableNum = 0; String atts; String tabNum; String endTable; String tableLevel; Stack tableStack = new Stack(); boolean finished = false; HTML.Tag selectedTag = HTML.Tag.TABLE; String selectedTable = Integer.toString(4); boolean inImportantTag = false; StringBuffer snippetString = new StringBuffer(); private String host; public HTMLParseLister2(String host) { this.host = host; } public String getSnippet() { return snippetString.toString(); } protected void indent() { indentSize += 4; } protected void unIndent() { indentSize -= 4; if (indentSize < 0) indentSize = 0; } protected void pIndent() { for(int i = 0; i < indentSize; i++) System.out.print(" "); } public void handleText(char[] data, int pos) { if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { //pIndent(); String str = new String(data); System.out.println(str); } } if (inImportantTag) { String str = new String(data); System.out.println(str); } } // ******************************************************** public void handleComment(char[] data, int pos) { if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { //pIndent(); String str = new String(data); //System.out.println("<!--" + str + "-->"); //indent(); //pIndent(); } } if (inImportantTag) { String str = new String(data); System.out.println("<!--" + str + "-->"); } } // ******************************************************** // ******************************************************** public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { // Is this Tag One of the few that we want to list outside the chosen component if (t == HTML.Tag.STYLE || t == HTML.Tag.LINK) { atts = listAttributes(a); inImportantTag = true; System.out.print("<" + t.toString() + " " + atts + ">"); return; } if (t == selectedTag && !finished) { //pIndent(); tableNum++; tabNum = Integer.toString(tableNum); tableStack.push(tabNum); atts = listAttributes(a); tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { //System.out.println("<Table#" + tableLevel + ">"); } } if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) { atts = listAttributes(a); System.out.println("<" + t.toString() + " " + atts + ">"); } } } // ******************************************************** // ******************************************************** public void handleEndTag(HTML.Tag t, int pos) { if (inImportantTag) { inImportantTag = false; System.out.println("</" + t.toString() + ">"); } if (!tableStack.empty() && !finished) { if (t == selectedTag) { //unIndent(); //pIndent(); tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))){ System.out.println("</" + t.toString() + ">"); } if (tableStack.peek().equals(selectedTable)) finished = true; endTable = (String) tableStack.pop(); } } if (!tableStack.empty() && !finished) { tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable)) && t != selectedTag) { //pIndent(); System.out.println("</" + t.toString() + ">"); //pIndent(); } } } // ******************************************************** // ******************************************************** public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) { if (t == HTML.Tag.LINK && !finished) { atts = listAttributes(a); System.out.println("<" + t.toString() + " " + atts + ">"); } if (!tableStack.empty() && !finished) { atts = listAttributes(a); if(a.getAttribute(HTML.Attribute.ENDTAG) != null) { handleEndTag(t, pos); return; } //if (tableStack.peek() == selectedTable) //pIndent(); tableLevel = (String)tableStack.peek(); if (Integer.parseInt(tableLevel) >= (Integer.parseInt(selectedTable))) System.out.println("<" + t.toString() + " " + atts + ">"); } } // ******************************************************** // ******************************************************** private String listAttributes(AttributeSet attributes) { Enumeration e = attributes.getAttributeNames(); String attString = ""; while (e.hasMoreElements()) { Object name = e.nextElement(); Object value = attributes.getAttribute(name); if (name.toString().equals("href") || name.toString().equals("src") || name.toString().equals("action")) { if (value.toString().charAt(0) == '/') value = host + value; } attString = attString + name + "=\"" + value + "\" "; } return attString; } // ******************************************************** // ******************************************************** public void handleError(String errorMsg, int pos){ //System.out.println("Parsing error: " + errorMsg + " at " + pos); } } |
From: Somik R. <so...@ya...> - 2002-03-04 14:28:27
|
HTMLParser 1.03 has been released. It contains a bug fix in = HTMLRemarkNode which was causing the parser to crash on pages with = remarks going over one line. A test case for the bug has been added in = HTMLRemarkNodeTest.=20 The release also contains the design documentation in the zip. Thanks to = Serge Kruppa for pointing out the bug. Regards Somik |