htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
1
(2) |
2
|
3
(1) |
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
(1) |
21
|
22
|
23
|
24
|
25
|
26
|
27
|
28
|
29
(4) |
30
|
|
|
|
|
From: Derrick O. <der...@au...> - 2003-09-29 20:00:13
|
Peter, =20 Yes, you have permission. In fact we would be honoured and endeavor to assist you in any way necessary. =20 It's funny you should mention images and DOM. The latest versions of htmlparser includes an example application that does a very similar task; getting the images behind thumbnails (see lib/thumbelina.jar or package org.htmlparser.lexerapplications.thumbelina). It uses the low level Lexer package to avoid having to form the entire document model. I would check to see if something like this meets your needs. =20 If you need more than that (i.e. table parsing, balancing end tags, etc.) you'll have to go with the full parser. Unfortunately, the Lexer hasn't been completely integrated into the parser yet and the current CVS snapshot is a bit of a mess. With a bit of patience, this too will come to pass. =20 As far as performance comparisons go, I've only heard anecdotal evidence that htmlparser is faster. I suppose this could be an area of investigation. =20 Derrick -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: September 29, 2003 8:53 AM To: Derrick Oswald Subject: question about using HTMLParser in Apache JMeter =20 Hi derrick, =20 =20 I am a commiter on Apache's Jakarta JMeter project. I was wondering if we can get permission to use it. Since Apache foundation can't use LGPL code without permission, I'm hoping you're open to the idea. =20 here is a quick description of how I want to use it. JMeter currently is a load testing tool for HTTP, FTP, JDBC and Java. The HTTP plugin uses JTidy to parse the HTML and extract the images for download. =20 test plans with more than 20 clients performs poorly because of the high cost of DOM. JTidy generates DOM documents. One trick is to turn off download images in JMeter, but that doesn't solve the real problem. I want to replace JTidy with HTMLParser. I haven't done any performance comparison yet, but I'm guessing it should use less memory. =20 has anyone done a performance comparison between JTidy and HTMLParser? =20 peter lin =20 =20 =20 =20 _____ =20 Do you Yahoo!? The <http://shopping.yahoo.com/?__yltc=3Ds%3A150000443%2Cd%3A22708228%2Cslk%3= A text%2Csec%3Amail> New Yahoo! Shopping - with improved product search |
From: Derrick O. <Der...@Ro...> - 2003-09-29 19:55:06
|
OK, it's started... I've integrated the low level lexer code into the main parser code. Many things aren't working anymore Of the 448 unit tests 213 of them fail and 14 show exception faults. But the upside is 211 of the tests pass. So I'm dropping my current snapshot, opening it up to those who may wish to assist. See the TODO section. Big changes =========== A lot of files have been removed -------------------------------- htmlparser/NodeReader.java this is the primary class that's being replaced by Lexer, the method nextNode() replaces readElement() htmlparser/RemarkNodeParser.java remark nodes are now parsed in the Lexer main loop htmlparser/parserHelper/AttributeParser.java attributes are now parsed by the lexer before the tag is created, manipulated as a Vector of Attribute objects htmlparser/parserHelper/StringParser.java string nodes are now parsed by the lexer htmlparser/parserHelper/TagParser.java tags are now parsed by the lexer htmlparser/tags/EndTag.java this class was replaced by a call to the new isEndTag() method on the Tag class I labeled the repository with tag "PriorToLexerIntegration" just in case you want to retreive a file that's no longer there. Class Derivations ----------------- The StringNode, RemarkNode and tags.Tag class now derive from their lexeme counterparts in lexer.nodes instead of the other way around. NodeFactory ----------- The beginnings of a node factory interface are included. This was added so the lexer could return 'visitable' nodes to the parser. The parser acts as it's own node factory, as does the Lexer. NodeCount --------- The node count for parsing goes up in most cases because every whitespace (i.e. newline) now counts as a StringNode. This has whacked out a lot of the tests that were expecting fewer nodes or a certain type of node at a particular index. Attributes ---------- Attributes now maintain their order and case. The count of attributes also went up because whitespace is maintained within tags too. The storage in a Vector means the element 0 Attribute is actually the name of the tag, rather than having the $TAGNAME entry in a HashTable. TODO ===== visitEndTag() ----------------- The visitEndNode() method on the visitor interface should be put back. I shouldn't have removed it when EndTag was removed. Instead the accept() in Tag should dispatch to visitTag() or visitEndTag() based on isEndTag(). Serializable -------------- The Parser needs to be made serializable again. This involves a transient field down on the Source, I think, rather than having the whole Lexer transient in the Parser. TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if the Tag was more bean-like and had a zero args constructor with appropriate accessors. Helpers ------- I desparately want to get rid of these 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended with a TagFactory (extending NodeFactory) that has the signatures for creating all the possible types of tags there are, and then this needs to be used by all the scanners to create their specific tags. Scanners -------- The scanners may not be working, hard to tell without the unit tests running. I'm not sure that CompositeTagScanner is completely all right yet, It probably needs to be reworked based on the lexer. Unit Tests ---------- As mentioned, many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the TODO list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). Derrick |
From: Derrick O. <Der...@Ro...> - 2003-09-29 17:38:09
|
Fixed up the serializability. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if the Tag was more bean-like and had a zero args constructor with appropriate accessors. Helpers ------- I desparately want to get rid of these 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended with a TagFactory (extending NodeFactory) that has the signatures for creating all the possible types of tags there are, and then this needs to be used by all the scanners to create their specific tags. Scanners -------- The scanners may not be working, hard to tell without the unit tests running. I'm not sure that CompositeTagScanner is completely all right yet, It probably needs to be reworked based on the lexer. Unit Tests ---------- As mentioned, many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the TODO list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-09-29 11:52:49
|
Fixed up the broken visitor logic. Added some docos on NodeVisitor. TODO ===== Serializable -------------- The Parser needs to be made serializable again. This involves a transient field down on the Source, I think, rather than having the whole Lexer transient in the Parser. TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if the Tag was more bean-like and had a zero args constructor with appropriate accessors. Helpers ------- I desparately want to get rid of these 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended with a TagFactory (extending NodeFactory) that has the signatures for creating all the possible types of tags there are, and then this needs to be used by all the scanners to create their specific tags. Scanners -------- The scanners may not be working, hard to tell without the unit tests running. I'm not sure that CompositeTagScanner is completely all right yet, It probably needs to be reworked based on the lexer. Unit Tests ---------- As mentioned, many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the TODO list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: du du <tel...@ya...> - 2003-09-20 06:48:33
|
I want to write a piece of code to implement auto-fill web page form.I try to use NodeVisitor.But I puzzled at : 1)String [] tagsToBeFound = {"FORM","INPUT"}; TagFindingVisitor visitor = new TagFindingVisitor(tagsToBeFound); parser.visitAllNodesWith(visitor); Node [] allformTags = visitor.getTags(0); FormTag formtag = (FormTag)allformTags[0]; Node [] allinputTags = visitor.getTags(1); InputTag inputtag = (InputTag)allinputTags[0]; there is an java.lang.ClassCastException: org.htmlparser.tags.Tag why? 2)if I write a customized visitor how do i write visitFormTag and visitInputTag so as to collect all the form tag and input tag together? 3)if I use RemarkNode to mark a form tag its relative input tags together, how to decide the parameter tagContents? thanks for any hints --------------------------------- Post your free ad now! Yahoo! Canada Personals |
From: Derrick O. <Der...@ro...> - 2003-09-03 11:05:38
|
The LineCount property indicates the line being processed and will increase from 1 when lines are read as nodes are retrieved. It's not a count of the number of lines in a file or page. That should be available after reading all nodes. Derrick zheng zhen wrote: >I'm a beginner of htmlparser developer,It will be >appreciate if sb. can give me some hints.Here is the >code: >NodeReader nodeR = new NodeReader(new FileReader(new >File("C:/temp/b.html")),1000); > >System.out.println("nodeR.getLineCount():"+nodeR.getLineCount()); > > >problem is why nodeR.getLineCount() always 1. > >thans again > >zz > > > |
From: Derrick O. <Der...@ro...> - 2003-09-01 13:16:32
|
Please welcome Christopher Bird. Chris has been programming since '67, using such languages as IBM 360 Assembler, Basic, PL/I, Pascal, C, Smalltalk, Java and most, recently, C#. Chris has taught database design and system development methods and practicies. He makes his living doing IT work - matching technologies to business strategy - including b2b integration, IP telephony and business continuity. He is also retained by an investment bank to look at technology deals - from the technology's perspective. He uses HTMLParser in several projects, including one that sends HTML page text content to his cell phone via SMS. He is a member of the IEEE Computer society with a special interest in Software Engineering and Model Driven Development. Derrick |
From: Derrick O. <Der...@ro...> - 2003-09-01 02:28:44
|
I've uploaded a draft java coding standards for your perusal: http://htmlparser.sourceforge.net/articles/Java%20Coding%20Standards.doc http://htmlparser.sourceforge.net/articles/Java%20Coding%20Standards.pdf http://htmlparser.sourceforge.net/articles/Java%20Coding%20Standards.html Comments? |