[Toy Browser] HTML Parsing and CSS Computing
Introduction: [Toy Browser] Introduction
Previous article: [Toy Browser] HTTP Request and Response Parse
Keywords: HTML, CSS, DOM Tree, Finite-state machine
In previous article, toy browser can send a request to the server, receive its response, and parse the response. Now, HTML file is extracted. Next step, we need to create a DOM Tree by parsing HTML. It’s the most tricky and interesting part of toy browser. The codes of HTML Parser will help you understand it. In general, the HTML Parser does following things in the order:
- parse HTML character by character
- tokenize HTML tag by tag
- create the nodes of DOM Tree from the tokens, and mount the nodes on the tree
- compute CSS rules and apply them on the nodes of the DOM Tree
HTML Parse
As parsing response and chunked response body, our old friend, Finite-state machine will help us parsing HTML. Different from the previous tasks, HTML is much more complicated. Fortunately, whatwg.org has defined all the states for us (check here, tokenization). Unfortunately, there are 80 states ! To make the toy browser simple, I chose some of the states.
- tagOpen, tagName, endTagOpen
- beforeAttributeName, afterAttributeName, attributeName
- beforeAttributeValue, doubleQuotedAttributeValue, singleQuotedAttributeValue, UnquotedAttributeValue, afterQuotedAttributeValue
- selfClosingStartTag
- data, EOF
As we can see, although I have reduced the number of state, the finite-state machine is still very complicated. whatwg.org has given us all the details of the states. Please read it if you want to build a more robust machine.
Since the states are defined, the HTML file can be tokenized by finite-state machine.
Tokenization
The machine reads the file, character by character. But it tokenizes it, tag by tag. A global variable, currentToken, is created to save tag’s information. Inside this token, there are several properties:
- type: it defines the type of the token. The types are
startTag
,endTag
,text
,EOF
. - tagName: HTML tag name, like,
div, span, img, …
- html attributes: all the html attributes. for example,
<div id="container">
. The tag has an attribute,id
, and its value iscontainer
. Same as the token, the attribute is saved by a global variable, currentAttribute. Once currentAttribute is completed, the token will dynamically take it as a property.
Once the information of a tag is collected, the function emit
will be called. This function is the key part of creating DOM Tree.
The function emit
has 3 steps:
- startTag step: create the node of DOM Tree from the token
- text step: create text node from the token with the
text
type - endTag step: compute the CSS and create the layout (it will be discussed in the next article of toy browser series)
DOM Tree is composed by nodes. The root of the tree is the node with document
type. Be careful, the type of the node is different from the type of the token. There are only 3 types, document
, element
and text
. Only the root has a document
type. All the children of the root owns an element
type or a text
type.
To make a brief conclusion, until now, we knew that Finite-state machine is used to parse HTML. During parsing, the information of the tag is saved in a token. Once the information is completed, the function emit
receives the token and starts to build the DOM Tree by creating nodes. Next, let’s see how the DOM Tree is created.
Token emission
The function emit
uses a stack (an array following the rule: First In Last Out) to build the tree. All its items in it are nodes. The root of the DOM tree is set while it is initialised.
Once the function emit
is called, it will check the top item (be careful, it is NOT popped) in the stack.
Starting with the first step. If the token type is startTag, the function emit
will enter the startTag step, create a node. A node has:
- type:
element
- tagName: It’s the same as that owned by received token.
- attributes: They are the same as those owned by received token.
- children: the children of current node
As we can see, the node keeps the information contained by the token, and more importantly, it creates the pointers pointed to its children.
Once the node is created, the function will compute its CSS, which will be presented later in this article, and set it as a child of the top element in the stack. After that, this node is pushed into the stack, except that the current token is a self closing tag, like <img src="image.jpg">
.
This part is very important for building a DOM tree. First, the parent of current node is set, then, the node is pushed into stack for setting its children. Self closing tag hasn’t children, so we don’t push it.
Moving to the second step. If the token type is text, the function emit
will enter text step, and create an node, owning the properties as followed:
- type:
text
- content: some text content
Attention, only data state in finite-state machine emits the text token. That means the function emit
receives the text character by character (since the finite-state machine parses the HTML character by character). The text content need to be create, right here.
Once the text content is completed, the function emit
set the text node as a child of the top element in the stack. With the help of the other steps, startTag step and endTag step in the function emit
, it is easy to determinate whether the text content is completed or not. In order to make text node accessible by others steps, the text node need to be saved by a global variable.
Finally, the third step. If the token type is endTag, the function emit
will pop the node.
But before that, it should check if the tag name of the current token is corresponded to the one of the top element in the stack. If two names are different, that means it’s an unclosed tag, it should throw an error.
If these names are the same, congratulation, a HTML tag is mounted on the DOM Tree. One more thing, if the tag is a style
tag, then the text node inside the tag contains CSS rules. We need the register them.
Last but no least, the final layout is calculated in this step. More details about the layout will be discussed in the next article.
Conclusion of tokenization
The process of tokenization is separated it into 3 steps: startTag step, text step and endTag step.
In startTag step, the function emit
does 4 things:
- creates the node from received token
- mounts the node on the DOM Tree
- computes node’s CSS
- pushes the node into the stack.
In text step, the function emit
keep accepting the character sent by finite-state machine and writing the text content. Once the content is completed, we mount it on the top element of the stack.
In endTag step, we register the CSS rules, pop the top element in the stack, and calculate the layout according to CSS rules.
Our DOM Tree is almost done. The last thing to discuss is how to compute the CSS.
CSS computing
Parsing is troublesome. This time, I didn’t parse CSS by hand, instead, I used the package called css. The CSS rule can be parsed easily by calling the function css.parse
. It returns an Abstract Syntax Tree. The CSS rules can be found in ast.stylesheet.rules
. All the CSS rules are registered in a global variable, rules.
To make the toy browser simpler, I only consider 3 types of CSS selectors:
- id selector: for example,
#my-id
- class selector: for example,
.my-class
- element selector : for example,
div
The function match
is created to match the selectors.
Besides, the specificity among these 3 selector is:
- id selector > class selector > element selector
For example, there are 3 selectors:
div { width: 500px }
div .cls { width: 100px }
div #my-id { width: 200px }
And, there is a div
:
<div>
<div class="cls"> 1 </div> <!-- first div -->
<div class="cls" id="my-id"> 2 </div> <!-- second div -->
</div><div class="cls" id="my-id"> 2 </div>
</div>
As the result, The width of the first div
is 100 px, and the second is 200px. Because the specificity of id selector is higher than class selector.
The function specificity
and compare
are created to generate the specificities of CSS selectors and compare them.
With 3 functions I created, the CSS rules can be applied on element nodes in DOM Tree. There is a trick: traversing the DOM Tree from leaves to root. It improves the performance by avoiding unnecessary repeated searched.
Imagine that there is a deep DOM Tree with numerous leaves, and there is a CSS rule for a leaf. If it traverses the tree from root to leaves, it needs to search several full depth branches of the tree. While, if it traverses from leaves to root, only some leaves will be checked. That’s a huge improvement.
As the DOM Tree is traversed from leaves to root, the order in CSS selector need to reversed, too. Because, the parent selector is always placed before the children, like div .cls
.
Once an element node matches a CSS rule, the specificity will be calculated. If the specificity is higher than current CSS rule attached on the element, override it. The CSS rule and the specificity are written in the property, computedStyle, on the element.
Now, the DOM Tree is fully created !
This article presents how the toy browser parses a html, computes CSS rules, and create a DOM Tree applied CSS rules.
But, can the toy browser renders a div
having a CSS flex: 1
? In the next article, I will talk about how the toy browser renders a DOM Tree, and print out an image from it. See you soon.