Sitemap
The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K followers.

[Toy Browser] HTML Parsing and CSS Computing

--

Introduction: [Toy Browser] Introduction

Previous article: [Toy Browser] HTTP Request and Response Parse

Keywords: HTML, CSS, DOM Tree, Finite-state machine

In previous article, toy browser can send a request to the server, receive its response, and parse the response. Now, HTML file is extracted. Next step, we need to create a DOM Tree by parsing HTML. It’s the most tricky and interesting part of toy browser. The codes of HTML Parser will help you understand it. In general, the HTML Parser does following things in the order:

  1. parse HTML character by character
  2. tokenize HTML tag by tag
  3. create the nodes of DOM Tree from the tokens, and mount the nodes on the tree
  4. compute CSS rules and apply them on the nodes of the DOM Tree

HTML Parse

As parsing response and chunked response body, our old friend, Finite-state machine will help us parsing HTML. Different from the previous tasks, HTML is much more complicated. Fortunately, whatwg.org has defined all the states for us (check here, tokenization). Unfortunately, there are 80 states ! To make the toy browser simple, I chose some of the states.

  • tagOpen, tagName, endTagOpen
  • beforeAttributeName, afterAttributeName, attributeName
  • beforeAttributeValue, doubleQuotedAttributeValue, singleQuotedAttributeValue, UnquotedAttributeValue, afterQuotedAttributeValue
  • selfClosingStartTag
  • data, EOF
Finite-states machine for parsing HTML of toy browser

As we can see, although I have reduced the number of state, the finite-state machine is still very complicated. whatwg.org has given us all the details of the states. Please read it if you want to build a more robust machine.

Since the states are defined, the HTML file can be tokenized by finite-state machine.

Tokenization

The machine reads the file, character by character. But it tokenizes it, tag by tag. A global variable, currentToken, is created to save tag’s information. Inside this token, there are several properties:

  • type: it defines the type of the token. The types are startTag, endTag , text, EOF .
  • tagName: HTML tag name, like, div, span, img, …
  • html attributes: all the html attributes. for example, <div id="container"> . The tag has an attribute, id , and its value is container . Same as the token, the attribute is saved by a global variable, currentAttribute. Once currentAttribute is completed, the token will dynamically take it as a property.

Once the information of a tag is collected, the function emit will be called. This function is the key part of creating DOM Tree.

The function emit has 3 steps:

  • startTag step: create the node of DOM Tree from the token
  • text step: create text node from the token with the text type
  • endTag step: compute the CSS and create the layout (it will be discussed in the next article of toy browser series)
All the startTag tokens in the demo HTML

DOM Tree is composed by nodes. The root of the tree is the node with document type. Be careful, the type of the node is different from the type of the token. There are only 3 types, document , element and text . Only the root has a document type. All the children of the root owns an element type or a text type.

DOM Tree with the root and the html tag node

To make a brief conclusion, until now, we knew that Finite-state machine is used to parse HTML. During parsing, the information of the tag is saved in a token. Once the information is completed, the function emit receives the token and starts to build the DOM Tree by creating nodes. Next, let’s see how the DOM Tree is created.

Token emission

The function emit uses a stack (an array following the rule: First In Last Out) to build the tree. All its items in it are nodes. The root of the DOM tree is set while it is initialised.

Once the function emit is called, it will check the top item (be careful, it is NOT popped) in the stack.

Starting with the first step. If the token type is startTag, the function emit will enter the startTag step, create a node. A node has:

  • type: element
  • tagName: It’s the same as that owned by received token.
  • attributes: They are the same as those owned by received token.
  • children: the children of current node

As we can see, the node keeps the information contained by the token, and more importantly, it creates the pointers pointed to its children.

Once the node is created, the function will compute its CSS, which will be presented later in this article, and set it as a child of the top element in the stack. After that, this node is pushed into the stack, except that the current token is a self closing tag, like <img src="image.jpg">.

This part is very important for building a DOM tree. First, the parent of current node is set, then, the node is pushed into stack for setting its children. Self closing tag hasn’t children, so we don’t push it.

Moving to the second step. If the token type is text, the function emit will enter text step, and create an node, owning the properties as followed:

  • type: text
  • content: some text content
text node of style tag

Attention, only data state in finite-state machine emits the text token. That means the function emit receives the text character by character (since the finite-state machine parses the HTML character by character). The text content need to be create, right here.

Once the text content is completed, the function emit set the text node as a child of the top element in the stack. With the help of the other steps, startTag step and endTag step in the function emit , it is easy to determinate whether the text content is completed or not. In order to make text node accessible by others steps, the text node need to be saved by a global variable.

Finally, the third step. If the token type is endTag, the function emit will pop the node.

But before that, it should check if the tag name of the current token is corresponded to the one of the top element in the stack. If two names are different, that means it’s an unclosed tag, it should throw an error.

If these names are the same, congratulation, a HTML tag is mounted on the DOM Tree. One more thing, if the tag is a style tag, then the text node inside the tag contains CSS rules. We need the register them.

Last but no least, the final layout is calculated in this step. More details about the layout will be discussed in the next article.

Conclusion of tokenization

The process of tokenization is separated it into 3 steps: startTag step, text step and endTag step.

In startTag step, the function emit does 4 things:

  • creates the node from received token
  • mounts the node on the DOM Tree
  • computes node’s CSS
  • pushes the node into the stack.

In text step, the function emit keep accepting the character sent by finite-state machine and writing the text content. Once the content is completed, we mount it on the top element of the stack.

In endTag step, we register the CSS rules, pop the top element in the stack, and calculate the layout according to CSS rules.

Our DOM Tree is almost done. The last thing to discuss is how to compute the CSS.

CSS computing

Parsing is troublesome. This time, I didn’t parse CSS by hand, instead, I used the package called css. The CSS rule can be parsed easily by calling the function css.parse . It returns an Abstract Syntax Tree. The CSS rules can be found in ast.stylesheet.rules . All the CSS rules are registered in a global variable, rules.

A CSS rule parsed by the function css.parse

To make the toy browser simpler, I only consider 3 types of CSS selectors:

  • id selector: for example, #my-id
  • class selector: for example, .my-class
  • element selector : for example, div

The function match is created to match the selectors.

Besides, the specificity among these 3 selector is:

  • id selector > class selector > element selector

For example, there are 3 selectors:

div { width: 500px }
div .cls { width: 100px }
div #my-id { width: 200px }

And, there is a div :

<div>
<div class="cls"> 1 </div> <!-- first div -->
<div class="cls" id="my-id"> 2 </div> <!-- second div -->
</div>
<div class="cls" id="my-id"> 2 </div>
</div>

As the result, The width of the first div is 100 px, and the second is 200px. Because the specificity of id selector is higher than class selector.

The function specificity and compare are created to generate the specificities of CSS selectors and compare them.

With 3 functions I created, the CSS rules can be applied on element nodes in DOM Tree. There is a trick: traversing the DOM Tree from leaves to root. It improves the performance by avoiding unnecessary repeated searched.

Imagine that there is a deep DOM Tree with numerous leaves, and there is a CSS rule for a leaf. If it traverses the tree from root to leaves, it needs to search several full depth branches of the tree. While, if it traverses from leaves to root, only some leaves will be checked. That’s a huge improvement.

As the DOM Tree is traversed from leaves to root, the order in CSS selector need to reversed, too. Because, the parent selector is always placed before the children, like div .cls .

Once an element node matches a CSS rule, the specificity will be calculated. If the specificity is higher than current CSS rule attached on the element, override it. The CSS rule and the specificity are written in the property, computedStyle, on the element.

a div node with computed CSS style

Now, the DOM Tree is fully created !

This article presents how the toy browser parses a html, computes CSS rules, and create a DOM Tree applied CSS rules.

But, can the toy browser renders a div having a CSS flex: 1 ? In the next article, I will talk about how the toy browser renders a DOM Tree, and print out an image from it. See you soon.

--

--

The Startup
The Startup

Published in The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K followers.

Jijie Liu
Jijie Liu

Written by Jijie Liu

Developer in Paris, interested in building small cool stuffs

No responses yet