Harvester is a lightweight and highly optimized javascript library for extracting data from the DOM tree. It supports extraction of tag texts with specified types and attributes. it's tiny and has no dependencies. The library consists of two parts: the harvest()
function, which runs in the browser to extract data, and various integrations with JavaScript frameworks such as Puppeteer and Playwright.
- Description
- Advantages
- Installation
- How to use in browser
- How to use with puppeteer
- Examples folder
- Main concepts
- API
- Run tests
- Run linter
- Contribution
The main idea of this library is that the data you need to extract from an HTML page can be described using a text-based template. You roughly specify which tags are inside others and what data they contain. It is not necessary to know the exact structure of the HTML, because harvester
can find the data in a fuzzy way.
This library is also especially useful for HTML structures that lack clear identifiers, unique classes, or other stable anchors typically used in CSS selectors. It relies not on unique identifiers, but on the structure of the tree defined in the template.
To extract data, the user needs to describe a template or special tree-like DSL of the data from the DOM from which the extraction will be performed, and then call the harvest()
function with two parameters: the described template and the start branch in the DOM where the search will begin. This library can be used both in the browser and in web scraping tools such as Puppeteer and Playwright.
Here is a minimal example of usage if you have an access to the destination DOM tree. For the HTML:
<div id="product">
<div class="title">Title: MacBook M4 Pro Max</div>
<div>AD section</div>
<span>4990.00</span>
</div>
You may harvest the title and price with this simple template:
var tpl = `
div{title:with:Title}
*{price:float}
`
var data = harvest(tpl, $('#product'))
console.log(data[0]) // 0th element is a harvested data map
Output will be:
{
title: 'Title: MacBook M4 Pro Max',
price: '4990.00'
}
More examples you may find here.
- Declarative and readable – the ability to see all block data at once, along with easy-to-read templates. you don't need to define the structure of every field or tag manually. All the data will be returned in a JS object
-
Can find DOM nodes without identifiable elements such as
id
,class name
, or attributes - Simplicity and minimalism – the implemented features are enough to extract data from most websites, and the library's single function greatly simplifies its usage
- Resistant to DOM structure changes – HTML may look very different for the same data, but it will still be correctly extracted
- Highly optimized and fast - a typical harvest() function call completes in 15ms
-
Supports various data types such as
int
,float
,func
,empty
, and more - Compatible with Puppeteer
- Compatible with Playwright
If you're using Harvester as a complement to Puppeteer, all you need to do is install the npm package into your Puppeteer project like this:
npm i js-harvester --save
Next, you need to call one of the two available functions harvestPage()
or harvestPageAll()
, in your Puppeteer code to extract one or all data blocks based on a given CSS query. Like this:
import { harvestPageAll } from 'js-harvester/puppeteer.js'
...
await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
More examples using this approach can be found in the examples folder.
If you're using Harvester as a complement to Playwright, all you need to do is install the npm package into your Playwright project like this:
npm i js-harvester --save
Next, you need to call one of the two available functions harvestPage()
or harvestPageAll()
, in your Playwright code to extract one or all data blocks based on a given CSS query. Like this:
import { harvestPageAll } from 'js-harvester/playwright.js'
...
await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
More examples using this approach can be found in the examples folder.
For local testing and writing templates, it's convenient to use the library directly in the Chrome DevTools console. You can copy the library code directly from the GitHub repository and paste it into the browser console. After that, you can run the harvest()
function on the actual page you're targeting. It might look like this:
// You are now in the browser console!
// here you insert js-harvester library code
var tpl = `
div{title}
span{description}
`
harvest(tpl, '#my_element')
First, you have to copy the content of the js-harvester/index.js
file into the clipboard and paste it into the DevTools console of the browser. Let's imagine you want to extract product data, and the structure of that data is shown on the left in two variations. It may change depending on different factors, such as the user's role, time zone, etc. In the top-right corner, you can see a template that describes both data structures for the given HTML examples. This template is just a string and you may put it into the JavaScript variable like this: var tpl = ' div{text}'
. At the bottom-right, you can see the result that the user will get after calling the harvest(tpl, '#rootElement')
function. Let's imagine we have an HTML like this:
<div id="product">
<div>
<span>Some title</span>
</div>
<h1>Product Name</h1>
<span>99.99</span>
<img src="https://example.com/image.jpg"></img>
</div>
And you need to grab a product title (you know that it has a "Name" in a title), price, and it's image. For this, your pseudo tree template may look like this:
const tpl = `
*
h1{name:with:Name}
span{price:float}
*[img=src]
`
const result = harvest(tpl, '#product')
console.log(result)
The output would be:
[
{
name: 'Product Name',
price: '99.99',
img: 'https://example.com/image.jpg'
}, // Extracted data
59, // Maximum possible score
59, // Matched score
[...] // Extracted tree with metadata
]
What if the HTML differs depending on external factors — for example, the time zone or the user's role?
<div id="product">
<div>Some title</div>
<span>Product Name</span>
<span>99.99</span>
<div>SKU: 4278654</div>
</div>
Let's modify our template so that it accommodates both cases.
const tpl = `
div
*{name:with:Name}
span{price:float}
img[img=src]
div{sku:with:SKU}
`
const result = harvest(tpl, '#product')
console.log(result)
The output would be:
[
{
name: 'Product Name',
price: '99.99',
sku: 'SKU: 4278654'
}, // Extracted data
83, // Maximum possible score
69, // Matched score
[...] // Extracted tree with metadata
]
Harvester works perfectly in combination with Puppeteer. The library exports two main functions for data extraction: harvestPage()
– for extracting a single data block using a CSS query, and harvestPageAll()
– for extracting all data blocks matching a given CSS query. Here's a minimal example where we extract all today's news from the Ukrainska Pravda website. But before running this code, you have to go to examples
folder and run npm i
command. After that, you can run one of the example files using Node.js in that folder:
node news
Let's take a look on a code for getting news from the site:
// examples/news.js
import { harvestPageAll } from 'js-harvester/puppeteer.js'
import { open, goto } from './utils.js'
const NEWS_QUERY = '.container_sub_news_list div.article_news_list'
const TPL = `
div{time}
div
a{title}`
const page = await open()
await goto(page, async () => page.goto('https://www.pravda.com.ua/news/'))
const news = await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
console.log(news, '\nPress Ctrl-C to stop...')
The output would be:
[
{
time: '13:04',
title: '"Не зможу жити, якщо не поїду разом з ними". Історія сімох рідних братів, які служать у 3 ОШБр'
},
{
time: '12:44',
title: 'Ракетний удар по Сумах: кількість поранених зросла до 83, з них 7 дітей'
},
{ time: '12:30', title: 'Партизани спалили танк на Донеччині' },
{
time: '12:23',
title: '"Б’є всі вікові рекорди": найстарішій у світі горилі виповнилося 68 років'
},
...
]
Press Ctrl-C to stop...
Note the parameters inject
and dataOnly
. They mean "inject the Harvester library code" and "do not return metadata," respectively. In other words, we're telling the library to insert the JavaScript code from harvester.js
into the HTML of the website we're extracting data from, and also asking it not to return maxScore
, score
, and the metadata tree (an array of 4 elements). Also, keep in mind that calling harvestPageAll()
multiple times with the inject: true
option will re-insert the library code each time, which can clutter the HTML page. Alternatively, you can manually inject the library code into the page using this function: page.addScriptTag({ path: HARVESTER_PATH })
.
This project contains few examples of usage the harvester with puppeteer, playwright and Node.js.
cd examples
npm i
cd examples
node src/puppeteer/rozetka
node src/playwright/news
The Harvester library implements the concept of "fuzzy tree matching." This means that when searching for all branches of our template in the DOM tree, some of them might not be found and/or might not be in their expected positions. Despite this, the library will attempt to find the best possible match.
Imagine we are looking for two tag's texts title and price with the following template:
var tpl = `
div{title}
span{price:float}
`
The DOM tree does not have to match exactly. For example, if the DOM structure looks like this:
<div>
Some title
<a href="...">Link is here</a>
<div>
<span>123.46</span>
</div>
</div>
The values title and price will still be found, even though the structure is not an exact match. In the given HTML, only one div
contains text, and only one span
is inside this div
. Moreover, the value inside the span
is a floating-point number. This is exactly how the library identifies the data we need. You might ask: What about performance? If the DOM tree is large, the program could get stuck in deeply nested branches while trying to find the required data. To solve this problem, the score calculation algorithm was introduced. A DOM tree can indeed be very large, and if Harvester were to check every single branch, the process could take a long time.
The further we deviate from the tree structure defined in our template, the fewer points we accumulate. In other words, at a certain depth, continuing the search no longer makes sense because the potential score will be lower than what we achieved in the higher levels of the DOM tree. To put it simply:
- The deeper we search from the entry point, the fewer points we collect.
- The lower the score, the less likely the data will be found.
To extract the data you need, you must specify a so-called entry point. The entry point is a specific DOM element that will be considered the starting (or root) node for the search. This element is passed as the second argument to the harvest()
function.
Usually, we use document.querySelector('#product')
for this. It's important to note that the search is performed starting from that branch and downwards. That means if you pass the entire document (like the <body>
or <html>
tag) as the entry point, the data most likely won’t be found — because the algorithm searches “near” the #product
node.
Based on that, the recommended approach for setting the entry point is:
- Locate the block that contains the data you want.
- Create a unique CSS selector for it.
- Call
querySelector()
with that selector. - Pass obtained DOM element as a second parameter to
harvest()
function
The next concept to understand is score. Score is how this library determines how well the extracted data matches our template. Scores are calculated separately for each branch separately. The general formula for calculating the score looks like this: node.score += (subNodesScore + (node.score > 0 ? maxLevel - level : 0))
, where maxLevel
is a maximum available search depth, level
is a current depth level, subNodesScore is a sum of all nested nodes scores. Let's take a look on score calculation:
- For each found tag in the DOM tree, the program assigns 1 point.
- If the required text is found, another 6 points are added.
- If the text type matches the specified type (e.g.,
int
orfloat
), an extra 12 points are given. - Finally, 6 more points are awarded if the required attribute is found on the current tag.
Let's assume we have the following template:
var tpl = `
div{title}
*{price:float}
`
Here, we are looking for a title inside the text of the div
tag and a price inside the text of any tag within the div
, where the value must be of type float
. To calculate the maximum score (maxScore
), we need to go through all the lines of the template and sum up the points for each element from bottom to up:
- The
*
tag always matches: +1 point - If the
*
tag has text with typefloat
: +12 points - Tag
*
is located on a level1
, so we add (maxLevel - level) = (5 - 1) = 4 points - If there is div tag upper: +1 point
- If this tag contains text: +6 points
- Tag
div
is located on a level0
, so we add (maxLevel - level) = (5 - 0) = 5 points - Tag
div
should sum it's own score and it's parent*
, so it +17 points
In total, that gives 29 points. So, maxScore = 29
.
During the search, not everything may match exactly. For example, the tag might be different, but the text type matches, or the text might be present, but the required attribute is missing. Harvester will try to find the match that accumulates the highest score, meaning it is the closest match to our template. Essentially, maximizing the score increases the accuracy of finding the required information in the DOM tree.
How does this affect harvest()
function? The harvest()
returns two types of scores:
- maxScore – the maximum possible score calculated based on the given template (i.e., the best-case scenario if everything matches perfectly).
- score – the actual score obtained during data extraction.
If score
matches maxScore
, then we have found everything we were looking for. However, in real - world scenarios, score
will almost always be lower than maxScore
.
To describe the structure of a data template, we need to understand its format. A template consists of lines. For example:
var tpl = " img{title}[attr=src]"
Each line corresponds to a branch in the DOM tree. Lines can be nested within each other using spaces at the beginning of the line. In our example, there are two spaces before the img
tag. If this tag is the first one, the initial number of spaces is interpreted as "no spaces" and is ignored. If we need to describe a nested branch, we do it like this:
var tpl = `
img{title}[attr=src]
div{text}
`
In this example, we defined a template consisting of two tags - img
and div
, where div
is inside img
. It's important to note that if div
is nested inside img
, it doesn't necessarily mean it is its first child. This library implements a fuzzy data search, meaning that div
may be deeper in the DOM tree. Also, please take a look that we are using multiline string here. The parser understands this type of string and gets only two lines instead of four.
As you may have noticed, every two spaces before a tag are interpreted as a "nested element." This is how we determine that the div
tag is inside the img
tag - since it has two more leading spaces. If you make a mistake and add more spaces than necessary, the parser will throw an error, and the current line will not be included in the final tree.
After the spaces, there is a required element - the tag. If we don't remember which specific tag should be in that place, we can use the special *
symbol, which means "any tag". So in this case our template may look like this: *{title}[src=href]
.
Next, there is an optional block for extracting text from the current tag. In the example above, we specified {title}
. Curly brackets indicate that we are looking for text inside the img
tag, which will be stored under the key "title"
in the object returned by the harvest()
function. It's important to note that only own the text of specified tag will be returned. If a tag has many text nodes, then an array will be returned.
Once again, this parameter is optional, and we can omit it if we don’t need the text of the current tag. If we know the type of data contained within the current tag, harvester
supports several basic types:
-
int
- an integer -
float
- a floating-point number -
with
- a substring search -
func
- a custom function for text validation -
parent
- checks if specified node has a specified parent -
str
- a string -
empty
- an empty text
Let's look at an example of using these types. Suppose we want to find a title that contains the word "Big". The template line will look like this:
var tpl = " img{title:with:Big}"
Notice that we omitted the square brackets in this example.
If we want to find a title inside an img
tag that is an integer, our template will be:
var tpl = " img{title:int}"
Finally, if we are looking for something specific and have a special function to validate the title, the template will be:
var tpl = " img{title:func:checkTitle}"
In this example, checkTitle
must exist in the global scope before calling the harvest()
function. Like this window.checkTitle = function checkTitle() {...}
The next optional parameter is the search for a tag’s attribute value. For example, the img
tag may have an src
attribute, and to extract its value, we need to create the following template:
var tpl = " img[attr=src]"
In this example, the value of the src
attribute of the img
tag will be extracted and stored under the key "attr"
in the object returned by the harvest()
function.
The harvester library uses just a single function to do all its work. It must be called in a browser context and not in a client's code of Puppeteer or a similar library.
harvest(tpl, element, options = {}): [data, maxScore, score, metaTree] | data
It finds matches between a pseudo-tree and the DOM structure, extracting text and attributes. The only function you will use to retrieve data is harvest()
. It takes three parameters: a text-based template describing the structure of the data, DOM element corresponding to the first specified tag or CSS Query to the DOM element, and options. The options have this format (with default values):
const options = {
spaceAmount: 2, // Default amount of spaces in a template between parent and child
executionTime: 15000, // Maximum search time. The search will be automatically stopped when the time runs out
minDepth: 3, // The minimum depth to which we will descend or ascend during the search. This value is not linear — it represents the number of recursive calls made by the search function while traversing the DOM tree.adding new and removing nodes
tagScore: 1, // Amount of point we add if the tag is equal to which we are looking for
tagTextScore: 6, // Amount of points we add if tag's text is exist and are looking for it
tagAttrScore: 6, // The same, but for attribute
tagTextTypeScore: 12, // Amount of points we add if tag & text are exist and text has the same type, which we are looking for
dataOnly: false // true means that only a data map will be returned instead of an array of four elements
}
The function returns an array of four elements:
- data - A hash map containing all the extracted data, organized by the keys specified in the template.
- maxScore - The maximum possible score based on the given template.
- score - The actual score obtained based on the search for template elements.
- metaTree - A metadata tree for debugging.
Sometimes the search doesn't find the data you've specified in the template. In such cases, it makes sense to increase the search depth using the minDepth
configuration like this:
harvest(tpl, '#product', { minDepth: 30 })
This parameter controls how deeply the algorithm will search for your data within the DOM tree. For websites with complex structures, increasing this value can help locate the desired elements, but it may also increase search time. Here is an example of parsing Amazon site with this config.
This function is specifically designed for programs using Puppeteer & Playwright. It extracts a single data block based on the provided template, CSS query, and configuration and returns a Promise. Function arguments are the same as for harvest()
function, except page, it's a Puppeteer's or Playwright's page instance we are working on.
harvestPage(page, tpl, query, opt = {}): Promise<[data, maxScore, score, metaTree]> | Promise<data>
harvestPage()
supports few configuration parameters. It's important to know that this parameters will be passed into harvest()
function. Take a look on config parameters only this function supports:
const options = {
inject: false, // true means - inject harvester.js file into the destination HTML page
path: './node_modules/js-harvester/src/harvester.js', // path, which library is used to find harvester.js file in your node_module folder
amount: undefined // amount of elements to return. undefined means all elements
}
It returns the same result like harvest()
function wrapped by a Promise: array of four elements or data map. Here is an example of usage:
import { harvestPage } from 'js-harvester/puppeteer.js'
...
const NEWS_QUERY = '.container_sub_news_list div.article_news_list'
const TPL = `
div{time}
div
a{title}`
const news = await harvestPage(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
console.log(news)
The output:
{
time: '17:40',
title: 'В Латвії заявили про необхідність мати на озброєнні "великі системи"'
}
If your code uses custom functions inside the template, they must be present on the HTML page before you call harvestPage()
. You can do it like this:
await page.evaluate(() => {
window.price = function price(t, el) { return el?.className === 'my-class' && t?.indexOf('$') > -1 }
})
const products = await harvestPageAll(page, TPL, PRODUCTS_QUERY, { inject: true, dataOnly: true, minDepth: 60 })
cnosole.log(products)
This function is also specifically designed for programs using Puppeteer & Playwright. It extracts all found data blocks based on the provided template, CSS query, and configuration and returns a Promise. Function arguments are the same as for harvest()
function, except page, it's a Puppeteer's or Playwright's page instance we are working on.
harvestPageAll(page, tpl, query, opt = {}): Promise<Array<[data, maxScore, score, metaTree]>> | Promise<Array<data>>
All other parameters and behavior, except returning value, of this function is similar to harvestPage()
. Here is a small example of usage:
import { harvestPageAll } from 'js-harvester/puppeteer.js'
...
const NEWS_QUERY = '.container_sub_news_list div.article_news_list'
const TPL = `
div{time}
div
a{title}`
const news = await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
console.log(news)
The output:
[
{
time: '17:40',
title: 'В Латвії заявили про необхідність мати на озброєнні "великі системи"'
},
{
time: '17:07',
title: 'ЄС засудив Росію за відкриття прямого авіасполучення з невизнаною Абхазією'
},
{
time: '17:06',
title: 'Вчені знайшли нове джерело золота в космосі: воно могло існувати ще за раннього Всесвіту'
},
...
]
The custom functions work in the same way like for harvestPage()
function.
npm i
npm test
Run standardjs
library to find possible issues in a code.
npm run lint
Just create an issue or PR and we will start a conversation...
MIT