Harvester 🚜

Harvester is a lightweight and highly optimized javascript library for extracting data from the DOM tree. It supports extraction of tag texts with specified types and attributes. it's tiny and has no dependencies. The library consists of two parts: the harvest() function, which runs in the browser to extract data, and various integrations with JavaScript frameworks such as Puppeteer and Playwright.

Description
Advantages
Installation
How to use in browser
How to use with puppeteer
Examples folder
- Examples installation
- Running examples
Main concepts
API
Run tests
Run linter
Contribution

Description

The main idea of this library is that the data you need to extract from an HTML page can be described using a text-based template. You roughly specify which tags are inside others and what data they contain. It is not necessary to know the exact structure of the HTML, because harvester can find the data in a fuzzy way.

This library is also especially useful for HTML structures that lack clear identifiers, unique classes, or other stable anchors typically used in CSS selectors. It relies not on unique identifiers, but on the structure of the tree defined in the template.

To extract data, the user needs to describe a template or special tree-like DSL of the data from the DOM from which the extraction will be performed, and then call the harvest() function with two parameters: the described template and the start branch in the DOM where the search will begin. This library can be used both in the browser and in web scraping tools such as Puppeteer and Playwright.

Here is a minimal example of usage if you have an access to the destination DOM tree. For the HTML:

  <div id="product">
    <div class="title">Title: MacBook M4 Pro Max</div>
    <div>AD section</div>
    <span>4990.00</span>
  </div>

You may harvest the title and price with this simple template:

var tpl = `
  div{title:with:Title}
    *{price:float}
`
var data = harvest(tpl, $('#product'))
console.log(data[0]) // 0th element is a harvested data map

Output will be:

{
  title: 'Title: MacBook M4 Pro Max',
  price: '4990.00'
}

More examples you may find here.

Advantages

Declarative and readable – the ability to see all block data at once, along with easy-to-read templates. you don't need to define the structure of every field or tag manually. All the data will be returned in a JS object
Can find DOM nodes without identifiable elements such as id, class name, or attributes
Simplicity and minimalism – the implemented features are enough to extract data from most websites, and the library's single function greatly simplifies its usage
Resistant to DOM structure changes – HTML may look very different for the same data, but it will still be correctly extracted
Highly optimized and fast - a typical harvest() function call completes in 15ms
Supports various data types such as int, float, func, empty, and more
Compatible with Puppeteer
Compatible with Playwright

Installation

For the Puppeteer

If you're using Harvester as a complement to Puppeteer, all you need to do is install the npm package into your Puppeteer project like this:

npm i js-harvester --save

Next, you need to call one of the two available functions harvestPage() or harvestPageAll(), in your Puppeteer code to extract one or all data blocks based on a given CSS query. Like this:

import { harvestPageAll } from 'js-harvester/puppeteer.js'
...
await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })

More examples using this approach can be found in the examples folder.

For the Playwright

If you're using Harvester as a complement to Playwright, all you need to do is install the npm package into your Playwright project like this:

npm i js-harvester --save

Next, you need to call one of the two available functions harvestPage() or harvestPageAll(), in your Playwright code to extract one or all data blocks based on a given CSS query. Like this:

import { harvestPageAll } from 'js-harvester/playwright.js'
...
await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })

More examples using this approach can be found in the examples folder.

For the browser

For local testing and writing templates, it's convenient to use the library directly in the Chrome DevTools console. You can copy the library code directly from the GitHub repository and paste it into the browser console. After that, you can run the harvest() function on the actual page you're targeting. It might look like this:

// You are now in the browser console!
// here you insert js-harvester library code
var tpl = `
  div{title}
    span{description}
`
harvest(tpl, '#my_element')

How to use in browser

First, you have to copy the content of the js-harvester/index.js file into the clipboard and paste it into the DevTools console of the browser. Let's imagine you want to extract product data, and the structure of that data is shown on the left in two variations. It may change depending on different factors, such as the user's role, time zone, etc. In the top-right corner, you can see a template that describes both data structures for the given HTML examples. This template is just a string and you may put it into the JavaScript variable like this: var tpl = ' div{text}'. At the bottom-right, you can see the result that the user will get after calling the harvest(tpl, '#rootElement') function. Let's imagine we have an HTML like this:

<div id="product">
  <div>
    <span>Some title</span>
  </div>
  <h1>Product Name</h1>
  <span>99.99</span>
  <img src="https://example.com/image.jpg"></img>
</div>

And you need to grab a product title (you know that it has a "Name" in a title), price, and it's image. For this, your pseudo tree template may look like this:

const tpl = `
*
  h1{name:with:Name}
  span{price:float}
  *[img=src]
`
const result = harvest(tpl, '#product')
console.log(result)

The output would be:

[
  {
    name: 'Product Name',
    price: '99.99',
    img: 'https://example.com/image.jpg'
  },    // Extracted data
  59,   // Maximum possible score
  59,   // Matched score
  [...] // Extracted tree with metadata
]

What if the HTML differs depending on external factors — for example, the time zone or the user's role?

<div id="product">
  <div>Some title</div>
  <span>Product Name</span>
  <span>99.99</span>
  <div>SKU: 4278654</div>
</div>

Let's modify our template so that it accommodates both cases.

const tpl = `
div
  *{name:with:Name}
  span{price:float}
  img[img=src]
  div{sku:with:SKU}
`
const result = harvest(tpl, '#product')
console.log(result)

The output would be:

[
  {
    name: 'Product Name',
    price: '99.99',
    sku: 'SKU: 4278654'
  },    // Extracted data
  83,   // Maximum possible score
  69,   // Matched score
  [...] // Extracted tree with metadata
]

How to use with puppeteer

Harvester works perfectly in combination with Puppeteer. The library exports two main functions for data extraction: harvestPage() – for extracting a single data block using a CSS query, and harvestPageAll() – for extracting all data blocks matching a given CSS query. Here's a minimal example where we extract all today's news from the Ukrainska Pravda website. But before running this code, you have to go to examples folder and run npm i command. After that, you can run one of the example files using Node.js in that folder:

node news

Let's take a look on a code for getting news from the site:

// examples/news.js
import { harvestPageAll } from 'js-harvester/puppeteer.js'
import { open, goto } from './utils.js'

const NEWS_QUERY = '.container_sub_news_list div.article_news_list'
const TPL = `
  div{time}
  div
    a{title}`
const page = await open()

await goto(page, async () => page.goto('https://www.pravda.com.ua/news/'))
const news = await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
console.log(news, '\nPress Ctrl-C to stop...')

The output would be:

[
  {
    time: '13:04',
    title: '"Не зможу жити, якщо не поїду разом з ними". Історія сімох рідних братів, які служать у 3 ОШБр'
  },
  {
    time: '12:44',
    title: 'Ракетний удар по Сумах: кількість поранених зросла до 83, з них 7 дітей'
  },
  { time: '12:30', title: 'Партизани спалили танк на Донеччині' },
  {
    time: '12:23',
    title: '"Б’є всі вікові рекорди": найстарішій у світі горилі виповнилося 68 років'
  },
  ...
]
Press Ctrl-C to stop...

Note the parameters inject and dataOnly. They mean "inject the Harvester library code" and "do not return metadata," respectively. In other words, we're telling the library to insert the JavaScript code from harvester.js into the HTML of the website we're extracting data from, and also asking it not to return maxScore, score, and the metadata tree (an array of 4 elements). Also, keep in mind that calling harvestPageAll() multiple times with the inject: true option will re-insert the library code each time, which can clutter the HTML page. Alternatively, you can manually inject the library code into the page using this function: page.addScriptTag({ path: HARVESTER_PATH }).

Examples folder

This project contains few examples of usage the harvester with puppeteer, playwright and Node.js.

Examples installation

cd examples
npm i

Running examples

cd examples
node src/puppeteer/rozetka
node src/playwright/news

Main concepts

1. Fuzzy search

The Harvester library implements the concept of "fuzzy tree matching." This means that when searching for all branches of our template in the DOM tree, some of them might not be found and/or might not be in their expected positions. Despite this, the library will attempt to find the best possible match.

Example Scenario

Imagine we are looking for two tag's texts title and price with the following template:

var tpl = `
div{title}
  span{price:float}
`

The DOM tree does not have to match exactly. For example, if the DOM structure looks like this:

<div>
  Some title
  <a href="...">Link is here</a>
  <div>
    <span>123.46</span>
  </div>
</div>

The values title and price will still be found, even though the structure is not an exact match. In the given HTML, only one div contains text, and only one span is inside this div. Moreover, the value inside the span is a floating-point number. This is exactly how the library identifies the data we need. You might ask: What about performance? If the DOM tree is large, the program could get stuck in deeply nested branches while trying to find the required data. To solve this problem, the score calculation algorithm was introduced. A DOM tree can indeed be very large, and if Harvester were to check every single branch, the process could take a long time.

The Key Idea

The further we deviate from the tree structure defined in our template, the fewer points we accumulate. In other words, at a certain depth, continuing the search no longer makes sense because the potential score will be lower than what we achieved in the higher levels of the DOM tree. To put it simply:

The deeper we search from the entry point, the fewer points we collect.
The lower the score, the less likely the data will be found.

2. Entry point DOM element

To extract the data you need, you must specify a so-called entry point. The entry point is a specific DOM element that will be considered the starting (or root) node for the search. This element is passed as the second argument to the harvest() function.

Usually, we use document.querySelector('#product') for this. It's important to note that the search is performed starting from that branch and downwards. That means if you pass the entire document (like the <body> or <html> tag) as the entry point, the data most likely won’t be found — because the algorithm searches “near” the #product node.

Based on that, the recommended approach for setting the entry point is:

Locate the block that contains the data you want.
Create a unique CSS selector for it.
Call querySelector() with that selector.
Pass obtained DOM element as a second parameter to harvest() function

3. Score

The next concept to understand is score. Score is how this library determines how well the extracted data matches our template. Scores are calculated separately for each branch separately. The general formula for calculating the score looks like this: node.score += (subNodesScore + (node.score > 0 ? maxLevel - level : 0)), where maxLevel is a maximum available search depth, level is a current depth level, subNodesScore is a sum of all nested nodes scores. Let's take a look on score calculation:

For each found tag in the DOM tree, the program assigns 1 point.
If the required text is found, another 6 points are added.
If the text type matches the specified type (e.g., int or float), an extra 12 points are given.
Finally, 6 more points are awarded if the required attribute is found on the current tag.

Example

Let's assume we have the following template:

var tpl = `
div{title}
  *{price:float}
`

Here, we are looking for a title inside the text of the div tag and a price inside the text of any tag within the div, where the value must be of type float. To calculate the maximum score (maxScore), we need to go through all the lines of the template and sum up the points for each element from bottom to up:

The * tag always matches: +1 point
If the * tag has text with type float: +12 points
Tag * is located on a level 1, so we add (maxLevel - level) = (5 - 1) = 4 points
If there is div tag upper: +1 point
If this tag contains text: +6 points
Tag div is located on a level 0, so we add (maxLevel - level) = (5 - 0) = 5 points
Tag div should sum it's own score and it's parent *, so it +17 points

In total, that gives 29 points. So, maxScore = 29.

During the search, not everything may match exactly. For example, the tag might be different, but the text type matches, or the text might be present, but the required attribute is missing. Harvester will try to find the match that accumulates the highest score, meaning it is the closest match to our template. Essentially, maximizing the score increases the accuracy of finding the required information in the DOM tree.

How does this affect harvest() function? The harvest() returns two types of scores:

maxScore – the maximum possible score calculated based on the given template (i.e., the best-case scenario if everything matches perfectly).
score – the actual score obtained during data extraction.

If score matches maxScore, then we have found everything we were looking for. However, in real - world scenarios, score will almost always be lower than maxScore.

4. Template format

To describe the structure of a data template, we need to understand its format. A template consists of lines. For example:

var tpl = "  img{title}[attr=src]"

Each line corresponds to a branch in the DOM tree. Lines can be nested within each other using spaces at the beginning of the line. In our example, there are two spaces before the img tag. If this tag is the first one, the initial number of spaces is interpreted as "no spaces" and is ignored. If we need to describe a nested branch, we do it like this:

var tpl = `
  img{title}[attr=src]
    div{text}
`

In this example, we defined a template consisting of two tags - img and div, where div is inside img. It's important to note that if div is nested inside img, it doesn't necessarily mean it is its first child. This library implements a fuzzy data search, meaning that div may be deeper in the DOM tree. Also, please take a look that we are using multiline string here. The parser understands this type of string and gets only two lines instead of four.

5. Understanding the format of a single line

spaces

As you may have noticed, every two spaces before a tag are interpreted as a "nested element." This is how we determine that the div tag is inside the img tag - since it has two more leading spaces. If you make a mistake and add more spaces than necessary, the parser will throw an error, and the current line will not be included in the final tree.

tag

After the spaces, there is a required element - the tag. If we don't remember which specific tag should be in that place, we can use the special * symbol, which means "any tag". So in this case our template may look like this: *{title}[src=href].

tag's text

Next, there is an optional block for extracting text from the current tag. In the example above, we specified {title}. Curly brackets indicate that we are looking for text inside the img tag, which will be stored under the key "title" in the object returned by the harvest() function. It's important to note that only own the text of specified tag will be returned. If a tag has many text nodes, then an array will be returned.

tag's text types

Once again, this parameter is optional, and we can omit it if we don’t need the text of the current tag. If we know the type of data contained within the current tag, harvester supports several basic types:

int - an integer
float - a floating-point number
with - a substring search
func - a custom function for text validation
parent - checks if specified node has a specified parent
str - a string
empty - an empty text

Let's look at an example of using these types. Suppose we want to find a title that contains the word "Big". The template line will look like this:

var tpl = "  img{title:with:Big}"

Notice that we omitted the square brackets in this example.

If we want to find a title inside an img tag that is an integer, our template will be:

var tpl = "  img{title:int}"

Finally, if we are looking for something specific and have a special function to validate the title, the template will be:

var tpl = "  img{title:func:checkTitle}"

In this example, checkTitle must exist in the global scope before calling the harvest() function. Like this window.checkTitle = function checkTitle() {...}

attributes

The next optional parameter is the search for a tag’s attribute value. For example, the img tag may have an src attribute, and to extract its value, we need to create the following template:

var tpl = "  img[attr=src]"

In this example, the value of the src attribute of the img tag will be extracted and stored under the key "attr" in the object returned by the harvest() function.

API

harvest()

The harvester library uses just a single function to do all its work. It must be called in a browser context and not in a client's code of Puppeteer or a similar library.

harvest(tpl, element, options = {}): [data, maxScore, score, metaTree] | data

It finds matches between a pseudo-tree and the DOM structure, extracting text and attributes. The only function you will use to retrieve data is harvest(). It takes three parameters: a text-based template describing the structure of the data, DOM element corresponding to the first specified tag or CSS Query to the DOM element, and options. The options have this format (with default values):

const options = {
  spaceAmount: 2,       // Default amount of spaces in a template between parent and child
  executionTime: 15000, // Maximum search time. The search will be automatically stopped when the time runs out
  minDepth: 3,          // The minimum depth to which we will descend or ascend during the search. This value is not linear — it represents the number of recursive calls made by the search function while traversing the DOM tree.adding new and removing nodes
  tagScore: 1,          // Amount of point we add if the tag is equal to which we are looking for
  tagTextScore: 6,      // Amount of points we add if tag's text is exist and are looking for it
  tagAttrScore: 6,      // The same, but for attribute
  tagTextTypeScore: 12, // Amount of points we add if tag & text are exist and text has the same type, which we are looking for
  dataOnly: false       // true means that only a data map will be returned instead of an array of four elements
}

The function returns an array of four elements:

data - A hash map containing all the extracted data, organized by the keys specified in the template.
maxScore - The maximum possible score based on the given template.
score - The actual score obtained based on the search for template elements.
metaTree - A metadata tree for debugging.

minDepth config

Sometimes the search doesn't find the data you've specified in the template. In such cases, it makes sense to increase the search depth using the minDepth configuration like this:

harvest(tpl, '#product', { minDepth: 30 })

This parameter controls how deeply the algorithm will search for your data within the DOM tree. For websites with complex structures, increasing this value can help locate the desired elements, but it may also increase search time. Here is an example of parsing Amazon site with this config.

harvestPage()

This function is specifically designed for programs using Puppeteer & Playwright. It extracts a single data block based on the provided template, CSS query, and configuration and returns a Promise. Function arguments are the same as for harvest() function, except page, it's a Puppeteer's or Playwright's page instance we are working on.

harvestPage(page, tpl, query, opt = {}): Promise<[data, maxScore, score, metaTree]> | Promise<data>

harvestPage() supports few configuration parameters. It's important to know that this parameters will be passed into harvest() function. Take a look on config parameters only this function supports:

const options = {
  inject: false,       // true means - inject harvester.js file into the destination HTML page
  path: './node_modules/js-harvester/src/harvester.js', // path, which library is used to find harvester.js file in your node_module folder
  amount: undefined    // amount of elements to return. undefined means all elements
}

It returns the same result like harvest() function wrapped by a Promise: array of four elements or data map. Here is an example of usage:

import { harvestPage } from 'js-harvester/puppeteer.js'
...
const NEWS_QUERY = '.container_sub_news_list div.article_news_list'
const TPL = `
  div{time}
  div
    a{title}`
const news = await harvestPage(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
console.log(news)

The output:

{
  time: '17:40',
  title: 'В Латвії заявили про необхідність мати на озброєнні "великі системи"'
}

custom functions

If your code uses custom functions inside the template, they must be present on the HTML page before you call harvestPage(). You can do it like this:

await page.evaluate(() => {
  window.price = function price(t, el) { return el?.className === 'my-class' && t?.indexOf('$') > -1 }
})
const products = await harvestPageAll(page, TPL, PRODUCTS_QUERY, { inject: true, dataOnly: true, minDepth: 60 })
cnosole.log(products)

harvestPageAll()

This function is also specifically designed for programs using Puppeteer & Playwright. It extracts all found data blocks based on the provided template, CSS query, and configuration and returns a Promise. Function arguments are the same as for harvest() function, except page, it's a Puppeteer's or Playwright's page instance we are working on.

harvestPageAll(page, tpl, query, opt = {}): Promise<Array<[data, maxScore, score, metaTree]>> | Promise<Array<data>>

All other parameters and behavior, except returning value, of this function is similar to harvestPage(). Here is a small example of usage:

import { harvestPageAll } from 'js-harvester/puppeteer.js'
...
const NEWS_QUERY = '.container_sub_news_list div.article_news_list'
const TPL = `
  div{time}
  div
    a{title}`
const news = await harvestPageAll(page, TPL, NEWS_QUERY, { inject: true, dataOnly: true })
console.log(news)

The output:

[
  {
    time: '17:40',
    title: 'В Латвії заявили про необхідність мати на озброєнні "великі системи"'
  },
  {
    time: '17:07',
    title: 'ЄС засудив Росію за відкриття прямого авіасполучення з невизнаною Абхазією'
  },
  {
    time: '17:06',
    title: 'Вчені знайшли нове джерело золота в космосі: воно могло існувати ще за раннього Всесвіту'
  },
  ...
]

The custom functions work in the same way like for harvestPage() function.

Run tests

npm i
npm test

Run linter

Run standardjs library to find possible issues in a code.

npm run lint

Contribution

Just create an issue or PR and we will start a conversation...

License

MIT

js-harvester

Harvester 🚜

Table of contents

Description

Advantages

Installation

For the Puppeteer

For the Playwright

For the browser

How to use in browser

How to use with puppeteer

Examples folder

Examples installation

Running examples

Main concepts

1. Fuzzy search

Example Scenario

The Key Idea

2. Entry point DOM element

3. Score

Example

4. Template format

5. Understanding the format of a single line

spaces

tag

tag's text

tag's text types

attributes

API

harvest()

minDepth config

harvestPage()

custom functions

harvestPageAll()

Run tests

Run linter

Contribution

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads