Search Crawl Job Data

Crawl jobs are downloadable as a single JSON/CSV dump by default. But if you need to download a subset of the data, or even a subset of data across several crawl jobs, you can do so with DQL.

Searching crawl jobs over DQL uses the same endpoint as Knowledge Graph DQL Search. Feel free to consult the API Reference directly.

Using DQL to search over crawl collections does not consume any credits.

What is DQL?

DQL is short for Diffbot Query Language. It is a structured query language custom built by Diffbot to query data from graph structured databases like Diffbot Knowledge Graph or Crawl jobs (which are kind of like small Knowledge Graphs). The syntax is designed to be minimal and resembles JSON.

The links below are helpful references for learning DQL, or you can simply skip them and follow the rest of this documentation which will show you some basic DQL queries over crawl jobs.

Quickstart

The following example will query a crawl job called diffbot, which is a hypothetical crawl for diffbot.com, for article entities. DQL caps total records returned to 50 by default, to ensure it returns all the matching records in the crawl job, we'll set size to -1.

import requests

url = "https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN"

params = {
  "type": 'crawl',
  "query": 'type:Article',
  "col": 'diffbot',
  "size": '-1',
}

headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json'
}

response = requests.get(url, params=params, headers=headers)

print(response.text)

const headers = new Headers();
headers.append("Content-Type", "application/json");
headers.append("Accept", "application/json");

const params = {
  "type": "crawl",
  "query": "title:'Riesling'",
  "col": "winemore,bevmo",
  "size": "-1",
}
const queryString = new URLSearchParams(params).toString();

const requestOptions = {
  method: "GET",
  headers: headers
};

fetch(`https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN&${queryString}`, requestOptions)
  .then((response) => response.text())
  .then((result) => console.log(result))
  .catch((error) => console.error(error));

curl --request GET \
     --url 'https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN&type=crawl&query=title%3A%27Riesling%27&col=winemore%2Cbevmo&from=0&size=-1' \
     --header 'accept: application/json'

On success, the response will look something like this.

{
    "version": 3,
    "hits": 62,
    "results": 0,
    "kgversion": "428",
    "diffbot_type": "Article",
    "facet": false,
    "giQuery": "type:Article gbtype:json",
    "end": -1,
    "data": [{...}]
}

For more helpful DQL over Crawl examples, see Search a Crawl/Bulk job using DQL.

Error Responses

Attempting to query a crawl job that does not exist will return a 404 error along with the following message.

{
  "error": true,
  "message": "Collections not found: YOUR_DIFFBOT_TOKEN-name_of_nonexistent_job"
}

Attempting to query an empty crawl job, or with a DQL that does not match any data in the job, will return a 200 status code, a hits attribute set to 0, and an empty data array.

{
    "version": 3,
    "hits": 0,
    "results": 0,
    "kgversion": "428",
    "diffbot_type": "Article",
    "facet": false,
    "giQuery": "type:Article gbtype:json title:\"A Made Up Title That Should Not Exist\"",
    "end": -1,
    "data": []
}

Pro Tips and Best Practices

Crawls can be searched as soon as they are live. This can be helpful in use cases where you wish to run a crawl job until a certain page or type of data is found (e.g. crawling a site for a privacy policy).
DQL can also be used to search and query Bulk Extract jobs .