trafilatura

package module
v1.13.1-SNAPSHOT Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 16, 2025 License: Apache-2.0 Imports: 31 Imported by: 0

README

Go-Trafilatura

Go-Trafilatura is a Go package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure.

As implied by its name, this package is based on Trafilatura which is a Python package created by Adrien Barbaresi. We decided to port this package because, based on the ScrapingHub benchmark available at the time of creation, Trafilatura was the most efficient open-source article extractor. This is especially impressive considering Trafilatura's code robustness: it achieves this performance with only about 4,000 lines of Python code across 26 files. In comparison, Dom Distiller requires approximately 17,000 lines of code in 148 files.

The package's structure closely mirrors the original Python code. This alignment not only simplifies the implementation of future improvements but also ensures that any web page parsable by the original Trafilatura should yield identical results with this package.

Table of Contents

Status

This package is stable enough for use and up to date with the original Trafilatura v2.0.0 (commit c6e8340).

There are some difference between this port and the original Trafilatura:

  • In the original, metadata from JSON+LD is extracted using regular expressions while in this port it's done using a JSON parser. Thanks to this, our metadata extraction is more accurate than the original, but it will skip metadata that might exist in JSON with invalid format.
  • In the original, python-readability and justext are used as fallback extractors. In this port we use go-readability and go-domdistiller instead. Therefore, there will be some difference in extraction result between our port and the original.
  • In our port we can also specify custom fallback value, so we don't limited to only default extractors.
  • The main output of the original Trafilatura is XML, while in our port the main output is HTML. Thanks to this, there are some difference in handling formatting tags (e.g. <b>, <i>) and paragraphs.

Usage as Go package

Run following command inside your Go project :

go get -u -v github.com/markusmobius/go-trafilatura

Next, include it in your application :

import "github.com/markusmobius/go-trafilatura"

Now you can use Trafilatura to extract content of a web page. For basic usage you can check the examples.

Usage as CLI Application

To use CLI, you need to build it from source. Make sure you use go >= 1.16 then run following commands :

go get -u -v github.com/markusmobius/go-trafilatura/cmd/go-trafilatura

Once installed, you can use it from your terminal:

$ go-trafilatura -h
Extract readable content from a specified source which can be either a HTML file or url.
It also has supports for batch download url either from a file which contains list of url,
RSS feeds and sitemap.

Usage:
  go-trafilatura [flags] [source]
  go-trafilatura [command]

Available Commands:
  batch       Download and extract pages from list of urls that specified in the file
  feed        Download and extract pages from a feed
  help        Help about any command
  sitemap     Download and extract pages from a sitemap

Flags:
      --deduplicate         filter out duplicate segments and sections
  -f, --format string       output format for the extract result, either 'html' (default), 'txt' or 'json'
      --has-metadata        only output documents with title, URL and date
  -h, --help                help for go-trafilatura
      --images              include images in extraction result (experimental)
  -l, --language string     target language (ISO 639-1 codes)
      --links               keep links in extraction result (experimental)
      --no-comments         exclude comments  extraction result
      --no-fallback         disable fallback extraction using readability and dom-distiller
      --no-tables           include tables in extraction result
      --skip-tls            skip X.509 (TLS) certificate verification
  -t, --timeout int         timeout for downloading web page in seconds (default 30)
  -u, --user-agent string   set custom user agent (default "Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0")
  -v, --verbose             enable log message

Use "go-trafilatura [command] --help" for more information about a command

Here are some example of common usage

  • Fetch readable content from a specified URL

    go-trafilatura http://www.domain.com/some/path
    

    The output will be printed in stdout.

  • Use batch command to fetch readable content from file which contains list of urls. So, say we have file named input.txt with following content:

    http://www.domain1.com/some/path
    http://www.domain2.com/some/path
    http://www.domain3.com/some/path
    

    We want to fetch them and save the result in directory extract. To do so, we can run:

    go-trafilatura batch -o extract input.txt
    
  • Use sitemap to crawl sitemap then fetch all web pages that listed under the sitemap. We can explicitly specify the sitemap:

    go-trafilatura sitemap -o extract http://www.domain.com/sitemap.xml
    

    Or you can just put the domain and let Trafitula to look for the sitemap:

    go-trafilatura sitemap -o extract http://www.domain.com
    
  • Use feed to crawl RSS or Atom feed, then fetch all web pages that listed under it. We can explicitly specify the feed url:

    go-trafilatura feed -o extract http://www.domain.com/feed-rss.php
    

    Or you can just put the domain and let Trafitula to look for the feed url:

    go-trafilatura feed -o extract http://www.domain.com
    

Performance

This package and its dependencies heavily use regular expression for various purposes. Unfortunately, as commonly known, Go's regular expression is pretty slow. This is because:

  • The regex engine in other language usually implemented in C, while in Go it's implemented from scratch in Go language. As expected, C implementation is still faster than Go's.
  • Since Go is usually used for web service, its regex is designed to finish in time linear to the length of the input, which useful for protecting server from ReDoS attack. However, this comes with performance cost.

To solve this issue, we compile several important regexes into Go code using re2go. Thanks to this we are able to achieve greater speed without using cgo or external regex packages.

Comparison with Other Go Packages

As far as we know, currently there are three content extractors built for Go:

Since every extractors use its own algorithms, their results are a bit different. In general they give satisfactory results, however we found out that there are some cases where DOM Distiller is better and vice versa. Here is the short summary of pros and cons for each extractor:

Dom Distiller:

  • Very fast.
  • Good at extracting images from article.
  • Able to find next page in sites that separated its article to several partial pages.
  • Since the original library was embedded in Chromium browser, its tests are pretty thorough.
  • CON: has a huge codebase, mostly because it mimics the original Java code.
  • CON: the original library is not maintained anymore and has been archived.

Readability:

  • Fast, although not as fast as Dom Distiller.
  • Better than DOM Distiller at extracting wiki and documentation pages.
  • The original library in Readability.js is still actively used and maintained by Firefox.
  • The codebase is pretty small.
  • CON: the unit tests are not as thorough as the other extractors.

Trafilatura:

  • Has the best accuracy compared to other extractors.
  • Better at extracting web page's metadata, including its language and publish date.
  • Its unit tests are thorough and focused on removing noise while making sure the real contents are still captured.
  • Designed to be used in academic domain e.g. natural language processing.
  • Actively maintained with new release almost every month.
  • CON: slower than the other extractors, mostly because it also looks for language and publish date.
  • CON: not very good at extracting images.

The benchmark that compares these extractors is available in this repository. It uses each extractor to process 983 web pages in single thread. Here is its benchmark result when tested on my PC (Intel i7-8550U @ 4.000GHz, RAM 16 GB):

Here we compare the extraction result between go-trafilatura, go-readability and go-domdistiller. To reproduce this test, clone this repository then run:

go run scripts/comparison/*.go content

For the test, we use 960 documents taken from various sources (2025-05-01). Here is the result when tested in my PC (AMD Ryzen 5 7535HS @ 4.6GHz, RAM 16 GB):

Package Precision Recall Accuracy F-Score Time (s)
go-readability 0.871 0.891 0.880 0.881 2.87
go-domdistiller 0.873 0.872 0.873 0.872 2.66
go-trafilatura 0.912 0.897 0.906 0.904 4.25
go-trafilatura with fallback 0.909 0.921 0.914 0.915 8.39

Comparison with Original Trafilatura

Here is the result when compared with the original Trafilatura v1.12.2:

Package Precision Recall Accuracy F-Score Time (s)
trafilatura 0.918 0.898 0.909 0.908 10.38
trafilatura + fallback 0.919 0.915 0.917 0.917 14.53
trafilatura + fallback + precision 0.932 0.889 0.912 0.910 19.34
trafilatura + fallback + recall 0.907 0.919 0.913 0.913 11.63
go-trafilatura 0.912 0.897 0.906 0.904 4.25
go-trafilatura + fallback 0.909 0.921 0.914 0.915 8.39
go-trafilatura + fallback + precision 0.921 0.900 0.912 0.910 7.68
go-trafilatura + fallback + recall 0.893 0.927 0.908 0.910 6.43

As the table demonstrates, the performance of our port is nearly identical to the original Trafilatura. This parity is achieved because the code was ported almost line-by-line from Python to Go (excluding minor, previously mentioned differences). We attribute the small remaining performance gap not to incorrect porting, but rather to our use of different fallback extractors than those in the original implementation.

Regarding speed, our Go port is significantly faster than the original. This is largely due to our use of re2go, which compiles several critical regular expressions ahead of time into native Go code. This approach allows us to avoid the typical performance overhead associated with standard Go regex libraries.

Furthermore, this package is thread-safe (based on our current testing). Depending on your application's needs, you can leverage this concurrency for substantial additional speed gains. For example, here are the results achieved on my PC when the comparison script was run concurrently across all available threads:

go run scripts/comparison/*.go content -j -1
Package Time (s)
go-trafilatura 0.931
go-trafilatura + fallback 1.976
go-trafilatura + fallback + precision 1.856
go-trafilatura + fallback + recall 1.599

Acknowledgements

This package won't be exist without effort by Adrien Barbaresi, the author of the original Python package. He created trafilatura as part of effort to build text databases for research, to facilitate a better text data collection which lead to a better corpus quality. For more information:

@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

License

Like the original, go-trafilatura is distributed under the Apache v2.0 license.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CreateReadableDocument

func CreateReadableDocument(extract *ExtractResult) *html.Node

CreateReadableComponent is helper function to convert the extract result to a single HTML document complete with its metadata and comment (if it exists).

Types

type Config

type Config struct {
	// Deduplication config
	CacheSize             int
	MaxDuplicateCount     int
	MinDuplicateCheckSize int

	// Extraction size setting
	MinExtractedSize        int
	MinExtractedCommentSize int
	MinOutputSize           int
	MinOutputCommentSize    int
}

Config is advanced setting to fine tune the extraction result. You can use it to specify the minimal size of the extracted content and how many duplicate text allowed. However, for most of the time the default config should be good enough.

func DefaultConfig

func DefaultConfig() *Config

DefaultConfig returns the default configuration value.

type ExtractResult

type ExtractResult struct {
	// ContentNode is the extracted content as a `html.Node`.
	ContentNode *html.Node

	// CommentsNode is the extracted comments as a `html.Node`.
	// Will be nil if `ExcludeComments` in `Options` is set to true.
	CommentsNode *html.Node

	// ContentText is the extracted content as a plain text.
	ContentText string

	// CommentsText is the extracted comments as a plain text.
	// Will be empty if `ExcludeComments` in `Options` is set to true.
	CommentsText string

	// Metadata is the extracted metadata which taken from several sources i.e.
	// <meta> tags, JSON+LD and OpenGraph scheme.
	Metadata Metadata

	// URLs are the extracted hyperlinks from the document ( <a href> elements )
	URLs []nurl.URL
}

ExtractResult is the result of content extraction.

func Extract

func Extract(r io.Reader, opts Options) (*ExtractResult, error)

Extract parses a reader and find the main readable content.

func ExtractDocument

func ExtractDocument(doc *html.Node, opts Options) (*ExtractResult, error)

ExtractDocument parses the specified document and find the main readable content.

type ExtractionFocus

type ExtractionFocus uint8

ExtractionFocus specify the focus of extraction.

const (
	// Balanced is the middle ground.
	Balanced ExtractionFocus = iota

	// FavorRecall makes the extractor extracts more text, even when unsure.
	FavorRecall

	// FavorPrecision makes the extractor extracts less text, but usually more precise.
	FavorPrecision
)

type FallbackCandidates

type FallbackCandidates struct {
	// Readability is the user specified extraction result from Go-Readability
	// that will be used as fallback candidate.
	Readability *html.Node

	// Distiller is the user specified extraction result from Go-DomDistiller
	// that will be used as fallback candidate.
	Distiller *html.Node

	// Others is list of the user specified extraction results taht will be used as
	// candidates, that generated manually by user using another methods beside
	// Go-Readability and Go-DomDistiller.
	//
	// This list will be prioritized before Readability and Distiller.
	//
	// Make sure to not put output of Go-Readability and Go-DomDistiller here, to
	// prevent those two extractors running twice.
	Others []*html.Node
}

FallbackCandidates allows to specify a list of fallback candidates in particular: Readability and Dom Distiller.

type HtmlDateMode

type HtmlDateMode uint8

HtmlDateMode specify the mode of publish date extractor using HtmlDate package.

const (
	// In Default mode, HtmlDate will be run based on whether fallback is enabled or not.
	// If fallback is enabled, HtmlDate will be run on `Extensive` mode. If fallback is
	// disabled, HtmlDate will be run on `Fast` mode.
	Default HtmlDateMode = iota

	// In Fast mode, publish date will be extracted from entire document by using HtmlDate,
	// but without using external DateParser package. Thank to this the date extraction is
	// quite fast, but it can't detect string in non English language.
	Fast

	// In Extensive mode, publish date will be extracted from entire document by using
	// HtmlDate, utilizing the external DateParser package. Thank to this the date
	// extraction is pretty accurate and can detect foreign language, but it use a lot
	// of RegEx which is slow in Go.
	Extensive

	// If Disabled, publish date will only extracted from metadata and not scanned from
	// the entire document. Thanks to this content extraction will be fast, but the
	// publish date might be missing or inaccurate. Use it if you only care about the
	// content and not the publish date.
	Disabled
)

type Metadata

type Metadata struct {
	Title       string
	Author      string
	URL         string
	Hostname    string
	Description string
	Sitename    string
	Date        time.Time
	Categories  []string
	Tags        []string
	ID          string
	Fingerprint string
	License     string
	Language    string
	Image       string
	PageType    string
}

Metadata is the metadata of the page.

type Options

type Options struct {
	// Config is the advanced configuration to fine tune the
	// extraction result. Keep it as nil to use default config.
	Config *Config

	// OriginalURL is the original URL of the page. Might be overwritten by URL in metadata.
	OriginalURL *nurl.URL

	// TargetLanguage is ISO 639-1 language code to make the extractor only process web page that
	// uses the specified language.
	TargetLanguage string

	// If EnableFallback is true, then whenever Trafilatura failed to extract a document,
	// it will use algorithm from another package, i.e. Readability and Dom Distiller.
	// This will make the extraction result more precise, but also a bit slower.
	EnableFallback bool

	// FallbackCandidates is user specified candidates that will be checked by Trafilatura
	// when EnableFallback set to True. This is useful if user already use Readability
	// and Dom Distiller before, or if user want to provide his own candidates. As mentioned
	// before, it will only used if `EnableFallback = true`.
	FallbackCandidates *FallbackCandidates

	// Focus specify the extraction behavior of Trafilatura.
	Focus ExtractionFocus

	// ExcludeComments specify whether to exclude comments from the extraction result.
	ExcludeComments bool

	// ExcludeTables specify whether to exclude information within the HTML <table> element.
	ExcludeTables bool

	// IncludeImages specify whether the extraction result will include images (experimental).
	IncludeImages bool

	// IncludeLinks specify whether the extraction result will include links along with their
	// targets (experimental).
	IncludeLinks bool

	// IncludeLinksOnly specify whether the extraction result will include only links along with their
	// targets (experimental) and no other content
	IncludeLinksOnly bool

	// BlacklistedAuthors is list of author names to be excluded from extraction result.
	BlacklistedAuthors []string

	// Deduplicate specify whether to remove duplicate segments and sections.
	Deduplicate bool

	// HasEssentialMetadata make the extractor only keep documents featuring all essential
	// metadata (date, title, url).
	HasEssentialMetadata bool

	// MaxTreeSize specify max number of elements inside a document.
	// Document that surpass this value will be discarded.
	MaxTreeSize int

	// EnableLog specify whether log should be enabled or not.
	EnableLog bool

	// HtmlDateMode specify the behaviour of the external HtmlDate package that used
	// to extract publish date from a web page.
	HtmlDateMode HtmlDateMode

	// HtmlDateOptions is user provided configuration for the external `go-htmldate`
	// package that used to look for publish date of a web page. If this property is
	// specified, `HtmlDateMode` will be ignored.
	HtmlDateOptions *htmldate.Options

	// HtmlDateOverride is user provided extracted date from `go-htmldate` package.
	// If this property specified, HtmlDate won't be run and instead will use
	// this property as its result. In other words, `HtmlDateMode` and `HtmlDateOptions`
	// will be ignored.
	HtmlDateOverride *htmldate.Result

	// PruneSelector is the CSS selector to select nodes to be pruned before extraction.
	PruneSelector string
}

Options is configuration for the extractor.

type SchemaData

type SchemaData struct {
	Types      []string
	Data       map[string]any
	Importance float64
	Parent     *SchemaData
}

Directories

Path Synopsis
cmd
go-trafilatura command
examples
chained command
from-file command
from-url command
internal
lru
re2go
Code generated by re2c 3.1, DO NOT EDIT.
Code generated by re2c 3.1, DO NOT EDIT.
scripts
comparison command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL