This project is not covered by Drupal’s security advisory policy.

Clean up messy HTML from any source — extract the content you want, strip ads and noise, fix links, and sanitize — before you store, convert, or index it.

Feed a Markdown converter or search index a raw web page and you get navigation, ads, and boilerplate in the result. HTML Processor removes that noise first, so downstream output stays clean and token-efficient. It runs a configurable pipeline from a single service call or a saved default, and is standalone: no other Drupal modules required, just a few Symfony/League libraries that Composer installs for you.

What it does

  • Extracts the content you want with CSS selectors (article, #main-content) and drops the rest.
  • Removes ads and boilerplate — built-in patterns for common networks, plus your own.
  • Strips unwanted fragments with admin-trusted regex (guarded against ReDoS).
  • Rewrites relative links and images to absolute URLs so they keep working.
  • Sanitizes elements and attributes via the Symfony HTML Sanitizer.
  • Shapes output — wrap as a full document, or minify.

Pass options per call, or save a default pipeline in the admin form and have it applied automatically.

Use cases

Cleaning HTML before Markdown conversion, AI/RAG ingestion, migrations, or search indexing — anywhere you pull content from sources you don't control.

Requirements

  • Drupal core ^10.4 || ^11
  • Composer libraries (installed automatically): Symfony html-sanitizer, dom-crawler, css-selector; league/uri

No other Drupal modules are required.

Recommended modules

Optional integrations — HTML Processor works on any HTML string on its own:

  • An HTML-to-Markdown loader — convert the cleaned HTML to Markdown for documentation or AI/RAG pipelines.

Install and configure

composer require drupal/html_processor
drush en html_processor -y

Set defaults at Configuration › Content authoring › HTML Processor (permission: Administer HTML Processor settings). Defaults are opt-in, and explicit options passed in code always win.

In code, inject HtmlProcessorInterface and call process():

$clean = $this->htmlProcessor->process([
  'content'    => $rawHtml,
  'container'  => 'article, #main-content',
  'remove_ads' => TRUE,
]);

The full service API, the Drush command, autowiring setup, and security notes are in the module's README.md.

Good to know

  • Container extraction uses explicit CSS selectors — precise, but review them if a source site changes its markup.
  • Headed for Markdown? Turn minify off (it breaks code blocks) and keep sanitization light to preserve structure.
  • Regex stripping and ad patterns are admin-trusted only — never safe for anonymous input.

Similar projects

The standalone successor to the cleaning services in Document Loader: HTML Processor (document_loader_html_processor). Pairs naturally with an HTML-to-Markdown loader downstream.

Supporting organizations: 
Development of the feature.

Project information

  • Project categories: Automation
  • Created by webbywe on , updated
  • shield alertThis project is not covered by the security advisory policy.
    Use at your own risk! It may have publicly disclosed vulnerabilities.

Releases