This project is not covered by Drupal’s security advisory policy.
Clean up messy HTML from any source — extract the content you want, strip ads and noise, fix links, and sanitize — before you store, convert, or index it.
Feed a Markdown converter or search index a raw web page and you get navigation, ads, and boilerplate in the result. HTML Processor removes that noise first, so downstream output stays clean and token-efficient. It runs a configurable pipeline from a single service call or a saved default, and is standalone: no other Drupal modules required, just a few Symfony/League libraries that Composer installs for you.
What it does
- Extracts the content you want with CSS selectors (
article,#main-content) and drops the rest. - Removes ads and boilerplate — built-in patterns for common networks, plus your own.
- Strips unwanted fragments with admin-trusted regex (guarded against ReDoS).
- Rewrites relative links and images to absolute URLs so they keep working.
- Sanitizes elements and attributes via the Symfony HTML Sanitizer.
- Shapes output — wrap as a full document, or minify.
Pass options per call, or save a default pipeline in the admin form and have it applied automatically.
Use cases
Cleaning HTML before Markdown conversion, AI/RAG ingestion, migrations, or search indexing — anywhere you pull content from sources you don't control.
Requirements
- Drupal core
^10.4 || ^11 - Composer libraries (installed automatically): Symfony
html-sanitizer,dom-crawler,css-selector;league/uri
No other Drupal modules are required.
Recommended modules
Optional integrations — HTML Processor works on any HTML string on its own:
- An HTML-to-Markdown loader — convert the cleaned HTML to Markdown for documentation or AI/RAG pipelines.
Install and configure
composer require drupal/html_processor drush en html_processor -y
Set defaults at Configuration › Content authoring › HTML Processor (permission: Administer HTML Processor settings). Defaults are opt-in, and explicit options passed in code always win.
In code, inject HtmlProcessorInterface and call process():
$clean = $this->htmlProcessor->process([ 'content' => $rawHtml, 'container' => 'article, #main-content', 'remove_ads' => TRUE, ]);
The full service API, the Drush command, autowiring setup, and security notes are in the module's README.md.
Good to know
- Container extraction uses explicit CSS selectors — precise, but review them if a source site changes its markup.
- Headed for Markdown? Turn minify off (it breaks code blocks) and keep sanitization light to preserve structure.
- Regex stripping and ad patterns are admin-trusted only — never safe for anonymous input.
Similar projects
The standalone successor to the cleaning services in Document Loader: HTML Processor (document_loader_html_processor). Pairs naturally with an HTML-to-Markdown loader downstream.
Project information
- Project categories: Automation
- Created by webbywe on , updated
This project is not covered by the security advisory policy.
Use at your own risk! It may have publicly disclosed vulnerabilities.

