Problem/Motivation

Discovered while working on #3253158: Add Alpha level Experimental Update Manager module. We have functional tests for our updates via the form that use the batch system. We test that if there are exception in the process they are shown.

There seems to be 2 problems going on that are demonstrated in the tests that ran on the patch on #9

  1. In 10.0.x both the Functional and FunctionalJavascript tests pass. So the functionality works with both non-js and js users
  2. In 10.1.x the Functional tests pass but the FunctionalJavascript does not pass.

    When testing this manually with a browser that has javascript enabled(unlike the functional test).
    There is javascript error

    Uncaught TypeError: Cannot read properties of null (reading 'removeAttribute')
    at Drupal.Message.defaultWrapper (message.js?v=11.0-dev:43:17)
    at new Drupal.Message (message.js?v=11.0-dev:26:46)
    at new Drupal.AjaxError (ajax.js?v=11.0-dev:184:35)
    at Object.error (progress.js?v=11.0-dev:157:25)
    at c (jquery.min.js?v=3.7.0:2:25266)
    at Object.fireWith [as rejectWith] (jquery.min.js?v=3.7.0:2:26015)
    at l (jquery.min.js?v=3.7.0:2:77746)
    at XMLHttpRequest. (jquery.min.js?v=3.7.0:2:80204)

    Here is the lines

    wrapper = document.querySelector('[data-drupal-messages-fallback]');
            wrapper.removeAttribute('data-drupal-messages-fallback');
            wrapper.setAttribute('data-drupal-messages', '');
            wrapper.classList.remove('hidden');
    

    so it appears document.querySelector('[data-drupal-messages-fallback]'); does not return a result.

  3. On 11.x both tests fail.

    When testing with Javascript disabled
    on it appears to briefly so the progress and then a white screen

Proposed resolution

The functional JavaScript test's failure has a different cause than the functional test's failure. This issue only deals with the functional test. The JavaScript test has its own issue; see #21 for details.

The cause of the functional test failure is a conflict between output buffering and the Content-Length header that gets added to every response as of #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration. The good news is, it's probably only breaking tests, not real sites, since the problem also involves a specific quirk of cURL and how it handles Content-Length headers. (Real-world browsers probably do not behave this way, although I'm just speculating here.)

A full explanation of the bug may be found in #17.

The proposed resolution is, as yet, uncertain. There are several possible paths forward, but framework manager review is probably needed to decide the right approach.

Remaining tasks

Test coverage is already written, so we just need to decide on the right fix here and implement it.

User interface changes

None.

API changes

TBD, but none expected.

Data model changes

None.

Release notes snippet

TBD

Issue fork drupal-3392196

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

tedbow created an issue. See original summary.

tedbow’s picture

StatusFileSize
new5.33 KB

uploading a patch version to demonstrate that this passes against 10.1.x 🤞🏻

tedbow’s picture

StatusFileSize
new5.34 KB
tedbow’s picture

Issue summary: View changes
tedbow’s picture

StatusFileSize
new6.61 KB
tedbow’s picture

Status: Active » Needs review
StatusFileSize
new6.68 KB

Status: Needs review » Needs work

The last submitted patch, 7: 3392196-exceptions-in-batch-7.patch, failed testing. View results

tedbow’s picture

StatusFileSize
new6.68 KB
xjm’s picture

Priority: Normal » Critical

This blocks AU in core and is therefore critical.

tedbow’s picture

Issue summary: View changes

@xjm ok thanks. I thought it might be critical

i updated the summary to explain the different test failures on the different branches

I started to debug this but so far I have not figured it out.

phenaproxima’s picture

@tedbow and I spent some time tracking this down with git bisect. We discovered that #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration broke the functional test, but not the functional JavaScript test. That leads us to believe there are actually two bugs here.

We don't know what's breaking the functional JS test, but we did discover that, in the functional test, the problem is that, once you start the batch job, the output from Drupal is just...cut off. You get an incomplete snippet of an HTML page. If you apply the patch in #9, then modify ProcessingTest so that it looks like this:

+    $edit = ['batch' => 'batch_8'];
+    $this->drupalGet('batch-test');
+    $this->submitForm($edit, 'Submit');
+    $this->assertSession()->assertNoEscaped('<');
+    print_r($this->getSession()->getPage()->getContent());

You'll see the first bit of an HTML page, just cut off somewhere, when you run the test. (You could also just say $this->assertSession()->responseContains('</html>'), and it should fail too.)

You can do one of two things to make the test pass:

  • Disable \Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader() (modify getSubscribedEvents() to do this).
  • In _batch_progress_page(), remove the call to ob_start().

In conclusion, we don't know what the actual bug is here...but it seems to have something to do with output buffering initiated by the batch system not playing nicely with what's happening in FinishResponseSubscriber::setContentLengthHeader()...if the batch operation callback raises an exception.

We're hoping someone who knows more than us can quickly identify the nature of the problem, based on what we've discovered here.

phenaproxima’s picture

I did some digging around through the core HTTP system and I discovered something. This gets right to the heart of how the kernel works.

I thought a bit about what I said in #12:

You'll see the first bit of an HTML page, just cut off somewhere

I also noticed what happens when there's an exception while producing a response -- other event subscribers are invoked, handling the \Symfony\Component\HttpKernel\KernelEvents::EXCEPTION event.

\Drupal\Core\EventSubscriber\FinalExceptionSubscriber::onException() is one of the handlers invoked. It does this:

    $content = $this->t('The website encountered an unexpected error. Try again later.');
    $content .= $message ? '<br><br>' . $message : '';
    $response = new Response($content, 500, ['Content-Type' => $content_type]);

    // ...some probably-irrelevant stuff

    $event->setResponse($response);

So it's replacing the response we were going to send with a new response containing the details of the exception.

That made me wonder: is it possible that the Content-Length header is being set using the exception response, rather than the actual response?? That would explain the discrepancy.

To test this, I did this in \Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader():

    if ($response instanceof StreamedResponse || $response->getStatusCode() >= 400) {
      return;
    }

...and lo, the functional test passed.

And also, this might help explain why removing the ob_start() call in _batch_progress_page() fixes it -- because output buffering doesn't delay the headers! PHP's ob_start documentation is subtle, but clear on this point (emphasis mine):

This function will turn output buffering on. While output buffering is active no output is sent from the script (other than headers), instead the output is stored in an internal buffer.

Content-Length is one of the headers. So maybe that's going out early, with a too-small value that it's deriving from the exception response.

I know these thoughts aren't very organized, but this might explain, at least partially, what's going on.

phenaproxima’s picture

Crediting @tedbow for writing the test coverage, and myself for the debugging efforts.

phenaproxima’s picture

Did a little more investigating:

  1. I modified \Drupal\Core\EventSubscriber\FinalExceptionSubscriber::onException() to add a X-Testing: yesh header to the response.
  2. Then I put this into ProcessingTest::testBatchForm():
    print_r($this->getSession()->getResponseHeaders());
    print_r($this->getSession()->getPage()->getContent());
    $this->assertSession()->linkExists('the error page');
    

Here's the test output:

Array
(
    [Server] => Array
        (
            [0] => nginx/1.25.2
        )

    [Content-Type] => Array
        (
            [0] => text/html; charset=UTF-8
        )

    [Transfer-Encoding] => Array
        (
            [0] => chunked
        )

    [Connection] => Array
        (
            [0] => keep-alive
        )

    [X-Powered-By] => Array
        (
            [0] => PHP/8.1.24
        )

    [Cache-Control] => Array
        (
            [0] => must-revalidate, no-cache, private
        )

    [Date] => Array
        (
            [0] => Fri, 20 Oct 2023 14:24:01 GMT
        )

    [X-Testing] => Array
        (
            [0] => yesh
        )

    [Content-language] => Array
        (
            [0] => en
        )

    [X-Content-Type-Options] => Array
        (
            [0] => nosniff
        )

    [X-Frame-Options] => Array
        (
            [0] => SAMEORIGIN
        )

    [Expires] => Array
        (
            [0] => Sun, 19 Nov 1978 05:00:00 GMT
        )

    [X-Generator] => Array
        (
            [0] => Drupal 11 (https://www.drupal.org)
        )

)
<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8" />
<meta name="Generator" content="Drupal 11 (https://www.drupal.org)" />
<meta name="MobileOptimized" content="width" />
<meta name="HandheldFriendly" content="true" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="icon" href="/https/www.drupal.org/core/misc/favicon.ico" type="image/vnd.microsoft.icon" />

    <title>Processing | Drupal</title>
    <link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/ajax-progress.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/align.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/autocomplete-loading.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/fieldgroup.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/container-inline.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/clearfix.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/details.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/hidden.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/item-list.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/js.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/nowrap.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/position-container.module.css?s2tzz3" />
<link rel="stylesheet" media="all" href="/https/www.drupal.org/core/themes/stable9/css/system/components/progress.module.css?s2tzz3" />
<link r

(Yes, that's the end of the output from the server. It just stops there.)

So what I'm seeing here is the output of the regular batch page, but the headers from the exception response!

Is the output buffer getting flushed in some way before the exception response is handed to the kernel?

phenaproxima’s picture

I finally have a complete, and satisfying, explanation for what is breaking the functional test!

First of all, this may not actually be breaking real sites in the real world. I don't know if you'd be able to reproduce this in manual testing (I doubt it).

Why? Because we are running afoul of a quirk in cURL -- it enforces the Content-Length header. This means that if the server tells cURL "hey, I'm about to send you 500 bytes", but it actually sends 1000, cURL will only see the first 500 bytes, and throw out the rest! (There does not seem to be a way to override this behavior from PHP.)

Now, previously Drupal didn't send the Content-Length header with its responses, or at least not by default. That changed in #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration.

The problem is that the new \Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader() event handler doesn't account for the possibility that output buffering might be active, and that there might already be stuff in the output buffer!

Anything in the output buffer is effectively already part of the response (unless you explicitly clear the buffer by calling ob_clean()). That ol' rascally batch system activates output buffering in _batch_process_page(), specifically for error handling, and sends some output. It's a very old and clunky form of error handling. (Yes, very old - the lines concerned were written in #127539: batch - progressive operations (à la update.php), which was committed by Gábor in 2007. The year I started using Drupal. Sheesh.)

Thus, the sequence of events emerges:

  1. You start your batch job. _batch_process_page() turns on output buffering, and puts some output into it:
        ob_start();
        // ...some other stuff...
        print $fallback;
    
  2. The batch operation throws an exception!
  3. The exception bubbles up to the Drupal kernel, setting off a new chain of processing that eventually ends up in \Drupal\Core\EventSubscriber\FinalExceptionSubscriber::onException(), which in turn replaces the entire response with the standard HTTP 500 error ("the website has encountered an unexpected error...") that we know and love.
  4. The new response is handed off to the event subscribers. Eventually we enter \Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader(), which dutifully sets the Content-Length header...to the length of the exception response, not the original response.
  5. Which, okay...except that, 🚩because output buffering is active, and there's stuff in the buffer which we haven't cleared out, we've effectively already sent part of the original response! 🚩
  6. Therefore, we are actually sending a too-small Content-Length header. 🐛
  7. In the functional test, cURL -- which is the engine underlying all requests made to the test site -- dutifully receives the Content-Length header, and reads that many bytes from Drupal, disregarding everything after that. 🙈
  8. But since the Content-Length isn't accounting for the already-sent data (the stuff that was in the output buffer)...it's a lie. cURL cuts it off, the response is malformed, and the test fails hard. 🤯

This kind of shit is why they pay me the big bucks. (Kidding!)

I think the proper fix, at least for the functional test, is to make \Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader() account for the possibility that output buffering is active. You can do this:

    $length = strlen($response->getContent());
    if (ob_get_level()) {
      $length += strlen(ob_get_contents());
    }
    $response->headers->set('Content-Length', $length, TRUE);

...and the test passes.

A few other ways we can fix the problem:

  • We could refactor the batch system so we don't need to do the output buffering dance. (This is probably the toughest option, but damn that code is old and creaky.)
  • We could just never send the Content-Length header for an exception response.
  • We could refuse to send the Content-Length header if output buffering is enabled, knowing that there might already have been output we don't know about.
  • We could clean the output buffer before replacing the original response with the exception response. (I bet this would break some tests in core, though...)
  • ...we could do some other clever thing. Suggestions would be lovely.
znerol’s picture

Let's get that straight:

  • The _batch_process() function is called repeatedly in order to perform the work of a batch.
  • If the browser supports JavaScript (i.e., pretty much always), _batch_process() is called from _batch_do().
  • If the browser doesn't support JavaScript, _batch_process() is called from _batch_page()
  • In the non-JS case (i.e., the fallback case which virtually never happens on production), _batch_page() tries very much to present a prettier error page for fatal errors than the rest of Drupal.

Is that more or less correct? Couldn't we just remove that buggy fallback error mechanism from _batch_page()? Basically anything between (and including) ob_start() and ob_end_clean() and replace it with [$percentage, $message, $label] = _batch_process();.

phenaproxima’s picture

That seems correct to me, but I am absolutely not an authority on the batch system, so I don't know if it's appropriate to remove that error handling code. I'd like to see it go, personally, but a more knowledgeable person than I should make the final decision.

I guess the downside is that, if the batch job fails outright, you don't get any kind of error page - I believe you would just get a WSOD if error reporting was silenced. (But we could confirm or deny that with a test -- and the batch system's test coverage is lacking, which is why this problem was not caught before.)

The upshot here is that adjusting the batch system would not solve the root problem, which is that the Content-Length provided by FinishResponseSubscriber can be flat-out wrong if an uncaught exception is thrown while output buffering is activated and output has been sent. Granted...I can't think of anything else except the batch system which would trigger those kinds of conditions, but it would still be better, IMHO, to guarantee that they could never arise than to fix one possible way in which they do.

znerol’s picture

I agree @phenaproxima, FinishResponseSubscriber::setContentLengthHeader() was introduced to optimize the performance of pages where some significant workload is executing in the terminate event. For sure we do not lose too much if this optimization is only applied to successful page loads.

bnjmnm’s picture

Info on the FunctionalJavaScript test failure:

  • The test failure is in fact due to a change introduced #2987444: Ajax errors are not communicated through the UI.
  • Prior to that change, AJAX errors were only reported in the JS console. Nothing reported in the Drupal UI. This was addressed by adding core/drupal.message as a dependency to core/drupal.ajax
  • Apparently core/drupal.message assumes theres a '#type' => 'status_messages' render element on the page. \Drupal\Tests\system\FunctionalJavascript\Batch\ProcessingTest is failing because it expects the markup that would be provided by the status_messages render element.

Clearly something needs to be addressed in core/drupal.message since JS libraries can't declare dependencies on render elements. It's safe to say that isn't something that needs to be addressed in this issue's scope. Adding a '#type' => 'status_messages' element to the test page should get the FJS test working, and the underlying issue has its own issue now #3396099: The core/drupal.message library requires a status_messages render element

phenaproxima’s picture

Issue summary: View changes
Issue tags: +Needs framework manager review
tedbow’s picture

@bnjmnm thanks for the explanation!

Adding a '#type' => 'status_messages' element to the test page should get the FJS test working,

We actually aren't using a test page this if failing on the regular batch system. We just use a test form to start the batch process. But then it gets directed to the core batch controller. Thats is why this would cause a problem any core or contrib usage of the batch form system when JS enabled(also without JS but the other reasons described on this issue), any exception that happened in batch callbacks would no longer show up unless they specifically were caught and handled in the callbacks.

So this just needed '#type' => 'status_messages' in \Drupal\system\Controller\BatchController::batchPage see https://git.drupalcode.org/project/drupal/-/merge_requests/4957/diffs#9d... in the MR

This gets the JS test passing for locally. I pushed up this change to see if we have any core tests that will fail if the status message element. I wouldn't think so but you never know.

I think we should make fixing the JS version its own issue if it is really this simple

phenaproxima’s picture

Since it's not super clear how to fix this yet...this need not be a hard blocker for Automatic Updates.

Automatic Updates could replace the finish_response_subscriber with its own version that disables the Content-Length header if output buffering is enabled. Then when this issue is resolved, it could remove its special subscriber and that would be that.

catch’s picture

If we revert #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration the workaround shouldn't be necessary, but we could potentially keep this open specifically to change the batch error handling.

phenaproxima’s picture

tedbow’s picture

tedbow’s picture

Title: Exceptions in batch no longer are shown on the page » Exceptions in batch no longer are shown on the page when Javascript is disabled

Updating the title to reflect this issue only affects non-JS users

#3406612: Exceptions in batch no longer are shown on the page: Javascript error handles JS users

but this issue still would affect all non-JS test, which we have a lot of in #3253158: Add Alpha level Experimental Update Manager module

catch’s picture

There's a green MR on #3396559: Only set content-length header in specific situations now, is that issue sufficient to unblock automatic updates, or is there more here that's not related to the content length changs?

catch’s picture

Status: Needs work » Postponed

Postponing this on #3396559: Only set content-length header in specific situations but assuming that issue fixes this bug, we should still keep this issue open to add the additional test coverage.

tedbow’s picture

Status: Postponed » Needs work

Need to check if the tests now pass since #3396559: Only set content-length header in specific situations is fixed

damienmckenna’s picture

Is the test coverage the only thing not already committed from #3396559: Only set content-length header in specific situations?

catch’s picture

@tedbow has this fixed the automatic updates tests?

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.