Problem/Motivation
Discovered while working on #3253158: Add Alpha level Experimental Update Manager module. We have functional tests for our updates via the form that use the batch system. We test that if there are exception in the process they are shown.
There seems to be 2 problems going on that are demonstrated in the tests that ran on the patch on #9
- In 10.0.x both the Functional and FunctionalJavascript tests pass. So the functionality works with both non-js and js users
- In 10.1.x the Functional tests pass but the FunctionalJavascript does not pass.
When testing this manually with a browser that has javascript enabled(unlike the functional test).
There is javascript errorUncaught TypeError: Cannot read properties of null (reading 'removeAttribute')
at Drupal.Message.defaultWrapper (message.js?v=11.0-dev:43:17)
at new Drupal.Message (message.js?v=11.0-dev:26:46)
at new Drupal.AjaxError (ajax.js?v=11.0-dev:184:35)
at Object.error (progress.js?v=11.0-dev:157:25)
at c (jquery.min.js?v=3.7.0:2:25266)
at Object.fireWith [as rejectWith] (jquery.min.js?v=3.7.0:2:26015)
at l (jquery.min.js?v=3.7.0:2:77746)
at XMLHttpRequest. (jquery.min.js?v=3.7.0:2:80204)Here is the lines
wrapper = document.querySelector('[data-drupal-messages-fallback]'); wrapper.removeAttribute('data-drupal-messages-fallback'); wrapper.setAttribute('data-drupal-messages', ''); wrapper.classList.remove('hidden');so it appears
document.querySelector('[data-drupal-messages-fallback]');does not return a result. -
On 11.x both tests fail.
When testing with Javascript disabled
on it appears to briefly so the progress and then a white screen
Proposed resolution
The functional JavaScript test's failure has a different cause than the functional test's failure. This issue only deals with the functional test. The JavaScript test has its own issue; see #21 for details.
The cause of the functional test failure is a conflict between output buffering and the Content-Length header that gets added to every response as of #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration. The good news is, it's probably only breaking tests, not real sites, since the problem also involves a specific quirk of cURL and how it handles Content-Length headers. (Real-world browsers probably do not behave this way, although I'm just speculating here.)
A full explanation of the bug may be found in #17.
The proposed resolution is, as yet, uncertain. There are several possible paths forward, but framework manager review is probably needed to decide the right approach.
Remaining tasks
Test coverage is already written, so we just need to decide on the right fix here and implement it.
User interface changes
None.
API changes
TBD, but none expected.
Data model changes
None.
Release notes snippet
TBD
| Comment | File | Size | Author |
|---|
Issue fork drupal-3392196
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #3
tedbowuploading a patch version to demonstrate that this passes against 10.1.x 🤞🏻
Comment #4
tedbowComment #5
tedbowComment #6
tedbowComment #7
tedbowComment #9
tedbowComment #10
xjmThis blocks AU in core and is therefore critical.
Comment #11
tedbow@xjm ok thanks. I thought it might be critical
i updated the summary to explain the different test failures on the different branches
I started to debug this but so far I have not figured it out.
Comment #12
phenaproxima@tedbow and I spent some time tracking this down with
git bisect. We discovered that #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration broke the functional test, but not the functional JavaScript test. That leads us to believe there are actually two bugs here.We don't know what's breaking the functional JS test, but we did discover that, in the functional test, the problem is that, once you start the batch job, the output from Drupal is just...cut off. You get an incomplete snippet of an HTML page. If you apply the patch in #9, then modify ProcessingTest so that it looks like this:
You'll see the first bit of an HTML page, just cut off somewhere, when you run the test. (You could also just say
$this->assertSession()->responseContains('</html>'), and it should fail too.)You can do one of two things to make the test pass:
\Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader()(modifygetSubscribedEvents()to do this)._batch_progress_page(), remove the call toob_start().In conclusion, we don't know what the actual bug is here...but it seems to have something to do with output buffering initiated by the batch system not playing nicely with what's happening in
FinishResponseSubscriber::setContentLengthHeader()...if the batch operation callback raises an exception.We're hoping someone who knows more than us can quickly identify the nature of the problem, based on what we've discovered here.
Comment #13
phenaproximaComment #14
phenaproximaI did some digging around through the core HTTP system and I discovered something. This gets right to the heart of how the kernel works.
I thought a bit about what I said in #12:
I also noticed what happens when there's an exception while producing a response -- other event subscribers are invoked, handling the
\Symfony\Component\HttpKernel\KernelEvents::EXCEPTIONevent.\Drupal\Core\EventSubscriber\FinalExceptionSubscriber::onException()is one of the handlers invoked. It does this:So it's replacing the response we were going to send with a new response containing the details of the exception.
That made me wonder: is it possible that the Content-Length header is being set using the exception response, rather than the actual response?? That would explain the discrepancy.
To test this, I did this in
\Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader():...and lo, the functional test passed.
And also, this might help explain why removing the
ob_start()call in_batch_progress_page()fixes it -- because output buffering doesn't delay the headers! PHP'sob_startdocumentation is subtle, but clear on this point (emphasis mine):Content-Lengthis one of the headers. So maybe that's going out early, with a too-small value that it's deriving from the exception response.I know these thoughts aren't very organized, but this might explain, at least partially, what's going on.
Comment #15
phenaproximaCrediting @tedbow for writing the test coverage, and myself for the debugging efforts.
Comment #16
phenaproximaDid a little more investigating:
\Drupal\Core\EventSubscriber\FinalExceptionSubscriber::onException()to add aX-Testing: yeshheader to the response.Here's the test output:
(Yes, that's the end of the output from the server. It just stops there.)
So what I'm seeing here is the output of the regular batch page, but the headers from the exception response!
Is the output buffer getting flushed in some way before the exception response is handed to the kernel?
Comment #17
phenaproximaI finally have a complete, and satisfying, explanation for what is breaking the functional test!
First of all, this may not actually be breaking real sites in the real world. I don't know if you'd be able to reproduce this in manual testing (I doubt it).
Why? Because we are running afoul of a quirk in cURL -- it enforces the
Content-Lengthheader. This means that if the server tells cURL "hey, I'm about to send you 500 bytes", but it actually sends 1000, cURL will only see the first 500 bytes, and throw out the rest! (There does not seem to be a way to override this behavior from PHP.)Now, previously Drupal didn't send the
Content-Lengthheader with its responses, or at least not by default. That changed in #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration.The problem is that the new
\Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader()event handler doesn't account for the possibility that output buffering might be active, and that there might already be stuff in the output buffer!Anything in the output buffer is effectively already part of the response (unless you explicitly clear the buffer by calling
ob_clean()). That ol' rascally batch system activates output buffering in_batch_process_page(), specifically for error handling, and sends some output. It's a very old and clunky form of error handling. (Yes, very old - the lines concerned were written in #127539: batch - progressive operations (à la update.php), which was committed by Gábor in 2007. The year I started using Drupal. Sheesh.)Thus, the sequence of events emerges:
_batch_process_page()turns on output buffering, and puts some output into it:\Drupal\Core\EventSubscriber\FinalExceptionSubscriber::onException(), which in turn replaces the entire response with the standard HTTP 500 error ("the website has encountered an unexpected error...") that we know and love.\Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader(), which dutifully sets theContent-Lengthheader...to the length of the exception response, not the original response.Content-Lengthheader. 🐛Content-Lengthheader, and reads that many bytes from Drupal, disregarding everything after that. 🙈Content-Lengthisn't accounting for the already-sent data (the stuff that was in the output buffer)...it's a lie. cURL cuts it off, the response is malformed, and the test fails hard. 🤯This kind of shit is why they pay me the big bucks. (Kidding!)
I think the proper fix, at least for the functional test, is to make
\Drupal\Core\EventSubscriber\FinishResponseSubscriber::setContentLengthHeader()account for the possibility that output buffering is active. You can do this:...and the test passes.
A few other ways we can fix the problem:
Content-Lengthheader for an exception response.Content-Lengthheader if output buffering is enabled, knowing that there might already have been output we don't know about.Comment #18
znerol commentedLet's get that straight:
_batch_process()function is called repeatedly in order to perform the work of a batch._batch_process()is called from_batch_do()._batch_process()is called from_batch_page()_batch_page()tries very much to present a prettier error page for fatal errors than the rest of Drupal.Is that more or less correct? Couldn't we just remove that buggy fallback error mechanism from
_batch_page()? Basically anything between (and including)ob_start()andob_end_clean()and replace it with[$percentage, $message, $label] = _batch_process();.Comment #19
phenaproximaThat seems correct to me, but I am absolutely not an authority on the batch system, so I don't know if it's appropriate to remove that error handling code. I'd like to see it go, personally, but a more knowledgeable person than I should make the final decision.
I guess the downside is that, if the batch job fails outright, you don't get any kind of error page - I believe you would just get a WSOD if error reporting was silenced. (But we could confirm or deny that with a test -- and the batch system's test coverage is lacking, which is why this problem was not caught before.)
The upshot here is that adjusting the batch system would not solve the root problem, which is that the Content-Length provided by FinishResponseSubscriber can be flat-out wrong if an uncaught exception is thrown while output buffering is activated and output has been sent. Granted...I can't think of anything else except the batch system which would trigger those kinds of conditions, but it would still be better, IMHO, to guarantee that they could never arise than to fix one possible way in which they do.
Comment #20
znerol commentedI agree @phenaproxima,
FinishResponseSubscriber::setContentLengthHeader()was introduced to optimize the performance of pages where some significant workload is executing in the terminate event. For sure we do not lose too much if this optimization is only applied to successful page loads.Comment #21
bnjmnmInfo on the FunctionalJavaScript test failure:
core/drupal.messageas a dependency tocore/drupal.ajaxcore/drupal.messageassumes theres a'#type' => 'status_messages'render element on the page.\Drupal\Tests\system\FunctionalJavascript\Batch\ProcessingTestis failing because it expects the markup that would be provided by thestatus_messagesrender element.Clearly something needs to be addressed in
core/drupal.messagesince JS libraries can't declare dependencies on render elements. It's safe to say that isn't something that needs to be addressed in this issue's scope. Adding a'#type' => 'status_messages'element to the test page should get the FJS test working, and the underlying issue has its own issue now #3396099: The core/drupal.message library requires a status_messages render elementComment #22
phenaproximaComment #23
tedbow@bnjmnm thanks for the explanation!
We actually aren't using a test page this if failing on the regular batch system. We just use a test form to start the batch process. But then it gets directed to the core batch controller. Thats is why this would cause a problem any core or contrib usage of the batch form system when JS enabled(also without JS but the other reasons described on this issue), any exception that happened in batch callbacks would no longer show up unless they specifically were caught and handled in the callbacks.
So this just needed
'#type' => 'status_messages'in\Drupal\system\Controller\BatchController::batchPagesee https://git.drupalcode.org/project/drupal/-/merge_requests/4957/diffs#9d... in the MRThis gets the JS test passing for locally. I pushed up this change to see if we have any core tests that will fail if the status message element. I wouldn't think so but you never know.
I think we should make fixing the JS version its own issue if it is really this simple
Comment #24
phenaproximaSince it's not super clear how to fix this yet...this need not be a hard blocker for Automatic Updates.
Automatic Updates could replace the finish_response_subscriber with its own version that disables the Content-Length header if output buffering is enabled. Then when this issue is resolved, it could remove its special subscriber and that would be that.
Comment #25
catchIf we revert #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration the workaround shouldn't be necessary, but we could potentially keep this open specifically to change the batch error handling.
Comment #26
phenaproximaComment #27
tedbowCreated #3392196: Exceptions in batch no longer are shown on the page when Javascript is disabled for the javascript issue
Comment #28
tedbowwrong issue
Comment #29
tedbowUpdating the title to reflect this issue only affects non-JS users
#3406612: Exceptions in batch no longer are shown on the page: Javascript error handles JS users
but this issue still would affect all non-JS test, which we have a lot of in #3253158: Add Alpha level Experimental Update Manager module
Comment #30
catchThere's a green MR on #3396559: Only set content-length header in specific situations now, is that issue sufficient to unblock automatic updates, or is there more here that's not related to the content length changs?
Comment #31
catchPostponing this on #3396559: Only set content-length header in specific situations but assuming that issue fixes this bug, we should still keep this issue open to add the additional test coverage.
Comment #32
tedbowNeed to check if the tests now pass since #3396559: Only set content-length header in specific situations is fixed
Comment #33
damienmckennaIs the test coverage the only thing not already committed from #3396559: Only set content-length header in specific situations?
Comment #34
catch@tedbow has this fixed the automatic updates tests?