Force PushedFP
  • Home
  • Blog
  • Workbooks

Fixing a chunking issue in an Event Sourced system

Overview

On a website I worked on for a client, when changes in a form were made and submitted, the expectation was that the user was taken to the next page

However, what was happening was after submitting the form, there'd be an XHR timeout and the subsequent page transition wouldn't occur in rare or infrequent occasions.

For a quick bit of background on the system, there was a JS frontend client that communicates

  • a JS application operates with a backend that uses event sourcing with cqrs
  • page transitions occur as a result of submitting a form to the backend and the browser being directed to the next page on a successful XHR request

Onto debugging!

Before I get started, I'm obfuscating some of the information like file paths etc for brevity and security since this is a live application in case some of the stack traces sound generic.

What do I know so far?

So before I got started trying to debug this thing, I did some review of what the knowns of the problem are. I know the timeouts are occuring on any action going to the backend, it's just not known where they originate from.

I also want to be sure I understand the layout of the technology stack this request needs to traverse so I can methodically do my debugging in the order the request follows.

After some discovery, I find out that any request in this scenario goes through:

  1. The client application in the web browser makes a XmlHttpRequest
  2. Request is received by HAProxy
  3. HAProxy forwards to request to an Apache instance
  4. Apache then routes it to a PHP / WordPress application
  5. PHP code then makes a Http\Client\HttpClient request to a Consul address
  6. the Consul address resolves to a NodeJS process
  7. Finally, the NodeJS makes a request to a Mono / ServiceStack process

PHP / WordPress logging

So the first I started looking was in the PHP layer.

The JS client is technically where the request originates, but I know it's not timing out there so I can just skip that part.

I start digging through the PHP logging config, either PHP FPM, mod-php, as well as what WordPress offers which isn't much.

I make sure all of the WordPress debug logging is enabled (WordPress Debugging):

sudo vim wp-config.php
...
define('WP_DEBUG', true);
define('WP_DEBUG_LOG', true);
define('WP_DEBUG_DISPLAY', true);
...

I also look to make some settings changing for PHP:

sudo vim /etc/php/7.0/apache2/php.ini
...
error_reporting = E_ALL | E_STRICT
display_errors = On
log_errors = On
error_log = /var/log/php_errors.log
...

I didn't notice anything in the general PHP logs, however, when I looked into the more application specific Apache logs, I saw a Proxy FastCGI timeout:

[Tue Oct 02 12:20:11.337835 2018] [proxy_fcgi:error] [pid 15212] (70007)The timeout specified has expired: [client 172.32.25.24:33318] AH01075: Error dispatching request to : (polling), referer: https://www.bigrentz.com/checkout/rental-agreement

Ok so could this possibly be the PHP slow log recording?

I make some changes to change the request timeout to 55s and the slowlog timeout to 5s. Who really knows how noisy thats going to be at this point, so just expire quickly to catch any low hanging fruit.

[02-Oct-2018 12:19:46] [pool www] pid 1691
script_filename = //var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-admin/admin-ajax.php
[0x00007f4aaea13fc0] curl_exec() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/curl-client/src/Client.php:122
[0x00007f4aaea13ef0] sendRequest() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/PluginClient.php:81
[0x00007f4aaea13dd0] Http\Client\Common\{closure}() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/Plugin/AddHostPlugin.php:64
[0x00007f4aaea13d10] handleRequest() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/PluginClient.php:156
[0x00007f4aaea13c50] Http\Client\Common\{closure}() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/PluginClient.php:170
[0x00007f4aaea13ba0] Http\Client\Common\{closure}() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/PluginClient.php:87
[0x00007f4aaea13b00] sendRequest() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/HttpMethodsClient.php:203
[0x00007f4aaea13a80] sendRequest() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/HttpMethodsClient.php:192
[0x00007f4aaea139d0] send() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/vendor/php-http/client-common/src/HttpMethodsClient.php:107
[0x00007f4aaea13930] post() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-content/themes/wp-theme/inc/keystoneClient/Api/AbstractApi.php:177
[0x00007f4aaea13870] postRaw() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-content/themes/wp-theme/inc/keystoneClient/Api/AbstractApi.php:139
[0x00007f4aaea137d0] post() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-content/themes/wp-theme/inc/keystoneClient/Api/OrderingQuoteCommand.php:112
[0x00007f4aaea13750] submitCheckoutStep4() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-content/themes/wp-theme/inc/class-bigrentz-shopping-cart-handler.php:615
[0x00007f4aaea13690] bigrentz_shopping_cart_submit_rental_agreement_action() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-includes/class-wp-hook.php:286
[0x00007f4aaea135a0] apply_filters() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-includes/class-wp-hook.php:310
[0x00007f4aaea13520] do_action() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-includes/plugin.php:453
[0x00007f4aaea13400] do_action() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-settings.php:471
[0x00007f4aaea13240] [INCLUDE_OR_EVAL]() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-config.php:157
[0x00007f4aaea131a0] [INCLUDE_OR_EVAL]() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-load.php:37
[0x00007f4aaea130f0] [INCLUDE_OR_EVAL]() /var/www/Application/2051.1a959a645403c5f26dad8066cdcf60aa0f4fa6c3/wp-admin/admin-ajax.php:22

Ok, so feeling and more like the timeout isn't happening in the PHP / WordPress layer.

When I check the other application specific logs, I saw similar messaging consistent with what was in the slow log report.

/var/log/apache2/Application_error.log
/var/log/apache2/Application_access.log
/var/log/php_errors.log
[Fri Aug 17 08:49:24.363547 2018] [proxy_fcgi:error] [pid 1381] [client 192.168.2.10:34610] AH01071: Got error 'PHP message: PHP Fatal error: Uncaught Http\\Client\\Exception\\NetworkException: Could not resolve host: bigrentz-nodeapi.service.consul in /home/www/Application/vendor/php-http/curl-client/src/Client.php:134\nStack trace:\n#0 /home/www/Application/vendor/php-http/client-common/src/PluginClient.php(81): Http\\Client\\Curl\\Client->sendRequest(Object(GuzzleHttp\\Psr7\\Request))\n#1 /home/www/Application/vendor/php-http/client-common/src/Plugin/AddHostPlugin.php(64): Http\\Client\\Common\\PluginClient->Http\\Client\\Common\\{closure}(Object(GuzzleHttp\\Psr7\\Request))\n#2 /home/www/Application/vendor/php-http/client-common/src/PluginClient.php(156): Http\\Client\\Common\\Plugin\\AddHostPlugin->handleRequest(Object(GuzzleHttp\\Psr7\\Request), Object(Closure), Object(Closure))\n#3 /home/www/Application/vendor/php-http/client-common/src/PluginClient.php(170): Htt...\n', referer: http://www.bigrentz.com/equipment-rentals

So looks like the timeout is occuring outside of the PHP stack. I can keep progressing down the layers.

Apache

I checked Apache logs but didn't see anything indicative of an error that was more alerting than what I saw in the PHP layer.

/var/log/apache2/error.log
/var/log/apache2/Application_error.log

I think I can confidently move onto the next layer at this point. PHP and Apache just seem to also be experiencing the timeout occuring further down the request chain.

NodeJS

Now I'm at the NodeJS application layer. Checking the logs here, I'm immediately presented with errors indicative that a request failed. The only kicker is I don't know more information of what exactly the error was.

Sounds like the app isn't properly handling some exception, this could potentially be where there problem is happening, but I can't be sure at this point.

(node:17345) UnhandledPromiseRejectionWarning: Error: next shouldn't be called more than once\n at f (/var/NodeJsApplication/7.caee182e87fc5af2ae0b9d2b658caabc743a5bbe/node_modules/once/once.js:34:13)\n at RequestHandler.getReq (/var/NodeJsApplication/7.caee182e87fc5af2ae0b9d2b658caabc743a5bbe/api/svc/RequestHandler.js:115:20)\n at <anonymous>\n at process._tickDomainCallback (internal/process/next_tick.js:229:7)\n(node:17345) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 22)",

To detail what's happening in the request here, the NodeJS process has made a https request to a Mono C# process serving as the command service in a CQRS system.

The payload returned to it contained something called a stream revision. This stream revision is a concept in event sourcing of a revision number of a projection of data that is returned by a query service. So now the NodeJS process is going to use this stream revision and listen on essentially an event emitter wired up to an update notifier process in the Mono C# service that sends along the current stream revision. When the stream revision the NodeJS proces is waiting for is observed in the event emitter, it knows then that the command it issued has now fully been processed and it can return to the caller (in this case the PHP / WordPress application).

However since there seems to be an exception here, it doesn't look like the problem is in the NodeJS side.

Mono / ServiceStack

The only thing left to check out was the event sourcing connection from a mono C# service to the NodeJS layer.

When I performed a TCPDUMP of the connection between the event source client and the mono C# update notifier, this is what I was seeing:

7f
id: 83
data: ApplicationProjection@ProjectionUpdated {"id":"0c5599ab7a3c42f5b625cf3c141564b8","streamRevision":6}

80
id: 84
data: ApplicationProjection@ProjectionUpdated {"id":"0c5599ab7a3c42f5b625cf3c141564b8","streamRevision":15}

80
80
id: 87
data: ApplicationProjection@ProjectionUpdated {"id":"0c5599ab7a3c42f5b625cf3c141564b8","streamRevision":35}

There is an extra chunk octet header being sent for a message, that should be sent for the last one.

The extra one is being parsed as part of the message body and is what is leading to the PARSE_ERROR message that is seen by request.

This is now beginning to look like a mono or servicestack bug.

Work i did that ended up being a dead end

  • Fixed mem leak bug in stream revision listener
  • noticed that we had a mem leak around not closing event emitters after we had seen the event
  • noticed we weren't correctly observing the stream revision if it already passed
  • Added logging to direct calls through node API