Addition of a resend_attempts counter to prevent output worker blocking. #21

smford · 2016-01-05T14:02:16Z

Preventing output blocking when the graphite server is offline by introducing a resend_attempts counter.

TLDR: This patch adds a resend_attempts option that ensures that metrics are discarded after a 3 failed attempts, or by whatever number is defined by the user in the logstash configuration.

There is is a problem with the previous logstash-output-graphite plugin that if the resend_on_failure was enabled and the graphite server went offline, logstash would infinitely attempt to resend metrics and end up blocking logstashs single output worker entirely (meaning logstash processed logs could not reach elasticsearch). Given that in some business environments graphites metrics are not considered vital, the option for discarding metrics after a few attempts is often considered acceptable, compared to lost or delayed logstash processed logs.

This could even result in logstash crashing as both logs and metrics get held in memory awaiting to be outputted by the now blocked single output worker. This issue would cause back pressure to the logstash-forwarder clients sending logs to logstash. This problem could compound with logstashs own logs becoming filled with millions of back pressure warnings until all local file storage is consumed.

I would also be argue that since the logstash metrics are not time stamped when sent to graphite, the discarding of stale metrics is far more favourable than having a large backlog of stale metrics being dumped in to graphite in a small window of time once the graphite server does come back online. Because graphite has a metric storage resolution/retention policy, a large surge/backlog of the same metric arriving from logstash within a window smaller than the resolution/retention policy, will simply mean the metrics get discarded anyway.

We have tested this extensively on logstash v1.5.4-1 and found logstash to be considerably more stable.

The resend_attempts option works well with a reduced reconnect_interval setting too, for example:
reconnect_interval => 1
resend_attempts => 3
resend_on_failure => true

Note: I will create another PR in the future to deal with the other logstash-output-graphite problem where logstash cannot start without being able to connect to a working graphite server.

This prevents metrics being spooled indefinately and causing logstash to crash when grahite becomes unavailable.

jordansissel · 2016-01-06T20:52:55Z

I haven't reviewed the code yet partly because there's no CLA signed. Can you do https://github.com/elastic/logstash/blob/master/CONTRIBUTING.md#contribution-steps ?

I'll do a review after CLA so I can look at the code :)

One comment on your description:

I would also be argue that since the logstash metrics are not time stamped when sent to graphite

I dont' think this is true. Logstash does use the timestamp from the event when sending it to graphite:

The text sent over tcp includes a timestamp: https://github.com/logstash-plugins/logstash-output-graphite/blob/master/lib/logstash/outputs/graphite.rb#L187
And that timestamp is unix epoch in seconds: https://github.com/logstash-plugins/logstash-output-graphite/blob/master/lib/logstash/outputs/graphite.rb#L173

smford · 2016-01-08T11:07:24Z

Hello Jordan,

Thanks for the feedback.

Concerning the CLA: I am getting my work to sign the CLA, we'll likely we raising some more contributions by a few other people. It'll take a couple days or so to organise.

Graphite timestamps: Thanks for that clarification.

elasticsearch-release · 2016-01-08T15:12:24Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'.

RickardCardell · 2016-03-08T15:27:27Z

Bump. How is it going with this PR?

smford · 2016-06-11T17:47:36Z

Just noticed this, I have signed the CLA so should move forward.

smford · 2016-06-11T17:48:03Z

jenkins, test it

TommySedin · 2017-03-15T15:14:07Z

Any news on this? We're having a very similar problem..

smford · 2017-03-16T17:49:14Z

jenkins, test it.

@jordansissel a few people seem interested in this - the CLAs been signed since July 2016 - can this move forward?

karmi · 2017-03-16T17:49:15Z

Hi @smford, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

smford · 2017-03-16T17:55:03Z

Signed this CLA 3 times! Go go go.

smford · 2017-03-17T09:11:21Z

@karmi Hello Karmi, I have signed the CLA a number of times, and based upon your last comment I signed again with the email address that matches this git accounts commit. How do we progress this?

TommySedin · 2017-03-17T14:47:36Z

@smford I'm not sure, but I think you can just do another "jenkins, test it".

smford · 2017-03-21T14:48:38Z

jenkins, test it

jordansissel · 2017-03-21T17:45:23Z

@smford I do not find your email in the CLA database. I have found some smford email address, but it is not the one you used in Git.

To find the one you used in git, see git log, or more directly for this patch, run this command:

% curl -s https://patch-diff.githubusercontent.com/raw/logstash-plugins/logstash-output-graphite/pull/21.patch | grep '^From:'

The above shows two lines with the same email (two commits, so far so good). When I search our CLA database for this email, it is not found.

smford · 2018-01-24T12:34:26Z

Hello, I have signed the CLA again. It should be go go go now.

I signed it with the email address from
curl -s https://patch-diff.githubusercontent.com/raw/logstash-plugins/logstash-output-graphite/pull/21.patch | grep '^From:'

synFK · 2018-09-18T08:39:54Z

Any news here? Why can't this PR be merged?

smford added 2 commits December 17, 2015 17:19

The addition of resend_attempts

9ff12cc

This prevents metrics being spooled indefinately and causing logstash to crash when grahite becomes unavailable.

Bumping version number to capture resend_attempts addition

67a2d5a

smford mentioned this pull request Jun 11, 2016

Connection refused to graphite server blocks entire pipeline #27

Open

jsvd closed this Mar 21, 2017

jsvd reopened this Mar 21, 2017

mavlyutov mentioned this pull request Jun 4, 2017

random choose host from provided hosts-list #31

Open

Addition of a resend_attempts counter to prevent output worker blocking. #21

Are you sure you want to change the base?

Addition of a resend_attempts counter to prevent output worker blocking. #21

Uh oh!

Conversation

smford commented Jan 5, 2016

Uh oh!

jordansissel commented Jan 6, 2016

Uh oh!

smford commented Jan 8, 2016

Uh oh!

elasticsearch-release commented Jan 8, 2016

Uh oh!

RickardCardell commented Mar 8, 2016

Uh oh!

smford commented Jun 11, 2016

Uh oh!

smford commented Jun 11, 2016

Uh oh!

TommySedin commented Mar 15, 2017

Uh oh!

smford commented Mar 16, 2017

Uh oh!

karmi commented Mar 16, 2017

Uh oh!

smford commented Mar 16, 2017

Uh oh!

smford commented Mar 17, 2017

Uh oh!

TommySedin commented Mar 17, 2017 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smford commented Mar 21, 2017

Uh oh!

jordansissel commented Mar 21, 2017

Uh oh!

smford commented Jan 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

synFK commented Sep 18, 2018

Uh oh!

Uh oh!

TommySedin commented Mar 17, 2017 via email •

edited

Loading

smford commented Jan 24, 2018 •

edited

Loading