Skip to content

Addition of a resend_attempts counter to prevent output worker blocking. #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

smford
Copy link

@smford smford commented Jan 5, 2016

Preventing output blocking when the graphite server is offline by introducing a resend_attempts counter.

TLDR: This patch adds a resend_attempts option that ensures that metrics are discarded after a 3 failed attempts, or by whatever number is defined by the user in the logstash configuration.

There is is a problem with the previous logstash-output-graphite plugin that if the resend_on_failure was enabled and the graphite server went offline, logstash would infinitely attempt to resend metrics and end up blocking logstashs single output worker entirely (meaning logstash processed logs could not reach elasticsearch). Given that in some business environments graphites metrics are not considered vital, the option for discarding metrics after a few attempts is often considered acceptable, compared to lost or delayed logstash processed logs.

This could even result in logstash crashing as both logs and metrics get held in memory awaiting to be outputted by the now blocked single output worker. This issue would cause back pressure to the logstash-forwarder clients sending logs to logstash. This problem could compound with logstashs own logs becoming filled with millions of back pressure warnings until all local file storage is consumed.

I would also be argue that since the logstash metrics are not time stamped when sent to graphite, the discarding of stale metrics is far more favourable than having a large backlog of stale metrics being dumped in to graphite in a small window of time once the graphite server does come back online. Because graphite has a metric storage resolution/retention policy, a large surge/backlog of the same metric arriving from logstash within a window smaller than the resolution/retention policy, will simply mean the metrics get discarded anyway.

We have tested this extensively on logstash v1.5.4-1 and found logstash to be considerably more stable.

The resend_attempts option works well with a reduced reconnect_interval setting too, for example:
reconnect_interval => 1
resend_attempts => 3
resend_on_failure => true

Note: I will create another PR in the future to deal with the other logstash-output-graphite problem where logstash cannot start without being able to connect to a working graphite server.

This prevents metrics being spooled indefinately and causing logstash to
crash when grahite becomes unavailable.
@jordansissel
Copy link
Contributor

I haven't reviewed the code yet partly because there's no CLA signed. Can you do https://github.com/elastic/logstash/blob/master/CONTRIBUTING.md#contribution-steps ?

I'll do a review after CLA so I can look at the code :)

One comment on your description:

I would also be argue that since the logstash metrics are not time stamped when sent to graphite

I dont' think this is true. Logstash does use the timestamp from the event when sending it to graphite:

@smford
Copy link
Author

smford commented Jan 8, 2016

Hello Jordan,

Thanks for the feedback.

Concerning the CLA: I am getting my work to sign the CLA, we'll likely we raising some more contributions by a few other people. It'll take a couple days or so to organise.

Graphite timestamps: Thanks for that clarification.

@elasticsearch-release
Copy link

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'.

@RickardCardell
Copy link

Bump. How is it going with this PR?

@smford
Copy link
Author

smford commented Jun 11, 2016

Just noticed this, I have signed the CLA so should move forward.

@smford
Copy link
Author

smford commented Jun 11, 2016

jenkins, test it

@TommySedin
Copy link

Any news on this? We're having a very similar problem..

@smford
Copy link
Author

smford commented Mar 16, 2017

jenkins, test it.

@jordansissel a few people seem interested in this - the CLAs been signed since July 2016 - can this move forward?

@karmi
Copy link

karmi commented Mar 16, 2017

Hi @smford, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@smford
Copy link
Author

smford commented Mar 16, 2017

Signed this CLA 3 times! Go go go.

@smford
Copy link
Author

smford commented Mar 17, 2017

@karmi Hello Karmi, I have signed the CLA a number of times, and based upon your last comment I signed again with the email address that matches this git accounts commit. How do we progress this?

@TommySedin
Copy link

TommySedin commented Mar 17, 2017 via email

@smford
Copy link
Author

smford commented Mar 21, 2017

jenkins, test it

@jsvd jsvd closed this Mar 21, 2017
@jsvd jsvd reopened this Mar 21, 2017
@jordansissel
Copy link
Contributor

@smford I do not find your email in the CLA database. I have found some smford email address, but it is not the one you used in Git.

To find the one you used in git, see git log, or more directly for this patch, run this command:

% curl -s https://patch-diff.githubusercontent.com/raw/logstash-plugins/logstash-output-graphite/pull/21.patch | grep '^From:'

The above shows two lines with the same email (two commits, so far so good). When I search our CLA database for this email, it is not found.

@smford
Copy link
Author

smford commented Jan 24, 2018

Hello, I have signed the CLA again. It should be go go go now.

I signed it with the email address from
curl -s https://patch-diff.githubusercontent.com/raw/logstash-plugins/logstash-output-graphite/pull/21.patch | grep '^From:'

@synFK
Copy link

synFK commented Sep 18, 2018

Any news here? Why can't this PR be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants