0% found this document useful (0 votes)
53 views

06 Debugging PDF

Learn the customer' s problem, find the root cause and fix it. Avoid the temporary fix trap. Fix things once, rather than over and over.

Uploaded by

smt1961
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

06 Debugging PDF

Learn the customer' s problem, find the root cause and fix it. Avoid the temporary fix trap. Fix things once, rather than over and over.

Uploaded by

smt1961
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CSE398:SystemAdministration

Debugging

Learnthecustomer'
sproblem

Findtherootcauseandfixit

Havetherighttools

Fixingthingsonce

Spring2004

Fixthingsonce,ratherthanoverandover

Avoidthetemporaryfixtrap

Learningfromcarpenters
CSE398:SystemAdministration

2004BrianD.Davison

Learnthecustomer'
sproblem

Stepone:understand(atahighlevel)whattheuser
istryingtodo,andwhatpartisfailing

Thecustomerexpectsaparticularresultfromsome
action,butisgettingsomethingelse

Ex:

Mymailprogramisbroken

Ican'
treachthemailserver

Mymailboxdisappeared!

Anycouldbetrue,buttherealproblemcouldbe
DNS,apowerfailure,anetworkproblem,etc.

Whencomplete,makesurethecustomeragrees!

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#1:tapefailures
AseverySystemAdministratorknows,reliablebackupsarea
must.Becauseofthismyteambecamesuitablyconcernedwhen
theoperatorshandlingourcentraldatabaseserversstartedto
report"tapefailures."Thefailuressoonbecameregular,and
requiredregularmanualinterventiontokeepoperational.In
investigatingthecauseofthisproblem,corporatesecurityand
productionfloorrulesforcedustodependontheoperatorsfor
information.Theoperationsstaffplacedtheblameontheoff
sitetapestorageservice'
sjostlingtapesduringtransport,and
requestsforsamplesoffailedtapesgavenoindicationastothe
cause.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#1continued
Therootcauseoftheproblemdidn'
tbecomeobviousuntilthis
hadbeengoingonforacoupleofmonths.Duringalarge
systemupgrade,myteamwasabletoobservetheoperatorsat
work.Theoperationsstaffhadbeenoutsourcedtoalowcost
contractingfirmthatapparentlycontainedalargepercentageof
fansofthelocalprofessionalhockeyteam.Theoperatorswere
skiddingthe8mmtapesacrossthecomputerroomfloorlikea
hockeypuckinsteadofcarryingthemacrossthefloor.Addinga
ruleprohibitingthrowing,skipping,andslidingofbackuptapes
quicklyrestoredbackupstoareliablestate.
TapeHockey,byAllenPeckham

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Findtheproblem's
causeandfixit

Workaroundsaregood,butfixingtheroot
causeismuchbetter

Rebooting/restartingisacommonworkaround

E.g.,solutionforfulldiskproblemisnotto
deleteoldlogfiles
Improvingthespeedofrebootsisnotreally
thesolutioneither!

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#2:mailproblems
AnISPnoticedthatemailservicewasparticularlyslowoneday
andtheyweregettingcomplaintsthatittookupto4hoursto
delivermessagesthatweresentthroughtheSMTPserver.
Thequick(andeasy)solutionwouldbetorestarttheserverand
flushoutwhateverwasslowingitdown.Thatwouldhave
maskedtheproblem,however.
Instead,theymonitoredtheserviceandnoticedthattheywere
gettingrepeatedaccessesfromthesamesite.Hundreds,no
thousandsofemailsflowingintousfromonesource.This
indicatedthatsomeonewasspammingthroughtheISP.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Examplecontinued
WiththeknowledgeoftheirIPaddress,theywereabletotrack
downwhotheywereandblockthemfromthesystemthereby
stoppingthemfromspammingthroughitanymore.
(Yes,theyhadspamblockinginplace;thisuserwasacustomer
andthereforewasallowedtousetheSMTPserver.Their
AcceptableUsePolicy,however,forbadeusingittosend
unsolicitedcommercialbulkemailsotheybannedtheuser.)
Source:LecturenotesofScottHeffner,KeeneStateCollege

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#3:missingfiles
Inthemiddleofthenight,allthemachineswentdown,
withvaryingamountsofstuffmissing.
Nobodyknewwhatwhatwasgoingon!Thesystemswere
restoredfrombackup,andthingsseemedtobegoingOK,
untilthenextnight.
Thistime,CorporateSecuritywascalledin,andthe
admingroup'
ssupervisorwascalledbackfromhis
vacation(Ithinkthere'
ssomethinginthereabouta
helicopterpickingtheguyupfromaraftingtripinthe
GrandCanyon).

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#3continued
Bychance,somebodycheckedthecronscripts,andall
waswellforthenextnight...
Why?Whathappened:
Wehaveahomegrownadminsystemthatcontrols
accountsonallofourmachines.Ithasaremoveuser
operationthatremovestheuserfromallmachinesat
thesametimeinthemiddleofthenight.
Well,onenight,thethinggoesoffandtriestoremove
auserwiththehomedirectory'
/'
...

Spring2004

Organization:AT&TBellLabs,MurrayHill,NJ,USA
CSE398:SystemAdministration

2004BrianD.Davison

Howtofindthecause

Besystematic

Formhypotheses,testthem,notetheresults,makechanges
basedonthoseresults

Use

Processofelimination

Successiverefinement

Realproblemismostoftenassociatedwiththemost
recentchangemadetothehost,network,orwhatever
isbroken

Spring2004

Fromalackoftesting
CSE398:SystemAdministration

2004BrianD.Davison

Processofelimination

Removedifferentpartsofthesystemuntilthe
problemdisappears

Commontechniqueforhardwareproblems

Problemwasinlastpartremoved
Swaporremovepiecesuntilitworks

Alsoworksforsoftware

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Successiverefinement

Addonenewcomponentatatimeandverifythatit
workscorrectly

tracerouteworksthisway

Mayrequireexaminingintermediatestagesofoutput

Forsystems/processeswithmanycomponents,the
processofeliminationandsuccessiverefinementmay
takeawhile.

Spring2004

Why?Whatisanalternative?

CSE398:SystemAdministration

2004BrianD.Davison

Havetherighttools

Diagnostictoolsletyouseeintodevicesorsystemsto
seeinnerworkings

Stillneedtointerpretwhatyousee

Packetsniffersareeasytouse

Understandhowthetoolworks

Understandingtheprotocolscapturedrequiresknowledge
andtraining(e.g.,networkingcourses)
Itmaydrawthewrongconclusion

Simpletoolsareoftenbest

Spring2004

ping,traceroute,telnet
CSE398:SystemAdministration

2004BrianD.Davison

Takeascientificapproach

Givenanunusual,recurring
behavior/problem

Spring2004

Collectdata

Visualize[optional,butoftenhelpful]

Discernpatterns

Hypothesizesourceofpatterns

Testforsuchsources

Applysolution

Testsolution
CSE398:SystemAdministration

2004BrianD.Davison

Endtoendunderstandinghelps
Acustomerreportsthatsomeofhisfilesweredisappearing
hehadabout100MBinhishomedirectory,andallbut
2MBhaddisappeared.
Herestoredhisfiles.Acoupleofdayslaterithappened
again.
Thishadbeenhappeningforafewweeks,butwas
embarrassedtotellthesystemadmins.
Theory1:Virusscansrevealednothing.
Theory2:Prank,orbadcronjob.
Wasgivenpagernumbers,toldtocallnexttime
Networksnifferswereputintoplace

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#4continued
Happensagain.Wasaskedwhathelastdid?Usedalab
machinetosurftheWeb.
Extraknowledgehelpsasysadminrememberedthat
Webbrowserskeptacacheandprunedittostayundera
certainlimit(suchas2MB).
Labworkstationmisconfigured;browserfoundan
invalid/missingcachedirectoryandusedtheuser'
shome
directoryinstead.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Fixthingsonce,
ratherthanoverandoveragain

Whensomethingseemstrivialortemporary,
itiseasytoignoreit,oruseaquickfix

Alittleeffortwilloftenpayforitself

Rule:Fixitonce

Spring2004

CorollaryA:Fixtheproblempermanently

CorollaryB:Leveragewhatothershavedone
don'
treinventthewheel

CorollaryC:Fixaproblemforallhostsatthe
sametime
CSE398:SystemAdministration

2004BrianD.Davison

Avoidthetemporaryfixtrap

Sometimesacompletefixisimpossibleinthat
situation

It'
simportantthatatemporaryfixbefollowedbya
permanentone

Recordtheactionstakenforatemporaryproblem!

Putthefullsolutiononatroubleticket!

Fixingthesamesmallthingsishabitformingwe
getgoodatthekeystrokesneeded!

Wegetusedtothequickfix,anddon'
trealizehow
muchtimewehavelostasaresult.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Mailinglistexample

Runningamailinglistseemseasy.

Bookauthorranmanymailinglists,andhadtodealwith
bouncedmessages.

E.g.,automatedsubscribeandunsubscribe

Dealingwithbouncestakestime.Wrotescriptstohelp
manageitcollectbounces,figureoutwhowasbouncing,
deletesubscriberiferrorpersisted.Stilltook~1houraday!

Bettersolutionwasothersoftwarethathandledbounces
ormadelistownersdealwiththem

Spring2004

Heignoredbouncesforaweek;stayedlatetoinstallnew
softwarewithoutinterruption.Cost:5hours;savings:4
hoursperweek.
CSE398:SystemAdministration

2004BrianD.Davison

Learningfromcarpenters

Measuretwice,cutonce

Alittleextracareisasmallpricecomparedtothe
potentialdamageofamistake.

Carpenterscopyalengthbyreusingthe
originalpieceoverandoveragain

Spring2004

Reuseworkingscriptsratherthanrewriting
them

Usecommandlineshellshortcutsratherthanre
typing
CSE398:SystemAdministration

2004BrianD.Davison

Example#5:rmfolly
MymistakeonSunOS(withOpenWindows)wastotry
andcleanupallthe'
.*'
directoriesin/tmp.Obviously"rm
rf/tmp/*"missedthese,soIwasverycarefulandmade
sureIwasin/tmpandthenexecuted
"rmrf./.*"
Iwillneverdothisagain.IfIaminanydoubtastohowa
wildcardwillexpandIwillechoitfirst.
Organization:DataCADLtd,Hamilton,Scotland

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Summary

Understandtheproblem

Fixesshouldbepermanent

Leverageothers'
fixes

Fixesshouldbeglobal

Testyoursolution!

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

You might also like