PDF conversion fails #199

adrianariton · 2024-12-22T10:31:05Z

For PDFs it only converts to text and sometimes it doesnt get the words right (it joins them in a long string)

Viddesh1 · 2024-12-23T10:37:35Z

The text looks something like below right?

67
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics
Volume 1: Long Papers, pages 67–93
March 17-22, 2024 c(cid:13)2024 Association for Computational Linguistics

Leak,Cheat,Repeat:DataContaminationandEvaluationMalpracticesinClosed-SourceLLMsSimoneBalloccuPatríciaSchmidtováMateuszLangoOndˇrejDušekCharlesUniversity,FacultyofMathematicsandPhysicsInstituteofFormalandAppliedLinguisticsPrague,CzechRepublic{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.czAbstractNaturalLanguageProcessing(NLP)researchisincreasinglyfocusingontheuseofLargeLanguageModels(LLMs),withsomeofthemostpopularonesbeingeitherfullyorpartiallyclosed-source.Thelackofaccesstomodeldetails,especiallyregardingtrainingdata,hasrepeatedlyraisedconcernsaboutdatacontam-inationamongresearchers.Severalattemptshavebeenmadetoaddressthisissue,buttheyarelimitedtoanecdotalevidenceandtrialanderror.Additionally,theyoverlooktheprob-lemofindirectdataleaking,wheremodelsareiterativelyimprovedbyusingdatacom-ingfromusers.Inthiswork,weconductthefirstsystematicanalysisofworkusingOpe-nAI’sGPT-3.5andGPT-4,themostpromi-nentlyusedLLMstoday,inthecontextofdatacontamination.Byanalysing255papersandconsideringOpenAI’sdatausagepolicy,weex-tensivelydocumenttheamountofdataleakedtothesemodelsduringthefirstyearafterthemodel’srelease.Wereportthatthesemodelshavebeengloballyexposedto∼4.7Msamplesfrom263benchmarks.Atthesametime,wedocumentanumberofevaluationmalpracticesemerginginthereviewedpapers,suchasun-fairormissingbaselinecomparisonsandrepro-ducibilityissues.Wereleaseourresultsasacol-laborativeprojectonhttps://leak-llm.github.io/,whereotherresearcherscancontributetoourefforts.1IntroductionTherecentemergenceoflargelanguagemodels(LLMs),thatshowremarkableperformanceonawiderangeoftasks,haslednotonlytoadramaticincreaseintheiruseinresearchbutalsotoagrow-ingnumberofcompaniesjoiningtheraceforthebiggestandmostpowerfulmodels.Inpursuingacompetitiveadvantage,manypopularLLMsto-dayarelockedbehindAPIaccessandtheirde-tailsareunknown(OpenAI,2023;Thoppilanetal.,2022;Touvronetal.,2023).Thisincludesmodelweights(OpenAI,2023),trainingdata(Piktusetal.,2023),orinfrastructuraldetailstoassessmodelcar-bonfootprint(Lacosteetal.,2019).Inparticular,thelackofinformationontrainingdataraisesimportantquestionsaboutthecredibilityofLLMsperformanceevaluation.Thedatafromwhichthesemodelslearn,typicallycollectedau-tomaticallybyscrapingdocumentsfromtheweb,maycontaintraining,validation,and–mostcrit-ically–testsetscomingfromNLPbenchmarks.Becauseofthis,researchersandstakeholdersmaylaterinadvertentlyevaluateLLMsonthesamedatatheyweretrainedon.Thisphenomenon,knownasdatacontamination,maynotbeanissueinthegeneraluseofcommercialLLMs,whereadherencetoresearchprinciplesisnotmandatory,butitbe-comesaseriousproblemwhenthesemodelsarewidelyusedandevaluatedinresearch.Unfortunately,manyproprietarymodelsarelockedbehindinference-onlyAPIs,makingithardtoinspectdatacontamination.Becauseofthis,ex-istingworkonthemattermostlyfocusesondetect-ingextremeformsofoverfittingandmemorization,suchasthemodel’sabilitytogeneratebenchmarksverbatim.TheseapproachesarenotonlylimitedbutalsoneglectthatrecentproprietaryLLMsgetiterativelyimprovedfromuserinteractions.Ifsuchinteractionsinvolvebenchmarkdata(forexamplewhenresearchersevaluateLLMsagainstbaselines),themodelmay,infact,becomecontaminatedevenifitwascontamination-freeduringitsinitialtrain-ing.Werefertothisphenomenonasindirectdataleaking.Inthispaper,weaddresstheissueofindirectdatacontaminationinclosed-source1LLMsbycon-ductingasystematicliteraturereview.Wereview255papersandcarefullydetaildataleakageemerg-ingfromthem.Wefocusprimarilyonthemodels1Inthispaperweusetheterms“proprietary”and“closed-source”interchangeablytorefertothesemodels.68
domaintextenrichedbytextualinstructionsleadstoanincreaseinthemodelperformanceevenifgoldlabelsarenotshowntothemodel.ThissetupperfectlymatchesthekindofdatashowntochatLLMswhenevaluatedbyresearchers.Thismeansthatclosed-sourceLLMssuchasGPT-3.5andGPT-4canmakeuseofthesegoldstandardexamplesfromwidelyusedNLPbenchmarkstogainanunfairadvantageoverothermodels.Wealsopointoutthatrecentwork(Aiyappaetal.,2023)showedthataftermodelupdates,Chat-GPTperformanceimprovedonbenchmarkstowhichitwaspreviouslyexposed(Zhangetal.,2022).Withthesemotivations,weconductasystematicreviewtoquantifyhowmuchofsuchdatathemodelspoweringChatGPTcouldhaveobtained.4MethodologyFollowingthestandardsystematicreviewproto-colfromthemedicaldomain(Khanetal.,2003),weanalysetheexistingworkonLLMsevaluationtoinspecttheissueofindirectdatacontaminationandotherevaluationmalpractices.WefocusonOpenAI’sGPT-3.5andGPT-4models,astheyarethemostprominentlyusedinrecentNLPresearch.Weorganizeourworkintofivemacro-steps,corre-spondingtothefollowingsubsections.4.1FramingquestionsInreviewingtheexistingworkevaluatingtheper-formaceofGPT-3.5andGPT-4,weposethefol-lowingresearchquestions:(1)WhichdatasetshavebeendemonstrablyleakedtoGPT-3.5andGPT-4duringthelastyear?70

If so then please take a look at this issue number which has same problem of long text combined as single string #120

Thanks!
Viddesh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF conversion fails #199

PDF conversion fails #199

adrianariton commented Dec 22, 2024

Viddesh1 commented Dec 23, 2024

Uh oh!

PDF conversion fails #199

PDF conversion fails #199

Comments

adrianariton commented Dec 22, 2024

Viddesh1 commented Dec 23, 2024

Uh oh!