Skip to content

PDF conversion fails #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adrianariton opened this issue Dec 22, 2024 · 1 comment
Open

PDF conversion fails #199

adrianariton opened this issue Dec 22, 2024 · 1 comment

Comments

@adrianariton
Copy link

For PDFs it only converts to text and sometimes it doesnt get the words right (it joins them in a long string)

@Viddesh1
Copy link

The text looks something like below right?

67
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics
Volume 1: Long Papers, pages 67–93
March 17-22, 2024 c(cid:13)2024 Association for Computational Linguistics

Leak,Cheat,Repeat:DataContaminationandEvaluationMalpracticesinClosed-SourceLLMsSimoneBalloccuPatríciaSchmidtováMateuszLangoOndˇrejDušekCharlesUniversity,FacultyofMathematicsandPhysicsInstituteofFormalandAppliedLinguisticsPrague,CzechRepublic{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.czAbstractNaturalLanguageProcessing(NLP)researchisincreasinglyfocusingontheuseofLargeLanguageModels(LLMs),withsomeofthemostpopularonesbeingeitherfullyorpartiallyclosed-source.Thelackofaccesstomodeldetails,especiallyregardingtrainingdata,hasrepeatedlyraisedconcernsaboutdatacontam-inationamongresearchers.Severalattemptshavebeenmadetoaddressthisissue,buttheyarelimitedtoanecdotalevidenceandtrialanderror.Additionally,theyoverlooktheprob-lemofindirectdataleaking,wheremodelsareiterativelyimprovedbyusingdatacom-ingfromusers.Inthiswork,weconductthefirstsystematicanalysisofworkusingOpe-nAI’sGPT-3.5andGPT-4,themostpromi-nentlyusedLLMstoday,inthecontextofdatacontamination.Byanalysing255papersandconsideringOpenAI’sdatausagepolicy,weex-tensivelydocumenttheamountofdataleakedtothesemodelsduringthefirstyearafterthemodel’srelease.Wereportthatthesemodelshavebeengloballyexposedto∼4.7Msamplesfrom263benchmarks.Atthesametime,wedocumentanumberofevaluationmalpracticesemerginginthereviewedpapers,suchasun-fairormissingbaselinecomparisonsandrepro-ducibilityissues.Wereleaseourresultsasacol-laborativeprojectonhttps://leak-llm.github.io/,whereotherresearcherscancontributetoourefforts.1IntroductionTherecentemergenceoflargelanguagemodels(LLMs),thatshowremarkableperformanceonawiderangeoftasks,haslednotonlytoadramaticincreaseintheiruseinresearchbutalsotoagrow-ingnumberofcompaniesjoiningtheraceforthebiggestandmostpowerfulmodels.Inpursuingacompetitiveadvantage,manypopularLLMsto-dayarelockedbehindAPIaccessandtheirde-tailsareunknown(OpenAI,2023;Thoppilanetal.,2022;Touvronetal.,2023).Thisincludesmodelweights(OpenAI,2023),trainingdata(Piktusetal.,2023),orinfrastructuraldetailstoassessmodelcar-bonfootprint(Lacosteetal.,2019).Inparticular,thelackofinformationontrainingdataraisesimportantquestionsaboutthecredibilityofLLMsperformanceevaluation.Thedatafromwhichthesemodelslearn,typicallycollectedau-tomaticallybyscrapingdocumentsfromtheweb,maycontaintraining,validation,and–mostcrit-ically–testsetscomingfromNLPbenchmarks.Becauseofthis,researchersandstakeholdersmaylaterinadvertentlyevaluateLLMsonthesamedatatheyweretrainedon.Thisphenomenon,knownasdatacontamination,maynotbeanissueinthegeneraluseofcommercialLLMs,whereadherencetoresearchprinciplesisnotmandatory,butitbe-comesaseriousproblemwhenthesemodelsarewidelyusedandevaluatedinresearch.Unfortunately,manyproprietarymodelsarelockedbehindinference-onlyAPIs,makingithardtoinspectdatacontamination.Becauseofthis,ex-istingworkonthemattermostlyfocusesondetect-ingextremeformsofoverfittingandmemorization,suchasthemodel’sabilitytogeneratebenchmarksverbatim.TheseapproachesarenotonlylimitedbutalsoneglectthatrecentproprietaryLLMsgetiterativelyimprovedfromuserinteractions.Ifsuchinteractionsinvolvebenchmarkdata(forexamplewhenresearchersevaluateLLMsagainstbaselines),themodelmay,infact,becomecontaminatedevenifitwascontamination-freeduringitsinitialtrain-ing.Werefertothisphenomenonasindirectdataleaking.Inthispaper,weaddresstheissueofindirectdatacontaminationinclosed-source1LLMsbycon-ductingasystematicliteraturereview.Wereview255papersandcarefullydetaildataleakageemerg-ingfromthem.Wefocusprimarilyonthemodels1Inthispaperweusetheterms“proprietary”and“closed-source”interchangeablytorefertothesemodels.68
domaintextenrichedbytextualinstructionsleadstoanincreaseinthemodelperformanceevenifgoldlabelsarenotshowntothemodel.ThissetupperfectlymatchesthekindofdatashowntochatLLMswhenevaluatedbyresearchers.Thismeansthatclosed-sourceLLMssuchasGPT-3.5andGPT-4canmakeuseofthesegoldstandardexamplesfromwidelyusedNLPbenchmarkstogainanunfairadvantageoverothermodels.Wealsopointoutthatrecentwork(Aiyappaetal.,2023)showedthataftermodelupdates,Chat-GPTperformanceimprovedonbenchmarkstowhichitwaspreviouslyexposed(Zhangetal.,2022).Withthesemotivations,weconductasystematicreviewtoquantifyhowmuchofsuchdatathemodelspoweringChatGPTcouldhaveobtained.4MethodologyFollowingthestandardsystematicreviewproto-colfromthemedicaldomain(Khanetal.,2003),weanalysetheexistingworkonLLMsevaluationtoinspecttheissueofindirectdatacontaminationandotherevaluationmalpractices.WefocusonOpenAI’sGPT-3.5andGPT-4models,astheyarethemostprominentlyusedinrecentNLPresearch.Weorganizeourworkintofivemacro-steps,corre-spondingtothefollowingsubsections.4.1FramingquestionsInreviewingtheexistingworkevaluatingtheper-formaceofGPT-3.5andGPT-4,weposethefol-lowingresearchquestions:(1)WhichdatasetshavebeendemonstrablyleakedtoGPT-3.5andGPT-4duringthelastyear?70

If so then please take a look at this issue number which has same problem of long text combined as single string #120

Thanks!
Viddesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants