PharmaSUG Tokyo 2019 PO02
PharmaSUG Tokyo 2019 PO02
Add annotations.
DM
CRF
Age Annotation
It takes time
Race Annotation to do this
manually!
1
Motivation
Insert Group Name or Division
2
Overall Flow
Insert Group Name or Division
3
Environment (1)
Insert Group Name or Division
Materials
Name Format Short Description
Domain Excel SDTM domain list. Consists of
List Domain abbreviations and
descriptions.
aCRF Excel Lists of standard CRF texts
metadata and corresponding SDTM
variables. 1 sheet per domain.
Existing PDF Used as test data for machine
aCRF learning.
New CRF PDF CRF to be newly annotated. 4
Environment (2)
Insert Group Name or Division
Structure ┌─scikit-learn
│ ├─document
Directory for Step 1.
of our │ │ domain_list.xlsx Files to be loaded for Step 1.
directory │ │ existing_acrf1.pdf
│ │ existing_acrf2.pdf
│ │ df_crf.csv: Training data generated in Step 1.3.
│ └─output df_vct1.csv: Word frequency data from Step 1.4.
│ df_crf.csv df_vct2.csv: Tf-idf from Step 1.5.
│ df_vct1.csv
│ df_vct2.csv
│ cvct.pkl Outputs from Step 1.4-1.6. Used for Step 1.7.
│ tftf.pkl
│ clf.pkl Result of classifying new CRF in Step 1.7.
│ df_clsres.csv
│
├─acrf Directory for Step 2.
│ ├─document Files to be loaded for Step 2.
│ │ aCRF_metadata.xlsx
│ │ new_crf1.pdf New aCRF from Step 2.4.
│ │
│ └─output
│ new_crf1_ant.pdf 1_1to6_create_classifier.py:
│ Python program for Step 1.1 - 1.6.
└─program 1_7_classify_crf.py: Python program for Step 1.7.
1_1to6_create_classifier.py
1_7_classify_crf.py 2_annotate.py: Python program for Step 2.
2_annotate.py 5
Step 1 – Classify CRF pages
Insert Group Name or Division
・・・File Input/Output
Existing aCRF ・・・Process
New CRF
New CRF
Results from Step 1
Classification of each page of new CRF
New aCRF
7
1.1 Get all words in each page
Insert Group Name or Division
dfcrf = pd.DataFrame({"page":pgseqs,"domain":dnmseqs,"words":wrdseqs})
# output csv.
if afl.lower() == "y":
dfcrf.to_csv(pth_csv, mode='a', header=False, index=False)
else: 9
dfcrf.to_csv(pth_csv, index=False)
1.4 Count frequency - 1.6 Create classifier
Insert Group Name or Division
CRF
We have 2 things to do.
1. Get coordinates of CRF items to add
AE ID: Annotation
annotation in appropriate location.
Start Date: MMDDYY • Left position: horizontal position of
CRF item + xx pixel
Is the adverse event • Top position: equal to CRF item
still ongoing? • Width of annotation: modify depends on
□Yes □No variable name’s length
2. Choose spreadsheet of our standard
list (= aCRF metadata) and find SDTM
Standards List of AE
variable name which matches CRF item.
AE ID AESPID
Standards List of DM
Start Date AESTDTC
Is the adverse… AEENRTPTE
. . Birth Date BRTHDTC
. . Gender SEX
. . . .
. .
. .
13
2.1 GetGroup
Insert all text blocks
Name in each page of new CRF,
or Division
2.2 Find SDTM variable names from aCRF metadata
doc = fitz.open(os.path.join(pthd2, fl1)) # open new crf.
14
2.3 Add annotations
Insert Group Name or Division
doc.save("full file path") #save new acrf. Save method saves current PDF.
16
Summary
Insert Group Name or Division
17
Future prospects (1)
Insert Group Name or Division
19