0% found this document useful (0 votes)
29 views20 pages

PharmaSUG Tokyo 2019 PO02

Uploaded by

Xiaojie Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views20 pages

PharmaSUG Tokyo 2019 PO02

Uploaded by

Xiaojie Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Insert Group Name or Division

Automated aCRF Generation


Using Python

PharmaSUG SDE Tokyo 2019,


24-Oct-2019
Mitsuhiro Isozaki, Hiroshi Nishioka,
Takumi Koyama, Masayo Koike,
Taku Uryu, Manabu Abe
Pfizer R&D Japan
Statistical Programming & Analysis Group
When we create aCRF...
Insert Group Name or Division

Detect specific CRF items


CRF
and classify the page.
Age
Age, Race…
Race →DM domain!

Add annotations.
DM
CRF

Age Annotation
It takes time
Race Annotation to do this
manually!
1
Motivation
Insert Group Name or Division

AE page text1 AE*** We have lists of


AE page text2 AE***
. .
standard CRF texts
. . and corresponding
. . SDTM variables.

If we automate CRF classification


using machine learning technique,
we can automate whole step?

2
Overall Flow
Insert Group Name or Division

1. Machine Learning to Classify CRF Page


Prepare Create Evaluate Tuning
Data Classifier Classifier Classifier

2. Add Annotations on CRF Generate aCRF.pdf

3
Environment (1)
Insert Group Name or Division

Python: 3.7.4 (Windows 10)


Packages/Libraries to be imported
Name Version Short Description Name Version Short Description
scikit-learn 0.21.3 Machine learning pandas 0.25.0 Output & load
Excel/CSV file
joblib 0.13.2 Output & load classifier
xlrd 1.2.0 Read Excel file
PyMuPDF 1.14.20 Edit PDF

Materials
Name Format Short Description
Domain Excel SDTM domain list. Consists of
List Domain abbreviations and
descriptions.
aCRF Excel Lists of standard CRF texts
metadata and corresponding SDTM
variables. 1 sheet per domain.
Existing PDF Used as test data for machine
aCRF learning.
New CRF PDF CRF to be newly annotated. 4
Environment (2)
Insert Group Name or Division

Structure ┌─scikit-learn
│ ├─document
Directory for Step 1.
of our │ │ domain_list.xlsx Files to be loaded for Step 1.
directory │ │ existing_acrf1.pdf
│ │ existing_acrf2.pdf
│ │ df_crf.csv: Training data generated in Step 1.3.
│ └─output df_vct1.csv: Word frequency data from Step 1.4.
│ df_crf.csv df_vct2.csv: Tf-idf from Step 1.5.
│ df_vct1.csv
│ df_vct2.csv
│ cvct.pkl Outputs from Step 1.4-1.6. Used for Step 1.7.
│ tftf.pkl
│ clf.pkl Result of classifying new CRF in Step 1.7.
│ df_clsres.csv

├─acrf Directory for Step 2.
│ ├─document Files to be loaded for Step 2.
│ │ aCRF_metadata.xlsx
│ │ new_crf1.pdf New aCRF from Step 2.4.
│ │
│ └─output
│ new_crf1_ant.pdf 1_1to6_create_classifier.py:
│ Python program for Step 1.1 - 1.6.
└─program 1_7_classify_crf.py: Python program for Step 1.7.
1_1to6_create_classifier.py
1_7_classify_crf.py 2_annotate.py: Python program for Step 2.
2_annotate.py 5
Step 1 – Classify CRF pages
Insert Group Name or Division

・・・File Input/Output
Existing aCRF ・・・Process

1.1 Get all words 1.4 Count frequency of


in each page word in each page

1.2 Get all text blocks Prepare data,


1.5 Calculates tf-idf
in each page create classifier,
Domain List and classify new CRF
1.3 Get domain name 1.6 Create classifier
annotation using machine learning

New CRF

1.7 Classify each page to appropriate domain


6
Step 2 – Create aCRF
Insert Group Name or Division

New CRF
Results from Step 1
Classification of each page of new CRF

2.1 Get all text blocks


in each page of new CRF

2.2 Find SDTM variable names


from aCRF metadata Add annotations
aCRF metadata on new CRF and
2.3 Add annotations generate new aCRF

2.4 Save aCRF

New aCRF

7
1.1 Get all words in each page
Insert Group Name or Division

Function to read existing aCRF


def read_crf1(pth, fl, plist, afl): Training data consists of
# initialize output variable.
pgseqs = []
1. word frequency on CRF and
dnmseqs = [] 2. classified result.
wrdseqs = []

for pg in plist: For #1, get all words using getTextWords


doc = fitz.open(os.path.join(pth, fl)) # open pdf.
page = doc[pg] # page number in pdf. and combine them as a text string.
wrdlst = page.getTextWords() # get words in a page. GetTextWords returns following list.
blklst_= page.getTextBlocks() # get words as block in a page. • 1st-4th: coordinate of each word
blklst = sorted(blklst_, key=itemgetter(1,0)) # sort by coordinate.
• 5th: word in CRF
wrdseq = "" # initialize per page.
wrd[0] wrd[1] wrd[2] wrd[3] wrd[4]
for col1, col2 in zip(dfd['Domain'], dfd['Description']):
for blk in blklst: 55.3 100.8 80.9 107.3 Start
if (col1+"="+col2).lower().replace(" ", "") in blk[4].lower().replace(" ", ""):
85.3 100.8 110.9 107.3 Date:
for wrd in wrdlst: # combine words in a page with space.
if wrdseq == "": 55.3 109.4 78.1 116.0 Ongoing:
wrdseq = wrd[4]
else: … … … … …
wrdseq = wrdseq + " " + wrd[4]
pgseqs.append(pg)
dnmseqs.append(col1.lower()) Create list of text strings. (wrdseqs)
wrdseqs.append(wrdseq)
break
wrdseqs = [birth date female ...,
date onset ..., …]
dfcrf = pd.DataFrame({"page":pgseqs,"domain":dnmseqs,"words":wrdseqs})
Word frequency is derived in later step.
# output csv.
if afl.lower() == "y":
dfcrf.to_csv(pth_csv, mode='a', header=False, index=False)
else: 8
dfcrf.to_csv(pth_csv, index=False)
1.2 GetGroup
Insert all text blocks
Name in each page,
or Division
1.3 Get domain name annotation
Function to read existing aCRF For #2, get all text blocks using
def read_crf1(pth, fl, plist, afl):
# initialize output variable. getTextBlocks.
pgseqs = []
dnmseqs = [] GetTextBlocks returns coordinate and
wrdseqs = [] contents of text block as similar to
for pg in plist: getTextWords. 5th item (= blk[4]) is
doc = fitz.open(os.path.join(pth, fl)) # open pdf. blocked text.
page = doc[pg] # page number in pdf.
Existing aCRF has domain name
wrdlst = page.getTextWords() # get words in a page.
annotation. This can be used as
blklst_= page.getTextBlocks() # get words as block in a page.
blklst = sorted(blklst_, key=itemgetter(1,0)) # sort by coordinate.
classified result of training data.
wrdseq = "" # initialize per page.

for col1, col2 in zip(dfd['Domain'], dfd['Description']):


for blk in blklst:
if (col1+"="+col2).lower().replace(" ", "") in blk[4].lower().replace(" ", ""):

for wrd in wrdlst: # combine words in a page with space.


if wrdseq == "":
wrdseq = wrd[4]
else:
wrdseq = wrdseq + " " + wrd[4]
pgseqs.append(pg)
dnmseqs.append(col1.lower())
To find this, match above text blocks
wrdseqs.append(wrdseq) and domain list.
break

dfcrf = pd.DataFrame({"page":pgseqs,"domain":dnmseqs,"words":wrdseqs})

# output csv.
if afl.lower() == "y":
dfcrf.to_csv(pth_csv, mode='a', header=False, index=False)
else: 9
dfcrf.to_csv(pth_csv, index=False)
1.4 Count frequency - 1.6 Create classifier
Insert Group Name or Division

Previous step generates training data in CSV


format. (= dfscv)
Steps for
source page domain (classified results) words
• counting frequency
5 dm birth date female ...
• calculating tf-idf
55 ae date onset ...
• creating classifier
… … …
cvct = CountVectorizer() CountVectorizer returns count frequency of
X_train_counts = cvct.fit_transform( dfcsv.words )
word from list of text strings.
X_train_counts source page birth date female onset
tftf = TfidfTransformer() 5 2 2 2 0
X_train_tfidf = tftf.fit_transform( X_train_counts ) 55 0 7 0 4
… … … … …

clf = MultinomialNB().fit( X_train_tfidf, dfcsv.domain )


TfidfTransformer calculates tf-idf from count
frequency.
X_train_tfidf source page Birth date female onset
5 0.0952 0.0417 0.0952 0
55 0 0.0875 0 0.1329
… … … … …

MultinomialNB().fit creates classifier from


tf-idf & classified results in training data
based on multinomial Naïve Bayes model.10
What is multinomial Naïve Bayes and tf-idf?
Insert Group Name or Division

Algorithm of Multinomial Naïve Bayes model tf-idf


1.Suppose words "date" and "female" 2 •Term frequency (tf) x Inverse document
times each in 1 CRF page, and frequency (idf).
we want to classify the page as 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑋𝑋 𝑖𝑖𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐴𝐴
𝑡𝑡𝑡𝑡 =
"AE domain" or "DM domain". 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜 𝑎𝑎𝑎𝑎𝑎𝑎 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑖𝑖𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐴𝐴
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
2.Calculate 2 conditional probabilities 𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑙𝑙𝑙𝑙𝑙𝑙
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 ℎ𝑎𝑎𝑎𝑎𝑎𝑎 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑋𝑋
given word frequencies.
- Prob1 = the page is "AE domain" •This statistic is more useful than low frequency
- Prob2 = the page is "DM domain" for classification.
Specific word → high weight
3.If prob1 < prob2, then classify the page Common word → low weight
as "DM domain".
Example:
•Freq: "date" = "female" = 2 in page 5.
Naïve = independence assumption of all Tf-idf: "date" < "female" in page 5.
word frequencies. •"date" is common word for demographic and
This is rarely true in real-world. However, adverse event page (e.g. birth date and onset
multiple studies show the classifier work date). However, "female" is specific word for
optimally. demographic.
page birth date female onset page birth date female onset
5 2 2 2 0 5 0.0952 0.0417 0.0952 0
55 0 7 0 4 55 0 0.0875 0 0.1329
11
… … … … ... … … … … …
1.7 Classify each page to appropriate domain
Insert Group Name or Division

Function to read new CRF


def read_crf2(pth, fl, plist): Get all words in each page of new CRF
wrdseqs = [] # initialize output variable.
using getTextWords and
for pg in plist: combine them as we created training data.
doc = fitz.open(os.path.join(pth, fl)) # open new crf.
page = doc[pg] # page number in pdf. new_data source page words
wrdlst = page.getTextWords() # get words in a page. 7 birth date female ...
1 date onset ...
wrdseq = "" # initialize per page.
for wrd in wrdlst: # combine words in a page with space. … …
if wrdseq == "":
wrdseq = wrd[4]
else: New_data is converted to
wrdseq = wrdseq + " " + wrd[4]
wrdseqs.append(wrdseq) word frequency list → tf-idf list
return wrdseqs as we did for training data.
Call above function Predict returns classification results using
file1 = r"new_crf1.pdf" # file name of new crf. classifier which we created from training
pagelist = [1,7,23,35] # pages to be read. data.
new_data = read_crf2(pthd2,file1,pagelist) # pthd2 = folder path.
The highest probability domain is chosen.
Data conversion Page 7 → DM domain
X_new_counts = cvct.transform(new_data)
X_new_tfidf = tftf.transform(X_new_counts)
Page 1 → AE domain
source Prob. Prob. Prob. Predicted
Classify new CRF page For AE For DM For ** domain
pred = clf.predict(X_new_tfidf) 7 0.146 0.299 0.071 dm
1 0.312 0.133 0.075 ae
… … … … … 12
To add annotations…
Insert Group Name or Division

CRF
We have 2 things to do.
1. Get coordinates of CRF items to add
AE ID: Annotation
annotation in appropriate location.
Start Date: MMDDYY • Left position: horizontal position of
CRF item + xx pixel
Is the adverse event • Top position: equal to CRF item
still ongoing? • Width of annotation: modify depends on
□Yes □No variable name’s length
2. Choose spreadsheet of our standard
list (= aCRF metadata) and find SDTM
Standards List of AE
variable name which matches CRF item.

AE ID AESPID
Standards List of DM
Start Date AESTDTC
Is the adverse… AEENRTPTE
. . Birth Date BRTHDTC
. . Gender SEX
. . . .
. .
. .
13
2.1 GetGroup
Insert all text blocks
Name in each page of new CRF,
or Division
2.2 Find SDTM variable names from aCRF metadata
doc = fitz.open(os.path.join(pthd2, fl1)) # open new crf.

for col_a, col_b in zip(df_clsres['page'], df_clsres['pred_domain']):


page = doc[col_a] # page number in pdf. Get all text blocks and their coordinates
txtlst1 = page.getTextBlocks() # get words in a page.
in new CRF using getTextBlocks
try:
df_meta = pd.read_excel(os.path.join(pthd2, "aCRF_metadata.xlsx"), sheet_name=col_b)
except XLRDError:
break Find SDTM variable names in
for txt1 in txtlst1: aCRF metadata which match text blocks
for col1, col2 in zip(df_meta['CRF_Text'], df_meta['SDTM']):
if col1 in txt1[4]:
from new CRF.
tboxwdth = sum([fitz.getTextlength(c) for c in col2]) # get text length to adjust text box width.
tbox = fitz.Rect(500, txt1[1], 500+tboxwdth, txt1[1]+10) # define text box.

anno = page.addFreetextAnnot(tbox, col2, fontsize=5) # put annotation in text box.


anno.setBorder(border)
anno.update(fill_color=yellow)
# this is necessary to overwrite the default flag 28
Spreadsheet of aCRF metadata is
# which dose not allow to move annotation. automatically chosen according to
anno.setFlags(0) # add annotation of domain name on left top.
domain name which classifier returned.
# add annotation of domain name on left top.
dfd2 = dfd[dfd.Domain == col_b.upper()]
lefttop = str((dfd2.Domain.values)[0]) + "=" + str((dfd2.Description.values)[0])
tboxwdth = sum([fitz.getTextlength(c) for c in lefttop])
tbox = fitz.Rect(50, 40, 50+tboxwdth, 50)
anno = page.addFreetextAnnot(tbox, lefttop, fontsize=5)
anno.setBorder(border)
anno.update(fill_color=yellow)
anno.setFlags(0)

14
2.3 Add annotations
Insert Group Name or Division

doc = fitz.open(os.path.join(pthd2, fl1)) # open new crf.

for col_a, col_b in zip(df_clsres['page'], df_clsres['pred_domain']):


page = doc[col_a] # page number in pdf.
txtlst1 = page.getTextBlocks() # get words in a page. Define positions and widths
try: of annotation text boxes.
df_meta = pd.read_excel(os.path.join(pthd2, "aCRF_metadata.xlsx"), sheet_name=col_b)
except XLRDError:
Txt[1] is vertical position of
break CRF item by getTextBlocks.
for txt1 in txtlst1:
for col1, col2 in zip(df_meta['CRF_Text'], df_meta['SDTM']):
AddFreetextAnnot adds
if col1 in txt1[4]: annotation text boxes on
tboxwdth = sum([fitz.getTextlength(c) for c in col2]) # get text length to adjust text box width.
tbox = fitz.Rect(500, txt1[1], 500+tboxwdth, txt1[1]+10) # define text box. PDF.
Col2 = SDTM variables
anno = page.addFreetextAnnot(tbox, col2, fontsize=5) # put annotation in text box.
anno.setBorder(border) from aCRF metadata.
anno.update(fill_color=yellow)
# this is necessary to overwrite the default flag 28
These lines also set
# which dose not allow to move annotation. font size, border color,
anno.setFlags(0) # add annotation of domain name on left top.
and background color.
# add annotation of domain name on left top.
dfd2 = dfd[dfd.Domain == col_b.upper()]
lefttop = str((dfd2.Domain.values)[0]) + "=" + str((dfd2.Description.values)[0]) Similar to SDTM variable
tboxwdth = sum([fitz.getTextlength(c) for c in lefttop])
tbox = fitz.Rect(50, 40, 50+tboxwdth, 50) annotations, add domain
anno = page.addFreetextAnnot(tbox, lefttop, fontsize=5) name annotation on left-top
anno.setBorder(border)
anno.update(fill_color=yellow) of the page according to
anno.setFlags(0)
domain name which
classifier returned.
15
2.4 Save aCRF
Insert Group Name or Division

doc.save("full file path") #save new acrf. Save method saves current PDF.

Actual output of new aCRF

16
Summary
Insert Group Name or Division

What our Python program can do:

• Create training data for machine learning from existing


aCRFs.
• Classify each page of new CRF to appropriate SDTM
domain.
• Add SDTM variable/domain annotations on new CRF
using our aCRF metadata.

17
Future prospects (1)
Insert Group Name or Division

Automatic adjustment of annotations


• Our Python program adds
CRF
annotations based on
CRF text coordinates. XXX: Annotation X
For busy CRF, it is needed YYY: Annotation Y
to change position or
Overwrapped!!
size of annotation.
Additional annotation algorithm for multiple domain
• E.g. Informed DM DS
consent date are Informed Consent
included in RFICDTC
DM and DS. Date: DD/MM/YYYY
Classifier should DSSTDTC
have multiple candidates.
18
Future prospects (2)
Insert Group Name or Division

More data to improve classifier


• The more test data has variant, the more classifier can
classify accurately.
Function to add bookmark
• Study data tabulation model metadata submission
guidelines requires 2 types of bookmark.
• PyMuPDF’s setToC can add bookmark to PDF. However,
By domains → relatively easy.
By timepoints → difficult to detect appropriate CRF page.
Export annotations’ page information
• Relation between annotations on CRF and SDTM
variables can be used to fill in Origin in define.xml.

19

You might also like