0% found this document useful (0 votes)

29 views20 pages

PharmaSUG Tokyo 2019 PO02

Uploaded by

Xiaojie Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views20 pages

PharmaSUG Tokyo 2019 PO02

Uploaded by

Xiaojie Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Insert Group Name or Division

Automated aCRF Generation

Using Python

PharmaSUG SDE Tokyo 2019,

24-Oct-2019
Mitsuhiro Isozaki, Hiroshi Nishioka,
Takumi Koyama, Masayo Koike,
Taku Uryu, Manabu Abe
Pfizer R&D Japan
Statistical Programming & Analysis Group
When we create aCRF...
Insert Group Name or Division

Detect specific CRF items

CRF
and classify the page.
Age
Age, Race…
Race →DM domain!

Add annotations.
DM
CRF

Age Annotation
It takes time
Race Annotation to do this
manually!
1
Motivation
Insert Group Name or Division

AE page text1 AE*** We have lists of

AE page text2 AE***
. .
standard CRF texts
. . and corresponding
. . SDTM variables.

If we automate CRF classification

using machine learning technique,
we can automate whole step?

2
Overall Flow
Insert Group Name or Division

1. Machine Learning to Classify CRF Page

Prepare Create Evaluate Tuning
Data Classifier Classifier Classifier

2. Add Annotations on CRF Generate aCRF.pdf

3
Environment (1)
Insert Group Name or Division

Python: 3.7.4 (Windows 10)

Packages/Libraries to be imported
Name Version Short Description Name Version Short Description
scikit-learn 0.21.3 Machine learning pandas 0.25.0 Output & load
Excel/CSV file
joblib 0.13.2 Output & load classifier
xlrd 1.2.0 Read Excel file
PyMuPDF 1.14.20 Edit PDF

Materials
Name Format Short Description
Domain Excel SDTM domain list. Consists of
List Domain abbreviations and
descriptions.
aCRF Excel Lists of standard CRF texts
metadata and corresponding SDTM
variables. 1 sheet per domain.
Existing PDF Used as test data for machine
aCRF learning.
New CRF PDF CRF to be newly annotated. 4
Environment (2)
Insert Group Name or Division

Structure ┌─scikit-learn
│ ├─document
Directory for Step 1.
of our │ │ domain_list.xlsx Files to be loaded for Step 1.
directory │ │ existing_acrf1.pdf
│ │ existing_acrf2.pdf
│ │ df_crf.csv: Training data generated in Step 1.3.
│ └─output df_vct1.csv: Word frequency data from Step 1.4.
│ df_crf.csv df_vct2.csv: Tf-idf from Step 1.5.
│ df_vct1.csv
│ df_vct2.csv
│ cvct.pkl Outputs from Step 1.4-1.6. Used for Step 1.7.
│ tftf.pkl
│ clf.pkl Result of classifying new CRF in Step 1.7.
│ df_clsres.csv
│
├─acrf Directory for Step 2.
│ ├─document Files to be loaded for Step 2.
│ │ aCRF_metadata.xlsx
│ │ new_crf1.pdf New aCRF from Step 2.4.
│ │
│ └─output
│ new_crf1_ant.pdf 1_1to6_create_classifier.py:
│ Python program for Step 1.1 - 1.6.
└─program 1_7_classify_crf.py: Python program for Step 1.7.
1_1to6_create_classifier.py
1_7_classify_crf.py 2_annotate.py: Python program for Step 2.
2_annotate.py 5
Step 1 – Classify CRF pages
Insert Group Name or Division

・・・File Input/Output
Existing aCRF ・・・Process

1.1 Get all words 1.4 Count frequency of

in each page word in each page

1.2 Get all text blocks Prepare data,

1.5 Calculates tf-idf
in each page create classifier,
Domain List and classify new CRF
1.3 Get domain name 1.6 Create classifier
annotation using machine learning

New CRF

1.7 Classify each page to appropriate domain

6
Step 2 – Create aCRF
Insert Group Name or Division

New CRF
Results from Step 1
Classification of each page of new CRF

2.1 Get all text blocks

in each page of new CRF

2.2 Find SDTM variable names

from aCRF metadata Add annotations
aCRF metadata on new CRF and
2.3 Add annotations generate new aCRF

2.4 Save aCRF

New aCRF

7
1.1 Get all words in each page
Insert Group Name or Division

Function to read existing aCRF

def read_crf1(pth, fl, plist, afl): Training data consists of
# initialize output variable.
pgseqs = []
1. word frequency on CRF and
dnmseqs = [] 2. classified result.
wrdseqs = []

for pg in plist: For #1, get all words using getTextWords

doc = fitz.open(os.path.join(pth, fl)) # open pdf.
page = doc[pg] # page number in pdf. and combine them as a text string.
wrdlst = page.getTextWords() # get words in a page. GetTextWords returns following list.
blklst_= page.getTextBlocks() # get words as block in a page. • 1st-4th: coordinate of each word
blklst = sorted(blklst_, key=itemgetter(1,0)) # sort by coordinate.
• 5th: word in CRF
wrdseq = "" # initialize per page.
wrd[0] wrd[1] wrd[2] wrd[3] wrd[4]
for col1, col2 in zip(dfd['Domain'], dfd['Description']):
for blk in blklst: 55.3 100.8 80.9 107.3 Start
if (col1+"="+col2).lower().replace(" ", "") in blk[4].lower().replace(" ", ""):
85.3 100.8 110.9 107.3 Date:
for wrd in wrdlst: # combine words in a page with space.
if wrdseq == "": 55.3 109.4 78.1 116.0 Ongoing:
wrdseq = wrd[4]
else: … … … … …
wrdseq = wrdseq + " " + wrd[4]
pgseqs.append(pg)
dnmseqs.append(col1.lower()) Create list of text strings. (wrdseqs)
wrdseqs.append(wrdseq)
break
wrdseqs = [birth date female ...,
date onset ..., …]
dfcrf = pd.DataFrame({"page":pgseqs,"domain":dnmseqs,"words":wrdseqs})
Word frequency is derived in later step.
# output csv.
if afl.lower() == "y":
dfcrf.to_csv(pth_csv, mode='a', header=False, index=False)
else: 8
dfcrf.to_csv(pth_csv, index=False)
1.2 GetGroup
Insert all text blocks
Name in each page,
or Division
1.3 Get domain name annotation
Function to read existing aCRF For #2, get all text blocks using
def read_crf1(pth, fl, plist, afl):
# initialize output variable. getTextBlocks.
pgseqs = []
dnmseqs = [] GetTextBlocks returns coordinate and
wrdseqs = [] contents of text block as similar to
for pg in plist: getTextWords. 5th item (= blk[4]) is
doc = fitz.open(os.path.join(pth, fl)) # open pdf. blocked text.
page = doc[pg] # page number in pdf.
Existing aCRF has domain name
wrdlst = page.getTextWords() # get words in a page.
annotation. This can be used as
blklst_= page.getTextBlocks() # get words as block in a page.
blklst = sorted(blklst_, key=itemgetter(1,0)) # sort by coordinate.
classified result of training data.
wrdseq = "" # initialize per page.

for col1, col2 in zip(dfd['Domain'], dfd['Description']):

for blk in blklst:
if (col1+"="+col2).lower().replace(" ", "") in blk[4].lower().replace(" ", ""):

for wrd in wrdlst: # combine words in a page with space.

if wrdseq == "":
wrdseq = wrd[4]
else:
wrdseq = wrdseq + " " + wrd[4]
pgseqs.append(pg)
dnmseqs.append(col1.lower())
To find this, match above text blocks
wrdseqs.append(wrdseq) and domain list.
break

dfcrf = pd.DataFrame({"page":pgseqs,"domain":dnmseqs,"words":wrdseqs})

# output csv.
if afl.lower() == "y":
dfcrf.to_csv(pth_csv, mode='a', header=False, index=False)
else: 9
dfcrf.to_csv(pth_csv, index=False)
1.4 Count frequency - 1.6 Create classifier
Insert Group Name or Division

Previous step generates training data in CSV

format. (= dfscv)
Steps for
source page domain (classified results) words
• counting frequency
5 dm birth date female ...
• calculating tf-idf
55 ae date onset ...
• creating classifier
… … …
cvct = CountVectorizer() CountVectorizer returns count frequency of
X_train_counts = cvct.fit_transform( dfcsv.words )
word from list of text strings.
X_train_counts source page birth date female onset
tftf = TfidfTransformer() 5 2 2 2 0
X_train_tfidf = tftf.fit_transform( X_train_counts ) 55 0 7 0 4
… … … … …

clf = MultinomialNB().fit( X_train_tfidf, dfcsv.domain )

TfidfTransformer calculates tf-idf from count
frequency.
X_train_tfidf source page Birth date female onset
5 0.0952 0.0417 0.0952 0
55 0 0.0875 0 0.1329
… … … … …

MultinomialNB().fit creates classifier from

tf-idf & classified results in training data
based on multinomial Naïve Bayes model.10
What is multinomial Naïve Bayes and tf-idf?
Insert Group Name or Division

Algorithm of Multinomial Naïve Bayes model tf-idf

1.Suppose words "date" and "female" 2 •Term frequency (tf) x Inverse document
times each in 1 CRF page, and frequency (idf).
we want to classify the page as 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑋𝑋 𝑖𝑖𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐴𝐴
𝑡𝑡𝑡𝑡 =
"AE domain" or "DM domain". 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜 𝑎𝑎𝑎𝑎𝑎𝑎 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑖𝑖𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐴𝐴
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
2.Calculate 2 conditional probabilities 𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑙𝑙𝑙𝑙𝑙𝑙
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 ℎ𝑎𝑎𝑎𝑎𝑎𝑎 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑋𝑋
given word frequencies.
- Prob1 = the page is "AE domain" •This statistic is more useful than low frequency
- Prob2 = the page is "DM domain" for classification.
Specific word → high weight
3.If prob1 < prob2, then classify the page Common word → low weight
as "DM domain".
Example:
•Freq: "date" = "female" = 2 in page 5.
Naïve = independence assumption of all Tf-idf: "date" < "female" in page 5.
word frequencies. •"date" is common word for demographic and
This is rarely true in real-world. However, adverse event page (e.g. birth date and onset
multiple studies show the classifier work date). However, "female" is specific word for
optimally. demographic.
page birth date female onset page birth date female onset
5 2 2 2 0 5 0.0952 0.0417 0.0952 0
55 0 7 0 4 55 0 0.0875 0 0.1329
11
… … … … ... … … … … …
1.7 Classify each page to appropriate domain
Insert Group Name or Division

Function to read new CRF

def read_crf2(pth, fl, plist): Get all words in each page of new CRF
wrdseqs = [] # initialize output variable.
using getTextWords and
for pg in plist: combine them as we created training data.
doc = fitz.open(os.path.join(pth, fl)) # open new crf.
page = doc[pg] # page number in pdf. new_data source page words
wrdlst = page.getTextWords() # get words in a page. 7 birth date female ...
1 date onset ...
wrdseq = "" # initialize per page.
for wrd in wrdlst: # combine words in a page with space. … …
if wrdseq == "":
wrdseq = wrd[4]
else: New_data is converted to
wrdseq = wrdseq + " " + wrd[4]
wrdseqs.append(wrdseq) word frequency list → tf-idf list
return wrdseqs as we did for training data.
Call above function Predict returns classification results using
file1 = r"new_crf1.pdf" # file name of new crf. classifier which we created from training
pagelist = [1,7,23,35] # pages to be read. data.
new_data = read_crf2(pthd2,file1,pagelist) # pthd2 = folder path.
The highest probability domain is chosen.
Data conversion Page 7 → DM domain
X_new_counts = cvct.transform(new_data)
X_new_tfidf = tftf.transform(X_new_counts)
Page 1 → AE domain
source Prob. Prob. Prob. Predicted
Classify new CRF page For AE For DM For ** domain
pred = clf.predict(X_new_tfidf) 7 0.146 0.299 0.071 dm
1 0.312 0.133 0.075 ae
… … … … … 12
To add annotations…
Insert Group Name or Division

CRF
We have 2 things to do.
1. Get coordinates of CRF items to add
AE ID: Annotation
annotation in appropriate location.
Start Date: MMDDYY • Left position: horizontal position of
CRF item + xx pixel
Is the adverse event • Top position: equal to CRF item
still ongoing? • Width of annotation: modify depends on
□Yes □No variable name’s length
2. Choose spreadsheet of our standard
list (= aCRF metadata) and find SDTM
Standards List of AE
variable name which matches CRF item.

AE ID AESPID
Standards List of DM
Start Date AESTDTC
Is the adverse… AEENRTPTE
. . Birth Date BRTHDTC
. . Gender SEX
. . . .
. .
. .
13
2.1 GetGroup
Insert all text blocks
Name in each page of new CRF,
or Division
2.2 Find SDTM variable names from aCRF metadata
doc = fitz.open(os.path.join(pthd2, fl1)) # open new crf.

for col_a, col_b in zip(df_clsres['page'], df_clsres['pred_domain']):

page = doc[col_a] # page number in pdf. Get all text blocks and their coordinates
txtlst1 = page.getTextBlocks() # get words in a page.
in new CRF using getTextBlocks
try:
df_meta = pd.read_excel(os.path.join(pthd2, "aCRF_metadata.xlsx"), sheet_name=col_b)
except XLRDError:
break Find SDTM variable names in
for txt1 in txtlst1: aCRF metadata which match text blocks
for col1, col2 in zip(df_meta['CRF_Text'], df_meta['SDTM']):
if col1 in txt1[4]:
from new CRF.
tboxwdth = sum([fitz.getTextlength(c) for c in col2]) # get text length to adjust text box width.
tbox = fitz.Rect(500, txt1[1], 500+tboxwdth, txt1[1]+10) # define text box.

anno = page.addFreetextAnnot(tbox, col2, fontsize=5) # put annotation in text box.

anno.setBorder(border)
anno.update(fill_color=yellow)
# this is necessary to overwrite the default flag 28
Spreadsheet of aCRF metadata is
# which dose not allow to move annotation. automatically chosen according to
anno.setFlags(0) # add annotation of domain name on left top.
domain name which classifier returned.
# add annotation of domain name on left top.
dfd2 = dfd[dfd.Domain == col_b.upper()]
lefttop = str((dfd2.Domain.values)[0]) + "=" + str((dfd2.Description.values)[0])
tboxwdth = sum([fitz.getTextlength(c) for c in lefttop])
tbox = fitz.Rect(50, 40, 50+tboxwdth, 50)
anno = page.addFreetextAnnot(tbox, lefttop, fontsize=5)
anno.setBorder(border)
anno.update(fill_color=yellow)
anno.setFlags(0)

14
2.3 Add annotations
Insert Group Name or Division

doc = fitz.open(os.path.join(pthd2, fl1)) # open new crf.

for col_a, col_b in zip(df_clsres['page'], df_clsres['pred_domain']):

page = doc[col_a] # page number in pdf.
txtlst1 = page.getTextBlocks() # get words in a page. Define positions and widths
try: of annotation text boxes.
df_meta = pd.read_excel(os.path.join(pthd2, "aCRF_metadata.xlsx"), sheet_name=col_b)
except XLRDError:
Txt[1] is vertical position of
break CRF item by getTextBlocks.
for txt1 in txtlst1:
for col1, col2 in zip(df_meta['CRF_Text'], df_meta['SDTM']):
AddFreetextAnnot adds
if col1 in txt1[4]: annotation text boxes on
tboxwdth = sum([fitz.getTextlength(c) for c in col2]) # get text length to adjust text box width.
tbox = fitz.Rect(500, txt1[1], 500+tboxwdth, txt1[1]+10) # define text box. PDF.
Col2 = SDTM variables
anno = page.addFreetextAnnot(tbox, col2, fontsize=5) # put annotation in text box.
anno.setBorder(border) from aCRF metadata.
anno.update(fill_color=yellow)
# this is necessary to overwrite the default flag 28
These lines also set
# which dose not allow to move annotation. font size, border color,
anno.setFlags(0) # add annotation of domain name on left top.
and background color.
# add annotation of domain name on left top.
dfd2 = dfd[dfd.Domain == col_b.upper()]
lefttop = str((dfd2.Domain.values)[0]) + "=" + str((dfd2.Description.values)[0]) Similar to SDTM variable
tboxwdth = sum([fitz.getTextlength(c) for c in lefttop])
tbox = fitz.Rect(50, 40, 50+tboxwdth, 50) annotations, add domain
anno = page.addFreetextAnnot(tbox, lefttop, fontsize=5) name annotation on left-top
anno.setBorder(border)
anno.update(fill_color=yellow) of the page according to
anno.setFlags(0)
domain name which
classifier returned.
15
2.4 Save aCRF
Insert Group Name or Division

doc.save("full file path") #save new acrf. Save method saves current PDF.

Actual output of new aCRF

16
Summary
Insert Group Name or Division

What our Python program can do:

• Create training data for machine learning from existing

aCRFs.
• Classify each page of new CRF to appropriate SDTM
domain.
• Add SDTM variable/domain annotations on new CRF
using our aCRF metadata.

17
Future prospects (1)
Insert Group Name or Division

Automatic adjustment of annotations

• Our Python program adds
CRF
annotations based on
CRF text coordinates. XXX: Annotation X
For busy CRF, it is needed YYY: Annotation Y
to change position or
Overwrapped!!
size of annotation.
Additional annotation algorithm for multiple domain
• E.g. Informed DM DS
consent date are Informed Consent
included in RFICDTC
DM and DS. Date: DD/MM/YYYY
Classifier should DSSTDTC
have multiple candidates.
18
Future prospects (2)
Insert Group Name or Division

More data to improve classifier

• The more test data has variant, the more classifier can
classify accurately.
Function to add bookmark
• Study data tabulation model metadata submission
guidelines requires 2 types of bookmark.
• PyMuPDF’s setToC can add bookmark to PDF. However,
By domains → relatively easy.
By timepoints → difficult to detect appropriate CRF page.
Export annotations’ page information
• Relation between annotations on CRF and SDTM
variables can be used to fill in Origin in define.xml.

Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
20BCE1779 - Web Mining - Lab-5
No ratings yet
20BCE1779 - Web Mining - Lab-5
8 pages
Discord Formatting Guide
No ratings yet
Discord Formatting Guide
19 pages
Breadth First Search and Iterative Depth First Search: Practical 1
No ratings yet
Breadth First Search and Iterative Depth First Search: Practical 1
21 pages
Exercise of Application Letter
No ratings yet
Exercise of Application Letter
4 pages
Crftut FNT PDF
No ratings yet
Crftut FNT PDF
109 pages
Metagenomics Classification: Project Synopsis
No ratings yet
Metagenomics Classification: Project Synopsis
15 pages
Automatic Classification of Bank
No ratings yet
Automatic Classification of Bank
110 pages
FLEN: Leveraging Field For Scalable CTR Prediction: Wenqiang Chen Lizhang Zhan Yuanlong Ci
No ratings yet
FLEN: Leveraging Field For Scalable CTR Prediction: Wenqiang Chen Lizhang Zhan Yuanlong Ci
9 pages
2 Classification
No ratings yet
2 Classification
38 pages
Pharmasug China 2021 AD040
No ratings yet
Pharmasug China 2021 AD040
14 pages
G14 Sameer Bin Younas
No ratings yet
G14 Sameer Bin Younas
31 pages
Ijet V2i3p7
No ratings yet
Ijet V2i3p7
6 pages
CRF Laura Kallmeyer
No ratings yet
CRF Laura Kallmeyer
21 pages
Multec 3.5 PDF
No ratings yet
Multec 3.5 PDF
178 pages
TFM Jenifer Tabita Ciuciu-Kis
No ratings yet
TFM Jenifer Tabita Ciuciu-Kis
83 pages
Flexcrfs
No ratings yet
Flexcrfs
34 pages
CRF Klinger Tomanek
No ratings yet
CRF Klinger Tomanek
32 pages
DMC Lab Ex - 1 To 15 (31.03.2024)
No ratings yet
DMC Lab Ex - 1 To 15 (31.03.2024)
52 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
13 pages
Garbage Collection Robot
No ratings yet
Garbage Collection Robot
12 pages
CSE 3024: Web Mining: Lab Assessment - 3
No ratings yet
CSE 3024: Web Mining: Lab Assessment - 3
13 pages
AI Manual
No ratings yet
AI Manual
69 pages
Computer Science 2
No ratings yet
Computer Science 2
66 pages
Phishing URL Detection Using ML: Project Report
No ratings yet
Phishing URL Detection Using ML: Project Report
24 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
9 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
Minor Project
No ratings yet
Minor Project
21 pages
Question Classification Blooms 1 PDF
No ratings yet
Question Classification Blooms 1 PDF
68 pages
Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R
No ratings yet
Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R
16 pages
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
AI Project-1 - 21L-7744 21L-5433
No ratings yet
AI Project-1 - 21L-7744 21L-5433
5 pages
Anomaly Detection in Log Files Using
No ratings yet
Anomaly Detection in Log Files Using
67 pages
EX1
No ratings yet
EX1
6 pages
Amlnew
No ratings yet
Amlnew
25 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
PharmaSUG 2020 SS 159
No ratings yet
PharmaSUG 2020 SS 159
9 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
A Minor Project Report On DMT
No ratings yet
A Minor Project Report On DMT
11 pages
An Introduction To Conditional Random Fields: Charles Sutton and Andrew Mccallum
No ratings yet
An Introduction To Conditional Random Fields: Charles Sutton and Andrew Mccallum
90 pages
ML Lab PT
No ratings yet
ML Lab PT
25 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
Cs3491 Lab Manual
No ratings yet
Cs3491 Lab Manual
28 pages
21BEE0103 (Iot 2theory)
No ratings yet
21BEE0103 (Iot 2theory)
7 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
23 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
AIML Lab Manual Final
No ratings yet
AIML Lab Manual Final
43 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
DWDM Pavan Final
No ratings yet
DWDM Pavan Final
10 pages
NLP Summary
No ratings yet
NLP Summary
2 pages
Documentation ML
No ratings yet
Documentation ML
10 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
ML Lab Exercise - 9
No ratings yet
ML Lab Exercise - 9
4 pages
Practical File DL
No ratings yet
Practical File DL
14 pages
Geo Server User
No ratings yet
Geo Server User
235 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Ai Journal
No ratings yet
Ai Journal
24 pages
Siglent
No ratings yet
Siglent
28 pages
Aiml Lab Aim & Alg
No ratings yet
Aiml Lab Aim & Alg
22 pages
Ps Pulse 5.2 Admin Guide
No ratings yet
Ps Pulse 5.2 Admin Guide
266 pages
Photogrammetry Geometry From Images and Laser Scans 2nd (English) Ed Edition Karl Kraus
100% (1)
Photogrammetry Geometry From Images and Laser Scans 2nd (English) Ed Edition Karl Kraus
55 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
1 hMATLAB - Simulink - Tutorial
No ratings yet
1 hMATLAB - Simulink - Tutorial
12 pages
Machine Learning Assignment 1
No ratings yet
Machine Learning Assignment 1
4 pages
CSE455/CSE552 Machine Learning (Spring 2024) Homework #2: Hand-In Policy Collaboration Policy Grading
No ratings yet
CSE455/CSE552 Machine Learning (Spring 2024) Homework #2: Hand-In Policy Collaboration Policy Grading
2 pages
PCR - HB-1760 RGQ MDX User Manual
No ratings yet
PCR - HB-1760 RGQ MDX User Manual
268 pages
Apostila Do SCORBASE
No ratings yet
Apostila Do SCORBASE
41 pages
XJ-S HE Repair Manual 1984
No ratings yet
XJ-S HE Repair Manual 1984
388 pages
Activos Fijos
No ratings yet
Activos Fijos
28 pages
Lansitec-Brochure-EN v7.55
No ratings yet
Lansitec-Brochure-EN v7.55
62 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
COM Configurator Manager Guide V1 - 0 - 0
No ratings yet
COM Configurator Manager Guide V1 - 0 - 0
18 pages
Adis 12th Project
No ratings yet
Adis 12th Project
44 pages
Focus-II ABB Manual (English) Rev-C
No ratings yet
Focus-II ABB Manual (English) Rev-C
56 pages
Language Arts Subject For Middle School - Communication and Its Elements by Slidesgo
No ratings yet
Language Arts Subject For Middle School - Communication and Its Elements by Slidesgo
55 pages
ONTAP 90 Volume Move Express Guide
No ratings yet
ONTAP 90 Volume Move Express Guide
16 pages
MagnaGear XTR Reducers
No ratings yet
MagnaGear XTR Reducers
8 pages
SEN Unit 1. CM Software Development Process
No ratings yet
SEN Unit 1. CM Software Development Process
37 pages
IMSI MSC Call Trace (Invoke Trace) : Motorola Jordan
No ratings yet
IMSI MSC Call Trace (Invoke Trace) : Motorola Jordan
7 pages
1.1 Functions Topic Questions 0606 Set 2 QP Ms
No ratings yet
1.1 Functions Topic Questions 0606 Set 2 QP Ms
14 pages
Hikvision Discos Vídeo Vigilancia: Hiklook
No ratings yet
Hikvision Discos Vídeo Vigilancia: Hiklook
9 pages
Flyweight Pattern
No ratings yet
Flyweight Pattern
6 pages
Oracle Data Dictionary
No ratings yet
Oracle Data Dictionary
2 pages
Alvarez Ka Needsassessment Medt7490
No ratings yet
Alvarez Ka Needsassessment Medt7490
9 pages
Muhammed Minhaj - Resume
No ratings yet
Muhammed Minhaj - Resume
3 pages
SGL 2
No ratings yet
SGL 2
3 pages
Pinterest
No ratings yet
Pinterest
1 page
Learning Linux Binary Analysis: Learning Linux Binary Analysis
From Everand
Learning Linux Binary Analysis: Learning Linux Binary Analysis
Ryan "elfmaster" O'Neill
4/5 (1)
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet

PharmaSUG Tokyo 2019 PO02

Uploaded by

PharmaSUG Tokyo 2019 PO02

Uploaded by

Insert Group Name or Division

Automated aCRF Generation

PharmaSUG SDE Tokyo 2019,

Detect specific CRF items

AE page text1 AE*** We have lists of

If we automate CRF classification

1. Machine Learning to Classify CRF Page

2. Add Annotations on CRF Generate aCRF.pdf

Python: 3.7.4 (Windows 10)

1.1 Get all words 1.4 Count frequency of

1.2 Get all text blocks Prepare data,

1.7 Classify each page to appropriate domain

2.1 Get all text blocks

2.2 Find SDTM variable names

2.4 Save aCRF

Function to read existing aCRF

for pg in plist: For #1, get all words using getTextWords

for col1, col2 in zip(dfd['Domain'], dfd['Description']):

for wrd in wrdlst: # combine words in a page with space.

Previous step generates training data in CSV

clf = MultinomialNB().fit( X_train_tfidf, dfcsv.domain )

MultinomialNB().fit creates classifier from

Algorithm of Multinomial Naïve Bayes model tf-idf

Function to read new CRF

for col_a, col_b in zip(df_clsres['page'], df_clsres['pred_domain']):

anno = page.addFreetextAnnot(tbox, col2, fontsize=5) # put annotation in text box.

doc = fitz.open(os.path.join(pthd2, fl1)) # open new crf.

for col_a, col_b in zip(df_clsres['page'], df_clsres['pred_domain']):

Actual output of new aCRF

What our Python program can do:

• Create training data for machine learning from existing

Automatic adjustment of annotations

More data to improve classifier

You might also like