Parsley is simple tool to parse and scan PDFs for given phrases (e.g. a date, a name, etc). Under the hood, Parsley uses Docsplit to do the hard work of PDF text extraction.
- Install required gems:
bundle install
- Install Docsplit dependencies (graphicsmagick, poppler, etc).
Using Parsley is a two step process. Before scanning documents for relevant information, you need to extract the raw text.
- Put your PDFs in
data/pdfs
. (Nested file structure not yet supported.) - From the root directory of this project, run
./scripts/extract_text
.
As the text is being extracted, extract_text
will output progress:
.
for each successful extractions
for any files that were skipped because they've already been extractedF
for any failed extraction
When completed, extract_text
will print out a list of failed files so that
you can check them manually, if desired.
Note: PDFs don't need to be OCR'd ahead of time. But if they have been OCR'd it'll run faster since it can skip that step.
Once you've extracted the raw text from your PDFs, you can search the documents by running the search script from the root of the project:
./scripts/search "Urgent Matters"
./scripts/search "Jane Smith"
Note: The search term must be surrounded by quotes.
This will print a table of results sorted by number of matches:
+---------------------------------+
| 3 Pertinent File(s) Found |
+-----------------------+---------+
| Filename | Matches |
+-----------------------+---------+
| jan-meeting-notes.txt | 9 |
| apr-meeting-notes.txt | 4 |
| dec-meeting-notes.txt | 2 |
+-----------------------+---------+