Parsley

Parsley is simple tool to parse and scan PDFs for given phrases (e.g. a date, a name, etc). Under the hood, Parsley uses Docsplit to do the hard work of PDF text extraction.

Installation

Install required gems: bundle install
Install Docsplit dependencies (graphicsmagick, poppler, etc).

Usage

Using Parsley is a two step process. Before scanning documents for relevant information, you need to extract the raw text.

Step 1: Extract text from PDF files

Put your PDFs in data/pdfs. (Nested file structure not yet supported.)
From the root directory of this project, run ./scripts/extract_text.

As the text is being extracted, extract_text will output progress:

. for each successful extraction
s for any files that were skipped because they've already been extracted
F for any failed extraction

When completed, extract_text will print out a list of failed files so that you can check them manually, if desired.

Note: PDFs don't need to be OCR'd ahead of time. But if they have been OCR'd it'll run faster since it can skip that step.

Step 2: Search extracted text for relevant information

Once you've extracted the raw text from your PDFs, you can search the documents by running the search script from the root of the project:

./scripts/search "Urgent Matters"
./scripts/search "Jane Smith"

Note: The search term must be surrounded by quotes.

This will print a table of results sorted by number of matches:

+---------------------------------+
| 3 Pertinent File(s) Found       |
+-----------------------+---------+
| Filename              | Matches |
+-----------------------+---------+
| jan-meeting-notes.txt | 9       |
| apr-meeting-notes.txt | 4       |
| dec-meeting-notes.txt | 2       |
+-----------------------+---------+

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
lib		lib
scripts		scripts
.gitignore		.gitignore
.ruby-gemset		.ruby-gemset
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parsley

Installation

Usage

Step 1: Extract text from PDF files

Step 2: Search extracted text for relevant information

About

Uh oh!

Releases

Packages

Uh oh!

Languages

coryschires/parsley

Folders and files

Latest commit

History

Repository files navigation

Parsley

Installation

Usage

Step 1: Extract text from PDF files

Step 2: Search extracted text for relevant information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages