This repository is a suite of tools to enable the civic code project.
doc-search is a tool for indexing PDF documents hosted by the City of Windsor. It leverages the web-scraping developed during the scraping council meetings project.
- Filter documents by year
- Filter documents by a specific date or date range
- Search documents based on meeting types
- Filter documents by name or keywords
- Download matching PDFs concurrently (opt-in via CLI flag)
- Saved filenames follow the schema
YYYY_MM_DD-CODE-name.pdf
- Saved filenames follow the schema
The tool is prebuilt for the following platforms:
- Windows (amd64)
- Linux (amd64)
- macOS (arm64)
You can download the prebuilt binaries from the Releases section.
- Download the binary corresponding to your operating system.
- Copy the binary to your
PATH
# 1. Clone the repository
git clone https://github.com/dntiontk/civic-code.git
cd civic-code/doc-search
# 2. Build the binary for your platform
GOOS=$(go env GOOS) GOARCH=$(go env GOARCH) go build -o doc-search main.go
# 3. Copy the binary to your `PATH`The tool emits the document metadata (including checksums and normalized filenames) as JSON and can optionally download PDFs to disk. Available flags:
Usage of bin/doc-search:
-after string
filter documents after date
-before string
filter documents before date
-docName string
filter documents with string in name
-concurrency int
number of concurrent downloads (default 4)
-downloadDir string
directory to store downloaded PDFs (default "./downloads")
-download
download matching PDFs to disk
-meetingType string
filter documents by meeting type
-year int
filter documents by year (default -1)
Pass -download to save files under downloadDir using normalized names such as 2024_03_15-CC-agenda.pdf, matching the fileName included in the JSON output.
Contributions are welcome. Please open an issue or submit a pull request for any enhancements or bug fixes.