Components
Django core component
The Django component of the Deed Machine can generally be thought of as the conductor or hub that turns raw processing results into structured data, facilitates the import and export to the Zooniverse crowdsourcing platform, allows for manual GUI editing, and manages final data exports.
Standalone deed uploader
Often deed images are stored on a local machine or network drive, and it’s not feasible or efficient to move them. This standalone uploader is designed to avoid the user having to do a full install on this computer.
S3 bucket folders
Results of initial processing are stored in an S3 bucket with a specific folder structure
Folder name |
Description |
|---|---|
/raw |
Original images, may be in multiple image formats |
/ocr/txt |
A text blob of each line of words extracted from Textract OCR JSON |
/ocr/json |
Original OCR JSON object created by Textract |
/ocr/stats |
JSON file with basic info about document, including handwriting percentage and web UUID in filename, which is needed to identify corresponding web image |
/ocr/hits |
NDJSON object containing info about what terms found and on what line number. Only exists if a term is found. Basic literal search (no fuzziness) |
/ocr/hits_fuzzy |
NDJSON object containing info about what terms found and on what line number. Only exists if a term is found. Fuzzy search with variable fuzziness allowed by term |
/web |
Web-optimized JPEG with watermark, uses UUID instead of s3_lookup to prevent scraping, since these files are publicly readable |
/web_highlighted |
Web-optimized JPEG with watermark, uses UUID instead of s3_lookup to prevent scraping, since these files are publicly readable |
DeedPageProcessor step function components
The individual lambda functions that make up the OCR, term search and web image optimization processes are in separate repositories:
TermSearchRefresh step function components
This step function is triggered by the Django management command trigger_term_search_refresh. Lambda function for term search is stored in a separate repository and is identical to above:
DeedPageProcessorFAKEOCR step function components
This step function is triggered by the Django management command trigger_lambda_refresh. The individual lambda functions that make up the OCR simulation, term search and web image optimization processes are in separate repositories: