Starting a workflow
In the Deed Machine, each county or other jurisdiction that provides sets of records to be analyzed is represented as a ZooniverWorkflow, or workflow for short.
For each new workflow, start by adding an entry in the Python config dictionary object in
local_settings.py.local_settings.pyis ignored by git, so if you have not previously made alocal_settings.pyfile, do so now, saved to theracial_covenants_processor/settings/folder. This file is imported at the end of the main settings file, common.py (which should generally not be edited by end users), and settings placed inlocal_settings.pywill override those settings.
ZOONIVERSE_QUESTION_LOOKUP = {
'WI Milwauke County': {
},
'MN Olmsted County': {
...
}
}
The folder structure and filenames of the records provided by records custodians can provide necessary and bonus information about each record. For example, folders and filenames can include the document date (
doc_date), document number (doc_num) book and page (book_idandpage_num). For each county, you will need to write a regular expression to parse the folder and filenames after they have been uploaded to S3 during the initial processing phase. While it is not strictly necessary to write this regular expression before file upload, it is a good practice to think through whether the folder structure and filenames as delivered will be able to be successfully generalized into a regular expression in order to avoid the need for either exceptionally complex regular expressions or costly re-uploads.
The best way to build your regular expression is to experiment at Pythex.org with sample paths from the s3_path field of the CSV files produced by the standalone uploader, which are stored in the data folder of wherever you have installed the mp-upload-deed-images-standalone application.
For example, during the process of ingesting results from S3 into the Deed Machine’s database, the following regular expression captures data including the workflow slug, as well as the doc_type, batch_id, book_id, doc_num, and split_page_num fields that will be saved to the database.
ZOONIVERSE_QUESTION_LOOKUP = {
'WI Milwauke County': {
...
},
'MN Olmsted County': {
'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?',
}
}
In order to facilitate correct pagination, for each image you will need to capture, at minimum:
Either a
doc_num, or both abook_idandpage_numThe
split_page_numgenerated by the initial processing stage when mult-page TIF files are processed. Note that while SPLITPAGE will not show up in the list of s3_paths in the CSVs generated by the mp-upload-deed-images-standalone application, they still should be accounted for in your regular expression. This means that regular expressions will almost always need to end with(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?, as shown below.
(Optional) If the required
doc_numorbook_id/page_numcombination are not parseable from the folder/filenames, then a suppliemental CSV should be included at the time of ingestion after initial processing. This CSV will allow the Deed Machine to link additional information to each image by using a lookup table based on metadata pulled from the images folder and pathname.
To add data from one or more supplemental CSV files, add a deed_supplemental_info list to the ZOONIVERSE_QUESTION_LOOKUP config object:
ZOONIVERSE_QUESTION_LOOKUP = {
'WI Milwauke County': {
...
},
'MN Olmsted County': {
'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?',
'deed_supplemental_info': [
{
'data_csv': '/Users/mcorey/Documents/Deed projects/mn/ramsey/ramsey_recorder_supplemental_info/Abstract_20191106_header.csv', # Absolute path to supplemental CSV
'join_field_deed': 'doc_alt_id', # Join field drawn from imported image path
'join_field_supp': 'itemnum', # Join field in supplemental CSV
'mapping': {
'doc_num': 'mp_doc_num', # deed machine varname: CSV column name
'doc_type': 'landtype' # deed machine varname: CSV column name
}
}
],
}
}
In the example above, the ingestion process will expect each ingested file to include a doc_alt_id field in the regular expression that matches the value itemnum in the supplemental spreadsheet. Based on the values in the mapping section of the deed_supplemental_info dictionary, the values in the CSV’s mp_doc_num column will be ingested into the Deed Machine’s doc_num field, and likewise values in the landtype CSV field will be ingested into the Deed Machine’s doc_type field.
itemnum |
pagecnt |
itemname |
docnum |
landtype |
instrumenttype |
mp_doc_num |
|---|---|---|---|---|---|---|
12117219 |
1 |
ABSTRACT - 1483219 - - R-CONVERSION |
1483219 |
ABSTRACT |
R-CONVERSION |
A1483219 |
12117223 |
1 |
ABSTRACT - 1483678 - - R-CONVERSION |
1483678 |
ABSTRACT |
R-CONVERSION |
A1483678 |
12117224 |
1 |
ABSTRACT - 1483679 - - R-CONVERSION |
1483679 |
ABSTRACT |
R-CONVERSION |
A1483679 |
12117228 |
1 |
ABSTRACT - 1485353 - - R-CONVERSION |
1485353 |
ABSTRACT |
R-CONVERSION |
A1485353 |
Create a Django ZooniverseWorkflow object
python manage.py create_workflow --workflow "WI Olmsted County"