Starting a workflow

In the Deed Machine, each county or other jurisdiction that provides sets of records to be analyzed is represented as a ZooniverWorkflow, or workflow for short.

  1. For each new workflow, start by adding an entry in the Python config dictionary object in local_settings.py. local_settings.py is ignored by git, so if you have not previously made a local_settings.py file, do so now, saved to the racial_covenants_processor/settings/ folder. This file is imported at the end of the main settings file, common.py (which should generally not be edited by end users), and settings placed in local_settings.py will override those settings.

ZOONIVERSE_QUESTION_LOOKUP = {
    'WI Milwauke County': {
    },
    'MN Olmsted County': {
    ...
    }
}
  1. The folder structure and filenames of the records provided by records custodians can provide necessary and bonus information about each record. For example, folders and filenames can include the document date (doc_date), document number (doc_num) book and page (book_id and page_num). For each county, you will need to write a regular expression to parse the folder and filenames after they have been uploaded to S3 during the initial processing phase. While it is not strictly necessary to write this regular expression before file upload, it is a good practice to think through whether the folder structure and filenames as delivered will be able to be successfully generalized into a regular expression in order to avoid the need for either exceptionally complex regular expressions or costly re-uploads.

The best way to build your regular expression is to experiment at Pythex.org with sample paths from the s3_path field of the CSV files produced by the standalone uploader, which are stored in the data folder of wherever you have installed the mp-upload-deed-images-standalone application.

For example, during the process of ingesting results from S3 into the Deed Machine’s database, the following regular expression captures data including the workflow slug, as well as the doc_type, batch_id, book_id, doc_num, and split_page_num fields that will be saved to the database.

ZOONIVERSE_QUESTION_LOOKUP = {
    'WI Milwauke County': {
        ...
    },
    'MN Olmsted County': {
        'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?',
    }
}

In order to facilitate correct pagination, for each image you will need to capture, at minimum:

  • Either a doc_num, or both a book_id and page_num

  • The split_page_num generated by the initial processing stage when mult-page TIF files are processed. Note that while SPLITPAGE will not show up in the list of s3_paths in the CSVs generated by the mp-upload-deed-images-standalone application, they still should be accounted for in your regular expression. This means that regular expressions will almost always need to end with (?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?, as shown below.

  1. (Optional) If the required doc_num or book_id/page_num combination are not parseable from the folder/filenames, then a suppliemental CSV should be included at the time of ingestion after initial processing. This CSV will allow the Deed Machine to link additional information to each image by using a lookup table based on metadata pulled from the images folder and pathname.

To add data from one or more supplemental CSV files, add a deed_supplemental_info list to the ZOONIVERSE_QUESTION_LOOKUP config object:

ZOONIVERSE_QUESTION_LOOKUP = {
    'WI Milwauke County': {
        ...
    },
    'MN Olmsted County': {
        'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?',
        'deed_supplemental_info': [
            {
                'data_csv': '/Users/mcorey/Documents/Deed projects/mn/ramsey/ramsey_recorder_supplemental_info/Abstract_20191106_header.csv',  # Absolute path to supplemental CSV
                'join_field_deed': 'doc_alt_id',  # Join field drawn from imported image path
                'join_field_supp': 'itemnum',  # Join field in supplemental CSV
                'mapping': {
                    'doc_num': 'mp_doc_num', # deed machine varname: CSV column name
                    'doc_type': 'landtype' # deed machine varname: CSV column name
                }
            }
        ],
    }
}

In the example above, the ingestion process will expect each ingested file to include a doc_alt_id field in the regular expression that matches the value itemnum in the supplemental spreadsheet. Based on the values in the mapping section of the deed_supplemental_info dictionary, the values in the CSV’s mp_doc_num column will be ingested into the Deed Machine’s doc_num field, and likewise values in the landtype CSV field will be ingested into the Deed Machine’s doc_type field.

Sample supplemental CSV with matching data

itemnum

pagecnt

itemname

docnum

landtype

instrumenttype

mp_doc_num

12117219

1

ABSTRACT - 1483219 - - R-CONVERSION

1483219

ABSTRACT

R-CONVERSION

A1483219

12117223

1

ABSTRACT - 1483678 - - R-CONVERSION

1483678

ABSTRACT

R-CONVERSION

A1483678

12117224

1

ABSTRACT - 1483679 - - R-CONVERSION

1483679

ABSTRACT

R-CONVERSION

A1483679

12117228

1

ABSTRACT - 1485353 - - R-CONVERSION

1485353

ABSTRACT

R-CONVERSION

A1485353

  1. Create a Django ZooniverseWorkflow object

python manage.py create_workflow --workflow "WI Olmsted County"