Identifying Patterns of Racial Discrimination through Natural Language Processing
Can scholars use Natural Language Processing (NLP) on racial covenants to track the path of racial segregation through American cities and across the country? Can a digitally enhanced public research project help a metropolitan community reckon with the racial segregation that is fundamental to its history?
This case study focuses on the accessibility of historic documents that are analog – how to digitize them, access their content, and then address questions of historical significance.
The Chicago Covenants Project (CCP) works to locate, map, and trace the history of every racially restrictive covenant in Cook County, Illinois, home of Chicago. A covenant is a clause within property deeds to restrict or prevent someone who is not white to occupy or own property.
These documents of exclusion, created from about 1910 to 1950, are buried in the analogue land record archives of the county clerk’s office. The CCP has located approximately 1,000 of these documents and the collection exceeds our ability to make meaningful distinctions using close reading alone.
The various patterns and forms of these racial covenants imply a set of document templates and creators responsible for the widespread use of covenants in the first half of the twentieth century. Covenants in Cook County, Illinois, number in the thousands, and likely in the millions across the country.
Hence, a challenge faced in this project is to provide proper optical character recognition (OCR) to thousands of historical documents. The challenges are threefold.
First, the scanned documents may need visual clean-up as they are up to over a century old.
Second, the typesetting of the documents may pose a challenge to OCR methods.
Lastly, we are interested in identifying different kinds of document formats with respect to content layout and organization. This case study shares lessons learned addressing these challenges using the Visual NLP library by John Snow Labs.
Chreston Miller
PhD, Assistant Professor at Virginia Tech
Dr. Miller is currently an assistant professor and the Data and Informatics Consultant within Data Services of the University Libraries at Virginia Tech. He provides consulting services for researchers at Virginia Tech centered around data-related challenges. His focus has been in Machine Learning/Deep Learning but supports a wide array of data-related areas such as data wrangling and data sensemaking. Dr. Miller’s current research interests are Applied Machine Learning, specifically Natural Language Processing, Human Behavior Analysis, and Human-Centered Computing.