Ethnographic Data Science: New Approaches to Comparative Research
Michael Fischer, Sridhar Ravula, Francine Barone, Human Relations Area Files, Yale University
We discuss issues arising from applying natural language processing and data science methods to search and analyse the collection of ethnography at the Human Relations Area Files. In particular, we examine how comparative research might be better enabled and pitfalls avoided.
Introduction
iKLEWS (Infrastructure for Knowledge Linkages from Ethnography of World Societies) is a Human Relations Area Files (HRAF) project underwritten by the National Science Foundation Human Networks and Data Science programme. iKLEWS is creating semantic infrastructure and ethnographic research services for a growing textual database (eHRAF World Cultures), presently with roughly 800,000 pages from 7,000 ethnographic documents covering 361 world societies, each at several time points in the ethnographic present.
The basic goal is to greatly expand the value of eHRAF World Cultures to users who seek to understand the range of possibilities for human understanding, knowledge, belief and behaviour, whether to address work in anthropological theory, exploring the relationship between human evolution and human behaviour, or informing real-world problems we face today, such as: climate change; violence; disasters; epidemics; hunger; and war. Understanding how and why cultures vary in the range of possible outcomes in a range of circumstances is critical to improving policy, applied science, and basic scientific understandings of the human condition. Seeing how others have addressed issues can help us find solutions we might not find otherwise. This is extremely valuable in understanding an increasingly globalised world.
eHRAF World Cultures: Ethnography
Since its inception in 1949 (or 1929 for the ancestral files) the HRAF collection of ethnography has included manually applied topical metadata for each entry in each document. These entries roughly correspond to paragraphs, but may include images, figures, lists, tables, bibliographic entries, footnotes and endnotes. We refer to these entries as Search and Retrieval Elements, or SRE. Each SRE in each ethnographic work is classified by a professional anthropologist, who assigns one or more of 790 classificatory terms drawn from HRAF extensions of the Outline of Cultural Materials (OCM) (Murdock 1937-1982; see Ford 1971) to each SRE. A given instance of a classificatory term is an OCM code. The OCM thesaurus presently takes the form of a classificatory tree, nominally with three levels of major or minor topics, with some asymmetry.
Improving eHRAF WC:e
Although the current web version of eHRAF World Cultures is very fast at retrieving relevant ethnography, fundamentally it uses the same method as HRAF’s original paper files in 1949, just very much faster and more convenient. There are no aids to analysing the material once found; the user has to read the results of their search and apply their own methods. This project aims to fill this gap so that modern methods of working with text are applied through an extensible framework that deploys tools for analysis as well as greatly improving search capability. These tools will initially be available through a services framework, with interfaces for researchers ranging from beginner to advanced, accessible through web apps or Jupyter notebooks, either those that we supply, or following API guidelines for our services framework, constructed by the researcher. Over time these capabilities will be added to the eHRAF web application, usually in a more specialised or restricted form.
Data Science Methods
New semantic and data mining infrastructure developed by this project will assist in determining universal and cross-cultural aspects of a wide range of user selected topics, such as social emotion and empathy, economics, politics, use of space and time, morality, or music and songs, to use examples that have been investigated using prototypic tools preceding this project. Some of the methods used can be applied in areas as far afield as AI and robotics, such as forming a basis for a bridge between rather opaque (sometimes denoted as ‘dark’ ) deep learning outcomes and more transparent logic driven narratives, making AI solutions more human, and more useful through a greater capacity to generalise results. We are applying pattern extraction and linguistic analysis through deep learning, NLP and other tools to define a flexible logical framework for the contents of the documents. The goal is to apply these results as new metadata and infrastructure based on the outcomes of these procedures so that researchers can operate in real time and we can scale up using less processor intensive algorithms than most ML and NLP methods require.
All of the methods we are using employ machine learning in some form. Machine learning (ML) emphasises the notion that while we are employing algorithms to identify regularities in the text (patterns), the algorithm is responsible for identifying these regularities, rather than the researcher, who mainly assesses the outcomes of machine learning. For most of the history of computer-assisted text analysis, algorithms were used to restructure the text (counting words, indexing words, finding phrases), but it was pretty much the researcher that identified patterns based on the restructured text, although they might have a lot of algorithm-driven tools to for doing so.
Machine Learning
machine learning. For most of the history of computer-assisted text analysis, algorithms were used to restructure the text (counting words, indexing words, finding phrases), but it was pretty much the researcher that identified