Documenting Data Science and Documentation in Data Science: an Ethnographic Exploration

Date: January 24, 2019

Abstract: The collection, curation, and analysis of data has always been as social as it is technical. Even in the most automated, data-driven systems, there are always humans who work behind the scenes, from the software developers and hardware operators who maintain invisible infrastructures to those who collect, label, annotate, clean, validate, merge, and manage data. These activities tend to get far less attention than the headline-grabbing technologies of machine learning and artificial intelligence, but it is crucial to always keep them in view. In this talk, I specifically discuss the central yet often passed over role of documentation in data science, based on several recent and ongoing studies and projects about the role and importance of documentation in software packages, datasets, analysis code, research protocols, and research teams. Documentation is often seen as an unglamorous, low-status chore to be left for later, but it is a crucial form of communication, collaboration, and collective sensemaking. However, documentation can be so difficult precisely because of the complex skills involved in writing good documentation, as well as the many different, sometimes even contradictory roles it plays for various audiences and stakeholders. In examining the work of documentation as communication, we gain a broader view into many pressing issues in data science, including those around open science, reproducibility, and data ethics.

R. Stuart Geiger