Project details

Where it all began

During previous interactions with my colleagues and other data workers, I noticed that many of them have inefficient workflows. They frequently copy-paste source code, and re-execute code multiple times. I wanted to find out more about this, and see if I can improve the situation.

Here is a presentation I gave for CHI ’20:

Observations to understand our users’ workflows

Data collection

We observed ten data workers perform hypothesis-driven data science tasks. We sampled these data workers from different backgrounds, such as Numerical Analysis, Applied Psychology, and HCI, in order to maximize the external validity of our research. We analyzed their coding, commenting, and analysis practice with RStudio, a common statistical programming IDE. Data workers were encouraged to work on their own data science tasks, but most (six out of ten) preferred to use fabricated datasets instead. In total, we collected approximate eight hours of video footage.

Analysis method

We watched the videos to extract clips that met one or more of the following criteria:

  1. The participant interacts with the RStudio IDE.
  2. The participant interacts with another app to perform a task related to his/her analysis, e.g., searching for a statistical procedure on the web.
  3. The participant thinks aloud about his/her analysis.

After initial analysis, I came up with three tiers of process codes:

  1. Domain-agnostic programming tasks, e.g., documentation, file creation, or cloning code.
  2. Analysis tasks, e.g., computing descriptive/summary statistics, visualizations, or building models.
  3. Steps in exploratory programming workflow, e.g., creating alternatives, writing production code, or searching for previously executed code.

These codes were applied to the videos and engendered subsequent levels of coding. Below is a screenshot of ELAN, the software I used for the video analysis, with one of the video annotation files open:

Key findings

Given below is a summary of our findings. For details, see the full paper at CHI ’20.

  • During exploration and while rewriting code, data workers have difficulty keeping track of the code that produced data insights and the states of code experiments.
  • Exploration involves a standard routine of finding base code, cloning, contextualizing, and evaluating it.
  • Hypotheses are the building blocks of analysis. Source code written to validate hypotheses have syntactic signals that make them detectable. Data manipulations lead to alternative analyses, and data workers have to remember variable names to keep track of data versions.
  • Data workers organize their code into blocks when writing code; these are used as checkpoints for navigation later.
  • Data workers use (a) prior knowledge of statistical procedure, (b) text & graphic output of source code, and (c) external resources like webpages to rationalize their analysis.
  • Data workers do not capture data insights initially, but instead rely on their memory and sparse documentation.
  • It is hard for data workers to track the data dependencies in their code. This leads to faulty executions in production code.
  • Data workers rerun code frequently to recall rationale, insights, and the states of explorations.

Solution: Tractus

Based on the above findings, we felt that we can improve data workers’ workflow with RStudio. The result is TRACTUS, an RStudio addin that is juxtaposed next to user’s source code. It groups source code into the hypothesis that is being tested. (The grouping is done automatically by parsing the source code and detecting the underlying models.) Additionally, TRACTUS also captures and visualizes key contextual information from the analysis, such as the inline and block comments, execution output, and execution dependencies.

Check out TRACTUS in action in this video.

For the details of our implementation, see our full paper or GitHub repository.


Validation with users

We validated TRACTUS via two user studies.

In the first user study, participants were given various R script files (of varying lengths and complexity) and asked to use TRACTUS to explain the analysis in the script. This helped us understand how users use TRACTUS. This was compared to the control condition of using RStudio IDE without the TRACTUS addin.

In the second study, we wanted to evaluate if TRACTUS can be beneficial during various stages of a data science task. Participants used RStudio with TRACTUS to conduct a complete analysis: Identify hypothesis, perform exploration, refactor code if needed, and perform confirmation. This helped us identify several key benefits of TRACTUS such as code curation.

For details, see the CHI ’20 paper.