Stats in the wild
This is an on-going project. I’ve always wondered how data workers, i.e., non-professional data scientists, grapple with the sheer complexity involved in selecting the appropriate statistical procedure. This project aimed at understanding this, and unearthing problems in this space.
Study #1: How data workers select statistical procedure?
We used purposive sampling to recruit twelve data workers (graduate students, doctoral candidates, and industry practitioners). We used the university mailing lists and word-of-mouth for the recruitment. We conducted semi-structured interviews with each participant.
We performed two cycles of coding: An initial/open coding, followed by focussed coding. We covered a wide range of topics, such as participants’ resource usage, attitude towards statistics, and tool usage.
- Selecting statistical procedures is an inherently complex task. There is a lot of uncertainty involved, which is only accentuated by the vast amount of resources out there.
- Researchers use some strategies to cope with this complexity. E.g., most researchers simply use the statistical procedure that is widely accepted in their field of research even if better methods are available.
- The main sources of information are books, research papers, interpersonal communication with experts and colleagues, and Q&A websites.
All of our participants reported using Q&A websites to look information. Therefore, we investigated these Q&A websites in detail in our next study.
Study #2: How are Q&A websites used?
We selected a sample of 76 questions from two prominent Q&A websites: CrossValidated and ResearchGate. We used the following inclusion criteria:
- Question should have at least one relevant tag, e.g., “significance test”.
- Question should have at least one answer written by a respondent.
- Question should pertain to the selection of appropriate statistical procedure, and not about tool usage (i.e., how to perform a statistical procedure with a statistical tool or programming language).
We manually analyzed the questions and answers by a) identifying key attributes of the question, e.g., characteristics of the dataset, b) questioner’s intent, e.g., confirmation of a stated procedure vs. open-ended request, and c) time point during questioner’s research at which the question was asked, e.g., before the data collection and after some data collection. We also looked closely at the answers to see the acceptance among respondents, and sought to understand the discourse amongst the respondents, as well as between the questioner and the respondents.
- Most questions sought a quick, evidence-based confirmation for a given statistical procedure. This shows us that many data workers are hesitant or uncertain about the chosen solution.
- There are three groups of information that are specified in most questions: Dataset, experimental design, and questioner’s intent.
- However, over 35% of questions had missing information and over 15% had unclear information. There were several situations where this led to incorrect or inaccurate answers by the respondents.
- Questioners face difficulty phrasing their questions in a comprehensible manner and representing the experimental design of their studies.