Anthropomorphizing data / by Emily Randall

There seem to be two camps within data science: those who believe a hypothesis is required for analysis, and those who believe it's not. I put myself in the "hypothesis required" camp. I wouldn't start collecting data or analyzing an existing data set without first formulating one or more hypotheses.

I define a hypothesis as a testable prediction, idea, or explanation. A hypothesis doesn't have to state cause and effect - it can simply be correlational. (A cause and effect hypothesis is ideal but not always possible when studying human behavior.)

If you are conducting an exploratory rather than empirical study, you should still have one or more questions that you want to answer before you start the analysis. For example, a drugstore chain might want to know the average amount of money spent on vitamins and supplements per customer per year. How does this spending differ based on age level, gender, prescription refills, or other factors? 

This approach is in contrast to that of researchers who want to "let the data speak for themselves." Data do not speak. They are not human. To give data human qualities is to anthropomorphize the data. A human must guide the data collection, analysis, and interpretation process. (I have similar issues with the term "data driven." Data cannot drive. They do not have a license or the skills to operate a motor vehicle.)

Data can yield powerful insights, and we have access to more of it than ever before. But putting the onus on the data means reneging your power as a researcher, and I'm not sure why anyone would want to do that. It also means straying from the "science" part of data science. It's not that hard to generate questions or predictions before an analysis; in fact, this can be the most enjoyable part of the whole process! And besides, how can you chose the most appropriate statistical test when you don't know what the independent and dependent variables are, or if you even have a dependent variable at all?

This assumes, of course, that the researcher is objective in his or her conclusions and that the insights gained are accurate and practical.

I like this quote from Michael L. Anderson, in his article highlighting oversimplifications in a neuroscientific study of personality:

"Data never speak for themselves. They are always placed inside an interpretive frame, and when that frame is inadequate, no interpretation can be valid. That is to say, without an adequate interpretive framework, we can't know the meaning of even the most statistically and methodologically sound finding."