12 Preliminary Data Analysis
SLIDE DECKS
Some of the material presented in this chapter will be discussed in class. It is your responsibility to ensure you cover all the concepts presented both in class and in this textbook.
After you’ve described the meta-data about your data, you’re going to want to begin the process of playing with the data. That is, you’re going to want to look at it, ask basic questions about it, and identify anything weird within it. The goal here isn’t to fully analyze the data so that you can answer your overarching research question, but to become more familiar with the data. By doing this, you’ll be ready to identify potential outliers, missing data, and odd patterns that might be surprising to you. All of this will be important as we work towards a full analysis of the data.
Summary Statistics
One of the quickest and easiest ways to get to know your data is to apply basic summary statistics to each variable in your data set. Since we are considering each variable in your data set on its own, we employ univariate summary statistics. These include, but are not limited to:
- minimum, maximum, mean, mode, and median (one of the 5 number summaries; another 5 numbers summary replaces mean and mode with the upper and lower quartiles)
- standard deviation or standard error (note, these are two different things)
- bar plots (typically used if the x-axis uses nominal or ordinal labels)
- levels (also known as factors for a nominal variable)
- number of NAs or missing data values
- variable types
- histograms
- box Plots
- normality Tests
- counts
- proportions
Of course, not all of these measures can or should be used on all data. We need to consider the type of data that we are summarizing (e.g. quantitative vs. qualitative, continuous vs. discrete, ordinal vs. nominal, etc.), and determine if the summary statistic is valid. For example, it wouldn’t make sense to use a Mean or Median on qualitative data. Counts might be useful for nominal data, but not necessarily for continuous data (unless you applied some transformation that assigned specific values to a labelled bin).
Once you’ve explored each variable in the data set, you might then consider how these data are related to each other. That is, are some of the variables correlated? In this case, we explore the data using multivariate measures. These include, but are not limited to:
- scatter plots
- cross tabs
- tables
- correlation coefficients
- stacked bar plots
- chi-square tests
- auto-correlation tests
- heat maps
- line plots (typically used if the x-axis is ordinal and quantitative)
- waffle plots
Similar to the case of univariate summary statistics, we need to consider our data when we choose which method will be used to summarize the potential relationships between our variables. Some of these summary methods aren’t appropriate for all types of data. For example, scatter plots aren’t really useful for the comparison of qualitative variables. In that case, you might want to consider stacked bar plots, or maybe a chi-square contingency table.
Outlier Data
As you play with your data, you might discover data that just doesn’t match what you expected to see. These data points might be the result of faulty equipment, miscalibrated equipment, or human error, or they might be extreme observations that don’t fit our understanding of the world. In all cases, these would likely be labelled as outlier data. Sadly, most students assume that outlier data should be removed from a data set – but that is a last resort.
Outlier data might be the result of a mistake during the data collection process, but it also might be the result of us simply not understanding a phenomenon in sufficient detail to be able to explain the occurrence of the outlier. In other words, an outlier might contain new knowledge, or it might indicate that our assumptions of the world are incorrect, or it might represent some poorly understood phenomenon. The bottom line is that throwing away an outlier could prevent us from having a better grasp of whatever we are studying.
So what should we do if we see a potential outlier?
- Begin by considering why the outlier doesn’t fit the pattern of the data that we expected. Is it because we have limited knowledge of what we are studying? Is it possible that it is the result of faulty equipment or human error?
- Consider if the outlier tells you something about your data or the assumptions you have about your data. Do you need to reconsider your assumptions or improve your understanding of the data set and the phenomenon it represents?
- Discuss the outlier with whoever created or provided you with the data set. Can they tell you if the data were poorly transcribed, or that whatever tool they used to collect it wasn’t working properly?
If the person who provided the data indicates that it is a mistake, only then can you completely remove it from your data set.
If the person who provided the data can’t tell you it’s a mistake or the result of faulty equipment, we really should consider what this means about our understanding of the phenomenon we are studying. How does this change what we are doing? How might it affect the research questions we are trying to answer?
You can also explore your analysis using the full data set (outlier included) and a reduced data set (with the outlier removed) to see how influential the outlier is. In some cases, the outlier will have very little impact on the results of your analysis. In this case, the outlier is considered to be a “not influential data point”. If the removal, however, leads to very different results, then the outlier is considered an “influential data point”.
In all cases, document the data that you believe are outliers. Also document what you are doing with the outlier data, and why you are doing whatever it is you are doing. You should also clearly state any assumptions. For example, if you don’t have concrete evidence to suggest a data point was incorrectly entered other than your own intuition (and perhaps physical limitations that might prevent the data that are recorded from actually being observed), indicate that you have assumed the data point was incorrectly entered (with reason) and you have therefore removed it from further analysis.
This doesn’t mean that everyone will agree with your decisions or assumptions. In fact, they will likely question every single decision you made and every single assumption you’ve documented. As a Data Scientist, you need to be able to defend your decisions.
Consider the following:
A Dataset that is 44% Outliers, by Robert W. Hayden (2017).
“Abstract: The data illustrate outliers that are not mistakes and not observations that are unusually high or low. The reasons for them are all interesting historically. They illustrate that ‘outliers’ need not be errors but may instead be particularly interesting cases. The data also illustrate that different data displays may differ in their ability to reveal interesting data structure.” (Hayden, 2017)
What Should I Document?
Whenever you describe the results of an analysis for a reader or a client, you should provide them with as much information about the data as possible. This helps them contextualize what you’ve done, and provide them enough information to evaluate how the results might affect them or a general audience. With that in mind, any report or presentation you give should document:
- the meta-data
- relevant univariate and multivariate summaries
- any outliers you’ve identified and what you’ve done with them
- any assumptions you’ve made about the data