11 Exploring Data
SLIDE DECKS
Some of the material presented in this chapter will be discussed in class. It is your responsibility to ensure you cover all the concepts presented both in class and in this textbook.
Often, students and researchers want to dive into data analysis as soon as they get their hands on a data set. However, this is not good practice. It’s much better to take a step back and review the data set, understand what it contains, how it was collected, how much was collected, etc. It’s common practice to describe the meta-data (or data about the data) whenever you are initially writing up a Data Science study. Meta-data provides context to anyone who wants to review the work you are doing, and they also provide you with information to determine how you might want to analyze the data.
In the following paragraphs, we’ll consider some of the questions you should consider whenever you begin working with a data set (particularly for the first time), and then we’ll provide a few specific terms that we can use to help describe the data.
Why is this important? As part of our responsibility, we must be able to describe the work we do in a manner that allows it to be reproduced by another researcher. If we can’t write in a way that provides other researchers with sufficient information to reproduce the work we did, then our work isn’t reproducible – which is a cornerstone of science!
Things To Ask About Your Data
The following questions can help you begin to contextualize the data you are working with:
- How much data do you have? How many records are in your data set? Is it one data set, or more? If more, how are they connected? If they are merged, how does this affect the size?
- Where did the data come from? Who collected it? What tool was used to gather it? What geographical locations were involved?
- When were the data collected? Is there a time-dependency to the responses? Is it longitudinal?
- What kinds of data are present? What are the relevant variables/columns?
- Are data missing? If so, are they missing for obvious reasons? Is there a pattern to the missingness? Are missing data the same as 0 (zero) data? Are not applicable data the same as “I choose not to answer”?
- Do the data have bounds? Are they limited to a specific range? Are the responses restricted to a set of pre-determined responses?
- Do the data have limitations?
- Is it possible that different columns are related?
- Is it possible that different rows are related?
- Are these possible relationships relevant?
How Do We Describe Data?
Fortunately, there is a standard language used to describe data.
- Quantitative vs. Qualitative: Quantitative data are numbers, things that can be measured and are often objective. Qualitative data are labels, stories, words, descriptions, and things that are described and are often subjective. Quantitative data could be converted to label/qualitative data.
- Ordinal vs. Nominal: Ordinal data are those data that can be ordered, whereas nominal data can’t. For example, you can’t order the labels “Canada”, “Indonesia”, and “Malawi”. You can, however, order the numbers 1, 3, 6, 21, and 99.
- Countable vs. Uncountable: Countable data are those where the list of possible responses can be labelled with a unique integer. Uncountable data can not be labelled by the integers.
- Finite vs. Infinite: Finite data are those that consist of N distinct items. Infinite data is a list of potential responses that go on forever. In the former case, the outcomes from rolling a pair of dice and adding the role up are finite. There are only so many outcomes that are possible.
- Continuous vs. Discrete: Discrete data are those which have a countable set of labels that can be used to classify the data. Continuous data would require an uncountable set of labels to classify the data.
It’s important to note that qualitative data can be both ordinal and nominal. For example, an ordinal but qualitative data set might have responses of small, medium, and large. Additionally, countable data can be finite or infinite. For example, the outcomes of the role of a pair of dice are finite and countable, but the integers themselves are infinite and countable.