The Data Science Process

Daniel Gillis; Kurtis Sobkowich

10 The Data Science Process

SLIDE DECKS

Some of the material presented in this chapter will be discussed in class. It is your responsibility to ensure you cover all the concepts presented both in class and in this textbook.

The Data Science Process

Data Science employs techniques and theories drawn from many fields within the broad areas of Mathematics, Statistics, Computer Science, Information Science, and more! And if you were to review job postings, you might be surprised to learn how much is expected of a Data Scientist!

For example, a Data Scientist might be expected to know about signal processing, probability models, machine learning, statistical learning, data mining, database design and management, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modelling, data warehousing, data compression, computer programming, artificial intelligence, and high-performance computing.

They might also be expected to know correlation analysis, linear and non-linear regression, density estimation, confidence intervals, hypothesis testing, clustering and unsupervised learning, supervised learning, time series analysis, spatial analysis, monte-carlo simulation, Bayesian statistics, principal components analysis, neural networks, association rules, segmentation, optimization, imputation, survival analysis, cross-validation, predictive modelling, filtering, linkage analysis, experimental design, visual design, and more!

It’s all a little overwhelming! And perhaps it might be a bit silly to think that one person will have mastery over all of this. Instead, some Data Scientists might be really good at some of this stuff, while others have expertise focused elsewhere.

In that sense, I like to think of Data Science as a team sport – where many different folks get together and share their collective expertise to uncover the story of the data we are analyzing so that it can be properly evaluated and used effectively for building knowledge and making decisions.

The Data Science Process

Data Science is not just about analysis and interpretation. There are many other stages to the Data Science pipeline. In general, this includes:

the collection of data,
the storage of data,
the cleaning of data,
the visualization of data (and results),
the analysis of data,
the interpretation of analytical results, and
the mobilization of knowledge.

Of course, I also think that there are things missing from this pipeline. Specifically, we should (or someone on the team should):

identify the problem/challenge/question that needs to be answered,
determine if the work is exploratory in nature, or very specific,
determine if our goal is to predict or classify (or something else),
identify who is going to use the results of the work we are doing, and
identify how best to deliver the results to those who need them.

Before We Begin A Project

Prior to starting a project, you should consider the following:

The Client: What are their needs? What is the purpose of the work we are doing? What questions do we need to answer for them?
The Audience: Who needs this information? Are they different than the client? Do both require the information? How best can you communicate with your audience?
The Data: Where are they? What are they? How were they collected? How “good” are they?

After A Project Is Finished?

Once a project is finished, you’re going to need to report to someone. With this in mind, you should consider the following:

The Client: Did you report the findings to the client? Do they have additional questions you need to answer? If so, how are you going to accomplish this?
The Audience: Did you actually meet their needs? How do you know? Did they understand the findings? How do you know?
The Data: What do you do with the data once you are finished working with them? What about the findings?

10 The Data Science Process

The Data Science Process

Before We Begin A Project

After A Project Is Finished?

License

Share This Book