The 5 stages of data analysis

Any good data analysis starts with a well thought out process. According to Wikipedia, “data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusions and supporting decision-making.”

These phases of data analysis have been summarized into 5 distinct stages, as described in The Art of Data Science by Roger D. Peng, Ph.D. and Elizabeth Matsui:

  1. Stating the question
  2. Exploratory Data Analysis
  3. Building formal statistical models
  4. Interpretation
  5. Communication the results

Peng further describes that within each of these five stages, data analysts should engage in a three-step iterative process so there is an opportunity to re-evaluate the findings and refine the step that was just performed.

  1. Setting Expectations
  2. Collecting your data and comparing it to your expectations
  3. Revising your expectations or fixing the data (e.g. collecting more) so your data matches what you expected

The table below illustrates the relationship between each step.

Set ExpectationsCollection InformationRevise Expectations
QuestionQuestion is of interest to the audienceLiterature, Search, DataSharpen Question
Exploratory Data AnalysisData are appropriate for the questionMake exploratory plots of dataRefine question or collect more data
Formal ModelingPrimary model answers questionFit secondary models, sensitivity analysisRevise a formal model to include more predictors
InterpretationInterpretation of analyses provides a specific & meaningful answer to the questionInterpret totality of analyses with a focus on effect sizes & uncertaintyRevise EDA and/or models to provide specific & interpretable answer
CommunicationProcess & results of the analysis are understood, complete & meaningful to the audienceSeek feedbackRevise analyses or approach to presentation

Table: Epicycles of Analysis, Source: The Art of Data Analysis by Roger D. Pend, Ph.D. & Elizabeth Matsui

Let’s take a look at this entire process using an example data set.

1. Stating and refining the question

It’s more than likely that a question will be given to you, or a question that you can work with to refine a better one. Let’s say we had a customer survey that asks them to rate the likelihood they would recommend our product to their friends. And the marketing director would like to know if there is any relationship between the survey results and the customers’ product interest.

Since we now know that we are looking for a relationship between two attributes of a customer, we can determine that we are asking an exploratory question. Exploratory questions require you to analyze data to find patterns and trends between features. But we stop short of forming a fully thought out hypothesis. The findings from your exploratory data analysis will provide supporting evidence for you to form a hypothesis.

Our refined question could be restated as:

Is there a relationship between a customer’s product interest and the Net Promoter Score they gave our product?

Now let’s quickly run through the three-step process.

  1. Setting expectations
    1. The question is of interest to our audience. In this case, it was our marketing department that would like to understand these data better.
    2. The question asks if there is a relationship between NPS and product interest. But it does not identify if there is influence between the two.
  2. Collecting Information
    1. NPS is a common metric. We can do some additional research to see if our NPS is representative of other companies within our industry. For this exercise, we’ll make the assumption it is similar to external data.
    2. Our product interests are unique to our business so we’ll move forward with collecting our own data.
  3. Revise Expectations
    1. There is not a need to revise our statement any further.

To collect our data, we’re going to select the id, nps_score, and interest columns of all customers from the ‘customers’ table in our example database.

SELECT id, nps_score, interests
FROM customers;

This SQL query resulted in 1,000 records. We’ll export these results as a CSV file for further analysis using Python. From a cursory view, we can confirm that we have NPS and product interest data.

2. Exploratory Data Analysis

We’re finally at the point where we’ll start exploring the data, cleaning the data, and examining any relationships between features. There are three primary goals to EDA.

  1. Determine the data is useful and there is enough data to work with
  2. Determine if we have the right data to answer our question
  3. Develop a primary model of the answer to our question

OK, Here we go!

Read in our data

You can follow along with this Deepnote project:

# read in our data
filename = 'data/nps_scores.csv'
df = pd.read_csv(filename)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
id           1000 non-null int64
nps_score    391 non-null float64
interests    789 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 23.6+ KB

Let’s check our dataset and start cleaning.

We can see that our data does include 1,000 rows, and three columns, id, nps_score, and interests. But there is already some concern. There are only 391 customers with an NPS value. We’ll continue with what we have, but we might want to send out more NPS surveys.

Looking at the top and bottom of the dataset, we can start to get a feel for what we are working with.

# Print the top and bottom 5 rows
 id  nps_score    interests
0  1462        3.0  Fly Fishing
1  1491        2.0  Fly Fishing
2  1492        NaN  Fly Fishing
3  1551        NaN  Fly Fishing
4  1553        NaN  Fly Fishing
       id  nps_score         interests
995  1989       10.0           Surfing
996  1994        6.0  Ski/Snowboarding
997  2000        NaN           Surfing
998  1975        NaN       Fly Fishing
999  1978        4.0       Fly Fishing

We can see that the data frame has an id column that contains a unique customer id value. Let’s use that column as our index.

# Set the index to the id column & drop the original id column
df.index = df['id']
df = df.drop(columns=['id'])
 nps_score    interests
1462        3.0  Fly Fishing
1491        2.0  Fly Fishing
1492        NaN  Fly Fishing
1551        NaN  Fly Fishing
1553        NaN  Fly Fishing

From our data frame info above, we know we need to either drop our NaN values, ignore them, or fill in substitutes. Let’s drop any rows with NaN values and check the data frame info.

# Drop NaNs and check the
df = df.dropna()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 315 entries, 1462 to 1978
Data columns (total 2 columns):
nps_score    315 non-null float64
interests    315 non-null object
dtypes: float64(1), object(1)
memory usage: 7.4+ KB

We’re left with 315 customers that have both an NPS and interest value. Let’s view some descriptive statistics & counts.

count    315.000000
mean       5.219048
std        2.905017
min        1.000000
25%        3.000000
50%        5.000000
75%        8.000000
max       10.000000
Name: nps_score, dtype: float64
Surfing             60
Mountain Biking     57
Ski/Snowboarding    56
Trail Running       49
Fly Fishing         48
Climbing            45
Name: interests, dtype: int64

Hey, we have some summarized data to work with now. As you can see, this is a very simplistic example. In reality, a data set would likely involve many more columns and require more pre-processing.

The next step is to make a plot.

# create a boxplot of nps_score by interests
plt.xticks(rotation=45) # rotate x-axis labels so we can read them

We can see that customers who are interested in climbing have a slightly higher median NPS.

  • Can we start to answer our question about the relationships between interests and NPS?
  • Do we have enough data with the 45 responses within climbing?
  • Does this data get us thinking about what we should do next to further explore NPS and interests and develop a hypothesis?

3. Building formal statistical models

A key characteristic of a model is reduction. We want to explain our EDA findings in the most simplistic way possible. Our example is a simplistic one, but depending on your EDA you may be building a more involved model.

In our case, we are able to report descriptive statistics of NPS values for each interest category. Descriptive statistics are the most simple for our question.

# Create a DataFrame of descriptive statistics for each interests value.
stats = pd.DataFrame(group.describe().rename(columns={'nps_score':name}).squeeze()
                         for name, group in df.groupby('interests'))
                  count      mean       std  min  25%  50%   75%   max
Climbing           45.0  5.711111  2.809256  1.0  3.0  6.0  8.00  10.0
Fly Fishing        48.0  4.562500  2.974403  1.0  2.0  4.0  6.25  10.0
Mountain Biking    57.0  5.368421  2.844993  1.0  3.0  5.0  8.00  10.0
Ski/Snowboarding   56.0  4.964286  2.898051  1.0  2.0  5.0  7.00  10.0
Surfing            60.0  5.433333  3.038547  1.0  3.0  5.0  8.00  10.0
Trail Running      49.0  5.265306  2.841475  1.0  3.0  5.0  7.00  10.0

4. Interpretation

Our EDA has revealed that Net Promoter Scores are low across product interests. We may need to revise our question and ask if there is any relationship between NPS and products purchased. In many cases such as this example, EDA leads to further discussion and questions. We’ve also discovered that only a third of our customers have responded to our NPS survey. There’s an opportunity to collect more data.

5. Communicate

Our last stage of data analysis is communicating the results. This often is a report or presentation given to stakeholders. I’ve also found it beneficial to publish my findings on Slack along with the original data. That way you can start a conversation and have colleagues provide their feedback. Some of the best advice I’ve received has come from engineers. Being transparent about the data analysis process goes a long way to creating confidence in your reporting.