Descriptive Statistics

A Very Brief Introduction/Overview

October 12, 2023

Ethan Marzban (for DS Collab)

Goals for Today

  • Introduce the Data Science Lifecycle

  • Discuss the data matrix representation of data

  • Introduce the topics of Descriptive Statistics and Exploratory Data Analysis

Quick Disclaimer

  • Typically, when I teach these concepts in a course (like PSTAT 5A), I try to emphasize the background or details of various tools/methods.

    • I do this to try and foster a greater sense of intuition
  • In the interest of time, however, I have tried to keep this workshop fairly brief and have, at times, forgone details in favor of brevity.

  • For those who would like more information, I’ll also be uploading a series of Appendices to the website that contain further details and examples on some of the concepts I discuss today.

Part 0: The Data Science Lifecycle

Adapted from https://learningds.org/ch/01/lifecycle_cycle.html

Adapted from https://learningds.org/ch/01/lifecycle_cycle.html

Understanding Data

  • A key part of the Data Science Lifecycle is to understand our data.

  • This entails many different aspects.

  • One part of the process of understanding data is Exploratory Data Analysis (often abbreviated EDA).

  • We will cover different aspects of EDA over the course of the next few workshops.

Part 1: Data

What is Data?

What is Data?

  • According to Merriam-Webster (source), there are three definitions for data:
  1. factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation

  2. information in digital form that can be transmitted or processed

  3. information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful

  • I like the first definition, mainly because of the phrase “used as a basis for reasoning, discussion, or calculation.”
  • Data, though incredibly useful, is not the be-all and end-all; rather, it should be viewed as a stepping stone for further discussion and/or analysis!

  • As Data Scientists, data literacy (the ability to think critically about data, and to understand not only what it is saying but also the ways in which it can be manipulated to deceive) is key.

Example of Data

  • As a concrete example of a dataset, let’s explore the so-called palmerpenguins dataset.

  • Collected by Dr. Kristen Gorman at the Palmer Station in Antarctica, this dataset contains various measurements of 344 different penguins Dr. Gorman encountered.

   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
  • Notice that our data is formatted as a table. This table is what data scientists refer to as the data matrix.

    • The data matrix representation for data is one of the most popular ways to display data.

Terminology

  • Each row of the data matrix above corresponds to an individual penguin.

    • In general, we refer to a given row of the data matrix as an observational unit, or case.
  • For each penguin, we can see that there are observations on several different characteristics; species, island, bill length (in mm), bill depth (in mm), flipper length (in mm), body mass (in grams), sex, and year of observation.

    • Notice that these are the column names in our data matrix above. In general, the columns of the data matrix are referred to as variables

Quick Summary

Important

Each row of the data matrix corresponds to a unique observational unit, and each column corresponds to a unique variable

  • Because it is not always obvious what each variable in a given dataset actually represents, most datasets come equipped with a data dictionary

    • The data dictionary lists the variables included in the dataset as well as a brief description of each variable.

Data Dictionary Example

Variable Description
species The species of penguin (either Adelie, Chinstrap, or Gentoo)
island The island on which the penguin was found (either Biscoe, Dream, or Torgersen)
bill_length_mm The length (millimeters) of the penguin’s bill
bill_depth_mm The depth (in millimeters) of the penguin’s bill
flipper_length_mm The length (in millimeters) of the penguin’s flipper
body_mass_g The mass (in grams) of the penguin
sex The sex of the penguin (either Male or Female)
year The year in which the penguin was observed

Leadup

  • If we look at the different variables contained in the palmerpenguins dataset, we can see some qualitative differences.

    • For instance, the observations of species are all words/phrases whereas the observations of bill_length_mm are numbers.

    • This leads us to an important remark: there are different kinds of variables! Let’s talk about how to classify these different types.

Part 2: Classifying Variables

Numerical vs. Categorical

  • Numerical variables are variables whose observations consist of numbers.

    • Examples: heights, temperatures, number of free throws, etc.
  • Not all variables are numerical. For example, I could take a poll asking people’s opinions on the movie Barbie- the observations of this variable will most certainly not be numerical.

    • Rather, the observations of this variable will fall into one of a series of fixed categories (e.g. “Enjoyed the movie”, “Too much pink”, etc.).

    • As such, we describe non-numerical variables as categorical variables

Caution!

  • We actually need to amend our definition of numerical variables slightly.

  • For instance: suppose we have a variable that tracks the months of the year, but encoded using numbers (e.g. 01 for January, 02 for February, etc.)

  • This is a categorical variable, despite the fact that we are using numbers to encode the categories!

    • For instance, 1 plus 2 is 3, whereas January plus February is not March

Caution!

Caution

The categories of a categorical variable may be encoded numerically.

  • This is much more common than you think, largely because computers are better at dealing with numbers than characters!

  • Also, we can modify our definition of numerical variables as follows: a variable is numerical if its values are numbers and it makes interpretive sense to consider adding (or subtracting) two observed values.

Second Level of Classification

  • There is actually a second level of classification we can make, beyond just numerical vs categorical.

  • For example, even though both height and number of accidents are numerical, we know that height values can be decimals whereas number of accidents must always be an integer.

    • This leads to the division of numerical variables into discrete variables and continuous variables

Second Level of Classification

  • Analogously, consider favorite color and letter grade- both variables are categorical, however letter grades are clearly ordered (A+ is “better” than A, which is “better” than A-, etc.) whereas favorite color values are not.

    • This leads to the division of categorical variables into ordinal variables (i.e. those that have an inherent ordering) and nominal variables.

Full Classification Scheme

  • I went through that fairly quickly- feel free to ask me any follow-up questions!
data_classification cluster_main cluster_0 cluster_1 cluster_2 cluster_3 Data Variable numerical Numerical Data->numerical categorical Categorical Data->categorical continuous Continuous numerical->continuous discrete Discrete numerical->discrete nominal Nominal categorical->nominal ordinal Ordinal categorical->ordinal

Part 3: Introduction to Descriptive Statistics

Categorical Variables

  • Consider observations on a single categorical variable (e.g. species), \(\{x_i\}_{i=1}^{n}\).

  • Our observations \(x_i\) necessarily come from a set of \(k\) fixed categories (e.g. {Adelie, Chinstrap, Gentoo})

  • How might we produce a visual summary of our observations?

Bargraph

When summarizing observations on a categorical variable, use a barplot (aka a bargraph)

  • A barplot consists of \(k\) bars (one for each category), with the heights of the bars proportional to the frequency of occurence for each category.

    • I.e. the height of bar 1 is proportional to the number of observations that fall into category 1, the height of bar 2 is proportional to the number of observations that fall into category 2, etc.

Example: Penguins

  • As a concrete example, here is the species variable from the palmerpenguins dataset:
  [1] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
  [8] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [15] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [22] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [29] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [36] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [43] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [50] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [57] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [64] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [71] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [78] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [85] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [92] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [99] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[106] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[113] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[120] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[127] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[134] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[141] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[148] Adelie    Adelie    Adelie    Adelie    Adelie    Gentoo    Gentoo   
[155] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[162] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[169] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[176] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[183] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[190] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[197] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[204] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[211] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[218] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[225] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[232] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[239] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[246] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[253] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[260] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[267] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[274] Gentoo    Gentoo    Gentoo    Chinstrap Chinstrap Chinstrap Chinstrap
[281] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[288] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[295] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[302] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[309] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[316] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[323] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[330] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[344] Chinstrap
Levels: Adelie Chinstrap Gentoo
  • We see there are three distinct species present in the dataset: Adelie, Chinstrap, and Gentoo.
  • Using a computing software, finding the counts within each category is relatively easy.

   Adelie Chinstrap    Gentoo 
      152        68       124 
  • So, our bargraph will have 3 bars (one for each of the 3 species), with heights proportional to 152, 68, and 124 respectively.

Let’s Hear From the Audience!

  • Given observations of 120 different PSTAT students’ favorite color (summarized below), pproximately what proportion of the students in the sample reported either blue or gold as their favorite color?

Numerical Variables

  • Suppose now we have \(n\) observations \(\{x_i\}_{i=1}^{n}\) of a numerical variable.

  • How might we generate a summarizing plot for these observations?

  • First note that we no longer have the notion of “categories”, as we did with categorical variables.

    • For example, consider the bill_length_mm variable from the palmerpenguins dataset- it is extremely unlikely that there will be two observations that are both exactly equal to any particular value

Discretization

  • We can, however, inject some categories into our observations.

  • For example, instead of asking how many penguins have bill lengths of exactly 33mm, we can ask how many penguins have bill lengths between 30mm and 35mm.

  • That is, our idea is to now consider intervals of values, and ask how many observations fall within each interval.

  • These intervals are often referred to as bins, and the width of each interval is called the binwidth.

    • The act of dividing our observations into these bins is called binning, or discretization.

Distribution Table

  • The resulting table of counts (i.e. counts within each bin) is called a distribution table.

  • For example, the distribution table of the bill_length_mm variable from the palmerpenguins dataset is, using a binwidth of 5mm,

[30, 35] (35, 40] (40, 45] (45, 50] (50, 55] (55, 60] 
      11       89       77      113       47        5 
  • Note the use of interval notation; if a datapoint falls on the boundary of a bin, we conventionally assign it to the left (lower) bin.

Histogram

  • We now use our distribution table in the same way we used our frequency table from before! I.e. we will draw as many rectangles as there are bins, and have the height of each rectangle proportional to the corresponding frequency in the distribution table.

  • This type of plot is called a histogram.

Histogram

Caution: The Importance of Binwidth

  • Notice that our notion of a histogram is intimately tied with our choice of binwidth.

  • Different binwidths can produce wildly different histograms!

  • Here is a demo

  • In practice, it is a good idea to play around with different binwidths to find one that results in a histogram that displays a moderate amount of detail without becoming so detailed as to lose sight of the bigger picture.

Boxplots

  • It turns out there is another way to summarize numerical data visually: using what is known as a boxplot.

  • Boxplots can be a seem a bit peculiar at first, so let’s take a look at one together. Before diving back into the palmerpenguins dataset, let’s look at a slightly different dataset.

    • This dataset contains only one variable, which records the scores (out of 100 points) of 140 different students on a final exam.
  [1] 88.236 77.348 81.050 74.431 75.083 79.569 74.998 80.099 74.264 83.850
 [11] 89.857 81.427 79.439 84.260 78.565 77.570 78.224 73.780 88.085 79.341
 [21] 80.554 77.317 81.155 83.842 87.051 78.362 81.528 72.148 74.131 78.927
 [31] 75.446 79.791 78.199 90.769 85.640 78.420 83.484 79.045 97.909 86.736
 [41] 73.723 76.973 81.320 79.238 85.803 86.621 85.781 81.844 82.896 80.478
 [51] 75.903 84.565 76.302 83.432 85.448 69.695 81.049 85.575 84.791 82.525
 [61] 78.361 77.803 86.542 84.171 86.103 72.772 78.730 76.189 75.187 79.194
 [71] 77.159 82.048 82.661 84.021 76.008 79.474 79.015 86.992 72.524 76.094
 [81] 78.765 80.623 82.497 75.776 70.614 79.677 81.182 77.943 76.863 85.561
 [91] 89.569 96.695 73.680 77.770 81.584 81.965 78.373 76.295 73.212 79.229
[101] 87.273 87.364 82.706 83.843 75.864 82.791 82.637 78.685 72.626 69.302
[111] 93.408 73.189 83.764 77.832 82.803 80.278 94.962 79.616 85.667 82.710
[121] 86.823 76.656 74.623 71.508 91.131 78.318 81.058 86.239 76.585 85.652
[131] 77.122 86.036 83.127 83.234 80.746 83.878 75.544 73.780 81.106 85.523
  • Here is a histogram of these scores…
  • … and here is a boxplot

Anatomy of a Boxplot


Understanding Boxplots

  • Let’s discuss each of the quantities represented on the boxplot separately.

  • The first quartile is the value \(Q_1\) such that 25% of observations lie to the left of \(Q_1\).

  • The third quartile is the value \(Q_3\) such that 75% of observations lie to the left of \(Q_3\).

  • The second quartile is the value \(Q_2\) such that 50% of observations lie to the left of \(Q_2\). This is often called the median

Whiskers

  • Finally, we discuss the role of the whiskers on the boxplot.

  • There are several different conventions for how far the whiskers extend. In some conventions, the whiskers extend to the minimum and maximum values of the data.

  • The convention often used is the following: the whiskers will never reach farther than \(1.5 \times (Q_3 - Q_1)\).

  • What this means is that there may be points in our dataset that lie beyond the reach of the whiskers. These points are what we call outliers.

  • The rationale for constructing the whiskers in this way is to try and highlight any points that are unusually distant from the rest of the data.

Quick Summary

  • To summarize observations of a single categorical variable, use bargraphs/barplots.

  • To summarize observations of a single numerical variable, use either histograms or boxplots.

Part 3.5: Visualizing the Relationship between Multiple Variables

Leadup

  • So, this takes care of how to summarize a single variable.

  • But, as we saw with the palmerpenguins dataset, it is entirely plausible to have a dataset that contains multiple variables and to want to visualize the relationship between some of these variables.

  • Perhaps unsurprisingly, visualizing the relationship between 3 or more variables can be a bit tricky.

  • As such, we will restrict ourselves to comparing only two variables.

Three Cases

  • Even if we compare only two variables, three cases arise:

    • Comparing two numerical variables
    • Comparing one numerical and one categorical variable
    • Comparing two categorical variables
  • We’ll discuss the first two; time-permitting, we’ll discuss the third.

Two Numerical Variables

  • Let’s say we have two variables, and we want to visualize their relationship.

  • As an example, let’s return to the palmerpenguins dataset and compare the bill_length_mm and bill_depth_mm variables. Let’s also restrict ourselves to Gentoo penguins.

   bill_length_mm bill_depth_mm
 1           46.1          13.2
 2           50            16.3
 3           48.7          14.1
 4           50            15.2
 5           47.6          14.5
 6           46.5          13.5
 7           45.4          14.6
 8           46.7          15.3
 9           43.3          13.4
10           46.8          15.4
# ℹ 114 more rows
  • Notice that each observational unit of this data matrix (consisting only of the bill_length_mm and bill_depth_mm variables) is a pair of numbers.
  • It is fairly natural, then, to imagine plotting these pair of numbers on a Cartesian coordinate system.



Scatterplot

  • This type of visualization is called a scatterplot.

  • Specifically, when comparing two numerical variables of the same length, we generate a scatterplot by plotting each observational unit on a Cartesian coordinate system where the axes are prescribed by the variables in question.

Result

When comparing two numerical variables (of the same length), a scatterplot is the best visualization tool.

Interpreting Scatterplots

  • Let’s return to the scatterplot we generated before:

  • Notice how as the values of bill_length_mm increase, the corresponding values of bill_depth_mm also increase on average?

    • This makes intutive sense: longer bills are probably deeper!

Trend

  • This is an example of what we call a trend; specifically, a positive linear trend

  • Trends can be either positive or a negative, and either linear or nonlinear.

    • “Positive” means a one-unit increase in x translates to an increase in y
    • “Negative” means a one-unit increase in x translates to an degrease in y
    • “Linear” means the rate of change is fixed (i.e. constant)
    • “Nonlinear” means the rate of change depends on x
  • Linear Negative Trend:

  • Nonlinear Negative Trend:

  • Linear Positive Trend:

  • Nonlinear Positive Trend:

Associations

  • By the way- another way to talk about trends is to phrase things in terms of the variables being compared.

    • For example, if the scatterplot of two variables displays a positive linear trend, we might say that the two variables have a positive linear association.

    • For example, bill length and bill depth appear to have a positive linear association, as seen in the scatterplot from a few slides ago.

A Numerical and a Categorical Variable

  • The final case we will consider today is comparing a numerical variable to a categorical one.

  • As a concrete example, here is a (mock) dataset comprised of the following variables:

Variable Name Description
stdy_hrs average amount of time (in hrs) a student spent studying for a particular class each week
ltr_grd the final letter grade (A+, A, A-, etc.) the student received in the class
   ltr_grd  stdy_hrs
1        A 17.758097
2        A 19.079290
3        A 26.234833
4        A 20.282034
5        A 20.517151
6        A 26.860260
7        A 21.843665
8        A 14.939755
9        A 17.252589
10       A 18.217352
11       A 24.896327
12       A 21.439255
13       A 21.603086
14       A 20.442731
15       A 17.776635
16       A 27.147653
17       A 21.991402
18       A 12.133531
19       A 22.805424
20       A 18.108834
21       B 11.728705
22       B 15.128100
23       B 11.895982
24       B 13.084435
25       B 13.499843
26       B  9.253227
27       B 19.351148
28       B 16.613492
29       B 11.447452
30       B 21.015260
31       B 17.705857
32       B 14.819714
33       B 19.580503
34       B 19.512534
35       B 19.286324
36       B 18.754561
37       B 18.215671
38       B 15.752353
39       B 14.776149
40       B 14.478116
41       C 15.873819
42       C 18.064372
43       C 13.305716
44       C 28.760302
45       C 24.435829
46       C 13.946011
47       C 17.187018
48       C 16.900051
49       C 22.509843
50       C 18.624839
51       C 20.139933
52       C 18.871540
53       C 18.807083
54       C 25.158710
55       C 17.984031
56       C 25.824118
57       C 12.030612
58       C 21.630762
59       C 19.557344
60       C 19.971737
61       C 20.708378
62       C 16.739544
63       C 17.500567
64       C 14.416411
65       C 14.176939
66       C 20.365879
67       C 21.016944
68       C 19.238519
69       C 23.150204
70       C 28.225381
71       D 12.526907
72       D  7.072493
73       D 17.017216
74       D 11.872398
75       D 11.935974
76       D 17.076714
77       D 13.145681
78       D 10.337847
79       D 14.543910
80       D 13.583326
81       D 14.017293
82       D 15.155841
83       D 12.888020
84       D 15.933130
85       D 13.338540
86       F  5.327128
87       F  8.387356
88       F  5.740726
89       F  2.696274
90       F  8.595230
  • Note that ltr_grd is categorical (in fact, it is ordinal) and stdy_hrs is numerical (specifically, continuous).

  • Now, even though the observational units of our data matrix are pairs of quantities, they are no longer pairs of numbers and it therefore no longer makes sense to plot them as points on a Cartesian coordinate system.

  • The way we get around this is, perhaps surprisingly… boxplots!

Side-by-Side Boxplots

  • This type of plot is called a side-by-side boxplot.

Result

When comparing one numerical and one categorical variable, it is best to visualize their relationship using a side-by-side boxplot.

  • Though the notion of trend is slightly different in the context of a side-by-side boxplot, we can still use them to determine relationships.

  • For example, from the plot on the previous slide, we can see that, on average, students who received lower grades tended to study less than those students who received higher grades.

Causality

  • I should make a very important point: identifying trends is not the same thing as identifying causal relationships.

  • For example, the side-by-side boxplot from a few slides ago does not tell us that “studying less causes your grade to decrease”

    • There are a lot of other confounding variables that could contribute to the decrease in grade.
  • We won’t talk too much about causality right now, but it is an important thing to be aware of: association is not the same thing as causation!

Quick Summary

  • To summarize a single variable, use:
    • a barplot/bargraph if the variable is categorical
    • a boxplot or a histogram if the variable is numerical
  • To summarize the relationship between two variables, use:
    • a scatterplot if both variables are numerical
    • a side-by-side boxplot if one variable is numerical and the other is categorical

Line Plots

  • Suppose we have a set of observations on a variable x, and for each distinct value of x there is only one corresponding value y.

    • This often arises when x is time; for example, the population of a city at any given point in time is only a single value.
  • In such a case, we could generate a scatterplot of y vs x.

Example: Population

Line Plots

  • It is sometimes conventional to connect the points on such a graph with straight lines, to generate a line plot:

Line Plots

Part 4: Numerical Summaries

Leadup

  • Suppose we have a set of observations \(\{x_i\}_{i=1}^{n}\) of a single numerical variable \(X\).

    • Remember that observations of a single variable are just lists of numbers!
  • We have previously seen how to construct a visual summary of \(\{x_i\}_{i=1}^{n}\) through either a histogram or a boxplot.

  • We can also provide numerical summaries of \(\{x_i\}_{i=1}^{n}\) as well.

Measures of Central Tendency

  • One thing we can do is to describe the “center” of the list \(\{x_i\}_{i=1}^{n}\).

    • Quantites that describe the center of a list of numbers are often referred to as Measures of Central Tendency
  • One widely used measure of central tendency is the arithmetic mean \[ \overline{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

Interpretation of the Mean

  • If you’re curious, there is actually a very nice geometric interpretation of the arithmetic mean.

  • If we were to “plot” our numbers \(\{x_i\}_{i=1}^{n}\) on a number line by placing a marble of equal weight at each of the values \(x_1, \cdots, x_n\), then the arithmetic mean is actually the point at which a fulcrum can be placed to ensure the entire system remains balanced:

The Median

  • Another measure of central tendency is the median, which, as previously stated, divides the list of numbers \(\{x_i\}_{i=1}^{n}\) in half.

  • More concretely, here is the procedure to find the median of a set of numbers:

The Median

  1. Line up the numbers in ascending order
  2. Cross off the first and last numbers
  3. Cross off the first and last numbers that haven’t yet been crossed
  4. Continue until either there is only a single number left uncrossed (in which case this number is the median), or there are two numbers left uncrossed (in which case the mean of these two numbers is the median).

Mean vs. Median

  • One thing to note is that the median is more robust than the mean.

  • What this means, loosely speaking, is that it is less affected by outliers.

  • For a bit more understanding of what this means, I refer you to the Appendices I will post later.

Measures of Spread

  • Sometimes, it may be useful to try and describe how “spread out” a list of numbers is.

  • There are several different metrics used as measures of spread.

  • One is the range, which is simply computed as the largest number minus the smallest number: \[ \mathrm{range}\left(\{x_i\}_{i=1}^{n}\right) = \max_{1 \leq i \leq n} \{x_i\} - \min_{1 \leq i \leq n} \{x_i\} \]

The Variance

  • Another often-used measure of spread is the variance, which is computed using the formula \[\begin{align*} s_X^2 & = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \overline{x})^2 \\ & = \left(\frac{1}{n - 1} \sum_{i=1}^{n} x_i^2 \right) - \frac{n}{n - 1} \cdot (\overline{x})^2 \end{align*}\]

Interpretation of The Variance

  • The variance can be interpreted as a sort of average distance of points to the mean

Standard Deviation

  • Note, though, that the units of variance are necessarily the square of the original units of measurements.

    • For example, if \(\{x_i\}_{i=1}^{n}\) represents a series of measurements made in inches, then the units of the variance \(s_X^2\) are squared-inches.
  • This can make interpreting the variance a little tricky in some cases.

Standard Deviation

  • As such, data scientists and statisticians often take the square root of variance to obtain the standard deviation: \[ s_X = \sqrt{s_X^2} \]

    • Note, though, that mathematically both the variance and the standard deviation provide the same information.

Standard Deviation

  • Finally, another popular measure of spread is the Interquartile Range (or IQR for short), defined to be the third quartile minus the first quartile: \[ \mathrm{IQR} = Q_3 - Q_1 \]

  • The IQR is a more robust description of spread than variance.

Part 5: A Trip To the Movies!

Leadup

Question

Have movies gotten longer or shorter over the years?

  • This is a question I found myself curious about.

    • As a teen, I remember movies being consistently around 90-ish minutes in length whereas nowadays I see movies being around 120-ish minutes in length.
  • As indicated by the Data Science Lifecycle, the next step is to procure data!

The Data

  • The Internet Movie Database (IMDb) provides a great repository of data on pieces of film media.

  • This data can be accessed at this link, if you’re curious.

  • For this particular mini-project, we will only need a subset of this dataset.

  • In fact, to make things even easier on ourselves, we will restrict our consideration only to films released since the start of the millennium.

The Data

    year runtime
1   2021      94
2   2000      60
3   2001     118
4   2020      70
5   2018     122
6   2005     100
7   2002     126
8   2009      58
9   2022      65
10  2017      80
11  2000      80
12  2006      90
13  2023      NA
14  2021     100
15  2004      96
16  2023      86
17  2002      95
18  2006     125
19  2015      NA
20  2017      NA
21  2002      97
22  2000      86
23  2000      NA
24  2000     100
25  2002      85
26  2001      90
27  2001     105
28  2008     100
29  2000      91
30  2000     167
31  2000      86
32  2000     180
33  2001     101
34  2009     101
35  2003      NA
36  2008      83
37  2005      72
38  2001      90
39  2001     145
40  2001     104
41  2001     100
42  2000      98
43  2000      96
44  2000      99
45  2000      94
46  2002     132
47  2001      89
48  2007      83
49  2000      88
50  2000      86
51  2001      NA
52  2003      90
53  2000      98
54  2022     101
55  2000     104
56  2000      75
57  2008      82
58  2002      91
59  2000     160
60  2001      86
61  2000     105
62  2000      98
63  2000     110
64  2000      87
65  2000      96
66  2022     108
67  2003     102
68  2001     100
69  2000     123
70  2000      84
71  2005     106
72  2000      78
73  2002     123
74  2001     122
75  2000      94
76  2000      74
77  2001     178
78  2000     122
79  2000     123
80  2002     100
81  2001     112
82  2001     111
83  2002      NA
84  2000     104
85  2002      88
86  2000      94
87  2000      78
88  2005      77
89  2002     142
90  2005     140
91  2000      NA
92  2000      99
93  2004      89
94  2000     115
95  2002      NA
96  2000     108
97  2000      90
98  2001      84
99  2001      92
100 2001     123
  • 166958 observational units included in the data matrix

Results

  • With a bit of work, we can compute the average (mean) runtime of films released every year since 2000 and use this to generate a line plot.

    • I’ll be going over the code to do this in a future workshop.

Discussing the Plot

  • Notice that, based on the plot alone, we have a pretty good idea of how to answer our original question!

  • Of course, we could provide a more statistical answer using things like Hypothesis Testing (to formally test whether the average runtime of films has changed over years).

  • But, for now, I hope you see how the act of simply generating an appropriate visualization can be incredibly illuminating!

Part 6: Good Graphics vs. Bad Graphics

Leadup

  • I would be remiss if I were to not impart to you some tips and tricks for making good visualizations.

  • Let’s start off by roasting some bad ones.

  • For each of the following graphs, let’s discuss some things that could be improved upon!

  • Source: Many of these plots come from https://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6

Graph 1

Graph 2

Graph 3

Graph 4

Scale

  • Speaking of scale, make keep in mind the dangers of rescaling plots!

Plot; Scale 1

Plot; Scale 2

Plot; Scale 3

Colorblind Accessibility

  • Finally, I should also mention that there are certain steps we can (and should) take to make our plots more colorblind-accessible.

  • One piece of advice many people suggest is to avoid using complex background visuals.

  • Another piece of advice is to leverage packages and built-in color schemes in programming languages.

    • We will talk about this a bit more in a future workshop.

General Principles

  1. Make sure a plot is needed!

  2. Keep things simple.

  3. Be clear about your scale and axes.

Additional References

  1. OpenIntro Statistics by Diez, Çetinkaya-Rundel, and Barr (https://www.openintro.org/book/os/)

    • Free and high-quality PDF available online, courtesy of the publishers!


  1. Statistics, 4th ed. by Freedman, Pisani, and Purves

Thank You!