Statistics

Important Concepts of Statistics and Probability

ENV* K245 Water Resources Engineering home

Note: Precipitation homework, due 10/5: McCuen, Hydrologic Analysis and Design, Chapter 4, Problems 4.2, 4.6, 4.7, 4.9, 4.10, 4.15

Descriptive statistics

These summarize a set of observations

In general these might be:

Heights of a group of people
Test scores
Size variation on a manufactured part

In hydrology some examples are:

Annual maximum 24 hour rainfall
Lowest flow in a river within a ten day period of each year
Number of days between rainfall events

Commonly used statistics include:

Measures of central tendency—These say where the middle of a group of observations is.

mean = average (aka m or x-bar)

Add up the values and divide by the number of values:

median = the middle value
mode = the most common value

Measures of variability—These tell us how spread out a set of numbers are.

standard deviation = the square root of the average squared difference between each number and the mean

This difference can be manipulated algebraically to give a form that is easier to calculate:

range = the difference between the highest and lowest numbers

4   3
5   4
6   5
6   6
6   2

mean: 4.7
median:5
mode: 6

sd: 1.4
range: 4

Inferential statistics

These allow us to make inferences beyond the observations

For example, suppose that within a particular group of 100 children, girls scored higher than boys on a particular test. Does that mean that it is likely that another group of girls will also score higher than another group of boys?

Hypothesis testing

From a given set of numbers (eg, scores) we can determine the probability that a certain distribution occurred by chance.

If the probability is less than some value, say 5% or 1%, than we say that the finding is statistically significant.

I don’t think we will use inferential statistics in this class, but examples include t-value and F-value.

Samples vs populations

A population is a theoretical universe of all possible values for some variable.

Sometimes this actually exists, for example the height of every person in the US.
In other cases, it is something we can imagine but it doesn’t really exist.

The set of all possible values for monthly rainfall in the month of September

Whether the population is real or hypothetical, we usually do not have access to every single value.

But, and this is important, we can describe the probability of obtaining a certain value if we pick a sample at random
For example, if the average height of an American man is 70 inches, the probability is 50 percent that the next man to walk down the hall is at least 70 inches tall.

Just as samples can be described by statistics (e.g., mean and standard deviation), so can populations.

For populations we call these values parameters rather than statistics.
We often add the word "population" to the name and we use Greek letters to abbreviate parameters.
So:

population mean = m (pronounced "mu")
population standard deviation = σ (pronounced “sigma”)

There’s a very important difference between statistics and parameters

Statistics are numbers we calculate from a set of observations.

They will change depending on what sample we use.

Parameters are unchanging numbers that describe a population.

Here is an exercise on samples and populations.

We usually cannot study the entire population (ie, all people, every possible rainstorm).

But, we often assume that a set of observations, a sample, is drawn from a certain population.
If a sample is from a population, the statistics for that sample are estimates of the parameters of the population. (This is an important point that is not always easy to grasp the first time you see it.
The course called "Statistics" teaches a set of techniques that use mathematics and probability theory to draw conclusions from observations.

Random Variables

A random variable is a series of numbers that we can consider one at a time.

For those of you who like symbols, note that the random variable is often abbreviated with a capital letter, X, while any particular value is abbreviated with lowercase letter, often subscripted, x₁.
The set of all possible values for the random variable is a population.
A set of some particular values is a sample, (x₁, x₂, x₃,…,x_n).

We’ve already considered some examples.

Height is random variable and so is maximum annual 24-hour rainfall.

Distribution

A distribution is a mathematical description of how values are distributed in a population.
There are several well known distributions, including:

The uniform distribution—such as the distribution of values shown on the face of a single die

All values are equally likely, e.g., 1 through 6

The binomial distribution—such as the numbers of consecutive heads or consecutive tails in a set of coin tosses
The normal distribution, also called the bell curve

Many random variables are assumed to be distributed according to the normal distribution.
The probability of any value of a normally distributed random variable can be predicted using a table of z values, Excel, or other techniques.

From a lesson on using Excel to draw a normal curve, http://www.tushar-mehta.com/excel/charts/normal_distribution/

Frequency Analysis

we often speak of the 2 year storm or the 10 year storm or the 25 year flood etc

the T-year storm or flood is the storm or flood of intensity that will on the average be met or exceeded once in T years
we can also talk about an T-year drought or low Q_N which would be the conditions equaled or gone less than on the average once in T years
T is called the recurrence interval or return period

these things are surprisingly easy to calculate from a series of yearly data

generally speaking we want at least 10 years of data or T/2 whichever is greater (eg, we can estimate the 100 year storm from 50 years of data)

the general idea is to draw up a plot of intensity versus frequency or probability (on probability paper) and read off the intensity that corresponds to 1/T
to do this:

rank the n items of data from highest to lowest (if we want a rare high event) or lowest to highest), assign a number m corresponding to the rank

calculate a probability P that the item at rank m will be exceeded:

Hazen’s formula gives:

Fa = (2n-1)/2y; the third highest item of ten would have P = (2*3-1)/(2*10) = 0.25

[or sometimes we might use P = n/(y+1) so for example the 3rd highest item of 10 would be P = 3/(10+1) = 0.27]

we are assuming that we’re looking for rare high values

plot each item above its probability
sketch the line that fits the points and find the rainfall, etc, that corresponds to P=1/T

Given this data for maximum flow at Babbling Brook during the month of August, what is the 20-year storm?

Year	QH (cfs)
1963	490
1964	440
1965	460
1966	550
1967	430
1968	360
1969	510
1970	410
1971	390
1972	470

Show graph

a very similar thing could be done mathematically:

find the mean and standard deviation of the items:

mean = X = å x/n

look up in a table of z values (ie, the normal distribution) the z value corresponding to P=1/T, call this value K (the z value for the
entry=0.5-P)
our T-year event is then given by:

x_T = X + Ks

Homework exercise: What is the 50-year maximum precipitation for the month of September at the Norwich Public Utility Station? Use the monthly data in the link below. Plot either the original values on log-prob paper or the log of the values on arith-prob paper.

Top of page

ENV* K245 Water Resources Engineering home

Environmental Engineering Technology home

Anthony G Benoit abenoit@trcc.commnet.edu
(860) 885-2386

Revised