Tutorial (in progress)
Random Variables and Probability
Functional Notation
The expression, y = f(x), indicates an algebraic relationship between the dependent variable, y, and the independent variable, x. Select a value of x and the corresponding value for y is returned by the function wherever the function is defined. The variable y is said to be a function of the variable, x.
Mathematical Equivalence
The expression, y = f(x), is also a statement of mathematical equivalence. Any mathematical operation that we can do on the right hand side can also be done on the left hand side. Thus, the equation, y = f(x), can be transformed to a new equation, which may be more useful. For example, taking the natural log of both sides gives us ln(y) = ln[f(x)].
Variation and Transformations
If two values of x, x1 and x2, are specified then two values of y are generated by the functional relationship, y1, corresponding to the input value, x1, and y2, corresponding to the input value x2. Now, if we take the difference in the y values and divide by the difference in the x values, we have an estimate for the rate of change in y per unit change in x over the range [x1, x2]. In the limit as x2 approaches x1 we have the derivative of the function where the variable x has the value x1. In calculus this is represented as a differential equation, dy/dx|x1 = f’(x1). This is read the derivative of y with respect to x, evaluated at x1, is equal to the first derivative of f(x1), f’(x1), evaluated at x = x1. It can be interpreted as the rate of change of y as the value of x varies around x = x1, or the slope of the line at x = x1 in a plot of y vs. x.
Types of Functional Dependence
A few of the many types of functional dependence are described here.
Linear: y = a + bx is the equation of a straight line in the (x, y) plane. Intercept on the y axis (the value of y for x = 0) is equal to the value of the constant “a”, and the slope of the line is constant with the value of the coefficient “b”. In other words if x changes by one unit, then y will vary by b units. Notice that “a” and “b” are arbitrary labels and could be replaced with any convenient symbol to denote the constant and the coefficient.
Quadratic: y = a + bx + cx2 is the equation of a parabola that is open at the top or at the bottom (not on either side). The intercept on the y axis (the value of y when x is zero) is again the value of the constant, “a”. The variable y has a maximum or a minimum value where x = -b/2c, corresponding to the point where the slope changes sign.
Polynomial: y = a + Si=1,nbixi is the equation for all polynomial dependences. Again the y intercept is the value of the constant. The linear and quadratic forms are special cases where n =1 for linear and n=2 for quadratic. A polynomial function has as many derivatives as value of n as long as bn is not zero.
Exponential: z = cebx indicates that z varies exponentially with x. The constant “c” acts as a scale factor and the value of “c” is the y intercept. The slope of z is cbebx, which also varies exponentially with x. The equation can be transformed by taking the natural log of both sides for c>0 resulting in ln(z) = ln(c) + bx. If we change variables, letting y = ln(z) and a = ln(c), then we have a linear equation relating y and x, y = a + bx. Look familiar?
Power: z = cwb indicates that z and w are related through a non-linear power relationship. Transforming by taking the natural log of both sides gives us
ln(z) = ln(c) + bln(w) whenever c is positive. Changing variables letting y = ln(z), a = ln(c) and x = ln(w) results once more in a linear equation, y = a +bx.
Similar transformations will turn out to be quite useful later, when we attempt to use data to infer the slope(s) and intercepts of our function of x.
The above types of functional dependence will take a value of x on the real number line and produce a value for y on the real number line. The variable y is called a continuous function of x. Sometimes the values for x and y are restricted to discrete values. For example, y = 2x where x is an integer value is defined on even integers and at the origin. We say that y is defined over the set of integers, including the origin, and has a domain consisting of even integers.
Random Variation
Suppose that y takes a value that is close to f(x) but has an additional component that varies randomly. Our original representation of y in terms of x, y = f(x), is no longer complete. We need an added term to represent the random variation around f(x). Symbolically we say that y = f(x) + e, where e is a zero mean random variable. For any input value of x the value of y is determined by the value of f(x) and the value of e, selected at random from the probability distribution describing e. The next time we enter the same value of x we may not get the same value of y because of the randomness of e.
Notice that if the function, f(x), is constant then y has a mean value equal to the constant, but any particular value for y will include the constant plus the random variation. The variation in y is purely random with no x dependence at all.
Probability Distributions and Processes
Probability weighting is a way of describing the relative frequency of occurrence for all the possible outcomes of a process. For outcomes defined on the real number line or line segment, the probability distribution is in the form of a probability density function. The density function describes for each possible outcome the rate of change in the cumulative probability. In other words, if x is a random variable that ranges from xmin to xmax , and the probability changes by an amount φ(x)dx as x is increased from x to x + dx, then φ(x) is the probability density function for the random variable x. The probability density function is normalized (scaled) so that the sum total of adding up all the disjoint (separate) increments from xmin to xmax will produce a value of 1.0.
For discrete variables, those whose outcome is not on the real number line, each possible outcome is weighted by a probability and the entire set of probabilities over the outcome space is referred to as a probability mass function, or just a probability distribution.
A simple example of generating a discrete distribution to describe the random outcomes of a process is the following:
Consider the process as throwing a single cubical die with uniform faces and uniform weighting, and individual numbers from 1 to 6 on the faces of the die. On any individual throw the outcome (the top face) is equally likely to be any integer from 1 to 6. We suppose a fair die and assign the probability 1/6 to each possible outcome. We have just created a probability mass function for the process of throwing a single die. The form of the distribution has a name. It’s called the discrete uniform distribution. For any set of equally likely outcomes the probability is 1/n for each of n possible outcomes.
An even simpler example is the process of tossing a single “fair” coin. This process has only two outcomes, heads or tails. To make a probability distribution out of this we assign a value 1 to heads and a value 0 to tails and assign each outcome a value of ½. This is a special case of the discrete uniform distribution.
In this section a few of the many “named” probability distributions will be described. A named distribution refers to one which is based on a generic process and for which a parametric form describing the probability distribution for outcomes has been defined. The power of these distributions is that a few parameters describe a complete distribution of probability weights or densities. Estimating the parameters becomes a matter of observing the process and analyzing the data.
Bernoulli Distribution: Many processes can be described as having binary output similar to the coin toss; success or failure, win or lose, accept or reject and so forth. The probability for each possible outcome may not be equally distributed. This type of process is called a Bernoulli process, and each trial is called a Bernoulli trial. The Bernoulli distribution has possible outcomes 0 and 1 and is completely defined by the probability of “success”, p, the probability that for each individual trial the event related to the outcome “1” is the actual result. An important restriction here is that the individual trials in the process are independent. That is the probability distribution over outcomes for any trial is not affected by the outcomes of previous trials.
Binomial Distribution: Imagine a process where a preset number of Bernoulli trials is run and the output of interest is the number of “successes” taken in any order. For example, ten tosses of a fair coin are made and the total number of heads is of interest. The outcome can range from 0 to 10 but the probability distribution over the outcome space is not the same for each outcome. There is only one way to get zero heads and that is if every toss results in a tail. Similarly there is only one way to get 10 heads and that occurs if every toss results in a head. However , there are 10 ways to get the result one head, since any one of the 10 tosses can result in a head and all the others tails. Similarly there are ten ways to get 9 heads, since this is equivalent to tossing only one tail. For two heads or two tails we find 45 paths each, for three heads or three tails we find 120 paths each, for four heads or four tails we find 210 paths and finally, for 5 heads and 5 tails we find 252 paths. The numbers of paths correspond to the coefficients in a binomial expansion of (1 + x)10. Hence, the name binomial distribution.
In many applications data is taken in order to estimate probability of success for a process. In such endeavors we want to know two things. First, our estimate of the parameter p for the process and secondly, our estimate for the range of values that p might take with different levels of confidence (the probability that the actual probability of success for the process is in the interval around our estimate).
Uniform Distribution: Any value of x within a fixed range, [x1, x2], is equally likely, with probability density 1/(x2 - x1).
Regression Analysis
Very often we have observations on a variable, call it y, and simultaneous observations on another variable, x. We would like to know if a relationship exists between y and x, and the nature of the relationship. Using techniques of data analysis we seek a best representation of y in terms of x, in effect separating y into it’s functional and random components. Very often we assume an appropriate functional form and try to estimate the constant and the parameters (coefficients) that best describe the relationship, given the data. This process is called linear regression. It’s linear in the coefficients, even if the functional dependence is non-linear. It’s regressive because rather than specifying values of x and computing y from known values of the constant and coefficients, we use observations of both x and y and attempt to discern the best representation of f(x) = a + bx