The latest page is at itmetr.net.
last modified:
05:04 PM MDT, Sun 15 Jun 2008
You will not need to know much at all about the theory of statistics to complete this tutorial. If you desire, you could peruse:
IMO, in statistical studies 90% of the labor is in obtaining the data (e.g., building and maintaining a mesonet), 9% in archiving and formatting the data (e.g., the reanalysis data) and 1% in passing the data through statistical analysis software and interpreting the output. Here we will do part of the 9%: formatting the data using Python. And the 1%: using a Python function to find regression coefficients from one data set, making a prediction with those coefficients, and comparing the predictions with observations. Here is the online documentation for the "workhorse" of the statistical analysis, available from a Python interpreter:
>>> import LinearAlgebra
>>> help(LinearAlgebra.linear_least_squares)
linear_least_squares(a, b, rcond=1e-10)
solveLinearLeastSquares(a,b) returns x,resids,rank,s
where x minimizes 2-norm(|b - Ax|)
resids is the sum square residuals
rank is the rank of A
s is an rank of the singual values of A in desending order
If b is a matrix then x is also a matrix with corresponding columns.
If the rank of A is less than the number of columns of A or greater than
the numer of rows, then residuals will be returned as an empty array
otherwise resids = sum((b-dot(A,x)**2).
Singular values less than s[0]*rcond are treated as zero.
(Note the spelling errors and notational inconsistencies.
Hey, but how much did you pay for it?)
We will adopt another notational convention rather than A (a) and b.
Instead of A, we will call the matrix with rows consisting of m
records of simultaneous
observations of n input variables to be X. The column vector of m
observations of an output variable,
which are hoped to be significantly dependent on the input variables, will be
called y. The n coefficients we seek will be the vector
(a 1-d array in Python) c. Having found the coefficients, from an analysis
of "training data", we hope to either make predictions or otherwise test the
skill of the model with an entirely different set of observations,
with so-called "verification data".
The vector of predictions is denoted yp, which is desired to be
close to y in the verification data. The
function linear_least_squares chooses c that minimizes
the difference between y and yp in the training data.
Here is our forecast model, with notation appropriate for a single record of observations, using n=3 as an example: ypi= c1 Xi1 + c2 Xi2 + c3 Xi3 + c4Actually, we seek n+1 coefficients, because of the use of the last term in the above.
The task for Fall 2006 is
Statistical Forecast of Norman Temperature
. The task is easily completed by
modifying the scripts for the example nowcast project.
|