LeastSquares.gms : Demonstrate the use of numpy.linalg.lstsq on the diabetes test problem using linalg ols

Description

Use the tool linalg.ols to solve the least squares regression problem.
This model demonstrates how to get the same statistical values as with
the LS solver (dropped in GAMS 34). It also demonstrates how to use the
feature in a loop with subsets of the feature set p.


Category : GAMS Data Utilities library


Main file : LeastSquares.gms   includes :  LeastSquares.gms  ls.gdx  diabetes_data.gdx

$title Demonstrate the use of numpy.linalg.lstsq on the diabetes test problem using linalg ols (LeastSquares,SEQ=141)

$ontext
Use the tool linalg.ols to solve the least squares regression problem.
This model demonstrates how to get the same statistical values as with
the LS solver (dropped in GAMS 34). It also demonstrates how to use the
feature in a loop with subsets of the feature set p.
$offtext
Set i   'patients'
    p   'features';
    
Parameter
    A(i<,p<) 'features collected from a cohort of diabetes patients'
    y(i)     'quantitative measure of disease progression one year after baseline';

* Data from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
$gdxin diabetes_data.gdx
$load A y
$gdxin

$if not set EXPLICITINTERCEPT $set EXPLICITINTERCEPT 1
$ifThen %EXPLICITINTERCEPT%==1
$onMulti
set p / b0 Intercept /;
$offMulti
A(i,'b0') = 1;
$endIf

Scalars
    rss     'sum of squared errors'
    sigma   'Standard error'
    r2      'R-Squared'
    df      'Degrees of freedom'
    resvar  'Residual variance';
Parameters
    estimate(p) 'Estimated coefficients'
    se(p)       'Standard errors'
    tval(p)     't values'
    pval(p)     'p values'
    resid(i)    'residuals'
    fitted(i)   'fitted values for dependent variable'
    covar(p,p)  'variance-covariance matrix';

$FuncLibIn stolib stodclib
Functions
    cdfT     / stolib.CDFStudentT  / 
    icdfT    / stolib.iCDFstudentT /;

executeTool.checkErrorLevel 'linalg.ols i p A y estimate covar=covar r2=r2 df=df rss=rss resvar=resvar se=se tval=tval fitted=fitted resid=resid sigma=sigma';
* Symbols estimate, covar, r2, df, rss, resvar, se, tval, fitted, resid, and sigma) have been loaded
* implicitly by executeTool../modlib/maxcut.338. The compiler instruction in the next line supresses
* errors about presumably unassigned symbols
$onImplicitAssign

pval(p) = 2*(1-cdfT(abs(tval(p)),df));

Set alpha / "90%", "95%", "97.5%", "99%" /;
Parameter
    av(alpha) / "90%"   .9
                "95%"   .95
                "97.5%" .975
                "99%"   .99 /;

parameter confint(alpha,p,*) 'confidence intervals'
scalar q, t;
loop(alpha,
  t = -icdfT((1-av(alpha))/2, df);
  confint(alpha, p, 'LO') = estimate(p) - t*se(p);
  confint(alpha, p, 'UP') = estimate(p) + t*se(p);
);
execute_unload 'ls_np.gdx',confint,covar,df,estimate,fitted,pval,r2,resid,resvar,rss,se,sigma,tval;
execute.checkErrorLevel 'gdxdiff ls.gdx ls_np.gdx eps=1e-4 releps=1e-4 > %system.nullFile%'

* Now demonstrate dynamic use of OLS with subsets of features
Set pp(p)    'subset of features'
    iter     'iterations' / iter1*iter20 /
    bestp(p) 'best subset of features'
Parameter
    Ap(i,p)  'reduced A'
    bestrss  / inf /
    beste(p) 'best estimate';
    
loop(iter,
   option clear=pp, clear=Ap; 
   pp(p)$(uniform(0,1)<0.5) = yes;
$if %EXPLICITINTERCEPT%==1  pp('b0') = yes;
   Ap(i,pp) = A(i,pp);
   executeTool.checkErrorLevel 'linalg.ols i pp Ap y estimate rss=rss';
   put_utility 'log' / iter.tl:10 ': rss=' rss;
   if (bestrss>rss, bestp(p) = pp(p); bestrss = rss; beste(p) = estimate(p));
);
display bestrss, bestp, beste;