Description

This program takes raw data from an Excel sheet, transforms the data
and finds the optimal values for alpha and beta following a procedure
outlined by Stephen Wright. An integral is computed numerically
using a trapezoidal approximations. The optimization results are then
written back into the Excel file.

This procedure was originally coded by David O'Brochta in Mathematica
and recast in a GAMS optimization framework by Wilhelmine Meeraus.

The Excel file contains the data from the field and labwork described
in W Meeraus, 2004.

References

Stephen I Wright, Quang Hien Le, Daniel J Schoen and Thomas E Bureau,
Population Dynamics of an Ac-like Transposable Element in Self- and
Cross-Pollinating Arabidopsis, Genetics 158: 127-1288 (July 2001).

David O'Brochta, Private Communications, University of Maryland Biotechnology
Institute, 2004.

Wilhelmine H Meeraus, A Combined Fieldwork and Laboratory Investigation into
the Behaviour of the Herves Transposable Element, London School of Hygiene
& Tropical Medicine, Theses, 2004.

Keywords: nonlinear programming, discontinuous derivatives, genetics,
          transposable element activity, trapezoidal approximation

Small Model of Type : DNLP

Category : GAMS Model library

Main file : herves.gms includes : hervesio.xlsx

$title Herves (Transposable Element) Activity Calculations (HERVES,SEQ=305)

$onText
This program takes raw data from an Excel sheet, transforms the data
and finds the optimal values for alpha and beta following a procedure
outlined by Stephen Wright. An integral is computed numerically
using a trapezoidal approximations. The optimization results are then
written back into the Excel file.

This procedure was originally coded by David O'Brochta in Mathematica
and recast in a GAMS optimization framework by Wilhelmine Meeraus.

The Excel file contains the data from the field and labwork described
in W Meeraus, 2004.

References

Stephen I Wright, Quang Hien Le, Daniel J Schoen and Thomas E Bureau,
Population Dynamics of an Ac-like Transposable Element in Self- and
Cross-Pollinating Arabidopsis, Genetics 158: 127-1288 (July 2001).

David O'Brochta, Private Communications, University of Maryland Biotechnology
Institute, 2004.

Wilhelmine H Meeraus, A Combined Fieldwork and Laboratory Investigation into
the Behaviour of the Herves Transposable Element, London School of Hygiene
& Tropical Medicine, Theses, 2004.

Keywords: nonlinear programming, discontinuous derivatives, genetics,
          transposable element activity, trapezoidal approximation
$offText

$eolCom //

* you can use command line --xlsxFile=xxx and --sheet=xxx which
* will allow further automation

* the sheet names are Aa,Ag,Am,ZU,ZAg
$if not set xlsxFile  $set xlsxFile 'hervesio.xlsx' // work book name
$if not set sheet     $set sheet    'Ag'            // sheet name

$set rng      '%sheet%!A4'     // starting range to read data
$set rngout   '%sheet%_Opt!A1' // starting range to store results


* 1. Extract raw data from an Excel sheet into a GDX file
*    and load into GAMS. The result of this step are
*    the set of samples i and the counts cnt. Empty or all
*    zero rows and columns will be dropped automatically from
*    the Excel data.
Set
   imax    'max sample size' / 1*50 /
   i(imax) 'samples';

Parameter cnt(imax) 'counts';

Alias (*,code,LabId,Cut);

Parameter raw(code,LabId,Cut) 'raw observations';

Set rr(code,LabID) 'row labels including empty rows';

* empty rows will be reported
$onEmbeddedCode Connect:
- ExcelReader:
    file: %xlsxFile%
    symbols:
      - name: raw
        rowDimension: 2
        columnDimension: 1
        range: %rng%
- Projection:
    name: raw(code,LabId,Cut)
    newName: rr(code,LabId)
    asSet: True
- GAMSWriter:
    symbols: all
    duplicateRecords: first
$offEmbeddedCode

Set
   rid(code,labID) 'row identifier'
   cid(Cut)        'column identifier';

Parameter ct 'column totals';

option rid < raw, cid < raw, rid:0:0:1; // map domains and set print option
abort$(card(rid) > card(imax)) 'size of imax is to small', imax, rid;

ct(cid) = sum(rid, raw(rid,cid));
display raw, ct;

Set rrdiff 'all zero rows';
rrdiff(rr) = yes - rid(rr);
display rrdiff;

i(imax)   = ord(imax) <= card(rid);
cnt(imax) = sum(cid, ct(cid) = ord(imax));
display i, cnt;


* 2. Prepare model constants and set up numerical integration
*    data sets.
Parameter
   diploid   'sample size'
   nhat
   bin(imax) 'binomial for diploid';

diploid      = card(i);
bin(i(imax)) = fact(card(i))/fact(card(i) - ord(imax))/fact(ord(imax));

* numerical integration data
Set
   s     'integral discritization steps' / s0*s1000 /
   ss(s) 'excluding first element';

Parameter
   xx(s) 'values for steps s (0-1)'
   ws(s) 'weight for trapezoid approximation'
   deltaxx;

ss(s)   = ord(s) > 1;       // drop values for x = 0
deltaxx = 1/(card(s) - 1);
xx(s)   = (ord(s) - 1)*deltaxx;
ws(s)   = 1;
ws(s)$(ord(s) = 1)       = 0.5;
ws(s)$(ord(s) = card(s)) = 0.5;

Parameter
   t0(s)      'intermediate value one'
   t1(imax,s) 'intermediate value two';

t0(s)         = sqr(xx(s))+ 2*xx(s)*(1 - xx(s));
t1(i(imax),s) = ws(s)*power(t0(s),ord(imax))*power(1 - t0(s),card(i) - ord(imax));


* 3. Define optimization model and set relevant bounds.
*    Note that not all parameters are known at this point.
Variable
   chi       'sum of CHI squares'
   a         'alpha values'
   b         'beta values'
   est(imax) 'estimates';

Equation
   defchi       'definition of CHI'
   defest(imax) 'definition of estimates';

defchi..    chi    =e= sum(i, sqr(cnt(i) - est(i))/est(i));

defest(i).. est(i) =e= nhat*(a + b)/a/beta(a,b)*bin(i)*deltaxx
                    *  sum(s, t1(i,s)*xx(s)**(a - 1)*(1 - xx(s))**(b - 1));

Model m / all /;

* set bounds and initial values
a.lo = 0.01; a.up = 1.0;
b.lo = 1;    b.up =  10;
a.l  =  .1;  b.l  =   5;
est.l(i) = 1;

option domLim = 1000, limCol = 0, limRow = 0, solPrint = on;

Parameter rep(*,*) 'summary report';


* 4. First Optimization
nhat = 4/3*sum(i(imax), ord(imax)*cnt(i))/card(i);

solve m minimizing chi using dnlp;

display diploid, nhat, a.l, b.l, chi.l;

rep(i ,'w-obs')       = cnt(i);
rep(i ,'w-est')       = est.l(i);
rep(i ,'w-chi')       = sqr(cnt(i) - est.l(i))/est.l(i);
rep('total','w-chi')  = sum(i, rep(i,'w-chi'));
rep('total','w-obs')  = sum(i, cnt(i));
rep('alpha','w-est')  = a.l;
rep('beta' ,'w-est')  = b.l;
rep('nhat' ,'w-est')  = nhat;
rep('SStat' ,'w-est') = m.solveStat;
rep('MStat' ,'w-est') = m.modelStat;


* 5. Second solve with modified counts

// use a cutoff of 70% and solve again
cnt(i(imax))$(ord(imax) > 0.7*card(i)) = 0;
nhat = 4/3*sum(i(imax), ord(imax)*cnt(i))/card(i);
display nhat;

solve m min chi using dnlp;

display diploid, nhat, a.l, b.l, chi.l;

rep(i ,'wo-obs')       = cnt(i);
rep(i ,'wo-est')       = est.l(i);
rep(i ,'wo-chi')       = sqr(cnt(i) - est.l(i))/est.l(i);
rep('total','wo-obs')  = sum(i, cnt(i));
rep('total','wo-chi')  = sum(i, rep(i,'wo-chi'));
rep('alpha','wo-est')  = a.l;
rep('beta' ,'wo-est')  = b.l;
rep('nhat' ,'wo-est')  = nhat;
rep('SStat' ,'wo-est') = m.solveStat;
rep('MStat' ,'wo-est') = m.modelStat;

display rep;

* 6. Update Excel workbook with results in a new sheet
embeddedCode Connect:
- GAMSReader:
    symbols:
      - name: rep
- ExcelWriter:
    file: %xlsxFile%
    symbols:
      - name: rep
        range: %rngout%
endEmbeddedCode