seders.gms : ERS Data Manipulations with SED

Description

Data files are often not in the form that can be read by another
program and/or transformations have to be made. The example below
shows how to manipulate an input output matrix in comma delimited
form. A sample file is shown below:

BAS1,1 Dairyfarmprd,2 PoultryEggs,3 Meatanimals,4 livestckmisc, ... 512 Wat2,513 Air2,Total,
1 Dairyfarmprd,0,0,265.7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ... 0,0,0,0,0,0,0,0,0,26452.6,
2 PoultryEggs,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,523.8,0,0, ... 0,0,0,0,0,0,,17623.5,
...
503 Air2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.2,0.7,1.7,0.3, ... 8,18859.6,470.4,0,0.5,5.4,33699.3,
Total,18728.2,14814.9,50839.5,2670.9,2693.7,3421.0,1831, ... ,3593.4,12052.2,6520145.5,

This file is the result of a routine process to take a raw file from a
web site and attach custom labels to rows and columns
( https://www.bea.gov/industry/io_benchmark.htm ).  The rows and
columns have the form '<number> space <text>'. Some of the rows have
been dropped and the <number> on the row labels does not match the
corresponding column label.  Note that the row named Air2 has number
503, whereas the column Air2 has number 513.  Since the aggregations
are done by column numbers, the row numbers need to be reassigned to
use the column order. Furthermore, there are three row labels that
have no matching column labels.

Finally, the matrix needs to be aggregated from a size of 513 to a
much smaller size.

We use the unix tool SED to manipulate the text files into a format
suitable to be read by GAMS. For more details on SED please an SED
book.


References

  • Somwaru, A, Problem contribution, ERS, US Department of Agriculture, Washington, DC.
  • Doufherty, D, and Robbins, A, SED and AWK. O'Reilly & Associates, Sebastobol, 1997.

Small Model of Type : GAMS


Category : GAMS Model library


Main file : seders.gms   includes :  f001.inc

$title ERS Data Manipulations with SED (SEDERS,SEQ=297)

$onText
Data files are often not in the form that can be read by another
program and/or transformations have to be made. The example below
shows how to manipulate an input output matrix in comma delimited
form. A sample file is shown below:

BAS1,1 Dairyfarmprd,2 PoultryEggs,3 Meatanimals,4 livestckmisc, ... 512 Wat2,513 Air2,Total,
1 Dairyfarmprd,0,0,265.7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ... 0,0,0,0,0,0,0,0,0,26452.6,
2 PoultryEggs,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,523.8,0,0, ... 0,0,0,0,0,0,,17623.5,
...
503 Air2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.2,0.7,1.7,0.3, ... 8,18859.6,470.4,0,0.5,5.4,33699.3,
Total,18728.2,14814.9,50839.5,2670.9,2693.7,3421.0,1831, ... ,3593.4,12052.2,6520145.5,

This file is the result of a routine process to take a raw file from a
web site and attach custom labels to rows and columns
( https://www.bea.gov/industry/io_benchmark.htm ).  The rows and
columns have the form '<number> space <text>'. Some of the rows have
been dropped and the <number> on the row labels does not match the
corresponding column label.  Note that the row named Air2 has number
503, whereas the column Air2 has number 513.  Since the aggregations
are done by column numbers, the row numbers need to be reassigned to
use the column order. Furthermore, there are three row labels that
have no matching column labels.

Finally, the matrix needs to be aggregated from a size of 513 to a
much smaller size.

We use the unix tool SED to manipulate the text files into a format
suitable to be read by GAMS. For more details on SED please an SED
book.


Doufherty, D, and Robbins, A, SED and AWK. O'Reilly & Associates,
Sebastobol, 1997.

This problem is due to: Agapi Somwaru, ERS, US Department of
Agriculture, Washington, DC

Keywords: data manipulation, stream editor, comma-separated values
$offText

$set raw f001.inc
$set out product

* We need the Posix compatible sed
$set sed sed
$ifI %system.platform% == "sol" $set sed /usr/xpg4/bin/sed


* 1. Extract column headers and write to colnumber
$onEcho > sedscript
# insert before line 1
1i\
$offListing
# drop first token
s/^[^,]*,//
# drop last item followed by comma
s/,[^,]*,$//
# replace comma with newline
s/,/\
/g
# quit after first line
1q
$offEcho

$call %sed% -f sedscript %raw% > colnumber


* 2. Extract row names write to rowname
$onEcho > sedscript
# insert before line 1
1i\
$offListing
# drop first line
1d
# drop last line
$d
# remove row number
s/^[0-9]* //
# remove everything following the comma
s/,.*$//
$offEcho

$call %sed% -f sedscript %raw% > rowname


* 3. Prepare matrix and write to mat1 using row names and column numbers
$onEcho > sedscript
# insert before line 1
1i\
$offListing\
$onDelim
# drop trailing comma
s/,$//
# remove the column names following the numbers on first line
1s/ [^,]*,/,/g
# remove numbers on other rows
1!s/^[^ ]* //
#insert after last line
$a\
;
$offEcho

$call %sed% -f sedscript %raw% > mat1


* 4. Read transformed data and perform some basic checks
Set
   j 'column numbers' /
$  include colnumber
                      /
   row 'row names'    /
$  include rowname
                      /
   col 'column names' /
$  call %sed% -e "s/^[0-9]* //"  colnumber > temp
$  include temp
                      /
   jmap(j,col) 'number name map'
/
$  onDelim
$  include colnumber
$  offDelim
/;

Table mat1(*,*) 'data including row and column totals'
$include mat1
;

Parameter rowtot(row), rowdiff(row), coltot(j), coldiff(j);

rowtot(row)  = sum(j, mat1(row,j));
coltot(j)    = sum(row, mat1(row,j));
rowdiff(row) = rowtot(row) - mat1(row,'total');
coldiff(j)   = coltot(j) - mat1('total',j);

* display rowdiff, coldiff;

* check to make sure the totals are OK, note the rounding!
Parameter
   rowerr(row) 'relative row errors'
   colerr(j)   'relative column errors';

rowerr(row) = round(rowdiff(row)/(1 + rowtot(row)),1);
colerr(j)   = round(coldiff(j)/(1 + coltot(j)),1);

display rowerr, colerr;

abort$(card(rowerr) + card(colerr)) 'likely data errors';

* update row and column totals
mat1(row,'total') = rowtot(row);
mat1('total',j)   = coltot(j);

Set zerorows(row) 'rows with zero total';
zerorows(row) = rowtot(row) = 0;

display zerorows;


* 5. Find additional row names and add to row OthSLGentpr
Set rowx(*) 'additional row names';
rowx(row) = yes - col(row);
display rowx;

mat1('OthSLGentpr',j) = mat1('OthSLGentpr',j) + sum(rowx, mat1(rowx,j));
mat1(rowx,j) = 0;


* 6. Map the rownames into the proper column numbers and store in mat2
Alias (i,j);

Parameter mat2(i,j) 'matrix using column numbers';
mat2(i,j) = sum(jmap(i,col), mat1(col,j));


* 7. Final aggregation mapping and tests
Set
   Iagg 'aggregation of IC sectors'
        / ADRY 'DAIRYFARPRD',       AOMT 'POULTRY',                 ACTL 'MEATANIM'
          ACOT 'COTTON',            AFOD 'FOODGRAIN',               AFED 'FEEDGRAIN'
          ATOB 'TOBACCO',           AFRT 'FRUITS AND TREENUTS',     AVEG 'VEG AND GREENHOUSE'
          ASUG 'SUGARCROPS',        AOCR 'OTHERCROPS',              AOIS 'OILBEARCROPS'
          ACMP 'CATTLE PROD PROC',  AOMP 'OTHER MEAT PROC',         ADRP 'DAIRY PROD PROC'
          AFOP 'FOOD GRAIN PROC',   AFEP 'FEED GRAIN PROC',         ASUP 'SUGAR PROD PROC'
          AFSH 'FISH PRE PROC',     AOIP 'OIL PROD PROC',           AFRP 'FRUIT PROD PROC'
          ABEV 'BEVERAGES',         ATOP 'TOBACCO PROC',            AFDP 'OTHER FOOD PROC'
          ATEX 'TEXTILE',           AAPR 'APPAREL',                 ALTP 'LEATHER PROD PROC'
          AACH 'AGRCHEMICALS',      AFFS 'FOREST FISHING AGR SERV', AMNF 'MANUFACTURING'
          ASRV 'SERVICES' /
   IaggMap(Iagg,J) 'aggregation mapping'
        / ADRY. (1)
          AOMT. (2)
          ACTL. (3,4)
          ACOT. (5)
          AFOD. (6)
          AFED. (7)
          ATOB. (9)
          AFRT. (10,11)
          AVEG. (12,17)
          ASUG. (13)
          AOCR. (8,14)
          AOIS. (15)
          AFFS. (16,18*21)
          AACH. (30,179,180,181)
          ACMP. (53,54)
          AOMP. (55)
          ADRP. (56*60)
          AFSH. (61,66)
          AFRP. (62,64*65,67)
          AFOP. (69*71,76*78,96,97)
          AFEP. (72*75)
          ASUP. (79,82,80)
          ABEV. (83*88,93)
          AOIP. (89*92,94)
          AFDP. (63,68,81,95,98)
          ATOP. (99*102)
          ATEX. (103*114)
          AAPR. (115*127)
          ALTP. (209*217)
          AMNF.(22*29,31*52,128*178,182*208,218*407)
          ASRV.(408*513)                             /;

Set
   iaggTest(Iagg) 'unmapped aggregated sectors'
   JZero(j)       'sectors without mapping'
   JTest(j)       'sectors with multiple mappings';

iaggtest(iagg) = sum(j$IaggMap(Iagg,j), 1)    = 0;
JZero(j)       = sum(iagg$IaggMap(Iagg,j), 1) = 0;
JTest(j)       = sum(iagg$IaggMap(Iagg,j), 1) > 1;

display iaggtest, jzero, jtest;

abort$(card(iaggtest) + card(jzero) + card(jtest)) 'the aggregation mapping is incorrect';


* 8. Aggregate into amat and store all data in a gdx file
Alias (iagg,jagg), (j,jj);

Parameter amat(iagg,jagg) 'aggregate table';
amat(iagg,jagg) = sum((j,jj)$(iaggmap(iagg,j)*iaggmap(jagg,jj)), mat2(j,jj));
display amat;

execute_unload '%out%' i, amat;
$if %system.filesys% == UNIX $exit
$call msappavail -Excel
$ifE errorLevel<>0 $exit

execute "gdxxrw %out%.gdx trace=0 par=amat rng=prod!b4";