Graham Swallow
CV - Update - SITA

www.Information-Cascade.co.uk


1998 Dec 20 - 2000 Feb 05: SITA

Societe International de Telecommunication Aeronautique, (www.sita.int) was founded 50 years ago, as a not-for-profit cooperative by agreement between the worlds airlines, to manage their networks, and do various things at airports (weather, lost baggage, ...). The name is not familiar to Joe Public, but if you've been through an international airport, you've depended on SITA services. SITA is now moving into a more commercial portfolio.

This particular data center focuses on the clean running of the machines, with the usual issues about new projects, new problems, sizing, upgrades, contingency, targets, migrations, and keeping the systems at "mundane" control, with pro-active tests to see that things are working to spec, reactive maintenance when a router fails, follow-ups to accumulate what is learned, and extrapolate that knowledge to things that didn't happen (and therefore won't).

Dozen HP-UX boxes (10.20)

The boxes usually ran smoothly, but occasionally there were hiccups. I was involved in scheduled maintenence, and prioritised troubleshooting.

There were a few problems with HP firmware levels, and early fibre devices. These got diagnosed, reactively fixed and the other machines proactively fixed, tracking versions on all machines.


5 Solaris Workstations

Someone found 5 Sparc-Ultra-1's in a cupboard (on support contract!), so I installed Solaris 2.7 on them, and taught the site how to use X11 workstations, rather than DOS telnet.


Open Source Toolkit

I bootstrapped and installed a rich library of cross-platform, Open Source tools, on both the HP-UX and the Solaris machines. This included gcc/g++, Tcl/Tk, Python, Gtk, Perl, tkined, jpeg, png, unzip, zip, zlib, lynx, SSL, graphics manipulation libraries (non gif!), and most of all: Midnight Commander (a seriously necessary text-mode file-browser).


Desktop GUI

I created a set of GUI menus for the operators and the sys-admins, which gave one-click enterprise access to the machines (with DISPLAY set), and also an event driven remote-run facility, so that a scripted command could be run remotely, and the results sent to ... a GUI, a file, or a collected string.

These included the ability to manually, switch routing between two routers, test the result and view the settings. It included the ability to change the (randomly generated) passwords on a dial-in modem, and server so that a HP technician could be given access for diagnostics, then have that access turned off (manually or automatically).


No Y2K problems

Early in 1999, I encouraged the site to upgrade their HP-UX 10.20 boxes, one by one, to the latest patch levels. This was very sucessful, as later updates were totally straight forward, and we crossed the Y2K dateline without hitch. I did not do any other Y2K work.

Statistics and Configuration Gathering

I created a set of scripts to run vmstat 60 1440 on all machines, and gather the results back into a central place, as well as a daily configuration gatherer. When other admins changed the configuration, or when a machine was rebooted (activating a new configuration, or deactivating a manual one), it gave a comparable reference to see how it used to work.

I also created some scripts to parse the output from ioscan, into an HTML table, with nested boxes (one backplane has several cards, each with several sub-devices, ...)


ISP Firewall Configuration

Whilst I was there, a few boxes moved their emphasis into being an ISP, with POP3 accounts, radius smartcards.

I was involved with the reconfigration of the firewall, to open up new paths, diagnose routing problems, and occasional details, like getting a web service moved to appear to be port-80 (so that remote firewalls would not have problems with port-8182).


Sybase and Oracle

I' not a DBA, but I did work with a DBA, to perform some interesting tasks, such as reformatting drives, setting up a Sybase, moving databases and installing custom apps. Our combined experience gave a better configured machine, with disk loads split over several splindles (non-striped), spare capacity, repeatable scripted steps, etc.

I scripted (ksh) 'db-stop' (and start), so that the systems could be controlled more easily. I also scripted up a way of dumping the main database (OmniBack), and reloading it onto a different machine, from a different tape device (as well as calling the usual sybase edits).


Production Machines

The biggest problem with production machines, and machines actively reserved as development machines, is that there is little opportunity to test scripts in a range of circumstances.

In that respect, I think the year was very sucessful. We managed to identify many places where the system needed specific attention, put something into place, refine and keep checking across a span of time.


Tape Robot

I installed a StorageTek SCSI tape robot, (pre SAN). This had several different intermittant teething problems, which required all the SCSI cards in the room to be upgraded, the floor to be lifted a few times, the SCSI-Multiplexors to be upgraded, and a number of system down times (external SCSI is not hot-pluggable).

Whilst backups were unreliable, the main system backups were done on an external DLT. This exposed problems with the fact that HP do not support two devices connected to the same SCSI card (never mind active), and diagnostic support tactics like: can you try swapping the two cards over, in-case its the card ...

The project was plagued by the VAR abandoning all involvement from day-1, and HP claiming that they didn't support that _particular_ model (so we eventually got it swapped for one they did support, remarkably similar, but more slots). Plus every wire change that touches a system, requires that system to be offline (and a centralised backup touches every system ...).

At some point, things came together, it became reliable and it went live. Since then the only problems have been week-08 and 09 not being valid octal numbers (all others are).

OmniBack

The UNIX backups are done through OmniBack, which does individual tasks very well, but is not always ideal. For-example, OmniBack sort of assumes that you will allocate many tapes into a pool, and let it manage which ones are used. However the site had a specific requirement, that each days tapes be easily identified, and moved to the firesafe, each day.

OmniBack (in its own presumed configuration), uses loose allocation of tapes, but that uses the maximum number of tapes (each task gets its own tape), so if you have 4 drives you need 4 tapes, PLUS if any tape fills up, you need another to guarantee there is always one available, plus if OmniBack fails (for another reason), it blames the tape, and needs yet another tape. That makes 6 tapes, when 3 was more then enough.

The only real way to avoid this is to use Strict Allocation, but that causes problems with two machines wanting the same tape - and one fails immediately. That meant that I had to script up our own set of custom backup sequencers.

I also scripted up a Tk GUI monitor, to show the results of all previous backups (lots of red/green boxes, with access to the messages), and some GUI screens to do tape management (list all tapes pools, move tape pool to door, move tapes beyond OmniBacks reach, recycle an old pool, etc, )