ITS battles e-mail troubles, declares victory against bugs

BY MIKE WISER

In the basement of Langdell, behind two solid doors and encased in a metal cage, lies the heart of the Law School’s computer infrastructure — the e-mail server. To the left of the server, a bank of small black hard drives, some of them installed five years ago, sits mostly idle. To the right, three gray cases, a third of the size of the older unit, hold a new disk array with almost three times the capacity of the black drives.

Under the cover of one of the gray cases, green LEDs flash as the drives are accessed. The light show was better on the old drives, but system administrators say the new drives have cut the delivery of e-mail from a couple of hours to a few seconds.

A Non-Existent Problem

In September the school’s Information Technology Services (ITS) wasn’t planning on purchasing a new array of hard drives for the e-mail server. The capacity of the old system had been upgraded the previous winter, and, in the spring, an outside consultant gave the system a clean bill of health.

“They walked away with a good report,” said Jane Sulkin, Director of ITS.

The system held up through September, but as e-mail usage increased in October, e-mail slowed to a crawl. According to Eric Watson, the Supervisor of System Administration, it began to take 60 to 90 minutes for messages to be delivered.

The cause of the slowdown was a mystery to Watson, who manages the e-mail server. Watson spends all of his time working with the school servers. The computers even page him when something goes wrong.

“It’s the only thing I have a pager for,” he said. “No one ever calls me to say, ‘Call me back,’ but the computer text pages are very helpful. They will tell me which computer it is and a cryptic message that to me is very obvious.”

Still, Watson could not find the problem.

Since Hewlett-Packard (HP) provided the OpenMail system and hardware, Watson and Sulkin decided to call in HP engineers to diagnose the problem.

“They started it with very much of, ‘We’re going to find stuff,'” Watson said of HP’s initial attitude.

For three weeks the engineers examined the problem and, to their surprise, found almost nothing wrong with the e-mail configuration. The problem was just that the hardware was not fast enough to handle the number of messages flowing through the servers.

“We just hit a critical mass of numbers of messages per day, number of users and the total amount of data stored on the system,” Watson said.

While the disk array was designed to handle the number of users that the Law School has, it was not capable of handling them all at the same time.

“The pattern of usage was such that a good percentage of the messages that were being sent to the e-mail server were being sent between the hours of two and five,” Sulkin said.

Downtime

The decision to upgrade the hardware kicked off a process that would test the patience of the HLS community by necessitating downtime and creating unforeseen problems.

Watson and Sulkin decided to make the upgrade in the days just after Christmas.

“We have chosen Christmas week to perform these upgrades because we believe that this timeframe has the least impact on the community,” Watson wrote in an e-mail to students.

Still, ITS administrators recognized that every downtime inconvenienced and frustrated users.

“This was the first project I had where there was so much user impact over such a long period of time,” Watson said.

Sulkin said she recognized the damage that the process could have on ITS’s relationship with the rest of the school.

“I think you have a community that just wonders, ‘Why can’t you fix the damn thing?'” she said.

The Law School is also a community that is quick to complain. When asked about whether they got feedback about the problems, Sulkin replied: “Oh, sure. This is not a shy community.”

She was quick to add, however, that many of the messages expressed “tremendous support” for the difficulties ITS faced.

But that first period of downtime was only the beginning of ITS’s problems. Due to trouble during the Christmas week upgrade, Watson did not have enough time to copy all of the files from the old disk array to the new one.

The transfer was rescheduled for an early Sunday morning.

“At the risk of pushing my luck, I once again ask for your continued patience and understanding,” Sulkin wrote in an e-mail to the HLS community.

Watson says that he usually doesn’t mind the off-hours work, but recalls being a little nervous as he drove towards Harvard at 2:30 a.m.

“I was thinking, ‘Who are the only cars on the road at 2:30 in the morning on Sunday morning? It’s people who just left the bars.’ And I thought, ‘Is moving OpenMail really worth it?'” Watson said.

As it turned out, the seven-and-a-half hours reserved for the transfer were not enough, and the move had to be aborted.

Memory Leak

At the same time, another problem developed. On December 26, right before upgrading the e-mail servers, Watson applied a patch that HP had recommended for upgrading the software running the server. As it turned out, the patch did more harm than good. Soon, Watson’s pager warned him that the system’s memory was dangerously low. Once, from home, he rebooted the computer when it only had 1 percent of its memory left. On New Year’s Day, he was not so lucky and had to drive back to the school at seven o’clock to manually reset the server after it crashed.

Again, the problem stumped Watson. The system’s log files shed no light on what was going wrong.

“That was a very difficult problem, because there weren’t a lot of symptoms. There wasn’t a lot of evidence of what was going on,” he said.

Because so many things had been done during the week of Christmas, it was impossible to isolate the cause.

HP finally assigned a more senior engineer to investigate the problem. That engineer suggested actually allowing the server to go down in order to produce a more complete file. The log file revealed the culprit — the December 26 patch.

The solution was frustratingly simple. In about 15 minutes, Watson was able to remove the tiny patch and fix the memory problem.

Success at Last?

Meanwhile, Sulkin wrote to the community outlining a new schedule for transferring the files from the old disk array to the new one. On the Saturday before Martin Luther King Day, the files were finally transferred and the new array was up and running.

“The move to the new hardware worked beautifully and the new hardware is working unbelievably well,” Watson said.

One way of testing an e-mail server is to send out a test message to all users. A few years ago, a test massage managed to reach all of the users in less than 20 minutes. It was a record that was never beaten, until now. According to Watson a message he sent out warning about an e-mail virus on Monday reached 2,000 users in less than three minutes. The OpenMail web client, he says, has also gone from taking more than a minute to log onto to only a few seconds.

Still, Sulkin and Watson don’t expect the accolades to begin pouring in. Both said that they expect the community to be a little bit skeptical and admit that it will take time to convince users that the servers are now fast and reliable. Even then, they do not expect to hear from students congratulating them on their success in battling the old disk arrays and bad patches.

“With system administration, it is a thankless profession,” Watson said. “No one ever calls you up and says, ‘Good job.’ You only hear from people when things aren’t working, and that really is the nature of system administration.”

After thinking for a moment, he modified his answer somewhat: “Someone called me up about four years ago and said, ‘Everything is working great. You’re doing great.'”

Comments