Thursday, May 17, 2012

Intermittent problems

Intermittent problems are the worst kind (except for total disaster, obviously.)  I'd almost rather see a system not work at all than fail occasionally and randomly.  If it's completely down, the cause is generally fairly easy to spot, and when you see it start working again, you can be pretty sure it's working for everybody.  If it consistently fails for one person but works for everybody else, you can at least check that single account on our end, or walk the person through their settings on theirs (assuming you can get in touch with them, but that's a completely separate topic) and most of the time you'll find the cause.  Likewise if it only fails at a certain time of day, or from a certain building, or on a specific kind of device, then you can test it then, or there, or using that.

It's when you can't find a pattern that things get really annoying.  In order to troubleshoot something, you need to be able to see the trouble as it happens, but if you don't have any way to do that, you're going to have a hard time finding the answer.  You can't know how many people are seeing the problem, or how often. Even if you can get it to fail where you can see it, when it starts working again does that mean you really fixed it?  Or did it just start working again randomly?  And did it fix the system for everybody else?

Even though most problems we deal with are pretty straightforward, it's those rare intermittent problems that really stick in your mind (even if it's only because you've banged your head against the wall for too long.)  I have to remind myself that they're rare, because right now I'm working on three of them at once.

  1. When a former student without an account wants to get their transcript online, they can go to Account Lookup and it will tell the system to create a temporary account.  It shows a message explaining what's happening, and asks the person to try again in ten minutes.  Usually, that second try at Account Lookup tells them the new account name, lets them set a password, and onward they go.
    Except right now, for a few people, it isn't working. Account Lookup tells them that an account will be created in ten minutes, and then blithely forgets to tell the system to actually go and do that. When they come back in ten minutes, Account Lookup can't find an account, so it gives them the "Wait ten minutes" message again, and then forgets to inform the system again, and around and around we go.

  2. A few people have reported that they get a "500 Server Error" when they click the Gmail button in the Portal. Until I can get in touch with one of them, I'm stuck, because it works fine for me no matter how I try. For all I know it's just one of those once-in-a-blue-moon fluke problems that solve themselves. But I can't afford to ignore it, because on the other hand the people who reported it might just be the tip of the iceberg. Going without email is not just an annoyance anymore.

  3. And finally, about ten people in the online faculty/staff directory are showing up without any contact information; not even email. They weren't even showing up at all at the start of the week, but a name without contact information is pretty much useless in a contact directory.
So yeah, interesting times. I need to stop talking about these things and start digging into them again.

No comments:

Post a Comment