Friday, April 04, 2008

Excel hell

I don't get most of the jokes on The Daily WTF, which is a humour/horror story site for IT professionals, but I go there for the stupid user stories and funny screenshots (examples here).

But there was this one story about Microsoft Excel that resonated quite well with my experience (or the experience of nearly anybody who's had to work with a lot of tabular data on a modern PC, come to that). One of the complaints in the comments thread particularly sounded familiar. Click here and scroll down to "2008-04-04 11:28 • by maldrich".

The root of all this: Excel makes things that look like tables, and tables are useful for data. There is no other program that is as widespread AND makes things that look like tables, so people use Excel to make tables of data. And it's in fact really, really bad at that. It was designed for ad-hoc numerical analysis and got appropriated as a database loading and reporting tool.

I think it's actually damaged the GNP of whole nations, this Excel program. It'd be interesting to know how badly.

Back in 2006 when I was back in Malaysia for a few months, my idiot cousin Jerng talked me into applying for a "data analyst" job with the company he was working for at the time, a multinational human resources consulting firm. Despite my protest that a) I didn't like office work and b) I didn't have any experience in data analysis except for running ELISAs, eventually I gave in, which was a big mistake. After less than two weeks, I decided that sitting in front of a computer all day, plus the corporate culture of "stay till 8pm even if you're not busy because everybody else is and you'll look lazy if you leave on time" were driving me nuts and I quit. I felt really bad about it since I'd told the guy who hired me that I was a serious worker, but I'd been prepared to actually WORK, not hang around half the day with nothing to do and then suddenly have stuff dumped on my desk shortly before knocking-off time.

The way they stored data was really weird though. We were working on a big fat report of the salaries and benefits offered by several hundred Malaysian clients. They would make up these forms in Excel and then email them to the clients. When the forms came back, first the "data analysts" had to "validate" them by calling up the clients' HR departments and bugging them about the fields that weren't filled out properly. Then the data were manually copy-pasted into more Excel tables.

Clients' employees job descriptions were assigned numerical codes - there were several hundred different numbers, categorized by industry sector. The problem was, that year they had switched over to a new set of codes...but some of the current year's data had been entered into worksheets that were using the old codes and therefore couldn't be put into the report yet... A lot of my work for the 3 weeks I was there ended up being converting entries to the new codes. At first I tried to make some lookup tables to do it. But then it turned out that some of the code-to-job-description info was hidden by macros. So my supervisor showed me how to unhide them. Then I got most of the old entries replaced but some still came up null because a few of the old codes weren't where they were supposed to be. Then my super dug them out of some old file that she'd forgotten about...

It was all extremely ad hoc and I can't help wondering now why a bloody fancy company like that couldn't get a few people who knew how to use proper database software.

Anyway, it's also slightly entertaining but very alarming to learn why you shouldn't use Excel to store microarray data.

A default date conversion feature in Excel (Microsoft Corp., Redmond, WA) was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] [3] was being converted to '1-DEC.' Figure 1 lists 30 gene names that suffer an analogous fate.
There is another default conversion problem for RIKEN [4] clone identifiers of the form nnnnnnnEnn, where n denotes a digit. These identifiers are comprised of the serial number of the plate that contains the library, information on plate status, and the address of the clone [5]. ... For example, the RIKEN identifier "2310009E13" was converted irreversibly to the floating-point number "2.31E+13." A non-expert user might well fail to notice that approximately 3% of the identifiers on a microarray with tens of thousands of genes had been converted to an incorrect form, yet the potential for 2,000 identifiers to be transmogrified without notice is a considerable concern. Most important, these conversions to an internal date representation or floating-point number format are irreversible; the original gene name cannot be recovered.
Seriously, I would cry if that happened to my data.

Stupid frigging Microsoftware and its tendency to assume it knows what you want better than you do. Try typing the names of recombinant DNA constructs or viruses into Word. It thinks you hit Caps Lock by accident and converts "rNiV" to "Rniv" - unless you turn off the "Correct accidental use of Caps Lock" AutoCorrect function. Gah!

Labels: ,

Sunday, February 11, 2007

Free software plug

Free Download Manager is trying to get people to promote them. So I'll just say this - it's free, it doesn't take up a lot of space/memory, and I like it.

While I'm at it, I also have to say that I also like Windows Media Player. Yes, Microsoft is evil and all, but WMP 11 is pretty decent software. I find WinAmp annoyingly hard to work with.

Other free things I use:

  • ClustalW: for aligning nucleic and protein sequences. Only useful if you're a biologist.
  • Azureus: BitTorrent client. I'm only starting with this, so I don't really know how to use it well, but it seems to have a nice interface. The original BitTorrent client by the guy who invented it was a real pain in the neck.
  • Paint.NET: very cool image editing program, tons of features. It's got everything I normally use in Paint Shop Pro 9. Only problem I've run into so far is that it doesn't support pressure sensitivity for tablets.
  • Trillian: multi-network instant messaging client. Back in the day when my sisters and I first started instant messaging (late 90s; yeah, we were slow) ICQ was the most popular service in Malaysia. Now it's MSN. Most Americans use AIM, and I have the few oddball friends who are only on Yahoo. So, yeah, essential.

Labels: