Scanning the past

When I first started on my current project of entering old race results I was unaware of the tangle of problems I was leaping into. I expected that I would finish with it fairly quickly and move on to something else. I also did not expect that the scope of the project would expand as it has — first I was given more results to process than I anticipated, and now I am actively seeking out such results myself.

My intention was that I would scan each page of results, then use optical character recognition (OCR) software on that page and then process the textual output of the OCR and add it to my database.

Well, the first problem was with the OCR. When I bought my new scanner it came with OCR software. That made everything seem very simple. I’d just use it.

Only…

The software was very proud that it did multi-column processing. Great! I’ve got lots of columns.

Well… “multi-column” processing is designed for newspaper columns. Each column is processed as a separate entity. This plays havoc with tabular data, which is what I really had. I’d get a column of names, then a column of ages, then a column of cities… Worse, some column outputs had several things in them, while other columns seemed to be missing entirely. Unfortunately there did not seem any way to turn off multi-column processing.

So I had to search for some less sophisticated software. And I found some.

Now, sadly, a 50+ year old document is often smudged, coffee-stained, etc. This makes it difficult for even a human to read, to say nothing of OCR.

Sometimes text which seems perfectly legible to me (especially at the larger resolution you see when you click on the image)

just comes out as garbage after OCR
1 01 050100 0001 20 0005510 00 0v1c 0:53 05:02
So some races need to be entered manually.

Even at the best of times it can be hard to tell a capital O from a zero, a one from an eye or a lower-case ell, a five from an S, 8 and B, 7 and ?. But I was surprised to find that the software has its hardest time with “4” which gets variously rendered as 0, 5, 6, h, b, and a myriad of others. It also has a distinct aversion to “M” and “W”, preferring to render them as “H”.

So— I need to build up a list of people’s names, of city names and of team names and try to spell check them. I also need to check the times. Luckily times are entered in a steadily increasing pattern (fastest runners first, of course) so one simple check is to make sure they are in order. Bibs and ages are all higgledy-piggledy and there’s no way to recheck other than with my fallible eye.

And then there are more fundamental problems. Race results look quite different now than they did 50 years ago. Running is now (in the US) and individual sport while 50 years ago it was a team sport. Age and sex are very important now because it isn’t fair to compare a 70 year old runner against a 20 year old, while 50 years ago every one was a young male. This all means that data which now seem essential wasn’t collected back then, and data which mattered then is now irrelevant.

Old results do not include age/sex or city/state or divisions or age grading, while they do include teams. What do I do with teams? There is nowhere in the modern results display for them… Well, I’ll just add a new field to the database and figure out how to display it later.

I don’t really need the results to contain divisions and age-grading, I can generate those myself — or I could if I had the runners’ ages. And before 1979 I don’t.

Well, if a runner from 1960 is still running (in Santa Barbara) in 1980 then I can find his (only men were running in 1960) age in 1980 and fill in an age in 1960. But most of the early runners weren’t still running (here) in 1980.

Also… there’s the problem of names… There’s at least one father/son pair with the same name, John McManus. How do I know runners from different years are the same person?

And, greatest horror of all, I find another “George Williams” was running here in the late 70s.

And then there are the in-jokes of the time which I don’t understand now… What was the “Infamous Centipede” and what was it doing in our races in the mid-1980s?

Advertisements

2 Responses to “Scanning the past”

  1. Ralph Says:

    I think your eventual old race postings should be interesting & fun to study.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: