Yesterday, I wrote about our new private URL shortener at u.gleeson.us. In the subsequent comments to that post, I explained more about how it works. But there are fascinating theoretical implications revolving around the whole topic. For instance, just how many URLs can this application hold before it runs out of room?
Before answering such a question, one must first define precisely what “running out of room” might mean, but as you’ll see, by any definition, the number of URLs we can shorten is… rather a lot.
Here are the facts. The URLs are all stored in a MySQL database table with two fields: ‘id’ and ‘url’. The ‘url’ field holds the text of the URL and the ‘id’ field is a unique integer to identify each one. The “shortened” URLs are really just references to those ID numbers. They all take the form of
http://u.gleeson.us/[#]
where [#] stands for the ID number. But the shortened URLs aren’t decimal numbers. They are written in a custom base-52 numbering system, so each place can have 52 distinct values. Here are the 52 characters this system uses (in order, from 0 to 51):
23456789-bcdfghjklmnpqrstv
wxyz_BCDFGHJKLMNPQRSTVWXYZ
So, in this system, the character ‘2’ is really 0, the ‘Z’ is 51, and all the others are the values between: the ‘k’ is 16, the ‘G’ is 35, and so forth. So the shortened URL
http://u.gleeson.us/G
would really be a reference to the URL with ID number 35 in our database. (We don’t have that many yet, so don’t try it.) We could theoretically encode 52 URLs without even needing a second digit, but in practice, it’s really only 51, because the ID numbers in the database start at 1, not 0.
Of course, there’s no reason to limit ourselves to just one digit, but there must be some number of digits that we don’t want to exceed, and that number will be the key to answering the question of how many URLs we can hold before we’re full. The formula is
Q = 52 ^ D - 1
or, the maximum quantity Q equals 52 raised to the power of the number of digits D, minus one. As we’ve seen, when D = 1, Q = 51. But Q grows exponentially as you add digits.
2 digits => 2,703 URLs
3 digits => 140,607 URLs
4 digits => 7,311,615 URLs
So we already exceeded (by far) the number of URLs that Phoebe and I could ever possibly need to shorten, and the system is nowhere near full. There’s no theoretical or practical reason to stop at four digits. The lowest “limit” I can rationalize is 5 digits, because that would make a URL of 25 characters (the first 20 characters are used for the “http://u.gleeson.us/” part), and 25 seems like a pretty good cutoff point. (For instance, Twitter automatically truncates URLs longer than 25 characters.)
So, this will be my answer to the question, how many URLs can u.gleeson.us hold before it runs out of room? It’s at least 52 to the fifth power minus one, or 380,204,031. I doubt that we will ever reach this limit, but even if we do, I can always build us a new shortener at a different URL.
10:34 PM
Sweet, looking good with recaptcha! The previous comments on this thread seem to have disappeared though.
10:37 PM
Never mind, I am dumb. I saw “URL shortener” in the headline and assumed this was the same thread as the earlier one.
10:40 PM
Wow, 380 million! A short URL in every pot in America!
10:46 PM
I got a 500 error while posting #3. It’s probably just gremlins, but you might want to check the logfile and see what it had to say. “grep ^208.78.97.96 /var/log/apache/error.log | grep 500” or something like that.
10:47 PM
I got a 500 error while posting #3, and another posting this comment the first time I tried. It’s probably just gremlins, but you might want to check Apache’s logfile and see what it had to say.
10:48 PM
Oops, I guess it worked after all. I didn’t get any error when I posted #5.
11:03 PM
You know, I was getting 500 errors myself earlier this evening when publishing template changes, then trying again and everything worked. When I checked the logs, it said something about CGI scripts missing their headers, so I suspect it was a Perl parsing thing. I think my server is having some issues tonight that have nothing to do with us. (I share it with other fine DreamHost customers, you know.)
11:06 PM
Ah, it must be Dreamhost killing off MT’s perl scripts when they run for too long, in order to preserve capability for everyone else on the server.
11:11 PM
Ironic, given that the main reason I went with MT rather than WP was to put less strain on the server.
I reckon I’ll leave the blog alone until morning, and see if it’s feeling better.
11:16 PM
Well, at least MT only gets hammered when someone doing something that causes a database write. WordPress gets hammered and falls down when people are doing nothing more than reading the page!
12:32 PM
Wow. For this comment, I didn’t sign in with my MT account, I signed in with my Facebook account! So now, anyone can sign in with a Facebook id, and bypass the CAPTCHA and everything.
I rock.
3:41 PM
Ev says he’s been trying to comment, but the reCAPTCHA isn’t working. So I’m testing it again. If you see this comment, it worked for me.
7:47 PM
Testing it out…
7:48 PM
Yay!
8:15 PM
Sean, Trying to make contact with you, can’t find a working e-mail address! Could you respond please?
9:01 PM
Sorry, Anna. I still haven’t finished building all the pages on this site. There will definitely be a full complement of contact information soon. You may email me at sean@gleeson.us