Designing a URL shortener
Design a system to take user-provided URLs and transform them to a shortened URLs that redirect back to the original. Describe how the system works. How would you allocate the shorthand URLs? How would you store the shorthand to original URL mapping? How would you implement the redirect servers? How would you store the click stats?
Assumptions: I generally don't include these assumptions in the initial problem presentation. Good candidates will ask about scale when coming up with a design.
- Total number of unique domains registering redirect URLs is on the order of 10s of thousands
- New URL registrations are on the order of 10,000,000/day (100/sec)
- Redirect requests are on the order of 10B/day (100,000/sec)
- Remind candidates that those are average numbers - during peak traffic (either driven by time, such as 'as people come home from work' or by outside events, such as 'during the Superbowl') they may be much higher.
- Recent stats (within the current day) should be aggregated and available with a 5 minute lag time
- Long look-back stats can be computed daily
- Targeted use cases: The shortened URLs are to be copied-and-pasted only.
- The URLs won't be typed in via a keyboard. Therefore, don't worry about distinguishing between
0ando,land1, etc. - The URLs won't be spelled out verbally. Therefore, there's no need to make the shortened URLs phonetic.
- The URLs won't be typed in via a keyboard. Therefore, don't worry about distinguishing between
Assumptions
1B new URLs per day, 100B entries in total the shorter, the better show statics (real-time and daily/monthly/yearly)
Encode Url
http://blog.codinghorror.com/url-shortening-hashes-in-practice/
Choice 1. md5(128 bit, 16 hex numbers, collision, birthday paradox, 2^(n/2) = 2^64) truncate? (64bit, 8 hex number, collision 2^32), Base64.
- Pros: hashing is simple and horizontally scalable.
- Cons: too long, how to purify expired URLs?
Choice 2. Distributed Seq Id Generator. (Base62: az, AZ, 0~9, 62 chars, 62^7), sharding: each node maintains a section of ids.
- Pros: easy to outdate expired entries, shorter
- Cons: coordination (zookeeper)
KV store
MySQL(10k qps, slow, no relation), KV (100k qps, Redis, Memcached)
A great candidate will ask about the lifespan of the aliases and design a system that purges aliases past their expiration.