Skip to main content

Designing a KV store with external storage

· 2 min read

Requirements

  1. Data size: Data size of values is too large to be held in memory, and we should leverage the external storage for them. However, we can still keep the data keys in memory.
  2. Single-host solution. No distributed design.
  3. Optimize for write.

Solution

  • In-memory hashmap index + index hint file + data files
  • Append-only for write optimization. Have only one active data file for write. And compact active data to the older data file(s) for read.

Components

  1. In-memory HashMap<Key, <FildId, ValueOffset, ValueSize, Timestamp>>

  2. Data file layout

|crc|timestamp|key_size|value_size|key|value|
...
  1. (index) hint file that the in-memory hashmap can recover from

Operations

Delete: get the location by the in-memory hashmap, if it exists, then go to the location on the disk to set the value to a magic number.

Get: get the location by the in-memory hashmap, and then go to the location on the disk for the value.

Put: append to the active data file and update the in-memory hash map.

Periodical compaction strategies

  • Copy latest entries: In-memory hashmap is always up-to-date. Stop and copy into new files. Time complexity is O(n) n is the number of valid entries.

    • Pros: Efficient for lots of entries out-dated or deleted.
    • Cons: Consume storage if little entries are out-dated. May double the space. (can be resolved by having a secondary node do the compression work with GET/POST periodically. E.g., Hadoop secondary namenode).
  • Scan and move: foreach entry, if it is up-to-date, move to the tail of the validated section. Time complexity is O(n) n is the number of all the entries.

    • Pros:
      • shrink the size
      • no extra storage space needed
    • Cons:
      • Complex and need to sync hashmap and storage with transactions. May hurt performance.

Following up questions

  • How to detect records that can be compacted?
    • Use timestamp.
  • What if one hashmap cannot fit into a single machine’s memory?
    • Consistent hashing, chord DHT, query time complexity is O(logn) with the finger table, instead of O(1) here with a hashmap.

Exactly What to Say: Keywords for Impacts

· 4 min read

These keywords and templates of sentences help you influence people.

  • I’m not sure it’s for you, but ...

    • recommend in a non-intrusive way
  • Open-minded

    • "Are you open-minded to do something?" this encourages people to do something.
    • or if you are criticizing something or someone but still want to show the empathy, you can say "I am helping someone be open-minded."
  • What do you know about

  • How would you feel if?

    • How are people motivated?
      • Avoid a loss
      • Aquire a potential gain
    • Emotion comes first than logic
    • People make decisions based on what feels right first. ==Interestingly, when we make decisions for ourselves, we should avoid harmful emotions. (By Ray Dalio)==
  • Just imagine

    • Creating pictures in the minds of others is done by telling stories.
  • When would be a good time?

    • One of the biggest reasons your ideas fail to get heard is that others tell you they just don’t have the time to consider them.
    • The preface prompts the other person to assume that there will be a good time and that no is not an option.
  • I’m guessing you haven’t got around to

    • By pushing for the negative scenario, you get people to rise to the positive or to tell you how they are going to fix the thing they said they were going to do.
  • Simple Swaps

    • Do you have any questions? => What questions do you have for me? When emphasizing “questions”, instead of “you” (the audience), then the audience will ask less questions.
  • As I see it, you have three options ... Of those three options, what’s going to be easier for you?

  • There are two types of people, ...

    • This may help people make final decisions.
  • I bet you’re a bit like me

    • It often results in the other person comfortably agreeing with you.
  • If … then …

    • people like to hear something with logic behind, no matter if it really makes sense…
  • Don’t worry

    • it’s particularly useful in high-stress scenarios, when confronted with someone who is panicked - it puts people at ease.
  • Most people

    • When you tell people what most people would do, their brains says, “I’m most people, so perhaps that is what I should do too.“
  • The Good News

    • “The good news is ….” causes people to face forward with optimism and zap any negative energy out of the conversation
    • by bring more positivity to situations with “the good news is …” and responding with, “that’s great,” you soon start shifting the balance in people’s thoughts.
  • What happens next

    • finishing a process with a question that is effortless to answer is the key to gaining a rapid response and a positive outcome.
      • the easier the question is to answer, the easier you gain your decision.
  • What makes you say that

    • success in negotiating is all about maintaining control in a conversation, and ==the person in control is always the person who is asking the questions.==
    • so when we get objections like
      • I haven’t got the time
      • It’s the wrong time
      • I want to shop around
      • I haven’t got the money right now
      • I need to speak to somebody else before I make a decision about this.
    • by treating every objection you face as nothing more than a question, you can quickly regain control of the conversation by asking a question in return.
  • before you make your mind up

    • fight for the last chance before you say “no”.

Ryan Holiday: How User Growth Begins with PMF (Product-Market Fit)

· One min read

Four Steps of Growth Hacking Marketing

  1. Capture and expand PMF (Product-Market Fit) during the product development phase.
  2. Identify and nurture seed users.
  3. Embed viral growth factors.
  4. Support with data, aiming for product optimization, and repeat the above steps.

How User Growth Begins with PMF

  1. ==Product-Market Fit== refers to the degree to which a product meets strong market demand.

  2. Start with the simplest viable product and improve based on user feedback.

  3. Use the data and information obtained to support the enhancement of PMF.

  4. Understand customer needs as early as possible.

    1. For example, Amazon employees provide internal newsletters to gather feedback before a project begins.
    2. For example, Werner Vogels suggests writing FAQs/key user experiences/user manuals for the product you are developing = concept + operation + reference.
  5. Find answers using the Socratic method.

    1. Who is this product for? Why would they use it? And why would I use it?
    2. What made you fall in love with this product? What prevents you from introducing the product to others? What is missing from this product? And what are its highlights?

From Good to Great

· One min read

Leading a company from good to great is equivalent to driving a massive flywheel to achieve ==breakthroughs==

  1. Disciplined and well-trained people
    1. Level 5 Leadership: Great leaders > Effective leaders > Competent managers > Contributing team members > Capable individuals
    2. First Who, Then What
  2. Disciplined and well-trained thoughts
    1. Confront the brutal facts
    2. Be a hedgehog first, then a fox
  3. Disciplined and well-trained actions
    1. A culture of discipline
    2. Technology accelerates the engine of growth

Good to great

· One min read

Leading a company to leap from good to great = pushing a giant flywheel to ==breakthrough== with

  1. Disciplined People
    1. Level 5 leadership: executive > effective leader > competent manager > contributing team member > highly capable individual
    2. First who then what
  2. Disciplined Thought
    1. Confront the brutal facts
    2. Be a fox after being a hedgehog
  3. Disciplined Action
    1. Culture of discipline
    2. Technology accelerators

Ryan Holiday: Finding your growth hack

· 2 min read
  1. Target a few hundred or a thousand key people, not millions.

    1. e.g., Dropbox began with a fun demo video in the initial launch. People can register but in a waiting list to use it. Use something ==new and exciting== to attract users.
    2. e.g., eBay in 2012 partnered with Gogo to provide free wifi access to ebay.com during flights. The brilliant part is that it can track the data to see whether it is beneficial and thus they can continue the partnership.
  2. Do not target all people - target the right people

    1. e.g., Uber offered free rides for Austin’s SXSW conference for several years, which attracts thousands of tech-obsessed, high-income young adults.

    2. Hacks

      • Pitch media websites to write about us.
      • Post in Hacker News, Quora, Reddit.
      • Write blogs.
      • Kickstarter.
      • www.helpareporter.com to connect to reporters.
      • Invite users for free or with some incentives.
    3. ==Stunts==

      • Create the aurora of exclusivity with “invite-only”
      • Create fake users to make it more actively than it is. (Reddit did this)
      • Catering to a single platform exclusively (PayPal and eBay)
      • Launching for users group by group (Facebook and colleges)
      • Bringing on influential people for their audience and fame
      • Sub-domain on the e-commerce site to donate (Amazon)
  3. Focusing on new user sign-ups (acquisition) instead of awareness.

  4. Growth Techniques = marketing + engineering

    1. e.g., Airbnb made tools to make cross-posts to Craigslist.
    2. Sean Ellis: “Focusing on customer acquisition over ‘awareness’ takes discipline… At a certain scale, awareness/brand building makes sense. However, for the first year or two it’s a total waste of money.”
    3. In-effective actions
      1. Big blowout launch
      2. Build it and they will come. (Aaron Swartz: users have to be pulled in.)

Ryan Holiday: Attracting and Nurturing Seed Users

· 2 min read
  1. Target a few hundred or a thousand key individuals, rather than millions
    1. For example, Dropbox started its initial launch with an engaging demo video. People could sign up but had to wait to use it. Attract users with something ==novel and exciting==.
    2. Similarly, in 2012, eBay partnered with Gogo to provide free Wi-Fi access to ebay.com during flights. The clever part was tracking data to determine whether it was beneficial to continue the partnership.
  2. Don’t target everyone - focus on the right people
    1. For instance, Uber provided free rides for years during the South by Southwest conference in Austin, attracting thousands of young, high-income tech enthusiasts.
    2. Tips
      • Persuade media outlets to write about you
      • Post on Hacker News, Quora, and Reddit
      • Write blogs
      • Use Kickstarter for crowdfunding
      • Contact journalists through www.helpareporter.com
      • Invite users for free or with some incentives
    3. ==Big tricks==
      • Create exclusivity with “invitation-only” hunger marketing
      • Generate fake users to make it appear more active. (Reddit used this approach)
      • Focus on a single platform (PayPal and eBay)
      • Spread from one user group to another (Facebook and universities)
      • Attract influencers because they have a broad audience and good reputation
      • Make charitable donations on subdomains of e-commerce sites (Amazon)
  3. Focus on new user registrations (acquisition) rather than brand awareness
  4. Growth hacking = marketing + engineering
    1. For example, Airbnb created tools while cross-posting on Craigslist.
    2. Sean Ellis once said: “Staying focused on customer acquisition rather than 'building brand awareness' often requires restraint... Certainly, once a company reaches a certain scale, brand awareness/branding makes sense. But in the first year or two, it’s just a complete waste of money.”
    3. Ineffective actions
      1. Grand launches
      2. Wishfully thinking that “the best way to attract users is to let the product speak for itself” (while Aaron Swartz believed that users must be attracted to come).

Ryan Holiday: How to begin with PMF

· One min read

4 Steps of Growth Hacking

  1. Begin with the product-market fit (PMF)
  2. Finding your growth hack
  3. Going viral
  4. Close the loop

How to begin with PMF?

  1. ==Product/market fit== is the degree to which a product satisfies a strong market demand.

  2. Start with MVP and evolve with feedbacks

  3. Use data and information to back PMF.

  4. Understand the needs of the customers as early as possible

    1. e.g., Amazon employees give internal press release before developing the project to collect feedbacks.
    2. e.g., Werner Vogels suggests writing FAQs for the product you’re developing / critical UX / user manual = concepts + how-to + reference
  5. Develop answers with the Socrates method

    1. Who is this product for? Why would they use it? Why do I use it?
    2. What is it that brought you to this product? What is holding you back from referring other people to it? What’s missing? What’s golden?

Designing a URL Shortener System

· 5 min read

Design a system that can convert URLs provided by users into short URLs, allowing users to access their original URLs (hereinafter referred to as long URLs) using these short URLs. Describe how this system operates, including but not limited to the following questions: How are short URLs allocated? How is the mapping between short URLs and long URLs stored? How is the redirection service implemented? How is access data stored?

Assumptions: The initial problem description does not include these assumptions. An excellent candidate will ask about system scale when given a specific design.

  • There are approximately tens of thousands of long URL domains.
  • The traffic for new long URLs is about 10,000,000 per day (100 per second).
  • The traffic for the redirection service using short URLs to access long URLs is about 10 billion per day (100,000 per second).
  • Remind the candidate that these are average figures - during peak times, these numbers can be much higher (one type of peak time is time-related, such as when users return home from work, and another type is event-related, such as during the Spring Festival Gala).
  • Recent data (e.g., today's data) should be collected in advance and should be available within five minutes when users want to view it.
  • Historical data should be calculated daily.

Assumptions

1 billion new URLs per day, 100 billion short URL accesses. The shorter the short URL, the better. Data presentation (real-time/daily/monthly/yearly).

URL Encoding

http://blog.codinghorror.com/url-shortening-hashes-in-practice/

Method 1: md5 (128 bits, 16 hexadecimal digits, collisions, birthday paradox, 2^(n/2) = 2^64) Shorter? (64 bits, 8 hexadecimal digits, collisions 2^32), base 64.

  • Advantages: Hashing is relatively simple and easy to scale horizontally.
  • Disadvantages: Too long, how to handle expired URLs?

Method 2: Distributed ID generator. (Base 62: az, AZ, 0~9, 62 characters, 62^7), partitioning: each node contains some IDs.

  • Advantages: Easier to eliminate expired URLs, shorter URLs.
  • Disadvantages: Coordination between different partitions (e.g., ZooKeeper).

Key-Value (KV) Storage

MySQL (10k requests per second, slow, no need for a relational database), key-value (100k requests per second, Redis, Memcached).

An excellent candidate will ask about the expected lifespan of short URLs and design a system that can automatically clean up expired short URLs.

Follow-Up

Question: How to generate short URLs?

  • A poor candidate might suggest using a single ID generator (single point of failure) or require coordination between ID generators for each ID generation. For example, using an auto-increment primary key in a database.
  • An acceptable candidate might suggest using md5 or some UUID generators that can generate IDs independently on some nodes. These methods can generate non-colliding IDs in a distributed system, allowing for the production of a large number of short URLs.
  • An excellent candidate will design a method using several ID generators, where each generator first reserves a block of ID sequences from a central coordinator (e.g., ZooKeeper), and these ID generators can allocate IDs from their ID sequences independently, cleaning up their ID sequences when necessary.

Question: How to store the mapping between long URLs and short URLs?

  • A poor candidate might suggest using a single, non-distributed, non-relational database. It is merely a simple key-value database.
  • An excellent candidate will suggest using a simple distributed storage system, such as MongoDB/HBase/Voldemort, etc.
  • A more excellent candidate will ask about the expected usage cycle of short URLs and then design a system that can clean up expired short URLs.

Question: How to implement the redirection service?

  • A poor candidate will design the system from scratch to solve problems that have already been solved.
  • An excellent candidate will suggest using an existing HTTP server with a plugin to translate the short URL ID, look up this ID in the database, update access data, return a 303 status, and redirect to the long URL. Existing HTTP servers include Apache/Jetty/Netty/Tomcat, etc.

Question: How to store access data?

  • A poor candidate will suggest writing to the database on every access.
  • An excellent candidate will suggest having several different components handle this task: generating access stream data, collecting and organizing it, and writing it to a permanent database after a certain period.

Question: How to separate the different components of storing access data proposed by the excellent candidate?

  • An excellent candidate will suggest using a low-latency information system to temporarily store access data and then hand the data over to the collection and organization component.
  • The candidate may ask how often access data needs to be updated. If updated daily, a reasonable method would be to store it in HDFS and use MapReduce to compute the data. If near-real-time data is required, the collection and organization component must compute the necessary data.

Question: How to block access to restricted websites?

  • An excellent candidate will suggest maintaining a blacklist of domains in the key-value database.
  • A good candidate might propose some advanced technologies that can be used when the system scales significantly, such as bloom filters.

Improving System Availability through Failover

· 2 min read

Failover: Failover is a backup operational mode used to enhance system stability and availability. When the primary component fails or is scheduled for downtime, the functions of system components (such as processors, servers, networks, or databases) are transferred to secondary system components.

Cold Backup: Cold backup refers to copying critical files to another location, using features or metrics/alerts to track failures. The system provides a new standby node in the event of a failure; however, cold backup is only suitable for stateless services. For backing up Oracle databases, cold backup is the fastest and safest method.

Hot Backup: This involves maintaining two active systems that share the same task roles, meaning the system operates normally while providing backup. The data between the two systems is nearly mirrored in real-time and contains the same information.

Warm Backup: This keeps two active systems, where the secondary system does not consume traffic unless a failure occurs.

Checkpoint (or similar to Redis snapshots): The system uses write-ahead logging (WAL) to record requests before processing tasks. The standby node recovers from the log during failover.

  • Disadvantages
    • A large amount of log recovery can be time-consuming
    • Data loss since the last checkpoint
  • User Cases: Storm, WhillWheel, Samza

Dual-host (or all-host) mode: This keeps two active systems behind a load balancer. The hosts operate in parallel, and data replication is bidirectional.