Skip to main content

Good Metrics by Lean Analytics

· 2 min read

Every aspiring entrepreneur should always be aware of the deadly pitfall of building something that nobody wants. That is why the right kind of analytics becomes so necessary. The book Lean Analytics introduces good metrics for start-up founders to navigate through the unknown and assess their success.

Data-driven in the right direction

Data is vital to business. Entrepreneurs need data to convince others that their ideas will work. Sometimes, entrepreneurs tend to overestimate their success but data will not lie. Data helps founders to stay grounded in reality. However, personal judgement of what data to pursue is also important. Don’t be just a slave to numbers.

What are good metrics?

In order to stay data-informed, you need to find some metrics which can provide meaningful data. Good metrics have three characteristics:

  • Comparable: a good metric can be compared to different time periods, groups of consumers and so on
  • Understandable: a good metric is simple and easy to comprehend
  • Ratios: ratios are effective and comparable

Five distinct stages by the Lean Analytics framework

The Lean Analytics framework suggests a start-up will go through five stages:

  • Empathy — identify a need that people have / identity your niche market
  • Stickiness — figure out how to satisfy the need with a product
  • Virality — add features that attracts people
  • Revenue — the business starts to grow and generate revenue
  • Scale — expand or break into new markets

Focus on one metric

To achieve success, founders must focus on one metric that’s most critical. Knowing what is the most important metric prevents you from getting lost in the data world.

What’s the best metric?

There is no best metric in general. In different industries, the best metric differs. For E-commerce companies, the most important metric is revenue per customer. However, for media sites, the best metric is the click-through rates.

From Uber Layoffs: To Build Wheels or Not?

· 4 min read

Between 2014 and 2018, Uber built several "wheels," such as the service discovery tool Hyperbahn, the task queue Cherami, the MySQL-based NoSQL Schemaless, the resource scheduler Peloton, and the service deployment platform uDeploy, among others. Now, with layoffs affecting even engineering teams and stock prices falling below 15-year valuations, were these "wheels" a success or a failure? Should startups hire people to build wheels, or should they adopt existing solutions?

Management is a pyramid; it’s people lifting people, and individuals rise through the support of others. The first lesson in political acumen for any manager is to recruit as many people as possible. Hiring a large number of engineers in VC-funded companies is also an interesting metric because investors, lacking technical knowledge, often believe that companies with more heads naturally perform better.

Thus, many individuals have a motivation to hire more people. But how do we measure the legitimacy of this motivation? This is relatively straightforward for mechanical work, such as in factories, where output is directly measurable. However, it’s less clear for intellectual work, especially coding. Bill Gates once said that measuring development progress by lines of code is akin to measuring aircraft manufacturing progress by weight. I’ve even heard that Google has dedicated teams to calculate each group's contribution to the company.

Managers like to hire, while engineers enjoy building wheels. On one hand, creators inherently relish the joy of creation; on the other hand, engineers may develop an undesirable ego, feeling that using others' technology implies their own skills are lacking. Managers provide the "what they want," while engineers provide the "what they want to do," resulting in products that are a product of these two forces.

For instance, a traditional retail company hired a new CTO from Silicon Valley, who began hiring a large number of engineers for projects and insisted that once good talent was found, useful projects would follow. He also wanted to package some internal software as a service, even though these services still ran on a mainframe. The CTO genuinely believed in this approach. The key question here is whether these efforts yield a positive ROI (return on investment).

If ROI cannot be known in advance, how can we effectively balance hiring on demand and avoiding resource waste? The answer lies in focusing on "building prototypes for proof of concept (POCs)." Test the waters with minimal investment; if it works, hire more people; if it doesn’t, don’t hire.

If ROI can be known in advance, then the answer becomes a simple arithmetic problem. For example, if HipChat charges 5perpersonpermonth,andUberhas60,000fulltimeemployeesandcontractors,thenthemonthlyservicecostwouldbe5 per person per month, and Uber has 60,000 full-time employees and contractors, then the monthly service cost would be 300,000. In contrast, hiring one engineer to modify open-source Mattermost would only cost $30,000 per month. Thus, "building wheels" could save about one-fifth to one-tenth of the original cost.

There are also companies that have built many wheels while thriving, where strong management and engineering culture play crucial roles. They advocate for simplicity and technical responsibility. If external wheels offer specialized, mature solutions, they will adopt them; if external wheels are overly complex and uncontrollable, they will build their own. I recall that one significant reason Uber did not adopt Cassandra immediately was the lack of internal experts on Apache Cassandra, leading to technical unpredictability.

The principle of simplicity does not conflict with attention to detail. For example, you might first choose an expensive, cumbersome ERP from Microsoft, SAP, or Oracle, and then write some services yourself for areas that require special handling closely tied to your business, ensuring those services are concise, efficient, and easy to maintain. Conversely, many new-generation startups fail in ERP because they neglect details, even failing to implement the most critical "audit" functions, such as double-entry bookkeeping in accounting.

Charles Handy: The Second Curve

· 2 min read

When you know where you should go, it is too late to go there; if you always keep your original path, you will miss the road to the future.

Charles Handy makes an analogy as his road to Davy's Bar. Turn right and go up the hill when there is half a mile to the Davy's Bar. However, when he realized he was on the wrong way, he arrived at Davy's Bar already.

The growth curve is usually in an "S" shape, and we call it S-curve or sigmoid curve. To keep the overall growth rate high, you have to develop your second S-curve before it is too late to invest your time and resources.

Intel's CPU, Netflix's video streaming, Nintendo's gaming, Microsoft's cloud are all excellent examples of the second-curve-driving businesses.

How to find and catch the second curve takes vision and execution. You have to input more information and continuously sort them to identify the best opportunities. And then, once a chance identified, you need a reliable team to fight the battle and figure out whether it really works.

What makes you succeed may not make you succeed again. There is always a limit to growth. The second curve theory helps us reflect on why and how to embrace the change and live a more thriving life.

Charles Handy: The Second Curve

· 2 min read

When you know where to go, it is often too late; if you always stick to the original path, you will miss the road to the future.

Charles Handy illustrates this with the analogy of "David's Bar": on the way to "David's Bar," you should turn right up the hill when you are half a mile away. However, by the time he realized he was going the wrong way, he had already arrived at "David's Bar."

Growth curves are typically S-shaped, which we refer to as the S curve. To keep the growth rate consistently high, you must invest time and resources to develop a second S curve while there is still time.

Intel's CPUs, Netflix's video streaming, Nintendo's games, and Microsoft's cloud services are all excellent examples of businesses driven by this second curve.

How can you discover and seize the second curve? You need to input more information, discern good from bad, and identify opportunities. Then, once the opportunity arises, having a strong team to tackle the hard work is essential to determine whether you have truly found the second curve.

The reasons that made you successful in the past may not lead to future success; growth always has its limits. The second curve theory helps us reflect on why and how to embrace change for a better life.

Ten Reasons to Fail at Growth

· 3 min read

Facebook's VP of Growth, Alex Schultz, once discussed with Mark Zuckerberg why they succeeded. The answer isn't that they are exceptionally smart or experienced, but rather that they work incredibly hard and execute effectively. Compared to execution, growth is optional. Everyone understands the reasoning; the difference lies in whether people can execute quickly.

Execution is challenging, and there are ten reasons why growth execution fails.

  1. Not starting with retention. Growth without retention is like a ring of fire in a wheat field; it will eventually burn out. Without retention, there is no Product-Market Fit (PMF). A sign of achieving PMF is that the retention curve in cohort analysis flattens out.

  2. Believing the product is everything. Based on this misconception, people tend to mistakenly focus on "doing more" with the product rather than "doing better" with the existing product. Growth is a process of "doing better." Builders love to create new things, but as a leader, you need to ensure they are at least partially accountable for the results.

  3. Looking for a silver bullet. Great products are polished through time and effort spent on details, not conjured up like magic. Good ideas are a byproduct of having many ideas; you can't control the outcome of finding good ideas, but you can create a process that allows more good ideas to emerge.

  4. Lack of focus. It's about cutting down one at a time, not chopping at everything in sight. How do you break through the threshold effect here? Remember two points: 1) Most companies' primary scale comes from a single channel, and 2) There are only a few methods to scale; choose one.

  5. Insufficient data and analysis. The challenge here is that it's hard to quantify the output of data analysis, so you must firmly believe that this is very valuable, as it enables you to make the right choices.

  6. Not enough experimentation, far from it. HubSpot ran thousands of experiments in just six months.

  7. Not asking why. When an experiment ends poorly, they just move on to the next one without asking why.

  8. Not doubling down on successes. If you find a channel that works exceptionally well and hasn't been fully utilized, continue to invest in it. Zynga discovered that a virtual gift in one game was highly profitable and that viral marketing worked exceptionally well, so they immediately added this feature to all their games.

  9. Insufficient resource allocation. Growth requires dedicated teams to focus on it.

  10. Unable to embrace change. A company's growth typically goes through three stages: Traction, Transition, Growth. The reasons for success in one stage won't necessarily help you succeed in the next stage.

The Four Fits Model for a $100 Million Business

· 2 min read

Question: For Hubspot's freemium and fully automated (touchless) software business, how can one achieve the highest growth in the least amount of time while being VC-backed?

Solution: The Four Fits Model identifies four interrelated elements that drive company growth: product, market, channel, and model. The author believes these four factors are interconnected and must align with each other.

PMF: Product-Market Fit. There are two types of companies in the world: tailwinds companies and headwinds companies, with the distinction being PMF. Achieving PMF means your product has a group of users who repeatedly engage over time, generating enough profit to support continued growth; it means your product has sticky users, resulting in a retention curve that may decline over time but ultimately levels off.

PCF: Product-Channel Fit. The attributes of the product itself determine the best channels for promotion. Simple, universal products correspond to inexpensive, mass-market channels, while complex, niche products correspond to specialized channels.

CMF: Customer Acquisition and Business Model Fit. On the ARPU ↔ CAC spectrum, high ARPU corresponds to high CVC; low ARPU corresponds to low CVC. The concern is that if ARPU is set too high, low CAC users cannot afford it, and if ARPU is set too low, there won't be enough revenue to sustain high CAC.

MMF: Business Model and Market Fit. Our goal is ARPU × total consumers in the market × the proportion of consumers you can capture >= $100 million. If this equation does not hold true, you need to adjust your business model to increase your pricing or target a broader user base.

When using this growth model, there are several key points to consider:

  1. The premise is a target VC-backed $100 million business.
  2. If you want to change one element, you must adjust others accordingly.
  3. The elements themselves are constantly evolving.
  4. It is best not to change too many elements at once; if you are unfamiliar with the related fields, you may struggle to manage the changes.

Designing Human-Centric Internationalization (i18n) Engineering Solutions

· 9 min read

Requirement Analysis

If you ask what the biggest difference is between Silicon Valley companies and those in China, the answer is likely as Wu Jun said: Silicon Valley companies primarily target the global market. As Chen Zhiwu aptly stated, the ability to create wealth can be measured by three dimensions: depth, which refers to productivity—the ability to provide better products or services in the same amount of time; length, which refers to the ability to leverage finance to exchange value across time and space; and breadth, which refers to market size—the ability to create markets or new industries that transcend geographical boundaries. Internationalization, which is the localization of products and services in terms of language and culture, is indeed a strategic key for multinational companies competing in the global market.

Internationalization, abbreviated as i18n (with 18 letters between the 'i' and the 'n'), aims to solve the following issues in the development of websites and mobile apps:

  1. Language
  2. Time and Time Zones
  3. Numbers and Currency

Framework Design

Language

Logic and Details

The essence of language is as a medium for delivering messages to the audience; different languages serve as different media, each targeting different audiences. For example, if we want to display the message to the user: "Hello, Xiaoli!", the process involves checking the language table, determining the user's language, and the current required interpolation, such as the name, to display the corresponding message:

Message CodesLocalesTranslations
home.helloenHello, ${username}!
home.hellozh-CN你好, ${username}!
home.helloIW!${username}, שלום

Different languages may have slight variations in details, such as the singular and plural forms of an item, or the distinction between male and female in third-person references.

These are issues that simple table lookups cannot handle, requiring more complex logic processing. In code, you can use conditional statements to handle these exceptions. Additionally, some internationalization frameworks invent Domain-Specific Languages (DSL) to specifically address these situations. For example, The Project Fluent:

Another issue that beginners often overlook is the direction of writing. Common languages like Chinese and English are written from left to right, while some languages, such as Hebrew and Arabic, are written from right to left.

The difference in writing direction affects not only the text itself but also the input method. A Chinese person would find it very strange to input text from right to left; conversely, a Jewish colleague of mine finds it easy to mix English and Hebrew input.

Layout is another consideration. The entire UI layout and visual elements, such as the direction of arrows, may change based on the language's direction. Your HTML needs to set the appropriate dir attribute.

How to Determine the User's Locale?

You may wonder how we know the user's current language settings. In the case of a browser, when a user requests a webpage, there is a header called Accept-Language that indicates the accepted languages. These settings come from the user's system language and browser settings. In mobile apps, there is usually an API to retrieve the locale variable or constant. Another method is to determine the user's location based on their IP or GPS information and then display the corresponding language. For multinational companies, users often indicate their language preferences and geographical regions during registration.

If a user wants to change the language, websites have various approaches, while mobile apps tend to have more fixed APIs. Here are some methods for websites:

  1. Set a locale cookie
  2. Use different subdomains
  3. Use a dedicated domain. Pinterest has an article discussing how they utilize localized domains. Research shows that using local domain suffixes leads to higher click-through rates.
  4. Use different paths
  5. Use query parameters. While this method is feasible, it is not SEO-friendly.

Beginners often forget to mark the lang attribute in HTML when creating websites.

Translation Management Systems

Once you have carefully implemented the display of text languages, you will find that establishing and managing a translation library is also a cumbersome process.

Typically, developers do not have expertise in multiple languages. At this point, external translators or pre-existing translation libraries need to be introduced. The challenge here is that translators are often not technical personnel. Allowing them to directly modify code or communicate directly with developers can significantly increase translation costs. Therefore, in Silicon Valley companies, translation management systems (TMS) designed for translators are often managed by a dedicated team or involve purchasing existing solutions, such as the closed-source paid service lokalise.co or the open-source Mozilla Pontoon. A TMS can uniformly manage translation libraries, projects, reviews, and task assignments.

This way, the development process becomes: first, designers identify areas that need attention based on different languages and cultural habits during the design phase. For example, a button that is short in English may be very long in Russian, so care must be taken to avoid overflow. Then, the development team implements specific code logic based on the design requirements and provides message codes, contextual background, and examples written in a language familiar to developers in the translation management system. Subsequently, the translation team fills in translations for various languages in the management system. Finally, the development team pulls the translation library back into the codebase and releases it into the product.

Contextual background is an easily overlooked and challenging aspect. Where in the UI is the message that needs translation? What is its purpose? If the message is too short, further explanation may be needed. With this background knowledge, translators can provide more accurate translations in other languages. If translators cannot fully understand the intended message, they need a feedback channel to reach out to product designers and developers for clarification.

Given the multitude of languages and texts, it is rare for a single translator to handle everything; it typically requires a team of individuals with language expertise from various countries to contribute to the translation library. The entire process is time-consuming and labor-intensive, which is why translation teams are often established, such as outsourcing to Smartling.

Now that we have the code logic and translation library, the next question is: how do we integrate the content of the translation library into the product?

There are many different implementation methods; the most straightforward is a static approach where, each time an update occurs, a diff is submitted and merged into the code. This way, relevant translation materials are already included in the code during the build process.

Another approach is dynamic integration. On one hand, you can "pull" content from a remote translation library, which may lead to performance issues during high website traffic. However, the advantage is that translations are always up-to-date. On the other hand, for optimization, a "push" method can be employed, where any new changes in the translation library trigger a webhook to push the content to the server.

In my view, maintaining translations is more cumbersome than adding them. I have seen large projects become chaotic because old translations were not promptly removed after updates, leading to an unwieldy translation library. A good tool that ensures data consistency would greatly assist in maintaining clean code.

Alibaba's Kiwi internationalization solution has implemented a linter and VS Code plugin to help you check and extract translations from the code.

Time and Time Zones

Having discussed language, the next topic is time and time zones. As a global company, much of the data comes from around the world and is displayed to users globally. For example, how do international flights ensure that start and end times are consistent globally and displayed appropriately across different time zones? This is crucial. The same situation applies to all time-related events, such as booking hotels, reserving restaurants, and scheduling meetings.

First, there are several typical representations of time:

  1. Natural language, such as 07:23:01, Monday 28, October 2019 CST AM/PM
  2. Unix timestamp (Int type), such as 1572218668
  3. Datetime. Note that when MySQL stores datetime, it converts it to UTC based on the server's time zone and stores it, converting it back when reading. However, the server's time zone is generally set to UTC. In this case, the storage does not include time zone information, defaulting to UTC.
  4. ISO Date, such as 2019-10-27T23:24:28+00:00, which includes time zone information.

I have no strong preference for these formats; if you have relevant experience, feel free to discuss it.

When displaying time, two types of conversions may occur: one is converting the stored server time zone to the local time zone for display; the other is converting machine code to natural language. A popular approach for the latter is to use powerful libraries for handling time and dates, such as moment.js and dayjs.

Numbers and Currency

The display of numbers varies significantly across different countries and regions. The meaning of commas and periods in numbers differs from one country to another.

(1000.1).toLocaleString("en")
// => "1,000.1"
(1000.1).toLocaleString("de")
// => "1.000,1"
(1000.1).toLocaleString("ru")
// => "1 000,1"

Arabic numerals are not universally applicable; for instance, in Java's String.format, the digits 1, 2, 3 are represented as ١، ٢، ٣ in actual Arabic language.

Regarding pricing, should the same goods be displayed in local currency values in different countries? What is the currency symbol? How precise should the currency be? These questions must be addressed in advance.

Conclusion

The internationalization tools mentioned in this article include translation management systems, the open-source Mozilla Pontoon, the closed-source paid service lokalise.co, POEditor.com, and so on. For code consistency, Alibaba's Kiwi internationalization solution is recommended. For UI display, consider using moment.js and day.js.

Like all software system development, there is no silver bullet for internationalization; great works are crafted through foundational skills honed over time.

Designing a Load Balancer or Dropbox Bandaid

· 3 min read

Requirements

Internet-scale web services deal with high-volume traffic from the whole world. However, one server could only serve a limited amount of requests at the same time. Consequently, there is usually a server farm or a large cluster of servers to undertake the traffic altogether. Here comes the question: how to route them so that each host could evenly receive and process the request?

Since there are many hops and layers of load balancers from the user to the server, specifically speaking, this time our design requirements are

Note: If Service A depends on (or consumes) Service B, then A is downstream service of B, and B is upstream service of A.

Challenges

Why is it hard to balance loads? The answer is that it is hard to collect accurate load distribution stats and act accordingly.

Distributing-by-requests ≠ distributing-by-load

Random and round-robin distribute the traffic by requests. However, the actual load is not per request - some are heavy in CPU or thread utilization, while some are lightweight.

To be more accurate on the load, load balancers have to maintain local states of observed active request number, connection number, or request process latencies for each backend server. And based on them, we can use distribution algorithms like Least-connections, least-time, and Random N choices:

Least-connections: a request is passed to the server with the least number of active connections.

latency-based (least-time): a request is passed to the server with the least average response time and least number of active connections, taking into account weights of servers.

However, these two algorithms work well only with only one load balancer. If there are multiple ones, there might have herd effect. That is to say; all the load balancers notice that one service is momentarily faster, and then all send requests to that service.

Random N choices (where N=2 in most cases / a.k.a Power of Two Choices): pick two at random and chose the better option of the two, avoiding the worse choice.

Distributed environments.

Local LB is unaware of global downstream and upstream states, including

  • upstream service loads
  • upstream service may be super large, and thus it is hard to pick the right subset to cover with the load balancer
  • downstream service loads
  • the processing time of various requests are hard to predict

Solutions

There are three options to collect load the stats accurately and then act accordingly:

  • centralized & dynamic controller
  • distributed but with shared states
  • piggybacking server-side information in response messages or active probing

Dropbox Bandaid team chose the third option because it fits into their existing random N choices approach well.

However, instead of using local states, like the original random N choices do, they use real-time global information from the backend servers via the response headers.

Server utilization: Backend servers are configured with a max capacity and count the on-going requests, and then they have utilization percentage calculated ranging from 0.0 to 1.0.

There are two problems to consider:

  1. Handling HTTP errors: If a server fast fails requests, it attracts more traffic and fails more.
  2. Stats decay: If a server’s load is too high, no requests will be distributed there and hence the server gets stuck. They use a decay function of the inverted sigmoid curve to solve the problem.

Results: requests are more balanced

Designing a Load Balancer

· 3 min read

Requirements Analysis

Internet services often need to handle traffic from around the world, but a single server can only serve a limited number of requests at the same time. Therefore, we typically have a server cluster to collectively manage this traffic. The question arises: how can we evenly distribute this traffic across different servers?

From the user to the server, there are many nodes and load balancers at different levels. Specifically, our design requirements are:

  • Design a Layer 7 load balancer located internally in the data center.
  • Utilize real-time load information from the backend.
  • Handle tens of millions of requests per second and a throughput of 10 TB per second.

Note: If Service A depends on Service B, we refer to A as the downstream service and B as the upstream service.

Challenges

Why is load balancing difficult? The answer lies in the challenge of collecting accurate load distribution data.

Count-based Distribution ≠ Load-based Distribution

The simplest approach is to distribute traffic randomly or in a round-robin manner based on the number of requests. However, the actual load is not calculated based on the number of requests; for example, some requests are heavy and CPU-intensive, while others are lightweight.

To measure load more accurately, the load balancer must maintain some local state—such as the current number of requests, connection counts, and request processing delays. Based on this state, we can employ appropriate load balancing algorithms—least connections, least latency, or random N choose one.

Least Connections: Requests are directed to the server with the fewest current connections.

Least Latency: Requests are directed to the server with the lowest average response time and fewest connections. Servers can also be weighted.

Random N Choose One (N is typically 2, so we can also refer to it as the power of two choices): Randomly select two servers and choose the better of the two, which helps avoid the worst-case scenario.

Distributed Environment

In a distributed environment, local load balancers struggle to understand the complete state of upstream and downstream services, including:

  • Load of upstream services
  • Upstream services can be very large, making it difficult to select an appropriate subset for the load balancer
  • Load of downstream services
  • The specific processing time for different types of requests is hard to predict

Solutions

There are three approaches to accurately collect load information and respond accordingly:

  • A centralized balancer that dynamically manages based on the situation
  • Distributed balancers that share state among them
  • Servers return load information along with requests, or the balancer actively queries the servers

Dropbox chose the third approach when implementing Bandai, as it adapted well to the existing random N choose one algorithm.

However, unlike the original random N choose one algorithm, this approach does not rely on local state but instead uses real-time results returned by the servers.

Server Utilization: Backend servers set a maximum load, track current connections, and calculate utilization, ranging from 0.0 to 1.0.

Two issues need to be considered:

  1. Error Handling: If fail fast, the quick processing may attract more traffic, leading to more errors.
  2. Data Decay: If a server's load is too high, no requests will be sent there. Therefore, using a decay function similar to a reverse S-curve ensures that old data is purged.

Result: Requests Received by Servers are More Balanced

Concurrency Models

· One min read

  • Single-threaded - Callbacks, Promises, Observables and async/await: vanilla JS
  • threading/multiprocessing, lock-based concurrency
    • protecting critical section vs. performance
  • Communicating Sequential Processes (CSP)
    • Golang or Clojure’s core.async.
    • process/thread passes data through channels.
  • Actor Model (AM): Elixir, Erlang, Scala
    • asynchronous by nature, and have location transparency that spans runtimes and machines - if you have a reference (Akka) or PID (Erlang) of an actor, you can message it via mailboxes.
    • powerful fault tolerance by organizing actors into a supervision hierarchy, and you can handle failures at its exact level of hierarchy.
  • Software Transactional Memory (STM): Clojure, Haskell
    • like MVCC or pure functions: commit / abort / retry