Skip to main content

2 posts tagged with "software engineering"

View All Tags

Google's Software Engineering: Project Management

· 3 min read

20% Time

Engineers are allowed to spend 20% of their work time on any project they want to contribute to, without needing approval from their managers or others. This is highly valuable because:

  1. As long as there are good ideas, no matter how bad they sound at first, there is ample time to develop them to a demo-ready state.
  2. It allows managers to see activities they might not otherwise notice; otherwise, engineers might engage in "skunkworks" and work secretly.
  3. It enables engineers to work on interesting projects, preventing burnout and motivating them to be happier. The output gap between motivated engineers and burnt-out engineers far exceeds 20%.
  4. It encourages innovation; if others around you are working on 20% projects, you will be inspired to do the same.

OKRs

Individuals and teams must publicly document their objectives and how they measure them.

  • Objectives
    • Set quarterly and annual goals.
    • Individual and team goals should align with the larger group’s goals.
  • Key Results: Measurable key results can quantify progress towards objectives, ranging from 0 to 1.
  • Set OKRs high; generally, achieving around 0.65 is a good standard. If your results are often below this, your goals may be set too high; if above, they may be too low.
  • Benefits
    • Everyone knows what others are working on, fostering mutual motivation.
    • Provides purpose to execution, making it easier to achieve goals.
  • OKRs are not directly related to performance evaluations.

Should the Project Continue or Be Terminated?

While the review process for major new releases is systematic, there is no definitive answer to whether a project should continue; some decisions are bottom-up, while others are top-down.

Reorganization

Splitting and merging teams is common, seemingly optimizing efficiency.

My Evaluation

The results of 20% time are positive, having incubated significant projects like Gmail and AdSense. In a competitive environment, encouraging talented engineers to spend time on new initiatives is highly beneficial. Promoting 20% time is also a unique strategy to attract talent when the company is small and needs to offer excellent benefits. I tend to view 20% time as a management style rather than a guaranteed path to success.

The distinction between OKRs and performance evaluations is crucial—this means separating vision from execution and goal management from performance management. For example, asking "Did you reach the destination?" compared to "Is the car you drove a good one?" are two different questions. Similarly, poor product sales and whether engineers produced a good product are two separate issues.

For regular engineers, maintaining good relationships with other teams in a large company, including those unrelated to your specific work, is important, as it increases your demand in the labor market. This way, in the event of a reorganization or other adverse events, you will have more options.

Google Software Engineering: Software Development

· 6 min read

It is widely recognized that Google is a company with exceptional engineering capabilities. What are its best engineering practices? What insights can we gain from them? What aspects have drawn criticism? We will discuss these details gradually, with this article primarily focusing on development.

Google Software Engineering - Software Development

Codebase

  • As of 2015, there are 2 billion lines of code in a small number of Monorepo single codebases, with the vast majority of the code visible to everyone. Google encourages engineers to make changes when they see issues, as long as all reviewers approve, the changes can be integrated.
  • Almost all development occurs at the head of the codebase, rather than on branches, to avoid issues during merging and to facilitate safe fixes.
  • Every change triggers tests, and any errors are reported to the author and reviewers within minutes.
  • Each subtree of the codebase has at least two owners; other developers can submit modifications, but approval from the owners is required for integration.

Build System

  • The distributed build system Bazel makes compilation, linking, and testing easy and fast.
  • Hundreds or thousands of machines are utilized.
  • High reliability, with deterministic input dependencies leading to predictable output results, avoiding strange, uncertain fluctuations.
  • Fast. Once a build result is cached, dependent builds will directly use the cache without needing to recompile. Only the changed parts are rebuilt.
  • Pre-submit checks. Some quick tests can be executed before submission.

Code Review

  • There are code review tools in place.
  • All changes must undergo review.
  • After discovering a bug, you can point out the issue in the previous review, and relevant personnel will be notified via email.
  • Experimental code does not require mandatory review, but code in production must be reviewed.
  • Each change is encouraged to be as small as possible. Fewer than 100 lines is "small," fewer than 300 lines is "medium," fewer than 1000 lines is "large," and more than 1000 lines is "extremely large."

Testing

  • Unit tests
  • Integration tests, regression tests
  • Pre-submit checks
  • Automatic generation of test coverage
  • Conduct stress tests before deployment, generating relevant key metrics, especially latency and error rates as load varies.

Bug Tracking Tools

Bugs, feature requests, customer issues, processes, etc., are all recorded and need to be regularly triaged to confirm priorities and then assigned to the appropriate engineers.

Programming Languages

  • There are five official languages: C++, Java, Python, Go, JavaScript, to facilitate code reuse and collaborative development. Each language has a style guide.
  • Engineers undergo training in code readability.
  • Domain-specific languages (DSLs) are also unavoidable in certain contexts.
  • Data interaction between these languages primarily occurs through protocol buffers.
  • A common workflow is essential, regardless of the language used.

Debugging and Analysis Tools

  • When a server crashes, the crash information is automatically recorded.
  • Memory leaks are accompanied by the current heap objects.
  • There are numerous web tools to help you monitor RPC requests, change settings, resource consumption, etc.

Release

  • Most release work is performed by regular engineers.
  • Timely releases are crucial, as a fast release cadence greatly motivates engineers to work harder and receive feedback more quickly.
  • A typical release process includes:
    1. Finding the latest stable build, creating a release branch, possibly cherry-picking some minor changes.
    2. Running tests, building, and packaging.
    3. Deploying to a staging server for internal testing, where you can shadow online traffic to check for issues.
    4. Releasing to a canary environment to handle a small amount of traffic for public testing.
    5. Gradually releasing to all users.

Review of Releases

User-visible or significant releases must undergo reviews related to legal, privacy, security, reliability, and business requirements, ensuring relevant personnel are notified. There are dedicated tools to assist with this process.

Postmortem Reports

After a significant outage incident, the responsible parties must write a postmortem report, which includes:

  1. Incident title
  2. Summary
  3. Impact: duration, affected traffic, and profit loss
  4. Timeline: documenting the occurrence, diagnosis, and resolution
  5. Root causes
  6. What went well and what did not: what lessons can help others find and resolve issues more quickly and accurately next time?
  7. Next actionable items: what can be done to prevent similar incidents in the future?

Focus on the issue, not the person; the key here is to understand the problem itself and how to avoid similar issues in the future.

Code Rewrite

Large software systems are often rewritten every few years. The downside is the high cost, but the benefits include:

  1. Maintaining agility. Markets change, software requirements evolve, and code must adapt accordingly.
  2. Reducing complexity.
  3. Transferring knowledge to newcomers, giving them a sense of ownership.
  4. Enhancing engineer mobility and promoting cross-domain innovation.
  5. Adopting the latest technology stacks and methodologies.

My Comments

Google's single codebase and powerful build system are not easily replicable by small companies, as they lack the resources and capabilities to make their build systems as fast and agile. Staying small, simple, and fast allows small companies to operate more smoothly and focus more on core business logic.

Build systems are often customized, and your knowledge may not transfer or scale. A powerful build system can even be detrimental to newcomers, as it raises the cost of gaining a holistic view.

The inability to transfer and scale knowledge is also an issue with well-developed in-house tools. Throughout my career, I have tried to avoid using non-open-source internal tools, such as Uber's Schemaless, which are tailored for specific scenarios and not made public for broader use; in contrast, LinkedIn's Kafka is a good product with openness and scalability of knowledge.

In the open market, there are excellent tools available for the entire development, testing, integration, and release process. For example, in the JS community:

ProcessTools
CodebaseGithub, Gitlab, Bitbucket, gitolite
Code ReviewGithub Pull Requests, Phabricator
Pre-submit checks, testing, and lintinghusky, ava, istanbul, eslint, prettier
Bug TrackingGithub Issues, Phabricator
Testing and Continuous IntegrationCircleCI, TravisCI, TeamCity
DeploymentOnline service deployment with Heroku, Netlify, mobile app deployment with Fastlane, library publication with NPM

Finally, I may have an insight: companies that do not focus on the automation of these engineering processes will lose significant competitive advantage. I even set up a JS full-stack development framework OneFx for good engineering practices. The difference between fast and slow, high quality and low quality is often exponential because — typically, speed allows you to do more and faster, while poor quality leads to less and worse outcomes.