[Software Architecture] Design software architecture for leverage (utilization)

Cavalcanti: What I'm talking about here is how to leverage software architecture. First, I'll define what leverage means here. This is Google's definition . Leverage is the amount of value you can gain relative to the depth of your investments . We want to get more value than the investment you make. In a software environment, it's the decisions you make, the choices you make, or the technical debt you acquire relative to the amount of value you can create. I wanted to see an example of some of the architectural decisions we've made throughout Nubank's development to get as much leverage as possible at the time. You may be in a similar position in your company, or at a stage in a future company where you will make these decisions. You can use us as an example, or at least a mindset.

background

I'm Lucas Cavalcanti. Since late 2013, I have been the lead software engineer at Nubank. A little over seven years. I live in Sao Paulo, Brazil.

Growing rapidly in a complex field

Nubank is the leading fintech company in Latin America and the largest digital bank in the world. In Time magazine, we get one of the 100 most influential companies in the world. We were also featured in Time magazine. That's a huge achievement from a company that's a little over seven years old. Here's an example of our growth curve. Below is the actual number of customers plotted in this graph. We now have 35 million customers. We process billions of Kafka messages and HTTP requests every day, in a system with hundreds of microservices and signed by hundreds of engineers. It's a pretty big size, and it wasn't always this big.

overview

I will introduce some stages of the company. The first is time to launch, and we value time to market and feedback. Moving into growth, we shift our focus to resilience and adaptability. Next comes integration time, the most important aspects being reliability and observability. When we value flexibility and scalability, it's finally time to scale. These are values ​​that are important to us during these stages.

Start time (2013-2015)

Startup time is a magical time when anything can happen, including failure and not having a company. In our case, this happened from late 2013 to early 2015. We've had incredible change, it was a magical time when you have a greenfield project and you can choose whatever technology you like. You must have a good reason for doing so. In a small office, in a friendly neighborhood of São Paulo, is actually a small house. When we launched our first product, it was a digital credit card with no fees and a real-time experience that was unheard of at the time. At least in Brazil, we were the first to do this. There are so many unknowns, we don't know where the company is going or if it will be successful. With limited resources and only a dozen people running the entire company, we needed to make this work. In our case we have a license period. If we hadn't operated by May 2014, we would have had to apply for a license which would take two years to grant, which is basically death for the company.

Startup time is a magical time when anything can happen, including failure and not having a company. In our case, this happened from late 2013 to early 2015. We've had incredible change, it was a magical time when you have a greenfield project and you can choose whatever technology you like. You must have a good reason for doing so. In a small office, in a friendly neighborhood of São Paulo, is actually a small house. When we launched our first product, it was a digital credit card with no fees and a real-time experience that was unheard of at the time. At least in Brazil, we were the first to do this. There are so many unknowns, we don't know where the company is going or if it will be successful. With limited resources and only a dozen people running the entire company, we needed to make this work. In our case we have a license period. If we hadn't operated by May 2014, we would have had to apply for a license which would take two years to grant, which is basically death for the company.

technology choice

The first lever we need to do is technology selection. The value here is time to market. We need to launch as soon as possible. The type of leverage here is to maximize the amount of work that doesn't need to be done. Don't create a more complex thing you need to do at that stage. We chose Datomic as a database, a very small database, which is an immutable ledger of facts. You can get audit services for free. History is preserved with each update, so no previous values ​​are lost. You can query the database at any point in time, so this is useful for later auditing and debugging. We chose Clojure, a functional programming language that runs on the JVM. We can leverage the entire Java ecosystem. Everything written in Java, we can use in Clojure. By default we get immutability. Almost every language decision makes it easy for us. "Simple made easy," is a rich quote from Rich Hickey. This is true, we use it in production. We have functional programming close to finance, which is why we chose Clojure. It is easier to map financial logic in a functional programming language.

We chose to use a hexagonal architecture so we could have an organized way to view the code. We chose Kafka as the messaging technology, which was very popular at the time, and it had a message persistence log with TTL. It's not forever, but for a period of time, you can check and review all messages generated. We can reset the offset, so you can reprocess old messages if necessary. We had to do it a few times at first. By default we also get partitions. At the time, Kafka was also slightly easier to generalize. The debt we had at the time was that we picked some very niche technologies, some unintegrated. They have not been established. It's hard to find people with some experience with these languages. In the early days of the company, we basically didn't need to ask for this in order to teach people this.

Vendors

The next lever is suppliers. When you consider time to market, several times, buying rather than building is the best option. The first is to use the cloud. At this stage of your company, you don't want to manage your own machines. We have used AWS with CloudFormation for deployment automation from the beginning. We use DynamoDB as the backend database for Datomic, which is also very easy to operate. We chose to purchase an off-the-shelf credit card solution. Instead of starting to build an entire credit card system, we started with a company that already processed credit card transactions. We can leverage, instead of building, we just integrate with that company, we can create the first product much faster. The liability here is that right now, by using these suppliers, we are limited by their growth, size and ability to deal with our problems, which is not always ideal.

practice

The last question about startup time is practice. The value this time is getting fast and early feedback. To get this, we need to have a good foundation on top so we can build on it faster in time. The thing here is that building the foundation takes time, and you can't always be at the start when you want to. Fortunately, we can. We have some time. We also took this opportunity to build a good CI/CD environment, so continuous deployment was very important to us at the time. We established some continuous deployment practices. We have a very rudimentary fault tolerance, but still there. We have had unchanging infrastructure from the beginning. Every time we deploy we create a new instance on EC2 and destroy the old one so you don't have the complexity of dealing with infrastructure changes. We chose to use microservices from the beginning because we knew that the financial world is very complex. In this example, containing that complexity in a small chunk, a smaller service, was very important to us at the time. We have already started.

The last question about startup time is practice. The value this time is getting fast and early feedback. To get this, we need to have a good foundation on top so we can build on it faster in time. The thing here is that building the foundation takes time, and you can't always be at the start when you want to. Fortunately, we can. We have some time. We also took this opportunity to build a good CI/CD environment, so continuous deployment was very important to us at the time. We established some continuous deployment practices. We have a very rudimentary fault tolerance, but still there. We have had unchanging infrastructure from the beginning. Every time we deploy we create a new instance on EC2 and destroy the old one so you don't have the complexity of dealing with infrastructure changes. We chose to use microservices from the beginning because we knew that the financial world is very complex. In this example, containing that complexity in a small chunk, a smaller service, was very important to us at the time. We have already started.

Growth time (2015-2016)

Going a little further, if we are lucky and successful, the company will enter a growth phase, which in our case was between 2015 and 2016, when we experienced faster than expected growth. We had expected to get 1 million customers in 5 years, and we achieved that in about 18 months. We need to respond to this. At first, the office was small and we had to move to a bigger place. The provider did not scale. The credit card processor doesn't scale, so we need to keep the system working even if the provider doesn't scale. The technology, the decisions we made at the beginning, also started not to scale. We're starting to see the first bottlenecks, which are hard to fix in this hypergrowth scenario.

practice

The first lever on growth time is practice. With the value of scalability or fault tolerance, we can and should avoid optimization as much as possible, or at least delay optimization. Because optimized code is much more complex than regular code. In a complex field, this can get off track very quickly. To do this, we use infrastructure sharding instead of only sharding a database or part of the infrastructure. We have several copies of the Nubank system. Each shard is a copy of the entire infrastructure, which is the unit of scalability. We can place a limit on the number of clients running on that replica, and move on to the next replica when a new set of clients is reached, and keep creating replicas as the base grows. If the fragments are small enough, the code doesn't have to be optimized, or can be delayed as much as possible. For this, we have to improve CI/CD. We need frequent automatic deployments.

We started with end-to-end testing from the beginning, but the size of the test started to shrink, so it took over an hour for the test to start. We have to replace them with consumer-driven contract tests, which will run faster with a little less guarantees, but it's better to keep deploying frequently rather than waiting too much time to deploy. We started migrating to Docker instead of using EC2. The investment here is a project that has been running with sharding for more than a year, which was a very large project of the company at the time. We have to design new tools to accommodate this situation. The debt here is that the project took longer than expected and the customer base grew faster than expected. We end up with the first shard that is bigger than the others. For a long time, this shard was a special shard that was basically the canary for any performance issues in the system, and it was the first shard. Additionally, each shard has a minimum cost, no matter how many clients there are. We started spending a lot of money running every copy of the shard.

In-Housing

The next joystick is inside IT. In particular, because our suppliers didn't scale, we started to take control of our own destiny in the most important aspects of our business. We started processing credit cards in-house, bringing feature after feature of the card in-house so we could control our scale. The same goes for customer support. Delighting customers is our greatest strength at the company, so we've also brought our customer support tools and customer support staff in-house. We also have to design for it. The biggest investment here is that it took more than 18 months to implement these features of the credit card. Every little feature is a migration we have to make. It's a huge investment with huge payoffs. Providers aren't going to scale to 35 million customers, we can. The debt here is that it took us a long time without any major product changes due to bringing several features inside. That's kind of bad.

Consolidation period (2017-2018)

If we're lucky, we'll move to the next phase, which is integration time. Between 2017 and 2018, when we went into cruise mode at the company, we could scale, but not in a steady fashion, and sharding helped a lot in terms of scaling. We've reached a scale where every little corner case that affects 0.1% of customers happens to thousands of customers. We must have a product or system that is more stable than we expected. The office was also out of scale, so we needed to move to a bigger office with a capacity of 1000 people, near Avenuda Paulista, one of the most famous streets in São Paulo. At this point, we launched a second product, a checking account. At this stage, we have generated a lot of data, so we start analyzing this data. This is also very important to us.

technology

The first advantage in terms of technology is that we aim for scalability and adaptability. The leverage here is this scale, we needed to be able to make infrastructure changes more easily, so we migrated to Kubernetes, which was also thriving at the time. It comes with an ecosystem of multiple infrastructure tools. It scales better than AWS CloudFormation as the number of services we get increases. We also started building better monitoring tools to collect real-time metrics using Prometheus plus Grafana. These metrics are also used by other tools such as Opsgenie, Slack or CI/CD for canary deployments. This is very important for us to scale up. The investment here is another year-long project where we have to build Kubernetes and migrate the shards to Kubernetes one by one, and the system is already running for millions of customers. They are quite complex operations that we are able to accomplish. The debt here is that while we didn't fully migrate, we started hitting AWS limits in terms of creating resources or number of resources and spending a lot of money on duplicate infrastructure before the project was complete. This is a big deal.

internal tools

We also had to invest heavily in internal tooling for resilience and observability. We need to make it easier for engineers to operate this system, especially with so many services and people. We created a command-line repository called NuCLI that includes the most common operations, like restarting the service, or sending HTTP requests to the service using our credentials, that can be run with just one command. Also, there is a tool for declarative infra. A repository that is automatically applied by tools when you can describe the resources you can get from your service. The investment here is that we need a dedicated team to curate, maintain and ensure all of these changes are applied.

data

At this point, when we do the integration, the data becomes very important. The amount of data we have cannot be handled by conventional tools. You need data to make almost every decision in your company, so we use Scala plus Spark to process all data by pulling data across all shards from databases in all services. We have an ETL process which is a repository of dataset definitions that almost everyone in the company or almost everywhere in the company is involved in and it outputs everything to the data warehouse so everyone can access it later. It integrates with some artificial intelligence tools that we can use to power our machine learning models. We can also use it as a consistency tool. With so much data, and using a distributed architecture, failed distributed transactions become very important. We also check the consistency using ETL while checking the system. Also, creating the initial ETL version and starting to iterate on it is another big project. We also have a dedicated team to make sure this runs smoothly.

Expander (2019- )

Finally, we get to expansion time, where we are now. From 2019 to now, we started to provide products for everyone. That's why you see this inflection point on the curve. For example, instead of saying no to customers asking for a credit card, we started offering products for everyone. We're starting to roll out in many countries, many offices, and many other products that we're manufacturing. We started acquiring companies, so the interfaces between those companies became important as well.

technology platform

The first lever here regarding time to scale is scalability and productivity, what I call a horizontal platform here. Basically have a dedicated technical team that builds abstract tools for other teams. For example, mobile and web, we have a team building tool in flutter, design system and component library, so that regular engineers, non-expert engineers can still develop and use the system. The same is true for infrastructure, we create abstract tools so that every engineer can do it, instead of everyone knowing how to operate Kubernetes. We also built an experimentation platform because at this scale, we want to experiment and we want to improve your product. Having a platform that allows you to do this easily when it comes to monitoring KPIs and anything related to testing is critical. We now need dedicated teams to create, maintain, and operate these platforms to ensure that every engineer can be productive on whatever technology we use.

business platform

Finally, here is the business platform for the team of domain experts to build abstract APIs that are used by all other product teams. Here's the gist of where we can innovate, the possibilities are endless with every possible platform product we can build. An example of it is banking as a service, so platforms are created to run the basic business of a bank. For example, credit platform. You don't have to figure out how to originate a loan, or how to report a loan, how to account for a loan, we have a platform where every product can originate a loan. Once that's done, the product can do whatever it wants. The same is true for assets and payment platforms in Open Banking. These are the building blocks we can use to build our bank. Also, on the credit card side, we have to roll out credit cards in other countries, so we have to make the system more universal. In order to do this, we divided the system into the most relevant parts for credit cards. For example, in this case we are processing credit lines or billing statements every month, or processing credit card transactions differently. Debts are also incurred and client debts are renegotiated if the client does not pay us their bill in full. Also, it's really important to have a flexible procurement process when you're getting too many products, so we use that. The investment here is that with the domain platform, we need to have a very deep understanding of the domain to do this. We've had long discussions about designing the right breakout point for platform-side systems, because creating the wrong abstraction here will also lead to failure. You have to make sure that you are creating the breadth of abstraction, because the cost of rebuilding abstraction is very high.

brief review

When starting a startup, we tried to delay writing code as much as possible while laying the groundwork for growth. As we grew, we started providing all the core functionality we needed in-house and started sharding to be able to scale faster. The time for integration is to mature our infrastructure and create a data environment so everyone can use data to grow the company. In the end, scaling time is about building horizontal and business platforms, so the possibilities will multiply and everyone's productivity will multiply.

question and answer

Bocelli: You start with microservices. Usually, when people choose microservices architecture, it's more of a legitimacy of Conway's law, when you're small, you don't, but you switch to that, do you consider doing a monolith? What was the thought process like in this regard?

Cavalcanti: I think the main reason is the complexity of the field. We knew the financial world was complex, so our focus on the system at the time was very different, like processing credit card transactions, handling customer data, and handling the acquisition process. We started with very few services. I think about four or five, but have moved into services because we know that if we're going to be successful and scale, we're going to have to adopt this kind of architecture eventually.

Bocelli: This is a challenge because a distributed system is much more complex than a single monolith.

This also involves some issues about segmentation. You mentioned sharding when you started scaling. You sort of balance the challenge of sharding and creating shards. Do you have to deal with outliers in this regard? Need more resources in this split?

Cavalcanti: I think we do have some people who run 10x or 100x more transactions than others. It only ends up affecting specific customers, like their bills may not open in time because there are too many transactions there. It doesn't affect other operations, since most of our operations are already batched. It doesn't affect us much. We did run into this issue on the first shard because it took us longer than expected to build the shard infrastructure. The first shard has been with customers who are too old for too long. Customers have been in Nubank for the longest time and have the largest number of customers. This presents us with some challenges. It still does sometimes, but now all the pieces are about the same size. We are managing them better.

Bocelli: So, link to diced. What about migrating to Kubernetes? Some questions about the Kubernetes family. Did you adopt a Kubernetes service like EKS, or did you go to Kubernetes? How did this transition from your infrastructure take place? You've mentioned it's immutable.

Cavalcanti: I think the main thing to consider here is that this was seven or eight years ago. Kubernetes was just introduced, with a public release in 2014. We didn't have these tools when we started the company. We didn't even have Docker when we started the company, so we had to migrate to Docker and then to Kubernetes. Then eventually migrated to EKS, because EKS didn't exist at the time either. I think it started appearing in the São Paulo area this year. I think if we started a business today, we'd probably use EKS on Amazon, and that's it. We didn't have those tools back then.

Bocelli: Another question I see a lot here is the use of Clojure, and the stack you mentioned. I'd say it's not that common. You take advantage of the ecosystem. You also mentioned the JVM. Can you help us understand this dichotomy? You started off easy, but you chose something that very few people would.

Cavalcanti: The most important thing about Clojure for us is that the language pushes you towards simplicity from the very beginning, in the sense that the amount of language you have to learn is very small. You can improve your productivity by learning Clojure for a few weeks. All of our services look similar. Once you get rid of those braces, you can switch from curly braces to braces, and you're good to go. Everything else is simpler. We don't have the cognitive load that most languages ​​have to learn grammar and language structures. Clojure is Lisp's brackets and notation. You don't have to learn many language features to use it like we did or copy-paste other code.

Bocelli: Now a question on the more business side, how do you factor cost into architectural decisions? When you're in the financial business, you're subject to a lot of regulations. How do you handle all the challenges of this distinction, infrastructure, business code, and regulations?

Cavalcanti: We do have specific teams, like when we get to a certain size, it pays to have a team that focuses on specific aspects. We have a team that only manages operational risk, or a team that only provides the platform for the regulated part of the company. Originating loans is very regulated in the market, at least in Brazil, so we created a platform around that. Every time we need a loan, the platform handles it and we have a number of teams that make it easy to issue loans. That's the main way we've approached this is by having a dedicated team that knows a lot about the rules and regulations and how to assess risk. Then, regular evaluations from other teams that may be affected by it.

Bocelli: You mentioned several times that there are dedicated teams and specialists. What is the team size of Nubank?

Cavalcanti: Usually, the team is not just engineers. We have a team of BAs, a team of product managers, a team of business product managers, and sometimes a team of data scientists. For the engineering part, we usually have a technical manager and each team has two to six engineers. It varies greatly in context, but every team is about the same size.

Bocelli: Two pizza rules.

Now let's switch to data. You mentioned ETL, what tool do you use for ETL?

Cavalcanti: We use Spark to build all the infrastructure to transform the data. We had to build a lot of tools internally to extract the dataset data and transform it in a way that ETL could use. I'm not an expert in data, so I don't know much about the details. I know we use a Mesos cluster to run the cluster that will run the ETL process. We use some BI tools, such as Databricks, such as Looker. I think the whole ecosystem is using several tools. We have our own dataset definition repository, contributed by the whole company. It's Scala and Spark, and some abstractions that we've created so it's easier for people to use.

Bocelli: You started with Kafka, and today, a lot of things are about streaming, but you went the batch ETL route. Is there any reason not to devote all your attention to streaming? It wasn't very popular back then, did you need to adjust?

Cavalcanti: The main reason is that Kafka Streams was not stable or released at the time when we needed ETL. At that time we already had Datomic's data from dozens of microservices companies. We had a hard time migrating to Kafka Streams on the architecture we chose at the time. If we start today, we'll be in streaming. We do have some use cases that are better suited for streams. For example, we use it when collecting metrics in real time. We do use Kafka Streams. For the regular database part, we don't.

This article: https://architect.pub/architecting-software-leverage
Discussion: Knowledge Planet [Chief Architect Circle] or add WeChat trumpet [ca_cto] or add QQ group [792862318]
No public

【jiagoushipro】
【Super Architect】
Brilliant graphic and detailed explanation of architecture methodology, architecture practice, technical principles, and technical trends.
We are waiting for you, please scan and pay attention.
e9b1669ae977dcd4a74c16addedb6c51.jpeg
WeChat trumpet

[ca_cea]
Community of 50,000 people, discussing: enterprise architecture, cloud computing, big data, data science, Internet of Things, artificial intelligence, security, full-stack development, DevOps, digitalization.

e559e0284fee7553fff341223fd1cc36.jpeg

QQ group

[285069459] In-depth exchange of enterprise architecture, business architecture, application architecture, data architecture, technical architecture, integration architecture, security architecture. And various emerging technologies such as big data, cloud computing, Internet of Things, artificial intelligence, etc.
Join the QQ group to share valuable reports and dry goods.

43deafe56698d0adc85f2ddc765d8fb6.jpeg

video number [Super Architect]
Quickly understand the basic concepts, models, methods, and experiences related to architecture in 1 minute.
1 minute a day, the structure is familiar.

a3b30e79b026956a71fd35fccecddd2f.jpeg

knowledge planet [Chief Architect Circle] Ask big names, get in touch with them, or get private information sharing.

ac651288455fb01f63af85dde821a74a.jpeg

Himalayas [Super Architect] Learn about the latest black technology information and architecture experience on the road or in the car. [Intelligent moments, Mr. Architecture will talk to you about black technology]
knowledge planet Meet more friends, workplace and technical chat. Knowledge Planet【Workplace and Technology】
LinkedIn Harry https://www.linkedin.com/in/architect-harry/
LinkedIn group LinkedIn Architecture Group
https://www.linkedin.com/groups/14209750/
Weibo‍‍ 【Super Architect】 smart moment‍
Bilibili 【Super Architect】

6a36e676ca3c48dbf6bb5036ec74cd43.jpeg

Tik Tok 【cea_cio】Super Architect

5ed0ca85204f70017492cb9dd766ef60.jpeg

quick worker 【cea_cio_cto】Super Architect

cf74b4bd95175c5115175ed59e92cb8d.jpeg

little red book [cea_csa_cto] Super Architect

476d073467bb74463b7d5e75250465cd.jpeg

website CIO (Chief Information Officer) https://cio.ceo
website CIOs, CTOs and CDOs https://cioctocdo.com
website Architect practical sharing https://architect.pub   
website Programmer cloud development sharing https://pgmr.cloud
website Chief Architect Community https://jiagoushi.pro
website Application development and development platform https://apaas.dev
website Development Information Network https://xinxi.dev
website super architect https://jiagou.dev
website Enterprise technical training https://peixun.dev
website Programmer's Book https://pgmr.pub    
website developer chat https://blog.developer.chat
website CPO Collection https://cpo.work
website chief security officer https://cso.pub    ‍
website CIO cool https://cio.cool
website CDO information https://cdo.fyi
website CXO information https://cxo.pub

Thank you for your attention, forwarding, likes and watching.

Guess you like

Origin blog.csdn.net/jiagoushipro/article/details/130788053