Back to Blogs

Zero-cost Disaster Recovery Plan for Applications Running on AWS

CS_-_Disaster_Recovery_-_16.9.png
Apr 12, 2022 | Santosh Teli

Share on:

Statistics show that over 40% of businesses will not survive a major data loss event 
without 
adequate preparation and data protection. Though disasters don’t
occur often, 
the effects can be devastating when they do.

A Disaster Recovery Plan (DRP) specifies the measures to minimize the damage of a major data loss event so businesses can respond quickly and resume operations as soon as possible. A well-designed DRP is imperative to ensure business continuity for any organization. If you are running an application, you must have a Disaster Recovery Plan in place, as it allows for sufficient IT recovery and the prevention of data loss. While there are traditional disaster recovery solutions, there has been a shift to the cloud because of its affordability, stability, and scalability.

AWS gives the ability to configure multiple Availability Zones to launch an application infrastructure. In an AWS Region, Availability Zones are clusters of discrete data centers with redundant power, networking, and connectivity. If downtime occurs in a single availability zone, AWS will immediately shift the resources to a different availability zone and launch services there.

Of course, downtimes do occur occasionally. To better handle them, you should configure the Auto Scaling Groups (ASGs), Load Balancers, Database Clusters, and NAT Gateways in at least three Availability Zones, to withstand (n-1) failures; that is, failure of two availability zones (as depicted in the diagram below).

Diagram of a Disaster Management in an AWS Region with failure of two of three Availability Zones.

Disaster Management within an AWS Region



Regional Disaster Recovery Plan Options

A regional disaster recovery plan is the precursor for a successful business continuity plan and addresses questions our customers often ask, such as:


  • What will be the recovery plan if the entire production AWS region goes down?
  • Do you have a provision to restore the application and database in any other region?
  • What is the recovery time of a regional disaster?
  • What is the anticipated data loss if a regional disaster occurs?

The regional disaster recovery plan options available with AWS range from low-cost and low-complexity (of making backups) to more complex (using multiple active AWS Regions). Depending on your budget and the uptime SLA, there are three options available:


  1. Zero-cost option
  2. Moderate-cost option
  3. High-cost option

While preparing for the regional disaster recovery plan, you need to define two important factors:


  • RTO (Recovery Time Objective) i.e. the time to recover in case of disaster
  • RPO (Recovery Point Objective ) i.e. the maximum amount of data loss expected during the disaster

  1. Zero-cost Option:

In this approach, you begin with database and configuration backups in the recovery region. The next step involves writing the automation script to facilitate the infrastructure launch within a minimum time in the recovery region. In case of a disaster, the production environment is restored using the existing automation scripts and backups. Though this option increases the RTO, there is no need to launch any infrastructure for disaster recovery.

Diagram of a zero-cost disaster recovery option with database and configuration backups in the recovery region.


  1. Moderate-cost Option:

This approach keeps a minimum infrastructure in sync in the recovery region, i.e. the database and configuration servers. This arrangement reduces the DB backup restoration time, significantly lowering the RTO.

Diagram of a moderate-cost disaster recovery option with database and configuration servers in sync in the recovery region.


  1. High-cost option:

This is a resource-heavy approach that involves installing load balancers in the production environment across multiple regions. Though it's an expensive arrangement, with proper implementation and planning the application is successfully recovered with little downtime for a single region disaster.

Diagram of a high-cost disaster recovery option  with load balancers across multiple regions in the production environment.


Zero-cost Option: The Steps

The zero-cost option does not require the advance launch of additional resources in the recovery region; the only cost incurred is for practicing the DR drills.

Step 1: Configure Backups

At this stage, reducing data loss is the top priority. The first step is configuring the cross-region backups in the recovery region. With a proper backup configuration, you can reduce RPO. It's essential to configure the cross-region backups of:


  • S3 buckets
  • Database backups
  • DNS zone file backups
  • Configuration (chef/puppet) server configuration
  • CICD (Jenkins/GoCD/ArgoCD) server configuration
  • Application configurations
  • Ansible playbooks
  • Bash scripts for deployments and cronjobs
  • Any other application dependencies required for restoring the application

Step 2: Write Infrastructure-as-a-Code (IaaC) Templates - Process Automation

Using IaaC to launch the AWS infrastructure and configure the application will reduce the RTO significantly, and automating the process will lessen the likelihood of human errors. Many automation tools are widely available.


  • Terraform code to launch application infrastructure in AWS
  • Ansible playbooks to configure Application AMI, Chef server, CICD servers, MongoDB Replica Sets Clusters, and other standalone servers
  • Scripts to bootstrap the EKS cluster

Step 3: Prepare for a DR Drill

The preparation for a DR drill should be done in advance through a specified process. The following is a sample method to get ready for a DR drill:

  • Select an environment similar to the production
  • Prepare a plan to launch complete production infrastructure in the recovery region
  • Identify all the application dependencies in the recovery region
  • Configure the cross-region backup of all the databases & configurations
  • Get ready with automation scripts with the help of Terraform, Ansible, and Shell-Scripts
  • Identify the team members for DR Drill and make their responsibilities known
  • Test your automation scripts and backup restoration in the recovery region
  • Note the time taken for each task to get a rough estimate of the drill time

Step 4: Execute the DR Drill

The objective of the DR drill is to test the automation scripts and obtain the exact RTO. Once the plan is set, decide a date and time to execute your DR drill. Regular practice is advisable to perfect your restoration capabilities.

Benefits of DR Drills

  • Practicing DR Drill boosts confidence that the production environment can be restored within a decided timeline.
  • Drills help identify gaps and provide exact RTO and RPO timelines.
  • They provide your customers with research-backed evidence of your disaster readiness.

Conclusion

Though AWS regions are very reliable, preparing for a disaster is a business-critical SaaS Application requirement. Multi-region or Multi-cloud deployments are complex, expensive architectures, and deciding the appropriate DR option depends on your budget and uptime SLA to recover during such disasters.




Share on:

Related Posts

Aug 05, 2022

What You Need to Know About Digital Accessibility

To be digitally accessible means to create an equally accessible online environment for employees and job seekers with disabilities. No different than creating an accessible brick-and-mortar store, developing digital accessibility is essential for a business’s success.There are many benefits of digital accessibility for employers, employees and job seekers. One notable advantage is that it creates an opportunity to embrace diversity in the workplace. According to the Bureau of Labor Statistics report on persons with disabilities in the workforce, 19.1% of persons with a disability were employed in 2021 compared to 17.9% in 2020. While the future is uncertain, it’s safe to assume the importance of digital accessibility will remain as vital as it is today. Let’s dive into what digital accessibility is and how it’s beneficial for employers and job seekers. What Is Digital Accessibility? As defined by Georgetown Law, digital accessibility “refers to the inclusive practice of removing barriers that prevent interaction with, or access to websites, digital tools, and technologies, by people with disabilities.” Digital accessibility isn’t just for the workplace; it’s for every interaction individuals with disabilities have with technology. Technology, digital tools or other online interactions should always be developed with accessibility in mind. Failure to create a digitally accessible environment — especially in the workplace — could lead to a range of implications, including failure to comply with the American Disability Act (ADA).  Workplace Accessibility LawsThe American Disability Act (ADA) is “a civil rights law that prohibits discrimination against individuals with disabilities in all areas of public life, including jobs, schools, transportation, and all public and private places that are open to the general public.” Common violations of the ADA include:Denying individuals an opportunity for employment because of their disabilityIntently refraining from promoting or providing a raise for employees with a disabilityFailure to provide required accommodations like wheelchair ramps and handicap-accessible parkingExamples of digital accessibility violations include:Poor color contrastNo closed-captioning option for videos and other visual aidsThe Web Active Initiative - Accessibility Rich Internet Applications (WAI-ARIA) was developed to make web content and applications more accessible to people with disabilities. While WAI-ARIA isn’t a legal matter, it is an essential set of standards used by many to create accessible content.  Digital Accessibility for EmployersWhen asked to develop a digitally accessible workplace, you’re being asked to create an online atmosphere that adheres to all three states of a disability: permanent, temporary and situational. Permanent disability: A permanent disability, also known as a long-term disability (LTD), is a mental or physical disability that affects an individual long-term and doesn’t go away. Permanent disabilities will affect an individual’s ability to perform daily tasks expected of other employees. Examples of permanent disabilities include spinal cord or brain injuries. Temporary disability: A temporary disability, also known as a short-term disability (STD), is a mental or physical disability that affects an individual over a shorter period, eventually going away. Examples of temporary disabilities include broken bones or a concussion;Situational disability: A situational disability, also known as a situational impairment, is when an individual has difficulty using digital technology as the result of a one-off scenario. An example of a situational disability is becoming visually impaired when viewing a website because of poor page lighting. Creating an accessible digital workplace that embraces the above disabilities and beyond can help improve the way employers and employees consume internal communications. Digital Accessibility Technology for Employers Improving a website’s digital accessibility can be difficult without the help of software and other technological tools. To remove that stress, IT professionals, digital marketers and web developers should consider implementing one or more of the below tools to maximize their site’s accessibility. While the above departments are typically responsible for implementing digital accessibility, it doesn’t mean it can’t be done without them. Below are examples of digital tools and solutions created for employers and job seekers who value accessibility.Accessibility Testing ToolsDigital accessibility testing tools help users identify a website’s usability. These testing tools will conduct automated testing and allow users to perform manual audits to pinpoint the system’s performance and whether or not it’s accessible. Here are a few popular accessibility testing tools and solutions:Adobe Acrobat Document Cloud (DC); Axe DevTools Chrome plugin;CKSource Accessibility Checker;Color Contrast Checker;Dynomapper;Quality Logic;Remediate.co;SortSite;WAVE.There are many accessibility testing tools to choose from. Read reviews and ask for recommendations from trusted professionals before selecting an accessibility testing solution. Closed-Captioning SoftwareClosed-captioning software helps users with hearing impairments follow along with video and other audio-based media on a website. Businesses with audio-heavy websites should have closed-captioning options to ensure those with hearing impairments can understand their content. If users can’t understand a site’s content because of a lack of accessibility, a business’s credibility can be damaged. There are multiple digital closed captioning solutions that can help.3Play Media;Amara;Captioning Star;Otter.ai;Rev;Videolinq;Vitac;YouTube.Websites with little to no audio should still consider closed-captioning solutions. Failure to do so may violate ADA laws. Per the FCC guidelines, all content aired on public television is required to have closed captions if this same content is posted online.Websites are not legally required to have closed captioning, but this doesn’t mean they shouldn’t. Opting out of this accessibility tool can alienate staff, prospective employees and customers, causing damage to your brand’s image and your business.  CMS and Headless CMSA content management system (CMS) is a software application that marketing teams use to create, publish and manage content for websites, applications and other digital experiences. CMS software can streamline workflows and ensure content satisfies regulatory standards as well. An enterprise CMS can:Encourages team collaborationsImproves customer experienceIncreases efficiencyReduces management costsTracks marketing information A headless CMS manages and organizes content without a connected front-end or display layer. Compared to traditional CMS software, a headless CMS offers these benefits for providing accessibility:Assign text equivalents to images, implement labeled form fields and ensure keyboard accessibility to all user interface componentsEnsure WAI-ARIA standards are met across all platformsMaintain regulatory standards, avoiding potential fines and lawsuits with an accessibility checkerPrompt authors for alternative content, such as alt text on images, or audio descriptions and text transcripts for videoOffer a uniform experience to all usersStreamline voice and visual search optimizationReview the above in further detail before choosing a CMS can help ensure you get the best system for your accessibility needs.Color CheckersThis simple yet valuable accessibility tool enables users with visual impairments such as color blindness or low vision to customize a website’s color, contrast and lighting. Accessibility color contrast analyzing tools to consider include:ACART Contrast Checker;Colorable;ColorBox;Colour Contrast Checker;Contrast;Contrast Ratio;Deque Color Contrast Analyzer;Tanaguru Contrast Finder;WebAIM. Keyboard Navigation OptimizationsTo be considered digitally accessible, a website must be fully accessible and operable using only a keyboard. Keyboard navigation optimization tools are often recommended when developing an accessible digital experience strategy. According to the California State Universal Design Center, to be keyboard accessible, a web page must not only be keyboard operable; it must also have: A visible keyboard focus to allow users to identify the primary focus elements on a page Appropriate tab order so users can navigate the pageNo keyboard traps that prevent users from accessing parts of the pageIndividuals with motor skill impairments often rely on this adaptive technology to help them perform their duties. This is why keyboard accessibility is essential to creating a digitally accessible workplace.Digital Experience Design SoftwareWhile the above software helps with accessibility, it takes digital experience design software to create the ultimate user experience. The main focus of digital experience design is to improve interactions with all website users, including job seekers. Digital experience design can improve interactions between users and:Email campaignsOnline advertisingSelf-help kiosksSocial media postsVirtual chatbotsYour websiteA digital experience management strategy can help ensure that customers have positive interactions with your business. A digital experience platform will enable you to create, manage, deliver and optimize digital experiences across all channels in your customer’s journey. How Digital Accessibility Benefits EmployersIt’s no secret that enabling digital accessibility embraces diversity in the workplace. However, this isn’t the only benefit for employers. Improving digital accessibility as a business can:Attract like-minded job seekersBoost employee moraleBuild brand awarenessCreate a positive and diverse workplace cultureIncrease productivity While it may seem like a lot to check off from your to-do list, ensuring your business is accessible from all ends is vital. The software and tools mentioned above can lessen your workload and ensure that you provide quality services. Additional Digital Accessibility Resources for EmployersThese resources offer more information about developing digital accessibility for your business:Disability Equality Index;Inclusion@Work;JAN Workplace Accommodation Toolkit; Syracuse University Employer Toolkit;What’s New in WCAG 2.1;Work Without Limits.Now that we’ve seen the benefits of digital accessibility for employers, let’s take a look at how it can positively affect those seeking employment. Digital Accessibility for Job SeekersFinding a new job without limitations is a stressful process in itself, but the stress is amplified for job seekers with disabilities.   There are ways that you, as a job seeker, can improve the digital accessibility of the website you’re accessing. Employers with basic disability etiquette will make it known on their site in a variety of ways. To ensure a website is accessible, ask yourself the following questions: Do they include alt texts for their images?Is there an option for closed captioning?Do you have software to check the color contrast?Can you access every tab on the page from your keyboard? If so, do the tabs appear to be in order?These are also signs a website isn’t accessible:Missing and/or incorrect media captionsPage timeout restrictionsPoor screen contrastHow Digital Accessibility Benefits EmployeesEmployers show they’re attempting to understand and improve the user’s experience when they remove barriers and unnecessary restrictions often seen on other websites.Inclusive companies will also focus on embracing your abilities beyond the application process and during the interview. If you notice a prospective employer only appears to be accessible on paper and not in person, then it could be a sign they’re not the right company for you.  You may consider seeking employers that offer flexible schedules or allow you to work from home.  Advocate for yourself and others if you feel a business isn’t accessible enough. After all, they can’t improve if they’re unaware of any accessibility issues. How to Address Accessibility With a Potential EmployerAddressing accessibility to an employer can be an awkward situation. Here are a few tips for discussing your disability with an employer:Decide if you want to have the conversation. You are not legally required to disclose a disability to an employer. However, they can’t provide optimal accessibility if they’re unaware of what alterations to make.Create a rough draft of what to say before you talk to them.Review their accessibility policy for areas that need improvement.Additional Digital Accessibility Resources for Job SeekersYour rights as an employee are protected by the ADA. Review these rights to ensure you’re not being asked to disclose information you’re not required to provide. You can refer to the Department of Labor’s guide on disability rights to learn more about how you can make an impact on the way your workplace embraces accessibility.  

Read more
Jul 14, 2022

What You Need to Know About E2E Testing with Playwright

Contentstack recently launched Marketplace, a one-stop destination that allows users to find, create and publish apps, connect with third-party apps and more. It aims to amplify the customer experience by enabling them to streamline operations. Marketplace now has a few Contentstack-developed apps and we will introduce more in the future.Initially, we tried to test these apps manually but found this too time-consuming and not scalable. The alternative was to use an end-to-end (E2E) testing tool (Playwright in our case), which helped us streamline and accelerate the process of publishing the apps.Playwright is a testing and automation framework that enables E2E testing for web apps. We chose Playwright because of its classic, reliable and fast approach. Besides, its key features such as one-time login, web-first approach, codegen and auto-wait make Playwright suitable for the task at hand.This article will walk you through the processes we used and the learnings we gathered.Our Testing ProcessesIn this section, we detail the processes we followed to test the Marketplace apps using Playwright.Set-up and Tear-down of Test DataPlaywright permits setting up (prerequisites) and tearing down (post-processing) of test data on the go, which helped us accelerate our testing.There are additional options available for this:global set-upglobal tear-downbeforeAll & afterAll hooksbeforEach & afterEach hooksIdeally, a test establishes prerequisites automatically, thereby saving time. Playwright helped us do that easily. Once the test was concluded, we deleted the app, content type, entry or the other data we initially set up.Playwright helped us achieve the following on the go:Auto-create app in the dev centerAuto-create content type and entryTest Directory StructureWe added all the test-related files and data to the test's repository. The following example explains the process:For the illustration app (see image below), we added the E2E test inside the 'test/e2e' folder.Next, we included the 'page-objects/pages' (diverse classes) for multiple web pages and tests. The Page Object Model is a popular pattern that allows abstractions on web pages, simplifying the interactions among various tests.We then placed the different tests (spec.js) under the test folders and the utility operations under /utilsAll the dependencies of E2E tests were put in the same .json package but under dev dependencies.We attached .env(env. sample) with correct comments to add the environment variables correctly.After that, we added support for basic auth on staging/dev.In the next stage, we added the Readme.md details about the project.We used the global-setup for login management to avoid multiple logins.Next, we used the global-tear-down option to break the test data produced during the global-setup stage.Finally, we used beforeAll/afterAll hooks to set-up/breakdown test data for discrete tests.How to Use Playwright Config Options & Test HooksGlobal-setup & Global tear-down:Both global-setup and global tear-down can be configured in the Playwright config file.Use global-setup to avoid multiple logins (or any other task later required during the test execution) before the tests start:Global set-up easily evades repetitive steps like basic auth login and organization selection.That way, when the tests are conducted, the basic requirements are already in place.Below is the example of a sample code snippet for a global set-up:Use global-tear-down to break down any test data created during the global-setup file.The test data generated using global-setup can be eliminated in global-teardown.While global-setup/global-teardown are the config option/s for an entire test suite, before/after tests hooks are for the individual tests.Test Hooks Available in PlaywrightPlaywright hooks improve the efficiency of testing solutions. Here is a list of test hooks available in Playwright:test.beforeAll & test.afterAllThe test.beforeAll hook sets test data shared between test execution like entries, creating content types and establishing a new stack. The test.afterAll hook is used to break or tear the test data. This option helps eliminate any trace of data created for test purposes.test.beforeEach&test.afterEachThis hook is leveraged to set up and break down test data for individual tests. However, the individual text execution and the concurring data might vary. Users can set up the data according to their needs.Tips & Tricks for Using PlaywrightWhile using Playwright, we learned a few valuable lessons and tips that could be useful to you:Using the codegen feature to create tests by recording your actions is a time-saving approach.You can configure Retires in the playwright config file. It helps in case of a test failure. You can re-run the test to come up with relevant results.The Trace Viewer allows you to investigate a test failure. This feature includes test execution screencast, action explorer, test source, live DOM snapshots and more.Use the timeout feature for execution and assertion during testing.By setting up a logger on Playwright, you can visualize the test execution and breakpoints.Using the test data attributes during a feature development navigates the test through multiple elements, allowing you to identify any element on the DOM quickly.Recommended Best PracticesWhile using Playwright for E2E testing of our marketplace apps, we identified a few best practices that might come in handy for other use cases.Parallelism:Test files run by default on Playwright, allowing multiple worker processes to run simultaneously.Tests can be conducted in a single file using the same worker process.It's possible to disable the test/file execution and run it parallelly to reduce workers in the config file.The execution time increases with the number of tests; parallelly running tests are independent.Isolation:Each browser context is a separate incognito instance, and it's advisable to run each test in individual browsers to avoid any clash.In isolation, each browser can emulate multi-page scenarios.It's possible to set up multiple project requirements in playwright config as per the test environment similar to baseURL, devices and browserName.Speed of Execution:Parallel test execution, assigning worker processes and isolation expedite the running of test results.Elements like test data, global tear-down and set-up affect the execution speed regardless of the number of worker processes.Double Quotes Usage:Use double quotes if you come across multiple elements on the exact partial string.Help establish case sensitivity. For instance, awaitpage.locator('text=Checkout') can return both elements if it finds a "Checkout" button and another "Check out this new shoe."The double usage quotes can also help return the button on its own, like await page.locator('text="Checkout"'). For details, check out the Playwright text selectors.Prioritizing User-facing Attributes:It's advisable to use user-facing elements like text context, accessibility tiles and labels whenever possible. Avoid using "id" or "class" to identify elements. For example, use await page.locator('text=Login') instead of await page.locator('#login-button') is recommended.A real user will not find the id but the button by the text content.Use Locators Instead of Selectors:Locators will reduce flakiness or breakage when your web page changes. You may not notice breakages when using standard selectors.Example:se await page.locator('text=Login').click() instead of await page.click('text=Login').Playwright makes it easy to choose selectors, ensuring proper and non-flaky testing.Wrapping UpIn a world dominated by Continuous Integration and Delivery, E2E testing is the need of the hour. Though it's a tedious task, following the practices above will save you time and improve your product.

Read more
Mar 14, 2022

5 Things You Need to Know About Adopting Kubernetes

Kubernetes has become the go-to container orchestration platform for enterprises. Since the COVID-19 pandemic, organizations have increased their usage by over 68% as tech architects look to Kubernetes to handle the increased delivery demands of the post-pandemic age. Why now? It offers faster deployment, better portability and helps keep pace with the modern software development requirements. Contentstack has adopted Kubernetes as a platform to build the next generation of microservices that power our product, and it offers incredible benefits. But adopting a transformational technology like Kubernetes is never straightforward. We faced our own set of challenges while implementing it, and we learned quite a few lessons in the process. In this piece, we share what we learned to help you avoid common pitfalls and position yourself for success.Kubernetes is Not Just an Ops Concern With the rise of DevOps, more companies are moving toward the “you build it, you run it” approach. The operations team is no longer the only team responsible for maintaining the app. Developers get more operational responsibilities, bringing them closer to the customers and improving issue resolution time. However, not all developers have exposure to Kubernetes. This may lead to developers being unable to optimize their applications. Some of their struggles include:Inability to implement “Cloud Native” patternsDifficulty in communicating and implementing scaling requirements for servicesInflexible configuration management for servicesInadvertent security loopholes Developers also face debugging issues in production, which can often lead to catastrophic outcomes for organizations that are targeting strict availability SLAs. We realized this when we started using Kubernetes at Contentstack. We manage this gap by investing in upskilling developers in Kubernetes. It has improved our SDLC (Software Development Life Cycle) lead time. We see improved communication between developers and operators and a clear shift to “Cloud Native” solutions.Pay Close Attention to Microservice Delivery “Microservice delivery” refers to the testing, packaging, integrating and deploying microservices to production. Streamlining delivery is an important aspect of managing microservices. For example, simply moving microservice deployments to Kubernetes will not give you immediate benefits if those deployments are not automated. Following are some of the first steps for setting up an efficient delivery pipeline:Package your microservices: While containers allow you to bundle the application code, you still need an abstraction for managing the application’s Kubernetes configuration. Kubernetes configuration is required for defining the microservice image, ports, scaling parameters, monitoring, etc. At Contentstack, we use Helm as the package manager.Implement a CI/CD pipeline to automate delivery: A CI/CD (Continuous Integration/Continuous Delivery) pipeline automates the entire process from building and testing apps to packaging and deploying them. Read more about Contentstack’s GitOps-based approach to CI/CD.Automate tests for your application: An ideal test suite gives you fast feedback while being reliable at the same time. You can achieve this by structuring your tests as a pyramid. These automated tests need to be integrated into the CI/CD pipeline so no defective code makes its way into production.Create a strategy for potential rollbacks: Failures are part of software development. However, your strategy for dealing with failures determines the reliability of your services. ‘Rollbacks’ is one such strategy. It involves re-deploying the previous working version when a build fails. A battle-tested strategy for implementing rollbacks using the CI/CD pipeline needs to be in place so you can handle deployment failures gracefully.Secure Your Workloads It’s no secret that containerization and microservices bring agility. However, security shouldn’t take a backseat. A recent survey found that about 94% of respondents experienced at least one security incident in their Kubernetes environments in the last 12 months. Most of these security issues were due to misconfiguration or manual errors. Security is one of the top challenges for the adoption of Kubernetes in the enterprise. We took several steps to ensure that our applications on Kubernetes clusters are secure:Scan container images: Docker images are atomic packages of the microservice. These images are a combination of the application code and the runtime environment. These runtimes should be scanned regularly for Common Vulnerabilities and Exposures (CVEs). We use Amazon ECR’s Image Scanning feature to enable this.Secure intra-cluster communication: Microservices running in a cluster need to talk to each other. These microservices may run across multiple nodes. The communication between them must be encrypted and they must be able to communicate only with authorized services. mTLS (mutual-TLS) is a great standard that helps to encrypt and authenticate clients. At Contentstack, we use istio, a service mesh tool, to automate the provisioning and rotation of mTLS certificates.Manage secrets and credential injection: Injecting credentials into microservices is required for the microservices to connect to databases and other external services. You must manage these credentials carefully. There are several techniques and tools to do this, including using version-controlled sealed-secrets and Hashicorp Vault. This also helps improve reliability of your deployments using automation.Invest in Effective Monitoring for Your Services According to the “Kubernetes and Cloud Native Operations Report 2021,” about 64% of respondents said maintenance, monitoring and automation are the most important goals for their team to become cloud-native. Monitoring is an often overlooked aspect of operations, but it is crucial, especially when moving to platforms like Kubernetes. While Kubernetes may make it very easy to run your services, it may trick you into believing everything will keep working as expected. The fact is, microservices may fail for a variety of reasons. Without effective monitoring in place, your customers may alert you of degraded performance, instead of you catching it first. Contentstack has several measures in place to monitor our services:Use centralized logging tools: When using microservices, having a centralized logging tool is invaluable. It helps developers and operations teams debug and trace issues across several microservices. Without access to centralized logging, you will have to spend a lot of time manually co-relating and tracking logs.Create monitors and alerts: For operations, there are several SLAs (Service Level Agreements) and SLOs (Service Level Objectives) that are monitored. Getting alerts (on messaging tools like Slack) on degraded performance will help you take timely action. It will also help you predict and prevent potentially catastrophic issues.Create monitoring dashboards: Comprehensive monitoring dashboards give you a birds-eye view of the health of the microservices. Dashboards are a perfect starting point for daily monitoring. At Contentstack, each team that manages a fleet of microservices has its own dashboard. Team members routinely check these dashboards for potential issues. Since both developers and operations teams rely on these dashboards, we can co-relate application information from both application logs and infrastructure monitors on the same dashboard.Take Advantage of Being ‘Cloud Native’ Kubernetes is an all-encompassing platform that offers many abstractions for solving common infrastructure problems. While the solutions address infrastructure problems, they can also solve application problems. Being “Cloud Native” combines using certain patterns and techniques with cloud-native tools. Here are some examples:Sidecar pattern: The sidecar pattern is the foundation of most service mesh technologies. It involves having a companion application (injected automatically) along with the main application container. In service meshes, it is used for routing and filtering all traffic to the main application container. At Contentstack, we have leveraged this pattern for distributed authorization across the cluster. Each application communicates with an authorization sidecar to validate incoming requests.Kubernetes jobs: Your application may have to process some one-off tasks that are not in sync with the request-response cycle, such as batch processing jobs. In the past, we depended on a separate service that kept running in the background looking for new tasks to process. Kubernetes comes out of the box with “Jobs,” which allows running such tasks as a pod. At Contentstack, we use Jobs for running database migrations before releasing a new version of an application on the cluster.Health probes: Kubernetes has a good health check system in place for your services. This means it will notify you if the health of any service is not as expected. Apart from notification, it also supports automatically restarting the service. Read more about how Contentstack configures health probes for its services. At Contentstack, we strive to continuously learn and adopt new practices and technology to stay ahead of the curve, and we are glad to share what we learn with the rest of the world. Adopting Kubernetes is a slow but rewarding journey that allows you to take advantage of the latest best practices for creating resource-efficient, performant and reliable infrastructure.

Read more
Mar 10, 2022

How to Get Your Technical Debt Under Control

Unless you live in la-la land with a dream team of developers who write perfect code, it is impossible to avoid technical debt. It is a natural by-product of fast-paced development of any SaaS product. As developers, we always have to choose between delivering on time and delivering with perfect code. It's a trade-off. In most cases, we choose to deliver on time, with a promise to deal with the byproduct later. Technical debt also occurs because of reasons such as:Change in technologyDevelopment of annual frameworks and librariesNon-maintenance of codebase or librariesIntroduction of new product featuresAddition of a new workforce to an existing project to ship fasterNeed for frequent releases and pressure of timeline And so, technical debt accumulates. We can either ignore it until it snowballs into something massive or fix some of it to reduce its impact on current and future development. I have seen teams try to figure out everything from the beginning so there is no minimum tech debt. But in an environment where delivery time matters, slowing down development could cause companies to lose opportunities or customers to competitors. It makes more sense to come to peace with the fact that there will always be some technical debt. Acknowledging this and then defining some best practices to manage technical debt effectively can reduce its impact on your product.10 Ways to Manage Your Technical Debt1. Opt for Modular and Extensible Architecture An excellent way to start a new project is by adopting an extensible architecture. This involves accounting for all our current requirements while extending the project to add new features in the future without any rewriting. Start with identifying the various modules based on their functionalities. Develop each module so it is independent and does not affect the working of other modules. However, all the modules should be loosely coupled. By doing this, you can easily break the project into multiple microservices and scale them individually if the need arises.2. Develop Only What is Required To build a product or its modules with more flexibility (or to make the next release easy), developers may load it with extra functionalities. This is only effective if you are sure about the future requirements, which is never the case. These added functionalities build up over time, increasing your technical debt.3. Plan Your Trade-Offs Carefully Pick what matters the most. For long-term projects that have a high return on investment, carefully consider the design and implementation and minimize technical debt as much as possible. If delivery is the top priority and efforts to build are low, you can knowingly create some debt you can fix later.4. Never send POC to production Before developing a new feature or a product, developers build a proof of concept (POC) to check its feasibility or to convey the idea. If the POC is accepted, the feature is included in the road map. At times, especially when it is working as expected, it is very tempting to put this POC into production. However, that is a bad idea. A POC is just what it says: a concept, not a complete solution. While developing a POC, we rarely think of all use cases or scenarios because the focus is on writing the code to get desired results quickly. That code is never meant for production, and writing tests around it usually doesn't work. Most of the time, POCs don’t have proper structures, error handling, data validation or extensibility. Use it as a reference and only deploy a complete production solution.5. Write Code With Proper Documentation Good code with proper documentation offers multiple benefits, including quick handoff to others, increased reusability and reduced time to build more on top of your code. There are two types of documentation: within your code and about your code. Both are equally important. Some examples of documentation within your code are:Function signatureInstructions for usersDescriptions explaining confusing or complex pieces of codeTo-dos for future reconsiderationNotes for yourselfComments about recent changes Examples of documentation about your code include:Readme filesAPIs referenceHow-to guidesFAQs6. Request Early Code Freezes Frequent releases (often with limited resources) are major contributors to the accumulation of technical debt. To meet tight deadlines, developers add hard-coded values or resort to quick fixes in codes. And there are always last-minute requirements or requests for change in functionalities. Over time, these quick fixes lead to performance, stability and functional issues. A good way to deal with this is to introduce a process for early code freeze, so developers can focus on improving code quality and stability instead of adding code until the last minute.7. Add Technical Debt in Sprint Eliminating technical debt all at once is not feasible. Product managers are not likely to commit extensive resources since it adds no direct or immediate value to customers. At the same time, developers want to spend more time developing something concrete, from scratch, instead of reworking on the same code. But there is a practical way out: Create a healthy backlog of debt items and a road map (with timeline) to replace the code's hard-coded, unoptimized and quick-fix scenarios. Add stories to every sprint so you can fix these debt items gradually and in a more organized way.8. Version Your Code Your code will likely change as you add new features to your application. One of the best ways to track and maintain these changes is to version your code. You can create incremental code through versioning, leave behind some old code, introduce breaking changes through newer versions, give users access to multiple versions and stop support for older versions. All of this is in your control. Versioning is probably the best process through which your code evolves. It is also the preferred way in the SaaS world to shape (or reshape) your product over time without piling up a lot of baggage from the legacy code.9. Refactoring is Your Friend Refactoring is the ultimate weapon in your fight against tech debt. How often you need to refactor the code is up to you, but doing this regularly is essential. Make sure you version your changes while refactoring. This keeps your code organized.10. Balance is Key Technical debt is okay as long as it's intentional and strategic and not the result of poor codes and design. As developers, our aim should always be to balance it and cushion its blow to the product.

Read more