How to become an infrastructure-as-code ninja, using AWS CDK - part 5

Now it is time to build a service, running in Elastic Container Service (ECS)!

In part 4 of our series, we set up an ECS cluster to run our containers using Fargate, so that we do not need to bother with underlying server infrastructure for the containers. We also added a Task Definition, so that we could manually start the container and get the Apache web server running.

We had a list of goals, which we could cover partially:

  • Expose an endpoint for a web server for HTTP traffic from internet.
  • Web server shall run in a container.
  • The container itself shall not be directly reachable from internet.
  • We should be able to have a service set up so that containers will automatically start when needed.
  • We should be able to build our custom solution for this web server.
  • We should be able to get container images from DockerHub.
  • We do not care about managing the underlying server infrastructure that runs the containers. I.e., we will use Fargate.

Let us now address more points in the goal list, one by one. Since we got the Apache web server running, but had to start it manually. Let us remove that manual step. We can accomplish this by setting up an ECS Service for the web server.


Update:

This article series uses Typescript as an example language. However, there are repositories with example code for multiple languages.

The repositories will contain all the code examples from the articles series, implemented in different languages with the AWS CDK. You can view the code from specific parts and stages in the series, by checking out the code tagged with a specific part and step in that part, see the README file in each repository for more details.


Setting up an ECS Service

An ECS Service will allow us to have a container running, and if the container fails, ECS will start up a new instance of a container automatically. Looking at the AWS CDK documentation, we can see that we have a FargateService class we can use. Let us think a bit about what we need to provide:

  1. The cluster the service should run in
  2. The task definition to run as a service
  3. The desired number of task instances we should run
  4. Any port openings to allow traffic to the container
  5. Optionally, a name for the service

To allow traffic to the service, we need a security group. In part 4, when we started a Fargate task, the AWS Console created a security group for us, with an opening for port 80 to the entire world. As a starting point, we aim at re-creating that experience.

So we have 5 pieces of information to include. On top of these, we also should provide a logical id for the service, and we add the scope which the service is added to. Let us create a function skeleton for this:

export const addService = 
function(scope: Construct, 
         id: string, 
         cluster: Cluster, 
         taskDef: FargateTaskDefinition, 
         port: number, 
         desiredCount: number, 
         serviceName?: string): FargateService {
}

In the function body, we add code to create a security group, and add an ingress rule for the port we provide.

We will then also create the service using FargateService, passing in our parameters. Finally, we will return to the service construct. The result looks like this:

export const addService = 
function(scope: Construct, 
         id: string, 
         cluster: Cluster, 
         taskDef: FargateTaskDefinition, 
         port: number, 
         desiredCount: number, 
         serviceName?: string): FargateService {
    const sg = new SecurityGroup(scope, `${id}-security-group`, {
        description: `Security group for service ${serviceName ?? ''}`,
        vpc: cluster.vpc,
    });
    sg.addIngressRule(Peer.anyIpv4(), Port.tcp(port));

    const service = new FargateService(scope, id, {
        cluster,
        taskDefinition: taskDef,
        desiredCount,
        serviceName,
        securityGroups: [sg],
    });

    return service;
};

The function code uses the id provided for the service construct to generate an id for the security group in the function. We make a call to add an ingress rule (incoming traffic) from anywhere using the specified TCP port. This is like what the AWS Console experience generated for us. We essentially replicate this for now, but this is not what we want to have in the end.

Now we need to add a call to our new function in our main program, to add the service to our existing code (since part 4) with the ECS cluster and task definition. The call itself a single line in what we have already written before. Notice here that I picked the family name from the task configuration to generate an id value for the service. The desired count for the service is a single task instance.

addService(stack, `service-${taskConfig.family}`, cluster, taskdef, 80, 1);

Now let us put this into its context and look at the code for the whole main program again. We retrieve data for the existing default VPC, which we use when we set up an ECS cluster. We define an ECS task definition using the standard httpd web service image, and we use this information to define an ECS service that should run with a single instance of that task.

The main program of our code now looks like this:

import { App, Stack } from 'aws-cdk-lib';
import { Vpc } from 'aws-cdk-lib/aws-ec2';
import { 
  addCluster, 
  addService,
  addTaskDefinitionWithContainer, 
  ContainerConfig, 
  TaskConfig 
} from '../lib/containers/container-management';

const app = new App();
const stack = new Stack(app, 'my-container-infrastructure', {
  env: {
    account: process.env.CDK_DEFAULT_ACCOUNT,
    region: process.env.CDK_DEFAULT_REGION,
  },
});

const vpc = Vpc.fromLookup(stack, 'vpc', {
  isDefault: true,
});

const id = 'my-test-cluster';
const cluster = addCluster(stack, id, vpc);

const taskConfig: TaskConfig = { cpu: 512, memoryLimitMB: 1024, family: 'webserver' };
const containerConfig: ContainerConfig = { dockerHubImage: 'httpd' };
const taskdef = addTaskDefinitionWithContainer(stack, `taskdef-${taskConfig.family}`, taskConfig, containerConfig);
addService(stack, `service-${taskConfig.family}`, cluster, taskdef, 80, 1);

We can also look at our container management module and see the functions and data structures we currently have in place there:

import { IVpc, Peer, Port, SecurityGroup } from 'aws-cdk-lib/aws-ec2';
import { Cluster, ContainerImage, FargateService, FargateTaskDefinition, TaskDefinition } from 'aws-cdk-lib/aws-ecs';
import { Construct } from 'constructs';

export const addCluster = function(scope: Construct, id: string, vpc: IVpc): Cluster {
    return new Cluster(scope, id, {
        vpc,
    });
}

export interface TaskConfig {
    readonly cpu: 256 | 512 | 1024 | 2048 | 4096;
    readonly memoryLimitMB: number;
    readonly family: string;
}

export interface ContainerConfig {
    readonly dockerHubImage: string;
}

export const addTaskDefinitionWithContainer = 
function(scope: Construct, id: string, taskConfig: TaskConfig, containerConfig: ContainerConfig): TaskDefinition {
    const taskdef = new FargateTaskDefinition(scope, id, {
        cpu: taskConfig.cpu,
        memoryLimitMiB: taskConfig.memoryLimitMB,
        family: taskConfig.family,
    });

    const image = ContainerImage.fromRegistry(containerConfig.dockerHubImage);
        taskdef.addContainer(`container-${containerConfig.dockerHubImage}`, { image });

    return taskdef;
};

export const addService = 
function(scope: Construct, 
         id: string, 
         cluster: Cluster, 
         taskDef: FargateTaskDefinition, 
         port: number, 
         desiredCount: number, 
         serviceName?: string): FargateService {
    const sg = new SecurityGroup(scope, `${id}-security-group`, {
        description: `Security group for service ${serviceName ?? ''}`,
        vpc: cluster.vpc,
    });
    sg.addIngressRule(Peer.anyIpv4(), Port.tcp(port));

    const service = new FargateService(scope, id, {
        cluster,
        taskDefinition: taskDef,
        desiredCount,
        serviceName,
        securityGroups: [sg],
    });

    return service;
};

Before we try to deploy this solution, we should have a few things in mind:

  • If you deploy an ECS Service with a desired count > 0, it will try to start the task during the deployment
  • Deployment is only considered successful if the desired count has been reached and considered being in a healthy state
  • By default, deployment to ECS can get stuck if the service does not work.
  1. reduce potential waiting times if we get any trouble, there are two things we want to do here. First, let us set the desired count for the service to 0. This means that AWS CDK (and ECS) will provision the service, but will not try to start it. We can then first try to start the service manually. Second, we can tell ECS to use a circuit breaker pattern for the deployment. What this means is that it will try a few times to run the service, and if that does not work, it will roll back the deployment. Let us update the code to include both.
  2. set the desired count to 0, we simply change the count parameter to the addService() call:
addService(stack, `service-${taskConfig.family}`, cluster, taskdef, 80, 0);
  1. apply the circuit breaker pattern, we have a property circuitBreaker on FargateService for this. When we specify this, we can also say that it should roll back the deployment in case there is an error. In this way, we can get back to the previous state of the service if the updated deployment fails. The updated addService() code now looks like this:
export const addService = 
function(scope: Construct, 
         id: string, 
         cluster: Cluster, 
         taskDef: FargateTaskDefinition, 
         port: number, 
         desiredCount: number, ![Image description](https://tidycloudaws.com/media/posts/27/Deployed_service_1.png =1982x858)
         serviceName?: string): FargateService {
    const sg = new SecurityGroup(scope, `${id}-security-group`, {
        description: `Security group for service ${serviceName ?? ''}`,
        vpc: cluster.vpc,
    });
    sg.addIngressRule(Peer.anyIpv4(), Port.tcp(port));

    const service = new FargateService(scope, id, {
        cluster,
        taskDefinition: taskDef,
        desiredCount,
        serviceName,
        securityGroups: [sg],
        circuitBreaker: {
            rollback: true,
        },
    });

    return service;
};

Let us now try to deploy this again, and see how this works out! As before, you can run the cdk deploy command to perform the deployment (do not forget to provide your AWS credentials). If you already have the solution from part 4 deployed, you can run the cdk diff command to see what the updates will be. You can also run cdk diff if nothing is deployed to see a list of the CloudFormation resources that will be deployed.

Note: At the time of this writing, the AWS Console provides an old and a new ECS Console experience. I have found that the new experience currently does not have all the features of the old experience, and hence I have used the old one here for screenshots. This may have changed by the time you read this. If you are using the new experience and find that you cannot do a particular task, try to switch to the old experience.

When the deployment is complete, we should see the service resource in our ECS cluster.

Deployed service

We can see that the desired count is 0, as well as the number of running tasks. This is what we expect with the deployment configuration we specified. So far, so good!

If we select the service and click on the Update button, we can change the service configuration to set another value for desired count. There are many entries, but we concern ourselves only with the desired count entry. Scroll down the page until you find the entry number of tasks and change the value to 1.

Service config

Click through the rest of the pages with the Next Step buttons with no additional changes, until you get to the Update Service button, and then click on that one. Now it will update the desired count and ECS will try to start the service. At first, the display should show PROVISIONING, then to state PENDING. We want it to reach status RUNNING and then we can test that we can reach the web server, just as we did in part 4.

Service tasks

However, you may notice that we do not actually reach state RUNNING. It seems to be stuck in PENDING state, and eventually ECS will stop trying and remove the task. Something has gone wrong. How do we fix this?

Troubleshooting our deployment

First, let us check if we have any useful logs. We can click around in the ECS Console, and we can look in CloudWatch console under Logs, and we find nothing. This is not so good…

It seems we do not get any useful logs by default, so we need to fix this. Before updating our deployment, let us look further to see if we find anything else that can be useful.

In the Details tab for our service, there is a Network Access section:

Service details - network access

We have a default VPC, and all subnets in the default VPC have public internet access, so the subnet list should probably be ok. We can also check the security group listed and check that this is ok. The fourth entry here says Auto-assign public IP DISABLED. When we started a task manually in part 4, we got a public IP address for our task, so this is different. Let us just check the security group also though:

Security group config

Security group config

The security group looks as expected, port 80 open for everyone. So now we have two things to update:

  • Add logging to our deployment
  • Change so that a public IP address is assigned to the container.

Remember, this is not the ultimate solution, but we take it in small steps and first we want to have it working at all! So let us see what we should change now.

Changing the service deployment

If we look at the FargateService documentation again, there is actually an entry there that says assignPublicIp. It is optional, and the default value is false. So let us add a parameter to addService() function that allows us to set this property.

export const addService = 
function(scope: Construct, 
         id: string, 
         cluster: Cluster, 
         taskDef: FargateTaskDefinition, 
         port: number, 
         desiredCount: number, 
         assignPublicIp?: boolean,
         serviceName?: string): FargateService {
    const sg = new SecurityGroup(scope, `${id}-security-group`, {
        description: `Security group for service ${serviceName ?? ''}`,
        vpc: cluster.vpc,
    });
    sg.addIngressRule(Peer.anyIpv4(), Port.tcp(port));

    const service = new FargateService(scope, id, {
        cluster,
        taskDefinition: taskDef,
        desiredCount,
        serviceName,
        securityGroups: [sg],
        circuitBreaker: {
            rollback: true,
        },
        assignPublicIp,
    });

    return service;
};

To add logging to our container, it not as obvious. There isn’t anything obvious on FargateService, and there isn’t anything on the FargateTaskDefinition either. It turns out that there is a logging property we can set when we add the container definition. This property requires a LogDriver object, and we can get an object that handles logging to CloudWatch by using the LogDriver.awsLogs() function. There is one mandatory parameter here, and that is streamPrefix. This will set the first part of the name of the CloudWatch log stream.

For now, let us pick the family name in the task configuration. By default, the logs will be around forever, which perhaps a bit too long. So we can also set the desired retention time for the logs. This is a quite temporary lab experiment, so let us keep the retention time short, just a day.

This means that the addTaskDefinitionWithContainer() function gets an update:

export const addTaskDefinitionWithContainer = 
function(scope: Construct, id: string, taskConfig: TaskConfig, containerConfig: ContainerConfig): TaskDefinition {
    const taskdef = new FargateTaskDefinition(scope, id, {
        cpu: taskConfig.cpu,
        memoryLimitMiB: taskConfig.memoryLimitMB,
        family: taskConfig.family,
    });

    const image = ContainerImage.fromRegistry(containerConfig.dockerHubImage);
    const logdriver = LogDriver.awsLogs({ 
        streamPrefix: taskConfig.family,
        logRetention: RetentionDays.ONE_DAY,
    });
    taskdef.addContainer(`container-${containerConfig.dockerHubImage}`, { image, logging: logdriver });

    return taskdef;
};

We also need to update the code in our main program file to enable public IP assignment:

addService(stack, `service-${taskConfig.family}`, cluster, taskdef, 80, 0, true);

Now we should hopefully be ready! When you have done the changes, run cdk diff to see that you get changes that reflect the logging updates and the public IP assignment. If that looks ok, then try cdk deploy and deploy the updates.

Once the deployment is done, let us check how we did. The service is deployed, and we can see that our manual modification of desired count is gone, it is set to 0 again. This is what we want.

Service config

In the Details tab of the service, we look at the Network Access section and see that the value of public IP assignment now is set to enabled:

Service details

We can then also analyze the task definition for the web server, and look at the specific container configuration, where we will find the log configuration:

Container log configuration

So now we are ready to test our update! Perform the same steps as before to update the number of tasks (i.e. desired count) to 1 for the service and check the status for the task started. It will go from PROVISIONING to PENDING, and in this case finally to RUNNING! Success!

Service tasks

Let us also double-check if we get any logs. In the CloudWatch Console, if we select the Logs section, we can actually see a log group in place:

Log group

In the log group, we can find a log stream. Opening up the stream itself, we can see a couple of log records:

Log records

This is good! The service is running, and we have logs from it!

We have successfully deployed and started the service. You can confirm this by using the public IP address. We got to see that we get the default response from the Apache web server.

Public IP

Next, you can set the desired count in the addService() function call, set a value 1, and verify that the service starts up and is accessible, with no manual steps needed.

You can also try to stop the task that is running then. After clicking the Stop button, it will stop within seconds. ECS will start a new task to replace the old one. It will not get the same IP address as the old task though, so even though it is accessible, it is not accessible from the same address. This is something we need to improve on.

Final words

In this part, we added the ability to provision a service and get that started automatically. We have some shortcomings in this, in that we do not maintain a fixed address to reach our web server. This is something we need to address later. We also learned that we do not get any logs by default and need to add a configuration entry for this as well.

In addition, we learned that ECS performs some health checks on the tasks in the service. In this case, we had to make it reachable from the internet for the health check to work.

So from our goal list we have a kind of ticked off another entry:

  • Expose an endpoint for a web server for HTTP traffic from internet.
  • Web server shall run in a container.
  • The container itself shall not be directly reachable from internet.
  • We should be able to have a service set up so that containers will automatically be started if needed.
  • We should be able to build our custom solution for this web server.
  • We should be able to get container images from DockerHub.
  • We do not care about managing the underlying server infrastructure that runs the containers. I.e., we will use Fargate.

In the upcoming parts, we will address the remaining points in the goal list. We will also take a step back and think more about monitoring and structure, and how we can test that we get what we expect before we deploy. This will become more important once our solutions grow larger and more complex.

As always, you can also check out Tidy Cloud AWS for more content that may be of interest if you work with infrastructure-as-code with AWS.

Don’t forget to use cdk destroy to remove the infrastructure you deployed! AWS does not bill you for what you use, as much as what you forget to shut down.