Monitoring Managed Public Services on Public Cloud Providers

Banana Pancake

Managed Services

The increase in capability of many managed services such as Cloud Run and Cloud Functions on Google Cloud Platform (or similarly, Fargate and Lambda on AWS) has resulted in an explosion in usage of these technologies. They follow the “less to manage” mantra and let you simply manage application code without having to patch / maintain surfaces below your application (for the most part).

As usage of these has increased, security as is typical will lag the multitude of options to monitor bare metal or VM infrastructure which are more established.

While some teams will invest in extremely templated platforms that only allow a single way to do something, these are commonly fraught with undue complexity: maintenance, flexibility, etc. – and fail to match the execution strategy of smaller teams, such as ours at Unit 410, who experiment regularly and quickly.

Permissions, Firewalls, Ingress, Oh my!

Banana Pancake

There is a large gap in the world of permissions and access to managed services on Public Cloud Providers which can create an undesirable attack surface to your services.

Both Cloud Run and Cloud Functions have both Ingress settings and Authentication settings that are not easy to configure or monitor programmatically. Configuring VPC connectors to allow inter-region traffic, proper IAM role configuration for invocation, and other settings require extensive work to do properly (and simply to become functional).

The easiest way to configure and deploy these managed services is the most insecure (public access, all users), which is generally not acceptable to most businesses nor should it be. Public Cloud Providers have a lot more work to do to make it easy to do what is smart and necessary with managed services.

While these settings can mostly be configured in an IaaC such as Terraform (e.g. cloud_run_service, google_cloud_run_service_iam_binding, etc.) or through the gcloud CLI there are extremely large gaps that you will likely have to fill.

There are also a lot of behind-the-scenes complexities that you should be aware of to be able to properly configure and manage these services that require a lot of data loaded into your personal RAM.

For example:

What Service Account is the service running as?
What Service Account (or human) is invoking it?
What permissions are being granted or inherited?
Where is it being invoked from?

The list of gotchas starts small and then grows:

Oh, in development a user had the Editor role which included roles/cloudfunctions.invoker but now that doesn’t work?
Oh, the OIDC token is configured in Cloud Scheduler but the service account set has changed?
The Security Team asked for an audit of all the Authentication settings of our services but the API doesn’t expose this information in a readily accessible manner - we will have to manually audit in the Console?
The Cloud Scheduler cannot invoke our function in this project. Oh, is it because it was created prior to March, 2019 (reference) and we are missing a GCP-specific role?

All of these are able to be solved but similar to other large technical decisions such as, Should we move to a Monorepo? Should we move to Kubernetes? Should we use GraphQL? – you are sacrificing initial speed / lack of friction for development for an Infrastructure & Security cost post-deployment.

Codification

Banana Pancake

You have to codify your infrastructure code. There is simply no substitute. By doing so, you treat your infrastructure in the same way you do your application code.

This actually will prevent a lot of the issues and questions mentioned above. Need to configure a new Cloud Run service? Use the Terraform module / template. Need to configure a new Cloud Function service? Use the Terraform module / template. Need to extend to support a slightly new use case? Extend the module or ask if making a new one makes sense.

We have learned there is a balance between a fully templated and structured IaaC environment and have chosen a non-DRY approach, a WET approach if you will in many cases. This approach lets us move and adapt quickly without forcing a particular template or approach for every single service that we build and deploy. If a change to a base template will impact too many things, we prefer to duplicate to enable folks to move quickly. Using terraform in a DRY manner can be friction full especially with managed services and we have decided it is completely acceptable to have duplication.

By choosing to codify your managed services, even in a WET approach, auditing permissions, targets, and invocation from IaaC becomes extremely doable and reasonable.

A small main.tf for us might look something like:

resource "google_project_iam_binding" "project" {
  project = "banana-pancakes-1234"
  role    = "roles/editor"

  members = [
    "group:[email protected]",
    ...
  ]
}

resource "google_storage_bucket" "state" {
  name          = "tf-state-banana-pancakes-1234"
  project       = "banana-pancakes-1234"
  location      = "US-CENTRAL1"
  force_destroy = false
}

resource "google_service_account" "pancakes" {
  account_id   = "banana-pancakes-1234"
  display_name = "Banana Pancakes"
  description  = "Banana Pancakes invoking service account"
}


resource "google_cloud_run_service_iam_binding" "pancakes" {
  service = "banana-pancakes"
  role    = "roles/run.invoker"

  members = [
    "serviceAccount:${google_service_account.pancakes.email}",
    ...
  ]

  lifecycle {
    ignore_changes = [
      etag,
    ]
  }
}

resource "google_cloud_scheduler_job" "pancakes" {
  name             = "banana-pancakes"
  description      = "Start the banana pancakes to monitor pancake production"
  region           = "us-central1"
  schedule         = "0 */12 * * *"
  time_zone        = "Etc/UTC"
  attempt_deadline = "60s"

  retry_config {
    max_backoff_duration = "3600s"
    max_doublings        = 16
    max_retry_duration   = "0s"
    min_backoff_duration = "5s"
    retry_count          = 0
  }

  http_target {
    http_method = "GET"
    uri         = "https://banana-pancakes-1234-uc.a.run.app/"

    oidc_token {
      audience              = "https://banana-pancakes-1234-uc.a.run.app"
      service_account_email = google_service_account.pancakes.email
    }
  }
}

Auditing

Banana Pancake

Even if (nearly) every managed service is codified, auditing the production deployment is always valuable and for some businesses required.

Unfortunately, this is fairly difficult today with most Public Cloud Providers. While tools are expanding in this area, they are vastly insufficient. For example, in GCP you can utilize some aspects of Security Command Center but most of the features are only available for a minimum of $25k/yr which is borderline unacceptable (reference). With AWS, there are several products that you would have to utilize such as GuardDuty. The cost to configure these proprietary (and costly) tools and make sure it is accomplishing what you expect is non-trivial for a smaller team. You should not only learn about the tools themselves but also their oddities, complexities, and gaps. It is sometimes very difficult to know if they will even accomplish what you are seeking without fully implementing them.

Neither these products or any others available on the market (at least from our research) can easily answer the following questions:

What is the Ingress Setting on all Managed Services by Service?
What is the Authentication Setting on all Managed Services by Service?
Are there any Managed Services that are not Authenticated (IAM/IAP) and Public?
Who can invoke / access each Managed Service by Service?

Due to not being able to have Public Cloud Providers provide answers out-of-the-box to these questions in a single API call or simple interface, we have often written service(s) to answer these questions.

Here are a couple sample snippets from our GCP service in Python that can handle App Engine, Cloud Run, and Cloud Functions:

for project_id, project_number in projects_list(service=service_cloudresourcemanager).items():
    print('Checking project:', project_id)

    # App Engine
    for appengine_service in appengine_list(service=service_appengine, project=project_id):
        # App Engine has the notion of enabled / disabled through the 'serving' status
        if appengine_enabled(service_appengine, appengine_service.replace('apps/', '')):
            print('> App Engine Enabled:', appengine_service)
            if appengine_iapenabled(service_appengine, appengine_service.replace('apps/', '')):
                print('> 📗 App Engine IAP Enabled:', appengine_service)
            else:
                print('> 📙 App Engine IAP Disabled:', appengine_service)
                if 'default' in appengine_service:
                    print('>> 📕 Should this exist?  Do you want to disable / check if disabled via empty code (you cannot delete the `default` service)?')
                else:
                    print('>> 📕 Public non-`default` service with no IAP?  Is that intended?')
        else:
            print('> 📗 App Engine Disabled:', appengine_service)
        ...

    # Cloud Run
    for cloudrun_service, cloudrun_region in cloudrun_list(service=service_cloudrun, project=project_id).items():
        print('> Cloud Run Service:', cloudrun_service)

        ingress_setting = cloudrun_ingresssetting(service=service_cloudrun, project=project_number, app=cloudrun_service)
        is_public = ingress_setting == 'all'

        policy = cloudrun_getiampolicy(service=service_cloudrun, project=project_number, app=cloudrun_service, region=cloudrun_region)
        authenticated = is_authenticated(policy)

        if is_public and not authenticated:
            print('>> 📕 Service is NOT Authenticated and IS Public- Is that intended?')
        ...

    # Cloud Functions
    for cloudfunction, ingress_setting in cloudfunctions_list(service=service_cloudfunctions, project=project_id).items():
        print('> Cloud Function:', cloudfunction)
        is_public = ingress_setting == 'ALLOW_ALL'

        policy = cloudfunctions_getiampolicy(service=service_cloudfunctions, app=cloudfunction)
        authenticated = is_authenticated(policy)

        if is_public and not authenticated:
            print('>> 📕 Function is NOT Authenticated and IS Public- Is that intended?')
        ...

What is particularly confusing about both GCP and AWS APIs is their transpiling to various languages causes a variety of issues that make usability and access to this fairly basic data unnecessarily difficult along with endless version upgrades / missing functionality.

A couple examples of this unnecessary difficulty:

Listing App Engine services uses an appsId argument, Cloud Run uses parent for listing, name for get, and resource for getIamPolicy!
API constructors are different with Cloud Run requiring specifying an endpoint whereas other APIs do not – this specific issue can result in completely valid responses but incorrect data.
Multiple APIs for Managed Services have v1, v2, and either beta/alpha where you need to combine information with no clear stable foundation to build upon.

Conclusion

The current state of managed service configuration on Public Cloud Providers and making the secure option the default / easy does not exist. These providers have a lot more work to do to make these services secure-by-default.

The current state of their APIs for auditing is extremely difficult to use and several usually are lacking. There are multiple Console-only (not even gcloud or awscli) functions with managed services, which is generally unacceptable.

A lot of opportunity exists for 3rd-parties to fill this gap if the Public Cloud Providers choose not to prioritize fixing these areas with their managed service products.

What is your organization doing to manage the proliferation of managed services in your infrastructure? What tools are you writing and utilizing? Feel free to get in touch if you are interested in working on these types of problems.

We are currently hiring for a Cryptocurrency Security Engineer in / around our Austin, TX location.