"title"=>"Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and…",
"summary"=>nil,
"content"=>"
Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and Labelling (Google Cloud Adoption Series)
Welcome to Part 3 of Landing Zone Design Considerations. This is part of the Google Cloud Adoption and Migration: From Strategy to Operation series.
In this part I’ll cover:
- Monitoring strategy
- Logging strategy
- Billing management and billing exports, and labelling
9. Monitoring Strategy
Monitoring is the process of collecting, processing, aggregating and displaying real time and historical quantitative data about a system.
Metrics
Google Cloud provides out-of-the-box monitoring, through the Cloud Monitoring component (formerly known as Stackdriver) of Google Cloud Operations (GCO) Suite. Cloud Monitoring automatically ingests over 1500 different metrics from over 100 different Google Cloud resources. There is no cost for ingestion of these metrics. For example:
In addition, we can ingest metrics such as:
- Custom metrics, where we programmatically create custom telemetry using, e.g. with the Cloud Monitoring API, with OpenCensus, and (for GKE) with Prometheus.
- Log-based metrics, i.e. where real-time information is derived from logs.
- Additional VM process metrics, via the Cloud Ops Agent.
- Out-of-the-box application metrics, such as nginx, Apache web server, MongoDB, Tomcat.
- GKE workloads, using natively-integrated Prometheus and OpenTelemetry.
- Hybrid cloud monitoring — e.g. monitoring signals from on-prem, AWS and Azure — using Blue Medora BindPlane.
Be mindful that certain metrics — such as custom metrics and logs-based metrics — have an ingestion cost. So for such metrics, only ingest what is truly valuable.
Visualisation and Analysis
Of course, it’s not much use collecting metrics if you don’t do anything with this data. Cloud Monitoring allows you to:
- Create charts
- Use predefined and custom dashboards
- Share charts and dashboards
- Create healthchecks
- Define services and SLOs
- Create alerting policies
So the end-to-end consumption and use of metrics looks something like this:
Metrics Scope Strategy
Google Cloud Monitoring defines an object called a metrics scope (formerly called a workspace). Such a scope defines the set of Google Cloud projects who metrics are visible from a “single pane of glass”. Each scope contains:
- Predefined and custom dashboards
- Alerting policies
- Uptime checks
- Notification channels
- (Resource) group definitions
Every project has its own metrics scope, and — by default — this metrics scope only has visibility of the resources in that project. However, we can extend this scope to include the metrics from other projects. And this is how we can centralise and aggregate monitoring across projects.
And consequently, it is important to design your metrics scope strategy. Broadly, you have three options:
- One metrics scope for many related projects, across different environments. Here we have a single pane of glass for all related projects. However, anyone with the monitoring.viewer IAM role for the monitoring project can see all metrics for all of the projects. Thus, we have no granular control over monitoring.
- Maximum isolation — every project is monitored by a separate scope. Viewers in one project can’t (by default) see any metrics from another. This provides the highest degree of separation, but also potentially the higest operational management overheads.
- In-between — i.e. one scope for a small number of related projects. For example, we might have different scopes for production versus non-production. We do need to think about our project groupings.
Decision Points
Here’s a recap of some of the key decisions that need to be made
- Where is your monitoring single pane of glass? Where possible, I’d recommend using only GCO Cloud Monitoring for the monitoring of Google Cloud infrastructure, services and applications. I.e. align to the “Cloud native / Cloud managed” principle. However, in hybrid environments, organisations may want to make use of existing external monitoring tools, such as New Relic. Be advised that this can considerably increase monitoring costs, e.g. through licensing of the third party monitoring tool, as well as egress costs.
- Organisational granularity and visibility of monitoring? SRE approach? Metrics scope strategy? Here we should consider our organisational SRE model. And we should decide on the appropriate level of metrics scope granularity.
- Establish your organisational best practices for what will be monitored, e.g. golden signals (latency, traffic, errors, saturation), SLIs, SLOs, and alerting policies, defined at application level. You will want to make this easily consumable and repeatable.
- Decide which chargeable metrics you want and need.
- Establish your alerting channels and best practices. E.g. define what sorts of alerts are urgent and require human interaction, and will therefore generate pages. Other alerts should generate tickets, and use alternative notification approaches.
10. Logging Strategy
Also part of the Google Cloud Operations suite, Cloud Logging is a fully-managed serverless service for the ingestion, storing, viewing, searching and analysis of logs. Cloud Logging components are exposed through the Logging API.
Logs are ingested from various sources, such as Google Cloud resources, GKE, third party applications, and user applications runing on Google Cloud.
The overall Cloud Logging architecture looks like this:
Log Routing and Sinks
The Log Router is responsible for ensuring that logs are reliably and efficiently streamed to logging destinations, called Log Sinks. A log sink is actually the combination of a destination, along with inclusion and exclusion filters:
Possible sinks include:
- Cloud Logging buckets — special storage buckets optimised for storing logging data, and which can be interrogated directly from the Logs Explorer in the Google Cloud Console.
- Google Cloud Storage — ideal for cheap long-term and automated log archiving.
- BigQuery — which is ideal for long-term storage, SQL-based analytics, and dashboarding.
- Pub/Sub topics — perfect for sending log data to downstream systems, such as on-prem Splunk. But also, we can use this to stream logs via Dataflow. This can be useful if we want to process the logging data on the fly.
Two predefined sinks are created in each Google Cloud project:
- _Required — where all admin, system, and access transparency audit logs go. These logs are retained for 400 days. The _Required log bucket cannot be modified or deleted.
- _Default —all logs go here, unless they are are sent to _Required or user-defined sinks. By default, retention is 30 days, but it can be increased to 3650 days (10 years). The _Default bucket cannot be deleted, but the sink can be disabled.
Note that it is possible export logs using a sink, whilst also EXCLUDING logs from being ingested in Cloud Logging buckets. This is very useful for managing logging costs.
We can aggregate logging to folder, or even organisational sinks. This logging aggregation works by including logs from child resources. We do this by creating an aggregated sink at folder or org level, and setting the includeChildren parameter to True. Then we select the destination for the sink, just like any other sink.
(When doing this, it is wise to add an exclusion filter to the _Defaultsink, so that logs are not retained in both sinks.)
It is a good idea to aggregate logs from across the organisation into a few logging buckets. Consider subdividing these buckets based on properties such as:
- Retention period
- User access requirements
- Data residency requirements
- Where you may need to use customer-managed encryption keys.
We can then use Log views and IAM bindings to determine who can see the logs.
Audit Logs
Most Google Cloud services create audit logs. They are intended to answer the question of: “Who did what, where, and when?” These are incredibly important for compliance in regulated industries.
There are five categories of audit logs:
- The Admin logs are always enabled, retained for 400 days, and have no charge. They record any API calls that modify the configuration of resources, such as creating networks, creating or modifying instances, etc.
- The System Event logs are always enabled, retained for 400 days, and have no charge. They record activities that modify the configuration of resources, but which occur due to Google-initiated events, rather than direct user interactions. For example, live migrations.
- The Data Access logs are disabled by default, and chargeable. They are useful for identifying costly data queries.
- The Policy Denied logs are enabled by default, but can be disabled. They are chargeable.
- Access transparency logs give visibility of any actions performed by Googel staff. For example, as part of an open incident with Google.
Network Logs
VPC flow logs allow you to record samples of network flows sent or received by VM instances, including GKE nodes. These logs are useful for diagnosing network issues, network and security analytics, and forensics. They are disabled by default, but can be enabled on a per-VPC subnet basis.
Logging volumes can be significant, so it’s important to optimise the aggregation interval and sampling rate.
Cloud Firewall Rules Logging can be used to answer questions like:
“How many connections match this rule?”
“Did this rule cause my application outage?”
“Is this rule incorrectly stopping traffic?”
Like VPC Flow logs, Firewall Rule Logging is disabled by default. It can be enabled on a per-rule basis. E.g.
gcloud compute firewall-rules update <rule-name> --enable-logging
And as with VPC Flow logs, this can result in a lot of logging, and it can get expensive quickly. A common strategy is to NOT enable rules logging by default, but to only turn on rules logging when diagnosing issues. And you can always start by creating a low priority allow-all rule, to verify that the traffic is hitting the firewall in the first place.
Also worth mentioning: we can use Cloud NAT logs to capture connections created and packets dropped by Cloud NAT.
Summary of Cloud Logging Decision Points
- Will you use Google Cloud Logging only, or do you need to export logs to downstream systems, like Splunk? Some organisations use Splunk for SIEM, so you may have a requirement to export audit logs to Splunk. But be careful not to do this for ALL logs, since Splunk is extremely expensive when working with large volumes of logs.
- What are log compliance, data residency, and retention requirements?
- Which logs will you exclude?
- Will you enable VPC flow logs? If so, what sampling rate?
- Will you archive logs for long-term storage? Will you use GCS lifecycle management to automate this?
- Will you export logs to BigQuery for analytics?
- Which aggregated log sinks will you define?
- Who will be given access to logs? Who will be given access to aggregated logs?
- When writing logs from applications, what standards will you follow? A good practice is to write structured JSON log entries, since these can then be easily parsed and filtered.
- Will we implement a mechanism to dynamically enable logging for troubleshooting?
11. Billing Management, Billing Exports, and Labelling
One of the big advantages of cloud is that it makes your costs visible and transparent. It’s easy to see how much you’re paying, what you’re paying for, and who’s using the resources that you’re paying for. But to do this effectively, a bit of planning is required.
Google Cloud accumulates costs at the project level. I.e. every resource belongs to a project. Furthermore, projects are associated with one and only one billing account. This is how your organisation actually gets billed by Google.
A project must be associated with a billing account, in order to consume any chargeable services. A billing account can be associated with many projects. Some organisations — typically Google resellers — will have billing sub-accounts, which can be used to aggregate billing for specific clients.
Some Roles to Care About
The Billing Account Admin role is typically given to individuals with financial responsibility. They will be able to:
- Link and unlink billing accounts to projects.
- Enable billing exports.
- View spend and costs.
- Set budgets and alerts.
The Billing Account Viewer can view billing accounts, but can’t make any changes.
The Project Billing Manager can assign a billing account to a project, and disable billing of a project. (Though you’d normally automate this.)
Billing Visibility
The Google Cloud Console includes built-in Billing Reports:
This information can be filtered based on properties such as project, date, and products.
A really useful feature is the ability to view cost trends:
We can also export this billing data into BigQuery, which allows for much more sophisticated analysis. Not only can we then query the data using SQL, but we can then easily visualise and analyse the data using tools like Google Looker Studio.
Labelling
Labelling (not to be confused with network tags) is incredibly important!! They are key-value pairs which we can assign to any Google Cloud project or resource.
We can then use these labels to help us analyse our consumption and billing data. For example, we can run something like this in BigQuery:
SELECT
project.labels.key,
project.labels.value,
sum(cost) as cost_total,
sum(usage.amount) as usage_total
FROM billing_table
WHERE
Start_Time >= '2024-02-01 00:00:00' AND Start_Time < '2024-03-01 00:00:00'
GROUP BY
project.labels.key,
project.labels.value
Consequently, it’s a good idea to come up with a labelling strategy and label naming standards. Wherever possible, we should set labels in an automated fashion, when we create resources using infrastructure-as-code.
We can define any labels we want. But here are some suggestions:
- Team / cost centre — e.g. “team:research”
- Environment — e.g. “environment:staging”
- Component — e.g. “component:fe”
- State — e.g. “state:pending-deletion”
- Shared resource — e.g. “shared:true”. This can be useful for identifying projects that are used by many tenants, and for subsequently attributing costs.
Key Design Considerations
- What organisational changes do you need to make, in order to take advantage of consumption-based (pay-as-you-go) resources, and to drive cost-savvy decisions?
- How many billing accounts do we need?
- Who will be given the Billing Administrator role? (Note that Org Admins do not have Billing Admin by default, but they can grant this role to others, and to themselves.)
- Who will be Billing Viewers?
- How will projects be linked with billing accounts? Using IaC?
- Will we export billing data to BigQuery?
- What is our labelling strategy? What are our label naming standards?
By the way, I’ve intentionally steered clear of FinOps. I’ll be covering that in a later article in this series.
Wrap-Up
Here we’ve covered the key LZ design considerations for monitoring, logging, and billing. In the next part, we’ll cover the last major LZ design consideration topic: infra-as-code (IaC) and GitOps.
Before You Go
- Please share this with anyone that you think will be interested. It might help them, and it really helps me!
- Feel free to leave a comment 💬.
- Follow and subscribe, so you don’t miss my content. Go to my Profile Page, and click on these icons:
Links
- Landing Zones on Google Cloud: What It Is, Why You Need One, and How to Create One
- Landing zone design in Google Cloud
- Google Cloud Adoption: SRE and Best Practices for SLI / SLO / SLA
- Observability in Google Cloud
- Setup monitoring, alerting and logging
- Google Cloud Monitoring
- Blue Medora BindPlane
- Google Cloud Logging
- Cloud Billing Overview
- Export Cloud Billing Data to BigQuery
- Example queries for Cloud Billing data export
- Google Cloud Architecture Framework
- Enterprise Foundations Blueprint
Series Navigation
- Series overview and structure
- Previous: Design your Landing Zone — Design Considerations Part 2: Kubernetes
- Next: Design your Landing Zone — Design Considerations Part 4: IaC, GitOps and CI/CD
Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and… was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
","author"=>"Dazbo (Darren Lester)",
"link"=>"https://medium.com/google-cloud/design-your-landing-zone-design-considerations-part-3-monitoring-logging-billing-and-7b40189a3c81?source=rss----e52cf94d98af---4",
"published_date"=>Thu, 28 Mar 2024 02:45:19.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://medium.com/google-cloud/design-your-landing-zone-design-considerations-part-3-monitoring-logging-billing-and-7b40189a3c81?source=rss----e52cf94d98af---4",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Sun, 31 Mar 2024 21:41:04.812312000 UTC +00:00,
"updated_at"=>Tue, 14 May 2024 05:27:07.348426000 UTC +00:00,
"newspaper"=>"Google Cloud - Medium",
"macro_region"=>"Blogs"}