🗞️Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and…

🏷️ labelling 🏷️ gcp-security-operations 🏷️ logging-and-monitoring 🏷️ google-cloud-platform 🏷️ landingzone

🗿Semantically Similar Articles (by :title_embedding)

🗄️ 10.3 🔗 Apr01 Design your Landing Zone — Design Considerations Part 4— IaC, GitOps and CI/CD (Google Cloud… (🧑🏻‍💻 Dazbo (Darren Lester))
🗄️ 15.4 🔗 2023Oct20 Tech Watch #3 — October, 20, 2023 (🧑🏻‍💻 Guillaume Laforge)
🗄️ 15.5 🔗 2023Oct27 Tech Watch #4 — October, 27, 2023 (🧑🏻‍💻 Guillaume Laforge)
🗄️ 15.7 🔗 Apr03 Wednesday Briefing (🧑🏻‍💻 Daniel E. Slotnik)
🗄️ 15.8 🔗 Apr01 Monday Briefing (🧑🏻‍💻 Daniel E. Slotnik)

Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and…

2024-03-28 - Dazbo (Darren Lester) (from Google Cloud - Medium)

Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and Labelling (Google Cloud Adoption Series)Welcome to Part 3 of Landing Zone Design Considerations. This is part of the Google Cloud Adoption and Migration: From Strategy to Operation series.In this part I’ll cover:Monitoring strategyLogging strategyBilling management and billing exports, and labelling9. Monitoring StrategyMonitoring is the process of collecting, processing, aggregating and displaying real time and historical quantitative data about a system.MetricsGoogle Cloud provides out-of-the-box monitoring, through the Cloud Monitoring component (formerly known as Stackdriver) of Google Cloud Operations (GCO) Suite. Cloud Monitoring automatically ingests over 1500 different metrics from over 100 different Google Cloud resources. There is no cost for ingestion of these metrics. For example:Cloud Monitoring ingests metrics from Google Cloud services and applicationsIn addition, we can ingest metrics such as:Custom metrics, where we programmatically create custom telemetry using, e.g. with the Cloud Monitoring API, with OpenCensus, and (for GKE) with Prometheus.Log-based metrics, i.e. where real-time information is derived from logs.Additional VM process metrics, via the Cloud Ops Agent.Out-of-the-box application metrics, such as nginx, Apache web server, MongoDB, Tomcat.GKE workloads, using natively-integrated Prometheus and OpenTelemetry.Hybrid cloud monitoring — e.g. monitoring signals from on-prem, AWS and Azure — using Blue Medora BindPlane.Types of metrics collected by Google Cloud MonitoringBe mindful that certain metrics — such as custom metrics and logs-based metrics — have an ingestion cost. So for such metrics, only ingest what is truly valuable.Visualisation and AnalysisOf course, it’s not much use collecting metrics if you don’t do anything with this data. Cloud Monitoring allows you to:Create chartsUse predefined and custom dashboardsShare charts and dashboardsCustom dashboards are assembled from chartsCreate healthchecksDefine services and SLOsCreate alerting policiesSo the end-to-end consumption and use of metrics looks something like this:Making use of your metrics in Google Cloud MonitoringMetrics Scope StrategyGoogle Cloud Monitoring defines an object called a metrics scope (formerly called a workspace). Such a scope defines the set of Google Cloud projects who metrics are visible from a “single pane of glass”. Each scope contains:Predefined and custom dashboardsAlerting policiesUptime checksNotification channels(Resource) group definitionsEvery project has its own metrics scope, and — by default — this metrics scope only has visibility of the resources in that project. However, we can extend this scope to include the metrics from other projects. And this is how we can centralise and aggregate monitoring across projects.Using a single metrics scope to monitor production projectsAnd consequently, it is important to design your metrics scope strategy. Broadly, you have three options:One metrics scope for many related projects, across different environments. Here we have a single pane of glass for all related projects. However, anyone with the monitoring.viewer IAM role for the monitoring project can see all metrics for all of the projects. Thus, we have no granular control over monitoring.Maximum isolation — every project is monitored by a separate scope. Viewers in one project can’t (by default) see any metrics from another. This provides the highest degree of separation, but also potentially the higest operational management overheads.In-between — i.e. one scope for a small number of related projects. For example, we might have different scopes for production versus non-production. We do need to think about our project groupings.Decision PointsHere’s a recap of some of the key decisions that need to be madeWhere is your monitoring single pane of glass? Where possible, I’d recommend using only GCO Cloud Monitoring for the monitoring of Google Cloud infrastructure, services and applications. I.e. align to the “Cloud native / Cloud managed” principle. However, in hybrid environments, organisations may want to make use of existing external monitoring tools, such as New Relic. Be advised that this can considerably increase monitoring costs, e.g. through licensing of the third party monitoring tool, as well as egress costs.Organisational granularity and visibility of monitoring? SRE approach? Metrics scope strategy? Here we should consider our organisational SRE model. And we should decide on the appropriate level of metrics scope granularity.Establish your organisational best practices for what will be monitored, e.g. golden signals (latency, traffic, errors, saturation), SLIs, SLOs, and alerting policies, defined at application level. You will want to make this easily consumable and repeatable.Decide which chargeable metrics you want and need.Establish your alerting channels and best practices. E.g. define what sorts of alerts are urgent and require human interaction, and will therefore generate pages. Other alerts should generate tickets, and use alternative notification approaches.10. Logging StrategyAlso part of the Google Cloud Operations suite, Cloud Logging is a fully-managed serverless service for the ingestion, storing, viewing, searching and analysis of logs. Cloud Logging components are exposed through the Logging API.Google Cloud Logging CapabilitiesLogs are ingested from various sources, such as Google Cloud resources, GKE, third party applications, and user applications runing on Google Cloud.The overall Cloud Logging architecture looks like this:Google Cloud Logging architectureLog Routing and SinksThe Log Router is responsible for ensuring that logs are reliably and efficiently streamed to logging destinations, called Log Sinks. A log sink is actually the combination of a destination, along with inclusion and exclusion filters:A Log SinkPossible sinks include:Cloud Logging buckets — special storage buckets optimised for storing logging data, and which can be interrogated directly from the Logs Explorer in the Google Cloud Console.Google Cloud Storage — ideal for cheap long-term and automated log archiving.BigQuery — which is ideal for long-term storage, SQL-based analytics, and dashboarding.Pub/Sub topics — perfect for sending log data to downstream systems, such as on-prem Splunk. But also, we can use this to stream logs via Dataflow. This can be useful if we want to process the logging data on the fly.Example: Real-time processing of streaming log dataTwo predefined sinks are created in each Google Cloud project:_Required — where all admin, system, and access transparency audit logs go. These logs are retained for 400 days. The _Required log bucket cannot be modified or deleted._Default —all logs go here, unless they are are sent to _Required or user-defined sinks. By default, retention is 30 days, but it can be increased to 3650 days (10 years). The _Default bucket cannot be deleted, but the sink can be disabled.Routing to log sinksNote that it is possible export logs using a sink, whilst also EXCLUDING logs from being ingested in Cloud Logging buckets. This is very useful for managing logging costs.We can aggregate logging to folder, or even organisational sinks. This logging aggregation works by including logs from child resources. We do this by creating an aggregated sink at folder or org level, and setting the includeChildren parameter to True. Then we select the destination for the sink, just like any other sink.(When doing this, it is wise to add an exclusion filter to the _Defaultsink, so that logs are not retained in both sinks.)Logging aggregationIt is a good idea to aggregate logs from across the organisation into a few logging buckets. Consider subdividing these buckets based on properties such as:Retention periodUser access requirementsData residency requirementsWhere you may need to use customer-managed encryption keys.We can then use Log views and IAM bindings to determine who can see the logs.Audit LogsMost Google Cloud services create audit logs. They are intended to answer the question of: “Who did what, where, and when?” These are incredibly important for compliance in regulated industries.There are five categories of audit logs:Google Cloud audit logsThe Admin logs are always enabled, retained for 400 days, and have no charge. They record any API calls that modify the configuration of resources, such as creating networks, creating or modifying instances, etc.The System Event logs are always enabled, retained for 400 days, and have no charge. They record activities that modify the configuration of resources, but which occur due to Google-initiated events, rather than direct user interactions. For example, live migrations.The Data Access logs are disabled by default, and chargeable. They are useful for identifying costly data queries.The Policy Denied logs are enabled by default, but can be disabled. They are chargeable.Access transparency logs give visibility of any actions performed by Googel staff. For example, as part of an open incident with Google.Network LogsVPC flow logs allow you to record samples of network flows sent or received by VM instances, including GKE nodes. These logs are useful for diagnosing network issues, network and security analytics, and forensics. They are disabled by default, but can be enabled on a per-VPC subnet basis.Logging volumes can be significant, so it’s important to optimise the aggregation interval and sampling rate.Configuring VPC Flow logsCloud Firewall Rules Logging can be used to answer questions like:“How many connections match this rule?”“Did this rule cause my application outage?”“Is this rule incorrectly stopping traffic?”Like VPC Flow logs, Firewall Rule Logging is disabled by default. It can be enabled on a per-rule basis. E.g.gcloud compute firewall-rules update <rule-name> --enable-loggingAnd as with VPC Flow logs, this can result in a lot of logging, and it can get expensive quickly. A common strategy is to NOT enable rules logging by default, but to only turn on rules logging when diagnosing issues. And you can always start by creating a low priority allow-all rule, to verify that the traffic is hitting the firewall in the first place.Also worth mentioning: we can use Cloud NAT logs to capture connections created and packets dropped by Cloud NAT.Summary of Cloud Logging Decision PointsWill you use Google Cloud Logging only, or do you need to export logs to downstream systems, like Splunk? Some organisations use Splunk for SIEM, so you may have a requirement to export audit logs to Splunk. But be careful not to do this for ALL logs, since Splunk is extremely expensive when working with large volumes of logs.What are log compliance, data residency, and retention requirements?Which logs will you exclude?Will you enable VPC flow logs? If so, what sampling rate?Will you archive logs for long-term storage? Will you use GCS lifecycle management to automate this?Will you export logs to BigQuery for analytics?Which aggregated log sinks will you define?Who will be given access to logs? Who will be given access to aggregated logs?When writing logs from applications, what standards will you follow? A good practice is to write structured JSON log entries, since these can then be easily parsed and filtered.Will we implement a mechanism to dynamically enable logging for troubleshooting?11. Billing Management, Billing Exports, and LabellingOne of the big advantages of cloud is that it makes your costs visible and transparent. It’s easy to see how much you’re paying, what you’re paying for, and who’s using the resources that you’re paying for. But to do this effectively, a bit of planning is required.Google Cloud accumulates costs at the project level. I.e. every resource belongs to a project. Furthermore, projects are associated with one and only one billing account. This is how your organisation actually gets billed by Google.A project must be associated with a billing account, in order to consume any chargeable services. A billing account can be associated with many projects. Some organisations — typically Google resellers — will have billing sub-accounts, which can be used to aggregate billing for specific clients.Billing hierarchySome Roles to Care AboutThe Billing Account Admin role is typically given to individuals with financial responsibility. They will be able to:Link and unlink billing accounts to projects.Enable billing exports.View spend and costs.Set budgets and alerts.The Billing Account Viewer can view billing accounts, but can’t make any changes.The Project Billing Manager can assign a billing account to a project, and disable billing of a project. (Though you’d normally automate this.)Billing VisibilityThe Google Cloud Console includes built-in Billing Reports:Billing in the Google ConsoleThis information can be filtered based on properties such as project, date, and products.A really useful feature is the ability to view cost trends:Cost trends in the Billing ReportWe can also export this billing data into BigQuery, which allows for much more sophisticated analysis. Not only can we then query the data using SQL, but we can then easily visualise and analyse the data using tools like Google Looker Studio.LabellingLabelling (not to be confused with network tags) is incredibly important!! They are key-value pairs which we can assign to any Google Cloud project or resource.We can then use these labels to help us analyse our consumption and billing data. For example, we can run something like this in BigQuery:SELECT project.labels.key, project.labels.value, sum(cost) as cost_total, sum(usage.amount) as usage_totalFROM billing_tableWHERE Start_Time >= '2024-02-01 00:00:00' AND Start_Time < '2024-03-01 00:00:00'GROUP BY project.labels.key, project.labels.valueConsequently, it’s a good idea to come up with a labelling strategy and label naming standards. Wherever possible, we should set labels in an automated fashion, when we create resources using infrastructure-as-code.We can define any labels we want. But here are some suggestions:Team / cost centre — e.g. “team:research”Environment — e.g. “environment:staging”Component — e.g. “component:fe”State — e.g. “state:pending-deletion”Shared resource — e.g. “shared:true”. This can be useful for identifying projects that are used by many tenants, and for subsequently attributing costs.Key Design ConsiderationsWhat organisational changes do you need to make, in order to take advantage of consumption-based (pay-as-you-go) resources, and to drive cost-savvy decisions?How many billing accounts do we need?Who will be given the Billing Administrator role? (Note that Org Admins do not have Billing Admin by default, but they can grant this role to others, and to themselves.)Who will be Billing Viewers?How will projects be linked with billing accounts? Using IaC?Will we export billing data to BigQuery?What is our labelling strategy? What are our label naming standards?By the way, I’ve intentionally steered clear of FinOps. I’ll be covering that in a later article in this series.Wrap-UpHere we’ve covered the key LZ design considerations for monitoring, logging, and billing. In the next part, we’ll cover the last major LZ design consideration topic: infra-as-code (IaC) and GitOps.Before You GoPlease share this with anyone that you think will be interested. It might help them, and it really helps me!Feel free to leave a comment 💬.Follow and subscribe, so you don’t miss my content. Go to my Profile Page, and click on these icons:Follow and SubscribeLinksLanding Zones on Google Cloud: What It Is, Why You Need One, and How to Create OneLanding zone design in Google CloudGoogle Cloud Adoption: SRE and Best Practices for SLI / SLO / SLAObservability in Google CloudSetup monitoring, alerting and loggingGoogle Cloud MonitoringBlue Medora BindPlaneGoogle Cloud LoggingCloud Billing OverviewExport Cloud Billing Data to BigQueryExample queries for Cloud Billing data exportGoogle Cloud Architecture FrameworkEnterprise Foundations BlueprintSeries NavigationSeries overview and structurePrevious: Design your Landing Zone — Design Considerations Part 2: KubernetesNext: Design your Landing Zone — Design Considerations Part 4: IaC, GitOps and CI/CDDesign your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and… was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

🏷️ labelling 🏷️ gcp-security-operations 🏷️ logging-and-monitoring 🏷️ google-cloud-platform 🏷️ landingzone

[Blogs] 🌎 https://medium.com/google-cloud/design-your-landing-zone-design-considerations-part-3-monitoring-logging-billing-and-7b40189a3c81?source=rss----e52cf94d98af---4

🗿article.to_s

------------------------------
Title: Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and…
[content]
Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and Labelling (Google Cloud Adoption&nbsp;Series)Welcome to Part 3 of Landing Zone Design Considerations. This is part of the Google Cloud Adoption and Migration: From Strategy to Operation series.In this part I’ll&nbsp;cover:Monitoring strategyLogging strategyBilling management and billing exports, and labelling9. Monitoring StrategyMonitoring is the process of collecting, processing, aggregating and displaying real time and historical quantitative data about a&nbsp;system.MetricsGoogle Cloud provides out-of-the-box monitoring, through the Cloud Monitoring component (formerly known as Stackdriver) of Google Cloud Operations (GCO) Suite. Cloud Monitoring automatically ingests over 1500 different metrics from over 100 different Google Cloud resources. There is no cost for ingestion of these metrics. For&nbsp;example:Cloud Monitoring ingests metrics from Google Cloud services and applicationsIn addition, we can ingest metrics such&nbsp;as:Custom metrics, where we programmatically create custom telemetry using, e.g. with the Cloud Monitoring API, with OpenCensus, and (for GKE) with Prometheus.Log-based metrics, i.e. where real-time information is derived from&nbsp;logs.Additional VM process metrics, via the Cloud Ops&nbsp;Agent.Out-of-the-box application metrics, such as nginx, Apache web server, MongoDB,&nbsp;Tomcat.GKE workloads, using natively-integrated Prometheus and OpenTelemetry.Hybrid cloud monitoring — e.g. monitoring signals from on-prem, AWS and Azure — using Blue Medora BindPlane.Types of metrics collected by Google Cloud MonitoringBe mindful that certain metrics — such as custom metrics and logs-based metrics — have an ingestion cost. So for such metrics, only ingest what is truly valuable.Visualisation and&nbsp;AnalysisOf course, it’s not much use collecting metrics if you don’t do anything with this data. Cloud Monitoring allows you&nbsp;to:Create chartsUse predefined and custom dashboardsShare charts and dashboardsCustom dashboards are assembled from&nbsp;chartsCreate healthchecksDefine services and&nbsp;SLOsCreate alerting&nbsp;policiesSo the end-to-end consumption and use of metrics looks something like&nbsp;this:Making use of your metrics in Google Cloud MonitoringMetrics Scope&nbsp;StrategyGoogle Cloud Monitoring defines an object called a metrics scope (formerly called a workspace). Such a scope defines the set of Google Cloud projects who metrics are visible from a “single pane of glass”. Each scope contains:Predefined and custom dashboardsAlerting policiesUptime checksNotification channels(Resource) group definitionsEvery project has its own metrics scope, and — by default — this metrics scope only has visibility of the resources in that project. However, we can extend this scope to include the metrics from other projects. And this is how we can centralise and aggregate monitoring across projects.Using a single metrics scope to monitor production projectsAnd consequently, it is important to design your metrics scope strategy. Broadly, you have three&nbsp;options:One metrics scope for many related projects, across different environments. Here we have a single pane of glass for all related projects. However, anyone with the monitoring.viewer IAM role for the monitoring project can see all metrics for all of the projects. Thus, we have no granular control over monitoring.Maximum isolation — every project is monitored by a separate scope. Viewers in one project can’t (by default) see any metrics from another. This provides the highest degree of separation, but also potentially the higest operational management overheads.In-between — i.e. one scope for a small number of related projects. For example, we might have different scopes for production versus non-production. We do need to think about our project groupings.Decision PointsHere’s a recap of some of the key decisions that need to be&nbsp;madeWhere is your monitoring single pane of glass? Where possible, I’d recommend using only GCO Cloud Monitoring for the monitoring of Google Cloud infrastructure, services and applications. I.e. align to the “Cloud native / Cloud managed” principle. However, in hybrid environments, organisations may want to make use of existing external monitoring tools, such as New Relic. Be advised that this can considerably increase monitoring costs, e.g. through licensing of the third party monitoring tool, as well as egress&nbsp;costs.Organisational granularity and visibility of monitoring? SRE approach? Metrics scope strategy? Here we should consider our organisational SRE model. And we should decide on the appropriate level of metrics scope granularity.Establish your organisational best practices for what will be monitored, e.g. golden signals (latency, traffic, errors, saturation), SLIs, SLOs, and alerting policies, defined at application level. You will want to make this easily consumable and repeatable.Decide which chargeable metrics you want and&nbsp;need.Establish your alerting channels and best practices. E.g. define what sorts of alerts are urgent and require human interaction, and will therefore generate pages. Other alerts should generate tickets, and use alternative notification approaches.10. Logging&nbsp;StrategyAlso part of the Google Cloud Operations suite, Cloud Logging is a fully-managed serverless service for the ingestion, storing, viewing, searching and analysis of logs. Cloud Logging components are exposed through the Logging&nbsp;API.Google Cloud Logging CapabilitiesLogs are ingested from various sources, such as Google Cloud resources, GKE, third party applications, and user applications runing on Google&nbsp;Cloud.The overall Cloud Logging architecture looks like&nbsp;this:Google Cloud Logging architectureLog Routing and&nbsp;SinksThe Log Router is responsible for ensuring that logs are reliably and efficiently streamed to logging destinations, called Log Sinks. A log sink is actually the combination of a destination, along with inclusion and exclusion filters:A Log&nbsp;SinkPossible sinks&nbsp;include:Cloud Logging buckets — special storage buckets optimised for storing logging data, and which can be interrogated directly from the Logs Explorer in the Google Cloud&nbsp;Console.Google Cloud Storage — ideal for cheap long-term and automated log archiving.BigQuery — which is ideal for long-term storage, SQL-based analytics, and dashboarding.Pub/Sub topics — perfect for sending log data to downstream systems, such as on-prem Splunk. But also, we can use this to stream logs via Dataflow. This can be useful if we want to process the logging data on the&nbsp;fly.Example: Real-time processing of streaming log&nbsp;dataTwo predefined sinks are created in each Google Cloud&nbsp;project:_Required — where all admin, system, and access transparency audit logs go. These logs are retained for 400 days. The _Required log bucket cannot be modified or&nbsp;deleted._Default —all logs go here, unless they are are sent to _Required or user-defined sinks. By default, retention is 30 days, but it can be increased to 3650 days (10 years). The _Default bucket cannot be deleted, but the sink can be disabled.Routing to log&nbsp;sinksNote that it is possible export logs using a sink, whilst also EXCLUDING logs from being ingested in Cloud Logging buckets. This is very useful for managing logging&nbsp;costs.We can aggregate logging to folder, or even organisational sinks. This logging aggregation works by including logs from child resources. We do this by creating an aggregated sink at folder or org level, and setting the includeChildren parameter to True. Then we select the destination for the sink, just like any other&nbsp;sink.(When doing this, it is wise to add an exclusion filter to the _Defaultsink, so that logs are not retained in both&nbsp;sinks.)Logging aggregationIt is a good idea to aggregate logs from across the organisation into a few logging buckets. Consider subdividing these buckets based on properties such&nbsp;as:Retention periodUser access requirementsData residency requirementsWhere you may need to use customer-managed encryption keys.We can then use Log views and IAM bindings to determine who can see the&nbsp;logs.Audit LogsMost Google Cloud services create audit logs. They are intended to answer the question of: “Who did what, where, and when?” These are incredibly important for compliance in regulated industries.There are five categories of audit&nbsp;logs:Google Cloud audit&nbsp;logsThe Admin logs are always enabled, retained for 400 days, and have no charge. They record any API calls that modify the configuration of resources, such as creating networks, creating or modifying instances, etc.The System Event logs are always enabled, retained for 400 days, and have no charge. They record activities that modify the configuration of resources, but which occur due to Google-initiated events, rather than direct user interactions. For example, live migrations.The Data Access logs are disabled by default, and chargeable. They are useful for identifying costly data&nbsp;queries.The Policy Denied logs are enabled by default, but can be disabled. They are chargeable.Access transparency logs give visibility of any actions performed by Googel staff. For example, as part of an open incident with&nbsp;Google.Network LogsVPC flow logs allow you to record samples of network flows sent or received by VM instances, including GKE nodes. These logs are useful for diagnosing network issues, network and security analytics, and forensics. They are disabled by default, but can be enabled on a per-VPC subnet&nbsp;basis.Logging volumes can be significant, so it’s important to optimise the aggregation interval and sampling&nbsp;rate.Configuring VPC Flow&nbsp;logsCloud Firewall Rules Logging can be used to answer questions like:“How many connections match this rule?”“Did this rule cause my application outage?”“Is this rule incorrectly stopping traffic?”Like VPC Flow logs, Firewall Rule Logging is disabled by default. It can be enabled on a per-rule basis.&nbsp;E.g.gcloud compute firewall-rules update &lt;rule-name&gt; --enable-loggingAnd as with VPC Flow logs, this can result in a lot of logging, and it can get expensive quickly. A common strategy is to NOT enable rules logging by default, but to only turn on rules logging when diagnosing issues. And you can always start by creating a low priority allow-all rule, to verify that the traffic is hitting the firewall in the first&nbsp;place.Also worth mentioning: we can use Cloud NAT logs to capture connections created and packets dropped by Cloud&nbsp;NAT.Summary of Cloud Logging Decision&nbsp;PointsWill you use Google Cloud Logging only, or do you need to export logs to downstream systems, like Splunk? Some organisations use Splunk for SIEM, so you may have a requirement to export audit logs to Splunk. But be careful not to do this for ALL logs, since Splunk is extremely expensive when working with large volumes of&nbsp;logs.What are log compliance, data residency, and retention requirements?Which logs will you&nbsp;exclude?Will you enable VPC flow logs? If so, what sampling&nbsp;rate?Will you archive logs for long-term storage? Will you use GCS lifecycle management to automate&nbsp;this?Will you export logs to BigQuery for analytics?Which aggregated log sinks will you&nbsp;define?Who will be given access to logs? Who will be given access to aggregated logs?When writing logs from applications, what standards will you follow? A good practice is to write structured JSON log entries, since these can then be easily parsed and filtered.Will we implement a mechanism to dynamically enable logging for troubleshooting?11. Billing Management, Billing Exports, and LabellingOne of the big advantages of cloud is that it makes your costs visible and transparent. It’s easy to see how much you’re paying, what you’re paying for, and who’s using the resources that you’re paying for. But to do this effectively, a bit of planning is required.Google Cloud accumulates costs at the project level. I.e. every resource belongs to a project. Furthermore, projects are associated with one and only one billing account. This is how your organisation actually gets billed by&nbsp;Google.A project must be associated with a billing account, in order to consume any chargeable services. A billing account can be associated with many projects. Some organisations — typically Google resellers — will have billing sub-accounts, which can be used to aggregate billing for specific&nbsp;clients.Billing hierarchySome Roles to Care&nbsp;AboutThe Billing Account Admin role is typically given to individuals with financial responsibility. They will be able&nbsp;to:Link and unlink billing accounts to projects.Enable billing&nbsp;exports.View spend and&nbsp;costs.Set budgets and&nbsp;alerts.The Billing Account Viewer can view billing accounts, but can’t make any&nbsp;changes.The Project Billing Manager can assign a billing account to a project, and disable billing of a project. (Though you’d normally automate&nbsp;this.)Billing VisibilityThe Google Cloud Console includes built-in Billing&nbsp;Reports:Billing in the Google&nbsp;ConsoleThis information can be filtered based on properties such as project, date, and products.A really useful feature is the ability to view cost&nbsp;trends:Cost trends in the Billing&nbsp;ReportWe can also export this billing data into BigQuery, which allows for much more sophisticated analysis. Not only can we then query the data using SQL, but we can then easily visualise and analyse the data using tools like Google Looker&nbsp;Studio.LabellingLabelling (not to be confused with network tags) is incredibly important!! They are key-value pairs which we can assign to any Google Cloud project or resource.We can then use these labels to help us analyse our consumption and billing data. For example, we can run something like this in BigQuery:SELECT project.labels.key, project.labels.value, sum(cost) as cost_total, sum(usage.amount) as usage_totalFROM billing_tableWHERE Start_Time &gt;= '2024-02-01 00:00:00' AND Start_Time &lt; '2024-03-01 00:00:00'GROUP BY project.labels.key, project.labels.valueConsequently, it’s a good idea to come up with a labelling strategy and label naming standards. Wherever possible, we should set labels in an automated fashion, when we create resources using infrastructure-as-code.We can define any labels we want. But here are some suggestions:Team / cost centre — e.g. “team:research”Environment — e.g. “environment:staging”Component — e.g. “component:fe”State — e.g. “state:pending-deletion”Shared resource — e.g. “shared:true”. This can be useful for identifying projects that are used by many tenants, and for subsequently attributing costs.Key Design ConsiderationsWhat organisational changes do you need to make, in order to take advantage of consumption-based (pay-as-you-go) resources, and to drive cost-savvy decisions?How many billing accounts do we&nbsp;need?Who will be given the Billing Administrator role? (Note that Org Admins do not have Billing Admin by default, but they can grant this role to others, and to themselves.)Who will be Billing&nbsp;Viewers?How will projects be linked with billing accounts? Using&nbsp;IaC?Will we export billing data to BigQuery?What is our labelling strategy? What are our label naming standards?By the way, I’ve intentionally steered clear of FinOps. I’ll be covering that in a later article in this&nbsp;series.Wrap-UpHere we’ve covered the key LZ design considerations for monitoring, logging, and billing. In the next part, we’ll cover the last major LZ design consideration topic: infra-as-code (IaC) and&nbsp;GitOps.Before You&nbsp;GoPlease share this with anyone that you think will be interested. It might help them, and it really helps&nbsp;me!Feel free to leave a comment&nbsp;💬.Follow and subscribe, so you don’t miss my content. Go to my Profile Page, and click on these&nbsp;icons:Follow and SubscribeLinksLanding Zones on Google Cloud: What It Is, Why You Need One, and How to Create&nbsp;OneLanding zone design in Google&nbsp;CloudGoogle Cloud Adoption: SRE and Best Practices for SLI / SLO /&nbsp;SLAObservability in Google&nbsp;CloudSetup monitoring, alerting and&nbsp;loggingGoogle Cloud MonitoringBlue Medora BindPlaneGoogle Cloud&nbsp;LoggingCloud Billing&nbsp;OverviewExport Cloud Billing Data to&nbsp;BigQueryExample queries for Cloud Billing data&nbsp;exportGoogle Cloud Architecture FrameworkEnterprise Foundations BlueprintSeries NavigationSeries overview and structurePrevious: Design your Landing Zone — Design Considerations Part 2: KubernetesNext: Design your Landing Zone — Design Considerations Part 4: IaC, GitOps and&nbsp;CI/CDDesign your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and… was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
[/content]

Author: Dazbo (Darren Lester)
PublishedDate: 2024-03-28
Category: Blogs
NewsPaper: Google Cloud - Medium
Tags: labelling, gcp-security-operations, logging-and-monitoring, google-cloud-platform, landingzone

{"id"=>1,
"title"=>"Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and…",
"summary"=>nil,
"content"=>"

Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and Labelling (Google Cloud Adoption Series)

Welcome to Part 3 of Landing Zone Design Considerations. This is part of the Google Cloud Adoption and Migration: From Strategy to Operation series.

In this part I’ll cover:

Monitoring strategy
Logging strategy
Billing management and billing exports, and labelling

9. Monitoring Strategy

Monitoring is the process of collecting, processing, aggregating and displaying real time and historical quantitative data about a system.

Metrics

Google Cloud provides out-of-the-box monitoring, through the Cloud Monitoring component (formerly known as Stackdriver) of Google Cloud Operations (GCO) Suite. Cloud Monitoring automatically ingests over 1500 different metrics from over 100 different Google Cloud resources. There is no cost for ingestion of these metrics. For example:

\"\" — Cloud Monitoring ingests metrics from Google Cloud services and applications

In addition, we can ingest metrics such as:

Custom metrics, where we programmatically create custom telemetry using, e.g. with the Cloud Monitoring API, with OpenCensus, and (for GKE) with Prometheus.
Log-based metrics, i.e. where real-time information is derived from logs.
Additional VM process metrics, via the Cloud Ops Agent.
Out-of-the-box application metrics, such as nginx, Apache web server, MongoDB, Tomcat.
GKE workloads, using natively-integrated Prometheus and OpenTelemetry.
Hybrid cloud monitoring — e.g. monitoring signals from on-prem, AWS and Azure — using Blue Medora BindPlane.

\"\" — Types of metrics collected by Google Cloud Monitoring

Be mindful that certain metrics — such as custom metrics and logs-based metrics — have an ingestion cost. So for such metrics, only ingest what is truly valuable.

Visualisation and Analysis

Of course, it’s not much use collecting metrics if you don’t do anything with this data. Cloud Monitoring allows you to:

Create charts
Use predefined and custom dashboards
Share charts and dashboards

\"\" — Custom dashboards are assembled from charts

Create healthchecks
Define services and SLOs
Create alerting policies

So the end-to-end consumption and use of metrics looks something like this:

\"\" — Making use of your metrics in Google Cloud Monitoring

Metrics Scope Strategy

Google Cloud Monitoring defines an object called a metrics scope (formerly called a workspace). Such a scope defines the set of Google Cloud projects who metrics are visible from a “single pane of glass”. Each scope contains:

Predefined and custom dashboards
Alerting policies
Uptime checks
Notification channels
(Resource) group definitions

Every project has its own metrics scope, and — by default — this metrics scope only has visibility of the resources in that project. However, we can extend this scope to include the metrics from other projects. And this is how we can centralise and aggregate monitoring across projects.

\"\" — Using a single metrics scope to monitor production projects

And consequently, it is important to design your metrics scope strategy. Broadly, you have three options:

One metrics scope for many related projects, across different environments. Here we have a single pane of glass for all related projects. However, anyone with the monitoring.viewer IAM role for the monitoring project can see all metrics for all of the projects. Thus, we have no granular control over monitoring.
Maximum isolation — every project is monitored by a separate scope. Viewers in one project can’t (by default) see any metrics from another. This provides the highest degree of separation, but also potentially the higest operational management overheads.
In-between — i.e. one scope for a small number of related projects. For example, we might have different scopes for production versus non-production. We do need to think about our project groupings.

Decision Points

Here’s a recap of some of the key decisions that need to be made

Where is your monitoring single pane of glass? Where possible, I’d recommend using only GCO Cloud Monitoring for the monitoring of Google Cloud infrastructure, services and applications. I.e. align to the “Cloud native / Cloud managed” principle. However, in hybrid environments, organisations may want to make use of existing external monitoring tools, such as New Relic. Be advised that this can considerably increase monitoring costs, e.g. through licensing of the third party monitoring tool, as well as egress costs.
Organisational granularity and visibility of monitoring? SRE approach? Metrics scope strategy? Here we should consider our organisational SRE model. And we should decide on the appropriate level of metrics scope granularity.
Establish your organisational best practices for what will be monitored, e.g. golden signals (latency, traffic, errors, saturation), SLIs, SLOs, and alerting policies, defined at application level. You will want to make this easily consumable and repeatable.
Decide which chargeable metrics you want and need.
Establish your alerting channels and best practices. E.g. define what sorts of alerts are urgent and require human interaction, and will therefore generate pages. Other alerts should generate tickets, and use alternative notification approaches.

10. Logging Strategy

Also part of the Google Cloud Operations suite, Cloud Logging is a fully-managed serverless service for the ingestion, storing, viewing, searching and analysis of logs. Cloud Logging components are exposed through the Logging API.

\"\" — Google Cloud Logging Capabilities

Logs are ingested from various sources, such as Google Cloud resources, GKE, third party applications, and user applications runing on Google Cloud.

The overall Cloud Logging architecture looks like this:

\"\" — Google Cloud Logging architecture

Log Routing and Sinks

The Log Router is responsible for ensuring that logs are reliably and efficiently streamed to logging destinations, called Log Sinks. A log sink is actually the combination of a destination, along with inclusion and exclusion filters:

Possible sinks include:

Cloud Logging buckets — special storage buckets optimised for storing logging data, and which can be interrogated directly from the Logs Explorer in the Google Cloud Console.
Google Cloud Storage — ideal for cheap long-term and automated log archiving.
BigQuery — which is ideal for long-term storage, SQL-based analytics, and dashboarding.
Pub/Sub topics — perfect for sending log data to downstream systems, such as on-prem Splunk. But also, we can use this to stream logs via Dataflow. This can be useful if we want to process the logging data on the fly.

\"\" — Example: Real-time processing of streaming log data

Two predefined sinks are created in each Google Cloud project:

_Required — where all admin, system, and access transparency audit logs go. These logs are retained for 400 days. The _Required log bucket cannot be modified or deleted.
_Default —all logs go here, unless they are are sent to _Required or user-defined sinks. By default, retention is 30 days, but it can be increased to 3650 days (10 years). The _Default bucket cannot be deleted, but the sink can be disabled.

Note that it is possible export logs using a sink, whilst also EXCLUDING logs from being ingested in Cloud Logging buckets. This is very useful for managing logging costs.

We can aggregate logging to folder, or even organisational sinks. This logging aggregation works by including logs from child resources. We do this by creating an aggregated sink at folder or org level, and setting the includeChildren parameter to True. Then we select the destination for the sink, just like any other sink.

(When doing this, it is wise to add an exclusion filter to the _Defaultsink, so that logs are not retained in both sinks.)

It is a good idea to aggregate logs from across the organisation into a few logging buckets. Consider subdividing these buckets based on properties such as:

Retention period
User access requirements
Data residency requirements
Where you may need to use customer-managed encryption keys.

We can then use Log views and IAM bindings to determine who can see the logs.

Audit Logs

Most Google Cloud services create audit logs. They are intended to answer the question of: “Who did what, where, and when?” These are incredibly important for compliance in regulated industries.

There are five categories of audit logs:

The Admin logs are always enabled, retained for 400 days, and have no charge. They record any API calls that modify the configuration of resources, such as creating networks, creating or modifying instances, etc.
The System Event logs are always enabled, retained for 400 days, and have no charge. They record activities that modify the configuration of resources, but which occur due to Google-initiated events, rather than direct user interactions. For example, live migrations.
The Data Access logs are disabled by default, and chargeable. They are useful for identifying costly data queries.
The Policy Denied logs are enabled by default, but can be disabled. They are chargeable.
Access transparency logs give visibility of any actions performed by Googel staff. For example, as part of an open incident with Google.

Network Logs

VPC flow logs allow you to record samples of network flows sent or received by VM instances, including GKE nodes. These logs are useful for diagnosing network issues, network and security analytics, and forensics. They are disabled by default, but can be enabled on a per-VPC subnet basis.

Logging volumes can be significant, so it’s important to optimise the aggregation interval and sampling rate.

Cloud Firewall Rules Logging can be used to answer questions like:

“How many connections match this rule?”
“Did this rule cause my application outage?”
“Is this rule incorrectly stopping traffic?”

Like VPC Flow logs, Firewall Rule Logging is disabled by default. It can be enabled on a per-rule basis. E.g.

gcloud compute firewall-rules update <rule-name> --enable-logging

And as with VPC Flow logs, this can result in a lot of logging, and it can get expensive quickly. A common strategy is to NOT enable rules logging by default, but to only turn on rules logging when diagnosing issues. And you can always start by creating a low priority allow-all rule, to verify that the traffic is hitting the firewall in the first place.

Also worth mentioning: we can use Cloud NAT logs to capture connections created and packets dropped by Cloud NAT.

Summary of Cloud Logging Decision Points

Will you use Google Cloud Logging only, or do you need to export logs to downstream systems, like Splunk? Some organisations use Splunk for SIEM, so you may have a requirement to export audit logs to Splunk. But be careful not to do this for ALL logs, since Splunk is extremely expensive when working with large volumes of logs.
What are log compliance, data residency, and retention requirements?
Which logs will you exclude?
Will you enable VPC flow logs? If so, what sampling rate?
Will you archive logs for long-term storage? Will you use GCS lifecycle management to automate this?
Will you export logs to BigQuery for analytics?
Which aggregated log sinks will you define?
Who will be given access to logs? Who will be given access to aggregated logs?
When writing logs from applications, what standards will you follow? A good practice is to write structured JSON log entries, since these can then be easily parsed and filtered.
Will we implement a mechanism to dynamically enable logging for troubleshooting?

11. Billing Management, Billing Exports, and Labelling

One of the big advantages of cloud is that it makes your costs visible and transparent. It’s easy to see how much you’re paying, what you’re paying for, and who’s using the resources that you’re paying for. But to do this effectively, a bit of planning is required.

Google Cloud accumulates costs at the project level. I.e. every resource belongs to a project. Furthermore, projects are associated with one and only one billing account. This is how your organisation actually gets billed by Google.

A project must be associated with a billing account, in order to consume any chargeable services. A billing account can be associated with many projects. Some organisations — typically Google resellers — will have billing sub-accounts, which can be used to aggregate billing for specific clients.

Some Roles to Care About

The Billing Account Admin role is typically given to individuals with financial responsibility. They will be able to:

Link and unlink billing accounts to projects.
Enable billing exports.
View spend and costs.
Set budgets and alerts.

The Billing Account Viewer can view billing accounts, but can’t make any changes.

The Project Billing Manager can assign a billing account to a project, and disable billing of a project. (Though you’d normally automate this.)

Billing Visibility

The Google Cloud Console includes built-in Billing Reports:

This information can be filtered based on properties such as project, date, and products.

A really useful feature is the ability to view cost trends:

\"\" — Cost trends in the Billing Report

We can also export this billing data into BigQuery, which allows for much more sophisticated analysis. Not only can we then query the data using SQL, but we can then easily visualise and analyse the data using tools like Google Looker Studio.

Labelling

Labelling (not to be confused with network tags) is incredibly important!! They are key-value pairs which we can assign to any Google Cloud project or resource.

We can then use these labels to help us analyse our consumption and billing data. For example, we can run something like this in BigQuery:

SELECT 
  project.labels.key, 
  project.labels.value, 
  sum(cost) as cost_total, 
  sum(usage.amount) as usage_total
FROM billing_table
WHERE 
  Start_Time >= '2024-02-01 00:00:00' AND Start_Time < '2024-03-01 00:00:00'
GROUP BY 
  project.labels.key, 
  project.labels.value

Consequently, it’s a good idea to come up with a labelling strategy and label naming standards. Wherever possible, we should set labels in an automated fashion, when we create resources using infrastructure-as-code.

We can define any labels we want. But here are some suggestions:

Team / cost centre — e.g. “team:research”
Environment — e.g. “environment:staging”
Component — e.g. “component:fe”
State — e.g. “state:pending-deletion”
Shared resource — e.g. “shared:true”. This can be useful for identifying projects that are used by many tenants, and for subsequently attributing costs.

Key Design Considerations

What organisational changes do you need to make, in order to take advantage of consumption-based (pay-as-you-go) resources, and to drive cost-savvy decisions?
How many billing accounts do we need?
Who will be given the Billing Administrator role? (Note that Org Admins do not have Billing Admin by default, but they can grant this role to others, and to themselves.)
Who will be Billing Viewers?
How will projects be linked with billing accounts? Using IaC?
Will we export billing data to BigQuery?
What is our labelling strategy? What are our label naming standards?

By the way, I’ve intentionally steered clear of FinOps. I’ll be covering that in a later article in this series.

Wrap-Up

Here we’ve covered the key LZ design considerations for monitoring, logging, and billing. In the next part, we’ll cover the last major LZ design consideration topic: infra-as-code (IaC) and GitOps.

Before You Go

Please share this with anyone that you think will be interested. It might help them, and it really helps me!
Feel free to leave a comment 💬.
Follow and subscribe, so you don’t miss my content. Go to my Profile Page, and click on these icons:

Series Navigation

Series overview and structure
Previous: Design your Landing Zone — Design Considerations Part 2: Kubernetes
Next: Design your Landing Zone — Design Considerations Part 4: IaC, GitOps and CI/CD

$\"\"$

Design your Landing Zone — Design Considerations Part 3 — Monitoring, Logging, Billing and… was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

",
"author"=>"Dazbo (Darren Lester)",
"link"=>"https://medium.com/google-cloud/design-your-landing-zone-design-considerations-part-3-monitoring-logging-billing-and-7b40189a3c81?source=rss----e52cf94d98af---4",
"published_date"=>Thu, 28 Mar 2024 02:45:19.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://medium.com/google-cloud/design-your-landing-zone-design-considerations-part-3-monitoring-logging-billing-and-7b40189a3c81?source=rss----e52cf94d98af---4",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Sun, 31 Mar 2024 21:41:04.812312000 UTC +00:00,
"updated_at"=>Tue, 14 May 2024 05:27:07.348426000 UTC +00:00,
"newspaper"=>"Google Cloud - Medium",
"macro_region"=>"Blogs"}

Edit this article

Back to articles