♊️ GemiNews 🗞️
🏡
📰 Articles
🏷️ Tags
🧠 Queries
📈 Graphs
☁️ Stats
💁🏻 Assistant
Demo 1: Embeddings + Recommendation
Demo 2: Bella RAGa
Demo 3: NewRetriever
Demo 4: Assistant function calling
Editing article
Title
Summary
<div class="block-paragraph_advanced"><p><span style="vertical-align: baseline;">Data engineers know that eventing is all about speed, scale, and efficiency. Event streams — high-volume data feeds coming off of things such as devices such as point-of-sale systems or websites logging stateless clickstream activity — process lightweight event payloads that often lack the information to make each event actionable on its own. It is up to the consumers of the event stream to transform and enrich the events, followed by further processing as required for their particular use case. </span></p></div> <div class="block-image_full_width"> <div class="article-module h-c-page"> <div class="h-c-grid"> <figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 " > <img src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Enrich_your_streaming_data_using_Bigtable_.max-1000x1000.jpg" alt="Enrich your streaming data using Bigtable and Dataflow"> </a> </figure> </div> </div> </div> <div class="block-paragraph_advanced"><p><span style="vertical-align: baseline;">Key-value stores such as </span><a href="https://cloud.google.com/bigtable"><span style="text-decoration: underline; vertical-align: baseline;">Bigtable</span></a><span style="vertical-align: baseline;"> are the preferred choice for such workloads, with their ability to process hundreds of thousands of events per second at very low latencies. However, key value lookups often require a lot of careful productionisation and scaling code to ensure the processing can happen with low latency and good operational performance. </span></p> <p><span style="vertical-align: baseline;">With the new </span><a href="https://cloud.google.com/dataflow/docs/guides/enrichment"><span style="text-decoration: underline; vertical-align: baseline;">Apache Beam Enrichment transform</span></a><span style="vertical-align: baseline;">, this process is now just a few lines of code, allowing you to process events that are in messaging systems like </span><a href="https://cloud.google.com/pubsub/docs/overview"><span style="text-decoration: underline; vertical-align: baseline;">Pub/Sub</span></a><span style="vertical-align: baseline;"> or Apache Kafka, and enrich them with data in Bigtable, before being sent along for further processing.</span></p> <p><span style="vertical-align: baseline;">This is critical for streaming applications, as streaming joins enrich the data to give meaning to the streaming event. For example, knowing the contents of a user’s shopping cart, or whether they browsed similar items before, can bring valuable context to clickstream data that feeds into a recommendation model. Identifying a fraudulent in-store credit card transaction requires much more information than what’s in the current transaction, for example, the location of the prior purchase, count of recent transactions or whether a travel notice is in place. Similarly, enriching telemetry data from factory floor hardware with historical signals from the same device or overall fleet statistics can help a machine learning (ML) model predict failures before they happen.</span></p> <p><span style="vertical-align: baseline;">The </span><a href="https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.enrichment.html" rel="noopener" target="_blank"><span style="text-decoration: underline; vertical-align: baseline;">Apache Beam enrichment transform</span></a><span style="vertical-align: baseline;"> can take care of the client-side throttling to rate-limit the number of requests being sent to the Bigtable instance when necessary. It retries the requests with a configurable retry strategy, which by default is exponential backoff. If coupled with auto-scaling, this allows Bigtable and </span><a href="https://cloud.google.com/dataflow"><span style="text-decoration: underline; vertical-align: baseline;">Dataflow</span></a><span style="vertical-align: baseline;"> to scale up and down in tandem and automatically reach an equilibrium. Beam 2.5.4.0 supports exponential backoff, which can be disabled or replaced with a custom implementation.</span></p> <p><span style="vertical-align: baseline;">Lets see this in action:</span></p></div> <div class="block-code"><dl> <dt>code_block</dt> <dd><ListValue: [StructValue([('code', 'with beam.Pipeline() as p:\r\n output = (p\r\n | "Read from PubSub" >> beam.io.ReadFromPubSub(subscription=SUBSCRIPTION)\r\n | "Convert bytes to Row" >> beam.ParDo(DecodeBytes())\r\n | "Enrichment" >> Enrichment(bigtable_handler)\r\n | "Run Inference" >> RunInference(model_handler)\r\n )'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ecdc1082190>)])]></dd> </dl></div> <div class="block-paragraph_advanced"><p><span style="vertical-align: baseline;">The above code runs a Dataflow job that reads from a Pub/Sub subscription and performs data enrichment by doing a key-value lookup with Bigtable cluster. The enriched data is then fed to the machine learning model for RunInference. </span></p> <p><span style="vertical-align: baseline;">The pictures below illustrate how Dataflow and Bigtable work in harmony to scale correctly based on the load. When the job starts, the Dataflow runner starts with one worker while the Bigtable cluster has three nodes and autoscaling enabled for Dataflow and Bigtable. We observe a spike in the input load for Dataflow at around 5:21 PM that leads it to scale to 40 workers.</span></p></div> <div class="block-image_full_width"> <div class="article-module h-c-page"> <div class="h-c-grid"> <figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 " > <img src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_yJOWQ1b.max-1000x1000.png" alt="2"> </a> </figure> </div> </div> </div> <div class="block-paragraph_advanced"><p><span style="vertical-align: baseline;">This increases the number of reads to the Bigtable cluster. Bigtable automatically responds to the increased read traffic by scaling to 10 nodes to maintain the user-defined CPU utilization target.</span></p></div> <div class="block-image_full_width"> <div class="article-module h-c-page"> <div class="h-c-grid"> <figure class="article-image--large h-c-grid__col h-c-grid__col--6 h-c-grid__col--offset-3 " > <img src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_je4LBFv.max-1000x1000.png" alt="3"> </a> </figure> </div> </div> </div> <div class="block-paragraph_advanced"><p><span style="vertical-align: baseline;">The events can then be used for inference, with either embedded models in the Dataflow worker or with </span><a href="https://cloud.google.com/vertex-ai"><span style="text-decoration: underline; vertical-align: baseline;">Vertex AI</span></a><span style="vertical-align: baseline;">. </span></p> <p><span style="vertical-align: baseline;">This Apache Beam transform can also be useful for applications that serve mixed batch and real-time workloads from the same Bigtable database, for example multi-tenant SaaS products and interdepartmental line of business applications. These workloads often take advantage of built-in Bigtable mechanisms to minimize the impact of different workloads on one another. Latency-sensitive requests can be run at </span><a href="https://cloud.google.com/bigtable/docs/request-priorities"><span style="text-decoration: underline; vertical-align: baseline;">high priority</span></a><span style="vertical-align: baseline;"> on a cluster that is simultaneously serving large </span><a href="https://cloud.google.com/bigtable/docs/writes#flow-control"><span style="text-decoration: underline; vertical-align: baseline;">batch</span></a><span style="vertical-align: baseline;"> requests with </span><a href="https://cloud.google.com/bigtable/docs/request-priorities"><span style="text-decoration: underline; vertical-align: baseline;">low priority</span></a><span style="vertical-align: baseline;"> and </span><a href="https://cloud.google.com/bigtable/docs/writes#flow-control"><span style="text-decoration: underline; vertical-align: baseline;">throttling</span></a><span style="vertical-align: baseline;"> requests, while also </span><a href="https://cloud.google.com/bigtable/docs/autoscaling"><span style="text-decoration: underline; vertical-align: baseline;">automatically scaling</span></a><span style="vertical-align: baseline;"> the cluster up or down depending on demand. These capabilities come in handy when using Dataflow with Bigtable, whether it’s to bulk-ingest large amounts of data over many hours, or process streams in real-time.</span></p> <h3><span style="vertical-align: baseline;">Conclusion</span></h3> <p><span style="vertical-align: baseline;">With a few lines of code, we are able to build a production pipeline that translates to many thousands of lines of production code under the covers, allowing Pub/Sub, Dataflow, and Bigtable to seamlessly scale the system to meet your business needs! And as </span><span style="vertical-align: baseline;">machine learning models evolve over time, it will be even more advantageous to use a NoSQL database like Bigtable which offers a flexible schema. </span><span style="vertical-align: baseline;">With the upcoming Beam 2.55.0, the enrichment transform will also have caching support for Redis that you can configure for your specific cache. To get started, </span><a href="https://cloud.google.com/dataflow/docs/guides/enrichment"><span style="text-decoration: underline; vertical-align: baseline;">visit the documentation page</span></a><span style="vertical-align: baseline;">.</span></p></div>
Content
Author
Link
Published date
Image url
Feed url
Guid
Hidden blurb
--- !ruby/object:Feedjira::Parser::RSSEntry published: 2024-03-27 16:00:00.000000000 Z entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier guid: https://cloud.google.com/blog/products/data-analytics/enrich-streaming-data-in-bigtable-with-dataflow/ title: Enrich your streaming data using Bigtable and Dataflow categories: - Databases - Data Analytics carlessian_info: news_filer_version: 2 newspaper: Google Cloud Blog macro_region: Technology summary: "<div class=\"block-paragraph_advanced\"><p><span style=\"vertical-align: baseline;\">Data engineers know that eventing is all about speed, scale, and efficiency. Event streams — high-volume data feeds coming off of things such as devices such as point-of-sale systems or websites logging stateless clickstream activity — process lightweight event payloads that often lack the information to make each event actionable on its own. It is up to the consumers of the event stream to transform and enrich the events, followed by further processing as required for their particular use case. </span></p></div>\n<div class=\"block-image_full_width\">\n\n\n\n\n\n\n \n \ <div class=\"article-module h-c-page\">\n <div class=\"h-c-grid\">\n \n\n \ <figure class=\"article-image--large\n \n \n h-c-grid__col\n \ h-c-grid__col--6 h-c-grid__col--offset-3\n \n \n \"\n \ >\n\n \n \n \n <img\n src=\"https://storage.googleapis.com/gweb-cloudblog-publish/images/Enrich_your_streaming_data_using_Bigtable_.max-1000x1000.jpg\"\n \ \n alt=\"Enrich your streaming data using Bigtable and Dataflow\">\n \ \n </a>\n \n </figure>\n\n \n </div>\n </div>\n \ \n\n\n\n\n</div>\n<div class=\"block-paragraph_advanced\"><p><span style=\"vertical-align: baseline;\">Key-value stores such as </span><a href=\"https://cloud.google.com/bigtable\"><span style=\"text-decoration: underline; vertical-align: baseline;\">Bigtable</span></a><span style=\"vertical-align: baseline;\"> are the preferred choice for such workloads, with their ability to process hundreds of thousands of events per second at very low latencies. However, key value lookups often require a lot of careful productionisation and scaling code to ensure the processing can happen with low latency and good operational performance. </span></p>\n<p><span style=\"vertical-align: baseline;\">With the new </span><a href=\"https://cloud.google.com/dataflow/docs/guides/enrichment\"><span style=\"text-decoration: underline; vertical-align: baseline;\">Apache Beam Enrichment transform</span></a><span style=\"vertical-align: baseline;\">, this process is now just a few lines of code, allowing you to process events that are in messaging systems like </span><a href=\"https://cloud.google.com/pubsub/docs/overview\"><span style=\"text-decoration: underline; vertical-align: baseline;\">Pub/Sub</span></a><span style=\"vertical-align: baseline;\"> or Apache Kafka, and enrich them with data in Bigtable, before being sent along for further processing.</span></p>\n<p><span style=\"vertical-align: baseline;\">This is critical for streaming applications, as streaming joins enrich the data to give meaning to the streaming event. For example, knowing the contents of a user’s shopping cart, or whether they browsed similar items before, can bring valuable context to clickstream data that feeds into a recommendation model. Identifying a fraudulent in-store credit card transaction requires much more information than what’s in the current transaction, for example, the location of the prior purchase, count of recent transactions or whether a travel notice is in place. Similarly, enriching telemetry data from factory floor hardware with historical signals from the same device or overall fleet statistics can help a machine learning (ML) model predict failures before they happen.</span></p>\n<p><span style=\"vertical-align: baseline;\">The </span><a href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.enrichment.html\" rel=\"noopener\" target=\"_blank\"><span style=\"text-decoration: underline; vertical-align: baseline;\">Apache Beam enrichment transform</span></a><span style=\"vertical-align: baseline;\"> can take care of the client-side throttling to rate-limit the number of requests being sent to the Bigtable instance when necessary. It retries the requests with a configurable retry strategy, which by default is exponential backoff. If coupled with auto-scaling, this allows Bigtable and </span><a href=\"https://cloud.google.com/dataflow\"><span style=\"text-decoration: underline; vertical-align: baseline;\">Dataflow</span></a><span style=\"vertical-align: baseline;\"> to scale up and down in tandem and automatically reach an equilibrium. Beam 2.5.4.0 supports exponential backoff, which can be disabled or replaced with a custom implementation.</span></p>\n<p><span style=\"vertical-align: baseline;\">Lets see this in action:</span></p></div>\n<div class=\"block-code\"><dl>\n \ <dt>code_block</dt>\n <dd><ListValue: [StructValue([('code', 'with beam.Pipeline() as p:\\r\\n output = (p\\r\\n | "Read from PubSub" >> beam.io.ReadFromPubSub(subscription=SUBSCRIPTION)\\r\\n \ | "Convert bytes to Row" >> beam.ParDo(DecodeBytes())\\r\\n \ | "Enrichment" >> Enrichment(bigtable_handler)\\r\\n \ | "Run Inference" >> RunInference(model_handler)\\r\\n \ )'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ecdc1082190>)])]></dd>\n</dl></div>\n<div class=\"block-paragraph_advanced\"><p><span style=\"vertical-align: baseline;\">The above code runs a Dataflow job that reads from a Pub/Sub subscription and performs data enrichment by doing a key-value lookup with Bigtable cluster. The enriched data is then fed to the machine learning model for RunInference. </span></p>\n<p><span style=\"vertical-align: baseline;\">The pictures below illustrate how Dataflow and Bigtable work in harmony to scale correctly based on the load. When the job starts, the Dataflow runner starts with one worker while the Bigtable cluster has three nodes and autoscaling enabled for Dataflow and Bigtable. We observe a spike in the input load for Dataflow at around 5:21 PM that leads it to scale to 40 workers.</span></p></div>\n<div class=\"block-image_full_width\">\n\n\n\n\n\n\n \n <div class=\"article-module h-c-page\">\n <div class=\"h-c-grid\">\n \n\n <figure class=\"article-image--large\n \ \n \n h-c-grid__col\n h-c-grid__col--6 h-c-grid__col--offset-3\n \ \n \n \"\n >\n\n \n \n \n <img\n \ src=\"https://storage.googleapis.com/gweb-cloudblog-publish/images/2_yJOWQ1b.max-1000x1000.png\"\n \ \n alt=\"2\">\n \n </a>\n \n </figure>\n\n \ \n </div>\n </div>\n \n\n\n\n\n</div>\n<div class=\"block-paragraph_advanced\"><p><span style=\"vertical-align: baseline;\">This increases the number of reads to the Bigtable cluster. Bigtable automatically responds to the increased read traffic by scaling to 10 nodes to maintain the user-defined CPU utilization target.</span></p></div>\n<div class=\"block-image_full_width\">\n\n\n\n\n\n\n \n <div class=\"article-module h-c-page\">\n <div class=\"h-c-grid\">\n \n\n <figure class=\"article-image--large\n \ \n \n h-c-grid__col\n h-c-grid__col--6 h-c-grid__col--offset-3\n \ \n \n \"\n >\n\n \n \n \n <img\n \ src=\"https://storage.googleapis.com/gweb-cloudblog-publish/images/3_je4LBFv.max-1000x1000.png\"\n \ \n alt=\"3\">\n \n </a>\n \n </figure>\n\n \ \n </div>\n </div>\n \n\n\n\n\n</div>\n<div class=\"block-paragraph_advanced\"><p><span style=\"vertical-align: baseline;\">The events can then be used for inference, with either embedded models in the Dataflow worker or with </span><a href=\"https://cloud.google.com/vertex-ai\"><span style=\"text-decoration: underline; vertical-align: baseline;\">Vertex AI</span></a><span style=\"vertical-align: baseline;\">. </span></p>\n<p><span style=\"vertical-align: baseline;\">This Apache Beam transform can also be useful for applications that serve mixed batch and real-time workloads from the same Bigtable database, for example multi-tenant SaaS products and interdepartmental line of business applications. These workloads often take advantage of built-in Bigtable mechanisms to minimize the impact of different workloads on one another. Latency-sensitive requests can be run at </span><a href=\"https://cloud.google.com/bigtable/docs/request-priorities\"><span style=\"text-decoration: underline; vertical-align: baseline;\">high priority</span></a><span style=\"vertical-align: baseline;\"> on a cluster that is simultaneously serving large </span><a href=\"https://cloud.google.com/bigtable/docs/writes#flow-control\"><span style=\"text-decoration: underline; vertical-align: baseline;\">batch</span></a><span style=\"vertical-align: baseline;\"> requests with </span><a href=\"https://cloud.google.com/bigtable/docs/request-priorities\"><span style=\"text-decoration: underline; vertical-align: baseline;\">low priority</span></a><span style=\"vertical-align: baseline;\"> and </span><a href=\"https://cloud.google.com/bigtable/docs/writes#flow-control\"><span style=\"text-decoration: underline; vertical-align: baseline;\">throttling</span></a><span style=\"vertical-align: baseline;\"> requests, while also </span><a href=\"https://cloud.google.com/bigtable/docs/autoscaling\"><span style=\"text-decoration: underline; vertical-align: baseline;\">automatically scaling</span></a><span style=\"vertical-align: baseline;\"> the cluster up or down depending on demand. These capabilities come in handy when using Dataflow with Bigtable, whether it’s to bulk-ingest large amounts of data over many hours, or process streams in real-time.</span></p>\n<h3><span style=\"vertical-align: baseline;\">Conclusion</span></h3>\n<p><span style=\"vertical-align: baseline;\">With a few lines of code, we are able to build a production pipeline that translates to many thousands of lines of production code under the covers, allowing Pub/Sub, Dataflow, and Bigtable to seamlessly scale the system to meet your business needs! And as </span><span style=\"vertical-align: baseline;\">machine learning models evolve over time, it will be even more advantageous to use a NoSQL database like Bigtable which offers a flexible schema. </span><span style=\"vertical-align: baseline;\">With the upcoming Beam 2.55.0, the enrichment transform will also have caching support for Redis that you can configure for your specific cache. To get started, </span><a href=\"https://cloud.google.com/dataflow/docs/guides/enrichment\"><span style=\"text-decoration: underline; vertical-align: baseline;\">visit the documentation page</span></a><span style=\"vertical-align: baseline;\">.</span></p></div>" rss_fields: - title - url - summary - author - categories - published - entry_id url: https://cloud.google.com/blog/products/data-analytics/enrich-streaming-data-in-bigtable-with-dataflow/ author: Reza Rokni
Language
Active
Ricc internal notes
Imported via /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/import-feedjira.rb on 2024-03-31 23:42:33 +0200. Content is EMPTY here. Entried: title,url,summary,author,categories,published,entry_id. TODO add Newspaper: filename = /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/../../../crawler/out/feedjira/Technology/Google Cloud Blog/2024-03-27-Enrich_your_streaming_data_using_Bigtable_and_Dataflow-v2.yaml
Ricc source
Show this article
Back to articles