"title"=>"Insights on Medium articles with GenAI and Ruby!",
"summary"=>nil,
"content"=>"
Automated insights on Medium articles with GenAI and Ruby!
A few months ago my manager asked me to ramp up on GenAI. That was one of the best work days of my life! Google offers a lot of toys to play with, and it’s hard not to have fun. I’m fortunate enough to have a job I really love! But enough with my dramatic style, since Vertex AI Palm API seems to think i’m a bit too Italian, look:
This article will illustrate how to crawl someone’s Medium page, and use GenAI to build a JSON file which has a mix of the crawled information and some new information which Vertex AI’s text-bison was able to infer for us.
The code
TL;DR If you want to delve into the code: https://github.com/palladius/genai-googlecloud-scripts/tree/main/03-ruby-medium-article-slurper
A Ruby crawler to find insights on Medium ArticlesMy script is simple: you give it a Medium handle (mine is “palladiusbonton”) and it will do two things:
- Parse the XML RSS feed from Medium (thanks for making it curl-able!) which provides a list of the latest (10?) articles by that person. [example]
- Extract some significant fields (title, body, date, keywords, ..) [example]
- Paste those at the end of a convoluted GenAI prompt.
- Call the Vertex AI Palm API to get an answer. [sample]
What am I trying to prove? I’m trying to use an LLM for a few things:
- Data Scraping. I can do this also in a deterministic way (with nokogiri).
- Classification. I’m really playing with this a lot. I ask the system to rate articles from 1 to 10, and to tell whether they’re Google Cloud articles or not. I also ask it to infer author’s nationality and favorite languages.
- Summarization. I get a small summary of each article, which is super useful.
Two prompts: Text and JSON
This morning, my prompt looked like this:
Prompt = <<-END_OF_PROMPT
Provide a summary for each of the following articles.
* Please write about the topics, the style, and rate the article from 1 to 10 in terms of accuracy or professionalism.
* Please also tell me, for each article, whether it talks about Google Cloud.
* Can you guess the nationality of the person (or geographic context of the article itself) writing all of these articles?
* If you can find any typos or visible mistakes, please write them down.
--
\#{Medium content will be pasted here}
END_OF_PROMPT
My colleague Marc C has showed me that genAI can do better — it can write a JSON for you, with very little guidance! Actually it’s probably better than a human to close parenthesis and double quotes :)
So after 3–4 hours of tinkering I got to this version:
### PROMPT HISTORY
# 1.6 16nov23 Removed typos from articles.
# 1.5 16nov23 Added movie.
# 1.4 16nov23 M oved from TXT to JSON!
PromptInJson = <<-END_OF_PROMPT
You are an avid article reader and summarizer. I'm going to provide a list of articles for a single person and ask you to do this:
1. For each article, I'm going to ask a number of per-article questions
2. Overall, I'm going to ask questions about the author.
I'm going to provide a JSON structure for the questions I ask. If you don't know some answer, feel free to leave NULL/empty values.
1. Per-article:
* Please write about the topics, the style, and rate the article from 1 to 10 in terms of accuracy or professionalism.
* Please also tell me, for each article, whether it talks about Google Cloud.
* For each article, capture the original title and please produce a short 300-500-character summary.
* What existing movie or book would this article remind you the most of? Try a guess, use your fantasy.
2. Overall (author):
* Extract name and surname
* Can you guess the nationality of the person (or geographic context of the article itself) writing all of these articles?
* Please describe this author style. Is it professional or more personal? Terse or verbose? ..
* Does this author prefer a certain language? In which language are their code snippets (if any)?
* If you can find any typos or recurring mistakes in any article, please write them here.
Please provide the output in a `JSON` file as an array of answer per article, like this:
{
"prompt_version": "1.6a", // do NOT change this, take verbatim
"author_name": "", // name and surname of the author
"author_nationality": "", // nationality here
"author_style": "", // overall author style: is it professional or more personal? Terse or verbose? ..
"author_favorite_languages": "", // which languages does the author use? Pascal? C++? Python? Java? Usa comma separated for the list.
"typos": [{ // array of mistakes or typos
"current": "xxx", // typo or mistake
"correct": "yyy", //
}],
"articles_feedback": [
// article 1
{
"title": "", // This should be the ORIGINAL article title, you should be able to extract it from the TITLE XML part, like "<title><![CDATA[What is toilet papers right side?]]></title>"
"summary": "...", // This should be the article summary produced by you.
"publication_date": "" // This should be provided to you in input
"accuracy": XXX, // Integer 1 to 10
"is_gcp": false, // boolean, true of false
"movie_or_book": "" // string, a book or film this article content reminds you of.
]
},
// Article 2, and so on..
]
}
Make **ABSOLUTELY SURE** the result is valid JSON or I'll have to drop the result.
Here are the articles:
--
END_OF_PROMPT
I’ve been playing around with the structure a lot: what pertains to the single article, what with the global part (author identity / style)?
Hey, we want some action!
Sure, here you are! Let’s look setp by step the output from the Ruby crawler and comment on it:
Phase 1 — download the XML.
First, curl this XML: https://medium.com/feed/@palladiusbonton . Results looks like this:
XML looks like this, so far so good.
Phase 2 — psarse the XML wityh Nokogiri
For each item, I scrape things I care about. Probably I could have it just infer automatically, but it’s little effort, so why not! I’t just about finding the fields you care about:
- iterate through item
- Per item, extract what you want: title , dc:creator, link , pubDate , ..
The heavy lifting is done by nokogiri, which is a real 💎 gem among Ruby gems; no, it’s not a repetition, it’s just a double gem. Pythonists and Javascriptists, I wish you had it in your language too 😜
# Code: https://github.com/palladius/genai-googlecloud-scripts/blob/main/03-ruby-medium-article-slurper/main.rb#L120-L141
File.open(genai_input_filename, 'w') do |file| # file.write("your text") }
## Version 2: Scrape more important metadatsa
docSM.xpath("//item").each_with_index do |node,ix| # Article
file.writeln "\\n====== Article \#{ix+1} ====="
title = node.xpath("title").inner_text
creator = node.xpath("dc:creator").inner_text
url = node.xpath("link").inner_text
pubDate = node.xpath("pubDate").inner_text
categories = node.xpath("category").map{|c| c.inner_text} # there's many, eg: ["cycling", "google-forms", "data-studio", "pivot", "google-sheets"]
article_content = ActionView::Base.full_sanitizer.sanitize(node.inner_text)
file.writeln "* Title: '\#{title}'"
file.writeln "* Author: '\#{creator}'"
file.writeln "* URL: '\#{url}'"
file.writeln "* PublicationDate: '\#{pubDate}'"
file.writeln "* Categories: \#{categories.join(', ')}"
file.writeln ""
file.writeln article_content
end
end
Result of this part of the code looks like this:
====== Article 3 =====
* Title: 'Migrate GCP projects across organizations, the gcloud way'
* Author: 'Riccardo Carlesso'
* URL: 'https://medium.com/google-cloud/how-to-migrate-projects-across-organizations-c7e254ab90af?source=rss-b5293b96912f------2'
* PublicationDate: 'Tue, 18 Apr 2023 13:16:26 GMT'
* Categories: gcp-security-operations, google-cloud-platform, migration
[...]
Nel mezzo del cammin di nostra vita, mi ritrovai per una selva oscura, ché la diritta via era smarrita”— Dante Alighieri(*), Divine Comedy(*) the Italian version of Shakespeare,
Phase 3 — Get LLM to infer information and spark some humour
If this was a deterministic program we could just end here: I have all the info, I’m basically translating from XML to JSON. Not majorly useful.
I’ll use GenAI to add some color here:
- Infer author nationality, favorite languages and writing style.
- Per article, write a small summary, infer accuracy and if it’s a GCP article or not. There you go, we have a rudimental QA tester and a classifier! — wOOOt! 😮
For my buddy Romin I get this:
{ "prompt_version": "1.6b",
"author_name": "Romin Irani",
"author_nationality": null,
"author_style": "verbose, professional",
"author_favorite_languages": "Python, Java",
"typos": [
{
"current": "criterias",
"correct": "criteria"
}
],
"articles_feedback": [
{
"title": "Integrating langchain4j and PaLM 2 Chat Bison Model",
"summary": "The article describes how to integrate Langchain4j and PaLM 2 Chat Bison Model. It provides detailed instructions on how to set up the environment, create a Google Cloud Function, and deploy the model. The article also includes a code sample and a link to the GitHub repository.",
"url": "https://medium.com/google-cloud/integrating-langchain4j-and-palm-2-chat-bison-model-a684cefd67af?source=rss-802a4d428d95------2",
"publication_date": "Mon, 06 Nov 2023 11:01:02 GMT",
"accuracy": 9,
"is_gcp": true,
"movie_or_book": null
},
{
"title": "Google Cloud Platform Technology Nuggets \\u2014 October 15\\u201331, 2023 Edition",
"summary": "The article provides a summary of Google Cloud Platform Technology Nuggets for the period of October 15-31, 2023. It covers topics such as infrastructure, containers and Kubernetes, identity and security, networking, machine learning, storage, databases and data analytics, and developers and practitioners.",
"url": "https://medium.com/google-cloud/google-cloud-platform-technology-nuggets-october-15-31-2023-edition-4d5ea0689e30?source=rss-802a4d428d95------2",
"publication_date": "Tue, 31 Oct 2023 10:22:07 GMT",
"accuracy": 8,
"is_gcp": true,
"movie_or_book": null
},
{
"title": "Develop a FlutterFlow App powered by Vertex AI PaLM 2 Integration",
"summary": "The article describes how to integrate FlutterFlow with Vertex AI PaLM 2. It provides a step-by-step guide on how to set up the environment, create a Google Cloud Function, and deploy the model. The article also includes a code sample and a link to the GitHub repository.",
"url": "https://medium.com/google-cloud/flutterflow-and-vertex-ai-palm-2-integration-14c137e83053?source=rss-802a4d428d95------2",
"publication_date": "Thu, 26 Oct 2023 09:37:11 GMT",
"accuracy": 8,
"is_gcp": true,
"movie_or_book": null
}
]
}
For my buddy Guillaume I get this:
{ "prompt_version": "1.6b",
"author_name": "Guillaume Laforge",
"author_nationality": "French",
"author_style": "Guillaume Laforge's writing style can be described as professional, informative, and engaging. He often writes about technology, open-source software, and programming.",
"author_favorite_languages": "Java, Python",
"articles_feedback": [
{
"title": "Tech Watch #4 \\u2014 October, 27, 2023",
"summary": "The article provides a summary of the latest developments in the field of artificial intelligence (AI). It covers topics such as the use of LLMs in vector embeddings, the scheduling of PostgreSQL tasks with pg_cron, and the creation of maps with Protomaps. The article also includes links to the original sources for further reading.",
"url": "https://glaforge.medium.com/tech-watch-4-october-27-2023-d48a1449eeb0?source=rss-431147437aeb------2",
"publication_date": "Fri, 27 Oct 2023 15:04:58 GMT",
"accuracy": 8,
"is_gcp": false,
"movie_or_book": "The Matrix"
},
{
"title": "Tech Watch #3 \\u2014 October, 20, 2023",
"summary": "The article provides a summary of the latest developments in the field of technology. It covers topics such as the use of Groovy in 2023, the state of WebAssembly in 2023, and the use of large language models (LLMs) to solve logic problems. The article also includes links to the original sources for further reading.",
"url": "https://glaforge.medium.com/tech-watch-3-october-20-2023-11a70245017d?source=rss-431147437aeb------2",
"publication_date": "Fri, 20 Oct 2023 19:49:34 GMT",
"accuracy": 9,
"is_gcp": false,
"movie_or_book": "2001: A Space Odyssey"
},
{
"title": "Client-side consumption of a rate-limited API in Java",
"summary": "The article discusses different approaches for consuming rate-limited APIs in Java. It covers topics such as using exponential backoff and jitter, scheduled execution, and using the Bucket4J library. The article also includes code examples for each approach.",
"url": "https://glaforge.medium.com/client-side-consumption-of-a-rate-limited-api-in-java-9fbf08673791?source=rss-431147437aeb------2",
"publication_date": "Mon, 02 Oct 2023 00:00:53 GMT",
"accuracy": 10,
"is_gcp": false,
"movie_or_book": "The Imitation Game"
},
{
"title": "Tech Watch #1 \\u2014 Sept 29, 2023",
"summary": "The article provides a summary of the latest developments in the field of technology. It covers topics such as observability-driven development for LLMs, rebuilding LLM documentation chatbots, container security, and the future of databases. The article also includes links to the original sources for further reading.",
"url": "https://glaforge.medium.com/tech-watch-1-sept-29-2023-2ac0a3a5016c?source=rss-431147437aeb------2",
"publication_date": "Fri, 29 Sep 2023 00:00:11 GMT",
"accuracy": 7,
"is_gcp": false,
"movie_or_book": "Minority Report"
}
]
}
For myself I get something like this:
{ "prompt_version": "1.6b",
"author_name": "Riccardo Carlesso",
"author_nationality": "Italian",
"author_style": "Verbose, uses personal anecdotes.",
"author_favorite_languages": "Ruby",
"articles_feedback": [
{
"title": "What is toilet paper\\u2019s right side?",
"summary": "The author discusses the question of which side of toilet paper is the \\u201cright\\u201d side. They explore the different opinions on this topic and provide their own thoughts on the matter. Ultimately, they conclude that there is no definitive answer to this question and that it is up to each individual to decide which side they prefer.",
"url": "https://medium.com/@palladiusbonton/what-is-toilet-papers-right-side-8da0504d6d0b?source=rss-b5293b96912f------2",
"publication_date": "Tue, 08 Aug 2023 16:37:20 GMT",
"accuracy": 8,
"is_gcp": false,
"movie_or_book": null
},
{
"title": "Spaghetti Bolognese don\\u2019t exist!!1!",
"summary": "The author discusses the common misconception that spaghetti Bolognese is an Italian dish. They explain that this dish is actually not from Italy and that it is not considered to be a traditional Italian dish. They also provide some tips on how to make a more authentic Italian pasta dish.",
"url": "https://medium.com/@palladiusbonton/spaghetti-bolognese-dont-exist-1-2088d85909dd?source=rss-b5293b96912f------2",
"publication_date": "Fri, 21 Apr 2023 16:09:23 GMT",
"accuracy": 9,
"is_gcp": false,
"movie_or_book": null
},
{
"title": "Migrate GCP projects across organizations, the gcloud way",
"summary": "The author provides a detailed guide on how to migrate GCP projects across organizations using the gcloud command-line tool. They cover everything from setting up the necessary permissions to executing the migration. This article is a valuable resource for anyone who needs to migrate GCP projects across organizations.",
"url": "https://medium.com/google-cloud/how-to-migrate-projects-across-organizations-c7e254ab90af?source=rss-b5293b96912f------2",
"publication_date": "Tue, 18 Apr 2023 13:16:26 GMT",
"accuracy": 10,
"is_gcp": true,
"movie_or_book": null
}]}
Lessons learnt
Today I learnt a few things:
- For the first time the token limitation was visible to me. Palm API’s text-bisonmodel has a 32k-token limit, and what I didn’t know is that it seems shared between input and output. If I increase the input size, this diminishes the output size (still to be confirmed, for the moment it’s just a hunch). For this reason I reduce my input from 32k to 22k. To see Token Count, Google gives you a nice API to calculate it (thanks Guillaume). You can see this very well from the API return JSON (note the sum of thetotalTokens here is exactly the maximum, 8192):
"metadata": {
"tokenMetadata": {
"outputTokenCount": {
"totalTokens": 549,
"totalBillableCharacters": 1312
},
"inputTokenCount": {
"totalBillableCharacters": 20713,
"totalTokens": 7643
}
}
}
- Prompting is a (long) fine-tuning feedback loop: you try something out, and after a few answers you realize it doesn’t work, so you try to ‘bribe’ your model saying to “please do something as it’s very important to you”. Example: the movie or book is always empty, it’s probably a stretch for a 0.1 temperature API invocation. So I change “What existing movie or book would this article remind you the most of? Try a guess, use your fantasy” by adding “Please do NOT leave this null! It’s just for fun. yet its very important to me”. Note that doesn’t fix — output gets “None” instead of null, which is fun. But read on…
- Temperature is an important parameter. When tasked to infer a title name for my articles, it would refuse, until I raised the temperature from 0.1 to 0.3. Now I get a curious result: my films are finally there! Wait — Ratatouille, seriously?
{
"prompt_version": "1.6b",
"author_name": "Riccardo Carlesso",
"author_nationality": "Italian",
"author_style": "Verbose, uses humor and personal anecdotes. Seems to prefer Ruby on Rails.",
"author_favorite_languages": "Ruby",
"typos": [
{
"current": "cis-centralis /pendens",
"correct": "cis-centralis/pendens"
},
{
"current": "trans-centralis/mur\\u00e0lis",
"correct": "trans-centralis/muralis"
},
{
"current": "spaghetti Bolognese don\\u2019t",
"correct": "Spaghetti Bolognese doesn't"
}
],
"articles_feedback": [
{
"title": "What is toilet paper\\u2019s right side?",
"summary": "The author discusses the great \\"toilet paper orientation debate\\" and shares their own experiences and opinions on the matter, ultimately concluding that there is no one right answer.",
"url": "https://medium.com/@palladiusbonton/what-is-toilet-papers-right-side-8da0504d6d0b?source=rss-b5293b96912f------2",
"publication_date": "Tue, 08 Aug 2023 16:37:20 GMT",
"accuracy": 8,
"is_gcp": false,
"movie_or_book": "The Big Lebowski"
},
{
"title": "Spaghetti Bolognese don\\u2019t exist!!1!",
"summary": "The author argues that the popular dish \\"Spaghetti Bolognese\\" does not exist in Italy and is considered an \\"imaginary dish\\" by Italians. They explain that the traditional Italian dish is called \\"rag\\u00f9 alla bolognese\\" and is typically served with tagliatelle or other types of pasta, not spaghetti.",
"url": "https://medium.com/@palladiusbonton/spaghetti-bolognese-dont-exist-1-2088d85909dd?source=rss-b5293b96912f------2",
"publication_date": "Fri, 21 Apr 2023 16:09:23 GMT",
"accuracy": 9,
"is_gcp": false,
"movie_or_book": "Ratatouille"
},
{
"title": "Migrate GCP projects across organizations, the gcloud way",
"summary": "The author provides a detailed guide on how to migrate GCP projects across organizations using the gcloud command-line tool. They cover topics such as identifying the current state of the projects, managing IAM permissions, and handling special cases.",
"url": "https://medium.com/google-cloud/how-to-migrate-projects-across-organizations-c7e254ab90af?source=rss-b5293b96912f------2",
"publication_date": "Tue, 18 Apr 2023 13:16:26 GMT",
"accuracy": 10,
"is_gcp": true,
"movie_or_book": "The Matrix"
}
]
}
- Finally, asking an LLM to create JSON or YAML can really speed up your development time, you can create fixtures for your blog or app, or you can just use a computer to further process an imperfect, half-processed entity.
Conclusions
LLMs are a really powerful tool to read long quantities of text, summarize them and classify them based on your tastes.
It can provide structured output (eg JSON) which you can in turn parse and use to populate a DB and an app. This way, a recommendation engine for your favourite articles (eg GCP articles, sorted by date of accuracy ) becomes easy and fun to build!
Next Steps
How could I extend this project?
- Add a workflow, possibly with Cloud Workflows. Iterate until I’m happy with the quality of the outputted JSON.
- Use pre-vetted JSON to populate an easy-peasy Node.JS app, and run it on Cloud Run.
- Change the code to create the “Morning list of articles for Riccardo to read”, by pulling A LOT of articles and do queries by keyword (in this demo it’s ‘GCP’ but could also bi ‘Pistacchio’ or ‘Politics’).
- sed s/keyword/embeddings/g to make it able to do semantic search.
Insights on Medium articles with GenAI and Ruby! was originally published in DevOps.dev on Medium, where people are continuing the conversation by highlighting and responding to this story.
","author"=>"Riccardo Carlesso",
"link"=>"https://blog.devops.dev/parse-medium-articles-with-genai-and-add-some-fun-02fe9d30475a?source=rss-b5293b96912f------2",
"published_date"=>Thu, 16 Nov 2023 13:54:40.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://blog.devops.dev/parse-medium-articles-with-genai-and-add-some-fun-02fe9d30475a?source=rss-b5293b96912f------2",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Sun, 31 Mar 2024 21:41:16.593983000 UTC +00:00,
"updated_at"=>Mon, 13 May 2024 18:38:17.379647000 UTC +00:00,
"newspaper"=>"Riccardo Carlesso - Medium",
"macro_region"=>"Blogs"}