Composing poésie concrète with AWS Step Function
Concept
Observing signs of russo-Ukrainian war fatigue I decided to come up with a poem called “Is the war over?”. I’ve labeled it as poésie concrète drawing an analogy with musique concrète - genre where music is composed of non-musical pieces of sound.
In the same way, my poem is composed of search engine results for “russian war crimes” in the latest week which are then processed by a sentiment analysis model to extract sentences that highlight russian atrocities most properly. After ML-processing sentences assembled in the poem. Each Sunday the poem is regenerated with new and new war crimes.
The war will be over on a day when the search engine will return no results and the poem will be blank.
One may argue that search engine might return results long after hostilities will end. This is exactly the point. Many people in Ukraine will have to live with the aftermath of war for their entire lives. Consider war veterans, traumatized children, and families who lost their close ones.
You may access the page that leads to the poem here.
High-level architecture
I’ve decided to proceed with the serverless offering since it allows me to pay per execution and execution figures are low for this one. While my experience mostly connected with .NET stack for this project I’ve decided to go with Javascript since its web-based capabilities exceed any other language I’m familiar with.
The high-level architecture diagram looks as below.
The entire process is launched by EventBridge Scheduler which launches the chain of Lambdas each following its own responsibility: - Crawling the search engine - Extract article content from a web page - Analyze the sentiment of each sentence inside the article - Assemble the poem from the sentences with the strongest sentiment and put it in S3 bucket that is served to the client via CloudFront.
Since the source code is stored in Github I decided to deploy them via Github actions. In the article below we’ll focus on some points of interest found in the code.
Crawling the search engine
The algorithm behind Google Crawler service is:
1. Make a request to https://www.google.com/search?q=russian+war+crimes&tbs=qdr:w
2. Traverse html page for links to web pages.
Once approaching this task I was under the impression that I’d leverage document API to traverse the HMTL. However, it relies on a browser which was not the case for Lambda. JSDom came to my rescue.
With its help extracting necessary values is as simple as
const dom = new jsdom.JSDOM(data);
const anchors = dom.window.document.querySelectorAll('a[data-ved]');
First deploy
I decided to design the project for deployment from the start so the next step was to introduce continuous build on GitHub.
name: Google Crawler Build
on:
push:
branches: [ "*" ]
pull_request:
branches: [ "master" ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [20.x]
# See supported Node.js release schedule at https://nodejs.org/en/about/releases/
steps:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
cache-dependency-path: "./src/google-crawler/package-lock.json"
- run: cd ./src/google-crawler && npm ci
- run: cd ./src/google-crawler && npm test
- run: cd ./src/google-crawler && npm run lint
I think the code is pretty self-explanatory, however, let’s look through some points.
Here we rely on ubuntu-latest
environment and node-version: [20.x]
.
First of all, we check out the source code with actions/checkout@v3
. For npm to work correctly we have to specify the path to package-lock.json
file with cache-dependency-path: "./src/google-crawler/package-lock.json"
. Apart from restoring packages with npm ci
we also run unit-tests and linter which are necessary quality gates for our codebase.
The deployment looks as follows
name: Google Crawler Deploy
on:
push:
branches: [ "master" ]
jobs:
lambda:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [20.x]
# See supported Node.js release schedule at https://nodejs.org/en/about/releases/
steps:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
cache-dependency-path: "./src/google-crawler/package-lock.json"
- uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: eu-central-1
- run: cd ./src/google-crawler && npm ci
- run: cd ./src/google-crawler && zip -r lambda1.zip ./
- run: cd ./src/google-crawler && aws lambda update-function-code --function-name=google-crawler --zip-file=fileb://lambda1.zip
It looks pretty similar to the build job, however, we also zip the code and deploy it via aws lambda update-function-code
command.
Extracting article content
To extract article content from the web page I’ve used Readability package. Here’s how I download article content from the web page and split it into sentences.
const res = await fetch(url);
const html = await res.text();
const doc = new jsdom.JSDOM(html);
const reader = new readability.Readability(doc.window.document);
const article = reader.parse();
const sentences = splitIntoSentences(article.textContent);
Calling one lambda from another
There are many advices on how to synchronously call one lambda from another over the internet. AWS documentation, however, is more prohibitive on that matter and for a good reason:
While this synchronous flow may work within a single application on a server, it introduces several avoidable problems in a distributed serverless architecture:
Cost: with Lambda, you pay for the duration of an invocation. In this example, while the Create invoice functions runs, two other functions are also running in a wait state, shown in red on the diagram.
Error handling: in nested invocations, error handling can become much more complex. Either errors are thrown to parent functions to handle at the top-level function, or functions require custom handling. For example, an error in Create invoice might require the Process payment function to reverse the charge, or it may instead retry the Create invoice process.
Tight coupling: processing a payment typically takes longer than creating an invoice. In this model, the availability of the entire workflow is limited by the slowest function.
Scaling: the concurrency of all three functions must be equal. In a busy system, this uses more concurrency than would otherwise be needed.
One of the alternatives is to use AWS Step Functions to orchestrate the execution of the lambdas. And turns out that my problem is a nice example of distributed map. Consider: - We extract all necessary links - We map each link in parallel by extracting its content and analyzing its sentiment. - We reduce it into a single S3 bucket.
Here’s the definition of the entire system.
{
"Comment": "A Step Functions workflow that processes an array of strings concurrently",
"StartAt": "Extract links from google",
"States": {
"Extract links from google": {
"Type": "Task",
"Resource": "<google crawler arn>",
"ResultPath": "$",
"Next": "ProcessArray",
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2
}
]
},
"ProcessArray": {
"Type": "Map",
"ItemsPath": "$",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "Extract article content",
"States": {
"Extract article content": {
"Type": "Task",
"Resource": "<article extractor arn>",
"InputPath": "$",
"Next": "Analyze sentiment",
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "Analyze sentiment"
}
]
},
"Analyze sentiment": {
"Type": "Task",
"Resource": "<sentiment analyzer arn>",
"InputPath": "$",
"End": true,
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2
}
]
}
}
},
"Next": "Reducer",
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "Reducer"
}
]
},
"Reducer": {
"Type": "Task",
"Resource": "<reducer arn>",
"InputPath": "$",
"ResultPath": "$",
"End": true,
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2
}
]
}
}
}
"Type": "Map",
and both are Article extractor
and Sentiment analyzer
serve as Iterator
. Once the map phase is done we enter reduce phase via "Next": "Reducer"
.
Another thing worth mentioning is increasing the reliability of our system by adding error handling. The most obvious way is adding retries via
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2
}
]
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "Analyze sentiment"
}
]
Using layers to optimize monorepo structure
At this point, the structure of our repository looks suboptimal with package.json and separate build step for each function. What’s more: a separate package.json means a separate node_modules folder which leads to much disk space going to waste since a lot of modules are duplicates.
This won’t scale once we will add more functions. There is however a way to build and package all the dependencies at once using Lambda layers. This approach lets us package all the dependencies into a separate layer and treat it for our functions as a common runtime.
We’ll reorganize our repository to look like this:
Let’s have a look at a separate action that deploys the layer:
name: Deploy Modules Layer
on:
workflow_call:
secrets:
AWS_ACCESS_KEY_ID:
required: true
AWS_SECRET_ACCESS_KEY:
required: true
jobs:
layer:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [20.x]
# See supported Node.js release schedule at https://nodejs.org/en/about/releases/
steps:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
cache-dependency-path: "./src/package-lock.json"
- uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: eu-central-1
- run: cd ./src && npm ci
- run: cd ./src && zip -r layer.zip node_modules
- run: cd ./src && aws lambda publish-layer-version --layer-name poeme-concrete-modules --zip-file fileb://layer.zip
aws lambda publish-layer-version
. Now let’s jump to consuming the deployed layer when we deploy our functions.
name: Article Extractor Deploy
on:
push:
branches: [ "master" ]
jobs:
layer:
uses: ./.github/workflows/modules-layer-deploy.yml
secrets: inherit
lambda:
runs-on: ubuntu-latest
needs: layer
strategy:
matrix:
node-version: [20.x]
# See supported Node.js release schedule at https://nodejs.org/en/about/releases/
steps:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
cache-dependency-path: "./src/package-lock.json"
- uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: eu-central-1
- run: cd ./src && npm ci
- run: cd ./src/article-extractor && zip -r lambda1.zip ./
- run: cd ./src/article-extractor && aws lambda update-function-code --function-name=article-extractor --zip-file=fileb://lambda1.zip
- run: echo "layer-arn=$(aws lambda list-layer-versions --layer-name poeme-concrete-modules --region eu-central-1 --query 'LayerVersions[0].LayerVersionArn')" >> $GITHUB_ENV
- run: aws lambda update-function-configuration --function-name=article-extractor --layers="${{ env.layer-arn }}"
First of all, is how we rely on deploy layers job.
jobs:
layer:
uses: ./.github/workflows/modules-layer-deploy.yml
secrets: inherit
secrets: inherit
to pass secrets down to the layer deploy action. One might naturally assume that it will infer secrets from the Github storage, however, this is not true and child action infers secrets from parent workflow.
Another important thing is forcing newly deployed function to use the latest version of the published layer. We achieve this in two steps: 1. Querying for the latest layer version and storing it inside the environment variable
echo "layer-arn=$(aws lambda list-layer-versions --layer-name poeme-concrete-modules --region eu-central-1 --query 'LayerVersions[0].LayerVersionArn')" >> $GITHUB_ENV
aws lambda update-function-configuration --function-name=article-extractor --layers="${{ env.layer-arn }}"
Accessing secrets
When it comes to choosing a sentiment analysis engine the natural choice is Amazon Comprehend. Why I didn’t stick with it? I didn’t like the results.
Instead, I’ve chosen text2data service. At the end of the day it’s like calling any other third-party service via HTTP so in this section I’ll briefly cover retrieving secrets needed to call this API.
import { SecretsManagerClient, GetSecretValueCommand } from "@aws-sdk/client-secrets-manager";
async function getSentimentAnalysisApiKey() {
const secret_name = "SENTIMENT_ANALYSIS_API_KEY";
const client = new SecretsManagerClient({
region: "eu-central-1",
});
let response;
try {
response = await client.send(
new GetSecretValueCommand({
SecretId: secret_name,
VersionStage: "AWSCURRENT"
})
);
} catch (error) {
console.log(error);
throw error;
}
return response.SecretString;
}
Writing down the result to S3
Cloudfront serves HTML content from S3 bucket. So in order for the poem to be published we need to generate HTML and store it inside the bucket.
To generate the HTML we insert sentences inside mustache template
const formatted =
poem
.map(p => `<p>${p}</p>`)
.join("\n");
const html = renderTemplate(formatted);
const renderTemplate = (poem) => {
const template = fs.readFileSync('./template.html', 'utf8');
return Mustache.render(template, {
poem: poem
});
}
The point of interest in the template is that we have to use triple curly brackets in order for the inserted HTML not to be escaped
<html>
//omitted for brevity
<body>
<article>
{{{poem}}}
</article>
</body>
</html>
Now we can store the HTML in S3 with the code below:
const putParams = {
Bucket: 'poeme-concrete',
Key: 'index.html',
Body: html,
ContentType: 'text/html',
};
await s3.putObject(putParams).promise();
Conclusion
Usually, at this point I write a summary of technologies touched on in the article. This time, however, I encourage you to read through the poem and occasionally revisit it. You may brush it off as something too disturbing but for many people in Ukraine it’s a grim reality. I am no exception since I have spent the last two years battling to cure my now-4-year-old son’s PTSD and speech disorder. Still, I’m in a more lucky position because I have a roof over my head and live in a relatively peaceful region of the country.
And if the course of the last 30 years has taught us anything it’s that you can never stop the aggressor by leaving him unpunished. russia didn’t stop after Transnistria, Ichkeria, Abkhazia, Crimea, and parts of Donetsk and Luhansk regions, and won’t stop now if it feels that it can wage this war unpunished.
This is why I encourage you to take a stance and support the Ukrainian military via these funds I’ve been supporting personally as well. - Official NBU account - Come Back Alive Foundation - Hospitallers Battalion