Moving AWS Diagrams, aka Docker for cheapskates

Last year I created a diagram tool specifically for AWS diagrams. It was essentially a hack of the underlying code draw.io uses.

The reality is, if you’re going to be doing diagrams you might as well use the proper draw.io app but I wanted to provide a way that didn’t need signup or any personal data being handed over. (I did add analytics so I could see if it was being used).

The Setup

I have been running the application (Java and HTML/JS/CSS) as a Docker container in Amazon Fargate for the past 6 months or so, this was with a Amazon Application Load Balancer sitting infront providing an HTTP endpoint. I got a load of AWS credits around the same time, so money was no object.

The Now Setup

Fast forward to 2020 and my credits have all gone/expired so I am footing the bill for the ALB and the Fargate usage. The docker container doesn’t cost too much but the ALB is expensive enough to want to find an alternative to not kill off the project

The Solution

Bearing in mind I have a Dockerfile already created and a dockerhub account, I decided the best way was going to be find a cheap and cheerful box to run docker on and expose the service using Nginx sitting alongside.

While looking into this, I discovered I can run a DigitalOcean droplet for $5/month - given the amount of traffic I’m getting, the one droplet will do for now. I can review if need be in the future.

I also discovered that while I planned to use nginx, Jason Wilder has created an automated image which used docker-gen to grab starting containers and add them to the nginx config automagically.

The last thing I wanted was to be able to get SSL/TLS certificate through Letsencrypt - here I used the letsencrypt companion from JrCs

Digital Ocean Droplet

The droplet is a cheap and cheerful Ubuntu instance that I installed docker and docker compose onto. I’ll cover that in a separate post.

The droplet has a Firewall assigned which is allowing port 80 and 443 traffic, but nothing else.

Nginx Docker Compose

I am using Docker Compose to create the containers on the box so I need a docker compose file for nginx and lets encrypt companion.

version: '2'

services:
  nginx-proxy:
    restart: always
    image: jwilder/nginx-proxy
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "/etc/nginx/vhost.d"
      - "/usr/share/nginx/html"
      - "/var/run/docker.sock:/tmp/docker.sock:ro"
      - "/etc/nginx/certs"

  letsencrypt-nginx-proxy-companion:
    restart: always
    image: jrcs/letsencrypt-nginx-proxy-companion
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    volumes_from:
      - "nginx-proxy"

This creates the services for the reverse proxy and letsencrypt companion. The mounted Docker socket allows the instances to see new containers start and the shared volumes allow for the lets encrypt container to create files for the nginx-proxy to consume.

Nginx for AWS Backend

AWS Backend is the name I gave the backend for the AWS diagram tool. In hindsight, it sounds a lot loftier a name that it actually does.

The solution is comprised of a jar and some html. The docker file is fairly basic too

FROM amazoncorretto:8
# copy WAR into image
COPY www /editor
COPY index.html /index.html
COPY jars/mxPdf.jar /mxPdf.jar
COPY backend/target/backend-1.0-SNAPSHOT-jar-with-dependencies.jar /backend.jar

# expose port of the container
EXPOSE 8080

# run application with this command line 
CMD ["/usr/bin/java", "-jar", "backend.jar", "-cp", "/*"]

To build this locally;

  1. Login to Docker
docker login -u owenrumney`
  1. Build the image
docker build -t owenrumney/awsbackend .
  1. Push to Dockerhub
docker push owenrumney/awsbackend

Now the image is available for use, I can add another compose file.

version: '2'

services:
  awsbackend:
    restart: always
    image: owenrumney/awsbackend:latest
    environment:
      - VIRTUAL_HOST=aws-diagrams.owenrumney.co.uk
      - VIRTUAL_PORT=8080
      - LETSENCRYPT_HOST=aws-diagrams.owenrumney.co.uk

The environment variables are uses by the proxy and the companion to create the certs and to configure the nginx routing.

You can see the finished result running at AWS Diagrams


Adding Help to a Makefile

Sometimes you inherit or even create a huge Makefile which is unwieldly and difficult to understand. The longer it is, the more complicated it can be to find out which targets are available and what they do.

This post covers an effective way to add a help target to a Makefile which will give an overview of what targets are available.

Basic Makefile

I’m going to use a really basic but real Makefile as a starting example. It runs a suite of tests, creates a dockerised environment for testing against or can stop the env.

 test:
 ## test: Run the test suite then shut down
   docker-compose up --abort-on-container-exit --exit-code-from tests

 dev:
 ## dev: Create an environment in docker to develop against
   docker-compose -f docker-compose-local.yml up -d

 stop:
 ## stop: Stop the docker instances for dev
   docker-compose down

Lets imagine that there are another 20 targets available including dependency management, build, build with unit tests, package, publish etc. Getting this as a new joiner, I’d have to open the file and read it through to get an idea of what options were available and what each one did.

Adding Help

To add a help section, we can put a comment under each of the targets with details of the action, then use a simple hand full of commands including sed and column in the help target.

 test:
 ## test: Run the test suite then shut down
   docker-compose up --abort-on-container-exit --exit-code-from tests

 dev:
 ## dev: Create an environment in docker to develop against
   docker-compose -f docker-compose-local.yml up -d

 stop:
 ## stop: Stop the docker instances for dev
   docker-compose down

 help:
 ## help: This helpful list of commands
   @echo "Usage: \n"
   @sed -n 's/^##//p' ${MAKEFILE_LIST} | column -t -s ':' | sed -e 's/^/-/'

What this target does is find all of the entries starting ## in all of the Makefiles that have been loaded (this is stored in the MAKEFILE_LIST env var). Once the comment lines have been gathered, they are piped to column to format into a table, splitting on the colon :. Finally, the front of the line is replaces with a dash, to get the breakdown of comments.

Running Make with help

Now we can run make help to get more information about the available targets.

This will give us the result;

$ make help
Usage:

- test   Run the test suite then shut down
- dev    Create an environment in docker to develop against
- stop   Stop the docker instances for dev
- help   this helpful list of command

That’s it, the more tasks there are, the more that this can be useful. Ultimately, it is only as useful as the quality and accuracy of the comments.


Combining rows into an array in pyspark

Overview

I’ve just spent a bit of time trying to work out how to group a Spark Dataframe by a given column then aggregate up the rows into a single ArrayType column.

Given the input;

transaction_iditem
1a
1b
1c
1d
2a
2d
3c
4b
4c
4d

I want to turn that into the following;

transaction_iditems
1[a, b, c, d]
2[a, d]
3[c]
4[b, c, d]

To achieve this, I can use the following query;

from pyspark.sql.functions import collect_list

df = spark.sql('select transaction_id, item from transaction_data')

grouped_transactions = df.groupBy('transaction_id').agg(collect_list('item').alias('items'))

Testing private methods with ScalaTest

Overview

As part of my journey into using Scala I have had to get used to the ScalaTest and the wealth of functionality it offers.

One of the enduring headaches with unit testing is find a clean way to test private methods without being left feeling that you’ve somehow compromised the solution in order to fully test.

Example

I’ve used an example which is reasonably common so easy to see the usefulness of the PrivateMethodTester trait.

The example is that of a file loader where the source might be local, or S3 or similar. In this case, I’m going to have a public method on my ObjectWithPrivate scala object, this method will accept a String for the sourcePath to a file that I want to load the content of as a BufferedSource.

The sourcePath may be local, or it may be S3, but as the consumer I don’t really want to care. The logical thing in this situation is to have the implementation details of loading the file hidden in private methods. These methods will attempt to load the file from their respective sources and throw a FileNotFoundException if it isn’t available.

import org.slf4j.{Logger, LoggerFactory}
import scala.io.{BufferedSource, Source}
import scala.reflect.io.File

object ObjectWithPrivate {

  val logger: Logger = LoggerFactory.getLogger("ObjectWithPrivate")

  def loadFromPath(sourcePath: String): BufferedSource = {
    sourcePath match {
      case s if s.startsWith("s3") => loadFromS3(sourcePath)
      case _                       => loadFromLocal(sourcePath)
    }
  }

  private def loadFromS3(sourcePath: String, s3Client: AmazonS3 
                                            = AmazonS3ClientBuilder.defaultClient()): BufferedSource = {
    val uri: AmazonS3URI = new AmazonS3URI(sourcePath)
    try {
      val s3Object: S3Object = s3Client.getObject(uri.getBucket, uri.getKey)
      Source.fromInputStream(s3Object.getObjectContent)
    } catch {
      case aex: AmazonServiceException => {
        if (aex.getStatusCode == 404) {
          throw new FileNotFoundException(s"file not found: $sourcePath")
        }
        throw aex
      }
    }
  }

  private def loadFromLocal(sourcePath: String) = {
    logger.info(s"Loading config from local File: $sourcePath")
    if (!File(sourcePath).exists) {
      throw new FileNotFoundException(s"Config file not found: $sourcePath")
    }
    val bufferedSource = Source.fromFile(sourcePath)
    bufferedSource
  }

}

The difficulty now comes in testing the private methods. Testing local load can be done by calling the public loadFromPath method, but that won’t work with the loadFromS3 method as this needs the S3 Mocking to adaquetely test without requiring connectivity to S3 and a known file guaranteed to be present.

This is where the PrivateMethodTester trait comes in. By mixing this trait into our ScalaTest class, we can invoke a private method on our object. I’ve included the whole test class because it has all the set up of the S3 Mock (I see little point in creating an example that calls S3 then not include the required information on how to replicate.)

import com.amazonaws.auth.{AWSStaticCredentialsProvider, AnonymousAWSCredentials}
import com.amazonaws.client.builder.AwsClientBuilder
import com.amazonaws.services.s3.AmazonS3ClientBuilder
import io.findify.s3mock.S3Mock
import org.scalatest.Matchers._
import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach, FunSuite, PrivateMethodTester}

import scala.io.BufferedSource

class ObjectWithPrivateTest extends FunSuite with BeforeAndAfterEach with BeforeAndAfterAll with PrivateMethodTester {

  val endpoint: AwsClientBuilder.EndpointConfiguration = new AwsClientBuilder.EndpointConfiguration(
      "http://localhost:8001",
      "eu-west-1"
    )
  val credentials = new AWSStaticCredentialsProvider(new AnonymousAWSCredentials)
  val api: S3Mock = new S3Mock.Builder()
                        .withPort(8001)
                        .withInMemoryBackend.build
  api.start

  override def beforeEach() {
    val client = AmazonS3ClientBuilder.standard
      .withPathStyleAccessEnabled(true)
      .withEndpointConfiguration(endpoint)
      .withCredentials(credentials)
      .build
    client.createBucket("testbucket")
    client.putObject("testbucket", "files/file1", "file1_content")
  }

  override def afterAll() {
    api.stop
  }

  test("ObjectWithPrivate loads a test file from S3") {
    val client = AmazonS3ClientBuilder.standard
      .withPathStyleAccessEnabled(true)
      .withEndpointConfiguration(endpoint)
      .withCredentials(credentials)
      .build

    val loadFromS3 = PrivateMethod[BufferedSource]('loadFromS3)
    val content = ObjectWithPrivate invokePrivate loadFromS3(
      "s3://testbucket/files/file1",
      client
    )
    content.mkString shouldBe "file1_content"
  }
}

// further tests for local omitted

In the test, the key part is the following line;

val loadFromS3 = PrivateMethod[BufferedSource]('loadFromS3)

This creates a PrivateMethod object which will return a BufferedSource which we pass the name of the method to be called as a Symbol. One of the features added by the PrivateMethodTester is the invokePrivate method such that we can use it to call the private method on a given Object (or instance of a class for that matter)

val content = ObjectWithPrivate invokePrivate loadFromS3(
  "s3://testbucket/files/file1",
  client
)

This will call the private method, returning our BufferedSource and I can test that the content of the mocked S3 object is infact file1_content.

For interest, here is the build.sbt for this simple project

name := "PrivateMethodTester"

version := "0.1"

scalaVersion := "2.12.8"

// dependencies versions
val amazonSdkVersion = "1.11.540"
val logbackClassicVersion = "1.2.3"
val s3MockVersion = "0.2.4"
val scalaTestVersion = "3.0.5"
val slf4jVersion = "1.7.25"

libraryDependencies ++= Seq(
  "com.amazonaws" % "aws-java-sdk-core" % amazonSdkVersion,
  "com.amazonaws" % "aws-java-sdk-s3" % amazonSdkVersion,
  "org.slf4j" % "slf4j-api" % slf4jVersion,
  "ch.qos.logback" % "logback-classic" % logbackClassicVersion,
  "org.scalatest" %% "scalatest" % scalaTestVersion,
  "io.findify" %% "s3mock" % s3MockVersion % Test
)

Update - Implicit Parameters

One thing worth adding is what to do when you have a method that takes an implicit method which needs testing. Lets used this contrived example;


Databricks Single SignOn with Azure Active Directory

Overview

At my current workplace we are using Databricks with much success. Having recently activated the Security Operations Package I was keen to implement the Single SignOn (SSO) functionality.

The documentation provided by Databricks doesn’t seem to cover integrating with Azure Active Directory as a SAML 2.0 Identity Provider and it took some effort to work out how to do it.

Simple Steps

  1. Log into Azure Portal and from the menu on the left, select Azure Active Directory then Enterprise applications from the secondary menu. Azure Active Directory - Enterprise Apps

  2. Select New Application to create a new Enterprise application Azure Active Directory - New App

  3. Databricks isn’t one of the Gallery Applications at the time of writing, so select Non-Gallery Application from the available list. Azure Active Directory - Non Gallery Application

  4. This is where the Databricks instructions is unclear, you need to use your Databricks URL as the Identity Provider Entity ID. Azure Active Directory - Basic SAML Settings

  5. When you’ve completed and saved the basic settings, you’ll be able to download the x.509 certificate and have access to the Login URL to use in the Databricks Admin Console. Download the cert and open with a text editor to extract the certificate content Azure Active Directory - Cert and Login

  6. You can now take these details over to the Databricks admin console to configure SSO. Enter the details into the Single Sign On tab in the Admin Console page. Your Identity Provider Entity ID is the root of your Databricks cloud URL.

Databricks Admin Console - SSO

  1. You can now log out, then log in using Single SignOn through Azure which should get you straight back in. Databricks Admin Console - SSO Login

A Note on Allow User Creation

If you enable Allow auto user creation, when a user logs in, it will create the user for them automatically. This is fine if you’ve configured Azure Active Directory to specify users who have a Role to use the Enterprise Application. For our use case, I’ve gone with this option disabled and enabled open access at the Active Directory end. This means that unknown (from a Databricks perspective) but otherwise authenticated users don’t have access to the environment