Unit testing HDFS code

I need to write a couple of unit tests for some code to add a log entry into HDFS but I don’t want to have to rely on having access to full blown HDFS cluster or a local install to achieve this.

The MiniDFSCluster in org.apache.hadoop:hadoop-hdfs can be used to create a quick clustered file system which can be used to testing.

The following dependencies are required for the test to work.

<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.0</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <type>test-jar</type> <version>2.6.0</version> <scope>test</scope> </dependency> 

The code is reasonably simple, I’m creating the cluster in the Test setup and tearing it down during the teardown phase of the tests

class="highlight">private MiniDFSCluster cluster; private... read more

Writing a Flume Interceptor

He we are in June, some five months since the last post and I finally have some time and content to sit and write a post.

In April 2013 I started working with Hadoop, the plan was to suck in server application logs to determine who was using what data within the business to make sure it was being correctly accounted for. At the time, Flume seemed like the obvious choice to ingest these files till we realised the timing, format and frequency made Flume a little like over kill. As it happened, it was discounted before I could get my teeth into it.

Two years later and there is a reason to use Flume - high volumes of regularly generated XML files which need ingesting into HDFS for processing - clearly a use case for Flume.

There are two key requirements for this piece, one that the file name... read more

Quick introduction to pyspark

All the work I have been doing with AWS has been using Python, specifically boto3 the rework of boto.

One of the intentions is to limit bandwidth when transferring data to S3 the idea is to send periodic snapshots then daily deltas to merge and form a latest folder so a diff mechanism is needed - I originally implemented this in Scala as a Spark process but in an effort to settle on one language I’m looking to redo in Python using pyspark

I’m using my Macbook and to keep things quick and easy I’m going to download a package with Hadoop and Spark then dump it in /usr/share

wget http://archive.apache.org/dist/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz tar -xvf spark-1.0.2-bin-hadoop2.tgz mv spark-1.0.2-bin-hadoop2 /usr/share/spark-hadoop 

I’m going to create a folder to do my dev in under my home folder, to keep things clean I like to use virtualenv

... read more

Client side encryption using Boto3 and AWS KMS

Towards the end of 2014 Amazon released the KMS service to provide a cheaper cut down offering for Key Management Services than those provided with the CloudHSM solutions (although it still uses hardware HSM underneath).

KMS service can be accessed through IAM service at the bottom option on the left side menu is Encryption Keys. May sure you change the region filter to the correct region before creating or trying to view your customer keys.

To create the customer key click the Create Key button and follow through the instructions to create a new master key - take a note of the Key ID then you’re ready to go.

You need a couple of libraries before you start, for testing I use virtualenv

bin/pip install boto3 bin/pip install pycrypto 


I’m using PyCrypto library for no other reason than it appeared in... read more

Adventures with Spark, part two

Some time ago, back in September, I wrote a post on starting my adventures with Spark but didn’t progress things very far.

On thing that was holding me back was a reasonably real world problem to use as a learning case. I recently came across a question which seemed like a good starting point and for the last few evenings I have been working on a solution.

The problem

A credit card company is receiving transaction data from around the world and needs to be able to spot fraudulent usage from the transactions.

To simplify this use case, I’m going to pick one fabricated indicator of fraudulent usage and focus on that.

  • An alert must be raised if a credit card makes £10,000 of purchases within a 10 minute sliding window

For the purposes of this learning project I am going to assume the following this;

... read more