I’m hoping this will be a reasonably accurate account of my play with the TfL Cycling DataSets.

I’m still forming my plan, however loosely I think I want to end up with a visualisation where the bike points are highlighted in over a time series as bikes are taken and returned.

Initially, I’m working on my Mac, but I have a Databricks community cluster that I’ve migrated some of the parts to.

Preparing my Local Env

As I said, I’m using my MacBook so I’m going to install a couple of things

Install Spark

To install spark, I use brew

brew install spark

Install Jupyter

Installing jupyter notebooks is done with pip

pip install jupyter

Getting some data

I took a single file from the S3 bucket to play with locally, for no particular reason I went with 01aJourneyDataExtract10Jan16-23Jan16.csv

aws s3 cp s3://cycling.data.tfl.gov.uk/usage-stats/01aJourneyDataExtract10Jan16-23Jan16.csv ~/datasets/cycling/.

Starting Up

Run the following commands to get your Jupyter Notebook up and running

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
pyspark

Quick Test

Finally a quick test to see how it looks. In the Jupyter notebook I can do

data = spark.read.csv('~/datasets/cycling/01aJourneyDataExtract10Jan16-23Jan16.csv', header=True, inferSchema=True)
data.show()

This should show you 20 rows from the data set and we’re off.

Are you confused about the ever growing number of services in AWS and Azure? Why not checkout these super useful glossaries that are always up to date!
There is one for AWS and one for Azure
Top

Share

Comments