For 18 months I’ve been working with Hadoop. Initially it was Hortonworks HDP on Windows then Hortonworks HDP on CentOS and for production we settled on Cloudera CDH5 on Red Hat. Recently we’ve been introduced to Spark and subsequently Scala which I am now in the process of skilling up on, the plan is to blog as I learn.
For the first entries I imagine it won’t be much more than the basic tutorial you could read elsewhere, however the plan is to get more detailed as I learn more.
I can’t introduce Scala better than Scala School so its worth taking a look at that.
I am going to use JetBrains IntelliJ IDEA for developing fuller applications, however for playing and learning you can download Spark for Hadoop in TAR format from the Spark Download Page and use the Spark shell.
For now I just extracted it to a folder in Downloads;
To start the Spark shell
One of the key parts to Spark is the SparkContext which if you’ve done mapreduce seems to be similar to the JobConf. The SparkContext has all the required information about where to run the work and application details for view in the Spark UI web page.
In the spark shell you can use the SparkContext sc
All this is doing is creating a string, splitting it into words and creating a Spark RDD with it. We can use the Action count() to find out how many words there are and we can use the filter() to create a new RDD with filtered results (in this case, filter to the word ‘the’)
I’m not a huge fan of Bing search engine, I’ve tried to use it but I don’t like the format of the search results and I don’t think it’s particularly good at finding relevant results either.
I do like Bing wallpapers, and I use Bing Desktop on my Windows laptop to update my desktop to Bings daily wallpaper.
Now that I’ve moved to a Mac I still want to get the picture, but the application is Windows only - so the script below will do the job for you. I’ve set it to download to the users picture folder ~/Pictures/bing-wallpapers just using the current date for the filename.
To run on a schedule, set up a cron job to run the script at 10am using crontab -e and add the line
I’ve just had an interesting problem with an SFTP account that suddenly stopped working from a cron job. When the account was used directly from the bash prompt the response was simply Connection closed immediately.
Every thing was set up correctly as far as authorized_keys and /etc/ssh/sshd_config looked fine but the account wouldn’t connect.
As I don’t have the private key for the user that was connecting to the server, I created a new key pair and added the public key to the authorized_keys file in the users .ssh folder. Using the following command I got a response that was no more helpful;
psftp firstname.lastname@example.org -i logdrop_pk.ppk
This gave the response;
FATAL: Recieved unexpected end-of-file from SFTP Server
After a couple more checks on the server and no success I decided to try
putty email@example.com -i logdrop_pk.ppk
This opened putty which connected but reported that the password for the user had expired and needed a new one. From here I was able to reset the password then go on the server and set