Introducing gitsearch-cli

The first version of gitsearch-cli is now available. This command line interface allows you to search github repositories and users using keywords and (currently) a handful of additional criteria.

Installation

To install git search you can use pip3 with the following command;

pip3 install gitsearch-cli

Usage

By default the search will be scoped to look in repositories, however you can change the scope to look specifically for users.

For additional help, just use;

git-search --help

Searching for Users

git-search --scope=users owen rumney

or

git-search --scope=users owenrumney

This will yield the following results;

username url
owenrumney https://github.com/owenrumney

Searching for repositories

When searching for repositories you can create a general search by keyword or focus the search by including the language and/or user.

git-search -l=scala -u=apache spark

This will give the following result;

name owner url
fluo-muchos apache https://github.com/apache/fluo-muchos
predictionio apache https://github.com/apache/predictionio
spark apache https://github.com/apache/spark
spark-website apache https://github.com/apache/spark-website

If you want to only return results where the keyword is in the name, you can use the --nameonly flag

git-search -l=scala -u=apache spark --nameonly

This will give the following result;

name owner url
spark apache https://github.com/apache/spark
spark-website apache https://github.com/apache/spark-website

TODO

  • Add date based options for search criteria
  • Refactor the code to be more pythonic

Allow connection to dockerised elasticsearch other than localhost

We need to access ElasticSearch in a namespace within minikube and the other Pods can’t connect to 9200. It turns out that from the box its limited to localhost and the network.host property needs updating.

Setting network.host in the elasticsearch.yml configuration file on a docker container will put the instance into “Production” mode which will invoke a load of limit checks including, but not limited to the number of threads allocated for a user.

To my knowledge setting ulimits in Docker isn’t trivial so another way to expose ElasticSearch to other pods is required.

The answer appears to be, set http.host: 0.0.0.0 so that its listening on all interfaces. This will allow you to stay as a development instance without all the ulimit issues stopping startup and you can access outside of the Pod.

Argument defaults in shell scripts

Regularly when writing a shell script I find that I want to be able to pass an argument into the script but only sometimes. For example if I want the script to output to /tmp folder for the most part but I’d like the opportunity to select the output myself.

Default arguments can be used in scripts using the following simple syntax

#!/bin/sh

# example script to write to output folder

OUTPUT_PATH=${1:-/tmp/output}

echo "some arbitrary process" > ${OUTPUT_PATH}/arbitrary_output.output

This will either used the first parameter passed in for the output path or a default value of /tmp/output if that isn’t provided

sh example_script.sh # outputs to /tmp/output

sh example_script.sh /var/tmp/special # outputs to /var/tmp/special

Running Spark against HBase

Its reasonably easy to run a Spark job against HBase using the newAPIHadoopRDD available on the SparkContext.

The general steps are,

  1. create an HBaseConfiguration
  2. create a SparkContext
  3. create a newAPIHadoopRDD
  4. perform job action

To get this working, you’re going to need the HBase libraries in your build.sbt file. I’m using HBase 1.1.2 at the moment so thats the version I’m pulling in.

"org.apache.hbase" % "hbase-shaded-client" % "1.1.2"
"org.apache.hbase" % "hbase-server" % "1.1.2"

Creating the HBaseConfiguration

This requires, at a minimum, the zookeeper URI. In my environment the test and the production have different ZOOKEEPER_ZNODE_PARENT so I’m passing that in to override the default.

def createConfig(zookeeper: String, hbaseParentNode: String, tableName: String): Configuration = {
  val config = HBaseConfiguration.create()
  config.set("zookeeper.znode.parent", hbaseParentNode)
  config.set("hbase.zookeeper.quorum", zookeeper)
  config.set("hbase.mapreduce.inputtable", tableName)
  config
}

Creating the SparkContext

The SparkContext is going to be the main engine of the job. At a minimum we just need to have the SparkConf with the job name.

val conf = new SparkConf().setAppName(jobname)
val spark = new SparkContext(conf)

Creating the newAPIHadoopRDD

We have a HBaseConfiguration and a SparkContext so now we can create the newAPIHadoopRDD. The newAPIHadoopRDD needs the config with the table name and namespace and needs to know to use a TableInputFormat for the InputFormat. We’re expecting the class of the keys to be ImmutableBytesWritable and for the values a Result.

val zookeeper = "hbase-box1:2181,hbase-box2:2181"
val hbaseParentNode = "/hbase"
val tableName = "credit_data:accounts"

val config = createConfig(zookeeper, hbaseParentNode, tableName)


val hBaseRDD = spark.newAPIHadoopRDD(config,
                classOf[TableInputFormat],
                classOf[ImmutableBytesWritable],
                classOf[Result])

Performing the Job Action

Thats all we need, we can now run our job. Its contrived, but consider the following table.

key d:creditLimit d:currentBalance
1234678838472938 1000.00 432.00
9842897418374027 100.00 95.70
7880927412346013 600.00 523.30

In our table, we have a key with the credit card number and a ColumnFamily of d: which holds the column_qualifiers creditLimit and currentBalance.

For this job, I want to know all the accounts which are at >90% of their available credit.

case class Account(ccNumber: String, creditLimit: Double, balance: Double)

val accountsRDD = hBaseRDD.map(r => {
    val key = Bytes.toStringBinary(t._1.get())
    val result = t._2.getFamilyMap("d")
    val creditLimit = Bytes.toDouble(result.get("creditLimit"))
    val balance = Bytes.toDouble(result.get("balance"))
    Account(key, creditLimit, balance)
})

That gives us a nicely typed RDD of Accounts we can use to do our filtering on.

val eligibleAccountsRDD = accountRDD.filter(a => {
    (a.balance / a.creditLimit) > 0.9
})  

That gives the matching accounts which we can now extract the account number for and save to disk.

val accountNoRDD = eligibleAccountsRDD.map(a => {
    a.ccNumber
}).saveAsTextFile("/save/location")

The save location will now be a folder with the created part-xxxxx files containing the results. In our case…

9842897418374027
7880927412346013

Replacing an incorrect git commit message

If you have committed some code to git (or in the current case, BitBucket) and you have made an error in the commit message (in the current case, referenced the wrong Jira ticket), all is not lost.

To replace the commit message perform the following actions.

git commit -amend

Change the commit message, in my case;

FOO-1234 - fix the bar
 - add some stuff

to

FOO-1235 - fix the bar
 - add some stuff

Then all that is required is to do a push with --force

git push --force