Untarring an Archive in Go

This took me a little while to figure out, so I’m posting it to save you some time.

Python Development and virtualenv

If you’re doing Python development, you will likely need more than one version of Python. virtualenv provides a way of running multiple Python version, with multiple sets of Python packages (eggs). virtualenvwrapper adds some features that make virtualenv easier to use. It also collects all of your virtual environments into a single directory, like rbenv does.

One difference between virtualenv and rbenv is that rbenv allows you to download and install versions of Ruby that are entirely isolated from the rest of your system and from all other users. virtualenv, on the other hand, requires a global installation of each Python version you want to use.

That is, if you want to have a virtualenv running Python 2.7.5 and one running Python 3.4.0, you must have a system-wide installation of both 2.7.5 and 3.4.0. Each virtualenv uses one of your globally-installed Python interpreters, plus a local set of packages.

For example, you may have two projects that use the 2.7.5 interpreter but require a different or conflicting set of packages.

I’m assuming you have a version of Python already on your system. You can run python –version to check.

To get virtualenv, you will have to use pip. Run the following command to see if you have pip already:

If you get “command not found,” you can install pip with this:

Installing virtualenv and virtualenvwrapper

Run this to install virtualenv and virtualenvwrapper:

Now let’s create a directory to hold all of our Python installations:

Add these lines to your .bash_profile:

Reload your bash profile with this command:

If you’re running OSX Mavericks, you should have Python 2.7.5 pre-installed. You can verify this by running:

You can globally install Python 3.x alongside Python 2.x on a Mac and they will not interfere with each other. Let’s set up a virtualenv for a Python 3 project. The Python 2.x executable on a Mac is called python, and the Python 3.x executable on a Mac is called python3.

You can download the latest Python from https://www.python.org. The downloads page includes installers for Windows and Mac, and source releases for Linux and Unix.

Once you’ve downloaded and installed Python 3 on a Mac, you should see something like the following:

Now let’s say we need to work on two projects, called project1 and project2. Both require Python 3, but each requires a set of packages that conflicts with the other. We can do this:

This sets up a virtual environment in ~/.virtualenv/project1, and it loads that virtual environment, placing it’s name in parentheses at the beginning of your command prompt.

You can run the same command to create a second environment for project2. Just change the last parameter:

Now you can switch back and forth between project1 and project2 using the workon command, which always adds the name of your current virtualenv to the beginning of your command prompt.

Below, we switch to the project1 virtual environment and install a version of a python package, then switch to project2 and install a different, conflicting version of the same package:

Now we can work on project1 and project2 separately, with each project loading the packages it needs.

If you want to delete the project1 environment, simply run:

More information on virtualenv is available at http://www.virtualenv.org/en/latest/virtualenv.html.

You’ll find more info on virtualenvwrapper at http://virtualenvwrapper.readthedocs.org/en/latest/.

Ruby Development and rbenv

You will likely need to run more than one version of Ruby if you are doing work on a Rails project and/or other Ruby code. You can use rvm or rbenv to manage multiple Ruby installations.

Installing rbenv

rbenv is a lightweight alternative to rvm: it allows you to run multiple versions of Ruby and multiple gem sets without conflict. rbenv manages the Ruby versions and their respective binaries (gem, rake, etc), while allowing bundler to manage gemsets.

If you currently have rvm installed, you’ll have to uninstall it with this command:

Run these four lines to install and set up rbenv and ruby-build:

Run source ~/.bash_profile to reload your updated bash_profile.

Now you can load a new Ruby version like this:

Then run this to make the new Ruby version available:

When this is done, you will have Ruby 2.1.1 installed in ~/.rbenv/versions/2.1.1.

Now let’s say you have a project in ~/projects/my_rails_app, and you want that to use the Ruby 2.1.1 version you just installed. Run these commands:

Now every time you change into the ~/projects/my_rails_app directory, rbenv sets ruby 2.1.1 as the Ruby version. It does this by creating a .rbenv file in that directory, describing which Ruby version should be loaded. If you run ruby –version in that directory, you will see that the version is 2.1.1.

You can now start installing gems. Any gem you do install here will go into ~/.rbenv/versions/2.1.1, so they will not conflict with gems you have installed for Ruby 1.9 or Ruby 2.0. You’ll probably want to install bundler first, so that you can manage gem sets:

For more information, see the rbenv documentation on GitHub:


Thinking Styles in Functional and Object-Oriented Languages

After several months of working in Clojure, I noticed that the deeper I thought about the problem I was solving, the shorter my code became. Working in object-oriented languages, I’ve generally found the opposite to be true: the more you analyze a problem, the more classes you wind up with. This is particularly true in statically-typed object-oriented languages.

Abstractions in Object-Oriented Languages

In object-oriented languages, classes are supposed to be abstractions. In a Human Resources application, an Employee class is essentially a template to hold information about an employee, such as the employee’s name, date of hire, department, manager, etc. The class is an abstraction to the extent that it can represent any employee in the organization, and to the extent that it allows various parts of the application to deal with employees in a uniform way.

Compared to basic data structures like lists and hash maps, however, classes are not particularly abstract. In fact, they are very specific. The Employee object has a name, a hire date, a department (which is likely an instance of class Department) and a manager (which is likely another instance of class Employee). Methods that interact with Employee objects must be written specifically for Employee objects, or for some base class that Employee descends from, or for some interface the class implements.

Of course, this is completely natural in the object-oriented world. This is by design. When you design your application, you model out the problem first. You can start by describing in plain English the problem you’re going to solve. The nouns become classes and verbs become methods.

You find commonalities among the nouns, and you build a class hierarchy based on those commonalities. For example, in the HR application, an Employee is a Person, and a JobCandidate is a Person. So you start with a base class of Person, and your application can do many of the same things with a JobCandidate that it can do with an Employee. This design also make it easy for a JobCandidate to become an Employee.

In the design process, you also try to find commonalities in ways the system will need to manipulate different classes of objects. When you discover these commonalities, you design interfaces– sets of common verbs or methods– and make sure that each class implements the necessary interface.

Interfaces are in some ways a more flexible abstraction than classes because while classes must promise that object adheres to an exact structure, interfaces merely promise that an object will respond to a set of method calls, regardless of their structure.

The explicit definitions that classes and interfaces require provide some benefits:

  1. The programmer knows exactly what he is dealing with, because a class’s properties and methods are explicitly defined before it is ever used.
  2. In statically-typed languages in particular, an Integrated Development Environment (IDE) can provide tremendous assistance in writing, understanding and debugging code. Again, this is because every property and method is explicitly defined before it is ever used. The IDE can analyze the code before it even runs, and can help the developer with tools like code completion and the ability to jump directly to the definition of a class or method.
  3. In statically-typed languages, the compiler can produce highly-optimized code because all classes, methods, and data types are specified in the code itself. The runtime can skip most of the work of figuring out exactly which method needs to be called, or whether an object is a string or a number, because the compiler figured all that out ahead of time.

Each of these three properties of object-oriented languages (particularly statically-typed object-oriented languages) is particularly valuable to large organizations developing large well-defined systems with large teams of engineers.

A large organization with a large engineering team will naturally have a lot of turnover. If, for example, the average engineer on a team of 100 remains in his position for 5 years, then the team will be bringing on an average of 20 new members each year.

It takes a great deal of time and effort for each engineer to learn the project’s code base. A good IDE can greatly reduce the amount of time it takes each new engineer to come up to speed. And a good IDE can do this because the underlying statically-typed object-oriented language requires everything to be declared explicitly and in great detail.

If you’ve worked exclusively with IDEs for the past few years, you may take for granted how much assistance they provide. Try navigating an unfamiliar C# or Java project that has a few thousand files using Notepad or Nano. This remind you of how much assistance the IDE provides.

The other great benefit of statically-typed object-oriented languages, runtime performance, is also invaluable to large organizations running large code bases. These organizations, and the applications they produce, tend to serve many users, processing large volumes of data. They have to be efficient.

Alongside these benefits, statically-typed object-oriented have some drawbacks, particularly in the realm of agility. By agility, I mean the ability for developers to change characteristics or behaviors of the system, to adapt the system to new requirements or apply it to new sets of problems.

The lack of agility in statically-typed object-oriented systems comes in part from static typing and in part from the very features that are supposed to provide abstraction and flexibility: classes and interfaces.

If you change some property of your Employee class– for instance, allowing an employee to work in two departments– you must change the department property of the class from a single instance of Department to a collection of Department. Then you must make corresponding changes to every part of your code base that touches the Employee.department property.

Though the refactoring tools of most current IDEs can assist in this process, making the required changes can still be quite time-consuming, disruptive, and difficult. And as you make the changes, you will need to rewrite existing tests and write new regression tests to make sure the system continues to behave as expected.

This is the downside of using classes and interfaces as abstractions: they are not really that abstract. In fact, they are quite specific. They define explicitly every property and every method, along with the types of every property, and parameter types and return types of every method. And once the classes are defined and start interacting with each other, they create dependencies that are very hard to undo.

While the best object-oriented architects are careful to avoid object dependencies, even many experienced object-oriented developers don’t do enough to avoid dependencies. In extreme cases, you wind up with dependency graphs like this:

Object dependencies not only make code difficult to change, they make it difficult to understand. In many object-oriented code bases, you may need to understand a dozen different classes before you can understand how a single class is instantiated, or transformed, or serialized or saved.

Object dependencies also vastly complicate the testing process. You often need to instantiate mocks several classes, and coerce each one into some particular state, just so you can test a single method in a single class.

Abstractions in Clojure

Abstractions in Clojure tend to be profoundly abstract, and therefore extremely flexible. This level of abstraction, which is generally unavailable in statically-typed object oriented languages, encourages a different way of thinking.

Unlike object-oriented languages, in which data structures and the method that operate on those structures are bound up into classes, Clojure separates data structures (the nouns) from functions (the verbs). While object-oriented languages allow and encourage developers to model custom classes to represent specific elements of a problem domain, Clojure provides two basic collection types capable of representing any elements of any problem domain: the sequence and the map.

A sequence is basically similar to an array in JavaScript or Ruby: it contains a list of elements, each of any arbitrary type, and it provides methods for iterating through, adding to, and removing its contents. Clojure includes a few sequence types– lists, vectors, and lazy sequences, for example– each with it’s own strengths. Each of these sequence types can be treated like any other, which means that if you write a function to work with a list, it will work just as well with a vector or a lazy sequence.

Maps are special sequences that have keys and values (other sequences have only values). Maps are similar to Hashes in Ruby and Perl, or Dictionaries in Python.

In addition to these fundamental data types, Clojure is built upon a relatively small set of core functions. John While McCarthy’s original LISP may have theoretically required only five functions as summarized here, Clojure’s core has a few more functions than that. Still, the number is small.

In an object-oriented design, you first model the problem domain, then create custom classes to represent concepts in the problem domain. Then you create custom components or sub-systems that contain specially-designed APIs to work specifically (and often exclusively) with your custom classes.

In Clojure, you write general functions to work with general data types. To accomplish specific tasks, you compose functions– that is, you write functions consisting of other functions– and this process is fairly straight forward because of the limited number of data types. Your specific function takes a list as its parameter. It in turn calls 5 other functions, each of which takes a list as a parameter and returns a list as its output.

These two approaches are profoundly different. The object-oriented model is similar to ideographic languages like Chinese, in which every word requires a custom symbol. The functional model is similar to an alphabetic language, in which any new word can be represented by composing letters from the same alphabet.

The alphabetic model simplifies many things. It allows computers to have relatively small keyboards. It allows readers to at least know how to pronounce words they’ve never seen before. It allows writers (in languages other than English) to spell words they’ve never written before. It allows writers to create new words simply by combining the letters necessary to make the sounds.

And these new words can immediately have meaning, even when the reader has never seen them before. For example, everyone understood exactly what George W. Bush meant when he said people misunderestimate him.

But if the Chinese premiere said the same thing, how would the newspapers report it? Who would be responsible for creating the ideogram to represent a word no one had ever seen before, and how would the readers know what the new ideogram represented?

Like ideograms, classes often try to model in some detail the things they represent.  For example, take a look at some of these Chinese characters, a few of which appear below:



Up or Above:

Down or Below:

The symbols for sun and bird bear some resemblance to the sun and a bird. The symbol for up shows a line going up from a baseline. The symbol for down shows a line going down from a baseline.

Alphabetic representations are more abstract. They don’t try to represent the things in the material world. Instead, they merely represent the sound of the spoken language. The letters a-p-p-l-e bear no relation to any fruit; they merely represent the sound of the fruit’s name in spoken English.

Object-oriented languages have many of the pitfalls of ideographic languages. You’ll often find a class for every concept. Understanding a complex system often means understanding thousands of classes. And understanding the classes themselves often requires understanding the though process of the person who architected the original system.

For example, the Chinese language includes a an ideogram depicting two women under a single roof. What do you suppose this means?

This is the ideogram for “trouble.” It may be obvious to the inventor of the ideogram, whose wife did not get along with his mother, but its not at all obvious to a newcomer to the language.

Similarly, the Apache HTTP Client library includes a huge number of classes, many of which you must study and understand before you can even open a simple connection to a remote host.

Object-Oriented Complexity vs. Functional Simplicity

The fundamental differences in the abstraction mechanisms of object-oriented and functional languages lead to fundamental differences in the way programmers model the problems they must solve, and the solutions they invent.

Because the primary abstraction mechanism in object-oriented languages is the class, large systems tend to include large numbers of classes. And because each class requires explicitly defined properties and methods, the process of modeling an object-oriented system is a process in which the architect moves from the specific to the general, then back to the specific.

For example, in modeling an HR system, the architect first considers what types of the data the system will be working with– Persons, Employees, JobCandidates, Departments, etc. This is a process of abstraction. Then the architect finds commonalities. Employee and JobCandidate are special instances of the more general concept of Person, so they will both derive from this base class.

Finally, the architect moves back to specifics: Employee will have these attributes and methods; JobCandidate will have this other set of attributes and methods. And at this point, before the application ever runs, what an Employee can do and what a JobCandidate can do are limited by the class definitions. What any part of the system can do with an Employee or a JobCandidate is limited by the definitions of those objects.

Contrast this to the process of modeling in a dynamically-typed functional language like Clojure. The nouns in your problem domain become maps or sequences. There is no need to fix the set of properties belonging to an Employee or a JobCandidate. You simply make them maps (a.k.a. hashes or dictionaries) and assign properties as needed.

The actions your system will need to implement– the verbs– are functions, and can work with virtually any objects, since they are all just maps. The limitations that appear in the object-oriented design as soon as it is written are not there in the functional design. The number of verbs that can operate on your nouns is not limited by the architect’s model.

This makes software written in functional languages like Clojure vastly more flexible than software written in statically-typed object-oriented languages like C++, Java and C#.

Unlike the object-oriented design process, in which the mind moves from specifics of functional requirements to generalities of the object model and back to specifics of class implementation, the fundamental abstractions available in Clojure encourage you to think in broad and general terms.

Because the language itself is based on a few fundamental data types and a small number of core functions, you start to think of all problems in terms of those few data types and those few functions. The amazing thing is that, once you get comfortable with Clojure, you find that those few data types and functions can indeed represent virtually every problem computers can solve.

And as you begin to conceive all problems in terms of a few fundamental structures and functions, you’ll find that as you re-think your design and refine your application, your code gets shorter. This process can be deeply gratifying.

Again, this process of simplification and reduction in code size is the opposite of my experience with object-oriented languages. While the occasional painful refactoring does simplify an object-oriented design, the general tendency seems to be toward writing more and more classes.

Code Size

Proponents of functional languages in general, and Clojure in particular, often proclaim the benefits of having to write and maintain less code. The great benefit of a smaller code base is that it is easier to reason about what the code is actually doing.

If you have read many legal documents, you’ve probably noticed that mediocre lawyers tend to think like mediocre programmers, first considering every conceivable scenario, and then treating each one as a special case. In programming, this tendency manifests in architectures that include huge numbers of classes. In legal documents, it appears in long-winded sentences that contain a clause to address every conceivable situation.

Back in the 1980’s, the monthly statement for Citibank cardholders included a 360-word statement describing the conditions under which card holders would be considered in default. The statement included numerous clauses, each describing a complex set of conditions that would constitute default.

Even intelligent, well-educated cardholders had trouble reasoning through the 360 word statement. Citibank got so tired of answering customer questions about what constituted default that they hired a more intelligent lawyer to distill the 360 word statement to its essence:

“You are in default if you have not made your minimum monthly payment in the past 90 days.”

The 18 words above express the same thing as the 360 words they replaced. It’s easy to reason about this short, direct statement.

When Clojure programmers tout the benefit of “less code,” they are not praising the virtue of laziness (as Larry Wall did) or expressing the relief they feel at not having to type so much. They are talking about the benefits of expressing clearly and concisely what their program is doing.

Such concise expression is not necessarily easy. When a programmer first approaches a problem, in Clojure or any other language, it’s common for the first iteration of code to look something like Citibank’s 360 word definition of default. Only after much consideration and revision can the programmer reduce the original solution to something as concise and simple as the sentence above.

Rich Hickey, who created the Clojure language, has an excellent presentation called Simple Made Easy in which he describes the amount of work required to arrive a simple solution. One of the great advantages of Clojure over languages like Java and C# is that the Clojure language permits simple elegant solutions, and the Clojure way of thinking leads you toward them.


If Clojure is so great, why isn’t everyone using it?

For one thing, it’s a difficult language to learn, particularly if you have an extensive background in object-oriented development. One of the most difficult adjustments in moving from an object oriented language to Clojure is Clojure’s lack of state. You can’t simply initialize a bunch of objects and leave them hanging around for convenient use at a later time.

Another issue

Creating a Custom AMI for Elastic Beanstalk

A couple of us here at Hotelicopter.com spent several hours trying to create an AMI with the HotSpot JVM that would run on Elastic Beanstalk. After creating an AMI, our Beanstalk application would not run.

In hopes of saving others some of the pain that we went through, here are the steps for creating a custom AMI for Elastic Beanstalk that replaces the IcedTea JVM with the Oracle HotSpot JVM. This information was collected from multiple posts in the AWS user forums.

  1. Create a new Beanstalk environment, or identify an existing Beanstalk environment that you want to modify.
  2. Identify one EC2 instance running in this Beanstalk environment. You can identify an instance by following these steps:
    – Click the EC2 tab in the AWS Management Console.
    – Click the Load Balancers link in the left nav bar.
    – In the list of load balancers, click the name of the Beanstalk environment you want to work with.
    – In the bottom pane, click the Instances tab.
    – Copy one of the instance IDs listed under Instances. It should look something like i-5ce58d3e
    – Click the Instances link in the left nav bar.
    – Paste the instance ID into the search box. You should now see only one instance listed.
    – Right click on that one instance and select Launch More Like This from the context menu.
    – Follow the steps in the Launch Instances Wizardto start the instance. In most cases, you will not change any of the values in the Wizard. Just make sure that you’ve selected a key pair for the instance you are about to launch; otherwise, you will not be able to open an SSH connection to it later.These steps are essential. You cannot create a viable Beanstalk AMI from a running Beanstalk instance, and you cannot create one from a standard EC2 instance. At this writing (January 5, 2012), you can only create a viable Beanstalk AMI from an EC2 instance running outside Beanstalk that was cloned from an EC2 instance running inside Beanstalk!
  3. Once the instance is up and running, open an SSH connection to the instance.
    – Find the public IP of the instance. If you click on the instance name in the EC2 tab’s list of instances, the public IP will appear in the lower pane. It usually looks something like this: ec2-50-22-133-61.compute-1.amazonaws.com
    – You’ll have to connect as ec2-user, so your command line will look something like this:
    ssh -i ~/.ssh/your_key ec2-user@ec2-50-22-133-61.compute-1.amazonaws.com
  4. Once you are connected, run the following commands to download the Oracle HotSpot JVM and set it as the default.

At this point, if you run java -version, you should see that your server is running the Sun/Oracle JVM.

Now you need to create an AMI from the this instance you’ve just built.

  1. Find the instance in the Instances list under the EC2 tab.
  2. Right click on it, and choose Create Image. (Your SSH session will be terminated as part of the AMI creation process.)
  3. The AWS console will show you the ID of the image it is building. Copy this ID. You will need it in the next step.
  4. Once the image is complete, go back to the Elastic Beanstalk tab in the AWS Console.
  5. Find the environment where you want to use this new AMI. From that environment’s Action menu, choose Edit/Load Configuration.
  6. Paste the ID of your new AMI into the Custom AMI ID field.
  7. Click Apply.

You may have to restart the application or rebuild the environment to force the changes to take effect. Both options are available on the Action menu.

You can SSH into an EC2 instance within your Beanstalk environment and run java -version to see which JVM is running.

You can also create any number of new Beanstalk environments with this new AMI. Currently, when setting up a new Beanstalk environment,  you can choose only the vanilla 32 or 64 bit Beanstalk AMI running either Tomcat 6 or Tomcat 7. You can then edit the environment’s configuration after it’s up and running, and follow steps 5-7 above to replace the default AMI with your custom AMI.

Debugging Clojure with JSwat

JSwat is a visual debugger for the JVM that lets you set break points in source files, step through code line by line, watch variables, and more.

This article describes some quick steps for setting up JSwat to work with Clojure.

  1. Download the latest JSwat from http://code.google.com/p/jswat/downloads/list . As of August 3, 2011, the latest version is 4.5. The full installer is jswat-4.5-installer.jar.
  2. Double-click the jar file and follow the instructions to install JSwat.
  3. If you are using leiningen, add the lines in red italics to your project.clj file.
    :jvm-opts [

    “-server” “-Xms128M” “-Xmx256M”
    ;; Use these options in development to allow debugging
    ;; with jswat on localhost port 9900
  4. Run your Clojure project. You should see a line like this as your project starts up:
    Listening for transport dt_socket at address: 9900
  5. Start JSwat. It should be in ~/jswat-4.5/bin/jswat
  6. From the Session menu, choose Attach.
  7. Set Conector to Attach by socket
  8. Set Transport to dt_socket
  9. Set Host to localhost
  10. Set Port to 9900
  11. Click OK.

JSwat should now be running and attached to your Clojure process.

Setting Breakpoints and Watching Variables

To set a breakpoint:

  1. From the File menu, choose Open File.
  2. Select a Clojure file and click OK.
  3. You should see the file in JSwat. However, the UI is buggy, at least on Mac OSX 10.6. If the text of the file does not appear, you should at least see a button with the name of the file you selected. If you don’t see the text of the file, double-click the button. This should cause the file editor to take up the entire window.
  4. Select Show Line Numbers from the View menu, so that line numbers appear in your source file. This feature is broken on Mac OSX 10.6, so read on below for a work-around.
  5. Right click on a line number and choose Set Breakpoint to set a break point on that line.
  6. If line number do not appear, move the cursor to the line where you want to set the break point, then choose Toggle Breakpoint from the Breakpoint menu.
  7. If the Breakpoints window is not showing, choose Window > Debugging > Breakpoints from the menu.
  8. You should see the Breakpoints Window beneath your file window. If you loaded the Clojure file your_file.clj and set a break point at line 214, you should see a line in the Breakpoints window that looks something like this:
    Line your_file.clj: 214

When you run your Clojure project, it will stop when it hits this break point. Unlike some other visual debuggers that highlight the source line when you hit a break point,  the JSwat UI does not give much visual notice when you hit a break point (at least on OSX 10.6).

If you have the Call Stack window showing when you hit your break point, you will see the call stack spring into view. You’ll also see a brief message in the window’s status bar saying execution has stopped at the break point you set. This message disappears after a few seconds, and the editor shows your source file with a blinking cursor at the break point line.

At this point, you can step through code, and examine variables. To watch variables, choose Window > Debugging > Variables. The Variables window should appear at the bottom of the screen, near the Breakpoints window.

Right click any variable in the list and choose Add to Watches. If the Watches window is not showing, choose Window > Debugging > Breakpoints.

From here, you can explore on your own.

Multiple Connections with clj-apache-http

I recently had to configure the Clojure clj-apache-http library to use multiple concurrent HTTP connections. clj-apache-http wraps the Apache HTTP client library. I’m working with version 2.2.0 of clj-apache-http, which wraps version 4.0.1 of the Apache library.

If you use the ThreadSafeClientManager, you get by default a pool of up to 20 connections, with a maximum of 2 connections per route. A route is essentially a single host. The Apache library adds connections to the pool only as needed, and will not exceed the maximum number of total connections or the maximum number of connections per route.

Our problem was that we needed to have more than two connections in the pool for a single route. For some routes, we need 20 connections in the pool. Fortunately, clj-apache-http allows you to specify your own connection manager. Here’s how you do it.

This gives you a pool of up to 24 connections, with a maximum of 8 connections per route.  This does not guarantee that your pool will have 8 connections to any specific host, because the Apache library adds connections to the pool as needed. If you connect to 24 different hosts, each will have only one connection available in the pool.

In addition, the setDefaultMaxPerRoute method does not allow you to specify that you want 4 connections for one host and 10 connections for another host. The ConnPerRouteBean included in the next example addresses this issue.

Multiple Connections to a Single Host Using x509 Client Certificates

The following sample code shows how to create a connection pool with multiple connections to a single host using x509 client certificates.

I hope someone out there finds this useful.

JDBC MySQL Quirk Causes Out of Memory Exception

I was working on a project in which I had to populate some large SQL tables, when I ran into some unexpected behavior in the clojure.contrib.sql library. The behavior actually comes from the underlying MySQL JDBC driver.

This occurs on Mac OS X (10.6.7) using Clojure 1.2.0, clojure-contrib 1.2.0, mysql-connector-java 5.1.6, and MySQL Ver 14.14 Distrib 5.1.45, for apple-darwin10.2.0 (i386).

The task I was working on was to populate a SQL table with data from several other tables. The SQL statement to populate the table looks like this:

The important thing to note about this statement is that it does not return a result set. All results go directly into some_big_table. If you run an “insert into … select” statement in the MySQL console, you’ll see it just does the inserts and does not output rows of data to the console.

The clojure function that executes this query looks like this:

The statement executed correctly, inserting about 8.5 million rows into the table. However, once the statement finished, the Clojure application’s memory usage went from 100 MB to over 1 GB, before finally crashing with an out of memory exception.

Obviously, the statement was returning some kind of data.

According to the documentation for the java.sql.Statement interface, executeBatch() returns an array of integers: “[executeBatch() submits] a batch of commands to the database for execution and if all commands execute successfully, returns an array of update counts.”

My guess is the JDBC driver for MySql was returning an array of 8.5 million integers, with each integer having a value of 1. This batch consists of a single statement which actually performs 8.5 million inserts. It does not use a transaction (because of the large amount of data), so perhaps as each insert is committed individually, MySql reports an update count of 1.

This was unexpected, because the MySql console reports the result as a single insert of 8.5 million rows, rather than 8.5 million inserts of one row each.

The solution to the problem is to use the executeUpdate() method of the java.sql.Statement object. According to the documentation, this method “Executes the given SQL statement, which may be an INSERT, UPDATE, or DELETE statement or an SQL statement that returns nothing, such as an SQL DDL statement.”

Here’s the Clojure function that implements executeUpdate():

Calling sql-update-without-transaction fixes the problem. The JDBC driver no longer cares about returning update counts, and running the SQL statement with this function does not require any excessive memory.

Freeing Up Space from MySQL Logs

If you do a lot of application development with MySQL, you may find your hard disk frequently fills up. If MySQL logging is on (and it is by default), the MySQL logs are often the culprit. I find that these logs can grow by up to 20GB per month when I am working on Rails apps that have a lot of data.

You can free up that disk space by periodically running this command in the MySQL console:

purge binary logs before '2011-06-01';

Obviously, you should change the date to something current. Set it to any future date to get rid of all MySQL logs.

Calculating Distance with SQL

The other day, I had to write a query to find hotels in a database that are more than a certain number of miles from a given location. The data was in a MySql database that did not have GIS extensions installed, so I had to write SQL to calculate approximate distances.

I found a helpful post from Jaime Rios explaining the Haversine Formula. MySql provides the necessary functions to translate Jaime’s work into SQL. The Haversine Formula in SQL looks like this:

In this query, hotels are in the hotels table, and locations are in the locations table. Locations tend to be cities or towns, with a latitude and longitude coordinates fixed at some central landmark, such as Times Square in New York, or Dupont Circle in Washington, DC.

In this calculation, the constant 3960 is the average radius of the earth, in miles. If you were calculating distance in kilometers, you would use 6371 instead of 3960.

Note that this calculation gives you a good approximate distance, not an exact distance.