Comprehensible Code

Think of a code base that you hack on; one that changes frequently. Ask yourself:

  • How often does it change?
  • Are there particular types of change that recur?
  • What is the cost of making each change?

At uSwitch we deal with lots of energy suppliers. Each send us a number of different reports, in various formats, containing data that we need to parse and analyse. We automate this process as much as we can, but the structure of the reports often changes, requiring frequent modifications to the parsing code.

After asking ourselves the three questions above, we accepted that having to change our parsing code often was inevitable and we set out to create a parsing system that would make the cost of change as cheap as possible.

The parsing code that we started with typically looked like the following:

(def column-spec
  {:A :switch_date
   :B :uswitch_urn
   :C :fuel
   :D :mpan
   :E :mprn
   :F :status
   :G :dropout_reason
   :H :received_date
   :I :dropout_date
   :J :first_name
   :K :surname
   :L :postcode
   :M :email
   :N :contact_number})

(defn read-sheet
  [sheet]
  (let [[header & updates] (spr/row-seq sheet)
        column-map (excel/column-map column-spec header)]
    (->> (excel/select-columns column-map updates)
         (map replace-missing-columns)
         (number-rows 1)
         (filter has-data?)
         (map #(transform-vals % row-transforms))
         (map #(s/validate CommissionStatement %))
         (map-assoc :urn extract-urn)
         (map-assoc :table-name (constantly report-table)))))

Sure, it's functional, it's quite concise, it's even fairly easy to work out what it's doing if you read it a couple of times, so what's the problem?

Getting to what matters

The problem is that this code tells us very little about the data that it's intended to parse.

  • Which columns are we interested in?
  • What data types do we expect?
  • Can a column contain blanks?
  • Are there any edge cases?

These are the questions that we have to answer every time we come to change this code.  What we need is a way to express this information directly so that changing the code is done at the level of these questions.

Write what you mean; no more

To parse a field for example, we might state the field name, the column heading under which the field can be found, and the data type that we expect to parse.

{:switch-date {:column "Switch Date"  :parse date}}

We can easily extend this scheme to support reports that use different data types within the same column by using an ordered collection of parsers. The order reflects the preference of data types to be parsed.

{:switch-date {:column "Switch Date"  :parse [date string]}}

Parsers are tried in succession until a value is successfully parsed, or until all parsers have been exhausted, at which point we fail with an error.

This also provides us with an elegant way of expressing that a field can optionally be empty.

{:switch-date {:column "Switch Date"  :parse [date string blank]}}

Similarly, we can support changing column headings across reports by using a collection of column names. In this case, order is not significant, so we use a set.

{:switch-date {:column #{"Switch Date" "Switched At"}  :parse [date string blank]}}

Combining these field declarations into a single structure gives us a recipe for parsing a whole report. This particular report is in spreadsheet format, so we use a regular expression to identify the worksheet by name.

{:name :commission-backing
 :worksheet #"(?i)Products"
 :fields {:urn          {:column #{"URN"}                 :parse [string]}
          :sales-date   {:column #{"Sales Date"}          :parse [required-date]}
          :title        {:column #{"Title"}               :parse [string blank]}
          :first-name   {:column #{"First Name"}          :parse [string blank]}
          :surname      {:column #{"Surname"}             :parse [string blank]}
          :service      {:column #{"Service"}             :parse [string]}
          :mpan         {:column #{"MPAN"}                :parse [numeric-string blank]}
          :mprn         {:column #{"MPRN"}                :parse [numeric-string blank]}
          :status       {:column #{"Status"}              :parse [string]}
          :reason       {:column #{"Cancellation Reason"} :parse [string blank]}
          :go-live-date {:column #{"Go Live Date"}        :parse [required-date]}}}

Since we also have to sanitise some of the reports that we deal with, we added the ability to perform row filtering and arbitrary transforms to our parsing recipes. We also allow them to be specified in sets; one entry per worksheet. This is an example of what our new report parsing code looks like.

(defn- report-spec [report-id report-table]
  #{{:name :commission-backing
     :worksheet #"(?i)Products"
     :fields {:urn          {:column #{"URN"}                 :parse [string]}
              :sales-date   {:column #{"Sales Date"}          :parse [required-date]}
              :title        {:column #{"Title"}               :parse [string blank]}
              :first-name   {:column #{"First Name"}          :parse [string blank]}
              :surname      {:column #{"Surname"}             :parse [string blank]}
              :service      {:column #{"Service"}             :parse [string]}
              :mpan         {:column #{"MPAN"}                :parse [numeric-string blank]}
              :mprn         {:column #{"MPRN"}                :parse [numeric-string blank]}
              :status       {:column #{"Status"}              :parse [string]}
              :reason       {:column #{"Cancellation Reason"} :parse [string blank]}
              :go-live-date {:column #{"Go Live Date"}        :parse [date]}}
     :filter remove-title-row
     :transform (comp
                 (add-fields {:report-id  report-id :table-name report-table})
                 add-id-from-hashed-row-contents)}})

Implementing the code that takes these specifications and uses them to parse a spreadsheet was a fairly simple task that only had to be done once. It was also easy to extend the parsing to CSVs, simply by having an alternate implementation. All of the parsers, which extract the specific value types, are shared between implementations.

Profit

The result is that we now have descriptive and readable report parsers that can be understood at a glance and, most importantly, can be quickly and easily changed. This has transformed the tasks of supporting new types of reports, or tweaking existing reports, from a frustrating time sink to a simple undertaking. It has also greatly increased our confidence in our parsing code for reasons best described by Tony Hoare:

There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies, and the other is to make it so complicated that there are no obvious deficiencies.

Time, of course, is money and we now have more of it to spend doing other, more exciting things, knowing that we can react to every day report changes quickly and with confidence.

Remember

  • Find out what changes most frequently
  • Optimise to make those changes as trivial as possible
  • Spend your time working on things that bring real benefit