The Julia programming language

22 February 2019

Peeved with Python? Revolted by R? SAS make you sad? The Julia Language may be for you. Recently reaching version 1.0, Julia claims to be more than just another data science language.

In this post I’ll give a tour of some of the more interesting features of Julia, and its implementation.

Basics

Let’s start with the usual: println("Greetings! 你好! 안녕하세요?") So far, so ordinary - though it is worth noting the native unicode support. Programs such as this can be placed in a script - standard extension .jl- or use the rather nice interactive prompt or REPL. Julia seems to encourage use of the REPL, probably partly a result of its Lisp influence. Jupyter notebooks also have support for Julia.

Here’s an example of a real Julia script I wrote last year. It processes a mixture of output from the Unix time command and some graph information, then outputs the data in csv format.

using Glob

files = glob("*.out*")
name_r = r"(.*)\.txt(\d+).out(.*)"
nodes_r = r"NODES: (\d+)"
edges_r = r"EDGES: (\d+)"
real_r = r"real\t(.*)"
user_r = r"user\t(.*)"

println("file,nodes,edges,threads,real,user,extra")
for fn in files
    (dir, name) = splitdir(fn)
   name_m = match(name_r, name)
   dataname = name_m[1]
   threads = name_m[2]
    extra = name_m[3]
    real = 0.0
    user = 0.0
    nodes = 0
    edges = 0
   open(fn) do f
       real = 0.0
       nodes = 0
       edges = 0
       for ln in eachline(f)
           real_m = match(real_r,ln)
           user_m = match(user_r,ln)
           nodes_m = match(nodes_r, ln)
           edges_m = match(edges_r, ln)

            if real_m !== nothing
               real = real_m[1]
           elseif user_m !== nothing
               user = user_m[1]
           elseif nodes_m !== nothing
               nodes = nodes_m[1]
           elseif edges_m !== nothing
               edges = edges_m[1]
           end
       end
    end
   println("$(dataname),$(nodes),$(edges),$(threads),$(real),$(user),$(extra)")
end

Overall, the syntax is quite Pythonic, with a pinch of Ruby mixed in. There are no classes; functions/methods live separately from data.

There are a few interesting things to note here. Firstly, we import the Globmodule using the usingkeyword. This puts into the global namespace all the names in Glob that have been exported with the export command. If this would create a namespace clash import can be used instead:

import Glob files = Glob.glob(".out")

Packages

Glob is not part of the Julia standard library, it is a third-party package. Like every other modish programming language, Julia has its own packaging facility. Unlike other modish programming languages, packages are generally installed using a module in the REPL. We can install the Globpackage as follows.

pkg> add Glob # We can open the package REPL by pressing "]" in the ordinary Julia REPL.

There are many packages available, both from the official repository and installable through git. Some packages are considered to be state-of-the-art for machine learning and scientific computing.

Regex

No scripting language would be complete without support for regular expressions. Fortunately, Julia has nice regex support built into its base library. Using it felt a lot like doing the same thing in Python, although I found if real_m !== nothing upsetting to type. This is due to Julia’s rather strict truthiness, where only actual boolean values can be used in boolean expressions: only true is true; only false can be false. Note that the first match is at index 1. As we shall see, this is not because index 0 refers to the entire matched string.

Arrays

Probably the most important part of any data science tool with high-performance pretensions, Julia has first class support for single and multidimensional arrays, without recourse to an external Fortran library for performance.

Beware that arrays are 1-indexed in Julia, as in other mathematical programming languages.

Julia has vectorised ‘dot’ operators for operating on arrays, in addition to overloaded arithmetic operators:

julia> a / b 4×4 Array{Float64,2}: 0.0196078 0.0588235 0.0784314 0.0980392 0.0392157 0.117647 0.156863 0.196078 0.0588235 0.176471 0.235294 0.294118 0.0784314 0.235294 0.313725 0.392157 julia> a ./ b 4-element Array{Float64,1}: 1.0 0.6666666666666666 0.75 0.8

Arrays are strongly typed and specialised, allowing generation of fast code. The standard library contains many array and matrix operations.

Data science

The DataFrames.jl package provides something equivalent to pandas, data.frame or tibbles. The company behind Julia has created JuliaDB as an alternative with some extra features and baked-in distributed parallelism.

Several graphing libraries are available. My personal favourite is Gadfly.jl, essentially a port of R’s ggplot2.

Parallelism and Concurrency

SIMD vectorisation is supported, and this should be automatic. Coroutines are available with a channel-based communications API, allowing lightweight pseudo-concurrency. There is fairly mature support for distributed memory parallelism.

Traditional multithreading support is lacking. There is an experimental threading interface, including some nice macros.

Lispiness

Julia may be lacking in parenthesis, but it does have a strong Lisp heritage. In particular, Julia has support for “Lisp-style” macros. These macros are functions executed at parse-time which take actual code as their arguments, rather than the values that the code evaluates, and are expanded into code when called. In the right hands, the macro facility is a powerful tool for automating cumbersome code generation or extending the syntax of the language eg Julia’s implementation of printf is a macro. These macros are also hygienic ie they don’t cause name collisions after expansion (unless you really want them to).

However, powerful macro systems can be a double-edged sword. Making it too easy to extend the syntax of the language allows you to create programs which are written in a programming language essentially unique to you.

Another Lispy feature in Julia is multiple dispatch (known as multimethods in Lisp-land), where calls to functions with the same name, but different argument types can be dynamically resolved without any hacking on the part of the programmer. As an example, consider the problem of calculating whether different shapes intersect:

intersect(a::Square, b::Square) = … # (1) intersect(a::Square, b::Circle) = … # (2) intersect(a::Rectangle, b::Circle) = … # (3) rect = Rectangle() circle = Circle() intersect(rect, circle) # calls (3)

It is not always apparent how multiple dispatch is useful, but it is one of those things that you will miss once you know about it and encounter a fitting problem.

JIT Compilation

Julia claims to be implemented as a just-in-time (JIT) compiler, which is technically correct (the best kind). Typically, a JIT compiler will interpret a program (either as parsed source code or as a bytecode VM) and then compile commonly executed functions or loops to machine code. This also allows optimisations based on runtime information and behaviour. Julia, on the other hand, compiles programs, and their dependencies, immediately before they are executed. There is no interpreter, virtual machine, or runtime optimisations. The inevitable result of this is very noticeable latency for starting scripts, and entering REPL and Jupyter notebook commands. Programs need to be recompiled after the Julia process terminates.

Thankfully, it is possible to precompile Julia programs to binaries ahead of time, but a proper JIT-compiling interpreter would be very welcome.

Performance

If you put your trust in microbenchmarks, Julia is remarkably fast. According to the benchmarks on the Julia website, and the programming languages benchmark game, Julia is approaching C speed and often faster than Java, except when shared-memory parallelism is warranted. Obviously, these benchmarks do not tell the whole story - the start-up cost for the “JIT” compiler is massive (it is not clear how or if this is accounted for in the benchmarks) , and I would particularly like to see benchmarks of memory or I/O bound problems - but for a language like this to perform that well is very impressive.

If performance really isn’t good enough, Julia has support for calling C-style functions in shared libraries with the @ccall macro.

Summary

Julia is an interesting, fast language, built for mathematical programming and data science. If you need performance, but don’t want to write or interface with C or Fortran, or if you really like the idea of multiple dispatch and homoiconic macros, it could be a good choice. If you’re a satisfied R or Python programmer, there may not be obvious benefits.

It’s always difficult to tell if a new or fashionable programming languages is going to succeed or not - Haskell has been avoiding success at all costs for a long time now. Julia has enough interesting features and performance to avoid being yet another data science language, but the real test will be how the ecosystem and community evolve, and if there are enough people who still haven’t found a Goldilocks programming language in R or Python.