Row Polymorphic Programming
Sometimes, especially when dealing with business logic, we have to deal with data that comes in from the real world, or elsewhere, naturally very messily typed, leading to nasty type signatures and messes of macros or code generation to build data structures from a schema. Row polymorphism can help us by abstracting over the fields contained in a record type in the type signature, letting us define records based on data, concatenate records together, and define functions that are generic across any record containing the required fields, all without macros.
This article is inspired by my work on implementing the Brown Benchmark for Table Types1 for my Structures2 library, an update for which will be released soon, providing a much more comprehensive collection of types for row-polymorphic programming.
What is Row Polymorphism
At time of writing, the Wikipedia page for row polymorphism starts with:
In programming language type theory, row polymorphism is a kind of polymorphism that allows one to write programs that are polymorphic on row types such as record types and polymorphic variants.
leading into some text dense with mathematical notation that doesn't do much in explaining what any of this looks like, and what the hell does "polymorphic on row types" even mean?
Consider a basic record
in Idris as an example:
record ImACat where
constructor MkCat
name : String
age : Nat
meows_a_lot : Bool
This certainly seems like it might fit the bill, it says
record
right there, and the wikipedia snippet did
mention records. But what are these "row types"?
Row Types
"Row Type" and "Record Type" are sometimes used interchangeably in this corner of type theory. While "record" may be a somewhat more intuitive name for someone approaching this from the perspective of many popular programming languages, we will mostly be using the name "Row Type", as it will make more sense in the tabular data context that will serve as framing for this article.
Imagine some tabular data, say a table in a database or a spread sheet. Tabular data is composed of a 2d grid of rows and columns, typically with each column having a name and potentially a type constraint on the values in that column. Consider this example:
name → String | age → Nat | meows_a_lot → Bool |
---|---|---|
Cricket | 11 | False |
Cookie-Dough | 4 | True |
Mango | 0 | False |
Creampuff | 5 | True |
"Row Types" are constructions for describing individual rows in such
a structure 3. They are parameterized by a list of
(label, Type)
pairs, represented by the header of the table
and referred to as a schema. We can see that the labels and types in our
Cats table are identical to the fields of our
ImACat
record from earlier. As such, each row in the table
can be thought of as an instance of the ImACat
record, and
the table itself as a list of such records.
That dependence on a schema, a piece of data describing the structure
of the type, makes row types a bit more flexible, as well as a bit more
complicated. We can perform the usual operations, like accessing a given
column in a row by name, analogous to field access like calling
name
on value of ImACat
. More interestingly,
we can also perform new operations that enable a sort of limited,
type-safe reflection over row types, such as selecting a specific set of
columns from a row to produce a new row type, or concatenating the
schemas of two rows to create a new type expressing the result of
joining the rows.
The Mysterious Row Polymorphism
Imagine you are helping your friend move and they need to figure out what furniture will fit in what rooms. The previous occupant of their new place has graciously provided a table containing the length and width of each room, and their furniture supplier has provided a table detailing the lengths, widths, and heights of the bounding cuboids for each piece of furniture:
room → String | length → Double | width → Double |
---|---|---|
Bedroom | 3.65 | 3.35 |
Living Room | 4.87 | 3.35 |
Dining Room | 2.44 | 2.44 |
Kitchen | 3.35 | 2.44 |
Bathroom | 3.65 | 2.44 |
item → String | length → Double | width → Double | height → Double |
---|---|---|---|
Dresser | 1.00 | 0.60 | 0.76 |
Bed | 2.00 | 1.50 | 0.50 |
Nightstand | 0.60 | 0.48 | 0.75 |
Dinner Table | 1.80 | 0.90 | 0.80 |
TV Stand | 1.93 | 0.42 | 0.90 |
Coffee Table | 1.81 | 0.81 | 0.46 |
One thing that's going to be critical to determine is how much floor
space is left in a room after the furniture's been placed in it. After
all, a room isn't of much use if there's not even enough free floor
space to stand in it. You notice that both of these tables contain the
information needed to calculate the available area and the foot prints
of the furniture, the length
and width
columns.
Its tempting to define a Room
data type and a
Furniture
data type containing each table's respective
columns as fields, and describe the area/footprint via an
Area
interface. However, this approach has several
drawbacks. It requires implementing the interface manually for each data
type4, can hide certain categories of
error, such as inadvertently using the wrong field to calculate the
area, and is potentially quite fragile if the structure of the tables
changes on us or we need to add more tables to our system.
We can do better. With our parameterized row types we can write an
area
function that is defined to work on any row type that
has a length
and a width
column of type
Double
. This is the essence of row polymorphism, writing
code that's generic not over what interfaces a type implements, but
instead is generic over what fields the type contains.
Pros and Cons
While row polymorphism does come with the relative downside of loosing proof-of-intentionality5, so to speak, it does come with some major upsides. Two columns in two different tables having the same names and types doesn't always mean that they are semantically equivalent, but its often a pretty good bet, especially when they are coming from related sources. More importantly, thinking about data types in terms of what values they contain rather than what behaviors they implement is often a much better way to think about messy real-world data that was laid out with a dynamically typed language in mind on a good day, and not laid out with computers in mind at all on a bad day.
Like so many things in programming, its a trade off. Low level systems programming, where almost all the data you are working with was generated by the computer, often by systems you also control, isn't going to get much benefit from row polymorphic programming. Other tasks like frontend work, business logic, and video games quite frequently deal with extremely messy data that has only loosely defined shape and doesn't want to fit neatly into a type hierarchy, leaving you with a lot to gain from row polymorphism. In the former category, you are more likely to hurt from the loss of semantic guarantees and end up needlessly weakening your type system and tripping over yourself more than you would without it. In the messier situation, you are instead likely to both end up saving type fighting with the type system trying to fit your data to it, and also run into the downsides of row polymorphism much less frequently.
Row polymorphism is also a fantastic way to think about assigning types to code that was originally written for a dynamically typed programming language6. In my personal experience I have found that when an experienced programmer who really seems like they should know better is making some weird sounding arguments about how type systems decrease their productivity and don't prevent enough bugs to be worth it, they seem to usually be complaining about the lack of row polymorphism (or the closely related structural subtyping) in popular statically typed languages, just without having the technical vocabulary for it.
To Type a Table
One of the core features of row polymorphism is that the labels and types of the columns are determined by a piece of data, a list like structure containing pairs of labels and types. In languages built specifically to be row polymorphic, this data describing the column structure is often implicit. Idris, however, isn't built to be row polymorphic. It's type system just happens to be powerful enough that we can build it on our own, so we will have to build an explicit schema composed of our own pairs of labels and types.
A more developed library would use bespoke types for most of these components, but for the sake of keeping things concise, we'll use regular ol' lists as much as possible.
Building the Schema
To make things just a little bit cleaner, we'll define our own data
type for the pairs, which we will call Header
, and an infix
constructor for it, :>
:
Note
Remember that the Idris concept called a record
is a
distinct concept, and Idris record
s are not the
row types we are going to be building.
public export
record Header where
constructor MkHeader
label: String
type: Type
export infixr 8 :>
public export
(:>) : String -> Type -> Header
(:>) str x = MkHeader str x
For this article, we will keep the schema simple and just use a basic list of headers:
public export
Schema : Type
Schema = List Header
We will also need a way to compute the column types from our schema,
turning a list of headers into a list of types. This amounts to mapping
the type
accessor over our list of Headers
,
but we will implement this with a bespoke function to avoid negative
interactions with proof search resulting from the standard library's
implementation of map
:
Tip
This function has an UpperCamelCase name because it is intended to be called in type signatures. This is both conventional, and reduces ambiguity as it can't be confused for a lowercase, implicitly defined, argument name.
public export
ColumnTypes : Schema -> List Type
ColumnTypes [] = []
ColumnTypes (h :: hs) = h.type :: ColumnTypes hs
Making a row
A row is a list of the same length as the schema where every list element has a type determined by the header in the same position of the schema:
Tip
HList
is a standard library type with type constructor
List Type -> type
. Each type in the list of types
determines the type of value in that position of the list. Notice this
also constrains the length of the list, it must be exactly the same as
the length of the list of types. HList
can be thought of as
a more primitive version of the Row
type we are
building.
public export
Row : Schema -> Type
Row hs = HList (ColumnTypes hs)
We will also want a way of indexing the values in a row by their columns, instead of by their positions. We only want this function to be callable when the row actually has that column, so we also require an inclusion proof as an auto-implicit, so the compiler can generate it for us when possible:
Tip
A value of type Elem x xs
is a proof that the value
x
is included in the list xs
, which also
encodes its position in the list. The type signature of the
prf
argument thus reads "h
is an element of
hs
with known position in hs
"
namespace RowIndex
export
index : {hs : Schema} -> (h : Header) -> (r : Row hs) -> {auto prf: h `Elem` hs} -> h.type
index {hs = (_ :: hs)} h (x :: xs) {prf = Here} = x
index {hs = (_ :: hs)} h (y :: ys) {prf = (There x)} = index h ys
Making a Table
A table is a list of rows all sharing the same schema, and, with our new tools, can easily be defined as such:
public export
Table : Schema -> Type
Table hs = List (Row hs)
Using our Tables
Now that we have a way to describe the type signature and structure of a table, let's go ahead and translate our previous table examples into code:
Rooms : Schema
Rooms = ["room" :> String, "length" :> Double, "width" :> Double]
rooms : Table Rooms
rooms =
[ ["Bedroom", 3.65, 3.35]
, ["Living Room", 4.87, 3.35]
, ["Dining Room", 2.44, 2.44]
, ["Kitchen", 3.35, 2.44]
, ["Bathroom", 3.65, 2.44]
]
Furniture : Schema
Furniture = ["item" :> String, "length" :> Double, "width" :> Double, "height" :> Double]
furniture : Table Furniture
furniture =
[ ["Dresser", 1.00, 0.60, 0.76]
, ["Bed", 2.00, 1.50, 0.50]
, ["Nightstand", 0.60, 0.48, 0.75]
, ["Dinner Table", 1.80, 0.90, 0.80]
, ["TV Stand", 1.93, 0.42, 0.90]
, ["Coffee Table", 1.81, 0.81, 0.46]
]
A simple row polymorphic function
With our tables defined, it's time to revisit our area
function. The schema will be an implicit variable in the type signature,
and we'll constrain it such that both the length and width fields are
elements of the schema. These constraints allow us to simply call the
index
function with the given fields on the provided row
and have the guarantee that if our call type checks, we'll get a value
back, no need for Maybe
here7:
Tip
In a type signature, x =>
is equivalent to
{auto _ : x} ->
area : {hs : Schema} -> "length" :> Double `Elem` hs => "width" :> Double `Elem` hs =>
Row hs -> Double
area x =
let length = ("length" :> Double) `index` x
width = ("width" :> Double) `index` x
in length * width
Without any further implementation, we can now call this function on both our room and furniture tables and it works as expected:
Note
In this specific context, we do have to explicitly provide the schema
to the area
function to disambiguate, but this can be
avoided with more careful design and should be considered more of an
artifact of the limited implementation in this article then an intrinsic
ergonomics concern with this approach to programming in Idris.
map (area {hs = Rooms}) rooms == [12.2275, 16.314500000000002, 5.9536, 8.174, 8.905999999999999]
map (area {hs = Furniture}) furniture == [0.6, 3.0, 0.288, 1.62, 0.8106, 1.4661000000000002]
Building a new column
Just being able to access the fields in a row polymorphic way is pretty useful, but its also not really that impressive. We have much more interesting capabilities at our disposal, like adding new fields. As a first example, we can take a function that consumes a row and produces a new value, along with a name for the new value, and use that to add a new column to each row in the table:
buildColumn : {hs : Schema}
-> (f : Row hs -> t) -> (name : String) -> Table hs -> Table (hs ++ [name :> t])
buildColumn f name table = map buildColumnRow table
where
addToEnd : {name' : String} -> {hs' : Schema}
-> t -> Row hs' -> Row (hs' ++ [name' :> t])
addToEnd {hs' = []} x [] = [x]
addToEnd {hs' = (y :: ys)} x (x' :: xs) = x' :: addToEnd x xs
buildColumnRow : {name' : String} -> Row hs -> Row (hs ++ [name' :> t])
buildColumnRow row = addToEnd (f row) row
With our new tool, we can make new tables derived from our previous ones, such as adding an area column to our previous two tables:
roomsWithArea : Table (Rooms ++ ["footprint" :> Double])
roomsWithArea = buildColumn {hs = Rooms} (area {hs = Rooms}) "footprint" rooms
roomsWithArea ==
[ ["Bedroom", 3.65, 3.35, 12.2275]
, ["Living Room", 4.87, 3.35, 16.314500000000002]
, ["Dining Room", 2.44, 2.44, 5.9536]
, ["Kitchen", 3.35, 2.44, 8.174]
, ["Bathroom", 3.65, 2.44, 8.905999999999999]
]
furnitureWithArea : Table (Furniture ++ ["footprint" :> Double])
furnitureWithArea = buildColumn {hs = Furniture} (area {hs = Furniture}) "footprint" furniture
furnitureWithArea ==
[ ["Dresser", 1.0, 0.6, 0.76, 0.6]
, ["Bed", 2.0, 1.5, 0.5, 3.0]
, ["Nightstand", 0.6, 0.48, 0.75, 0.288]
, ["Dinner Table", 1.8, 0.9, 0.8, 1.62]
, ["TV Stand", 1.93, 0.42, 0.9, 0.8106]
, ["Coffee Table", 1.81, 0.81, 0.46, 1.4661000000000002]
]
Select
We can also implement a function that takes only certain specified
fields from a table, analogous to an SQL SELECT
statement.
This function only makes sense to apply when all the fields in the new
schema are also in the old schema, so we take an auto
implicit argument containing a proof of inclusion for each element of
the new schema, which we will use to index into the input table.
select
makes sense to define both for individual rows and
whole tables, so we create two different versions of it in two different
namespaces to allow for overloading:
Tip
All
is a more general version of HList
.
All p xs
can be thought of as being equivalent to
HList (map p xs)
, but in the Idris standard library,
HList xs
is actually an alias for
All id xs
.
namespace SelectRow
public export
select : {hs : Schema} -> Row hs -> (is : Schema) -> {auto allIn : All (`Elem` hs) is}
-> Row is
select r [] {allIn = []} = []
select r (x :: xs) {allIn = (this :: rest)} =
let this = x `index` r
in this :: select r xs
namespace SelectTable
public export
select : {hs : Schema} -> Table hs -> (is : Schema) -> {auto allIn : All (`Elem` hs) is}
-> Table is
select xs is = map (\r => select r is) xs
Lets give this a test drive, we only really wanted to know how big the rooms are8, and now that we have the areas calculated, we can discard the length and width fields:
roomAreas : Table ["room" :> String, "footprint" :> Double]
roomAreas =
select {hs = Rooms ++ ["footprint" :> Double]}
roomsWithArea ["room" :> String, "footprint" :> Double]
roomAreas ==
[ ["Bedroom", 12.2275]
, ["Living Room", 16.314500000000002]
, ["Dining Room", 5.9536]
, ["Kitchen", 8.174]
, ["Bathroom", 8.905999999999999]
]
One interesting thing to note about select
is that you
don't actually need to know the complete schema of the table it was
called on for your call to type check. This caveat allows
select
calls to still work correctly after rows have been
added to tables, reused on tables that have had some sort of join
operation applied to them, or work across entire families of related
tables.
Implementing Regular Interfaces for Row Types
Implementing types as Row
s doesn't lock us out of more
traditional programming methods. Say, for instance, in interacting with
someone else's code, it turns out that we actually do have to provide
area information through a HasArea
interface:
interface HasArea t where
getArea : t -> Double
Not only can we implement this for a specific Row
type,
we can implement it generically for any Row
that has the
appropriate fields in one shot:
[rowHasArea]
{hs : Schema} -> All (`Elem` hs) ["length" :> Double, "width" :> Double]
=> HasArea (Row hs) where
getArea row =
let [length, width] = select row ["length" :> Double, "width" :> Double]
in length * width
Note
This code is a bit needlessly ugly, the implicit arguments can be made amenable to proof search by using bespoke types instead of generalist standard library types, but that would have made this article a lot longer than it needed to be.
map (getArea @{rowHasArea {hs = Rooms}}) rooms ==
map head
(select {hs = Rooms ++ ["footprint" :> Double]} roomsWithArea ["footprint" :> Double])
Simplifications made for the sake of this article
I've made a number of simplifications compared to my implementation in Structures9 and implementations in other languages to keep the implementation side concise. If Structures hasn't been updated yet by the time you are reading it, you can check out the status at the time I was writing this article here. I also recommend checking out the Brown Benchmark For Table Types for a more complete treatment of the topic.
To summarize a few of the changes relative to Structures:
Label types are hard coded to
String
While the label type is going to be
String
in a lot of cases in Idris code, really any type that has a decidable equality implementation, or even just plain equality, depending on the level of guarantees you wish to provide.Feilds aren't nullable
Tabular data coming in from the real world quite often has fields that are missing their data, someone may have not answered a question on their survey, that particular row might be older than that field's addition, among a litany of other reasons. Many other treatments make all fields implicitly nullable, and Structures provides explicit control over nullability in the schema, but I've ignored nullability here to make the examples much more concise.
The consistency of the schema isn't ensured
A lot of operations on row types stop making sense if the schema doesn't satisfy certain constraints. For instance, indexing (among other operations) gets weird if there are two columns with the same name in the schema. Structures has its functions accept auto-implicit proofs of the relevant properties for each functions, but those proof types consist of a large amount of the verbosity of the library, and is not needed for conveying the basic idea of row polymorphism.
Lack of bespoke types
The auto implicits demonstrated in this article all interact with proof search better if they are bespoke types instead of compositions of standard library types. Not only are those bespoke types not needed for conveying the basic idea, but the use of standard library types helps show the connection to more general ideas, making it a better idea for a purely pedagogical implementation like this.
Though row types need not inhabit tables, they can also be used effectively on their own↩︎
At best, we might be able to get away with something like a derive macro, but that's still something we have to invoke manually↩︎
Generally speaking, with "traditional" data types, an interface typically requires some manual intervention from the programmer to be implemented for a given type. This provides a level of assurance that the implementation over that data type is intentional, and thus likely semantically valid.↩︎
For instance, typescript's structural subtyping is a very good approximation of row polymorphism, and this is something I believe has contributued just as much to its success as a javascript replacement as its backwards compatibility. A statically typed langauge that could run existing javascript code, but made it a battle to assign types to that existing javascript code, wouldn't be very useful as a transition language.↩︎
In other languages, feilds in row types may be considered to be implicitly nullable. In Structures, I provide explicit control over nullability of fields in the schema.↩︎
Because that's totally all the information you could ever need to determine what all will fit in a room, right?↩︎