Bubble Foundry


Grammar-Based Event Logging

by Peter.

This is a big one, so hold tight!

One of the most common tasks in any application, on the web or otherwise, is logging. The list of uses for a good log of events occurring within your application ‘space’ is practically infinite, but at the simplest we can say that a log lets the developer store and then communicate to the user things that have happened and that it lets the developer better understand what is actually happening in his application, particularly when something unexpected or undesirable is occurring!

Theory

What do you log? It may seem obvious, but you’re logging events. Now, events are things that happen at a specific time. Time is a very slippery subject,1 but for simplicity’s sake let’s assume that an event’s time is a specific point in time (apologies for the circular logic) that can be fully represented by a Unix timestamp.2 Any event that would describe an occurrence over a span of time, such as a user session, can be described by two related events, one for the start moment and the other for the end moment. Yes, this is less succinct but it also requires less timespan juggling and shows the fundamental usefulness of the basic form.

So, at the most basic an event log is a set of ‘things’ which have Unix timestamps. Because time is linear,3 events will happen one after another and our log will probably list the events in sequential order. I say probably because irregular amounts of time may elapse between the time one event is created (I’m using the word event to refer to the discrete data representing the action we wish to log for posterity, not the action itself) and written to the log and the time another is created and written, such that the second may occur chronologically after the first but be written to the log before it.

This may be mitigated somewhat by offloading the task of creating the timestamps to the datastore which will store it, usually a database, such that the timestamp is only determined at the last possible moment, when writing the event to the log, which practically ensures the log entries will have sequential timestamps. A disadvantage of this strategy, of course, is that you ignore the difference between the action itself and its event object on one hand and its logged form on the other. Furthermore, you lose the ability to mark an event and only log it later, perhaps using the intervening non-negligible time to gather useful metadata to attach to the event or to send the data over the internet to a log residing on another machine. Even if these drawbacks are accepted, the potential for events to be written to the log out of chronological order remains. Since, even at the microsecond level, there is the possibility of events occurring simultaneously, we cannot treat event times as unique. In database terms, the timestamp is something we would (and do!) index but we cannot enforce a unique index upon it.

The Form and Use of an Event Log

How do you store an event in a log? Earlier I described events as being things happening at specific times. Depending on the nature of our application and the storage system we use, the nature of the thing an event represents (that is, the event type) can be included in the log entry or inferred from its storage context. This difference is the source of what I see to be the fundamental divide between various event logging systems: on one hand you have general purpose systems which require no context to identify an event’s type while on the other you have specialized ones which infer an event’s type from how the event is stored. This doesn’t mean that either one of the approaches makes it inherently easier to understand an event’s raw log form or to extract information about the event thing. For instance, different things may be represented by arbitrary, pre-determined integers in a generalized system while a specialized system may store events in verbose, descriptive files or database tables (e.g. user_signup_events or server_low_memory_warning_events).

There are two similar but distinct uses of event logging systems: to calculate values based upon events and to display events to a user. A system should be able to answer queries like last login < 5 days but, and this is something I see less emphasized, it should also be able to translate its efficient, indexable, queryable internal event format to human-friendly representations like “Peter Robinett created an account on July 25, 2010 at 3:06 AM CET.” These two goals are in opposition. Consider that the internal representation of the previous event might look like (using JSON to represent it): {'timestamp': 1280019960, 'eventType': 1, 'userID': 1}. Obviously this isn’t something you want an end user to see! Likewise the sentence form isn’t something you really want the computer (or, more accurately, the developer) to have to deal with – parsing that string the extract the relevant values is a pain.4 This disconnect is the basis for my attempt for a system to have the best of both worlds and will be explained in detail following a summary of other, existing approaches to event logging.

Existing Implementations: Text Files

The simplest solution to create a log of events is the oldest: write event data to a text file, almost always one event per line. You can search it later with tools like sed and awk. While as a programmer you need to write defensive code that can take whatever your users throw at you and prohibit incorrect or malicious input, as a library writer you also need to write code that prevents other programmers who will use your code from messing up your system. All it takes is one miss-formed entry and your log may come crashing down upon you. You shouldn’t give developers the noose to hang themselves with, particularly when doing a task that is both potentially vital to an application and whose internal implementation can be completely obscured. As Jaime Zawinski said, “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” and something of that criticism must surely apply here.

There are ways to avoid some of the pitfalls of an ad-hoc log file. First, there are structured log file formats and with a little bit of discipline you can ensure constituent event elements are listed predictably. The simplest example might be CSV: by separating the discrete elements of an event with a predictable separator (presumably a comma of course) we can now easily mark the components of an event and indicate if a specific component is empty or missing. Second, developers can provide intermediate tools which take input (whether from users or other developers) and properly structure it, ensuring any incorrect or malformed data is rejected or omitted and that only valid data is including when writing to the log file. This method is useful in any general event logging system and will be revisited later.

However, if you’re using a structured text file you’re already halfway (or perhaps all the way, depending on who you ask) to a database, as evidenced by the fact that CSV files are used by many as simple databases. If you are already using some sort of traditional database (relational or otherwise) with your application a structured log file would just seem like a poor man’s version of it, particularly when querying the event log, and would be introducing a separate data storage location you would need to manage and develop for. So, let us now consider how to log events in a database.

Existing Implementations: Row-Based Logging

There are many different ways to log events in databases. Several web application frameworks encourage the tying of basic time data to other records, giving you basic events for these records. This is probably the simplest way to store events in a database and is very common. For instance, Ruby on Rails and its ActiveRecord ORM (the inspiration for the others, I believe) will correctly manage created_at and updated_at columns for you, as will CakePHP. This approach is at the row, or item, level and gets its event type from its context of being in a specific table, so using the nomenclature developed earlier we can call it a specialized event logging system. While row created and row updated are the most common event types in this approach, there is no reason why other timestamps can’t be added to the table to allow for additional event types relevant to the row type to be logged.

However, there are many drawbacks to a row-based approach. Particularly limiting is the fact that (ignoring the database’s transaction log) a new event always overwrites the previous instance of it and there is no way for the database user to know of previous events, even whether there have been any or not. Some people couple a timestamp with a counter so that they can know the number of events (for example, last login and total logins) but this still does not give you the full details of the previous, overwritten events.

Furthermore, an updated column will tell you when a row last changed but not what within it changed. A digression: I think it would be awesome if there were a database that exposed their transaction logs at a level accessible to the average SQL or application programmer. Perhaps there is one? I actually began to explore such a system several months ago as a way to serialize, store, and synchronize an SQLite database via Git, but that’s a topic for another day. Now back to the problems with row-based events.

Additional columns can be added to a table in order to track additional events, to the point where you could have a timestamp column for every ‘normal’ column, but such an approach both bloats your database and is cumbersome and potentially dangerous (to your system!): every time someone wants to track a new event type you’ll have to add yet another column to the table and then ensure that all things that use this table are not negatively affected by the addition of a new column. Say I have a social security application and users can change their names. Accordingly I have name and name_updated columns, giving me a single name changed event. However, that doesn’t tell me why someone’s name changed: Did they get married and change their surname to that of their spouse? Or perhaps they didn’t like the name they were given and changed it upon becoming a legal adult? Are they an immigrant and decided that a name in the language of their new home would be better? Or maybe even it was an infant that didn’t have a name yet when they were first entered in the database? A simple name changed name changed event can never tell us these things. You may say that such distinctions are irrelevant – all a social security agency cares about is the person’s id number and that has not changed – but I disagree. For instance, maybe official correspondence is not reaching the person but knowing the type of name change the person did would help correctly reroute them to the person. Ok, that example is rather contrived, but I think you get the picture.

Existing Implementations: Specialized Log Tables

The solution most often chosen by those who begin to reach the limit of a ‘created-and-updated’ approach is to create specialized log tables. On one hand we can call these tables specialized because the events within get their general event type from the containing table but we can also call them specialized because there is no restriction. For instance, if previously we could only have a user updated event with a simple row-based approach, now a table exclusively for user events means that me can have specialized user event types such as name changed or even name changed because of marriage. Because all events relate somehow to a user we know that all events will make use of a user_id column and can even implement a foreign key constraint linking this column to its corresponding column in the users table, both ensuring fast queries and data fidelity. Because the close coupling with the it’s also easy to do things like database triggers. While the indexing benefits are lost, such an approach works exactly the same with structured log files – just have multiple log files – and retains its other benefits.

However, specialized event log tables silo event data by type and makes it difficult to examine the intersection of different event types and thus object states, prioritizing an event’s connection to the database rows it describes over its commonalities with other events. This may not seem like a bad thing but it discourages easier, deeper analysis of the logged system. For instance, answering the question of which users changed their names within a month of getting married, assuming the name changed and marriage created events (watch out for the marriage destroyed one!) live in different specialized log tables, becomes particularly tricky and requires a mess of joins. Joins are not impossible to work with but they do have performance penalties and developers are required to keep track of commonalities and differences between event log tables so as to know when and where to make them. On a separate note, while specialized log tables give you a good deal of structured data for little cost, you must still add additional columns if you want to support specialized types within the common event type or if you need to store additional data to explain the event, (e.g. to whose surname did the person change their surname to?).

Existing Implementations: Ad-Hoc Log Tables

Ad-hoc log tables are, as I see it, a reaction to specialized log tables.5 Why have tons of log tables cluttering your database when you can just have one? By only having two structured columns, timestamp and event_type, we can keep our table generic enough to log any type of event while still having some indexing. Any additional event metadata can be stored in a free-form text column, whether it’s serialized structured data describing the event or a human-friendly version of the event (the “Peter Robinett created an account on July 25, 2010 at 3:06 AM CET.” from much earlier). It is because of (the potential of) storing all event metadata in an ad-hoc fashion that I call this an ad-hoc log table approach. And, according our previous rubric, these tables are most definitely general event logs. So is this the solution to all our needs? Not quite. The problem with this approach is that any querying and filtering beyond that by event type must happen outside of the database and in application code – we’ve given up all the benefits of SQL for a gloried key-value system.6 We can do better.

Using a Grammar

It is now that we finally come to my grammar-based event logging system. The question is, can we make an event logging system that is both general enough to include any event type we can think of but, and this is where we go beyond the ad hoc system described above, still do (most of) our queries through the database, using SQL. Remember that we’re trying to express human friendly sentences like “Peter Robinett created an account on July 25, 2010 at 3:06 AM CET.” Is there something that can break the previous sentence into a regular structure we can use?

Well there is, grammar! According to Wikipedia grammar is “the set of structural rules that govern the composition of sentences, phrases, and words in any given natural language.” So, if grammars are structural rules for the composition of sentences, we can use a natural language grammar, in my case English,7 and break down any sentence that someone could use to describe an event occurring with our system. We can then log the event in the form of those grammatical elements, giving us a regular structure that will make subsequent querying and indexing easier. In addition, when pulling the event out of the log we can reassemble it into a coherent sentence, though perhaps not in the exact same form as the person who logged it gave it.

For those who didn’t do Reed-Kellogg sentence diagramming in middle school, here’re some grammar basics: Sentences are divided up into subjects and predicates. Since a predicate must always have a verb, let’s simplify and say sentences have subjects, verbs, and various details.8 Verbs, particularly action verbs, are great, they’re what our events are all about. They can describe an action taking place. So, our event type is our verb is is always required in a grammar-based event logging system: without it you don’t know what actually happened that they’re logging. Subjects are a little trickier, but not much: normally it is a specific user that preformed an action and so that user is the subject. However, sometimes the subject isn’t so clear or it might be the system itself: for example ‘the system’ in “The system sent Peter an email.” Because of this i will sometimes make sense to have special subjects like ‘the system’ or omit it entirely.

But wait, we don’t know when the event described by the verb happened! Correct, and now that we’re ready to describe the ‘various details’ I mentioned earlier, we’ll fix that. First, we need to describe when the verb happened – this is the job of an adverb or adverbial phrase. For example, “Peter Robinett created an account on July 25, 2010 at 3:06 AM CET” has the adverbial phrase ‘on July 25, 2010 at 3:06 AM CET’. Because this grammatical component is so important, perhaps even more important than the verb sometimes if you’re only interested about how many events happened when, I tend to use the term timestamp, not adverb, to highlight what it is and how it works.

subject, verb, timestamp a.k.a. adverb: we’ve got the basics down but it’s still just one more structured column than the ad hoc system. Luckily there’s more grammatical elements than just that. First, actions are often done to things or people, like: “Peter bought the ticket.” ‘the ticket’ is an object, specifically a direct object. In my grammar I think of a direct object as anything acted on directly by the verb. Wikipedia describes a separate type of object called a prepositional object9, but to simplify things, in my grammar it’s considered the same thing as a direct object. Now we’re starting to get a nice grammar: a subject can do something (verb) at a timestamp to a direct object. We could stop here and be pretty satisfied with your system: we can describe basic sentences in our grammar and in a reasonably relational database can use these points of information as jumping off points to get more. If a player attacks a spaceship in a game and the spaceships table has an owner column, you can infer which player the first one attacked.

If, as I hoped you realized, direct object is an optional element and doesn’t exist in all event sentences, the next two are even more so. First, something can be connected to the verb through the verb’s effect on the direct object. This is called an indirect object and it gives context to the verb and direct object. For example: “Peter gave John money.” Here ‘John’ is the indirect object and ‘money’ the direct object. As with direct objects I like to play fast and loose with indirect objects and consider anything that is connecting an owner or additional object to the direct object to be an indirect object. So, in “he gave the money to John” and “he took John’s money,” ‘John’ is always the indirect object in my grammar.

But how much money did I give to John? The money needs quantification, so let’s introduce a quantifier to our event grammar. You could have “he bought 10 spaceships” or “he gave bread to 10 people”: we have the same quantifier value but the first is quantifying the direct object while the second is qualifying the indirect object. In your event log table you could have a quantifying column for each grammatical element but that’s messy: I prefer to just have one general column and leave it up to my translation methods (more on that later) to make the proper relationships.

And that’s about it! Now we can log an event like “Peter added 10 tags to the blog post on event logging” broken down into our structured grammar as subject: Peter, verb: added, direct object: tags, indirect object: blog post on event logging, quantifier: 10. If we have a good translation function we might be able to later reassemble the event into the sentence “Peter added 10 tags to blog post on event logging.” Not bad! As you’ve seen I’ve made many simplifications along the way. There are many more grammatical elements out there (dive into Wikipedia if you’re interested) but I’ve tried to strike a balance between specificity and simplicity. If you think I’ve error too far in one direction or another, your system can always be different but I’d be interesting in hearing what you’d do differently, as I hope that the grammatically elements I’ve outlined in this sections are sufficient for all standard application logging.

Implementing a Grammar-Based Event Log – The Database

You didn’t think I was done, did you? I’ve given you all the theory behind grammar-based event logging, but how do you actually implement it? First, let’s start with the database table. We’re going to have one table with eventually LOTS of rows and, to be honest, things may slow down. However, we also want to preserve a fair amount of flexibility, which means we may want to go for more flexible, inefficient options. How you manage this tension will really depend on how much you constrain your grammar ahead of time. Let’s assume a fairly flexible setup.

We’ve got an eventlog table with the following columns: verb, timestamp, subject, directobject, indirectobject, and quantifier. Notice that we don’t have any sort of column here that immediately jumps out as a potential primary key. Depending on what database system you’re using, adding an auto-incrementing integer id column either is a needless complication or a vital addition. For instance, as I understand it, using one can be a big help with the InnoDB engine in MySQL. If you don’t have an integer primary key, making your primary key a compound key of, in order, the verb, timestamp, and subject columns could work.

However, this assumes that such set would be unique, when it is very possible to allow and anticipate a subject performing the same verb multiple times at the same timestamp. For instance, the event “Peter invited Joe and Katie to his birthday party” could be logically split into “Peter invited Joe to his birthday party” and “Peter invited Katie to his birthday party,” both having the same subject, verb, and timestamp. Granted, there is question of whether the two events represent two separate invitations that happened at the same time or a single compound invitation. Particularly if the event records direct user intervention I think you can make such an assumption. Using a finer grained, millisecond timestamp will also reduce the probability that separate events of the same subject and verb will happen at the same time. For the reasons I’ve outlined above I suggest a using an auto-incrementing integer primary key and leaving any compound keys non-unique.

Next let’s consider verb. You may want to make is an integer column and connect the integers to human-friendly names in an eventtype table, perhaps even using foreign key constraints to connect the two tables. Or, you could treat the integers as magic numbers and just have a list explaining them somewhere in your code. Using an integer column, particularly something like a SMALLINT or, if you’re using MySQL, TINYINT, will keep your column size to a minimum and perhaps make querying and sorting a bit more efficient. An alternative approach, and the one I have used in the two grammar-based event log systems I’ve build, is to use a VARCHAR, or string, column and use keys like USER_CREATED. I can then avoid any sort of translation table, and I believe that the readability advantage outweighs the increased column size, though ultimately I will have to stress the systems and see. I’d be interested in hearing if you have have any thoughts on this.

And now the rest, subject, directobject, indirectobject, and quantifier. I think for all of them you should use a string column type, even quantifier, for the reasons I use one for verb. The things you will refer to won’t always be simple rows in other tables, such as the system events we discussed earlier. In additional, sometimes you may just want to store arbitrary text in the column. Even quantifier can benefit from this: consider that sometimes you may want to say ‘many’ instead of ‘5’.

Finally, keys. You’ll want single keys on the most used columns – timestamp, verb, and subject – and it may make sense to have keys on directobject, indirectobject, and quantifier. Given that we’re using string column types your keys may get quite large, so it’s really up to you to balance your column types and lengths with the keys you will need. Composite keys, depending on your application and if chosen well, can act as covering keys and could be a real benefit. Again, the most used columns is the prime suspect for a composite key, though as we discussed a while earlier you probably can’t make it unique. If you can figure out a composite column where it will always be unique, all the better and you may see some performance benefits. It could even be your primary key, though many database engines don’t like large primary keys, so be warned.

Implementing a Grammar-Based Event Log – The Code

Ok, we’re getting there. We’ve got a general system to describe events and we have a database schema. How do we actually use it? Sure, you could do raw INSERTs and SELECTs but there’s a lot of room for error. We can do better. In the two grammar-based event logging systems I’ve taken similar, but slightly divergent, approaches. In both cases I made classes representing event types based upon common parent classes which convert event-specific variables to grammar elements. In my Scala project I made case classes for each event type and had them share some common conversion. For instance, I might log a user like NewUser(userid).save, which the save method internally hooking into the ORM I’m using. In contrast, my PHP implementation uses a shared singleton object (something that doesn’t really exist, but that’s another long blog post) and might look like: EventLog::log(EventLog::$NEW_USER, array('userid' => $userid)), where EventLog::$NEW_USER is a static event type class instance. To be honest, I’m not sure which approach is better. On one hand doing everything via instances of classes works well (and Scala’s case classes are awesome), but my approach in PHP with ligher event type classes and the user event instantiating instances themselves also worked well, though it means the code ends up relying heavily upon associative arrays (maybe this is an advantage for you). Regardless of the approach I tried to make it so that the developer using the system had easy to understand event objects to work with and assigned all event variables to names logical for the event in question, rather than having to remember their mappings to grammar elements. This enforces some of the consistency that we cannot have at the database level.

Conclusions

And here we are. Hopefully I’ve made a reasonably compelling argument that event logging is quite useful and that a general database-backed system using a simple grammar is the way to go. With a good event log you can create cool activity streams for users and useful monitoring tools for developers. However, there are tradeoffs, which I have tried to mention when appropriate, and a grammar-based event log isn’t a panacea. For instance, while event logs are great for activity streams, I wouldn’t use one for a system entirely around Twitter-like status updates: best to have specific database solutions designed for that specific, high-volume type of data and leave the general event log for everything else happening in the system.

It’s only fair here to revisit what I said earlier about text-based logging. Using an event grammar like the one I’ve outlined above you could easily use a flat file instead of a database table. You have a fixed number of fields, and you can easily express them in a structured format such as CSV. In fact, it is easier to add additional fields for specific log types, as you needn’t change the table schema. That being said, I believe that the ability to use SQL to query an event log is far superior to any sort of querying you would normally do over a structured text file. Fans of sed and awk may disagree. Furthermore, if performance is a concern, you should see better results with a database as it will keep some or all of its log in memory, avoiding expensive filesystem operations.

  1. Don’t believe me? Read up on some of the many issues that can crop up trying to represent the thing that is time. It’s a testament to the subject’s complexity that Joda Time, perhaps the best time library out there, has Instants, Partials, Intervals, Durations, Periods, and Chronologies. []
  2. If ‘Unicode, dammit!’ is what I consider to be Beautiful Soup’s catch-phrase, I think mine might be ‘Unix timestamps, dammit!’ Yes, they’ve got issues but they’re still one of the best options out there to represent time. Speaking of bad libraries, am I the only one that thinks that Python’s datetime class is ridiculously stupid and makes it incredibly impossible to get a simple Unix timestamp? I mean, someone says you should get a Unix timestamp with mktime(myDatetime.timetuple())+1e-6*myDatetime.microsecond! Jeeze, even Java has myDate.getTime()! []
  3. We’re application programmers, not quantum physicists, so Newtonian time is good enough for us. []
  4. That being said, it’s not impossible. The field of natural language processing is quite advanced today and doesn’t even require Google-scale supercomputing clusters: a friend of mine is doing quite impressive NLP in Javascript within the web browser. []
  5. Of course, this is probably not fair and they probably actually developed simultaneously or even the ad-hoc ones came first. []
  6. Of course, sometimes a key-value store is exactly what you need, but my whole thesis in this article is that we can do better for event logging. []
  7. And as irregular English is, it still has the bonuses of keeping grammatical elements pretty separate from each other, with relatively little inflection, and of kind of working whichever way you place the grammatical elements (this is very useful when composing human friendly sentences, trust me). You should probably use English too if you’re implementing such a system. We’ll see later how even using English as your grammar’s natural language when developing events doesn’t preclude easy internationalization/localization of human friendly event sentences. []
  8. While it may seem like some sentences are either missing a subject or a verb, let’s say it’s either incomplete or the missing thing is implied. For example, “Go there,” has the implied subject ‘you’. []
  9. In their example, ‘radio’ in “We listened to the radio” where ‘to the radio’ is a preposition. []