Friday 27 September 2013

Origin Destination Data Expansion Tool

If an origin destination survey only collects a sample of the trips made across a cordon area, this data can be combined with total flow counts at each site to produce a complete dataset. This can quite accurately represent the true situation on the ground, and can reduce the costs of carrying out such a survey.

Using ANPR technology, detection rates in excess of 95% can often be achieved (albeit at quite a high cost), and this leaves little for the tool to do, but if you have collected Bluetooth, WiFi or RFID signatures, or adverse weather affects an ANPR survey, the sample rate can be significantly lower. In these cases it is especially useful to be able to easily carry out 'bi-proportional matrix balancing', sometimes known as the 'Furness method'.

I have developed a free online tool to carry out this process.

It is also available in Buchanan Computing's MicroMatch, but this does cost money, and may not always be applicable to your data since it is just intended for registration data.

The process can similarly be applied to data collected through road-side interviews or questionnaires where only a small proportion of travellers provide information about their trips.

The software employs a process similar to that described here, and my implementation uses Python and NumPy.

Another use for the tool is to 'factor up' existing OD data from a previous survey, to meet new total flows which have been collected more recently. There are arguably better approaches to this type of problem however, such as using a gravity model.

For full information about how to use it, please read on, as there is limited information on the webpage.


Two CSV (comma separated values) files are required - see the following examples:
LFData.csv
TCData.csv

The first is for the total flows into and out of the cordon at each site for each time interval.  The software supports multiple classes, which have a column each, whose header must be labelled. 5, 15 or hourly intervals should work fine, but there must be an entry for each time interval, even if all counts are 0. These values represent what the seed data is to be factored up to (the target values), and will often be ATC data or manually classified vehicle link counts by direction.

The second is for the seed data to be expanded. This could be ANPR, Bluetooth or RFID match data for the cordon area which constitutes a sample of the vehicles using the cordon.

The provided seed data from this file may optionally be fixed so it cannot be expanded. This allows the software to serve a different purpose in 'filling in' OD match pairs where that data is unavailable. This can be the case when large junctions are enumerated from video footage by following vehicles manually through a junction. Occasionally one or more turning movements cannot be seen sufficiently clearly, and this method may provide an acceptable alternative to an expensive resurvey. When using the tool like this, the values for the missing turning movements are seeded with low values, which then hopefully find equilibrium. You may want to run the data through twice, once to fill a missing OD match pair, and then again to adjust for any inaccuracy in the recorded turning counts.

In some cases it may also help to untick the option to expand return trips. This can be useful if you have 95% of the data, and the tool is adding unwanted extra pairs which seem out of place (u-turns at a junction for example). This may happen if there were some problems with the raw match data.

The number of iterations you require depends on how well your dataset is converging, and what characteristics you require from the data. Experimenting with this value is probably a good idea.

Once you click 'Expand Data' the data should be processed within a minute, and you will be presented with several files containing different views of the new dataset.

Processed Data - this is the expanded trip data in the format you supplied
Time Series Data - this is convenient for graphing the distribution of trips over time

Target Match - this shows how closely the seed values converged upon the target values
Processed Data Summary - this aggregates the Processed Data over the whole survey period


I would welcome any comments of suggestions regarding this tool, and if you have data in a format that you cannot convert to the required format, please let me know, and if time allows, I may add support for it.

If your requirements are more complicated than this, I would also be happy to discuss them with you.

Sunday 15 September 2013

Local Population Tool

Some surveys undertaken by my clients, require to know the resident population within certain distances of a survey site.

There are a number of ways of collating this data, but I decided that by combining the 2011 Census data for England and Wales with the Postcode address location data published by the OS.  I could make this very easy by offering it through my new website.

Here is the local population tool

It allows the user to input either a residential postcode, or a OSX,OSY coordinate, and provides a Google chart showing the radial distance on the X axis, and the residential population on the Y axis.

The way it works is by using a matrix with a population figure for every 100x100m grid square of the area covered.  A circular shape is overlaid at the required location on this grid, as a kind of mask, and the totals for the grid squares in each radius are summed to arrive at the population totals.  A slight adjustment reduces the downsampling error.

This kind if work is easy using Python and NumPy, and providing access through a website is no problem using Django.


Because of the static (2011) source and the resolution of the input data, it's not going to be totally accurate, and my selection of grid resolution introduces further approximation.  The quality of the data however, should be comparable to other similar data sources out there, and more convenient for basic purposes.

In its raw form, the population data was aligned to individual postcodes, which themselves were aligned to specific OSX/OSY centre points, rather than area polygons.  Had time allowed, I would have created polygons, and distributed the populations within those, but instead I applied a smoothing algorithm to the data using SciPy.  My intuitive sense being that this would slightly improve the data quality.

This is what the data looks like when plotted as a heat-map with the Python matplotlib library.



I could have used the matplotlib library for creating the charts in the tool too, but decided that Google Charts had the advantage when it came to interactivity, and simplicity of implementation.

If there ever seems to be a requirement for it, I will add the other data from the census set: the male/female breakdown and number of households.

It would probably also be informative to combine this data with the data used for the BBC's Road Crash Deaths visualisation that was plotted last year, which looks similar.  I may do at some point if time allows.