Abstract. Although spatial interaction modeling is a fundamental technique to many geographic disciplines, relatively little software exists for spatial interaction modeling and for the analysis of flow data. This applies particularly to the realm of free and open source software. As a result, this primer introduces the recently developed spatial interaction modeling (SpInt) module of the python spatial analysis library (PySAL). The underlying conceptual framework of the module is first highlighted, followed by an overview of the main functionality, which will be illustrated using migration data. Finally, some future additions are discussed.
Spatial interaction modeling involves the analysis of flows from an origin to a destination, either over physical space (i.e., migration) or through abstract space (i.e., telecommunication). Despite being a fundamental technique to many geographic disciplines, there is relatively little software available to carry out spatial interaction modeling and the analysis of flow data, especially in the realm of free and open source software. Therefore, the purpose of this primer is to provide an overview of the recently developed spatial interaction modeling (SpInt) module1 of the python spatial analysis library (PySAL)2 . First, the current framework of the module will be highlighted. Next, the main functionality of the module will be illustrated using migration flows with a dataset previously used for spatial interaction modeling tutorials in the R programming environment (Dennett 2012). Finally, some future additions are discussed.
The core purpose of the SpInt module is to provide the functionality to calibrate spatial interaction models. Since the “family” of spatial interaction models put forth by Wilson (1971) are perhaps the most popular, they were chosen as the starting point of the module. Consider the basic gravity model (Fotheringham, O’Kelly 1989),
When data for , , , and are available we can estimate the model parameters (also called calibration), which summarize the effect that each model component contributes towards explaining the system of known flows (). In contrast, known parameters can be used to predict unknown flows when there are deviations in model components (, , and ) or the set of locations in the system are altered.
Using an entropy-maximizing framework, Wilson derives a more informative and flexible “family” of four spatial interaction models (Wilson 1971). This framework seeks to assign flows between a set of origins and destinations by finding the most probable configuration of flows out of all possible configurations, without making any additional assumptions. By using a common optimization problem and including information about the total inflows and outflows at each location (also called constraints), the following “family” of models can be obtained,
Each member of the family of models provides a different system structure, which can be chosen depending on the available data or the specific research question at hand. The so-called unconstrained model does not conserve the total inflows or outflows during parameter estimation. The production-constrained and attraction-constrained models conserve either the number of total inflows or outflows, respectively, and are therefore useful for building models that allocate flows either to a set of origins or to a set of destinations. Finally, the doubly-constrained model conserves both the inflows and the outflows at each location during model calibration. The quantity of explanatory information provided by each model is given by the number of parameters it provides. As such, the unconstrained model provides the most information, followed by the two singly-constrained models, with the doubly-constrained model providing the least information. Conversely, the model’s predictive power increases with higher quantities of built-in information (i.e. total in or out-flows) so that the doubly-constrained model usually provides the most accurate predictions, followed by the two singly-constrained models, and the unconstrained model supplying the weakest predictions (Fotheringham, O’Kelly 1989).
Spatial interaction models are often calibrated via linear programming, nonlinear optimization, or, increasingly more often, through linear regression. Given the flexibility and extendability of a regression framework it was chosen as the primary model calibration technique within the SpInt module. By taking the natural logarithm of both sides of a spatial interaction model, say the basic gravity model, it is possible to obtain the so-called log-linear or log-normal spatial interaction model,
12) and (13) is the functional distance-decay specification, which results from plugging either equation (10) for a power function or (11) for an exponential function into equation (2) before linearizing it. The only practical difference here is that the distance is logged in equation (12) whereas in equation (13) it is not. Constrained spatial interaction models can be achieved by including fixed effects for the origins (production-constrained), fixed effects for the destinations (attraction-constrained) or both (doubly-constrained). However, there are several limitations of the log-normal gravity model, which include,
Therefore, the Poisson log-linear regression specification for the family of spatial interaction models was proposed (Flowerdew, Aitkin 1982, Flowerdew, Lovett 1988). This specification assumes that the number of flows between and is drawn from a Poisson distribution with mean, , where is assumed to be logarithmically linked to the linear combination of variables,
and exponentiating both sides of the equation yields the unconstrained Poisson log-linear gravity model,
where equations (14) and (15) refer to the unconstrained model with a power function distance-decay. As previously mentioned, using fixed effects for the balancing factors in equations (4), (6), (8) and (9), the constrained variants of the family of spatial interaction models can be specified as,
Tiefelsdorf, Boots 1995). Notice that k is the estimated intercept and must be included in these log-linear models (equation 12–18) to ensure the total number of flows is conserved, despite not being included in the maximum entropy models where such conservation is implied. Similar to equation (13), the exponential function distance-decay can be specified in equation (15, 16–18) by omitting the logarithm associated with . Using Poisson regression is more representative of flows and satisfies limitations (1) and (2) and it also alleviates limitations (3) and (5) since we no longer need to take the logarithm of . Using fixed effects within Poisson regression to calibrate the doubly-constrained model also avoids the need for iterative computation of the balancing factors that exists in other calibration methods (Fotheringham, O’Kelly 1989).
Calibration of Poisson regression can be carried out within a generalized linear modeling framework (GLM) using iteratively weighted least squares (IWSL), which converges to the maximum likelihood estimates for the parameter estimates (Nelder, Wedderburn 1972). To maintain computational efficiency with increasingly larger spatial interaction datasets, SpInt is built upon a custom GLM/IWLS routine that leverages sparse data structures for the production-constrained, attraction-constrained, and doubly-constrained models. As the number of locations in these models increases, so the does the number of binary indicator variables needed to construct the fixed effects that enforce the constraints. Therefore, larger spatial interaction datasets become increasingly sparse and the utilization of sparse data structures takes advantage of this feature. As a metric, constrained models with locations, which implies observed flows when each location is an origin and destination, can be calibrated within minutes on a standard macbook pro notebook.
In order to evaluate the fit of spatial interaction models, it has been recommended that a variety of statistics be used (Knudsen, Fotheringham 1986), which is the approach taken in SpInt. For the log-normal regression specification, it is popular to utilize the coefficient of determination (), though this statistic is not available within the GLM framework used by SpInt. In replacement of the statistic, the SpInt framework provides a pseudo based on the likelihood function (McFadden 1974),
where is the likelihood of an estimated model, is the model including all explanatory variables of interest, and is the model with only an intercept (i.e., no covariates). Like the statistic, the pseudo version is at a maximum at a value of 1 with higher values denoting better model fit. To account for model complexity, there is also an adjusted version of this statistic,
where is the number of regressors. If model fit does not sufficiently improve, then it is possible for this measure to decrease as variables are added, signaling that the additional variables do not contribute towards a better model fit. Henceforth, these pseudo statistics are referred to solely as and adjusted . Another model fit statistic available in the SpInt module that also accounts for model complexity is the Akaike information criterion (AIC),
where lower AIC values indicate a better model fit (Akaike 1974). This statistic is grounded in information theory, whereby the AIC is an asymptotic estimate of the information that is lost by using the full model to represent a given theoretical process.
The and AIC are designed for model selection, which means they should not be used to compare between different spatial systems. One solution to this issue is the standardized root mean square error (SRMSE),
where the numerator is the root mean square error of the observed flows, , and the flows predicted by the model, , and the denominator is the mean of the observed flows and is responsible for standardization of the statistic. Here, is the number of origin-destination pairs that constitute the system of flows. A SRMSE value of 0 indicates perfect model fit, while higher values indicate decreasing model fit; however, the upper limit of the statistic is not necessarily 1 and will depend on the distribution of the observed values (Knudsen, Fotheringham 1986).
One final fit statistic, a modified Sorensen similarity index (SSI), is included within the SpInt module because it has become increasingly popular in some spatial interaction literature that deals with non-parametric models (Lenormand et al. 2012, Masucci et al. 2012, Yan et al. 2013). Using the same symbol definition from the SRMSE, the SSI is defined as,
which is bounded between values of 0 and 1 with values closer to 1 indicating a better model fit.
Despite being a small toy dataset, the following example is utilized for consistency since it was previously used to demonstrate spatial interaction modeling in the R programming language (Dennett 2012). The data are migration flows between Austrian NUTS level 2 regions in 2006. In order to use a regression-based calibration, the data has to be transformed from the matrices and vectors described in equations (1)–(9) to a table where each row represents a single origin-destination dyad, and any variables associated with locations and . Details on how to do this are outlined further in LeSage, Pace (2008), though this has already been done in the example data. Let’s have a look!
The Origin and Destination columns refer to the labels for origin locations, , and the labels for destination locations, , the Data column is the number of flows between i and j, the Oi and Dj columns are the number of total out-flows at i and total in-flows at j, respectively, and the Dij column is the Euclidian distance between the centroids of and . In this case we use the total out-flow and total in-flow as variables to describe how emissive an origin is and how attractive a destination is. If we want a more informative and interesting model we can replace these with application specific variables that pertain to different hypotheses. Next, lets format the data into arrays.
The Oi and Dj vectors need not be arrays. In fact, they can be where is the number of variables that are being used to describe either origin or destination attributes associated with flows. It should also be noted that intra-zonal flows have been excluded (the first line of code above). This is sometimes done because intra-zonal flows are large compared to inter-zonal flows and would therefore heavily influence the model or because it is not possible to adequately define a distance associated with intra-zonal flows. Some solutions to these issues have been proposed (Kordi et al. 2012, Tsutsumi, Tamesue 2011), though for simplicity, intra-zonal were removed for this example.
Now, lets load the main SpInt functionality and calibrate some models. The “family” of spatial interaction models are found within the gravity namespace of the SpInt module and the estimated parameters can be accessed via the params attribute of a successfully instantiated spatial interaction model.
Unconstrained (basic gravity) model
Note that for the above examples the print statement for the constrained models params attribute is limited to print only the main model variables (i.e., not fixed effects), though it is still possible to access the fixed effect parameters too.
The first parameter is always the overall intercept with the subsequent 8 parameters representing the fixed effects in this case. You might ask, “why not 9 fixed effects for the 9 different regions?”. Due to the coding scheme used in SpInt, and many popular statistical programming languages, you would use binary indicator variables in the design matrix to include the fixed effects for all 9 regions in the model. While the non-zero entries in these columns of the design matrix indicate which rows are associated with which region, where a row has all zero entries then implicitly refers to the region that has been left out. In SpInt, this is always the first origin or destination for the production-constrained and attraction-constrained models. For the doubly-constrained model, both the first origin and the first destination are left out (Tiefelsdorf, Boots 1995). In terms of interpreting the parameters, these dropped locations are assumed to be 0.
You can also access typical model diagnostics, such as standard errors (std_err), t-values (tvalues), p-values (pvalues), and confidence intervals (cont_int).
First, it will be demonstrated how to interpret the coefficients associated with the main model variables from a general Poisson regression. However, because the spatial interaction model is a log-linear Poisson regression (i.e., we take the log of the explanatory variables) the same interpretation often cannot be applied because we are working in logarithmic space. Therefore, it will also be demonstrated how to interpret the parameters when they are associated with a logged explanatory variable.
Recall from the previous section that the exponential distance-decay specification results in a model that does not take the logarithm of . Therefore, we can use an unconstrained gravity model with an exponential distance-decay specification to demonstrate a typical interpretation of coefficients from a Poisson regression.
-6.22938370e-03 is the coefficient for the distance variable in the above example. In Poisson regression, the coefficients are typically interpreted as the proportionate change in the predicted response, here , if we increase an explanatory variable by 1 unit (Cameron, Trivedi 2013). Technically, this is expressed as,
where is the new value of and is a coefficient, here the one typically associated with distance in a Poisson log-linear spatial interaction model with an exponential function distance-decay. For this example, this means from a 1 unit increase in distance, holding all other factors constant, if our model predicted 2,500 flows, then we can expect the number of flows to decrease to approximately 2,484.475. We can also identify the percent change expected from a one unit increase in distance using,
which serves as an alternative interpretation of . In this case, we could say that from a 1 unit increase in distance we could expect the number of predicted flows to decrease by approximately 0.621%.
However, neither equation (24) nor (25) is applicable when the coefficient is associated with a logged explanatory variable. This is important for Poisson log-linear spatial interaction models because this applies to the origin and destination variables when using an exponential function of distance-decay and to the origin, destination, and distance variables when using a power function of distance-decay. In these cases, the interpretation of the coefficients becomes the percent change in the predicted response, here , if we increase the associated explanatory variable by 1% (Cameron, Trivedi 2013). For example, 8.91445153e-01 is the coefficient associated with destination total in-flows (i.e., attractiveness) in the above example. Then if we increase the in-flows to location by 1%, say from to , and holding all other factors constant, we can expect the number of flows from to (i.e., ) to increase from to 3.
Finally, the fixed effects in the constrained models can be interpreted such that the mean predicted flows, , are () times larger if they originate (terminate) from location (location ) (Cameron, Trivedi 2013), where is equivalent notation for .
We can compare the different model fit statistics across the four types of spatial interaction models for this example. Let’s process the statistics into a tidy table and have a look.
From this table we can see that all of the fit statistics indicate a better model fit as constraints are introduced. That is, the weakest model fit is consistently related to the gravity model, with similarly increased model fit for the production-constrained and attraction-constrained models, and finally, the best model fit is associated with the doubly constrained model. We can also see that the and adjusted are very close, since these models have a very similar number of explanatory variables, thereby resulting in little or no penalization for model complexity.
We can also take a look at whether the power or exponential distance-decay specification results in a better model fit. For simplicity, lets just take a look at the SRMSE for a doubly constrained model.
For this example, it looks like the power distance-decay specification results in a better model fit.
The SpInt module also makes it possible to calibrate “local” models, which subset the data by specific origins or destinations in order to investigate how spatial interaction processes vary over space (Fotheringham, Brunsdon 1999). Below is an example of how to get local parameters and local diagnostics for a gravity model subset by its origins. The result is a dictionary of lists where the keys are the different sets of local values including parameters, hypothesis testing diagnostics, and the previously reviewed fit statistics.
Lets take a look at the local distance-decay parameters. The origin, destination and distance-decay parameters are indexed sequentially through the design matrix starting with 0 as you move through the origin attributes, through to the destination attributes, and finally the distance-decay attribute. Therefore, for variables, the distance-decay parameters are always the n - 1th parameter, in this case of 3 variables: param2.
We can also take a look at the local .
Both the local distance-decay and the show some variation. We can explore this spatially, by mapping the local values. First, lets join the local values to a shapefile and then plot the local distance-decay parameters
Next, lets map the local values. Above we can see a much stronger distance-decay for the most westerly region. Below we can see that the model fit is stronger in the north-west and decreases in the south-east. Using these patterns, we could then further postulate why they arise or how we might be able to improve model fit.
In addition to all of the features presented here, there are several other tools that exist in SpInt or could be added. First, there are dispersion tests available in the dispersion namespace of the SpInt module, which can be used to test whether or not the Poisson equidispersion assumption is met. That is, that the conditional mean and variance are equivalent, which can be unrealistic in many scenarios. If these tests indicate overdispersion or underdispersion, then it might be appropriate to use a Quasi-Poisson model, which relaxes the equidispersion assumption of the Poisson model. The resulting parameter estimates are equivalent to the Poisson model, but the standard errors are typically larger whenever equidispersion does not hold (Wedderburn 1974). The Quasi-Poisson model specification can be carried out by setting Quasi=True in any of the spatial interaction models introduced here. Alternatively, it might be more appropriate to change the underlying probability model from Poisson to that of negative binomial or a zero-inflated model. However, this has not yet been implemented in SpInt and therefore remains as future work.
Another area of potential expansion is to accommodate several paradigms for incorporating spatial effects into spatial interaction models, such as competing destinations (Fotheringham 1983), a spatial lag autoregressive model (LeSage, Pace 2008), or an eigenvector spatial filter model (Chun 2008). These paradigms require code that computes additional variables, more complex calibration techniques, and specialized representations of spatial relationships. Some solutions to the latter are available in the spintW namespace of the weights module of PySAL. While there is still much work to be done to develop a more robust set of open source spatial interaction modeling tools, SpInt provides a starting point for which to build upon.
Akaike H (1974) A new look at the statistical model identification. Automatic Control, IEEE Transactions on 19: 716–723. CrossRef.
Cameron AC, Trivedi PK (2013) Regression Analysis of Count Data. Cambridge University Press, New York. CrossRef.
Chun Y (2008) Modeling network autocorrelation within migration flows by eigenvector spatial filtering. Journal of Geographical Systems 10: 317–344. CrossRef.
Flowerdew R, Aitkin M (1982) A Method of Fitting the Gravity Model Based on the Poisson Distribution. Journal of Regional Science 22: 191–202. CrossRef.
Flowerdew R, Lovett A (1988) Fitting Constrained Poisson Regression Models to Interurban Migration Flows. Geographical Analysis 20: 297–307. CrossRef.
Fotheringham AS (1983) A new set of spatial-interaction models: the theory of competing destinations. Environment and Planning A 15: 15–36. CrossRef.
Fotheringham AS, Brunsdon C (1999) Local Forms of Spatial Analysis. Geographical Analysis 31: 340–358. CrossRef.
Knudsen D, Fotheringham A (1986) Matrix comparison, Goodness-of-fit, and spatial interaction modeling. International Regional Science Review 10: 127–147. CrossRef.
Lenormand M, Huet S, Gargiulo F, Deffuant G (2012) A Universal Model of Commuting Networks. PLoS ONE 7: e45985. CrossRef.
LeSage JP, Pace RK (2008) Spatial Econometric Modeling Of Origin-Destination Flows. Journal of Regional Science 48: 941–967. CrossRef.
Masucci AP, Serras J, Johansson A, Batty M (2012) Gravity vs radiation model: on the importance of scale and heterogeneity in commuting flows. arXiv:1206.5735 [physics]. CrossRef.
Nelder JA, Wedderburn RWM (1972) Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General) 135: 370–384. CrossRef.
Tsutsumi M, Tamesue K (2011) Intraregional Flow Problem in Spatial Econometric Model for Origin-destination Flows. Procedia - Social and Behavioral Sciences 21: 184–192. CrossRef.
Wedderburn RWM (1974) Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika 61: 439–447. CrossRef.
Wilson AG (1971) A family of spatial interaction models, and associated developments. Environment and Planning A 3: 1–32. CrossRef.
Yan XY, Zhao C, Fan Y, Di Z, Wang WX (2013) Universal Predictability of Mobility Patterns in Cities. arXiv:1307.7502 [physics]. CrossRef.