Using data derived from cellular phone locations to estimate visitation to natural areas: An application to water recreation in New England, USA

How to cite this study

Merrill, N.H., Atkinson, S.F., Mulvaney, K.K., Mazzotta, M.J. and Bousquin, J. 2020. Using data derived from cellular phone locations to estimate visitation to natural areas: An application to water recreation in New England, USA. PLoS ONE 15(4): e0231863.

Overview

This study compares cell phone location data to onsite observations at 577 water access areas in New England and Massachusetts. A statistical model was built to estimate daily visitation with cell data for the summer months of 2017. The model accurately predicted visitation to the 577 water-access areas.

Relevance

This study is relevant to those interested in using large data sources, like cellphone data or social media data, to estimate visitation. Though cell phone data can provide vast quantities of temporal and spatial information, the average locational accuracy is not always available: some reported locations have a range of 1-10 meters (GPS signal) while others are 20-200 meters (Wi-Fi signal) or 100-2,000 meters (cell tower signal). This highlights the need to include additional data like onsite counters to verify that visitors were using the water access points rather than a restaurant nearby, for example. Finally, commercially available cellular data received from a third-party provider that is anonymized and aggregated uses many statistical modeling assumptions, and users may not always accurately represent the demographics of water access areas.

Location

This study is located in Cape Cod, Barnstable County, Massachusetts, and 113 beaches across New England.

Trail Type

Water access areas include all public beaches and public access points to water (fresh and saltwater beaches, parks, and boat ramps).

Purpose

The purpose of this paper was to compare cell phone location data to multiple types of onsite observations in water access areas in Massachusetts and New England. The authors received no funding for this work.

Findings

  • With calibration (matching the scale of cell phone data to observational data), the random forest model accurately predicted visitation to the 577 water-access sites.
  • Without calibration, the model overpredicted visitation by four times. 
  • Cell phone data appears to be better at predicting visitation to larger areas with more daily visitation than smaller areas with less daily visitation.

Methods

Cellular location data was collected June through September 2017 at water recreation areas in Barnstable County, Massachusetts (464 sites) and at beaches in New England (113 sites). The data was purchased from Airsage, Inc., as anonymized and aggregated by summaries of visitation at recreation sites and estimates of the visitors’ origin census block-group geographies. 100-meter buffers were added to the water access sites to improve counts. The cell phone dataset includes visitation estimates for 51,511 days across all 577 sites. Data from observation counts was also used, including small access points to an estuary, a town’s visitation estimates for its managed beaches, and entrance fees collected by a town to a major beach (Narragansett Beach) to compare with the cell phone data. 

Cell phone data overestimated the observed visitor counts by about four times. This is likely because the cell phone data did not account for the block of time that corresponded to the onsite visitation counts (9 am to 4 pm) and instead counted visitors by the hour, counting visitors multiple times for each hour. An assumption of a three-hour average stay was used to better calibrate the data. Three models (linear, log-linear, and random forest) were created to translate the cell data into visitation estimates across many sites to test the transferability of the cell data. Variables including temperature, precipitation, month, day of the week, and size of water access were added, creating linear regressions in R. The random forest model (a model commonly used in data science and machine learning fields) was chosen as the preferred model due to its better prediction performance.


Added to library on November 27, 2023