This chapter summarizes the seventh session of the workshop, which addressed changes in social science data and methods and their impact on rural classification. Alan Murray (Drexel University) prepared a commissioned paper, Evolving Spatial Analytics and Rural Area Classification, for the workshop.
He summarized changing analytical possibilities, including Geographic Information System and spatial statistics, and increasingly powerful computing and advancement of technology. Sarah Low (Economic Research Service [ERS]) also described changing analytical possibilities, Richelle Winkler (Michigan Technological University) spoke about the availability and quality of data from the American Community Survey (ACS), and Michael Ratcliffe (Census Bureau) discussed more frequent availability of local-level data at lower levels of geographic scale. James Fitzsimmons was the moderator for this session.
STATEMENT BY ALAN MURRAY1
Murray stated that rural or rurality is a vague concept, but context and the purpose of the study matter. He pointed out there are many different perspectives and rationales motivating why analysts look at rurality and what that means for people in rural areas. For example, air service
1This presentation is based on Murray (2015).
subsidies for rural airports define rural in a very different way than has been discussed at this workshop.
Categories of Data Sources
Murray said that there are more sources of data than from the U.S. Census Bureau. Some are government agency generated, including national and local. National products, in addition to those from the Census Bureau, include those from ERS, other U.S. Department of Agriculture agencies, and others. Local products encompass parcel and structure-oriented data. Private vendors may also scrape various sources. For example, he noted, the National Establishment Time-Series is scraped from Dun and Bradstreet. Private vendors may take published data, perhaps from the Internet, and make it available in a digital source. Geolibraries and geoportals include the U.S. General Services Administration’s data.gov, as well as volunteer geographic information user-generated products such as Wikimapia, Openstreet, and others. Data are available from sensing platforms such as GPS, satellite imagery, aircraft, drones, and red-light cameras/videos. Sensing platforms include Google Street View in Google Maps, as well as cameras and traffic counters. User-generated sources include volunteer geographic information (VGI), where people add information that has a spatial orientation. There is also unintentional user-generated information, he noted. If a GPS is turned on in a person’s phone, it is tracking where the person is going. The person is generating data, but may not realize it. At Drexel, clothes are made with embedded radio frequency identification devices (RFID). The wearers do not realize they are generating data about their activities, body temperature, and bio-characteristics.
It is useful, he said, to think about how these varied sources might be used to derive characteristics of rural areas, although some of the sources, such as VGI, may have issues related to data quality. That can be problematic in various ways, Murray said.
Murray defined spatial analytics as any of the quantitative methods to support analysis, policy, planning, and management involving geographic space. They support the systematic analysis of geographic data and are similar to and consistent with definitions of quantitative geography and geocomputation. They include Geographic Information System (GIS), remote sensing, measures and metrics, statistics, simulation, optimization, regional economics, and geovisualization. Spatial analytics could be used in a map-based product to summarize objects or in some
analytical environment where a map and other graphic and nongraphic methods are used to derive insights.
Over time, there has been not only increasing computational capabilities, but also richer spatio-temporal information than was available in the past. There are also different conceptualizations of geographic space. Simple abstractions of geographic space have been replaced by more explicit and detailed analyses, he noted. The digital environment supports these enhancements. Nevertheless, Murray pointed out, the information available in a digital environment is an abstraction of reality, with uncertainties and other issues in terms of data quality, position, attributes, and change over time.
Murray defined GIS as a particular form of information system. It collects geographically (spatially) referenced and nonspatial attribute data. It is a system of hardware, software, and procedures designed to support geographical decision making through the capture, management, manipulation, analysis, modeling, and display of spatially referenced data.
He referred to a few GIS components to highlight issues important for rural classification and rural analysis. He reminded the audience that in a digital environment, the real world has been simplified. Murray said that analysts can do a lot with digital information through GIS, but there is also uncertainty and potential error in the process of digitizing, and using on-screen or other devices. Data can be converted from one source to another, but in that conversion process, spatial and other errors may be introduced.
Geocoding is the process of giving geographic coordinates to an address to identify its location on the surface of the Earth. However, there are quality issues in geocoding as well, such as spatial accuracy and matching success.
Murray said that data in a digital environment comes from some source by some process. It has data quality issues in terms of the attributes derived and in terms of the positional accuracy. The mainstay of GIS is an ability to manipulate digital information by simplification, aggregation, disaggregation and interpolation, transformation, and projection. GIS supports many analyses, but users should be aware of the potential for error and uncertainty.
Unlike in the past, there are many ways to measure distance, including rectilinear distance or along a network, rather than only considering the proximity between two places. Further, it can done with spatial objects. Murray suggested the population centroid of a county may not
make as much sense as in the past. The capabilities of GIS have implications for rural classification.
Murray noted that spatial autocorrelation, which was developed in the 1990s, reflects the notion of evolution. Ten years ago, analysts began to represent geographic space in terms of the weights related to who is a neighbor. There are many way to do this, and the measure is very sensitive to that specification, he said, and leads to the question of what is a neighborhood relationship. Work is continuing to focus on what that specification should be.
Spatial optimization and simulation are sensitive to data specification, error, and uncertainty. If a measure of spatial autocorrelation is computed, there is some uncertainty in the data in terms of the units used, the spatial scale, or some of the attributes. An analyst may or may not detect spatial autocorrelation. Alternatively, the areas themselves might change based upon this uncertainty. This is known as the so-called modifiable areal unit problem or framed dependence. That is to say, the method is dependent upon this underlying spatial specification.
Murray closed by saying that sources of data, particularly spatial data and spatial analytics, have evolved over time and are very promising. However, many errors and uncertainty remain with little understanding of their implications for studying issues in rural areas or urban areas.
STATEMENT BY SARAH A. LOW
Low said she approached the topic of changing analytical possibilities as a representative of the next generation of regional scientists and rural researchers. She agreed with other presenters about the unlikelihood of developing a universal definition of rural. But she said she would like to see a better definition that is more widely used. With the data and methods available, it is about getting people to adopt a slightly better definition in research. She said she would discuss what she termed the county trap or the “nonmetropolitan equals rural trap.” Adoption is the problem, in her view. ERS has a great reputation in the research community, especially the rural research community. If ERS adopted an improved definition, she said, it would go a long way toward encouraging wider adoption.
She said her generation has always had access to GIS, big data, and geocoding. The point is that spatial analysis methods and data are not new. The methods allow creative researchers to define rural or enterprise zones, but they usually use someone else’s definition. Many times, metropolitan-nonmetropolitan is used as a proxy for rural-urban because analysts are busy, she stated.
Low noted that Isserman (2005) talked about a trap in which metropolitan becomes the most widely used definition of urban. But, as
James Fitzsimmons pointed out earlier in the workshop (see Chapter 2), 40 percent of the U.S. land area is either metropolitan or micropolitan. Considering how concentrated the population is in this country, a lot of the land area is core. She referred to John Cromartie’s explanation that ERS codes are based on a metropolitan/nonmetropolitan breakdown. In Low’s view, it is a bit of an abuse to consider everything in a metropolitan area as urban.
As Low pointed out, Isserman (2005) lamented that researchers and policy makers refer to metropolitan counties as urban and nonmetropolitan as rural, which he said misleads the public and policy makers. In 2000, most counties were both rural and urban. Metropolitan counties contain over half of the U.S. rural population. Low briefly explained Isserman’s rural-urban density typology. He defined counties by their character. He defined rural counties as having more than 90 percent of the population in rural areas, with a population density of less than 500 people per square mile, and the size of the largest urban area less than 10,000 people. Urban counties were defined as counties with more than 90 percent of the population in urban areas, an urban population of at least 50,000, and a population density of more than 500 people per square mile. He also defined mixed urban and mixed rural counties as those that were in between. If the population density was less than 320 people per square mile, the county was mixed rural; if the population density was more than 320, the county was mixed urban.
She said that if a density-based typology such as this were more widely adopted, it would have the potential to clarify definitions as well as an understanding of rural.
Data and Methods and the Future of Defining Rural
Low explained that big data, GIS, and spatial analysis allow analysts to make appropriate definitions for the task at hand. She asked if defining rural is more of a policy question than a research question. Spatial econometrics does not require a cutoff, and research increasingly operates on a rural-urban continuum. GIS allows for a proximity focus, such as driving time and distance, as well as useful aggregations, such as labor market areas or commuting zones. The results are more intuitive and make more sense than at the county level.
Low said that it is important for analysts to continue to advocate for better data. Isserman (2005) made a strong case for better data, she noted. Publicly available data are comprehensive only at the county level, and data suppression to preserve confidentiality limits the utility of data for small areas. The subcounty data are limited, but they are becoming more available. It would be helpful to increase researchers’ access to micro
data with remote access, lower cost, and increased data-sharing between federal agencies, she said.
Low said that the methods and data to implement a better definition of rural are available, but that data and analytics are ahead of the concept. She said that although a universally accepted definition of rural is not likely, a more useful definition than defining rural as not metropolitan would be valuable. As she noted earlier, ERS is in the position to posit a new definition and encourage its use in research and policy.
Low also challenged workshop attendees to better prepare graduate students in areas like computation methods and GIS so they will become more able researchers or policy makers. Both groups, whether they are sociologists, economists or planners, need to be comfortable with GIS and big data, and they need the resources to be able to do that, she stated.
STATEMENT BY RICHELLE WINKLER
Winkler spoke about the availability and quality of data from the American Community Survey (ACS). ACS is a data instrument that people who work in more rural areas are starting to learn more about and rely on, regardless of its flaws. Winkler pointed out there are other opportunities and alternative data sources to the ACS.
Winkler discussed information on the geographic units and the available variables of interest in the ACS, and the margins of error associated with ACS data and how they vary for different geographies and variables. Whether ACS provides high enough quality data for more rural areas depends on the variable of interest and the geographical unit of analysis, she explained.
ACS data are available at geographies down to the Census block group level. Those block groups nest within Census tracts, which nest within counties. Another option is the county subdivision. They are appealing in some ways because in the 12 minor civil division (MCD) states, these political units of analysis reasonably represent neighborhoods. However, this is only true in those 12 MCD states; in other states, county subdivisions seem fairly arbitrary, but they do nest within counties.
Winkler noted the host of variables in the ACS and suggested the variables that might be of interest in rural analysis. The ACS provides population and housing unit estimates that are updated annually, not just at the time of the decennial census. However, they are demographic estimates with some error associated with them. There are data on industry and employment, extractive industries, natural resource-based industries, agriculture, and migration. A question in the ACS about where the respondent lived one year ago allows an analyst to calculate multiple different measures of migration. It is possible to compute the percentage
of the population in a geographic region who moved in within the last year, or who moved in from a metro area within the last year. One way an analyst might recognize whether a more rural area is urbanizing is if people are moving there from a more urban area. The ACS also provides county-to-county flow files associated with the migration question to see where people are moving to or coming from. To Winkler, those data are not useful given the errors associated with them.
She said there are other variables in the ACS to consider. Commuting is important, she said. With data on commuting, an analyst can look at travel time and at how many people or what proportion of people leave their unit of analysis to work somewhere else outside the state or other area.
There also are county-to-county worker flow files created from the ACS data. These are of better quality than the migration flows files, she said. For the migration question, the sample size is smaller. People are asked whether they lived in the same house one year ago and, if not, where they lived. First, a person who moved has to be sampled. In contrast, with commuting, most people work, and it is more likely that a number of these workers will be included in the sample; with a larger sample, the data are better.
Winkler also noted that the Census Transportation Planning Products (CTPP) files come out of an agreement between the transportation planning community and the Census Bureau.
Winkler identified two critical temporal issues to consider to use ACS for rural area classification: the residence rule and the timing of counting people in the ACS, which is different than the decennial census. The ACS is an ongoing survey, while the Census counts the number of people on April 1 in their usual residence for the year. This is different from the ACS residence rule of two months. The ACS asks if a person will be in the same household for two months. If so, the person is counted as a resident.
Winkler said this matters because much of the seasonal population that the Census did not count is included in the ACS estimates. For example, in places with many seasonal residents, population density could vary quite a bit from the decennial Census count. This raises an interesting question, she said, about whether to count seasonal populations when considering population thresholds for rural classification.
She said that multiyear estimates are another important feature of the ACS. ACS data for small areas and more rural counties are only available as five-year averages. This creates challenges in interpreting change over time. At the time of the workshop, she noted, the most recent data available were for 2009–2013, which was released in January 2015.
Winkler pointed out the sample size associated with the decennial census long form was of one in six households. With the ACS, it is closer
to one in 40 each year. The ACS samples about 2.5 percent of households annually. She observed that for some units of analysis, some data are suppressed if there are not enough people in the sample to protect confidentiality. ACS estimates are also accompanied by margins of error, and the margins of error are larger for smaller geographies, smaller populations, and more rural areas. She said that one way to assess whether a margin of error is too high or not is to use the coefficient of variation (CV), the standard error times 100 divided by the estimate. A rule of thumb, she explained, is that if a CV is less than 12 percent, then it might be considered reliable. If the CV is greater than 40, the estimate has low reliability and is probably not very useful.
Winkler illustrated some of the quality and margin-of-error issues associated with using ACS data for rural area classification by presenting a case study of the Upper Peninsula of Michigan, a remote, mostly rural area. Winkler summarized the CVs for the case study at various units of analysis (county, county subdivision, census tract, and census block group) to demonstrate data quality for three key variables: population estimates, percent in-migration from a metro county, and percent who commute to a metropolitan county or a micropolitan city for work.
She said that for the population estimates, there were not any geographic units that displayed low reliability (CV > 40). All of the census tracts displayed high reliability (CV < 12), as did 67 percent of county subdivisions and 53 percent of census block groups. Overall, the CVs for this variable were reasonable for any of these geographic units, but tracts perform better than either county subdivisions or block groups.
However, she said, looking at in-migration, or the percentage of the population who moves in from a metro area within the last year, the median CVs for all geographic units (even counties) are greater than 40 percent. Only 18 percent of counties displayed high reliability and 0 percent of county subdivisions, tracts, or block groups had high reliability. Low reliability was observed in 36 percent of counties, 86 percent of county subdivisions, 60 percent of census tracts, and 88 percent of census block groups. In other words, for the migration variable, ACS data are not reliable enough to make meaningful classifications, even at the county unit of analysis. The data are even less reliable for small geographic units, she said.
Data on commuting are much better than on migration, she said, but not quite as good as population estimates. Still a significant proportion of the geographic units had low reliability, even at the county level (45% of counties). Census tracts performed better than counties, county subdivisions, or block groups, with 0 percent showing low reliability.
Winkler said the ACS is not an official population count or estimate, nor is it the basis for classifying urban or rural areas by the Census
Bureau, which is done with the decennial Census. Estimates vary in their accuracy or reliability based on both the variable considered and the geographic unit. She said the population estimates are quite good, but there is a question about seasonal residence. If an analyst does not want to include seasonal residents, she asked, then why not just use the decennial census? Winkler stated she would not trust using in-migration data even at the county level. Commuting data are mostly acceptable for overall patterns (not necessarily specific flows), and at the tract level, they are just as good as at the county level.
Winkler pointed to alternative data sources she has used. The Longitudinal Employer-Household Dynamics (LEHD)2 origin-destination files for commuting are based on administrative data built on employer filings for unemployment insurance. They cover about 90 percent of all workers and are available quarterly back to 2000. For migration, Internal Revenue Service data3 are available at the county level. These data are probably much more accurate than data from the ACS, she said. There is the National Land Cover Database (NLCD)4 to look at where urban infrastructure exists on the ground. For MCD states, it is possible to look at the tax base for county subdivision levels.
STATEMENT BY MICHAEL RATCLIFFE
Ratcliffe noted that Mark Perry, Census Bureau, collaborated with him to prepare this presentation about frequent availability of local-level data at lower levels of geography and geographic scale.
The history of urban-rural classifications and especially the Census Bureau’s urban-rural classifications since the late 19th century has been one of response to improvements in spatial resolution of data, increased amounts of data, and improved technology. Applying more data at lower levels of geography more frequently does not necessarily produce a better definition of urban and rural, he said.
Ratcliffe focused his presentation on the period from 1950, when urbanized areas of 50,000 or more population, based partly on population density, were introduced. They were used through 1990. That was a period of manual delineation, when urban was defined using planimeters and paper maps, calculating the population densities and the land area of the small enumeration districts and then drawing the boundaries by hand.
In 1990, interactive GIS-based delineation began. The Census Bureau’s
3See https://www.irs.gov/uac/SOI-Tax-Stats-Migration-Data [November 2015].
Geography Division built a GIS and interactively delineated about 660 potential urban areas, with 22 people working for six months looking at block-level densities and drawing boundaries around areas that qualified. When density-based urban clusters of 2,500–50,000 people were added, they moved to an automated, GIS-based delineation system to meet the time requirements. That is how they did it in 2010 as well.
After the 1990 delineation, one of the concerns was that nonresidential urban land uses on the fringes of urban areas were not being accounted for. There were rules within the criteria for accounting for low-density employment centers, downtowns, industrial parks, office parks, and other areas surrounded by high residential density areas. But if the office park or other urban use were on the edge of the urban area, there was not an enclave of low density surrounded by high density. There was an urban land use, perhaps with high densities on one side and low densities on the other. If only looking at population density, the profile of the industrial park looked like the rural land adjacent to it. The Census staff sought other data to help reach those decisions. For the 2010 delineation they used the National Land Cover Dataset (NLCD) impervious surface layer as a proxy to identify nonresidential urban land uses.
Other Datasets for Defining Rural and Urban
There are various other datasets, Radcliffe pointed out, as discussed throughout the workshop. In addition to the NLCD, the Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) is another source. It is annually updated with block-level data on employed persons from the ES202 files from the states, he explained. It is a synthesized dataset to avoid disclosure. There are some perturbations, but it warrants additional use.
He said that broadband maps are available at the census block level, as well as cellphone and parcel data. There is a lot of good information within the parcel data about the parcel and about the structures on the parcels. Zoning information shows what that parcel can be used for in the future.
He noted the abundance of data on small geographic areas—census blocks, block groups, tracts, grid cells, zip codes/zip code tabulation areas, pixelated data, latitude/longitude coordinates for structures, and GIS and database technologies to manage and manipulate large quantities of data. These provide the ability to measure distances between structures and other data points. It is possible to get down to the individual level with some data now available.
Ratcliffe next described the change in the urban population as a percentage of total from 1790 to 2010. In thinking about the rural definitions,
Ratcliffe and Perry broke the data into three eras, starting with a relatively flat trend, then a steep trend, and then another flattening trend. At the early stages in the United States, rural was the norm, and urban consisted of cities, smaller towns, and towns with more than 2,500 people that served as market centers for a larger rural region. It made sense to start to think of what is urban as distinct from the rural landscape, he said. The industrialization period was characterized by rural to urban migration, increasing suburbanization, and a separation of urban from rural. An urban area had a distinctive footprint on the landscape. The third era is the post-industrial era: suburban, with exurban growth. Urban is the norm, he said. With 81 percent of the population urban, the question he posed is what is rural?
Considering the Data “Landscape”
In closing, Ratcliffe raised several questions. He noted that it is possible to measure the landscape and define urban and rural with great precision, but to what purpose? Does the application of more data at lower levels of geography improve the ability to define rural? Is it meaningful for analysis and policy, he asked. He said he saw these things as integrated, and that good research is needed to inform policy.
He said that rural was once defined as physical isolation. But in an increasingly connected society, is rural really social isolation? He suggested it is time to rethink what is meant by rural, and perhaps to define rural with urban as the residual, not just in terms of a geographic and proximity, but perhaps sociologically and economically.
Michael Partridge referred to Isserman’s classification and said for the kinds of analysis he does, it is conceptually correct to use nonmetropolitan as rural. In his studies, he said, rural is where the people are not functionally integrated with an urban center. He raised a few concerns he had with the classification as related to his studies. Low responded that she pointed out that one definition of rural and urban is not going to work. Part of the problem with the current metropolitan definition, she said, is there are too many counties very rural in character that are now classified as metropolitan, especially in the last 2000 and 2010 censuses. If the definition of metropolitan and nonmetropolitan were a little different, she said she would be more content with metropolitan-nonmetropolitan as a proxy for rural-urban, which is why she would like to see an alternative.
Bruce Weber suggested a classification system that deals with prox-
imity to the urban places. Low suggested the ERS codes that take both density and proximity into consideration are a happy medium.
Jon Pender commented that in some research, density is of concern. He suggested separate measures if the purpose is to find out what effect commuting has versus what effect being in a dense area has. He said they do not have to be all compounded into one unit.
David Plane asked Murray about his research into rural air service. Air service is one of the critical functions of proximity and access, he observed. Murray responded that in looking at the essential air service subsidies system, a lot was based upon evaluation of the measures of success of this historical legacy program. Some of these measures have nothing to do with spatial proximity. For example, in the air services case, nothing about what they are doing has to do with spatial proximity. It is focused on performance, being within a rural area and therefore eligible for the program, and then continuing to get funding.
David McGranahan asked Ratcliffe about cellphone data. Ratcliffe responded that information collected through cellphones is being used to locate individuals in space and in time. Researchers are starting to work with data collected from the GPS units of cellphones, tracking people’s daily movements within and across urban-rural areas, within cities, and so on, looking at densities. He described it as an emerging area of interest among big data researchers. People generate an incredible amount of information every day that could be accessed from many different sources, although questions remain about the accuracy, volume, and velocity.
Ratcliffe noted that when he and his colleagues were looking at the redefinition of metropolitan areas in 2000, they asked about the frequency of updating. They concluded that every year is too fast because by the time of an update, they would be issuing a new definition. Five years seemed the right amount of time, he said, but he was not sure if OMB will choose to conduct updates every five years.
Murray expressed concern that many new data sources, such as from cellphones, collect individual data without a person’s knowledge. People do not have control over their data, and it is not clear whether people’s identities will be protected if the data are released.
Hardcastle commented that with the cellphone and even employment data, there are coverage issues. Some carriers want to provide the data or enter into agreements to provide the data, but not all carriers will. There is also the issue of who uses cellphones. Related to employment data, there are differential coverage rates across time and jurisdictions from Bureau of Economic Analysis versus Bureau of Labor Statistics data. The coverage fluctuates. There are also coverage issues with cellphone data.