Monday, July 21, 2014

Migration

A while back at Ordinary Times, there was an interesting comment thread on the subject of defining the Midwest region of the US.  One of the thoughts that occurred to me while reading that was whether it was possible to define regions based on inter-state migration patterns.  The idea grew, I suppose, out of my own experience.  I lived and worked in New Jersey for ten years, but never really felt like I fit in there.  Eventually my wife and I moved to Colorado, to the suburbs of Denver, where we immediately felt right at home.  Most people, I thought, might have been brighter than we were and not moved to someplace so "different."

I've also encountered a variety of nifty data visualization tools that look at inter-state migration in the US, like this one and this one from Forbes.  State-level data for recent years turns out to be readily available from the Census Bureau.  We can define a simple distance measure: two states are close if a relatively large fraction of the population of each moves between them each year.  "Relatively" because states with large population have large absolute migration numbers in both directions.  For example, large numbers of people move between California and Texas -- in both directions -- because those states have lots of people who could move.  From Wyoming, not so many.  Given a distance measurement, it turns into a statistical problem in cluster analysis: partition the states into groups so that states within a group are close to each other.  Since there's only a distance measure, hierarchical clustering seems like a reasonable choice.

The map to the left shows the results of partitioning the 48 contiguous states into seven clusters.  The first thing I noticed about the partition is that states are grouped into contiguous blocks, without exception.  While that might be expected as a tendency [1], I thought there would be at least a couple of exceptions.  The resulting regions are more than a little familiar: there's the Northest, the Mid-Atlantic, the Southeast, the Midwest (in two parts), the West, and "Greater Texas".  There are a couple of other surprises after reading the discussion at Ordinary Times: Kentucky is grouped with the Midwest, and Missouri and Kansas with Greater Texas.  New Mexico clustered with Texas isn't surprising, but New Mexico with Louisiana and Arkansas?  Hierarchical clustering is subject to a chaining effect: New Mexico may be very close to Texas, and Louisiana also close to Texas, and they get put into the same cluster even though New Mexico and Louisiana aren't very close at all.

One way to test that possibility is to remove Texas from the set of states.  The result of doing that is shown to the left. As expected, New Mexico is now clustered with the other Rocky Mountain states and Louisiana with the Southeast.  Perhaps less expected is that the other four states -- Arkansas, Kansas, Missouri, and Oklahoma -- remain grouped together.  None of them is split off to go to other regions; the four are close to one another on the basis of the measure I'm using here.

Answers to random anticipated questions... I used seven clusters because that was the largest number possible before there was some cluster with only a single state in it [2].  The Northeast region has the greatest distance between it and any of the other regions.  If the country is split into two regions, the dividing line runs down the Mississippi River.  If into three, the Northeast gets split off from the rest of the East.  There are undoubtedly states that should be split, ie, western Missouri (dominated by Kansas City) and eastern Missouri (dominated by St. Louis); a future project might be to work with county-level data.


[1]  My implementation of hierarchical clustering works from the bottom up, starting with each state being its own cluster and merging clusters that are close.  Using the particular measure I defined, close pairs of states include Minnesota/North Dakota, California/Nevada, Massachusetts/New Hampshire, and Kansas/Missouri.  These agree with my perception of population flows.

[2]  The singleton when eight clusters are used is New Mexico.  When ten clusters are used, Michigan also becomes a singleton, and Ohio/Kentucky a stand-alone pair.

No comments:

Post a Comment