Showing posts with label population. Show all posts
Showing posts with label population. Show all posts

April 09, 2010

Mapping Wikipedia Biographies

The map below is a visualisation of references to places within 423,846 biography articles in the English version of Wikipedia. The definition of these bolded terms and the methodology used to obtain these data is discussed in more detail below.


Now compare this map to the below map of actual population density.


The differences are quite astonishing. What one sees is that articles about people in Wikipedia are highly likely to reference particular parts of the world (the US and Western Europe). This is a geography of people that is in no way reflective of the actual distribution of population on our planet.

Of course, because the data only includes biography articles in the English version of Wikipedia it is biased towards English speaking countries. This fact helps explain the concentration of articles that reference the US and the UK. However, language alone does not explain why countries where English is widely used (e.g. India) have a smaller presence than non-English speaking countries in Western Europe.

Most importantly, it is clear that Wikipedia has not yet attained its goal of storing the "the sum of all human knowledge." Wikipedia guidelines specify that biographies should only be about notable people and this map suggests that there are more notable people in Europe and North America (at least in the eyes of Wikipedians). Not to knock our home continents but it seems likely (especially after looking at some of the people deemed "notable") that Wikipedia is simply reflecting its user base who are disproportionally from these places.

In any case it shows that there are likely still a lot of possibilities out there for new Wikipedia articles (despite claims that Wikipedians are running out of new topics to write about).

And in the big picture it again raises questions about who participates in online discussions and what is discussed and documented in these conversations.

The data used to create these maps were collected by Adrian Popescu and are available here for anyone interested in playing with them. The data were actually collected through a rather complicated process that we'll explain below.

First of all, we need to define biography articles; basically, any article about a person in Wikipedia (e.g. Angela Merkel, Ron Jeremy or Gary Brolsma). A list of biographies was created using data harvested from the list of occupations.

We then geolocated each biography article. This was done counting the number of references to place names in each person's biography and then mapping only the most mentioned place in each article. Ranking of placenames was conducted not only using the English version of the article, but also using the equivalent in up to seven languages (English, German, French, Dutch, Spanish, Italian and Portuguese). The thinking behind this method of ranking is simple: the more article versions mention a given location, the more relevant for the concept that location is. We have, however, also done some analysis with the 2nd, 3rd etc. most mentioned places in each article and will be publishing a post on this work soon (along with analysis of Wikipedia data by century and the geography of specific occupations (e.g. artists, politicians and footballers) within the encyclopaedia).

It is clear that this method favors European locations at the expense of places in the rest of the world. Japanese and Arabic Wikipedias, for example, probably have a very different geography (something we are also working on mapping). The fact remains though, that the English language Wikipedia offers us a very particular worldview rather than access to "the sum of all human knowledge" (for the time being at least).

Hmmm....that reminds us, we should start up a Floatingsheep page at Wikipedia some time soon.

See also:

Adrian's analysis of Wikipedia: Adrian Popescu, Gregory Grefenstette Spatiotemporal Mapping of Wikipedia Concepts, JCDL 2010, June 21 - 25, Brisbane, Australia

...and some of our previous work on mapping Wikipedia here.

February 24, 2010

The many guns of urban America

God and guns keep us strong
That's what this country was founded on
Well we might as well give up and run
If we let them take our God and guns
-Lynyrd Skynyrd, "God and Guns"
As we have shown in earlier maps (here and here) guns have become a central fixture of the American landscape.

And often proponents of the Second Amendment are associated with a predominantly rural, religious and conservative population as exemplified by the above song lyric. Whether or not this is because rural Americans are 'bitter', the stereotype remains pervasive. However, when we map the number of user-generated Google Maps placemarks mentioning the word "gun", a much different pattern emerges.


Absolute Number of Guns in User-Generated Placemarks



Although the smaller dots peppered throughout the rural United States certainly show that guns maintain a presence in the rural landscape, the highest concentrations of guns in user-generated placemarks are undoubtedly found in the nation's urban centers.

Relative Specialization in Guns in User-Generated Placemarks


By focusing instead on those places with a higher-than-average number of placemarks with the word "gun", the concentration in urban areas becomes more obvious - rural areas are all but wiped off the map of indexed values. A plausible explanation would simply say that the prevalence of guns is more a function of population (more references to guns because there are more people) than of a stylized cultural trait.

Or could the differences in user-generated content been explained, at least in part, by a digital divide between urban and rural Americans? For example, rural Americans could simply be too busy actually using their guns to worry about adding user-generated placemarks to Google Maps? We should also note that the meaning of a reference to the word "gun" in a placemark is not straightforward. In other words, it could be a protest against guns or, alternatively, an affirmation of them.

Unfortunately, we end with an entirely new set of questions and are left clinging to conjecture, just as much of America remains clinging to their guns.

December 11, 2009

Finding a Restaurant

Finding a restaurant can be one of the most vexing tasks in modern life and an extremely useful application of Google Maps is getting help locating nearby establishments. The map below shows the number of user-generated placemarks containing the word "restaurant". The density of restaurant references corresponds closely with the distribution of population in the United States and Canada. In particular, the densely populated Northeast is blanketed with New York City containing the largest concentration.
When user generated placemarks are compared to regular Google Maps directory listings one sees essentially the same pattern of clusters, albeit and a higher density. For example, the largest number of directory listings of restaurants (again in New York City) is about 25 percent higher than user generated ones. Moreover, more rural areas (see the eastern U.S.) clearly have a high number of directory listing relative to user generated ones.
This suggests that user generated placemarks are biased towards urban areas where early technology adopters are most likely to dwell and use.

December 10, 2009

Swine flu: a user-generated pandemic?

In a recent post at 538.com, Nate Silver delves into mapping the spatio-temporal diffusion of swine flu in the US, via Google Flu Trends. Drawing from queries referencing swine flu, the map below shows the approximate date at which state-wide searches for "swine flu" crossed a particular threshold, potentially signifying the onset of what has become a swine flu pandemic. According to Silver, the date at which the relative number of searches reaches the indexed value of 5000 serves as a proxy for measuring the diffusion of the year's most talked about genetic mix-up.
So we know when and where people were looking for information about swine flu, but what about geo-references to the virus? How does the geography of swine flu differ between Google Flu Trends and user-generated Google Maps placemarks? How do Google's multiple representations compare to the actual number of cases of swine flu in the United States?

Although the CDC has stopped collecting data on the outbreak of swine flu on a state-by-state basis, the regional-level data in the map above shows the concentration of swine flu cases. The upper Midwest, for example, which has the highest number of swine flu infections in the country, only recently surpassed the 5000 point mark on Google Flu Trends. Clearly the act of searching for information on swine flu need not closely correspond to the number of cases. And while this region shows significant clustering in user-generated Google Maps placemarks, the values fail to approach the maximums for the nation as a whole. The peer produced geography of swine flu also seems to support CDC statistics for the southeastern US (showing a relatively high infection rate), while the Flu Trends data fails to match accordingly both there and along the US-Mexico border.
The greatest number of mentions of swine flu in user-generated placemarks is located in Baltimore, Maryland - part of District 3, which is home to the second-most cases of swine flu in the US. However, as one moves up the DC-Philadelphia-NYC-Boston metropolitan corridor there is an increasing disconnection between the online representations and material reality of swine flu. Although the absolute and population-adjusted number of actual swine flu cases in Regions 1 and 2 (home to Boston and New York respectively) are relatively low compared to other regions, they are highly visible in terms of user generated placemarks references to H1N1 or swine flu.
The population-adjusted map does, however, give a much clearer picture of the swine flu landscape in the US. Both the west coast and upper midwest, despite having the highest incidence of swine flu in the country, were previously overshadowed by the population centers of the east coast. Normalized by population, the placemark density comes to mirror much more closely the actual diffusion of swine flu across the country.

December 08, 2009

Toronto and Cape Cod are the "funnest" places in North America

These maps illustrate the distribution of "fun" in North America as defined by user generated placemarks containing the term. Luckily for society, fun seems to be well dispersed and corresponds with the distribution of population. In other words, where there are people there is also fun. But one can also see concentrations and specializations in fun.


For example, Toronto has a massive (dare we say strategic?) reserve of fun clustered around it. Who knew? I have fond memories of my trips to Toronto but had no idea. The film festival is great, the neighborhoods are fantastic and the underground walkways keep you warm in the winter but how does it all come together to make this mother lode of fun? Jane Jacobs clearly had it right. Perhaps this will become the next invisible export for the region's economy.

Also the Northwest is suspiciously fun. How does that work with all the rain?

Clearly, some means of standardizing "fun" needs to be down to separate the large concentrations from the places that truly specialize in fun. When we use population, i.e., fun per capita, it turns out that Cape Cod, a place outside of Ogden, Utah and Cancun, Mexico have the most fun per person in North America. But before you start planning a vacation to the Great Salt Lake, remember that the high showing outside of Ogden was largely due to a very small population figure.

November 07, 2009

Where in the world is Barack Obama? (and John McCain, too!)

To follow up on our previous map showing the difference in the number of mentions between Barack Obama and John McCain in user-generated Google Maps content prior to the 2008 US Presidential Election, we figured an alternative visualization might be beneficial. The following maps represent the absolute number of mentions of Obama and McCain, respectively, in user-generated placemarks, a disaggregation of the map in our previous post.
This map, much like the previous iteration, shows the vast concentration of user-generated placemarks mentioning Obama in the nation's urban centers. The nation's largest cities - New York City, Los Angeles and Chicago - all appear prominently in this map. Although many of the notable points in both the Obama and McCain maps can be attributed to the large populations (and thus, presumably, a greater level of connectedness), a number of other explanations remain necessary. Despite being the 3rd largest city in the United States, Chicago is also the home of Barack Obama, and it houses the highest concentration of placemarks that mention his name. Significant events also seem assert their presence spatially, as Denver, Colorado, the site of the 2008 Democratic National Convention, is another relatively well-represented area, along with Portland, Oregon, where 70000+ rallied for Obama in May 2008.
Mirroring the already established pattern of urban primacy, much of McCain's presence is concentrated in the nation's urban centers, again including both New York City and the Washington, DC metro area (where McCain has the highest concentration). Unlike Obama, the places McCain is best represented in Google Maps were not necessarily the places he fared the best during either the primary or general election. For example, both Iowa and Michigan, in which McCain receives a nearly uniform number of mentions across the state, voted against him in both the primary and general elections.

Despite some of these patterns of user-generated content merely confirming the primacy of urban areas in virtual representations of the material world, others depart significantly from the predicted spatial clustering. Some areas that voted for McCain feature more prominently in the user-generated representations for Barack Obama, and vice versa, with the number of mentions for Barack Obama being more than double the number of mentions for John McCain. Although not all of the patterns displayed can be easily attributed to a particular causal factor, they only further complicate the relational geographies of the virtual and material world.

June 15, 2009

Global Placemark Intensity

The following map shows the intensity of google placemarks on a global scale. Using custom-designed software, a dataset was created based on a 1/4 degree grid of all the land mass in the world (roughly 250,000 points). For each point a search was run on the numbers “0” and “1” in order to create a baseline measure of the amount of online geo-referenced content in each place. In the below map, every place with more than 100 placemarks is highlighted with a yellow dot.


The same method was used to create a map that highlights every place on the globe containing more than 1000 placemark hits:


When compared to a map of population density (see the map below), the distinct geographies of placemarks become apparent.

Image source: NASA

These maps suggest that the GeoWeb is far from being a simple mirror of population density or human activity. Online representations of the physical world are highly concentrated in North America, Western Europe, and the more affluent parts of East Asia and Australasia. Maps displaying placemark density on regional and local scales will be explored in more detail in the next post.