How to access census data in R

Photo by Sigmund on Unsplash

How to access census data in R

This article illustrates how to access the Decennial census data in R using the package called UScensus20.

US Census 2020 is an R package that helps you access US Census data in R in the form of data packages. Census is the survey of population that happens once every decade and the last census for US was in 2020. Census data can tell a lot about patterns and help uncover valuable insights. The aim of the package is to provide researchers with a ready to use data kit for their analysis so that they can focus on things that are of greater importance

This package was a part of the Google Summer of Code project and can be downloaded Github.

Instructions

  • Step 1: Installing all the neccesary prerequisties. Since this package is not available on CRAN (the official repository for R packages, it can be downloaded from github using remotes package).
#the remotes package in R allows us to download the R packages stored on github
install.packages("remotes")
remotes::install_github("shreshtha48/USCensus2020")

After installing the package, let us load up all the neccesary libraries in the console.

  • The sf library is an R package used for handling shapefiles. The data in this package includes shapefiles and redistricting files stored as simple features. ggplot2 is another R package used for creating visualizations. If you like, you can also utilize the basic R functions. ggplot2 offers greater control over grid and layer creation, allowing for more customization.
library(USCensus20)
library(sf)
library(ggplot2)

Installation of data package

The dataset "alabamacounty20" contains the redistricting file data from the US Census Bureau. This data is in a ready-to-use file format on my GitHub in a data package named "UScensuscounty20". Wrapper functions to access this data are provided in the "USCensus20" package.

remotes::install_github("shreshtha48/USCensuscounty20")
library(USCensuscounty20)

Alternatively, there is a function called install_county in the UScensus20 package that helps you download the county level data as well.

#loading the data and storing it in a variable called data
data=alabamacounty20
#looking at the structure of the data
str(data)

This data frame consists of 400 variables. The documentation to what each of these variables mean is available on the census website.

For the purposes of this tutorial, I shall be using the variables population per county and the no of nursing facilities by the county. Furthermore, i also want to analyze which factors from the files 2 and 3 i.e. housing and facilities data have the most impact on county wise total population.


Visualizing the census data

Visualization is a crucial aspect of comprehending and displaying data. How you present information is often more critical than the information itself. It's essential to ensure that visualizations are clear and engaging for the audience.

While statistical plots like histograms, pie charts, scatter plots, density plots, and box plots are commonly used to visualize discrete and continuous data, they may not always capture all the details of the data. Below, i propose two lesser-known visualization techniques that are highly effective for visualizing census data.

Creating choropleth maps using the census data

Choropleth map is a type of visualization technique that uses color shading to represent data aggregated over places. These maps provide a visual representation of spatial patterns and variations in data values across geographic areas.

Using latitude and longitude coordinates, choropleth maps can accurately depict data on a map. These shapefiles follow the wgs system

Choropleth maps are particularly useful for visualizing census data as they can effectively display population distribution, demographic trends, or any other statistical information across different regions. The color gradients on the map help viewers quickly understand the data distribution and make comparisons between areas

  • Creating a basic choropleth map usually involves getting the data, then the shapefiles and combining them, making sure that there is an overlap between the columns of those files. This is a great option for someone using custom shapefiles or someone who wants greater control over their plots. However, often times we simply want to display one or two variables as a gradient over a fixed shapefile and it is simply not worth the effort to write all of that code. The package Uscensus20 provides the functonality to simply use the plot base R command to generate easy shapefiles.

  • Here, i am creating a plot with the plot function and storing it in a variable called Alabama

Alabama: plot(data['P0010001'])

Creating hexplots using the census data

Hexplots are plots with uniform tiles representing density. They are useful for visualizing census and social science data because they provide a clear and structured way to display data patterns and distributions, proving incredibly useful to show trends and anomalies

  • Ggplot2 is an incredible tool to create hex maps. Here i am calling the ggplot command, giving in the data and the x and y variables using the aes command. The aes command kind of helps in mapping the variables to the plot. This basically creates a blank canvas and to that I am adding hex bins, dividing the entire plot into hexagonal bins

  • I am using the variables, the total number of population that has been instituionalized and vacany of households to find out the patterns

d <- ggplot(data, aes(institutional, vacant))
d + geom_hex(bins=25)


Simple Linear Regression using census data

Linear regression is a statistical technique that aims at predicting the relationship between dependent and independent variables using a line. Often times, linear regression is used to predict the trends and outcomes for a continuous variable using a straight line with the equation as follows:

$$[ y = mx + b ]$$

where:

  • ( y ) is the dependent variable

  • ( x ) is the independent variable

  • ( m ) is the slope of the line

  • ( b ) is the intercept of the line

The goal of simple linear regression is to find the best-fitting line that minimizes the cost function which is denoted by

$$[ J(\theta) = \frac{1}{2m} \sum{i=1}^{m} (h{\theta}(x^{(i)}) - y^{(i)})^2 ]$$

An iterative algortihm called gradient descent, is used to minimize the cost function (the square of error between the actual and predicted values). It does so by using the fundamentals of calcluls to find the global minima on the graph of cost function.

Now that we have understood the basics, let us implement simple linear regression on the data. I am trying to figure out if the population of people of black race has any effect on the institutionalized population. I will start by creating a subset of the dataframe containing my needed variables

 # Data preparation 
alabama <- data("nurse" = alabamacounty20$P0050005,"pop" = alabamacounty20$P0010002)

Now that is done, I will plot the regression line using ggplot 2 as follows:

# Plotting
ggplot(alabama, aes(x = nurse, y = pop)) +
  geom_point(color = "black") +
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
  xlab("Nurse") +
  ylab("Population")

# Fit linear regression model
model <- lm(pop ~ nurse, data = alabama)
summary(model)


Exploring UScensus20 functions

  1. load_data

    This function provides the utility to load the data based on the given paramters such as state and geographic level. This function is helpful as by providing the url, the state and the geographic level that you want your data on, it can give the data to you. One huge limitation that I faced during making this package is the lack of storage and compute space. If we were to narrow down the geographic level to cdp or to blockgroup/ blocks, it virtually impossible to combine them all in a single data package due to limitations of CRAN and github. This function allows you the get your desired data for one or more states for a varitey of geographic levels

load_data(state="Alabama", level="County")

Function parameters:

  • URL: allows you to pass the url that you need to download the data from (please ensure that the URL is pertaining to files of the 2010 or 2020 census since the function has only been verified across those years). Default maps to the 2020 redistricting file data

  • State: Enter the state(s) for which you want the data

Level: Enter the geographic level for which you want the data. Supports County, County Subdivision,City,Tract,Block Group ,Block.


  1. Demographics

     data=demographics(dem = "P0010001", state="Alabama", level ="county")
    

Demographics function allows you to grab one or more variables for a state

Function parameters:

  • Dem: provide the demographic variables that you want in your data

  • state: the state(s) for which you want the data

  • statefips : (optional) the fips code of the state for which you want the data

level : the geographic level for which you want the data


  1. Get FIPS

     get_fips(state,countyname=NULL,level=c("state","county"))
    

Getfips allows you to get the fipscode for the particular state


  1. Geohash

     geohash(state,level=c("state","county","tract","block","blockgroup"))
    

The geohash function in the UScensus20 package takes in a state and a level and returns a geohash for the level. For reference, geohash is an encoding system which encodes the geographic place into a short string of letters

  1. Metadata
metadata(state,statefips=FALSE,name,level=c("city","county","tract"))

Metadata helps pull out metadata for the given state


  1. MSA

     MSA(msaname="Greater Los Angelese",state="California")
    

    MSA stands for metropolitan statistical area and is an area with very high population density. This function helps you pull out the redistricting file data for your requested MSA(s)

These are some main helper functions present in the census package. There are some other functions such as check_state and data dictonary. Incase you have any queries related to the package or have any new feature suggestions, please comment down below. Alternatively, please feel free to open an issue on the github repository at https://github.com/shreshtha48/USCensus2020