Untangling XML data with R

And parsing it into a tidy format

data

governance

Published

July 14, 2022

XML, or (e)Xtensible Markup Language, can be a pain to get into a tabular format for analysis. It is designed to contain any “arbitrary” structure¹ of data which makes it useful for storing things like financial transactions, webpage content and document formatting². Whenever I’ve worked with XML in the past, there seemed to be a dearth of resources describing how to actually transform the data into something usable. This post aims to remedy that by describing how to parse an XML file from GovInfo’s congressional bill data.

What you’ll learn:

Load and parse XML files in R using the {XML} and {xml2} packages
Explore the structure of XML documents
Transform your data from XML into a tabular format using {purrr}, and make use of {tidyr}’s list columns

TL;DR

Transforming XML into a tabular format can be tricky depending on the structure. Start by understanding how the data is organized, what you want it to look like, and what level of detail you are looking at. This post uses GovInfo’s congressional bill data to explore examples of XML structure.

Navigate XML

Look through any available documentation, always. When you read in the XML file, xmlParse(), xml_structure(), and xml_contents() are some of the functions that will help you navigate the structure you’re dealing with.

[Define the output]

Consider what you want the output to look like and think about how the data needs to be transformed to match this. Once you understand the existing structure, the data you need, and your desired output structure, it is time to start parsing.

[Parse a single element] (write a function if it gets too complicated)

Get one element into the form you want first. Writing functions can help you think through the data transformations being applied and make your code easier to read. You can see an example of this functional approach in parsing legislative actions. Once you’re reliably parsing multiple elements and producing a consistent output format, you can scale up.

Apply to all elements

Next, focus on processing a single file in full (or the subset of the file you’re interested in). You might want an XML file to return a single row, a single column, or a data frame of size \(n\times k\). Once you have a single file returned in the format you want, you can combine the outputs of all the files in your dataset.

What is XML?

At its core, XML is “information wrapped in tags”. You can think of the structure like branches of a tree, with each branch having their own leaves. Below is an example of a bookstore’s data in XML format from w3schools.

<bookstore>
  <book category="children">
    <title>Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <book category="web">
    <title>Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>

<bookstore> is the parent, or root element here. The two <book> elements are the bookstore’s children, each of which has individual data points in child elements (e.g. <title>) and attributes e.g. <book category="children"> store books’ category.

Note

Attributes are a useful way to store data related to a specific element for HTML, but using child elements is typically preferred in XML³. However, you’ll still likely run into data stored in attributes so we’ll touch on extracting data from both.

Navigate XML

If you’ve never used R before, this will be a bit of a rough point to start learning. Instead I would recommend checking RStudio’s beginners guide, so for the remainder of the post I will assume some familiarity with R.

Four packages I use throughout this post are listed below. If you don’t understand how a function works or want to see the function reference, click on the function in the code and it should bring you to the function’s help page.

{XML} and {xml2} provide functions for reading in and parsing XML⁴
{purrr} and {tidyr} (loaded as part of the {tidyverse} package) help us wrangle lists and transform XML to data frames.

Load libraries

# copy the comment below to install the packages
# install.packages(c("tidyverse", "XML", "xml2))
library(tidyverse)
library(XML)
library(xml2)

To illustrate complexities of XML structures I’ll be using congressional bill status data from the US GPO’s GovInfo service. In particular, you can download the sample bill here (chosen for its examples of challenges parsing XML).

First we need to read in the XML file. You can pass the URL or file path of the XML file to read_xml() to create an xml_document object⁵. To see the contents of the file you’ve just loaded you can use xml_contents() or enter the object’s name (here bill_xml) in the console.

Load and view XML file

# Read XML file
bill_xml = read_xml("https://www.govinfo.gov/bulkdata/BILLSTATUS/117/hr/BILLSTATUS-117hr391.xml")

# Look at xml_document's contents
xml_contents(bill_xml)

{xml_nodeset (2)}
[1] <bill>\n  <billNumber>391</billNumber>\n  <createDate>2021-01-22T08:12:10 ...
[2] <dublinCore xmlns:dc="http://purl.org/dc/elements/1.1/">\n  <dc:format>te ...

We have two top-level nodes - <bill> and <dublinCore>⁶. The latter is actually a metadata identifier and doesn’t contain bill-related information. xml_child() selects the <bill> node which returns an XML nodeset that looks like this:

Select first matching child element

(bill_node = xml_child(bill_xml, "bill"))

{xml_node}
<bill>
 [1] <billNumber>391</billNumber>
 [2] <createDate>2021-01-22T08:12:10Z</createDate>
 [3] <updateDate>2022-09-07T13:37:22Z</updateDate>
 [4] <originChamber>House</originChamber>
 [5] <billType>HR</billType>
 [6] <introducedDate>2021-01-21</introducedDate>
 [7] <congress>117</congress>
 [8] <constitutionalAuthorityStatementText><![CDATA[<pre>[Congressional Recor ...
 [9] <recordedVotes>\n  <recordedVote>\n    <rollNumber>188</rollNumber>\n    ...
[10] <committees>\n  <billCommittees>\n    <item>\n      <systemCode>ssfr00</ ...
[11] <committeeReports/>
[12] <relatedBills/>
[13] <actions>\n  <item>\n    <actionDate>2021-07-12</actionDate>\n    <commi ...
[14] <sponsors>\n  <item>\n    <bioguideId>C001078</bioguideId>\n    <fullNam ...
[15] <cosponsors>\n  <item>\n    <bioguideId>C000266</bioguideId>\n    <fullN ...
[16] <cboCostEstimates>\n  <item>\n    <pubDate>2021-05-11T15:11:11Z</pubDate ...
[17] <laws/>
[18] <notes/>
[19] <policyArea>\n  <name>International Affairs</name>\n</policyArea>
[20] <subjects>\n  <billSubjects>\n    <legislativeSubjects>\n      <item>\n  ...
...

Define the output

Now we have the bill XML loaded and we’ve already stripped away some data we don’t need but before diving deeper, it’s helpful to think about the how you want to the processed XML data to be structured. For example, in the bill status data I might want to collect party-level votes. This could take the form of one row per party per bill with a column for yeas, nays, and abstentions⁷. Perhaps I want to look at the policy areas of bills which became law. Instead of processing the entire file and then selecting the data I want, I can select only the nodes I need at the outset before transforming the data into a data frame. This will likely save you some time in cases where you have a lot of data.

With the congressional bills data, I defined my output as a bill-level dataset, i.e. one bill per row, storing nested elements like recordedVotes, committees, and actions inside list columns. While columns in a data frame are typically atomic vectors⁸ of equal length, list columns enable us to store data of varying length and type, including data frames themselves. Using {tidyr}’s unnest() function, we can expand this nested data into regular data frame columns. Here is what this output will look like:

Here’s a glimpse() of some of the nested data:

List of 1
 $ : tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
  ..$ committee_system_code  : chr [1:2] "ssfr00" "hsfa00"
  ..$ committee_name         : chr [1:2] "Foreign Relations Committee" "Foreign Affairs Committee"
  ..$ committee_chamber      : chr [1:2] "Senate" "House"
  ..$ committee_type         : chr [1:2] "Standing" "Standing"
  ..$ committee_activities   :List of 2
  ..$ subcommittee_activities:List of 2

List of 1
 $ : tibble [1 × 8] (S3: tbl_df/tbl/data.frame)
  ..$ roll_number     : chr "188"
  ..$ url             : chr "https://clerk.house.gov/evs/2021/roll188.xml"
  ..$ full_action_name: chr "Final Passage Under Suspension of the Rules Results"
  ..$ chamber         : chr "House"
  ..$ date            : chr "2021-06-28T23:46:02Z"
  ..$ session_number  : chr "1"
  ..$ vote_roll       :List of 1
  ..$ roll_found      : logi TRUE

List of 1
 $ : tibble [14 × 9] (S3: tbl_df/tbl/data.frame)
  ..$ action_date                 : Date[1:14], format:  ...
  ..$ action_committee_system_code: chr [1:14] "ssfr00" NA NA NA ...
  ..$ action_committee_name       : chr [1:14] "Foreign Relations Committee" NA NA NA ...
  ..$ action_source_code          : chr [1:14] "0" "2" "2" "9" ...
  ..$ action_source_name          : chr [1:14] "Senate" "House floor actions" "House floor actions" "Library of Congress" ...
  ..$ action_text                 : chr [1:14] "Received in the Senate and Read twice and referred to the Committee on Foreign Relations." "Motion to reconsider laid on the table Agreed to without objection." "On motion to suspend the rules and pass the bill, as amended Agreed to by the Yeas and Nays: (2/3 required): 30"| __truncated__ "Passed/agreed to in House: On motion to suspend the rules and pass the bill, as amended Agreed to by the Yeas a"| __truncated__ ...
  ..$ action_type                 : Ord.factor w/ 6 levels "IntroReferral"<..: 1 3 3 3 3 3 3 3 3 2 ...
  ..$ action_time                 : 'hms' num [1:14] NA 19:46:03 19:46:02 19:46:02 ...
  .. ..- attr(*, "units")= chr "secs"
  ..$ action_code                 : chr [1:14] NA "H38310" "H37300" "8000" ...

List of 1
 $ : tibble [1 × 11] (S3: tbl_df/tbl/data.frame)
  ..$ sponsor_bioguide_id            : chr "C001078"
  ..$ sponsor_full_name              : chr "Rep. Connolly, Gerald E. [D-VA-11]"
  ..$ sponsor_first_name             : chr "Gerald"
  ..$ sponsor_middle_name            : chr "E."
  ..$ sponsor_last_name              : chr "Connolly"
  ..$ sponsor_party                  : chr "D"
  ..$ sponsor_state                  : chr "VA"
  ..$ sponsor_identifiers_lis_id     : chr "1959"
  ..$ sponsor_identifiers_bioguide_id: chr "C001078"
  ..$ sponsor_identifiers_gpo_id     : chr "8202"
  ..$ sponsor_district               : chr "11"

How your XML is structured, and how you want to structure your tabular output will determine how you need to process the XML file. In general, this will involve the following steps:

Select the XML nodes you want to include in your dataset.
Convert the XML to a list.
flatten() the list into a data frame.

Write Functions

If you find yourself doing similar transformations in multiple cases or copy-pasting some piece of code more than once, I would implore you to write functions because they:

make your code more reliable
make you think critically about how you’re modifying the data
are easier to debug

A good resource to start learning about writing functions is How to Write a Function in R.

The simple case

When converting XML into a data frame, one of the simplest cases to deal with is a set of nodes which each have a single value. You can think of singular nodes as the contents of a single cell in a data frame. When your data are entirely made up of nodes like this, you can convert XML to a data frame using two functions from {XML}:

xmlParse() parses an XML file into an R structure⁹ representing the XML tree
xmlToDataFrame() extracts data from an XML document directly into a data frame, but with limited flexibility

Note

In the example below, we need to use getNodeSet(., [path =] "//bill") to select only nodes in <bill>. If we just passed bill_xml_parse, we would get a dataframe with two rows because the top-level nodes are bill and the metadata identifier node <dublinCore>.

Use {XML} functions to read XML data

# First we need to use xmlParse function to create XMLDocument object
bill_xml_parse = xmlParse(bill_xml)

xmlToDataFrame(getNodeSet(bill_xml_parse, "//bill")) %>% 
  glimpse()

Output of xmlToDataFrame

Rows: 1
Columns: 28
$ billNumber                           <chr> "391"
$ createDate                           <chr> "2021-01-22T08:12:10Z"
$ updateDate                           <chr> "2022-09-07T13:37:22Z"
$ originChamber                        <chr> "House"
$ billType                             <chr> "HR"
$ introducedDate                       <chr> "2021-01-21"
$ congress                             <chr> "117"
$ constitutionalAuthorityStatementText <chr> "<pre>[Congressional Record Volum…
$ recordedVotes                        <chr> "188https://clerk.house.gov/evs/2…
$ committees                           <chr> "ssfr00Foreign Relations Committe…
$ committeeReports                     <chr> ""
$ relatedBills                         <chr> ""
$ actions                              <chr> "2021-07-12ssfr00Foreign Relation…
$ sponsors                             <chr> "C001078Rep. Connolly, Gerald E. …
$ cosponsors                           <chr> "C000266Rep. Chabot, Steve [R-OH-…
$ cboCostEstimates                     <chr> "2021-05-11T15:11:11ZH.R. 391, Gl…
$ laws                                 <chr> ""
$ notes                                <chr> ""
$ policyArea                           <chr> "International Affairs"
$ subjects                             <chr> "Advisory bodiesAnimal and plant …
$ summaries                            <chr> "002021-01-21Introduced in House2…
$ title                                <chr> "Global Health Security Act of 20…
$ titles                               <chr> "Display TitleGlobal Health Secur…
$ amendments                           <chr> ""
$ textVersions                         <chr> "Referred in Senate2021-07-12T04:…
$ latestAction                         <chr> "2021-07-12Received in the Senate…
$ calendarNumbers                      <chr> ""
$ version                              <chr> "1.0.0"

Because the congressional data has elements like committees, actions, and sponsors which contain multiple child elements of their own, the contents are combined into single columns. When parsing nested XML like this, it might require multiple steps to get the data into a tidy data frame.

In this case, the first step is to isolate singular elements like billNumber and billType, and deal with nested elements individually afterwards.

Singular elements are nodes which meet two criteria:

the node does not have any child nodes of its own, and
the node is not an empty string.

xml_find_all() finds nodes which match an XPath (XML Path Language) expression¹⁰ in a similar way as regular expressions is used to match strings. As the function’s help file says, “XPath is like regular expressions for trees”.

In the code below the [ expressions within the square brackets ] are called predicates which I use to find nodes which have no children with the XPath count function and nodes which are not empty strings¹¹. This returns an {xml_nodeset} which can be coerced to a list using as_list().

Select non-empty child-less elements using XPath

singular_nodes = xml_find_all(bill_xml, 
                              # //bill/* selects the child nodes in a bill
                              "//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")

(singular_list = as_list(singular_nodes)) %>% 
  glimpse()

List of 10
 $ :List of 1
  ..$ : chr "391"
 $ :List of 1
  ..$ : chr "2021-01-22T08:12:10Z"
 $ :List of 1
  ..$ : chr "2022-09-07T13:37:22Z"
 $ :List of 1
  ..$ : chr "House"
 $ :List of 1
  ..$ : chr "HR"
 $ :List of 1
  ..$ : chr "2021-01-21"
 $ :List of 1
  ..$ : chr "117"
 $ :List of 1
  ..$ : chr "<pre>[Congressional Record Volume 167, Number 17 (Thursday, January 28, 2021)][House]From the Congressional Rec"| __truncated__
 $ :List of 1
  ..$ : chr "Global Health Security Act of 2021"
 $ :List of 1
  ..$ : chr "1.0.0"

Now that we have the data we want in a list, we can flatten each element in the list into a data frame column using purrr’s flatten_dfc(). However, notice the elements in a singular_list are not named. If we flatten this list to a data frame, our columns also won’t have any column names. xml_name() creates a character vector of the nodes’ names. When used in combination with setNames() we can create a named list. Just make sure singular_list and singular_nodes are the same length and in the same order.

Name the list elements

singular_list_named = setNames(singular_list, 
                                xml_name(singular_nodes))

When flattened, a named list becomes a data frame with column names in tow.

Flatten singular nodes into data frame

(bill_df = flatten_dfc(singular_list_named)) %>% 
  glimpse()

Rows: 1
Columns: 10
$ billNumber                           <chr> "391"
$ createDate                           <chr> "2021-01-22T08:12:10Z"
$ updateDate                           <chr> "2022-09-07T13:37:22Z"
$ originChamber                        <chr> "House"
$ billType                             <chr> "HR"
$ introducedDate                       <chr> "2021-01-21"
$ congress                             <chr> "117"
$ constitutionalAuthorityStatementText <chr> "<pre>[Congressional Record Volum…
$ title                                <chr> "Global Health Security Act of 20…
$ version                              <chr> "1.0.0"

With the singular nodes transformed, we have the base of our data frame. The next step is to choose a nested element, define its structure in the output, and transform the XML accordingly.

The more complicated (and realistic) case

In congressional XML data, the actions and votes on a bill are stored in containers. To collect this information and keep our dataset at the bill-level, we can transform each container into data frames using the same steps as before and then assign them as a list columns in our bill_df data frame.

Parse a single element

Actions are stored in a container node where each <item> represents a congressional action taken for a bill, such as being introduced, sent to a committee, debated on the floor, etc. Just as before, we use as_list() to convert the {xml_nodeset} to a list.

Select all action elements and convert them to a list

actions_xml = xml_find_all(bill_node, "actions/item")

actions_list = as_list(actions_xml)
# Look at first action
glimpse(actions_list[[1]], width = 50)

Structure of actions list

List of 6
 $ actionDate  :List of 1
  ..$ : chr "2021-07-12"
 $ committees  :List of 1
  ..$ item:List of 2
  .. ..$ systemCode:List of 1
  .. ..$ name      :List of 1
 $ links       : list()
 $ sourceSystem:List of 2
  ..$ code:List of 1
  .. ..$ : chr "0"
  ..$ name:List of 1
  .. ..$ : chr "Senate"
 $ text        :List of 1
  ..$ : chr "Received in the Senate and Read twice and referred to the Committee on Foreign Relations."
 $ type        :List of 1
  ..$ : chr "IntroReferral"

For an individual action, we have the date, a list of committees related to the action, some administrative information, the text and type of action. To deal with this, we can write a function (or set of functions) like the ones below to transform actions from a list to a data frame with their own nested elements.

View helper functions

#' Helper function: flatten_dfc_rename
#' Flatten a list into a data frame and apply a prefix to the column names
#'
#' @param list_to_flatten A list to flatten
#' @param name_prefix A string to apply as a prefix to columns
#'
#' @return A data frame of columns that begin with the prefix
flatten_dfc_rename = function(list_to_flatten, 
                              name_prefix = "prefix"){
  rename_with(
    .data = flatten_dfc(list_to_flatten), 
    .fn = ~str_c(name_prefix, "_", .),
    # Exclude columns which already start with the prefix
    .cols = -starts_with(name_prefix)
  )
}

#' Function: parse_action
#' Parse actions from list to data frame
#' 
#' Separately flatten and rename source system and committee elements,
#' then flatten and rename remaining elements. 
#' Committee elements are flattened into a nested data frame 
#' because an action can relate to multiple committees.
#'
#' @param action An action element in list form
#'
#' @return A data frame with a single row containing action information
parse_action = function(action){
  action %>% 
    # Flatten + rename sourceSystem elements
    map_at("sourceSystem", ~flatten_dfc_rename(.x, "source")) %>% 
    # Flatten + rename committees
    map_at("committees", function(committee){
      # using map_at and map_dfr to create a data frame row for each committee
      map_dfr(committee, ~flatten_dfc_rename(.x, "committee"))
    }) %>% 
    # Flatten object to data frame
    flatten_dfc_rename(., "action") %>% 
    # Lastly, clean the names
    janitor::clean_names()
}

Apply it to all elements

Using the {purrr} library’s map_dfr(), we apply my parse_action() function to each element of our actions list and combine the results into data frame rows.

Parse actions and combine into data frame

(actions_df = map_dfr(actions_list, parse_action)) %>% 
  glimpse()

Rows: 14
Columns: 9
$ action_date                  <chr> "2021-07-12", "2021-06-28", "2021-06-28",…
$ action_committee_system_code <chr> "ssfr00", NA, NA, NA, NA, NA, NA, NA, NA,…
$ action_committee_name        <chr> "Foreign Relations Committee", NA, NA, NA…
$ action_source_code           <chr> "0", "2", "2", "9", "2", "2", "2", "2", "…
$ action_source_name           <chr> "Senate", "House floor actions", "House f…
$ action_text                  <chr> "Received in the Senate and Read twice an…
$ action_type                  <chr> "IntroReferral", "Floor", "Floor", "Floor…
$ action_time                  <chr> NA, "19:46:03", "19:46:02", "19:46:02", "…
$ action_code                  <chr> NA, "H38310", "H37300", "8000", "H30000",…

When you plan to combine rows into a data frame or unnest() the data in the future, it is useful to be explicit about the expected data types of each column. This can be done using type_convert().

Specify column types

# Specify column data types
actions_col_types = cols(
  action_date = col_date(),
  action_time = col_time(),
  action_committee_systemCode = col_character(), 
  action_committee_name = col_character(), 
  action_source_code = col_character(),
  action_source_name = col_character(),
  action_text = col_character(), 
  action_type = col_character(), 
  action_code = col_character()
  )

# Convert column types
actions_df = type_convert(actions_df,
                          col_types = actions_col_types)

Because we are only adding a single row of data to bill_df we can encase the actions in a list and use dollar assignment. If you were adding multiple rows of data, you would need to make sure that you are adding a column of the same length as bill_df.

Assign actions as list column in our data frame

bill_df$actions = list(actions_df)

When they put XML in your XML

So we’ve dealt with cases with multiple child elements, but XML allows for all sorts of weird and wonderful storage options. The congressional data includes information about votes taken on a bill, the total vote count, and the votes of individual legislators. However, only the high level information is included in the <recordedVotes> element. the more detailed information on individual legislators’ votes are contained in an XML file referenced by the <url> element as you can see below.¹²

Assign actions as list column in our data frame

# Find recordedVote nodes inside recordedVote container
bill_recorded_vote_nodes = xml_find_all(
  bill_node, 
  "recordedVotes/recordedVote"
  )

# Look at the nodeset contents
xml_contents(bill_recorded_vote_nodes)

{xml_nodeset (7)}
[1] <rollNumber>188</rollNumber>
[2] <url>https://clerk.house.gov/evs/2021/roll188.xml</url>
[3] <fullActionName>Final Passage Under Suspension of the Rules Results</full ...
[4] <chamber>House</chamber>
[5] <congress>117</congress>
[6] <date>2021-06-28T23:46:02Z</date>
[7] <sessionNumber>1</sessionNumber>

Before we dive one level deeper, we can convert the top-level nodes to a list and flatten it into columns to start building our vote-level data frame.

Convert <recordedVotes> to a list and flatten into data frame

# Coerce nodes to list
recorded_votes_list = as_list(bill_recorded_vote_nodes)

(recorded_votes_df = map_dfr(recorded_votes_list, flatten_dfc)) %>% 
  glimpse()

Rows: 1
Columns: 7
$ rollNumber     <chr> "188"
$ url            <chr> "https://clerk.house.gov/evs/2021/roll188.xml"
$ fullActionName <chr> "Final Passage Under Suspension of the Rules Results"
$ chamber        <chr> "House"
$ congress       <chr> "117"
$ date           <chr> "2021-06-28T23:46:02Z"
$ sessionNumber  <chr> "1"

Note

We use flatten_dfc() to flatten each vote element into a data frame as columns and then map_dfr() rolls them all up as rows into a single data frame.

Now to get the vote roll XML file we’ll use read_xml() . There are two main nodes - <vote-metadata> and <vote-data>. One node contains the aggregated vote information, while <vote-data> contains the legislator-level vote data. We’ll focus on <vote-data> here.

Read vote roll data and convert to list

vote_roll_xml = read_xml(recorded_votes_df$url)

# Find all vote-data elements
vote_data = xml_find_all(vote_roll_xml, "vote-data")

# Find all recorded-vote elements
vote_legislators = xml_find_all(vote_data, "recorded-vote")

# Convert to list and show first element
(legislators_list = as_list(vote_legislators))[1] %>% 
  glimpse()

List of 1
 $ :List of 2
  ..$ legislator:List of 1
  .. ..$ : chr "Adams"
  .. ..- attr(*, "name-id")= chr "A000370"
  .. ..- attr(*, "sort-field")= chr "Adams"
  .. ..- attr(*, "unaccented-name")= chr "Adams"
  .. ..- attr(*, "party")= chr "D"
  .. ..- attr(*, "state")= chr "NC"
  .. ..- attr(*, "role")= chr "legislator"
  ..$ vote      :List of 1
  .. ..$ : chr "Yea"

If we use the same combination of as_list() , map_dfr(), and flatten_dfc() we can flatten the XML into one row per legislator but we lose all the attributes.

Convert list of votes to data frame and LOSE attributes

(vote_roll_flattened = as_list(vote_legislators) %>% 
  map_dfr(flatten_dfc))

# A tibble: 430 × 2
   legislator  vote 
   <chr>       <chr>
 1 Adams       Yea  
 2 Aderholt    Nay  
 3 Aguilar     Yea  
 4 Allen       Nay  
 5 Allred      Yea  
 6 Amodei      Yea  
 7 Armstrong   Yea  
 8 Arrington   Nay  
 9 Auchincloss Yea  
10 Axne        Yea  
# … with 420 more rows

To extract the attributes before we flatten the data, we can apply map_at() to each legislator element and extract the attributes from each while retaining the value in vote.¹³

Extract attributes to list elements

(legislator_attr = legislators_list %>% 
    # Modify one level deeper using map_at to target legislator elements
    map(map_at, "legislator", attributes))[[1]] %>% 
  glimpse()

List of 2
 $ legislator:List of 6
  ..$ name-id        : chr "A000370"
  ..$ sort-field     : chr "Adams"
  ..$ unaccented-name: chr "Adams"
  ..$ party          : chr "D"
  ..$ state          : chr "NC"
  ..$ role           : chr "legislator"
 $ vote      :List of 1
  ..$ : chr "Yea"

Convert list of votes to data frame and KEEP attributes

(legislator_vote_df = map_dfr(legislator_attr, flatten_dfc) %>% 
    # Use {janitor} to clean up the column names
    janitor::clean_names()) %>% 
  glimpse()

Rows: 430
Columns: 7
$ name_id         <chr> "A000370", "A000055", "A000371", "A000372", "A000376",…
$ sort_field      <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ unaccented_name <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ party           <chr> "D", "R", "D", "R", "D", "R", "R", "R", "D", "D", "R",…
$ state           <chr> "NC", "AL", "CA", "GA", "TX", "NV", "ND", "TX", "MA", …
$ role            <chr> "legislator", "legislator", "legislator", "legislator"…
$ vote            <chr> "Yea", "Nay", "Yea", "Nay", "Yea", "Yea", "Yea", "Nay"…

It can often feel like you’re getting lost in a list of lists, but with some experimentation you’ll be able to find your way back to the surface.

Now we have a table of legislator voting data, and the last thing to take care of is the <vote-metadata>. Everything other than the <vote-totals> element is singular so we can get that out of the way the same way as before:

View <vote-metadata> element

(vote_metadata = xml_find_all(vote_roll_xml, "vote-metadata")) %>% 
  xml_contents()

{xml_nodeset (13)}
 [1] <majority>D</majority>
 [2] <congress>117</congress>
 [3] <session>1st</session>
 [4] <chamber>U.S. House of Representatives</chamber>
 [5] <rollcall-num>188</rollcall-num>
 [6] <legis-num>H R 391</legis-num>
 [7] <vote-question>On Motion to Suspend the Rules and Pass, as Amended</vote ...
 [8] <vote-type>2/3 YEA-AND-NAY</vote-type>
 [9] <vote-result>Passed</vote-result>
[10] <action-date>28-Jun-2021</action-date>
[11] <action-time time-etz="19:42">7:42 PM</action-time>
[12] <vote-desc>Global Health Security Act</vote-desc>
[13] <vote-totals>\n  <totals-by-party-header>\n    <party-header>Party</part ...

Because we have the individual votes, we don’t need to keep the <vote-totals> element, but it is useful to note because it could serve as a check for data consistency in the future. Instead, we’ll just keep the singular nodes in <vote-metadata>

Convert singular vote nodes to data frame

vote_singular_nodes = xml_singular_nodes(vote_metadata)

(vote_df = as_list(vote_singular_nodes) %>% 
  # as_list() doesn't retain element names so we set names ourselves
  setNames(xml_name(vote_singular_nodes)) %>% 
  flatten_dfc() %>% 
    janitor::clean_names()) %>% 
  glimpse()

Rows: 1
Columns: 12
$ majority      <chr> "D"
$ congress      <chr> "117"
$ session       <chr> "1st"
$ chamber       <chr> "U.S. House of Representatives"
$ rollcall_num  <chr> "188"
$ legis_num     <chr> "H R 391"
$ vote_question <chr> "On Motion to Suspend the Rules and Pass, as Amended"
$ vote_type     <chr> "2/3 YEA-AND-NAY"
$ vote_result   <chr> "Passed"
$ action_date   <chr> "28-Jun-2021"
$ action_time   <chr> "7:42 PM"
$ vote_desc     <chr> "Global Health Security Act"

Once we have our nodeset (which at last are all singular), we use the same listing, mapping, and flattening…or lappening as absolutely no one calls it.

Now we can put all the pieces together and nest each element in its parent element until we’ve gotten back to the bill-level i.e. the vote roll data frame goes in the votes data frame, which goes in the bill data frame.

Nest each element in their parent data frame

# Nest legislator votes in the vote roll data frame
vote_roll_df = vote_df %>% 
  mutate(legislator_votes = list(legislator_vote_df)) %>% 
    janitor::clean_names()

# Nest vote rolls in the votes data frame
(recorded_votes_df = recorded_votes_df %>% 
  mutate(vote = list(vote_roll_df))) %>% 
  glimpse()

Rows: 1
Columns: 8
$ rollNumber     <chr> "188"
$ url            <chr> "https://clerk.house.gov/evs/2021/roll188.xml"
$ fullActionName <chr> "Final Passage Under Suspension of the Rules Results"
$ chamber        <chr> "House"
$ congress       <chr> "117"
$ date           <chr> "2021-06-28T23:46:02Z"
$ sessionNumber  <chr> "1"
$ vote           <list> [<tbl_df[1 x 13]>]

Nest each element in their parent data frame

# Nest votes in the bill data frame
bill_df$votes = list(recorded_votes_df)

Now we have the bill-level characteristics with action and vote information nested in list columns. We can keep the data at the bill level or unnest() the elements we want to use.

Code

glimpse(bill_df)

Rows: 1
Columns: 12
$ billNumber                           <chr> "391"
$ createDate                           <chr> "2021-01-22T08:12:10Z"
$ updateDate                           <chr> "2022-09-07T13:37:22Z"
$ originChamber                        <chr> "House"
$ billType                             <chr> "HR"
$ introducedDate                       <chr> "2021-01-21"
$ congress                             <chr> "117"
$ constitutionalAuthorityStatementText <chr> "<pre>[Congressional Record Volum…
$ title                                <chr> "Global Health Security Act of 20…
$ version                              <chr> "1.0.0"
$ actions                              <list> [<tbl_df[14 x 9]>]
$ votes                                <list> [<tbl_df[1 x 8]>]

Code

select(bill_df, billType, billNumber, actions) %>% 
  unnest(actions) %>% 
  glimpse()

Rows: 14
Columns: 11
$ billType                     <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR",…
$ billNumber                   <chr> "391", "391", "391", "391", "391", "391",…
$ action_date                  <date> 2021-07-12, 2021-06-28, 2021-06-28, 2021…
$ action_committee_system_code <chr> "ssfr00", NA, NA, NA, NA, NA, NA, NA, NA,…
$ action_committee_name        <chr> "Foreign Relations Committee", NA, NA, NA…
$ action_source_code           <chr> "0", "2", "2", "9", "2", "2", "2", "2", "…
$ action_source_name           <chr> "Senate", "House floor actions", "House f…
$ action_text                  <chr> "Received in the Senate and Read twice an…
$ action_type                  <chr> "IntroReferral", "Floor", "Floor", "Floor…
$ action_time                  <time>       NA, 19:46:03, 19:46:02, 19:46:02, …
$ action_code                  <chr> NA, "H38310", "H37300", "8000", "H30000",…

Code

# Select bill ID and votes
select(bill_df, billType, billNumber, votes) %>% 
  unnest(votes) %>% 
  # Unnest further vote info and add prefix to cols
  unnest(vote, names_sep = "_") %>% 
  glimpse()

Rows: 1
Columns: 22
$ billType              <chr> "HR"
$ billNumber            <chr> "391"
$ rollNumber            <chr> "188"
$ url                   <chr> "https://clerk.house.gov/evs/2021/roll188.xml"
$ fullActionName        <chr> "Final Passage Under Suspension of the Rules Res…
$ chamber               <chr> "House"
$ congress              <chr> "117"
$ date                  <chr> "2021-06-28T23:46:02Z"
$ sessionNumber         <chr> "1"
$ vote_majority         <chr> "D"
$ vote_congress         <chr> "117"
$ vote_session          <chr> "1st"
$ vote_chamber          <chr> "U.S. House of Representatives"
$ vote_rollcall_num     <chr> "188"
$ vote_legis_num        <chr> "H R 391"
$ vote_vote_question    <chr> "On Motion to Suspend the Rules and Pass, as Ame…
$ vote_vote_type        <chr> "2/3 YEA-AND-NAY"
$ vote_vote_result      <chr> "Passed"
$ vote_action_date      <chr> "28-Jun-2021"
$ vote_action_time      <chr> "7:42 PM"
$ vote_vote_desc        <chr> "Global Health Security Act"
$ vote_legislator_votes <list> [<tbl_df[430 x 7]>]

Code

# Select bill ID and votes
select(bill_df, billType, billNumber, votes) %>% 
  unnest(votes) %>% 
  # Select bill ID, roll number, and vote list col
  select(billType:rollNumber, vote) %>% 
  unnest(vote) %>% 
  # Select bill ID, roll number, and legislators list col
  select(billType:rollNumber, legislator_votes) %>% 
  unnest(legislator_votes) %>% 
  glimpse()

Rows: 430
Columns: 10
$ billType        <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", …
$ billNumber      <chr> "391", "391", "391", "391", "391", "391", "391", "391"…
$ rollNumber      <chr> "188", "188", "188", "188", "188", "188", "188", "188"…
$ name_id         <chr> "A000370", "A000055", "A000371", "A000372", "A000376",…
$ sort_field      <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ unaccented_name <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ party           <chr> "D", "R", "D", "R", "D", "R", "R", "R", "D", "D", "R",…
$ state           <chr> "NC", "AL", "CA", "GA", "TX", "NV", "ND", "TX", "MA", …
$ role            <chr> "legislator", "legislator", "legislator", "legislator"…
$ vote            <chr> "Yea", "Nay", "Yea", "Nay", "Yea", "Yea", "Yea", "Nay"…

I’ll stop there for brevity’s sake, but you can find the code for extracting the full XML file here ¹⁴. Please reach out with any feedback, leave a comment below, or share this with someone who’s struggling with XML. There may even be a reward for anyone who finds mistakes in my code 🤷.

Other helpful articles

Here are some of the helpful articles I came across in the course of writing this:

Session Info

Version information about R, OS, and loaded packages.

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Ireland.utf8
 ctype    English_Ireland.utf8
 tz       America/Los_Angeles
 date     2022-11-09
 pandoc   2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package        * version   date (UTC) lib source
 P assertthat       0.2.1     2019-03-21 [?] CRAN (R 4.2.0)
 P backports        1.4.1     2021-12-13 [?] CRAN (R 4.2.0)
 P bit              4.0.4     2020-08-04 [?] CRAN (R 4.2.0)
 P bit64            4.0.5     2020-08-30 [?] CRAN (R 4.2.0)
 P bitops           1.0-7     2021-04-24 [?] CRAN (R 4.2.0)
 P broom            1.0.0     2022-07-01 [?] CRAN (R 4.2.1)
 P cellranger       1.1.0     2016-07-27 [?] CRAN (R 4.2.0)
 P cli              3.4.0     2022-09-08 [?] CRAN (R 4.2.1)
 P colorspace       2.0-3     2022-02-21 [?] CRAN (R 4.2.0)
 P crayon           1.5.1     2022-03-26 [?] CRAN (R 4.2.0)
 P curl             4.3.2     2021-06-23 [?] CRAN (R 4.2.0)
 P DBI              1.1.3     2022-06-18 [?] CRAN (R 4.2.0)
 P dbplyr           2.2.1     2022-06-27 [?] CRAN (R 4.2.1)
 P digest           0.6.29    2021-12-01 [?] CRAN (R 4.2.0)
 P dplyr          * 1.0.10    2022-09-01 [?] CRAN (R 4.2.1)
 P ellipsis         0.3.2     2021-04-29 [?] CRAN (R 4.2.0)
 P evaluate         0.15      2022-02-18 [?] CRAN (R 4.2.0)
 P extrafont      * 0.18      2022-04-12 [?] CRAN (R 4.2.0)
 P extrafontdb      1.0       2012-06-11 [?] CRAN (R 4.2.0)
 P fansi            1.0.3     2022-03-24 [?] CRAN (R 4.2.0)
 P fastmap          1.1.0     2021-01-25 [?] CRAN (R 4.2.0)
 P forcats        * 0.5.1     2021-01-27 [?] CRAN (R 4.2.0)
 P fs               1.5.2     2021-12-08 [?] CRAN (R 4.2.0)
 P gargle           1.2.0     2021-07-02 [?] CRAN (R 4.2.0)
 P generics         0.1.3     2022-07-05 [?] CRAN (R 4.2.1)
 P ggplot2        * 3.3.6     2022-05-03 [?] CRAN (R 4.2.0)
 P glue             1.6.2     2022-02-24 [?] CRAN (R 4.2.0)
 P googledrive      2.0.0     2021-07-08 [?] CRAN (R 4.2.0)
 P googlesheets4    1.0.0     2021-07-21 [?] CRAN (R 4.2.0)
 P gtable           0.3.1     2022-09-01 [?] CRAN (R 4.2.1)
 P haven            2.5.0     2022-04-15 [?] CRAN (R 4.2.0)
 P hms              1.1.1     2021-09-26 [?] CRAN (R 4.2.0)
 P htmltools        0.5.3     2022-07-18 [?] CRAN (R 4.2.1)
 P htmlwidgets      1.5.4     2021-09-08 [?] CRAN (R 4.2.0)
 P httr             1.4.3     2022-05-04 [?] CRAN (R 4.2.0)
 P janitor          2.1.0     2021-01-05 [?] CRAN (R 4.2.1)
 P jsonlite       * 1.8.0     2022-02-22 [?] CRAN (R 4.2.0)
 P knitr            1.40      2022-08-24 [?] CRAN (R 4.2.1)
 P lifecycle        1.0.1     2021-09-24 [?] CRAN (R 4.2.0)
 P log4r          * 0.4.2     2021-11-04 [?] CRAN (R 4.2.0)
 P lubridate      * 1.8.0     2021-10-07 [?] CRAN (R 4.2.0)
 P magrittr         2.0.3     2022-03-30 [?] CRAN (R 4.2.0)
 P microbenchmark   1.4.9     2021-11-09 [?] CRAN (R 4.2.0)
 P modelr           0.1.8     2020-05-19 [?] CRAN (R 4.2.0)
   mokeR          * 0.1.0     2022-10-03 [1] local (C:/Users/Mark/Documents/Projects/mokeR_0.1.0.tar.gz)
 P munsell          0.5.0     2018-06-12 [?] CRAN (R 4.2.0)
 P pillar           1.8.1     2022-08-19 [?] CRAN (R 4.2.1)
 P pkgconfig        2.0.3     2019-09-22 [?] CRAN (R 4.2.0)
 P purrr          * 0.3.4     2020-04-17 [?] CRAN (R 4.2.0)
 P R6               2.5.1     2021-08-19 [?] CRAN (R 4.2.0)
 P RCurl          * 1.98-1.9  2022-10-03 [?] CRAN (R 4.2.1)
 P reactable      * 0.3.0     2022-05-26 [?] CRAN (R 4.2.0)
 P reactR           0.4.4     2021-02-22 [?] CRAN (R 4.2.0)
 P readr          * 2.1.2     2022-01-30 [?] CRAN (R 4.2.0)
 P readxl           1.4.0     2022-03-28 [?] CRAN (R 4.2.0)
   renv             0.15.5    2022-05-26 [1] CRAN (R 4.2.1)
 P reprex           2.0.1     2021-08-05 [?] CRAN (R 4.2.0)
 P rjson          * 0.2.21    2022-01-09 [?] CRAN (R 4.2.0)
 P rlang            1.0.5     2022-08-31 [?] CRAN (R 4.2.1)
 P rmarkdown        2.18      2022-11-09 [?] CRAN (R 4.2.1)
 P rstudioapi       0.13      2020-11-12 [?] CRAN (R 4.2.0)
   Rttf2pt1         1.3.8     2020-01-10 [1] CRAN (R 4.2.1)
 P rvest            1.0.2     2021-10-16 [?] CRAN (R 4.2.0)
 P scales           1.2.1     2022-08-20 [?] CRAN (R 4.2.1)
 P sessioninfo      1.2.2     2021-12-06 [?] CRAN (R 4.2.0)
 P snakecase        0.11.0    2019-05-25 [?] CRAN (R 4.2.1)
 P stringi          1.7.8     2022-07-11 [?] CRAN (R 4.2.1)
 P stringr        * 1.4.1     2022-08-20 [?] CRAN (R 4.2.1)
 P tibble         * 3.1.8     2022-07-22 [?] CRAN (R 4.2.1)
 P tidyjson       * 0.3.1     2020-05-31 [?] CRAN (R 4.2.0)
 P tidyr          * 1.2.1     2022-09-08 [?] CRAN (R 4.2.1)
 P tidyselect       1.1.2     2022-02-21 [?] CRAN (R 4.2.0)
 P tidyverse      * 1.3.2     2022-07-18 [?] CRAN (R 4.2.1)
 P tzdb             0.3.0     2022-03-28 [?] CRAN (R 4.2.0)
 P utf8             1.2.2     2021-07-24 [?] CRAN (R 4.2.0)
 P vctrs            0.4.1     2022-04-13 [?] CRAN (R 4.2.0)
 P vroom            1.5.7     2021-11-30 [?] CRAN (R 4.2.0)
 P withr            2.5.0     2022-03-03 [?] CRAN (R 4.2.0)
 P xfun             0.32      2022-08-10 [?] CRAN (R 4.2.1)
 P XML            * 3.99-0.11 2022-10-03 [?] CRAN (R 4.2.1)
 P xml2           * 1.3.3     2021-11-30 [?] CRAN (R 4.2.0)
 P yaml             2.3.5     2022-02-21 [?] CRAN (R 4.2.0)

 [1] C:/Users/Mark/Documents/Projects/Blog/renv/library/R-4.2/x86_64-w64-mingw32
 [2] C:/Program Files/R/R-4.2.1/library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

Footnotes

Wikipedia for the curious. Don’t miss the drama about Microsoft’s “vociferous protests” as one of the co-editors started working with Netscape.↩︎
The x at the end of Microsoft file types (e.g. .docx, .xlsx, .pptx) stands for XML.↩︎
To learn more about what XML is and how to use it, I recommend going through w3school’s XML tutorial.↩︎
Between {XML} and {xml2} there are numerous ways to load XML data with R. I generally prefer {xml2} but I suggest looking through the packages’ documentation to find the right function for your task↩︎
xml_document is one of the “key classes” used in the {xml2} library, the others being xml_node (a single node), and xml_nodeset (a set of nodes).↩︎
Which is a document metadata identifier, not an Irish metal music genre. While very useful in its own right, it isn’t relevant for our task here.↩︎

For example

Potential structure for party-level votes
Bill	Party	\(\cdots\)	Yeas	Nays	Abstentions
HR-1	R	\(\cdots\)	221	0	1
HR-1	D	\(\cdots\)	0	212	1
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)

↩︎

Atomic vectors are vectors with elements of the same type (e.g. numeric, character, etc.) as opposed to general vectors, or lists, which can contain a variety of types.↩︎
When you’re working with both the {XML} and {xml2} libraries, it is important to note that their functions often rely on different object types. In the case of xmlToDataFrame(), it does not take {xml_document} or {xml_node} objects. If you try to use them, you’ll see an error like this:
Code
```
xmlToDataFrame(bill_xml)
```
```
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'xmlToDataFrame' for signature '"xml_document", "missing", "missing", "missing", "missing"'
```
↩︎
A very useful table of XPath syntax ↩︎

In the process of writing this, I discovered that using XPath made this ~5x faster than an equivalent function in R.

Code

# Function to select singular child nodes from XML node
xml_singular_nodes = function(xml_node){
  # Return child nodes of current node
  child_nodes = xml_children(xml_node)
  # Select child nodes with 0 children
  zero_length_child_nodes = child_nodes[xml_length(child_nodes) == 0]

  # Keep the nodes which are not empty strings
  keep(zero_length_child_nodes, ~(xml_text(.) != ""))
}

Code

# Using XPath:
# singular_nodes1 = xml_find_all(bill_xml, 
#                                 "//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")

# Using R function:
# singular_nodes2 = xml_singular_nodes(bill_node)

# Check they are the same
# all.equal(singular_nodes1, singular_nodes2)

# benchmark the two different ways of selecting nodes
microbenchmark::microbenchmark(
  xml_singular_nodes = xml_singular_nodes(bill_node),
  xml_find_all = xml_find_all(bill_xml, "//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")
)

Unit: microseconds
               expr    min      lq     mean  median      uq    max neval
 xml_singular_nodes 1040.2 1087.15 1182.637 1101.25 1131.15 7016.6   100
       xml_find_all  200.5  211.45  231.863  220.85  242.45  463.2   100

↩︎

↩︎
We do lose the top-level element in the legislator list. This does not matter in this case because the name is repeated in the attributes, but be aware of this in case you want to combine values in a list with its attributes.↩︎
Ctrl/Cmd+F: extract_bill_status↩︎

What is XML?

Navigate XML

Define the output

The simple case

The more complicated (and realistic) case

Parse a single element

Apply it to all elements

When they put XML in your XML

Other helpful articles

Footnotes

Share