Load libraries
# copy the comment below to install the packages
# install.packages(c("tidyverse", "XML", "xml2))
library(tidyverse)
library(XML)
library(xml2)And parsing it into a tidy format
July 14, 2022
XML, or (e)Xtensible Markup Language, can be a pain to get into a tabular format for analysis. It is designed to contain any “arbitrary” structure1 of data which makes it useful for storing things like financial transactions, webpage content and document formatting2. Whenever I’ve worked with XML in the past, there seemed to be a dearth of resources describing how to actually transform the data into something usable. This post aims to remedy that by describing how to parse an XML file from GovInfo’s congressional bill data.
{XML} and {xml2} packages{purrr}, and make use of {tidyr}’s list columnsTransforming XML into a tabular format can be tricky depending on the structure. Start by understanding how the data is organized, what you want it to look like, and what level of detail you are looking at. This post uses GovInfo’s congressional bill data to explore examples of XML structure.
Look through any available documentation, always. When you read in the XML file,
xmlParse(),xml_structure(), andxml_contents()are some of the functions that will help you navigate the structure you’re dealing with.
Consider what you want the output to look like and think about how the data needs to be transformed to match this. Once you understand the existing structure, the data you need, and your desired output structure, it is time to start parsing.
Get one element into the form you want first. Writing functions can help you think through the data transformations being applied and make your code easier to read. You can see an example of this functional approach in parsing legislative actions. Once you’re reliably parsing multiple elements and producing a consistent output format, you can scale up.
Next, focus on processing a single file in full (or the subset of the file you’re interested in). You might want an XML file to return a single row, a single column, or a data frame of size \(n\times k\). Once you have a single file returned in the format you want, you can combine the outputs of all the files in your dataset.
At its core, XML is “information wrapped in tags”. You can think of the structure like branches of a tree, with each branch having their own leaves. Below is an example of a bookstore’s data in XML format from w3schools.
<bookstore>
<book category="children">
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore><bookstore> is the parent, or root element here. The two <book> elements are the bookstore’s children, each of which has individual data points in child elements (e.g. <title>) and attributes e.g. <book category="children"> store books’ category.
Attributes are a useful way to store data related to a specific element for HTML, but using child elements is typically preferred in XML3. However, you’ll still likely run into data stored in attributes so we’ll touch on extracting data from both.
If you’ve never used R before, this will be a bit of a rough point to start learning. Instead I would recommend checking RStudio’s beginners guide, so for the remainder of the post I will assume some familiarity with R.
Four packages I use throughout this post are listed below. If you don’t understand how a function works or want to see the function reference, click on the function in the code and it should bring you to the function’s help page.
{XML} and {xml2} provide functions for reading in and parsing XML4
{purrr} and {tidyr} (loaded as part of the {tidyverse} package) help us wrangle lists and transform XML to data frames.
To illustrate complexities of XML structures I’ll be using congressional bill status data from the US GPO’s GovInfo service. In particular, you can download the sample bill here (chosen for its examples of challenges parsing XML).
First we need to read in the XML file. You can pass the URL or file path of the XML file to read_xml() to create an xml_document object5. To see the contents of the file you’ve just loaded you can use xml_contents() or enter the object’s name (here bill_xml) in the console.
{xml_nodeset (2)}
[1] <bill>\n <billNumber>391</billNumber>\n <createDate>2021-01-22T08:12:10 ...
[2] <dublinCore xmlns:dc="http://purl.org/dc/elements/1.1/">\n <dc:format>te ...
We have two top-level nodes - <bill> and <dublinCore>6. The latter is actually a metadata identifier and doesn’t contain bill-related information. xml_child() selects the <bill> node which returns an XML nodeset that looks like this:
{xml_node}
<bill>
[1] <billNumber>391</billNumber>
[2] <createDate>2021-01-22T08:12:10Z</createDate>
[3] <updateDate>2022-09-07T13:37:22Z</updateDate>
[4] <originChamber>House</originChamber>
[5] <billType>HR</billType>
[6] <introducedDate>2021-01-21</introducedDate>
[7] <congress>117</congress>
[8] <constitutionalAuthorityStatementText><![CDATA[<pre>[Congressional Recor ...
[9] <recordedVotes>\n <recordedVote>\n <rollNumber>188</rollNumber>\n ...
[10] <committees>\n <billCommittees>\n <item>\n <systemCode>ssfr00</ ...
[11] <committeeReports/>
[12] <relatedBills/>
[13] <actions>\n <item>\n <actionDate>2021-07-12</actionDate>\n <commi ...
[14] <sponsors>\n <item>\n <bioguideId>C001078</bioguideId>\n <fullNam ...
[15] <cosponsors>\n <item>\n <bioguideId>C000266</bioguideId>\n <fullN ...
[16] <cboCostEstimates>\n <item>\n <pubDate>2021-05-11T15:11:11Z</pubDate ...
[17] <laws/>
[18] <notes/>
[19] <policyArea>\n <name>International Affairs</name>\n</policyArea>
[20] <subjects>\n <billSubjects>\n <legislativeSubjects>\n <item>\n ...
...
Now we have the bill XML loaded and we’ve already stripped away some data we don’t need but before diving deeper, it’s helpful to think about the how you want to the processed XML data to be structured. For example, in the bill status data I might want to collect party-level votes. This could take the form of one row per party per bill with a column for yeas, nays, and abstentions7. Perhaps I want to look at the policy areas of bills which became law. Instead of processing the entire file and then selecting the data I want, I can select only the nodes I need at the outset before transforming the data into a data frame. This will likely save you some time in cases where you have a lot of data.
With the congressional bills data, I defined my output as a bill-level dataset, i.e. one bill per row, storing nested elements like recordedVotes, committees, and actions inside list columns. While columns in a data frame are typically atomic vectors8 of equal length, list columns enable us to store data of varying length and type, including data frames themselves. Using {tidyr}’s unnest() function, we can expand this nested data into regular data frame columns. Here is what this output will look like:
Here’s a glimpse() of some of the nested data:
List of 1
$ : tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
..$ committee_system_code : chr [1:2] "ssfr00" "hsfa00"
..$ committee_name : chr [1:2] "Foreign Relations Committee" "Foreign Affairs Committee"
..$ committee_chamber : chr [1:2] "Senate" "House"
..$ committee_type : chr [1:2] "Standing" "Standing"
..$ committee_activities :List of 2
..$ subcommittee_activities:List of 2
List of 1
$ : tibble [1 × 8] (S3: tbl_df/tbl/data.frame)
..$ roll_number : chr "188"
..$ url : chr "https://clerk.house.gov/evs/2021/roll188.xml"
..$ full_action_name: chr "Final Passage Under Suspension of the Rules Results"
..$ chamber : chr "House"
..$ date : chr "2021-06-28T23:46:02Z"
..$ session_number : chr "1"
..$ vote_roll :List of 1
..$ roll_found : logi TRUE
List of 1
$ : tibble [14 × 9] (S3: tbl_df/tbl/data.frame)
..$ action_date : Date[1:14], format: ...
..$ action_committee_system_code: chr [1:14] "ssfr00" NA NA NA ...
..$ action_committee_name : chr [1:14] "Foreign Relations Committee" NA NA NA ...
..$ action_source_code : chr [1:14] "0" "2" "2" "9" ...
..$ action_source_name : chr [1:14] "Senate" "House floor actions" "House floor actions" "Library of Congress" ...
..$ action_text : chr [1:14] "Received in the Senate and Read twice and referred to the Committee on Foreign Relations." "Motion to reconsider laid on the table Agreed to without objection." "On motion to suspend the rules and pass the bill, as amended Agreed to by the Yeas and Nays: (2/3 required): 30"| __truncated__ "Passed/agreed to in House: On motion to suspend the rules and pass the bill, as amended Agreed to by the Yeas a"| __truncated__ ...
..$ action_type : Ord.factor w/ 6 levels "IntroReferral"<..: 1 3 3 3 3 3 3 3 3 2 ...
..$ action_time : 'hms' num [1:14] NA 19:46:03 19:46:02 19:46:02 ...
.. ..- attr(*, "units")= chr "secs"
..$ action_code : chr [1:14] NA "H38310" "H37300" "8000" ...
List of 1
$ : tibble [1 × 11] (S3: tbl_df/tbl/data.frame)
..$ sponsor_bioguide_id : chr "C001078"
..$ sponsor_full_name : chr "Rep. Connolly, Gerald E. [D-VA-11]"
..$ sponsor_first_name : chr "Gerald"
..$ sponsor_middle_name : chr "E."
..$ sponsor_last_name : chr "Connolly"
..$ sponsor_party : chr "D"
..$ sponsor_state : chr "VA"
..$ sponsor_identifiers_lis_id : chr "1959"
..$ sponsor_identifiers_bioguide_id: chr "C001078"
..$ sponsor_identifiers_gpo_id : chr "8202"
..$ sponsor_district : chr "11"
How your XML is structured, and how you want to structure your tabular output will determine how you need to process the XML file. In general, this will involve the following steps:
flatten() the list into a data frame.If you find yourself doing similar transformations in multiple cases or copy-pasting some piece of code more than once, I would implore you to write functions because they:
A good resource to start learning about writing functions is How to Write a Function in R.
When converting XML into a data frame, one of the simplest cases to deal with is a set of nodes which each have a single value. You can think of singular nodes as the contents of a single cell in a data frame. When your data are entirely made up of nodes like this, you can convert XML to a data frame using two functions from {XML}:
xmlParse() parses an XML file into an R structure9 representing the XML tree
xmlToDataFrame() extracts data from an XML document directly into a data frame, but with limited flexibility
In the example below, we need to use getNodeSet(., [path =] "//bill") to select only nodes in <bill>. If we just passed bill_xml_parse, we would get a dataframe with two rows because the top-level nodes are bill and the metadata identifier node <dublinCore>.
Rows: 1
Columns: 28
$ billNumber <chr> "391"
$ createDate <chr> "2021-01-22T08:12:10Z"
$ updateDate <chr> "2022-09-07T13:37:22Z"
$ originChamber <chr> "House"
$ billType <chr> "HR"
$ introducedDate <chr> "2021-01-21"
$ congress <chr> "117"
$ constitutionalAuthorityStatementText <chr> "<pre>[Congressional Record Volum…
$ recordedVotes <chr> "188https://clerk.house.gov/evs/2…
$ committees <chr> "ssfr00Foreign Relations Committe…
$ committeeReports <chr> ""
$ relatedBills <chr> ""
$ actions <chr> "2021-07-12ssfr00Foreign Relation…
$ sponsors <chr> "C001078Rep. Connolly, Gerald E. …
$ cosponsors <chr> "C000266Rep. Chabot, Steve [R-OH-…
$ cboCostEstimates <chr> "2021-05-11T15:11:11ZH.R. 391, Gl…
$ laws <chr> ""
$ notes <chr> ""
$ policyArea <chr> "International Affairs"
$ subjects <chr> "Advisory bodiesAnimal and plant …
$ summaries <chr> "002021-01-21Introduced in House2…
$ title <chr> "Global Health Security Act of 20…
$ titles <chr> "Display TitleGlobal Health Secur…
$ amendments <chr> ""
$ textVersions <chr> "Referred in Senate2021-07-12T04:…
$ latestAction <chr> "2021-07-12Received in the Senate…
$ calendarNumbers <chr> ""
$ version <chr> "1.0.0"
Because the congressional data has elements like committees, actions, and sponsors which contain multiple child elements of their own, the contents are combined into single columns. When parsing nested XML like this, it might require multiple steps to get the data into a tidy data frame.
In this case, the first step is to isolate singular elements like billNumber and billType, and deal with nested elements individually afterwards.
Singular elements are nodes which meet two criteria:
xml_find_all() finds nodes which match an XPath (XML Path Language) expression10 in a similar way as regular expressions is used to match strings. As the function’s help file says, “XPath is like regular expressions for trees”.
In the code below the [ expressions within the square brackets ] are called predicates which I use to find nodes which have no children with the XPath count function and nodes which are not empty strings11. This returns an {xml_nodeset} which can be coerced to a list using as_list().
List of 10
$ :List of 1
..$ : chr "391"
$ :List of 1
..$ : chr "2021-01-22T08:12:10Z"
$ :List of 1
..$ : chr "2022-09-07T13:37:22Z"
$ :List of 1
..$ : chr "House"
$ :List of 1
..$ : chr "HR"
$ :List of 1
..$ : chr "2021-01-21"
$ :List of 1
..$ : chr "117"
$ :List of 1
..$ : chr "<pre>[Congressional Record Volume 167, Number 17 (Thursday, January 28, 2021)][House]From the Congressional Rec"| __truncated__
$ :List of 1
..$ : chr "Global Health Security Act of 2021"
$ :List of 1
..$ : chr "1.0.0"
Now that we have the data we want in a list, we can flatten each element in the list into a data frame column using purrr’s flatten_dfc(). However, notice the elements in a singular_list are not named. If we flatten this list to a data frame, our columns also won’t have any column names. xml_name() creates a character vector of the nodes’ names. When used in combination with setNames() we can create a named list. Just make sure singular_list and singular_nodes are the same length and in the same order.
When flattened, a named list becomes a data frame with column names in tow.
Rows: 1
Columns: 10
$ billNumber <chr> "391"
$ createDate <chr> "2021-01-22T08:12:10Z"
$ updateDate <chr> "2022-09-07T13:37:22Z"
$ originChamber <chr> "House"
$ billType <chr> "HR"
$ introducedDate <chr> "2021-01-21"
$ congress <chr> "117"
$ constitutionalAuthorityStatementText <chr> "<pre>[Congressional Record Volum…
$ title <chr> "Global Health Security Act of 20…
$ version <chr> "1.0.0"
With the singular nodes transformed, we have the base of our data frame. The next step is to choose a nested element, define its structure in the output, and transform the XML accordingly.
In congressional XML data, the actions and votes on a bill are stored in containers. To collect this information and keep our dataset at the bill-level, we can transform each container into data frames using the same steps as before and then assign them as a list columns in our bill_df data frame.
Actions are stored in a container node where each <item> represents a congressional action taken for a bill, such as being introduced, sent to a committee, debated on the floor, etc. Just as before, we use as_list() to convert the {xml_nodeset} to a list.
List of 6
$ actionDate :List of 1
..$ : chr "2021-07-12"
$ committees :List of 1
..$ item:List of 2
.. ..$ systemCode:List of 1
.. ..$ name :List of 1
$ links : list()
$ sourceSystem:List of 2
..$ code:List of 1
.. ..$ : chr "0"
..$ name:List of 1
.. ..$ : chr "Senate"
$ text :List of 1
..$ : chr "Received in the Senate and Read twice and referred to the Committee on Foreign Relations."
$ type :List of 1
..$ : chr "IntroReferral"
For an individual action, we have the date, a list of committees related to the action, some administrative information, the text and type of action. To deal with this, we can write a function (or set of functions) like the ones below to transform actions from a list to a data frame with their own nested elements.
#' Helper function: flatten_dfc_rename
#' Flatten a list into a data frame and apply a prefix to the column names
#'
#' @param list_to_flatten A list to flatten
#' @param name_prefix A string to apply as a prefix to columns
#'
#' @return A data frame of columns that begin with the prefix
flatten_dfc_rename = function(list_to_flatten,
name_prefix = "prefix"){
rename_with(
.data = flatten_dfc(list_to_flatten),
.fn = ~str_c(name_prefix, "_", .),
# Exclude columns which already start with the prefix
.cols = -starts_with(name_prefix)
)
}
#' Function: parse_action
#' Parse actions from list to data frame
#'
#' Separately flatten and rename source system and committee elements,
#' then flatten and rename remaining elements.
#' Committee elements are flattened into a nested data frame
#' because an action can relate to multiple committees.
#'
#' @param action An action element in list form
#'
#' @return A data frame with a single row containing action information
parse_action = function(action){
action %>%
# Flatten + rename sourceSystem elements
map_at("sourceSystem", ~flatten_dfc_rename(.x, "source")) %>%
# Flatten + rename committees
map_at("committees", function(committee){
# using map_at and map_dfr to create a data frame row for each committee
map_dfr(committee, ~flatten_dfc_rename(.x, "committee"))
}) %>%
# Flatten object to data frame
flatten_dfc_rename(., "action") %>%
# Lastly, clean the names
janitor::clean_names()
}Using the {purrr} library’s map_dfr(), we apply my parse_action() function to each element of our actions list and combine the results into data frame rows.
Rows: 14
Columns: 9
$ action_date <chr> "2021-07-12", "2021-06-28", "2021-06-28",…
$ action_committee_system_code <chr> "ssfr00", NA, NA, NA, NA, NA, NA, NA, NA,…
$ action_committee_name <chr> "Foreign Relations Committee", NA, NA, NA…
$ action_source_code <chr> "0", "2", "2", "9", "2", "2", "2", "2", "…
$ action_source_name <chr> "Senate", "House floor actions", "House f…
$ action_text <chr> "Received in the Senate and Read twice an…
$ action_type <chr> "IntroReferral", "Floor", "Floor", "Floor…
$ action_time <chr> NA, "19:46:03", "19:46:02", "19:46:02", "…
$ action_code <chr> NA, "H38310", "H37300", "8000", "H30000",…
When you plan to combine rows into a data frame or unnest() the data in the future, it is useful to be explicit about the expected data types of each column. This can be done using type_convert().
# Specify column data types
actions_col_types = cols(
action_date = col_date(),
action_time = col_time(),
action_committee_systemCode = col_character(),
action_committee_name = col_character(),
action_source_code = col_character(),
action_source_name = col_character(),
action_text = col_character(),
action_type = col_character(),
action_code = col_character()
)
# Convert column types
actions_df = type_convert(actions_df,
col_types = actions_col_types)Because we are only adding a single row of data to bill_df we can encase the actions in a list and use dollar assignment. If you were adding multiple rows of data, you would need to make sure that you are adding a column of the same length as bill_df.
So we’ve dealt with cases with multiple child elements, but XML allows for all sorts of weird and wonderful storage options. The congressional data includes information about votes taken on a bill, the total vote count, and the votes of individual legislators. However, only the high level information is included in the <recordedVotes> element. the more detailed information on individual legislators’ votes are contained in an XML file referenced by the <url> element as you can see below.12
{xml_nodeset (7)}
[1] <rollNumber>188</rollNumber>
[2] <url>https://clerk.house.gov/evs/2021/roll188.xml</url>
[3] <fullActionName>Final Passage Under Suspension of the Rules Results</full ...
[4] <chamber>House</chamber>
[5] <congress>117</congress>
[6] <date>2021-06-28T23:46:02Z</date>
[7] <sessionNumber>1</sessionNumber>
Before we dive one level deeper, we can convert the top-level nodes to a list and flatten it into columns to start building our vote-level data frame.
<recordedVotes> to a list and flatten into data frameRows: 1
Columns: 7
$ rollNumber <chr> "188"
$ url <chr> "https://clerk.house.gov/evs/2021/roll188.xml"
$ fullActionName <chr> "Final Passage Under Suspension of the Rules Results"
$ chamber <chr> "House"
$ congress <chr> "117"
$ date <chr> "2021-06-28T23:46:02Z"
$ sessionNumber <chr> "1"
We use flatten_dfc() to flatten each vote element into a data frame as columns and then map_dfr() rolls them all up as rows into a single data frame.
Now to get the vote roll XML file we’ll use read_xml() . There are two main nodes - <vote-metadata> and <vote-data>. One node contains the aggregated vote information, while <vote-data> contains the legislator-level vote data. We’ll focus on <vote-data> here.
vote_roll_xml = read_xml(recorded_votes_df$url)
# Find all vote-data elements
vote_data = xml_find_all(vote_roll_xml, "vote-data")
# Find all recorded-vote elements
vote_legislators = xml_find_all(vote_data, "recorded-vote")
# Convert to list and show first element
(legislators_list = as_list(vote_legislators))[1] %>%
glimpse()List of 1
$ :List of 2
..$ legislator:List of 1
.. ..$ : chr "Adams"
.. ..- attr(*, "name-id")= chr "A000370"
.. ..- attr(*, "sort-field")= chr "Adams"
.. ..- attr(*, "unaccented-name")= chr "Adams"
.. ..- attr(*, "party")= chr "D"
.. ..- attr(*, "state")= chr "NC"
.. ..- attr(*, "role")= chr "legislator"
..$ vote :List of 1
.. ..$ : chr "Yea"
If we use the same combination of as_list() , map_dfr(), and flatten_dfc() we can flatten the XML into one row per legislator but we lose all the attributes.
# A tibble: 430 × 2
legislator vote
<chr> <chr>
1 Adams Yea
2 Aderholt Nay
3 Aguilar Yea
4 Allen Nay
5 Allred Yea
6 Amodei Yea
7 Armstrong Yea
8 Arrington Nay
9 Auchincloss Yea
10 Axne Yea
# … with 420 more rows
To extract the attributes before we flatten the data, we can apply map_at() to each legislator element and extract the attributes from each while retaining the value in vote.13
List of 2
$ legislator:List of 6
..$ name-id : chr "A000370"
..$ sort-field : chr "Adams"
..$ unaccented-name: chr "Adams"
..$ party : chr "D"
..$ state : chr "NC"
..$ role : chr "legislator"
$ vote :List of 1
..$ : chr "Yea"
Rows: 430
Columns: 7
$ name_id <chr> "A000370", "A000055", "A000371", "A000372", "A000376",…
$ sort_field <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ unaccented_name <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ party <chr> "D", "R", "D", "R", "D", "R", "R", "R", "D", "D", "R",…
$ state <chr> "NC", "AL", "CA", "GA", "TX", "NV", "ND", "TX", "MA", …
$ role <chr> "legislator", "legislator", "legislator", "legislator"…
$ vote <chr> "Yea", "Nay", "Yea", "Nay", "Yea", "Yea", "Yea", "Nay"…
It can often feel like you’re getting lost in a list of lists, but with some experimentation you’ll be able to find your way back to the surface.
Now we have a table of legislator voting data, and the last thing to take care of is the <vote-metadata>. Everything other than the <vote-totals> element is singular so we can get that out of the way the same way as before:
<vote-metadata> element{xml_nodeset (13)}
[1] <majority>D</majority>
[2] <congress>117</congress>
[3] <session>1st</session>
[4] <chamber>U.S. House of Representatives</chamber>
[5] <rollcall-num>188</rollcall-num>
[6] <legis-num>H R 391</legis-num>
[7] <vote-question>On Motion to Suspend the Rules and Pass, as Amended</vote ...
[8] <vote-type>2/3 YEA-AND-NAY</vote-type>
[9] <vote-result>Passed</vote-result>
[10] <action-date>28-Jun-2021</action-date>
[11] <action-time time-etz="19:42">7:42 PM</action-time>
[12] <vote-desc>Global Health Security Act</vote-desc>
[13] <vote-totals>\n <totals-by-party-header>\n <party-header>Party</part ...
Because we have the individual votes, we don’t need to keep the <vote-totals> element, but it is useful to note because it could serve as a check for data consistency in the future. Instead, we’ll just keep the singular nodes in <vote-metadata>
Rows: 1
Columns: 12
$ majority <chr> "D"
$ congress <chr> "117"
$ session <chr> "1st"
$ chamber <chr> "U.S. House of Representatives"
$ rollcall_num <chr> "188"
$ legis_num <chr> "H R 391"
$ vote_question <chr> "On Motion to Suspend the Rules and Pass, as Amended"
$ vote_type <chr> "2/3 YEA-AND-NAY"
$ vote_result <chr> "Passed"
$ action_date <chr> "28-Jun-2021"
$ action_time <chr> "7:42 PM"
$ vote_desc <chr> "Global Health Security Act"
Once we have our nodeset (which at last are all singular), we use the same listing, mapping, and flattening…or lappening as absolutely no one calls it.
Now we can put all the pieces together and nest each element in its parent element until we’ve gotten back to the bill-level i.e. the vote roll data frame goes in the votes data frame, which goes in the bill data frame.
Rows: 1
Columns: 8
$ rollNumber <chr> "188"
$ url <chr> "https://clerk.house.gov/evs/2021/roll188.xml"
$ fullActionName <chr> "Final Passage Under Suspension of the Rules Results"
$ chamber <chr> "House"
$ congress <chr> "117"
$ date <chr> "2021-06-28T23:46:02Z"
$ sessionNumber <chr> "1"
$ vote <list> [<tbl_df[1 x 13]>]
Now we have the bill-level characteristics with action and vote information nested in list columns. We can keep the data at the bill level or unnest() the elements we want to use.
Rows: 1
Columns: 12
$ billNumber <chr> "391"
$ createDate <chr> "2021-01-22T08:12:10Z"
$ updateDate <chr> "2022-09-07T13:37:22Z"
$ originChamber <chr> "House"
$ billType <chr> "HR"
$ introducedDate <chr> "2021-01-21"
$ congress <chr> "117"
$ constitutionalAuthorityStatementText <chr> "<pre>[Congressional Record Volum…
$ title <chr> "Global Health Security Act of 20…
$ version <chr> "1.0.0"
$ actions <list> [<tbl_df[14 x 9]>]
$ votes <list> [<tbl_df[1 x 8]>]
Rows: 14
Columns: 11
$ billType <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR",…
$ billNumber <chr> "391", "391", "391", "391", "391", "391",…
$ action_date <date> 2021-07-12, 2021-06-28, 2021-06-28, 2021…
$ action_committee_system_code <chr> "ssfr00", NA, NA, NA, NA, NA, NA, NA, NA,…
$ action_committee_name <chr> "Foreign Relations Committee", NA, NA, NA…
$ action_source_code <chr> "0", "2", "2", "9", "2", "2", "2", "2", "…
$ action_source_name <chr> "Senate", "House floor actions", "House f…
$ action_text <chr> "Received in the Senate and Read twice an…
$ action_type <chr> "IntroReferral", "Floor", "Floor", "Floor…
$ action_time <time> NA, 19:46:03, 19:46:02, 19:46:02, …
$ action_code <chr> NA, "H38310", "H37300", "8000", "H30000",…
Rows: 1
Columns: 22
$ billType <chr> "HR"
$ billNumber <chr> "391"
$ rollNumber <chr> "188"
$ url <chr> "https://clerk.house.gov/evs/2021/roll188.xml"
$ fullActionName <chr> "Final Passage Under Suspension of the Rules Res…
$ chamber <chr> "House"
$ congress <chr> "117"
$ date <chr> "2021-06-28T23:46:02Z"
$ sessionNumber <chr> "1"
$ vote_majority <chr> "D"
$ vote_congress <chr> "117"
$ vote_session <chr> "1st"
$ vote_chamber <chr> "U.S. House of Representatives"
$ vote_rollcall_num <chr> "188"
$ vote_legis_num <chr> "H R 391"
$ vote_vote_question <chr> "On Motion to Suspend the Rules and Pass, as Ame…
$ vote_vote_type <chr> "2/3 YEA-AND-NAY"
$ vote_vote_result <chr> "Passed"
$ vote_action_date <chr> "28-Jun-2021"
$ vote_action_time <chr> "7:42 PM"
$ vote_vote_desc <chr> "Global Health Security Act"
$ vote_legislator_votes <list> [<tbl_df[430 x 7]>]
# Select bill ID and votes
select(bill_df, billType, billNumber, votes) %>%
unnest(votes) %>%
# Select bill ID, roll number, and vote list col
select(billType:rollNumber, vote) %>%
unnest(vote) %>%
# Select bill ID, roll number, and legislators list col
select(billType:rollNumber, legislator_votes) %>%
unnest(legislator_votes) %>%
glimpse()Rows: 430
Columns: 10
$ billType <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", …
$ billNumber <chr> "391", "391", "391", "391", "391", "391", "391", "391"…
$ rollNumber <chr> "188", "188", "188", "188", "188", "188", "188", "188"…
$ name_id <chr> "A000370", "A000055", "A000371", "A000372", "A000376",…
$ sort_field <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ unaccented_name <chr> "Adams", "Aderholt", "Aguilar", "Allen", "Allred", "Am…
$ party <chr> "D", "R", "D", "R", "D", "R", "R", "R", "D", "D", "R",…
$ state <chr> "NC", "AL", "CA", "GA", "TX", "NV", "ND", "TX", "MA", …
$ role <chr> "legislator", "legislator", "legislator", "legislator"…
$ vote <chr> "Yea", "Nay", "Yea", "Nay", "Yea", "Yea", "Yea", "Nay"…
I’ll stop there for brevity’s sake, but you can find the code for extracting the full XML file here14. Please reach out with any feedback, leave a comment below, or share this with someone who’s struggling with XML. There may even be a reward for anyone who finds mistakes in my code 🤷.
Here are some of the helpful articles I came across in the course of writing this:
Version information about R, OS, and loaded packages.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23 ucrt)
os Windows 10 x64 (build 19045)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Ireland.utf8
ctype English_Ireland.utf8
tz America/Los_Angeles
date 2022-11-09
pandoc 2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P assertthat 0.2.1 2019-03-21 [?] CRAN (R 4.2.0)
P backports 1.4.1 2021-12-13 [?] CRAN (R 4.2.0)
P bit 4.0.4 2020-08-04 [?] CRAN (R 4.2.0)
P bit64 4.0.5 2020-08-30 [?] CRAN (R 4.2.0)
P bitops 1.0-7 2021-04-24 [?] CRAN (R 4.2.0)
P broom 1.0.0 2022-07-01 [?] CRAN (R 4.2.1)
P cellranger 1.1.0 2016-07-27 [?] CRAN (R 4.2.0)
P cli 3.4.0 2022-09-08 [?] CRAN (R 4.2.1)
P colorspace 2.0-3 2022-02-21 [?] CRAN (R 4.2.0)
P crayon 1.5.1 2022-03-26 [?] CRAN (R 4.2.0)
P curl 4.3.2 2021-06-23 [?] CRAN (R 4.2.0)
P DBI 1.1.3 2022-06-18 [?] CRAN (R 4.2.0)
P dbplyr 2.2.1 2022-06-27 [?] CRAN (R 4.2.1)
P digest 0.6.29 2021-12-01 [?] CRAN (R 4.2.0)
P dplyr * 1.0.10 2022-09-01 [?] CRAN (R 4.2.1)
P ellipsis 0.3.2 2021-04-29 [?] CRAN (R 4.2.0)
P evaluate 0.15 2022-02-18 [?] CRAN (R 4.2.0)
P extrafont * 0.18 2022-04-12 [?] CRAN (R 4.2.0)
P extrafontdb 1.0 2012-06-11 [?] CRAN (R 4.2.0)
P fansi 1.0.3 2022-03-24 [?] CRAN (R 4.2.0)
P fastmap 1.1.0 2021-01-25 [?] CRAN (R 4.2.0)
P forcats * 0.5.1 2021-01-27 [?] CRAN (R 4.2.0)
P fs 1.5.2 2021-12-08 [?] CRAN (R 4.2.0)
P gargle 1.2.0 2021-07-02 [?] CRAN (R 4.2.0)
P generics 0.1.3 2022-07-05 [?] CRAN (R 4.2.1)
P ggplot2 * 3.3.6 2022-05-03 [?] CRAN (R 4.2.0)
P glue 1.6.2 2022-02-24 [?] CRAN (R 4.2.0)
P googledrive 2.0.0 2021-07-08 [?] CRAN (R 4.2.0)
P googlesheets4 1.0.0 2021-07-21 [?] CRAN (R 4.2.0)
P gtable 0.3.1 2022-09-01 [?] CRAN (R 4.2.1)
P haven 2.5.0 2022-04-15 [?] CRAN (R 4.2.0)
P hms 1.1.1 2021-09-26 [?] CRAN (R 4.2.0)
P htmltools 0.5.3 2022-07-18 [?] CRAN (R 4.2.1)
P htmlwidgets 1.5.4 2021-09-08 [?] CRAN (R 4.2.0)
P httr 1.4.3 2022-05-04 [?] CRAN (R 4.2.0)
P janitor 2.1.0 2021-01-05 [?] CRAN (R 4.2.1)
P jsonlite * 1.8.0 2022-02-22 [?] CRAN (R 4.2.0)
P knitr 1.40 2022-08-24 [?] CRAN (R 4.2.1)
P lifecycle 1.0.1 2021-09-24 [?] CRAN (R 4.2.0)
P log4r * 0.4.2 2021-11-04 [?] CRAN (R 4.2.0)
P lubridate * 1.8.0 2021-10-07 [?] CRAN (R 4.2.0)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.2.0)
P microbenchmark 1.4.9 2021-11-09 [?] CRAN (R 4.2.0)
P modelr 0.1.8 2020-05-19 [?] CRAN (R 4.2.0)
mokeR * 0.1.0 2022-10-03 [1] local (C:/Users/Mark/Documents/Projects/mokeR_0.1.0.tar.gz)
P munsell 0.5.0 2018-06-12 [?] CRAN (R 4.2.0)
P pillar 1.8.1 2022-08-19 [?] CRAN (R 4.2.1)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.2.0)
P purrr * 0.3.4 2020-04-17 [?] CRAN (R 4.2.0)
P R6 2.5.1 2021-08-19 [?] CRAN (R 4.2.0)
P RCurl * 1.98-1.9 2022-10-03 [?] CRAN (R 4.2.1)
P reactable * 0.3.0 2022-05-26 [?] CRAN (R 4.2.0)
P reactR 0.4.4 2021-02-22 [?] CRAN (R 4.2.0)
P readr * 2.1.2 2022-01-30 [?] CRAN (R 4.2.0)
P readxl 1.4.0 2022-03-28 [?] CRAN (R 4.2.0)
renv 0.15.5 2022-05-26 [1] CRAN (R 4.2.1)
P reprex 2.0.1 2021-08-05 [?] CRAN (R 4.2.0)
P rjson * 0.2.21 2022-01-09 [?] CRAN (R 4.2.0)
P rlang 1.0.5 2022-08-31 [?] CRAN (R 4.2.1)
P rmarkdown 2.18 2022-11-09 [?] CRAN (R 4.2.1)
P rstudioapi 0.13 2020-11-12 [?] CRAN (R 4.2.0)
Rttf2pt1 1.3.8 2020-01-10 [1] CRAN (R 4.2.1)
P rvest 1.0.2 2021-10-16 [?] CRAN (R 4.2.0)
P scales 1.2.1 2022-08-20 [?] CRAN (R 4.2.1)
P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.2.0)
P snakecase 0.11.0 2019-05-25 [?] CRAN (R 4.2.1)
P stringi 1.7.8 2022-07-11 [?] CRAN (R 4.2.1)
P stringr * 1.4.1 2022-08-20 [?] CRAN (R 4.2.1)
P tibble * 3.1.8 2022-07-22 [?] CRAN (R 4.2.1)
P tidyjson * 0.3.1 2020-05-31 [?] CRAN (R 4.2.0)
P tidyr * 1.2.1 2022-09-08 [?] CRAN (R 4.2.1)
P tidyselect 1.1.2 2022-02-21 [?] CRAN (R 4.2.0)
P tidyverse * 1.3.2 2022-07-18 [?] CRAN (R 4.2.1)
P tzdb 0.3.0 2022-03-28 [?] CRAN (R 4.2.0)
P utf8 1.2.2 2021-07-24 [?] CRAN (R 4.2.0)
P vctrs 0.4.1 2022-04-13 [?] CRAN (R 4.2.0)
P vroom 1.5.7 2021-11-30 [?] CRAN (R 4.2.0)
P withr 2.5.0 2022-03-03 [?] CRAN (R 4.2.0)
P xfun 0.32 2022-08-10 [?] CRAN (R 4.2.1)
P XML * 3.99-0.11 2022-10-03 [?] CRAN (R 4.2.1)
P xml2 * 1.3.3 2021-11-30 [?] CRAN (R 4.2.0)
P yaml 2.3.5 2022-02-21 [?] CRAN (R 4.2.0)
[1] C:/Users/Mark/Documents/Projects/Blog/renv/library/R-4.2/x86_64-w64-mingw32
[2] C:/Program Files/R/R-4.2.1/library
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
Wikipedia for the curious. Don’t miss the drama about Microsoft’s “vociferous protests” as one of the co-editors started working with Netscape.↩︎
The x at the end of Microsoft file types (e.g. .docx, .xlsx, .pptx) stands for XML.↩︎
To learn more about what XML is and how to use it, I recommend going through w3school’s XML tutorial.↩︎
Between {XML} and {xml2} there are numerous ways to load XML data with R. I generally prefer {xml2} but I suggest looking through the packages’ documentation to find the right function for your task↩︎
xml_document is one of the “key classes” used in the {xml2} library, the others being xml_node (a single node), and xml_nodeset (a set of nodes).↩︎
Which is a document metadata identifier, not an Irish metal music genre. While very useful in its own right, it isn’t relevant for our task here.↩︎
For example
| Bill | Party | \(\cdots\) | Yeas | Nays | Abstentions |
|---|---|---|---|---|---|
| HR-1 | R | \(\cdots\) | 221 | 0 | 1 |
| HR-1 | D | \(\cdots\) | 0 | 212 | 1 |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
Atomic vectors are vectors with elements of the same type (e.g. numeric, character, etc.) as opposed to general vectors, or lists, which can contain a variety of types.↩︎
When you’re working with both the {XML} and {xml2} libraries, it is important to note that their functions often rely on different object types. In the case of xmlToDataFrame(), it does not take {xml_document} or {xml_node} objects. If you try to use them, you’ll see an error like this:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'xmlToDataFrame' for signature '"xml_document", "missing", "missing", "missing", "missing"'
In the process of writing this, I discovered that using XPath made this ~5x faster than an equivalent function in R.
# Function to select singular child nodes from XML node
xml_singular_nodes = function(xml_node){
# Return child nodes of current node
child_nodes = xml_children(xml_node)
# Select child nodes with 0 children
zero_length_child_nodes = child_nodes[xml_length(child_nodes) == 0]
# Keep the nodes which are not empty strings
keep(zero_length_child_nodes, ~(xml_text(.) != ""))
}# Using XPath:
# singular_nodes1 = xml_find_all(bill_xml,
# "//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")
# Using R function:
# singular_nodes2 = xml_singular_nodes(bill_node)
# Check they are the same
# all.equal(singular_nodes1, singular_nodes2)
# benchmark the two different ways of selecting nodes
microbenchmark::microbenchmark(
xml_singular_nodes = xml_singular_nodes(bill_node),
xml_find_all = xml_find_all(bill_xml, "//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")
)Unit: microseconds
expr min lq mean median uq max neval
xml_singular_nodes 1040.2 1087.15 1182.637 1101.25 1131.15 7016.6 100
xml_find_all 200.5 211.45 231.863 220.85 242.45 463.2 100
↩︎We do lose the top-level element in the legislator list. This does not matter in this case because the name is repeated in the attributes, but be aware of this in case you want to combine values in a list with its attributes.↩︎
Ctrl/Cmd+F: extract_bill_status↩︎
---
title: "Untangling XML data with R"
subtitle: "And parsing it into a tidy format"
date: "2022-07-14"
categories:
- data
- governance
- R
format:
html:
toc: true
code-link: true
code-fold: show
code-tools: true
comments:
hypothesis: true
---
```{r}
#| label: setup
#| message: false
#| warning: false
#| echo: false
library(extrafont)
library(reactable)
source("https://github.com/MokeEire/my-reps/raw/master/R/parsing_functions.R")
bill_xml_url = "https://www.govinfo.gov/bulkdata/BILLSTATUS/117/hr/BILLSTATUS-117hr391.xml"
bill_xml_file = read_xml(bill_xml_url)
```
XML, or (e)Xtensible Markup Language, is a versatile data format designed to store any "arbitrary" data structure[^1] but it can be a pain to transform into a typical tabular format. It is used to store things like financial transactions, webpage content and document formatting[^2]. When I've worked with XML in the past, there never seems to be many resources describing how to actually *use* the data. This post aims to remedy that by describing how to parse XML using [GovInfo's congressional bill data](https://www.govinfo.gov/bulkdata/BILLSTATUS) as an example.
[^1]: [Wikipedia](https://en.wikipedia.org/wiki/XML) for the curious. Don't miss the drama about Microsoft's "vociferous protests" as one of the co-editors started working with Netscape.
[^2]: The *x* at the end of Microsoft file types (e.g. .docx, .xlsx, .pptx) stands for XML.
::: {.callout-tip appearance="simple"}
## What you'll learn:
1. Load and parse XML files in R using the `{XML}` and `{xml2}` packages
2. Explore the structure of XML documents
3. Transform your data from XML into a tabular format using `{purrr}`, and make use of `{tidyr}`'s list columns
:::
::: {.callout-note collapse="true" appearance="simple"}
## TL;DR
Transforming XML into a tabular format can be tricky depending on the structure. Start by understanding how the data is organized, what you want it to look like, and what level of detail you are looking at. This post uses [GovInfo's congressional bill data](https://www.govinfo.gov/bulkdata/BILLSTATUS) to explore examples of XML structure.
1. [Navigate XML](#sec-navigate-xml)
> Look through any available documentation, always. When you read in the XML file, `xmlParse()`, `xml_structure()`, and `xml_contents()` are some of the functions that will help you navigate the structure you're dealing with.
2. [**Define the output**]
> Consider what you want the output to look like and think about how the data needs to be transformed to match this. Once you understand the existing structure, the data you need, and your desired output structure, it is time to start parsing.
3. [**Parse a single element**] (write a function if it gets too complicated)
> Get one element into the form you want first. Writing functions can help you think through the data transformations being applied and make your code easier to read. You can see an example of this functional approach in [parsing legislative actions](#sec-nested-data). Once you're reliably parsing multiple elements and producing a consistent output format, you can scale up.
4. **Apply to all elements**
> Next, focus on processing a single file in full (or the subset of the file you're interested in). You might want an XML file to return a single row, a single column, or a data frame of size $n\times k$. Once you have a single file returned in the format you want, you can combine the outputs of all the files in your dataset.
:::
# What is XML?
At its core, XML is "information wrapped in tags". You can think of the structure like branches of a tree, with each branch having their own leaves. Below is an example of a bookstore's data in XML format from [w3schools](https://www.w3schools.com/XML/xml_usedfor.asp).
``` xml
<bookstore>
<book category="children">
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
```
`<bookstore>` is the parent, or *root* element here. The two `<book>` elements are the bookstore's *children*, each of which has individual data points in *child* elements (e.g. `<title>`) and *attributes* e.g. `<book category="children">` store books' category.
::: {.callout-note appearance="simple"}
## Note
Attributes are a useful way to store data related to a specific element for HTML, but using child elements is typically preferred in XML[^3]. However, you'll still likely run into data stored in attributes so we'll touch on extracting data from both.
:::
[^3]: To learn more about what XML is and how to use it, I recommend going through [w3school's XML tutorial](https://www.w3schools.com/XML/default.asp).
# Navigate XML {#sec-navigate-xml}
If you've never used R before, this will be a bit of a rough point to start learning. Instead I would recommend checking [RStudio's beginners guide](https://education.rstudio.com/learn/beginner/), so for the remainder of the post I will assume some familiarity with R.
Four packages I use throughout this post are listed below. If you don't understand how a function works or want to see the function reference, click on the function in the code and it *should* bring you to the function's help page.
- `{XML}` and `{xml2}` provide functions for reading in and parsing XML[^4]
- `{purrr}` and `{tidyr}` (loaded as part of the `{tidyverse}` package) help us wrangle lists and transform XML to data frames.
[^4]: Between `{XML}` and `{xml2}` there are numerous ways to load XML data with R. I generally prefer `{xml2}` but I suggest looking through the packages' documentation to find the right function for your task
```{r}
#| label: load-libs
#| code-fold: show
#| code-summary: "Load libraries"
# copy the comment below to install the packages
# install.packages(c("tidyverse", "XML", "xml2))
library(tidyverse)
library(XML)
library(xml2)
```
To illustrate complexities of XML structures I'll be using [congressional bill status data](https://www.govinfo.gov/bulkdata/BILLSTATUS) from the US GPO's [GovInfo service](https://www.govinfo.gov/). In particular, you can download the sample bill [here](https://www.govinfo.gov/bulkdata/BILLSTATUS/117/hr/BILLSTATUS-117hr391.xml "HR-391: Global Health Security Act of 2021") (chosen for its examples of challenges parsing XML).
First we need to read in the XML file. You can pass the URL or file path of the XML file to `read_xml()` to create an `xml_document` object[^5]. To see the contents of the file you've just loaded you can use `xml_contents()` or enter the object's name (here `bill_xml`) in the console.
[^5]: `xml_document` is one of the "key classes" used in the `{xml2}` library, the others being `xml_node` (a single node), and `xml_nodeset` (a set of nodes).
```{r}
#| label: read-xml
#| code-fold: show
#| code-summary: "Load and view XML file"
# Read XML file
bill_xml = read_xml("https://www.govinfo.gov/bulkdata/BILLSTATUS/117/hr/BILLSTATUS-117hr391.xml")
# Look at xml_document's contents
xml_contents(bill_xml)
```
We have two top-level nodes - `<bill>` and `<dublinCore>`[^6]. The latter is actually a metadata identifier and doesn't contain bill-related information. `xml_child()` selects the `<bill>` node which returns an XML nodeset that looks like this:
[^6]: Which is a document metadata identifier, not an Irish metal music genre. While very useful in its own right, it isn't relevant for our task here.
```{r}
#| label: xml-bill-child
#| code-fold: show
#| code-summary: "Select first matching child element"
(bill_node = xml_child(bill_xml, "bill"))
```
# Define the output {#sec-define-output}
Now we have the bill XML loaded and we've already stripped away some data we don't need but before diving deeper, it's helpful to think about the how you want to the processed XML data to be structured. For example, in the bill status data I might want to collect party-level votes. This could take the form of one row per party per bill with a column for yeas, nays, and abstentions[^7]. Perhaps I want to look at the policy areas of bills which became law. Instead of processing the entire file and then selecting the data I want, I can select only the nodes I need at the outset before transforming the data into a data frame. This will likely save you some time in cases where you have a lot of data.
[^7]: For example
| Bill | Party | $\cdots$ | Yeas | Nays | Abstentions |
|----------|----------|----------|----------|----------|-------------|
| HR-1 | R | $\cdots$ | 221 | 0 | 1 |
| HR-1 | D | $\cdots$ | 0 | 212 | 1 |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
: Potential structure for party-level votes
```{r}
#| label: example-bill
#| echo: false
#| message: false
#| column: page
example_bill_df = extract_bill_status(bill_xml_url, log_types = NULL)
```
With the congressional bills data, I defined my output as a bill-level dataset, i.e. one bill per row, storing *nested* elements like `recordedVotes`, `committees`, and `actions` inside [list columns](https://jennybc.github.io/purrr-tutorial/ls13_list-columns.html). While columns in a data frame are typically *atomic vectors*[^8] of equal length, list columns enable us to store data of varying length and type, including data frames themselves. Using `{tidyr}`'s `unnest()` function, we can expand this nested data into regular data frame columns. Here is what this output will look like:
[^8]: Atomic vectors are vectors with elements of the same type (e.g. numeric, character, etc.) as opposed to *general vectors*, or lists, which can contain a variety of types.
```{r}
#| label: example-bill-rt
#| echo: false
#| eval: true
#| column: page
example_bill_df %>% #glimpse() # here to check the data
reactable::reactable(
theme = moke_rt(),
compact = T,
resizable = T,
wrap = F,
columns = list(
"bill_id" = colDef(show=F),
"bill_number" = colDef(name = "Bill No.", width = 80),
"create_date" = colDef(format = colFormat(date = T), width = 100),
"update_date" = colDef(format = colFormat(date = T), width = 100),
"origin_chamber" = colDef(show=F),
"bill_type" = colDef(name = "Bill type", width = 60),
"introduced_date" = colDef(format = colFormat(date = T), width = 130),
"congress" = colDef(width = 80),
"constitutional_authority_statement_text" = colDef(show=F),
"title" = colDef(minWidth = 180),
"version" = colDef(show = F, width = 80),
"policy_areas" = colDef(width = 120),
"legislative_subjects" = colDef(width = 200),
"bill_summaries" = colDef(show=F),
"bill_titles" = colDef(show=F),
"bill_text_versions" = colDef(show=F),
"latest_action_action_date" = colDef(name = "action_date",
width = 110,
format = colFormat(date = T)),
"latest_action_text" = colDef(name = "action_text",
minWidth = 200),
"committees" = colDef(minWidth = 80),
"house_votes" = colDef(width = 80),
"senate_votes" = colDef(width = 80),
"actions" = colDef(width = 80),
"action_counts" = colDef(show=F),
"sponsors" = colDef(width = 80),
"cosponsors" = colDef(width = 80)
),
columnGroups = list(
colGroup(name = "Latest Action",
columns = c("latest_action_action_date",
"latest_action_text")),
colGroup(name = "Nested Data",
columns = c("committees", "house_votes",
"senate_votes", "actions",
"sponsors", "cosponsors"))
)
)
```
Here's a `glimpse()` of some of the nested data:
::: panel-tabset
## Committees
```{r}
#| label: glimpse-committees
#| echo: false
glimpse(example_bill_df[["committees"]], width = 100)
```
## Votes
```{r}
#| label: glimpse-votes
#| echo: false
glimpse(example_bill_df[["house_votes"]], width = 90)
```
## Actions
```{r}
#| label: glimpse-actio
#| echo: false
glimpse(example_bill_df[["actions"]], width = 40)
```
## Sponsors
```{r}
#| label: glimpse-sponsors
#| echo: false
glimpse(example_bill_df[["sponsors"]], width = 70)
```
:::
How your XML is structured, and how you want to structure your tabular output will determine how you need to process the XML file. In general, this will involve the following steps:
1. Select the XML nodes you want to include in your dataset.
2. Convert the XML to a list.
3. `flatten()` the list into a data frame.
::: callout-tip
## Write Functions
If you find yourself doing similar transformations in multiple cases or copy-pasting some piece of code more than once, I would implore you to write functions because they:
1. make your code more reliable
2. make you think critically about how you're modifying the data
3. are easier to debug
A good resource to start learning about writing functions is [How to Write a Function in R](https://www.earthdatascience.org/courses/earth-analytics/automate-science-workflows/write-function-r-programming/).
:::
# The simple case
When converting XML into a data frame, one of the simplest cases to deal with is a set of nodes which each have a single value. You can think of singular nodes as the contents of a single cell in a data frame. When your data are entirely made up of nodes like this, you can convert XML to a data frame using two functions from `{XML}`:
- `xmlParse()` parses an XML file into an R structure[^9] representing the XML tree
- `xmlToDataFrame()` extracts data from an XML document directly into a data frame, but with limited flexibility
[^9]: When you're working with both the `{XML}` and `{xml2}` libraries, it is important to note that their functions often rely on different object types. In the case of `xmlToDataFrame()`, it does not take `{xml_document}` or `{xml_node}` objects. If you try to use them, you'll see an error like this:
```{r}
#| label: xml-df-error
#| error: true
xmlToDataFrame(bill_xml)
```
::: callout-note
## Note
In the example below, we need to use `getNodeSet(., [path =] "//bill")` to select only nodes in `<bill>`. If we just passed `bill_xml_parse`, we would get a dataframe with two rows because the top-level nodes are bill and the metadata identifier node `<dublinCore>`.
:::
```{r}
#| label: xml-to-df
#| code-summary: "Use {XML} functions to read XML data"
#| attr-output: '.details summary="Output of xmlToDataFrame"'
# First we need to use xmlParse function to create XMLDocument object
bill_xml_parse = xmlParse(bill_xml)
xmlToDataFrame(getNodeSet(bill_xml_parse, "//bill")) %>%
glimpse()
```
Because the congressional data has elements like `committees`, `actions`, and `sponsors` which contain multiple child elements of their own, the contents are combined into single columns. When parsing nested XML like this, it might require multiple steps to get the data into a [tidy data](https://vita.had.co.nz/papers/tidy-data.html) frame.
In this case, the first step is to isolate singular elements like `billNumber` and `billType`, and deal with nested elements individually afterwards.
Singular elements are nodes which meet two criteria:
1. the node does not have any child nodes of its own, and
2. the node is not an empty string.
`xml_find_all()` finds nodes which match an [XPath](https://www.w3schools.com/xml/xml_parser.asp) (XML Path Language) expression[^10] in a similar way as regular expressions is used to match strings. As the function's help file says, "XPath is like regular expressions for trees".
[^10]: [A very useful table of XPath syntax](https://www.w3schools.com/XML/xpath_syntax.asp){target="_blank"}
In the code below the `[` expressions within the square brackets `]` are called predicates which I use to find nodes which have no children with the XPath `count` function and nodes which are not empty strings[^11]. This returns an `{xml_nodeset}` which can be coerced to a list using `as_list()`.
[^11]: In the process of writing this, I discovered that using XPath made this \~5x faster than an equivalent function in R.
```{r}
#| label: singular-nodes
#| code-fold: show
# Function to select singular child nodes from XML node
xml_singular_nodes = function(xml_node){
# Return child nodes of current node
child_nodes = xml_children(xml_node)
# Select child nodes with 0 children
zero_length_child_nodes = child_nodes[xml_length(child_nodes) == 0]
# Keep the nodes which are not empty strings
keep(zero_length_child_nodes, ~(xml_text(.) != ""))
}
```
```{r}
#| label: bench-node-selection
# Using XPath:
# singular_nodes1 = xml_find_all(bill_xml,
# "//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")
# Using R function:
# singular_nodes2 = xml_singular_nodes(bill_node)
# Check they are the same
# all.equal(singular_nodes1, singular_nodes2)
# benchmark the two different ways of selecting nodes
microbenchmark::microbenchmark(
xml_singular_nodes = xml_singular_nodes(bill_node),
xml_find_all = xml_find_all(bill_xml, "//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")
)
```
```{r}
#| label: singular-nodes-list
#| code-summary: "Select non-empty child-less elements using XPath"
#| code-fold: show
singular_nodes = xml_find_all(bill_xml,
# //bill/* selects the child nodes in a bill
"//bill/*[count(./*) = 0 and not(string-length(.) = 0)]")
(singular_list = as_list(singular_nodes)) %>%
glimpse()
```
Now that we have the data we want in a list, we can flatten each element in the list into a **d**ata **f**rame **c**olumn using purrr's `flatten_dfc()`. However, notice the elements in a `singular_list` are not named. If we flatten this list to a data frame, our columns also won't have any column names. `xml_name()` creates a character vector of the nodes' names. When used in combination with `setNames()` we can create a named list. Just make sure `singular_list` and `singular_nodes` are the same length and in the same order.
```{r}
#| label: singular-nodes-list-named
#| code-fold: show
#| code-summary: "Name the list elements"
singular_list_named = setNames(singular_list,
xml_name(singular_nodes))
```
When flattened, a named list becomes a data frame with column names in tow.
```{r}
#| label: flatten-singular-nodes
#| code-summary: "Flatten singular nodes into data frame"
(bill_df = flatten_dfc(singular_list_named)) %>%
glimpse()
```
With the singular nodes transformed, we have the base of our data frame. The next step is to choose a nested element, define its structure in the output, and transform the XML accordingly.
```{r}
#| label: singular-nodes-rt
#| column: screen-inset
#| echo: false
#| eval: false
reactable(bill_df,
theme = moke_rt(),
columns = list(
"constitutional_authority_statement_text" = colDef(show=F),
"create_date" = colDef(format = colFormat(date = T), width = 100),
"update_date" = colDef(format = colFormat(date = T), width = 100),
"introduced_date" = colDef(format = colFormat(date = T), width = 120),
"congress" = colDef(width = 90),
"origin_chamber" = colDef(show=F),
"version" = colDef(show=F, width = 80),
"bill_number" = colDef(width = 105),
"bill_type" = colDef(width = 85),
"title" = colDef(minWidth = 180)
))
```
# The more complicated (and realistic) case {#sec-nested-data}
In congressional XML data, the actions and votes on a bill are stored in containers. To collect this information and keep our dataset at the bill-level, we can transform each container into data frames using the same steps as before and then assign them as a list columns in our `bill_df` data frame.
## Parse a single element
Actions are stored in a container node where each `<item>` represents a congressional action taken for a bill, such as being introduced, sent to a committee, debated on the floor, etc. Just as before, we use `as_list()` to convert the `{xml_nodeset}` to a list.
```{r}
#| label: actions-list
#| code-fold: show
#| code-summary: "Select all action elements and convert them to a list"
#| attr-output: '.details summary="Structure of actions list"'
actions_xml = xml_find_all(bill_node, "actions/item")
actions_list = as_list(actions_xml)
# Look at first action
glimpse(actions_list[[1]], width = 50)
```
For an individual action, we have the date, a list of committees related to the action, some administrative information, the text and type of action. To deal with this, we can write a function (or set of functions) like the ones below to transform actions from a list to a data frame with their own nested elements.
```{r}
#| label: actions-functions
#| code-fold: true
#| code-summary: "View helper functions"
#' Helper function: flatten_dfc_rename
#' Flatten a list into a data frame and apply a prefix to the column names
#'
#' @param list_to_flatten A list to flatten
#' @param name_prefix A string to apply as a prefix to columns
#'
#' @return A data frame of columns that begin with the prefix
flatten_dfc_rename = function(list_to_flatten,
name_prefix = "prefix"){
rename_with(
.data = flatten_dfc(list_to_flatten),
.fn = ~str_c(name_prefix, "_", .),
# Exclude columns which already start with the prefix
.cols = -starts_with(name_prefix)
)
}
#' Function: parse_action
#' Parse actions from list to data frame
#'
#' Separately flatten and rename source system and committee elements,
#' then flatten and rename remaining elements.
#' Committee elements are flattened into a nested data frame
#' because an action can relate to multiple committees.
#'
#' @param action An action element in list form
#'
#' @return A data frame with a single row containing action information
parse_action = function(action){
action %>%
# Flatten + rename sourceSystem elements
map_at("sourceSystem", ~flatten_dfc_rename(.x, "source")) %>%
# Flatten + rename committees
map_at("committees", function(committee){
# using map_at and map_dfr to create a data frame row for each committee
map_dfr(committee, ~flatten_dfc_rename(.x, "committee"))
}) %>%
# Flatten object to data frame
flatten_dfc_rename(., "action") %>%
# Lastly, clean the names
janitor::clean_names()
}
```
```{r}
#| label: benchmark-parse-action
#| eval: false
#| include: false
# Test modify_at vs map_at
parse_action2 = function(action){
action %>%
# Flatten+rename sourceSystem elements
modify_at("sourceSystem", ~flatten_dfc_rename(.x, "source")) %>%
# Flatten+rename committees
modify_at("committees", function(committee){
map_dfr(committee, ~flatten_dfc_rename(.x, "committee"))
}) %>%
# Flatten object to data frame
flatten_dfc_rename(., "action") %>%
janitor::clean_names()
}
# Parse the first action
parse_action2(actions_list[[1]])
# What about speed?
microbenchmark::microbenchmark(
parse_action = parse_action(actions_list[[1]]),
parse_action2 = parse_action2(actions_list[[1]]), times = 1000
)
```
## Apply it to all elements
Using the `{purrr}` library's `map_dfr()`, we apply my `parse_action()` function to each element of our actions list and combine the results into **d**ata **f**rame **r**ows.
```{r}
#| label: parse-actions
#| code-summary: "Parse actions and combine into data frame"
(actions_df = map_dfr(actions_list, parse_action)) %>%
glimpse()
```
When you plan to combine rows into a data frame or `unnest()` the data in the future, it is useful to be explicit about the expected data types of each column. This can be done using `type_convert()`.
```{r}
#| label: actions-type-convert
#| code-fold: show
#| code-summary: "Specify column types"
# Specify column data types
actions_col_types = cols(
action_date = col_date(),
action_time = col_time(),
action_committee_systemCode = col_character(),
action_committee_name = col_character(),
action_source_code = col_character(),
action_source_name = col_character(),
action_text = col_character(),
action_type = col_character(),
action_code = col_character()
)
# Convert column types
actions_df = type_convert(actions_df,
col_types = actions_col_types)
```
Because we are only adding a single row of data to `bill_df` we can encase the actions in a list and use dollar assignment. If you were adding multiple rows of data, you would need to make sure that you are adding a column of the same length as `bill_df`.
```{r}
#| label: assign-actions-to-df
#| code-fold: show
#| code-summary: "Assign actions as list column in our data frame"
bill_df$actions = list(actions_df)
```
```{r}
#| eval: false
#| include: false
#| echo: false
bill_actions = xml_find_all(actions_node, "item")
bill_action_counts = as_list(xml_find_all(actions_node, "./*[not(self::item)]")) %>%
map_dfc(flatten_dfc) %>%
rename_with(.cols = everything(), ~str_c("actions_", .)) %>%
pivot_longer(everything(), names_to = "action", names_prefix = "actions_", values_to = "count")
# Coerce nodes to list
actions_df = as_list(bill_actions) %>%
map_dfr(parse_action) %>%
type_convert(col_types = col_specs$actions)
bill_df$actions = list(actions_df)
bill_df$action_counts = list(type_convert(bill_action_counts,
col_types = cols(action = col_character(), count = col_integer())))
```
# When they put XML in your XML
So we've dealt with cases with multiple child elements, but XML allows for all sorts of weird and wonderful storage options. The congressional data includes information about votes taken on a bill, the total vote count, and the votes of individual legislators. However, only the high level information is included in the `<recordedVotes>` element. the more detailed information on individual legislators' votes are contained in an XML file referenced by the `<url>` element as you can see below.[^12]
[^12]: {width="133"}
```{r}
#| label: vote-nodes-contents
#| code-summary: "Assign actions as list column in our data frame"
# Find recordedVote nodes inside recordedVote container
bill_recorded_vote_nodes = xml_find_all(
bill_node,
"recordedVotes/recordedVote"
)
# Look at the nodeset contents
xml_contents(bill_recorded_vote_nodes)
```
Before we dive one level deeper, we can convert the top-level nodes to a list and flatten it into columns to start building our vote-level data frame.
```{r}
#| label: vote-nodes-df
#| code-summary: "Convert `<recordedVotes>` to a list and flatten into data frame"
# Coerce nodes to list
recorded_votes_list = as_list(bill_recorded_vote_nodes)
(recorded_votes_df = map_dfr(recorded_votes_list, flatten_dfc)) %>%
glimpse()
```
::: {.callout-note appearance="simple"}
## Note
We use `flatten_dfc()` to flatten each vote element into a data frame as columns and then `map_dfr()` rolls them all up as rows into a single data frame.
:::
Now to get the vote roll XML file we'll use `read_xml()` . There are two main nodes - `<vote-metadata>` and `<vote-data>`. One node contains the aggregated vote information, while `<vote-data>` contains the legislator-level vote data. We'll focus on `<vote-data>` here.
```{r}
#| label: vote-roll-xml
#| code-summary: "Read vote roll data and convert to list"
vote_roll_xml = read_xml(recorded_votes_df$url)
# Find all vote-data elements
vote_data = xml_find_all(vote_roll_xml, "vote-data")
# Find all recorded-vote elements
vote_legislators = xml_find_all(vote_data, "recorded-vote")
# Convert to list and show first element
(legislators_list = as_list(vote_legislators))[1] %>%
glimpse()
```
If we use the same combination of `as_list()` , `map_dfr()`, and `flatten_dfc()` we can flatten the XML into one row per legislator but we lose all the attributes.
```{r}
#| label: vote-roll-data-flat
#| code-summary: "Convert list of votes to data frame and LOSE attributes"
(vote_roll_flattened = as_list(vote_legislators) %>%
map_dfr(flatten_dfc))
```
To extract the attributes before we flatten the data, we can apply `map_at()` to each legislator element and extract the attributes from each while retaining the value in `vote`.[^13]
[^13]: We do lose the top-level element in the legislator list. This does not matter in this case because the name is repeated in the attributes, but be aware of this in case you want to combine values in a list with its attributes.
```{r}
#| label: extract-attr
#| code-summary: "Extract attributes to list elements"
(legislator_attr = legislators_list %>%
# Modify one level deeper using map_at to target legislator elements
map(map_at, "legislator", attributes))[[1]] %>%
glimpse()
```
```{r}
#| label: vote-roll-leg-df
#| code-summary: "Convert list of votes to data frame and KEEP attributes"
(legislator_vote_df = map_dfr(legislator_attr, flatten_dfc) %>%
# Use {janitor} to clean up the column names
janitor::clean_names()) %>%
glimpse()
```
It can often feel like you're getting lost in a list of lists, but with some experimentation you'll be able to find your way back to the surface.
Now we have a table of legislator voting data, and the last thing to take care of is the `<vote-metadata>`. Everything other than the `<vote-totals>` element is singular so we can get that out of the way the same way as before:
```{r}
#| label: vote-roll-metadata-xml
#| code-summary: "View `<vote-metadata>` element"
(vote_metadata = xml_find_all(vote_roll_xml, "vote-metadata")) %>%
xml_contents()
```
Because we have the individual votes, we don't need to keep the `<vote-totals>` element, but it is useful to note because it could serve as a check for data consistency in the future. Instead, we'll just keep the singular nodes in `<vote-metadata>`
```{r}
#| label: combine-votes-df
#| code-summary: "Convert singular vote nodes to data frame"
vote_singular_nodes = xml_singular_nodes(vote_metadata)
(vote_df = as_list(vote_singular_nodes) %>%
# as_list() doesn't retain element names so we set names ourselves
setNames(xml_name(vote_singular_nodes)) %>%
flatten_dfc() %>%
janitor::clean_names()) %>%
glimpse()
```
Once we have our nodeset (which at last are all singular), we use the same listing, mapping, and flattening...or *lappening* as absolutely no one calls it.
Now we can put all the pieces together and nest each element in its parent element until we've gotten back to the bill-level i.e. the vote roll data frame goes in the votes data frame, which goes in the bill data frame.
```{r}
#| label: vote-roll-build
#| code-summary: "Nest each element in their parent data frame"
# Nest legislator votes in the vote roll data frame
vote_roll_df = vote_df %>%
mutate(legislator_votes = list(legislator_vote_df)) %>%
janitor::clean_names()
# Nest vote rolls in the votes data frame
(recorded_votes_df = recorded_votes_df %>%
mutate(vote = list(vote_roll_df))) %>%
glimpse()
# Nest votes in the bill data frame
bill_df$votes = list(recorded_votes_df)
```
Now we have the bill-level characteristics with action and vote information nested in list columns. We can keep the data at the bill level or `unnest()` the elements we want to use.
::: panel-tabset
## Bill
```{r}
glimpse(bill_df)
```
## Actions
```{r}
select(bill_df, billType, billNumber, actions) %>%
unnest(actions) %>%
glimpse()
```
## Votes (Vote-level)
```{r}
# Select bill ID and votes
select(bill_df, billType, billNumber, votes) %>%
unnest(votes) %>%
# Unnest further vote info and add prefix to cols
unnest(vote, names_sep = "_") %>%
glimpse()
```
## Votes (Legislator-level)
```{r}
# Select bill ID and votes
select(bill_df, billType, billNumber, votes) %>%
unnest(votes) %>%
# Select bill ID, roll number, and vote list col
select(billType:rollNumber, vote) %>%
unnest(vote) %>%
# Select bill ID, roll number, and legislators list col
select(billType:rollNumber, legislator_votes) %>%
unnest(legislator_votes) %>%
glimpse()
```
:::
I'll stop there for brevity's sake, but you can find the code for extracting the full XML file [here](https://github.com/MokeEire/my-reps/blob/master/R/parsing_functions.R)[^14]. Please reach out with any feedback, leave a comment below, or share this with someone who's struggling with XML. There may even be a reward for anyone who finds mistakes in my code 🤷.
[^14]: Ctrl/Cmd+F: `extract_bill_status`
```{r}
#| eval: false
#| include: false
parse_vote_roll = function(vote, logger, bill_type, bill_num){
tryCatch(
{
vote_xml = read_xml(vote, options = "RECOVER")
vote_data = xml_find_all(vote_xml, "vote-data")
vote_roll_children = xml_children(vote_roll_xml)
vote_data = xml_find_all(vote_roll_xml, "vote-data")
# Vote data
vote_legislators = vote_data %>%
xml_find_all("recorded-vote")
legislators_list = as_list(vote_legislators)
legislator_vote_df = legislators_list %>%
# Modify one level deeper using map_at to target legislator elements
map(map_at, "legislator", attributes) %>%
map_dfr(flatten_dfc)
# Vote metadata
vote_metadata = xml_find_all(vote_roll_xml, "vote-metadata")
vote_singular_nodes = xml_singular_nodes(vote_metadata)
(vote_df = as_list(vote_singular_nodes) %>%
# as_list() doesn't retain element names so we set names ourselves
setNames(xml_name(vote_singular_nodes)) %>%
flatten_dfc())
# Vote totals
vote_totals = xml_find_all(vote_metadata, "vote-totals")
vote_totals_by_party = xml_find_all(vote_totals, "totals-by-party")
party_vote_totals_df = vote_totals_by_party %>%
as_list() %>%
map_dfr(flatten_dfc) %>%
type_convert()
vote_roll_df = vote_df %>%
mutate(legislator_votes = list(legislator_vote_df),
party_votes = list(party_vote_totals_df)) %>%
janitor::clean_names()
vote_list = as_list(vote_data)
flatten_dfr(vote_list) %>%
unnest(everything())
},
error=function(cond) {
log_info(logger,
bill_type = bill_type,
bill_num = bill_num,
"ERROR: Vote roll could not be parsed")
# Choose a return value in case of error
return(tibble())
}
)
}
```
# Other helpful articles
Here are some of the helpful articles I came across in the course of writing this:
- [From XML to Excel for Data Analysis](https://towardsdatascience.com/from-xml-to-excel-for-data-analysis-ac0c0c765b7d "Introduction to Processing XML In Python")
- [Reading XML files in R](https://medium.com/geekculture/reading-xml-files-in-r-3122c3a2a8d9)
- [Converting nested XML to dataframe in R](https://urbandatapalette.com/post/2021-03-xml-dataframe-r/)
- [Parse and process XML (and HTML) with xml2](https://www.rstudio.com/blog/xml2/)
::: {.callout-note collapse="true"}
## Session Info
Version information about R, OS, and loaded packages.
```{r}
#| label: session-info
#| echo: false
sessioninfo::session_info("loaded")
```
:::