Open Iterations Improve COVID-19 Data Quality
Posted by Jack Bastian on March 26th, 2021
By: Jack Bastian, Data Engineer, HHS Protect, Office of the Chief Data Officer (OCDO), U.S. Department of Health & Human Services (HHS); Greg Singleton, Director, Health Sector Cybersecurity Coordination Center, Office of the Chief Information Officer (OCIO), HHS; Kristen Honey, Chief Data Scientist and Senior Advisor to Assistant Secretary for Health (ASH), HHS
Summary: Data for COVID-19 is being monitored and improved constantly. In this blog, learn about the techniques that HHS uses to identify issues with input from the American public.
The COVID-19 pandemic has created an unprecedented need for rapid, high volume and high- quality healthcare data collection and analysis. Data is a strategic asset that has long been leveraged by businesses to drive value for decades. Government data is taxpayer funded and a public good we are leveraging to help our country recover from the COVID-19 pandemic. Through open and interoperable data, we are following the Executive Order on Ensuring a Data-Driven Response to COVID-19 and Future High-Consequence Public Health Threats to build trust in government through scientific integrity and evidence-based policies. We are also following the National Strategy for the COVID-19 Response and Pandemic Preparedness.
As of March 2021, the U.S. Department of Health and Human Services (HHS) is collecting data from roughly 6,700 hospitals across the United States every day. Each hospital reports about 70 unique data fields, which means HHS receives close to 500,000 data points every single day. This wealth of data makes up one of the most comprehensive and widespread real-time hospital disease-monitoring systems ever created.
Importantly, HHS is not only focused on gathering essential of hospital data – we are constantly improving the quality of that data. One of the many ways HHS addresses the issue of data quality is through active engagement with researchers and the public. Thanks to feedback from the public, HHS unlocked over 200 datasets related to COVID-19 and the U.S. pandemic response through HHS Protect Public and HealthData.gov, the home of HHS open data. Recent examples of the department’s efforts to share more data publicly include the release of the COVID-19 Community Profile Report which provides a comprehensive view of testing and hospitalization trends. In addition, HHS also started publishing weekly COVID-19 hospital data for each U.S. hospital, including Frequently Asked Questions (FAQs) and a data dictionary. More data publicly available means anyone can perform their own analyses and potentially uncover data quality issues.
How can you help digital analytics drive improvement?
Every American can help the U.S. pandemic response. Your near-real-time input on COVID-19 datasets is extremely valuable, as it is helping HHS to identify — and quickly resolve — information gaps and quality issues.
Currently, teams at HHS employ a variety of techniques for identifying and monitoring erroneous values coming into our hospital data reporting system. For example, a team at HHS applies programmatic logic to incoming hospital data that flags dramatic one-day increases in certain priority data fields. The screenshot at the end of this post shows a dashboard that gathers these alert flags and displays a time series graph of the data field where a specific alert was triggered. This time series graph gives context to flagged data points (highlighted in red) and inform decisions that remedy data quality issues.
While this dashboard often catches the most glaring spikes and increases, there can be some anomalous values that fall under the radar.
What are specific examples of ways that the public has helped?
Example 1 — Nevada Total Beds: In late December 2020, a Nevada hospital with between 150 and 200 beds accidentally entered a value of 7,518 total beds. This error resulted in a large spike in Nevada’s overall bed count, which the public immediately flagged for HHS review. Our data quality teams took note of the error and reached out to the facility to validate the correct value. The facility acknowledged the error, and we got the value corrected within 1 business day.
Example 2 — Maryland and New York Hospitalizations: Also in December 2020, COVID-19 hospitalizations were tracking lower than expected in New York and Maryland. Members of the public brought this issue to the attention of HHS and data quality teams were able to trace the issue to 55 hospitals across both states. It turned out that these hospitals needed to be re-mapped to different unique identifiers in the HHS hospital reporting system. Once this change was made, the data was much more accurate and reflected a 14% increase in COVID-19 hospitalizations in New York and a 3% increase in Maryland.
Example 3 — Feedback from Researchers on Upcoming Data Releases: Throughout the last few months, select researchers and journalists have given invaluable feedback on the content, structure and formatting of public data sets before they are published. Through these interactions, HHS is able to better understand the needs of their data users and make improvements that allow for more efficient and effective analysis and visualization. A recent example of this was when a group of researchers gave feedback on the facility level hospital data, specifically focusing on improving the clarity of the variable names and the documentation around what facilities are included in the data set. Through implementing these suggestions, HHS was able to make this dataset much more understandable and more suitable for analysis by data analysts of all levels.
What does transparency and open data accomplish?
The pandemic has touched every aspect of American life, and we all have a vested interest in today’s data-driven pandemic response. One critical role of HHS is to publish high quality, accessible, and machine-readable COVID-19 datasets in a consistent and timely manner. This information is critical. By opening datasets to the public, HHS is democratizing access to information so that everyone can help to:
- Reveal disease
patterns and trends, including local hotspots or outbreaks
- Provide insights
into the optimal distribution of limited resources, based on local need
for diagnostic testing supplies, vaccines, personal protective equipment
(PPE), medical supplies, and trained staff
- Inform decisions
about lockdowns, closures, and/or re-openings, so that public health
safety is optimally balanced with other societal needs including economic
- Improve science
communication through data visualization
- Share methods
with transparency, which can accelerate scientific discoveries and
identify data gaps or opportunities to improve the quantity and quality of
We welcome your continued feedback and ongoing collaboration and, together, we will continue to steadily improve COVID-19 data quality and reporting consistency over time.
Disclaimer: Please visit CDC’s website on COVID-19 for the most up-to-date information and COVID-19 guidance.