Primary tabs

Taking our own medicine: Using Pillbox Open Data

The Department of Health and Human Services (HHS) is committed to making health data more accessible and usable.  Oftentimes, the gap between open data and usable data is very wide.  I'm the project manager for Pillbox at the National Library of Medicine at the National Institutes of Health.  Pillbox's goal is a free and easy-to-use dataset where any pill you can hold in your hand has a single, accurate record, is connected to other data sets, and has a high resolution image of the pill.  You might be thinking, "Wait, this doesn't already exist?!"  To create Pillbox we have to be both a data consumer, pulling from sources like the Food and Drug Administration (FDA), and a data owner; restructuring, combining, even correcting data to create a unique resource like Pillbox.  As for the images, we had to photograph them ourselves.

I want to share with you two stories from Pillbox that illustrate the benefits of being both a consumer and an owner of data.  For this first post, I'll talk about how we worked with two difficult to use data sources, adding our knowledge of how to use them, to build something new and valuable.  We also created open source code to share our knowledge of how to build this system and use it as a platform to dig even deeper into those sources.  In the second post, I'll talk about how we partnered with FDA and the Department of Veterans Affairs (VA), which has one of the largest pharmacy systems in the world, to take pictures of pills.  In particular, the solution we found with the VA to help them create and share thousands of pill images was unexpectedly simple and effective.  It's helping us achieve one of the critical goals of the project and is also supporting the care of our nation's veterans.

The primary data source for Pillbox is what most of us know as the drug labels, information drug companies are required to send to FDA about their products.  Part of FDA's mission is collecting this data.  Because it's a regulatory process, the data is structured in a way that isn't based on individual pills.  It's such a complex system there are very few groups that have the multiple areas of expertise necessary to actually use this data.  The other source is a vocabulary of normalized names for clinical drugs.  The expertise required to use it is balanced by its value in connecting Pillbox to other drug information systems.  If you're wondering what powers the drug information systems at hospitals, pharmacies, websites, and many apps, it's expensive, third-party resources.  This puts it out of reach of most small businesses, startups, researchers, and even public health departments.  Working directly with the source data is like searching for a needle in a haystack made of haystacks.  There are other barriers to using the source data: errors in the data supplied by the drug companies and inconsistencies in the data formatting.  Also, when we started Pillbox, there were almost no pill images in the labels.

Substantial effort went into creating a process to parse the drug labels, identify errors, and compliment the data with other high-value drug information.  We spent months working with FDA, pharmacists, regulatory experts, and computer scientists just figuring out how to use our own HHS data.  The dataset we created now powers the Pillbox website, the application programming interface (API) that developers use to create apps that identify a drug based on a picture taken with a phone or a person saying what the pills looks like, and is also available for download   The problem was developers were dependent on us to expand the project and provide them with updated data, which took weeks handing off tasks between different team members to create.   Through HHS Ignite, an innovation program from the HHS IdeaLab, we created a set of open source programs that perform the same data process we developed for Pillbox, running 20 times faster than before, saving staff time and money.  Because the code is open source developers can not only run the process themselves, they know how to do it and they can improve it.  This code went public on GitHub in late-May.  We've now started a new project to expand this code to automatically detect errors in the data and report them to FDA and the pharmaceutical industry.

As part of our HHS Ignite project, we're brought all of our developer resources into a new site, Pillbox for Developers, to be an invitation to participate and collaborate.  The goal here is to open the project and transfer knowledge: what we're doing and how we're doing it.  This site launched the same day as the code.  The code for this site is also open source and posted on GitHub.  The Department of Labor, which has one of the most advanced developer engagement programs in the federal government used this code to create a new developer portal.  Other government developer groups are also using this code to support their websites.

Bridging the gap from open data to data that can be easily used to build innovative solutions to health challenges takes substantial effort.  The secret sauce to open data delivering on its promise of being useful and creating impact is that is often has to 1) be restructured in a way that meets the needs of downstream users, 2) be connected to other datasets to create interoperability, and 3) address data quality and integrity issues.  For Pillbox these three required working closely with FDA.  There's a fourth component: creating new data.  In the next post I'll talk about how we partnered with FDA and VA to create the dataset everyone thought already existed: images of the pills.