Gartner Blog Network

Send the Questions to the Data

by Wes Rishel  |  July 7, 2011  |  10 Comments

One of the insights of the PCAST report was the utility of sending the question to the data vs centralizing the data to answer questions.  It may take a long time to achieve this by defining a new universal data language and using digital content management to manage consumer privacy preferences, but what happens if we uncouple that the basic “question to the data” notion from the more elaborate vision of the Report?

The utility to public health and some kinds of research could be substantial and it may be possible to get started with an intense, voluntary effort similar to The Direct Project.

Public health covers a lot of territory. Some concepts for gathering data are enshrined in law and regulation and, therefore, are stable. “Reportable diseases” are typically enumerated and it is conceivable to build filters into lab systems to “send the data to the question (asker).”

But what of the requirements for situational awareness during an epidemic. Would public health officials want to ask about presenting symptoms, ED diagnoses, lab or micro results, pathological findings? As their understanding of the disease process grows how often would they want to change their inquiries to add data or refine the selection criteria? Pretty darn frequently, I’d wager.

How might it work?
In general, the sources of clinical data (EHRs, stand-alone labs, ePrescribing networks, etc.) would “sign up” to receive queries from requesters such as public health departments, researchers or other carefully identified requestors of information. It is critical that the identity of both ends of the transaction are established mutually authenticated each time data is transmitted. It seems that the security about the institutional asker’s of questions may require a higher assurance level.

Once a relationship has been established the sources of data might receive requests for data from time to time such as “tell me about encounters you have had with high fevers and vomiting as a ratio to the total number of encounters you have had for the same period.” The request would also include (or imply) information on where to send the data and how to associate it with a specific request.”

Requesting aggregates rather than individual data is a two-edge sword. On the one hand, it avoids policy issues around sending individual health information. On the other hand there is no way for the recipient to weed out duplicates from multiple sources so the statistical inaccuracy will rule out fine-grained measurements. Furthermore, because the statistics are generated at the source fine-grained pattern discovery based on sophisticated analyses of millions of records does not seem possible. This is an area where the perfect should not be an effective enemy of the good. Any pragmatic scheme to get even crude data into the hands of public health officials is better than a more precise scheme that would take years to get going.

In the un-PCAST world that we occupy today, it is unlikely that the data-source organizations would respond automatically. Some officer of the organization will likely approve each individual request. However, it would be ideal if that officer didn’t have to schedule the time of a programmer to determine if the request is feasible and, if so, code it up. In other words, it would be ideal if fulfilling the request was automatic and there was a workflow for the officer of the source organization to know about requests, and decide whether to authorize the release.

“Same old stuff” or a new day?
We all know the reasons that this can’t work. The policy issues are difficult but the biggest issue is that EHRs and other clinical data systems have different internal data schemas, so it is hard to frame a request and know that it is compatible with source systems and it is hard to fulfill the request without custom programming.

As that legendary scientist and explorer, Jean-Luc Picard, once said, “things are only impossible until they’re not.” It’s time to decide if we are approaching a cusp of possibility with regard to this issue. We have new resources (more EHRS, a fundamental secure communications infrastructure and standard coding systems in support of meaningful use requirements).

It is time to assemble a group of users and vendors willing to look at this issue intensely, follow the principles of building on available standards and, if pragmatic solutions exist, support them with an approach similar to the Direct Project, which is, “rough consensus, open source code implementations, final specifications.”

Additional Resources

View Free, Relevant Gartner Research

Gartner's research helps you cut through the complexity and deliver the knowledge you need to make the right decisions quickly, and with confidence.

Read Free Gartner Research

Category: healthcare-providers  interoperability  vertical-industries  

Tags: direct-project  distributed-query  ehr  health-information-exchange  healthcare-interoperability  healthcare-providers  open-source  pcast-report  public-health  

Wes Rishel
VP Distinguished Analyst
12 years at Gartner
45 years IT industry

Wes Rishel is a vice president and distinguished analyst in Gartner's healthcare provider research practice. He covers electronic medical records, interoperability, health information exchanges and the underlying technologies of healthcare IT, including application integration and standards. Read Full Bio

Thoughts on Send the Questions to the Data

  1. Trevor Kerr says:

    Wes, as a preliminary to these considerations, it would be worth looking at the recent HUS outbreak in Europe.
    How did the different systems store information about the infecting bacterium?

  2. Wes, Great ideas – we are trying to set up a cyberinfrastructure with security and privacy controls that would allow this type of thing to be done (iDASH – integrating data for anonymization, analysis, and sharing, the newest NCBC). EMR data is a major target for hosting on iDASH, as is public health data, imaging data, and biological data.

  3. Thomson Kuhn says:

    While the task and the subject matter are different, an article in the current issue of JAMIA describes the real-world difficulties involved in aligning data from different patient care systems.

    The effort required to align relatively straightforward demographics among some of the leading healthcare systems in this country was enormous.

  4. Lee Jones says:

    Wes, what you describe largely exists in NY, and it is called the Universal Public Health Node (UPHN). It basically allows the querying entity to describe the query (patients exhibiting flu-like symptoms defined as …), and the frequency it wishes to get this information. This query is sent to RHIOs or other nodes who in turn scatter the request out to all potential suppliers of information (who have opted in to be publishers), and the data sources fulfil the subscriptions on the prescribed schedule. The results roll back up and are aggregated with results from other publishers and are returned to the original requester, again, on the prescribed schedule.

    It supports both a true federated model where the question is passed to the data source, and also a centralized model where those publishers that prefer to live in the world you are eschewing of sending all data somewhere to wait for the question can in fact do just that. In the latter case, the aggregator fulfils the subscription.

    It allows you to ask for all different kinds of data from HL7 2.x streams, and also CCDs, or summaries through its analytic query (e.g. – counts of results in the query). There is a query encoding that is used which is similar to SQL paradigm in terms of specifying the query like a “where clause”. It has an option to traffic in identified or de-identified data (really pseudonymized data), and if the requester wants to identify a pseudonymized case, it can be requested and the identifiable records are sent (right now, only the State requests, so they have authorities to get identified data).

    The trick is that UPHN has a CDR and open source code to fulfil the subscription, take care of security, etc. The CDR can be populated through popular data streams (HL7 2.x, CCD, etc), so this is an option for implementation for those that don’t want to implement the subscription fulfillment natively. It at least keeps the data at the same location as the source, even if in some cases there is duplication by populating a CDR external to the native system.

    This is deployed and working in NY, so maybe we are closer to your vision than you think.

  5. Hans Buitendijk says:

    Wes, exploring distributed queries is certainly one of the ways to operationalize various suggestions to the PCAST report to walk before we run and try it out. Quite commendable and worthwhile.

    However, as we explore those opportunities we should not require or assume a need for consistency across EHR/HIT systems on their “internal data schemas”. That would be too limiting and unnecessary to achieve the underlying goals you’ve articulated.

    Rather the requirement for the EHR/HIT systems should just be to recognize the data set request in the query and provide the matching data, both using an agreed to standards-based data set/format. The EHR/HIT systems can map their internal data representations to those standards-based data sets/formats if their internal data schema is not the same as the standard interchange format.

    Any attempt to prescribe internal data schemas requiring common implementations across EHR/HIT systems seems fool hardy and I assume that is not really what you intent. Let systems determine what their best internal data schemas are to support their specific and highly variable customer needs, while still being able to transform various data subsets to standards-based data sets/formats, and vice versa, to exchange data with other systems (national, regional, or down the hall).

  6. Wes Rishel says:

    Thanks, Hans, I agree 100%, maybe even more 🙂

    The questions that this effort must address include

    (a) what is the reference model that serves as a basis for expressing detailed questions and responses?
    (b) how are criteria and results expressed using codes (part of the reference model, really)?
    (c) are there pragmatic ways to make this concept operational while minimizing the impact on vendors and self-developers of EHRs, given that they are busy keeping up with meaningful use requirements.

    Several others who have commented on this blog, including Lee Jones and Wendy Chapman, have suggested working models that presumably meet these goals. If a group is formed it will start with plenty of practical ideas.

  7. Wes Rishel says:

    Thom, Good point. My sense is that in the early stages (a few years at least) this will be an issue that limits the use cases (or user stories) to situations where rough, aggregate data obtained timely is adequate and important. Attempting to align all the EHRs would be much better, but in this case the “better” is the enemy of the good.

    It will be important that a group working on this include experienced epidemiologists who can predict flaws in data alignment and identify user stories where the value outweighs the difficulties.

  8. Wes Rishel says:

    Thanks, Trevor, you raise an important point. The questions are was the bug identified in a coded fashion at all and, if so, was the coding standard?

    However, I recommend that a group that would take this on be careful NOT to ask the question in terms of the systems’ internal representations of the data. We don’t have the time or the economic clout to standardize systems internals. Even if we could, I would be against it because of the tradeoff between standardization and innovation.

    Instead we should ask the questions in terms of how queries and results can be expressed. If a query recipient can parse the query, will it be able to translate its internal representations of the data into meaningful answers? It may use ad hoc code, a vocabulary server, or even controlled natural language processing to express data in a coded form, capable of being aggregated to respond to a query.

  9. CG Chute says:

    This vision is not new, promoted in my experience by Scott Blois in the 1980’s. The scope of the challenge really subsumes all modes of secondary data use, not just public health. Best evidence discovery, outcomes research, and related translational research could be similarly promoted. Chuck Friedman had promoted his notion of the “third space” which includes the notion of federated query service. Indeed, I2B2 promised this goal, though leaves the challenge of data normalization as an exercise to the implementer. Our SHARPn grant ( seeks to develop open-source data normalization pipelines including NLP, clinical phenotype (mostly disease) algorithms, and associated infrastructure that could make this more practical.

  10. Keith Boone says:

    Interesting thoughts that deserve something more that a brief response. See Where should the data go?

Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.