The need for data integrity assertion

There’s a lot of energy these days focused on data interoperability, within and across industries. Generally speaking, interoperability is a laudable and worthwhile goal, but with greater access to data from broader and more diverse sources comes a need for greater attention to establishing and maintaining data integrity. While reading the (highly recommended) Tao Security blog, the January 16 post from Richard Bjetlich brought this issue into clear focus: when the recipient of a message relies on the contents of that message to make a decision or take specific action, the importance of the data’s accuracy cannot be understated.

The problem of executing processes (or making decisions) based on information that is received using communication channels that may not be reliable has been addressed in the context of preventing “Byzantine” failures. The Byzantine Generals’ Problem from which this class of fault-tolerance gets its name is concerned with protecting against bad actors interfering with the transmission of messages where the content of the messages is the basis for determining and taking a course of action (in the Byzantine Generals’ Problem, the action is whether to attack an enemy target). There are many technical treatments of this problem scenario in which a system may crash or otherwise behave unpredictably when the information it receives as input is erroneous. This class of errors can be distinguished from typical syntactical or technical errors because in a Byzantine failure both the format and the method of delivery of the message is valid, but the content of the message is erroneous or invalid in some way. In fact, even in the mathematical solutions to the Byzantine Generals Problem, the best outcome that can be achieved is that all the loyal generals take the same action; there is no provision in the problem or its solutions for the case where the original order (i.e., the message contents) is invalid.

Most approaches to mitigate Byzantine failure provide fault-tolerance through the use of multiple sources for the same information. In this model even if some of the messages are unreliable, having sufficient redundant sources allows a system to evaluate valid versus forged messages and make a correct decision. Whether or not multiple message sources are available, the addition of technical anti-forgery mechanisms such as digital signatures can also provide the means to validate messages, at least in terms of their origin. All of these approaches focus on ways to ensure that the content of messages when received are the same as the content when the messages are sent. However, even when the sender of messages is trusted by the recipient, to date there has not been much attention focused on the integrity of the data itself. This gives rise to any number of scenarios where data is received from a trusted sender with the intent of taking action based on the data, but the data sent is invalid.

The example of incorrect medication dosages cited in Richard Bejtlich’s recent blog post is an eye-opening illustration of the risk involved here. A more familiar example to many would be the appearance of erroneous information on an individual’s credit report. The major credit reporting agencies receive input from many sources, and use the data they receive in order to produce a composite credit rating. The calculation of a credit score assumes the input data to be accurate; if a creditor reports incorrect data to the credit reporting agency, the credit score calculation will also be incorrect. Unfortunately for consumers, when this sort of error occurs, the burden usually falls on the individual to follow up to try to correct the error. The credit reporting agencies take no responsibility for the accuracy of the data used to produce their reports, so it would be great if the companies serving as the sources for the data would. What would help from a technical standpoint is some way to assert the accuracy or validity of data when it is provided. This would give the receiving entity a greater degree of confidence that calculations based on multiple inputs were in fact accurate, and would also reduce the risk of choosing the wrong course of action when using such a calculation in support of decision making.

It would seem there is a small but growing awareness of this particular data integrity problem. It’s a significant risk even when considering mistakes or inadvertent corruption of data, but adding the dimension of intentional malicious modification of data – which may then be used as the basis for decisions – raises the threat to a critical level. Conventional approaches to data integrity protection – e.g., access controls, encryption, host-based IDS – could in theory be complemented by regularly executed processes to validate data held by a steward, and by assigning a tagging or scoring scheme to data when transmitted to provide an integrity assertion to receiving entities. The concept of a consistent, usable, enforceable integrity assertion mechanism is one of several areas we think warrants further investigation and research.