Seeing Through the Fog of Data Bias

"Future conflicts will be fought not just with bullets and bombs but also with bytes and big data,” NATO Secretary General Jens Stoltenberg said earlier this month, as the alliance prepared to adopt its first strategy on artificial intelligence. These thoughts reminded us of the enduring relevance of the following article from Proceedings December 2020 magazine by CNA’s Eileen Chollet, reprinted here with permission of the U.S. Naval Institute.

According to one database, over the past ten years, cruiser deployments have averaged eight months. The shortest deployment was two weeks, and the longest was more than two years. Every entry in that data-base is complete, correct, and authoritative. And yet, any budgeteer or analyst who used those numbers would end up wildly off target, because those statistics lump together forward-deployed and rotational cruisers, which use markedly different definitions of “deployment.”

If a distorted average becomes an incorrect planning assumption, future leaders could find themselves without the resources they need.

Truth in data depends on definition, interpretation, and approximation. When data approximations differ from the “truth” according to some specified definition, statisticians call that “data bias.” If the future digital Navy fails to understand and actively manage data bias, it will find itself off course.

Over the past few years, Navy leaders have rightly focused on the competitive advantages that data offer. In an April 2019 article in Defense One, Admiral William Moran warned of adversary attempts to “dominate the data domain” and urged the Navy to focus on “high-quality data input at any entry point.” The Secretary of the Navy’s 2019 Cybersecurity Readiness Review lays out the risks the Navy faces when it can no longer trust the confidentiality, integrity, and availability of data. Meanwhile, former Chief of Naval Operations Admiral John Richardson’s A Design for Maintaining Maritime Superiority 2.0 emphasized the foundational role high-quality data plays in decision-making. But even if the data foundation appears perfect, with every entry complete and correct, bias can imperceptibly erode the competitive edge.

Some data bias always will occur when operating in real time and in the real world, rather than under laboratory conditions. Location bias, the overrepresentation of information physically closer to the data collector, is a concern when forces are distributed. “Big data” is great when you can acquire it, but time and funding limitations almost always limit sample sizes, leading to predictions from too few points. And it is human nature to empathize with the planners and participants of an experiment, exercise, or wargame, leading data collectors to leave out points they feel should not “count,” so that the event succeeds.

The bias problem only grows when there is no right answer. “How many ships does the Navy have?” may be quantitative, but it is also squishy. Is a submarine a ship? How about the USS Constitution? Include the National Defense Reserve Fleet? The definition of “ship” will bias the data one way or the other, producing a smaller or larger number depending on the choices the data analyst makes—consciously or unconsciously. In 2014, the Navy began counting hospital ships, some patrol craft, and cruisers in reduced status as part of the battle force. Congress, seeing politics where perhaps there was none, forced a return to the old counting rules less than a year later. Any model of the battle force data that does not take these definition changes into account could see trends where there are none.

Managing data bias does not require the most sophisticated algorithms, machine learning, or supercomputers; to keep data bias from undermining the digital Navy, we need to cultivate people. The Navy’s data stewards—the people responsible for the collection, management, and administration of data sets—need specialized training to understand how data bias occurs, and to be instinctively skeptical of analytic results presented with only a hand wave at methodology. Data stewards need operational experience and longevity with a particular data set to understand the choices made when the data were collected and curated. And data stewards need to maintain their objectivity and independence, so their funding or fitness reports cannot depend on the data telling only the stories that leaders want to hear. The digital Navy that the nation needs depends on it

Copyright U.S. Naval Institute.

Eileen Chollet is a senior research scientist in CNA’s Fleet Operations and Assessments Program.

Seeing Through the Fog of Data Bias

Recent Articles