As the financial industry deals with scaling and consistency problems resulting from an ever-growing and more demanding customer base, data that can be easily accessed and interpreted are becoming less of a benefit and more of a requirement. A clear example of this trend has been the move from legacy systems to automated, cloud-based solutions. This revolution, however, does not come without its challenges. The most notable among them is addressing dirty data. “Dirty data” often refers to removed data, corrupted data, or the replication/reformatting of data across different company systems, rendering it unusable, inconsistent, or slow to access.
What makes cleaning data so challenging lies largely in how differently the data is stored/labeled across the company. It’s no wonder data management can feel overwhelming when having to clean this for a company with millions or even billions of data points that may be continually growing.
In many cases, the data itself is not dirty but is instead inconsistent. For example, a client’s legal first name for tax purposes may be Joseph, for their client statements they prefer them to be addressed to Joe, but on the phone, they prefer to be called Joey by their account rep. Data management who is looking to reconcile, replicate, or reformat data then not only needs to make these connections but identify every instance in which this is done with a client. To make matters worse, finding errors in this process is almost as hard as the reconciliation itself since it requires a person to look at data management’s decisions and notice subtle differences between the cleaned data and the correct source data.
Also, some asset management firms say. “I would rather have some dirty data than risk losing precious information or paying to get it cleaned”. This response may make sense to some managers in a vacuum. After all, data cleaning is expensive and labor-intensive. However, most companies who don’t clean their data, lose, on average, 15% to 20% of their revenue from this according to MIT Sloan.
Does this mean all financial firms are doomed to either deal with the risk of inconsistent data or pay the dirty data tax? not necessarily. Many of these problems have been addressed, all that is left is to apply the solutions.
Data-oriented programming languages (i.e., SQL and Python) can be used to develop models for future cleanings by training them on only a handful of mappings done by hand. Think of this as reconciling phone numbers across the business. One group could use the format (xxx) xxx-xxxx, while another could use the format xxx.xxx.xxxx, and the end product needs to look like xxx-xxx-xxxx. In the case of removed or corrupted data, similar methods of training can teach a computer to follow pre-made rules like replacing empty fields or removing data before a certain date. A model can be trained to recognize the punctuation and correctly format the data as the human eye would. Similarly, User Acceptance Testing is much easier as the model can be told what the result is supposed to look and act like, which can be compared against what it does look/act like.
For more complicated cleaning procedures, definitions for each term could be interpreted with natural language processing mentioned earlier in this paper. Natural language processing is already being used for mapping electronic health records and could easily be applied to data with clear, accessible definitions, like financial data.
Finally, since the rules developed during an initial cleaning can be applied to new data, this process only needs to happen once.