In order to build predictive models, data scientists need accurate data for training and validation. While a lot of work usually goes into cleaning up data sources for modeling, such as dealing with missing attributes, there’s often larger issues with the underlying data set that need to be correctly in order for the trained models to actually be representative. One of the goals of data governance is data integrity, which involves validating that your underlying assumptions about the data set match reality. An example showing why this aspect of data science is so important is the recent FiveThirtyEight article, where they identified that previous conclusions published about broadband access were invalid due to using a flawed data set.
Governance roles for data science and analytics teams are becoming more common, because companies are using large and complex data sets from a variety of internal and external sources. One of the key functions of this role is to perform analysis and validation of data sets in order to build confidence in the underlying data sets. We want to build trust in our data sets before we use them as input to our models, where the outputs are visible to customers. At Windfall, we use a variety of different public and proprietary data sources as input to our net worth models. We’re hiring for a governance data scientistrole focused on aspects such as data integrity, to ensure that we are using validated data sets in our modeling processes.
Since this is a newer role, I wanted to identify the key functions that a data scientist in this role should perform:
- Question underlying assumptions about the data
- Identify how to resolve discrepancies in data sources
- Evaluating if new data sources are valuable
One of the key challenges when using data sets is determining the validity of the data. Often data is stale or sampled in a way that is not representative of the overall population.If you’re using a data source that is several years old, many conclusions that could be drawn from the data may no longer hold true. For example, using data about broadband connectivity in 2010 would be problematic when determining the impact of repealing net neutrality on US households today. In the case of the FiveThirtyEight article, a sampled data set was used where the distribution of broadband subscribers significantly varied from other data sources analyzed.
In order to question underlying assumptions about data, it’s often necessary to audit the data against different sources. For example, transaction-level data provided by the FEC about political contributions can be compared with aggregate amounts reported from campaigns, and estimates of housing values can be compared to estimates from Zillow and Redfin. A governance role will prioritize which data points to manually inspect, in order to build more confidence in the data sets, and make sure that conclusions reached from a sample data set can be applied to a wider population.
Another aspect of this role is determining how to resolve issues with data sets when they are discovered. In the case of incorrect findings being published, a postmortem should be published explaining how the findings change based on the newly discovered information, and the FiveThirtyEight article is a great example of this. But if the input data is instead used for modeling, then the role should work with an engineering team to resolve these issues in the data pipeline.
One of the non-trivial situations we encountered at Windfall is handling multiple-property transactions, where properties at multiple addresses are purchased as part of the same transaction. Handling these types of transactions required adding new rules to our automated valuation model (AVM) calculations. Much like productizing a model, a governance data scientist should be capable of putting data quality fixes into production. This can involve handing off a script, or submitting PRs with code changes.
Evaluating New Sources
An additional function that we are defining for a governance role is to evaluate if new data sources are worth using for modeling purposes. At Windfall, this means determining if adding a new data source will improve the accuracy of our net worth models. A data scientist in this role should be able to work with third party data in a variety of data formats and types of sources, and perform exploratory analysis on the data. Often the goal of exploring a new data set is to test for correlations between attributes in different data sets, and data scientists need to be able to work effectively with disparate data sources.
Governance Role Profile
What are companies looking for in the governance role? At Windfall, we’re looking for data scientists with the following skill set:
- EDA: Demonstrated experience of exploratory data analysis (EDA) across large and messy data sets. For example, working with a third-party API and testing core assumptions about the data.
- Scripting: As mentioned above, data scientist should be capable of productizing their findings. R and Python are a good starting point for setting up reproducible research, but we also want findings from scripting projects to be translatable to our data pipeline.
- Writing: Written and verbal communication is critical for this role, because the governance role needs to be able to share findings with technical teams, business leaders, and third-party data vendors. This includes writing long-form written reports, creating compelling visualizations, and documenting new data sources.
This role differs from a machine learning role, because the focus is not on predictive modeling, but instead focused on improving data quality and integrity. It also differs form product analytics roles, because the goal is to identify discrepancies in the underlying data rather than business metrics. Despite these differences, the role still requires the statistical knowledge, domain expertise, and hacking skills commonly associated with data science.