It’s time we began to “fixate on data” to solve our problems, says one of the world’s leading experts in data science.
In 2006, Jeannette Wing, then the head of the computer science department at Carnegie Mellon University, published an influential essay titled “Computational Thinking,” arguing that everyone would benefit from using the conceptual tools of computer science to solve problems in all areas of human endeavor.
Wing herself never intended to study computer science. In the mid-1970s, she entered MIT to pursue electrical engineering, inspired by her father, a professor in that field. When she discovered her interest in computer science, she called him up to ask if it was a passing fad. After all, the field didn’t even have textbooks. He assured her that it wasn’t. Wing switched majors and never looked back.
Formerly corporate vice president of Microsoft Research and now executive vice president for research at Columbia University, Wing is a leader in promoting data science in multiple disciplines.
Anil Ananthaswamy recently asked Wing about her ambitious agenda to promote “trustworthy AI,” one of 10 research challenges she’s identified in her attempt to make AI systems more fair and less biased.
Q: Would you say that there’s a transformation afoot in the way computation is done?
A: Absolutely. Moore’s Law carried us a long way. We knew we were going to hit the ceiling for Moore’s Law, [so] parallel computing came into prominence. But the phase shift was cloud computing. Original distributed file systems were a kind of baby cloud computing, where your files weren’t local to your machine; they were somewhere else on the server. Cloud computing takes that and amplifies it even more, where the data is not near you; the compute is not near you.
The next shift is about data. For the longest time, we fixated on cycles, making things work faster—the processors, CPUs, GPUs, and more parallel servers. We ignored the data part. Now we have to fixate on data.
Q: That’s the domain of data science. How would you define it? What are the challenges of using the data?
A: I have a very succinct definition. Data science is the study of extracting value from data.
You can’t just give me a bunch of raw data and I push a button and the value comes out. It starts with collecting, processing, storing, managing, analyzing, and visualizing the data, and then interpreting the results. I call it the data life cycle. Every step in that cycle is a lot of work.
Q: When you’re using big data, concerns often crop up about privacy, security, fairness, and bias. How does one address these problems, especially in AI?
A: I have this new research agenda I’m promoting. I call it trustworthy AI, inspired by the decades of progress we made in trustworthy computing. By trustworthiness, we usually mean security, reliability, availability, privacy, and usability. Over the past two decades, we’ve made a lot of progress. We have formal methods that can assure the correctness of a piece of code; we have security protocols that increase the security of a particular system. And we have certain notions of privacy that are formalized.
Trustworthy AI ups the ante in two ways. All of a sudden, we’re talking about robustness and fairness—robustness meaning if you perturb the input, the output is not perturbed by very much. And we’re talking about interpretability. These are things we never used to talk about when we talked about computing.
[Also,] AI systems are probabilistic in nature. The computing systems of the past are basically deterministic machines: they’re on or off, true or false, yes or no, 0 or 1. The outputs of our AI systems are basically probabilities. If I tell you that your x-ray says you have cancer, it’s with, say, 0.75 probability that that little white spot I saw is malignant.
So now we have to live in this world of probabilities. From a mathematical point of view, it’s using probabilistic logic and bringing in a lot of statistics and stochastic reasoning and so on. As a computer scientist, you’re not trained to think in those ways. So AI systems really have complicated our formal reasoning about these systems.
Q: Trustworthy AI is one of the 10 research challenges you identified for data scientists. Causality seems to be another big one.
A: Causality, I think, is the next frontier for AI and machine learning. Right now, machine-learning algorithms and models are good at finding patterns and correlations and associations. But they can’t tell us: Did this cause that? Or if I were to do this, then what would happen? And so there’s another whole area of activity on causal inference and causal reasoning in computer science. The statistics community has been looking at causality for decades. They sometimes get a little miffed at the computer science community for thinking that “Oh, this is a brand-new idea.” So I do want to credit the statistics community for their fundamental contributions to causality. The combination of big data and causal reasoning can really move the field forward.
Q: Are you excited about what data science can achieve?
A: Everyone’s going gaga over data science, because they are seeing their fields being transformed by the use of data science methods on the digital data that they are now generating, producing, collecting, and so on. It’s a very exciting time.