Ellen Friedman
Ellen is principal technologist at HPE Ezmeral, focused on large-scale analytics and artificial intelligence. She is also a committer for Apache Drill and Apache Mahout. With a Ph.D. in biochemistry, Ellen is an international speaker and co-author of multiple books.

You understand the importance of data. It shapes machine learning (ML) models and drives decisions based on large-scale analytics applications. But what is your data really telling you?

Let me explain. I originally trained as a research biochemist. Many years ago, on a flight to a scientific conference, I had an interesting conversation with my seatmate who was heading for the same event. He worked for a pet food company, specifically on a project to develop a “new and improved” flavoring for dog food. He entertained me with an account of how difficult it is to determine if a new flavor really is better. To find out, his team did a simple test: They put down two bowls of dog food, one on the left with the new flavor and one on the right with the original version. To their disappointment, the test dog went to the original flavor, with the same result in each of multiple trials.

The team went back to the lab and weeks later tried again with a new flavor. In comes the dog, and again, he goes to the original flavor. Multiple trials yield the same disappointing result. But then, it occurred to one of the team to reverse the position of the bowls, now with the new flavor on the right. Turns out, the dog just preferred the bowl on the right, regardless of which food it contained.

Read More:   Ho Chi Minh City Launches Digital Traffic App 2017

AI and Analytics: What Is the Data Telling You?

When you mine data for insights or to automate a decision, whether, through machine learning or data analytics, it matters to be careful how you design each question. You need to know how data was collected and to keep asking yourself, “Does the data represent what I think it does?”

Take this real-world example from machine learning: A data scientist was tasked with building a recommendation system for an online video streaming service. This data scientist was experienced with developing recommenders and knew to look at what people do rather than what they say they like (behavior over reported ratings) to discover preferences. In this case, the data to be used for training the recommender was the videos people clicked on — this was the behavior used to reveal preferences. Surprisingly, however, the results were poor; the recommendation system did not perform well although it used approaches that had been successful in the past.

The solution was to re-examine a broader group of viewer behavior data through direct inspection and re-think the assumptions about what the data represented. It turned out, using video titles as the indicator of preference wasn’t a good idea. In many cases, people selected a title but quickly clicked away, often because the title did not match the content, either through error or spamming. But using a different target for training — watching the first 30 seconds of a video rather than just clicking on it — resulted in a video recommendation system that worked beautifully.

The lesson here is not about video but about the importance of keeping an open mind to what your data tells you, trying different approaches and continually questioning your assumptions. It’s also an example of what newly trained data scientists discover: Data in the real world is not as clean and straightforward compared to the carefully selected data sets often used in machine learning classes.

Read More:   Update Enterprise Database Selection Most Often Led by System Architects

Clearly, potential pitfalls exist in data selection and in framing the question you are addressing. So, what can you do about that?

Avoiding the Pitfalls: Tips for Better Data Science

No specific set of steps is guaranteed to avoid these problems. Much of the ability to avoid pitfalls comes through experience and being generally suspicious about your own assumptions. Just being alert to the potential for data to be misleading is already a step in the right direction. And a number of practices can help you better develop your skills and instincts on how to approach these issues. In addition to working on a system with efficient data management and data engineering, keep in mind these tips about data and decisions:

  • Plan time for data exploration, and talk to domain experts to find out more about how data was collected, known defects, what the labels mean, what other related data may be available or could be collected.
  • Look at the issue in more than one way. If different types of data lead you to the same conclusions, your confidence level should increase. Similarly, try predicting some variables based on others. This helps you understand if the data is self-consistent.
  • Ask yourself, or others who have tried similar approaches, if the results are roughly what you expect. A model that behaves much better or much worse than expected should be a warning flag to go back and re-examine data as well as how the question is framed. It isn’t always the case that outlier results are bogus — you might have built an extraordinary system! But it is a good idea to recheck the process if models behave in particularly surprising ways.
  • Consider injecting synthetic data as a test of your system. Physicists working on particle accelerators and large-scale astronomical studies do something similar. They inject sample signals or known kinds of noise into their data to verify their analysis methods can robustly detect the injected samples.
  • Try randomizing a data source you use for training. If this doesn’t change your results, then modeling is not working the way you think it is.
  • If possible, shadow real users as they go about the behaviors of interest. Now that you know what they actually do, verify their actions are reflected in the data you plan to use. This is a great way to reveal faulty assumptions or misleading aspects of data collection.
Read More:   Update Intezer Provides Code ‘DNA Mapping’ to Root out Malware

Whichever approaches you choose, it is helpful to have a comprehensive data strategy across an enterprise rather than isolating data science teams and the data they use. This shared data strategy makes it easier to explore different types of data for feature extraction and to use a wide range of machine learning or large-scale analytics tools without having to set up separate systems. A comprehensive data strategy along with a unifying data infrastructure to support it also encourage collaboration between data scientists and non-data scientists who hold valuable domain expertise. All of this helps you to keep questioning what your data is telling you and to keep testing your conclusions.

A Final Data Science Example

A few years ago, I was at a conference, signing and giving away books with my co-author, Ted Dunning. We gave people a choice between: Practical Machine Learning: Innovations in Recommendationand “A New Look at Anomaly Detection,” but each person could take only one book. We were surprised that over 80% chose the book on recommendations over the one on anomaly detection. Then a thought occurred to me. I leaned over and whispered “dog food” to Ted. He swapped the positions of the books.

Turns out, data scientists prefer the book on the left.

If you’d like to read our latest short book, download a free PDF courtesy of HPE: AI and Analytics at Scale: Lessons from Real World Production Systems.

InApps Technology is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.

Lead image via Shutterstock.