It’s likely that at some point, you’ve been asked to include a certain phrase in the subject line of an e-mail so that it wouldn’t get mistakenly caught up in the receiver’s spam folder. This practice will become less and less common as time goes by, thanks to advances in the machine learning technology that tells your e-mail program how to recognize spam. Machine learning is changing not only the way your inbox ‘thinks’ about your incoming mail, but also the way that data scientists think about data analysis. Machine learning can either work with or replace human intuition when making predictions based on data; this increases the accuracy of the predictions and makes data more valuable.
A machine is said to learn when it improves its performance at task X from experience Y. In order to do this, it must make generalizations based on experience Y and use them to accurately handle situations it has not experienced before. The e-mail program discussed above, for example, uses data from messages previously classified as spam to build a predictive model about which future messages should be classified as spam.
A number of approaches to machine learning have developed as data scientists have developed more and more advanced algorithms for analyzing and drawing inferences from data. These approaches can generally be categorized as using either supervised learning or unsupervised learning. In supervised learning, frequently used in spam detection, information retrieval, and database marketing, machines are “taught” to recognize patterns using labeled training data.
In unsupervised learning, machines are programmed to discover patterns in unlabeled sets of data. Unsupervised learning is frequently the foundation of a few kinds of data analysis. It is used to discover clusters of data which are housed in similar structures, helping us define the patterns in large sets of data. Dimension reduction, the process of reducing the number of random variables under consideration, is another important use of unsupervised learning. These algorithms allow us to gain an advanced understanding of quantities of data too vast for humans to process.
In order to teach machines to learn like humans do, we have to be able to codify human experience into mathematical language. While philosophers may have something to say on whether or not this is truly possible, today’s data scientists are developing increasingly accurate mathematical simulations of learning from experience. Machine learning gives us a powerful tool for seeing and understanding huge amounts of data in new ways.
One of the underlying tenants to making machine learning a valuable tool in your arsenal is that data must by-in-large still be right & accurate. Statistical tests can be used to remove outliers as learning algorithms are working, but there must be a population of data that is worth modeling for inference. This brings back the age old problem — junk in, is junk out! In the big data environment, not having data governance practices in place just results in bigger data problems. It is critical to begin shaping your organization’s data governance practices, so that you can take advantage of the new ways to isolate & identify new business opportunities through techniques like machine learning. Today organizations are creating a competitive advantage through such capabilities, tomorrow these capabilities will be common place; therefore, will not only be required to create a competitive advantage, but will be required to sustain a competitive advantage.
Data Clairvoyance is excited to help you prepare & leverage data science capabilities; contact us for more information about our services.