17 analytics to transform your organisation, Part 1

28 November 2019
17 analytics pour transformer votre organisation, AUSY
A complete overview of the most commonly used techniques for analysing and valuing data by Pierre-Sylvain Roos

“Analytics” is the new keyword that you have undoubtedly seen again and again in many publications in recent years. It must be a magical concept since you are told everywhere that it is the key to deriving value from your data.

If we were to summarise, we can compare an analytics system to a super-powerful engine that consumes data in bulk to answer four main questions:

  1. What happened? (descriptive analytics)
  2. Why did this happen? (diagnostic analytics)
  3. What will happen? (predictive analytics)
  4. How to make this happen? (prescriptive analytics)

Today, there are many possible solutions and technologies to set up analytics. And, of course, they are not all the same nor have the same ability to solve the various problems with which you are confronted. On the contrary, we can see that there is a specialisation: each of these techniques is focused on a very specific goal or function.    

Let me give you an overview of those that are the most widespread today. By asking the experts and reading the technical literature as well as the many online articles that discuss the subject, I have counted seventeen.     

 

1 – Regression and Correlation Analysis

This is probably the type of analytics to which we are most accustomed, in spite of ourselves, through the many publications delivered to us on the websites we visit. Not a day goes by without an article telling us of the dangers or benefits of a practice, behaviour, food product, or what have you. All these articles shed light on the same, unchanging logic: a study of the presence or change in a thing, a factor that establishes that it causes or influences another. This generally results in simple, easy-to-understand pronouncements such as “Eating too much red meat causes colon cancer”, or “doctors who receive fewer gifts from pharmaceutical companies give cheaper prescriptions”.

This is precisely what we call a regression, a relationship between two or more variables that we can observe directly. The term comes from the first statistical study of its kind (in the early 20th century) which sought to explain why height “regressed” to an average size over generations among the descendants of tall people.

Regression analysis is a statistical method performed with an iterative approach based on a known variable. It is generally used to find an explanation to an identified and measured phenomenon.

For climate change, for example, we are exploring the changes in the various variables that we suspect are causing it to identify any possible matching with the change in temperatures: solar activity, the reflectivity of the Earth’s surface, the greenhouse effect, etc.

This is how we were able to consider the dominant role of greenhouse gas emissions, especially emissions caused by human activities.

Regression can deal with quantitative or qualitative variables. In the latter case, we also talk about classification.

The concept of correlation is complementary to regression. It measures the level of dependency between two values and expresses it as a correlation coefficient. It ranges from -1 to +1 depending on if the correlation is positive or negative. If the correlation coefficient is 0, it means the two variables are independent. Of course, correlation only applies to quantitative variables. 

Applied to global warming, measuring the correlation can help identify which factors are the most incidental.

 

correlation entre evolution dioxyde de carbone et la temperature au pole sud

A strong correlation between the change in the concentration of carbon dioxide in the atmosphere and the temperature at the South Pole. Both lines can practically be placed on top of each other (source: NASA). 

 

A common misconception is to assume that there is a causal relationship between two correlated variables, especially if the correlation is strong. However, this is not always the case; they may simply have a common cause. For example, in a city during a heatwave, we observe an increase in cases of dehydration and an increase in sales of fans at the same time. However, we cannot conclude that the increase in cases of dehydration is causes by the increase in fan sales (or vice versa). Both have the same cause: a prolonged increase in temperatures.   

Today, regression and correlation analyses are highly tooled and automated. They concern increasingly complex statistical models that can include up to several thousand variables. They comprise the traditional, essential domain of data analysis. Further on, we will see that regression studies can also be conducted with an entirely different approach that uses neural networks.    

 

2 - Data Mining

In the strictest sense of the term, data mining is a technique used to discover information that is buried and not immediately available by exploring large data sets. Its best-known application is predictive analysis where the information found allows us to make predictions. The main method of exploration is “drilling”, i.e. drilling through the layers of data to find that which is stored most deeply.

Data mining originated and grew in the 90s when specialist players like Teradata and SAS became established. An almost iconic result ensured this practice’s promotion worldwide: the connection between the sales of diapers and beer on Saturday afternoons.

This legend of data mining has been restated many, many times, sometimes “improved” by adding cakes. In reality, the facts occurred in 1992 when Teradata analysed sales in OSCO stores. This showed that we could see a peak in the sale of diapers between 5 and 7 PM alongside a peak in the sale of beer. This helped stores cleverly reorganise their sections by requiring passage through the beer section to reach the diapers, significantly increasing the sale of beer. 

But predicting is not explaining; aside from a hypothetical father out to buy diapers while mum is at home watching the baby, this phenomenon remains a mystery.    

With the new generation of technology, data mining’s power has grown even more, offering the possibility of analysing and crossing completely heterogeneous sources of data.

This is how Walmart made a surprising discovery in the early 2000s. Using sales data from all its shops, Walmart identifies which products to showcase in the aisles every day. For a while, its analysts had suspected shoppers of deviating from their usual purchasing behaviour whenever inclement weather was approaching, particularly if a hurricane was on the way. Thus, when hurricane Frances was approaching in 2004, an analysis was performed on all the sales made during past hurricanes. Surprise! Before each hurricane, they observed a significant increase in sales and shortages of... strawberry Pop-Tarts!    

Armed with this discovery, Walmart sent entire lorries filled with strawberry Pop-Tarts in all its shops along highway 95 in the path of Frances. The result: a sevenfold increase in sales!

data-analyse-rayon-poptart-ouragan

An aisle of Pop-Tarts already well emptied...

 

Why Pop-Tarts? Probably because it is a food product that requires neither refrigeration nor cooking. Why strawberry? To this day, no one can say.    

As an aside, in good weather, American consumers tend to prefer Rice Krispies.

With the increase in the volume of data stored and available for analysis, it is clear that data mining still has many opportunities ahead with doubtless many more surprising discoveries in store.

 

3 – Text Mining / Text Analytics

Words are data: this has not escaped computer scientists who began to develop techniques to analyse and automatically process text with a computer in the 50s. This first application of this kind was created in 1957 at IBM. It ran on an IBM 704 and was used to develop a summary of a text.

The idea was simple: first, a statistical method was used to measure the frequency of words and how they were distributed, which then allowed for an evaluation of the relative meaning of words and phrases and then an extraction of the most significant phrases. This basic procedure has been perfected and extended ever since using the same overall process:

  • a recognition phase (words, phrases, grammatical rules, relationships, etc.),
  • an interpretation/composition phase (depending on the objective sought).

What has changed, however, is the type of text and, above all, its volume. We now talk about “text as data”.

Today, the average company stores 100 terabytes or more of unstructured data, mainly in the form of text.

To this we can add all sorts of sources outside the company, all making up a mass of textual data that is available and can be exploited immediately: emails, reviews, online critiques, tweets, call centre notes, results of surveys and polls, bodies of text (grey literature), answers to open questions on a questionnaire, the text fields of a business application, a post on social networks, articles, reports, etc.

Systematised text mining is a powerful means of extracting meaning, discovering non-trivial relationships, and revealing new features that are still just weak signals.

And the applications are numerous in a variety of fields:

  • for electronic communications: to filter and classify incoming messages (spam/not spam, by topic/subject, etc.),
  • in scientific and technical domains or in everyday life: to search for information or queries in a search engine, or to identify inconsistencies or anomalies, 
  • in journalism: to monitor news feeds or assess changes in opinion,
  • in security organisations: to intercept private or public communications (NSA Echelon or Europol’s system),
  • in business intelligence: to map business relationships and stakeholder networks,
  • in marketing: to analyse the behaviour of consumers or clients,
  • in social sciences: to track the evolution of a lexical field over time, identify the appearance of new words, evaluate trends (sociology, psychology, marketing, etc.),  
  • for individual privacy (GDPR): detecting and protecting sensitive data, i.e. data relating to political opinions, religious beliefs, healthcare, sexual orientation, etc.

Among the successful examples of text mining, we note Google Ngram for its fun and accessible side (whilst a bit out of date now). It is a service to search for specific words or phrases in millions of digitised books. We can instantly obtain the number of times they appeared between 1800 and 2010.

ngram-viewer-google-data-analyse

Google Ngram allows you to mine for words in a body of digitised books. Here, we can see the presence of various types of analytics since 1950: the height of Linear Programming in the 1970s, the boom of Data Mining from 1990, as well as the emergence of the Neural Network in the 1960s and its rise from 1986 onwards are clearly visible.  

 

We also have PubGene, created in 2001 in the biomedical domain. This is a search engine that relies on a semantic map of biomedical terms and graphical networks. It is focused on medical expertise.

More worrying, however, is the way Facebook scans all the text content its users publish or send, including private messages. Mark Zuckerberg himself publicly admitted to the practice in 2018. He justifies it by the need to identify and block content that breaches Facebook’s policies, violates respect for individuals, or incites hatred.  

Whatever the intentions or reasons for their use, we can note that text mining technologies are perfectly operational and effective. 

Today, we can count at least 40 text mining solutions available on the market. They offer features to evaluate content, including semantics and the context. They usually combine analysis algorithms with machine learning alongside the possibility to include custom dictionaries. We can cite a few: Mozenda, DiscoverText, IBM AlchemyLanguage, MonkeyLearn, SAS Text Miner, Keatext, Clarabridge Text Analytics, etc.

These solutions have all evolved beyond simple text mining by including sentiment analysis features, which we will discuss later.  

 

4 – Voice Analytics

Voice Analytics is the logical follow-up to Text Mining, since the spoken word can be converted into text. All that remains is to be able to recognise the language and to transcribe it automatically. That is why voice analysis technologies required much more time to be refined.

The first wide-scale applications strictly concern the analysis of speech. This is what we call Speech Analytics. It is mainly used in customer relation centres where its use has led to a significant improvement in quality of service. First, it helps identify recurring problems, emergencies, or the topics that consumers often raise about a product or service.   

This is how the insurance company NewYork Life was able to reduce the volume of incoming calls by 400,000 per year a few years ago, a savings of 40% in the quality assurance teams.

Voice Analytics is an extension of Speech Analytics; it focuses on the way we speak as well. The pitch, amplitude, frequency of phonemes and words and the Mel-Frequency Cepstral Coefficient* (see illustration) are measured dynamically to identify the speaker’s emotional state: joy, sadness, anger, or fear.

mel-frequency-cepstral-coefficient-data-analyse

*MFCC - the Mel-Frequency Cepstral Coefficient measures the short-term power spectrum of a sound, thus roughly modulating the onset and intensity that can be matched to a state of stress or excitation (in a simplified way).

 

Importantly, speech recognition must be distinguished from speaker recognition, which aims to identify the speaker and is mainly based on deep learning models.

Voice Analytics software therefore necessarily includes an Automatic Speech Recognition (ASR) component as well as extensive functions for analysing audio patterns. There are two competing approaches:

  • one relies on a comprehension and a direct interpretation of phonemes,
  • the other consists in first transcribing the speech into text to then analyse it.

The most efficient technology to date is LVCSR (Large-Vocabulary Continuous Speech Recognition), which allows speech-to-text to be used in full transcription mode as opposed to purely phonetic approaches, which are much less powerful.

For analysis, the combination of vocabulary, referenced expressions, and voice intonations give us the possibility of identifying tense or aggressive situations or, on the other hand, satisfaction and understanding between two or more speakers. Specialist software components can be combined to maximise results (e.g. kNN, C4.5 or SVM RBF Kernel).

Since Voice Analytics is specifically dedicated to analysing voice conversations, it is mainly used in the context of customer or partner relationships. It uses recorded telephone conversations to highlight:

  • factors that increase costs,
  • trends in the issues that arise,
  • the strengths and weaknesses of the products and services offered,
  • elements to understand how a product or service is perceived.

Voice Analytics is also very useful in the domain of security to identify threats or, more generally, intercept suspicious discussions or comments.

Finally, of course it is an essential component to making the voice assistants that have become a major part of our everyday lives over the past few years work (Alexa, Siri, Watson, Google Assistant, etc.), with all the limits and reservations that go with them.        

As a preview of what the future holds in store, the association between Amazon-Lab 126 gives us a taste with Dylan. This is a connected bracelet combined with an artificial intelligence that can read and decrypt human emotions. In a situation where voice commands are being used, the wearer’s emotional state is identified by the sound of their voice. For now, the ambition is rather commendable since the goal is to provide instant moral support or advice on how to deal with loved ones in situations that we may consider difficult. In a way, Dylan is a cyber companion for distress when humans may be lacking.   

 

Conclusion

Thank you for reading this introduction to the topic. The four types of analytics presented are probably the best known to everyone. Each demonstrates a high level of improvement and can give rise to several applications. However, they are just some of the technologies that exist and are commonly employed worldwide. To continue our grand tour, the next step will look at the most immediate evolutions of these four founding pillars.      

 

And you may be interested in our offer Big Data.

Let’s have a chat about your projects.

bouton-contact-en