International call traffic may tell you more than first thought

Wednesday, November 30, 2016

International call traffic may tell you more than first thought

Originally by Pedro de Alarcón and Javier Carro, Data Scientists at LUCA.

This post debates the value of international phone calls in understanding society.

Telefónica has a wide global infrastructure of networks which can be used by other service providers to carry their international call and data traffic. Telefonica Business Solutions sell this service, amongst others, negotiating wholesale business deals. Our role throughout this process is to collect call and data traffic in one country (provided by a telecommunications operator) and effectively transport and pass this on to another operator in a different country.

International call patterns
Figure 1: We have analysed international call patterns to observe how different countries interact.

We recently had the chance to process and analyse a few months’ worth of data relating to this service. The aim of this was to let us understand the data and information allowing us to discover some interesting facts with a pretty simple analysis. All the information that is stored and processed by our global Big Data team here in LUCA is done so anonymously, ensuring a secure working environment.

When dealing with voice calls, the characteristics of each “recording/event” have a dataset which can be summed up as the phone number from the country of origin, a destination phone number, a timestamp and the call time duration. To add a deeper analysis, we can also use further parameters – but we won’t do that on this occassion.

Whilst the phone numbers we deal with are anonymous, we can access the country code and in some cases the region or province of the number of the person making the call and the person receiving the call. This dataset may face limitations in terms of data variance but it is expansive in terms of volume. In terms of structure it is very similar to some popular open data relating to air traffic (see this example). In fact, this resemblance has allowed us to easily reuse some interpretations of the data as they have been previously formed by the programme Carto

Let’s listen to what the data tells us:

 Given the basic information that this dataset has provided, the first exploration that we have raised is the evolution of the total number of calls that they have studied. The following graph is a representation of the daily traffic that was studied.

Amount of calls managed by Telefónica
Figure 2: A specific representation of the amount of calls managed by Telefónica. We can see a clear weekly pattern and the curious changes that happen in different weeks.


It has been found that a weekly pattern takes place with dips during the weekend. What is most notable is the weekly variation and of course even more so when the variations are very pronounced. The data is starting to to show something worth debating, so what is the data trying to tell us?

The answer becomes clearer when we start to travel between countries. For example, in the following graphic we can see the daily progress of the number of calls to Italy from various countries. The biggest peaks that appear on the right hand of the graph are from the 24th of August 2016, which is when a large earthquake took place in Italy.

Calls made to Italy
Figure 3: Representation of the amount of calls made to Italy from different countries worldwide during the earthquake that took place on the 24th of August in Italy.


The data may also be starting to allow us to analyse international events when the ties between countries are noted through our data. Let’s hold that thought, the information starts to appear more subtly: why was the response of Ecuador or Argentina more notable than other countries? We can try to explain the situation with a few well thought out arguments, but, we want the data to do the talking.   

Google gives us a very useful tool to help us interpret this information we are finding. This platform is referred to as GDELT and it monitors in real time what’s happening in the world and the impact it’s having. It also takes into consideration the language and where in the world it has happened. This means that we can further develop the information that we already have by combining the local and global. This tool can be used with the BigQuery platform from Google. Depending on how you choose to set the parameters the results may vary or you can simply stick to the preconfigured analytical tools.

As an example, in June 2016, the United Kingdom voted over whether they would remain in the European Union. Can our data explain this? It certainly can. We aren’t just talking about the immediate effect but also the impact it will have in the following weeks. We can see in the following graph the amount of calls between the United Kingdom and Belgium (headquarters of the European Commission). The first marked date (in red) is the day of the vote (Thursday June 23). We can also see the impact in the weeks after the event. The second marked date, exactly a month after, coincides with the first published economic index which highlighted the economic contraction of the United Kingdom.

Calls between UK and Belgium
Figure 4: Representation of the amount of calls between the UK and Belgium around the time of the Brexit vote in the UK.

These initial investigations help to create a more formal model. These bodies can even be the anonymous phone numbers as well as the geographical regions with their origin and destination. They can also be taken as separate data that can create a series of indicators (amount of minutes received and taken), or this information can be paired up so that it would be talking about a network or graph in which the hubs are the bodies and the arcs connect those hubs with others where there has been traffic.



Type of data considered Suggested Analysis:




The next figure shows an example from a graph which represents a map that shows and analyzes the existing connections between Spain and the rest of the countries noted. In detail, the data is linked to July 7 2016, and highlight the connections with Islamic countries as it was the last day of Ramadan. They also show links to countries that contribute to tourism in the summer. The video below shows the daily changes in data from the map. 

Connection graphic
Figure 5: Graph showing the connections between Spain and the rest of the world on the 7th of July 2016 (the end of Ramadan)

Time Series

The sequential and temporary nature of data allows us to model them in a time sensitive way. The analysis of time series is a popular statistical discipline and therefore library functions have been developed in almost all programming languages which are regularly used in the world of data analysis (R, Python and Matlab). There are even free tools such as INZight which allow us to do more basic analyses without even writing one line of code.

As a first step before making any analysis, it is important to verify that our data series is static (the mean, variance and covariance of its values does not depend on time) and, if it is not, we make it that way. A series of data from the call based data set shouldn’t usually be stationary, so we need to work on that.

Put simply, a time series like what we have identified in the data taken from the call traffic has been divided into three parts that can be added together or multiplied to produce the original series. 

Trend: In our case it depends of the volume of traffic that Telefonica processes with a particular country, I.e., we are mainly linked to the growth or contraction of business.

Seasonality: There are notable weekly cycles, in which a significant increase in calls happens during the weekend. 

Remaining information: This is the difference in values from the original series of data and then data that has been generated through trends and seasonality. This part of the data allows for most interest as the peaks and troughs can be linked and related to technical issues, international events and public holidays. Ultimately the remaining information is where we can look if we want to analyse what happened outside of the normal trends. 

Any program (like zoo, xts or R timeSeries) allows us to easily remove these three components.

Graphic
Figure 6: Break down through trend, seasonality and those left over from the amount of directed minutes to one single country.



The usual interest in doing a time series analysis is to be able to generate a predictive model (like the exponential smoothing model or the ARIMA model) that allows us to anticipate, for example, how much traffic we will have in the next few days. Or it can help us to find the true outliers in the series (values that come out of the intervals of predictions that we can do confidently, which is quite simple to do in R

Due to its reliability to make predictions in the short term, the family of tests belonging to the exponential smoothing technique from Holt-Winters have become popular and are available in tools like Tableau or TIBCO Spotfire analysis. 

The ARIMA models are more complex to apply but in most cases improve the prediction of the previous data as the link between the data has been previously established giving the model more context depending on earlier values.

Traffic generated
Figure 7: Prediction of traffic generated with exponential smoothing (Holt-Winters)

Multi-country social media platforms:

The information behind the use of social media is a value that is used extensively by businesses. The main reason that businesses follow up this information is to segment their clients so that they can effectively communicate with them to increase their chances of product consumption. However, the main obstacles for businesses when trying to exploit these sources are complexity and cost.

Telefónica provides experience and differencial knowledge about the construction of social media models or SNA (Social Network Analysis) which uses information gained from caller patterns. This time we want to understand the existing relationships internationally that are formed through social media and how we can explain their relevance through telecommunications. We have been inspired by social initiatives like Combatting global epidemics with big mobile data and also Behavioural insights for the 2030 agenda.

The next figure gives us a first look at data taken from this type of perspective. Only taking into account the volume of calls relevant to the ones that were actually answered, combining this with common sense aligning this data with global socio-economic data.

Amount of calls
Figure 8: The amount of calls be between countries paired by their origin and destination calls throughout the month of August 2016. This only counts the countries with most volumes of calls generated and the main volume of destination calls for each country.

There are good sources with international socio-economic data in order to contrast and complement what is observed in our data. For example, large amounts of economic data can be found in the economic observatory of MIT, in the Databank of the World Bank, or Eurostat. And more social (and also economic) data in the United Nations or UNICEF databases. This type of data can be very useful even if there is temporary granularity, spatial issues, or the frequency with which they are published is not ideal.

Before continuing to understanding how countries interact, we need to stop for a moment and think about how people behave when it comes to making a call.

In Figure 9 we have divided the calls that are made daily into four different user groups: those who tend to call during working hours (green), those who call in their free time (blue), those who call during the weekend (red), or finally those who call at night time (purple). Although this first division may seem simple, it allows us to note the users who will normally be calling for personal reasons or those who ring because of work related activity. We can highlight how, for example, the level of calls during the weekend easily exceeds those taken from Monday to Friday, and furthermore once you are in one of these groups you tend to stay there. It’s easy to say this was expected but it’s the data that has been able to state and qualify these statements.

Daily evolution
Figure 9: The daily evolution of calls made by users who normally call during office hours (green), those who call during the afternoon Monday-Friday (blue), weekend callers (red) and night time callers (purple)

Coming back to the inter-country perspective, in the following graph we can see a geographical representation that can help us to better understand the flows in communication. The original data has been simplified and scaled for convenience and ease of reading. We can monitor the changes in datafrom the dates of Ramadan (7/7/16) to the earthquake in Italy on the (24/8/16).


Video 1: Animated representation of the connections made between Spain and the rest of the listed countries throughout the months of June, July and August 2016. We can see these connections with relation to the dates of the end of Ramadan and also with the earthquake in Italy

These representations have led us to confirm personal links (social) and professional (economic) which we have mentioned before when referring to socioeconomic data. In the next graphic we go a little deeper to show a more specific central European zone, the video gives us a closer look at the analysis of the data. 

Geographical representation
Figure 10: Geographical representation of communications through a defined zone in Europe.

Recapping what we have learnt from the data analysis, where we have separated the callers based on their habits and have the precise information about their location we can start to understand the fundamental relation between call data and other socio-economic indicators. Not forgetting the link between global events, commercial relations between regions and even the simple interaction between people in their local communities.

For example: 

We could analyze communications between eminently industrial zones and compare those with relation to commercial seaports which are connected by transport links.
The combined knowledge of communications between the caller country and the destination of the call collaborated with historic immigration patterns has allowed us to give the data a deeper meaning. We could see that this was consistent with the data analysed about Argentina and Italy during the earthquake crisis in Italy. For this reason, we expect the same patterns with Spain and Germany. Does this mean that this call time information could become a true indication for modern day immigration? It’s possible that one day we might be able to predict the flow of people through data.

Clearly our digital footprint goes a long way in describing us and our behaviour. 

No comments:

Post a Comment