SlideShare a Scribd company logo
UNIVERSITA’ DEGLI STUDI DI TRENTO
Department of Information engineering and Computer
Science
Bachelor degree in Computer Science
Thesis
Monitoring of human mobility by
utilising Call Detail Records
Supervisor Student
Prof. Stefano Bonaccorsi Ledio Gjoni
Co-Supervisor
Roberto Larcher
Academic year 2014-2015
ledio_gjoni_tesi
Abstract
Mobile phones have become very popular in the last 20 years or so, making
communication between people all around the world a trivial matter. Almost
everyone owns one. In the developed countries, such as Italy, the coverage
of population using such devices reaches 100% (Blondel et. al.[2]). As a
consequence, mobile phone operators gather a massive amount of Call Detail
Records (CDR) for their billing purposes. This data, besides information
on how, when and with whom we communicate, also contains geo-spatial
information. Since mobile phones are portable devices, the mobility traces
of their users are recorded. In this thesis, the spatio-temporal information in
CDR is elaborated in order to classify users utilizing different transportation
modes for traveling between two major Italian cities: Rome and Milan.
The three main transportation modes taken into consideration are railway,
highway and air transport.
Contents
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature review 3
2.1 A similar paper . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Differences from this work . . . . . . . . . . . . . . . . . . . . 5
2.3 Other research . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Methods 6
3.1 Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 Spatial data . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.3 Data cleansing . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Manually identifying travel modes . . . . . . . . . . . . . . . 9
3.3.1 Extracting the full journeys . . . . . . . . . . . . . . . 9
3.3.2 Extracting the airplane users . . . . . . . . . . . . . . 10
3.3.3 Extracting samples of train and car users . . . . . . . 10
3.4 Building the jouorney trajectories . . . . . . . . . . . . . . . . 13
3.5 The cell probabilities method . . . . . . . . . . . . . . . . . . 13
3.6 The journey compatibility method . . . . . . . . . . . . . . . 15
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Results 18
5 Conclusions 20
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Acknowledgements 21
Chapter 1
Introduction
1.1 Context
The Semantics and Knowledge Innovation Lab (SKIL), part of the Joint
Open Labs network of Telecom Italia, is a laboratory focused in Big Data and
Data Mining. The SKIL team explores and develops data-driven solutions
that exploit the enormous amount of data generated (Call Detail Records)
in the network of Telecom Italia.
One of the main initiatives is represented by the CitySensing project, a
platform for the management of large events in urban areas based on social
media and mobile network Big Data Streams. Another relevant project is
the Mobile Territorial Lab, in collaboration with Telefonica and MIT. The
project aims at creating an experimental environment to push forward the
research on human behavior analysis and interaction studies of people while
in mobility.
In collaboration with the team, I gave my contribution to SKIL in
analysing CDR within the TIM cellular network in order to differentiate peo-
ple travelling between big cities and infer the transportation mode. Roberto
Larcher, an internal SKIL researcher and my tutor during my period of
internship, has helped me and guided me through my work.
1.2 Contributions
In this thesis I use CDR to analyse people’s mobility habits.
• The travel routes between Rome and Milan are taken into considera-
ton.
• The location and spatial coverage of tower cells is used to construct
journey trajectories for each user, based on their mobile phone activ-
ity. Then, using a ”compatibility” approach, I differentiate the jorney
trajectories into train, car and plane.
1
1.3 Motivation
Understanding human mobility is indispensable for solving several social
problematics. Asgari et. al.[1] review the possible applications of human
mobility study in solving these issues. One such application is for analysing
the mobility data with the purpose of modeling the traffic flow in road net-
works and public transportation networks. Furthermore, it helps to under-
stand the spreading of infectuous diseases. These diseases contaminate the
population because people travel and interact. Another example consists
in marketing and advertising. Having knowledge about traffic flows and
population movements is essential for placing advertisements in the right
locations.
Telecom Italia in particular wants to exploit the CDR potential for de-
veloping services outside the telephony scope. Information regarding human
mobility can prove very valuable for companies like Trenitalia and Alitalia,
which use it for updating their infrastructures and for marketing strategies.
This kind of study has been carried out in the past by surveys. A national
census is performed every 10 years, drawing conclusions regarding millions
of people [1]. The number of participants ranges around 1000, with results
strongly depending on the subjectivity of the participants’ answers.
On the other side, the data collected from a mobile phone network is
cheap and very frequent. Millions of people use their phones every day,
therefore producing massive ammounts of CDR. These records contain in-
formation about the time and location of an occured communication, and
thus they are very fit and convenient for a study on human mobility such
as the work presented in this thesis.
1.4 Summary
In chapter 3 I describe the methodology used for carrying out the analy-
sis. I compare non classified journey trajectories with manually identified
train and car users in order to make a classification. In chapter 4 I show
the results, while in chapter 5 I descuss the limitations I encountered and
eventual future work.
2
Chapter 2
Literature review
2.1 A similar paper
The premise of this work is very similar to the one in Doyle et. al.[4]. In said
paper the authors try to classify users traveling between two regions RA and
RB (namely Dublin and Cork in Ireland) by the means of transportation
mode. They use what they call the Virtual Cell Path (VCP) approach.
By knowing the location of the towers collecting CDR, they apply Voronoi
tessellation to come up with the coverage area for each tower cell.
Figure 2.1: Sample of idealized Voronoi tessellation used to calculate cell
network coverage map.
User event trajectories (those traveling between RA and RB) are formed
by constructing a temporal sequence of the cells generating events.
In order to identify the journeys taken place by using the major modes of
transport, there is the need to associate cells with a certain route of interest
(rail-line or motorway).
Such a collection of cells is a virtual cell path (VCP), defined as a rep-
resentation of the path through a mobile telephony network along which a
user may travel while on a specific route. One cell is part of a VCP if its
3
area of coverage overlaps with a transportation route. The VCP can also be
improved by using cells belonging to manually classified journey trajectories.
Figure 2.2: Virtual Cell Path of rail-line and road.
Given a joruney trajectory Ji and a transportation route T, then Ji has
taken place in T(rail-line) if the probability P(T|Ji) is deemed sufficient in
comparison with P(R|Ji), with R being road.
Some necessary conditions are required for VCP based travel path iden-
tification to be feasible:
• A minimum number of diverging cells that cover different travel routes.
• A minimum weight of difference in measures of similarity among all
travel routes.
4
2.2 Differences from this work
I don’t need to apply the Voronoi tesselation since in my case every cell
is associated with its area of coverage. The data regarding cell coverage
had already been separated between cells that cover the railroad and those
covering the highway. These could represent rudimentary VCPs, like those
seen in the previous paragraph, but it is not enough. The VCPs I have need
to be heavily refined in order for them to be acceptable for an approach
like Doyle et. al.[4], so this aspect is used only partially here. However, for
the approach followed in this thesis, the use of manually classified journey
trajectories is crucial.
The calculation of P(T|Ji), the probability that associates a jorney with
a travel route, is only hinted by the authors of the paper, but never shown.
Professor Stefano Bonaccorsi helped me elaborate a method to calculate said
probabilities, shown in chapter 3.
2.3 Other research
Wang et. al.[5] infer transportation mode in an urban context. Using CDR,
they try to divide users living in the same city into three groups: those
traveling by car, public transport and lastly on foot. They group users
with the same origin and destination and compare the travel times for the
same route with those from Google Maps. This study relies soley on travel
times, since in a case like this the difference of traveling speed between
transportation modes is significant enough to allow a classification. I can
partially use this approach for identifying only people using air transit. On
one hand, the plane travel times are much shorter than car and train ones,
but on the other hand, the last two are quite similar and a different strategy
is needed, much like Doyle et. al.[4].
A simpler approach is used by Dixon et. al.[3] in their study of mobile
phone data generated in the Ivory Coast. To determine on which segment
of transport infrastructure a user is traveling, they associate an antenna to
the closest route segment. Thus, a set AK of antennas is determined for
each route segment SK. A chronological sequence A of antennas, which a
user calls on a given day, represent the users journey trajectory.
A user is deemed to travel on SK if there is at least a pair of antennas
in A that belong also to AK. A minimum distance is required between
antennas of the same pair.
5
Chapter 3
Methods
3.1 Premise
For explanation purposes, sometimes I show samples of tables that come
from the dataset I have been working with. While the table structures
remain genuine, the values have been changed due to privacy concerns.
From this point forward the words user and journey are to be intended
as synonyms. A journey rapresents a user’s itinerary, and they both refer
to the same field user id in the dataset.
All of the queries have been performed in the MySQL Workbench 6.3
environment.
3.2 Data
3.2.1 CDR
The main data type used in this work are CDR (Call Detail Records). Tele-
com Italia provided SKIL with the CDR. Telecom Italia is one of the biggest
mobile phone operators in Italy, with 34% of the market share. CDR are
records produced by a receiving tower cell every time a user makes use of
mobile services (like making phone call, texting or surfing the internet).
Every user activity is recorded by the tower with the strongest receiving
capacity, which is not always the one nearest the user. The CDR analysed
here were generated during a 24 hour time period in the year 2015.
The CDR dataset has three fields:
• user id - Every user has his own ID number. To ensure anonimity
this ID changes once a day for every user, although this shouldn’t be
a problem for this thesis, since I take into account only a 24 hours
period.
6
• cgi - Cell Global Identity: serves to identify the cell which recorded
the user’s activity.
• date - The exact time the CDR was generated. This gives us the
moment a moving user is picked up by a receiving tower.
Table 3.1: Sample of the CDR table in the database
date cgi user id
16/02/2015 00:19:00 222-01-00102-16951 b1c06d58ff46a91af9312be341bccfa2
16/02/2015 00:07:34 222-01-61212-00576 38eb7933d8065d669bb71e7e171ccf0a
16/02/2015 00:07:02 222-01-61509-03904 ce7f842ebc51356d327c5310b277a01f
3.2.2 Spatial data
To SKIL was also given the spatial data for the area coverage of each tower
cell under the form of a shapefile. The shapefile format is a popular geospa-
tial vector data format for geographic information system (GIS) software.
Figure 3.1: Cell coverage for Rome’s urban area
Theres also a .dbf file, where every attribute of a shapefile is stored in
a dBase format. Among other attributes, the CGI field is present, so it is
possible to perform queries with the CDR.
7
Table 3.2: Cell distribution table
Area Number of cells
Rome urban area 6332
Milan urban area 4290
Rome airports 55
Milan airports 103
Rome-Milan Highway 3329
Rome-Milan Railway 6064
There is a total of 16763 cells with distinct CGI, but many of them
overlap, so the overall sum of the values in the ”Number of cells” columns
in table 3.2 is greater than 16763.
I elaborated the different shapefiles in order to create a CGI table for
my database with these fields:
• cgi
– Cell Global Identity: serves to identify the cell which recorded
the user’s activity.
• city
– If the cell belongs to Rome’s territory this field has value ’RM’.
– If the cell belongs to Milan’s territory this field has value ’MI’.
– If the cell belongs to neither of the cities this field has value
’TRIP’.
• type
– If the cell belongs to a city’s urban area this field has value ’1’.
– If the cell belongs to a city’s airport this field has value ’2’.
– If the cell overlaps the railway this field has value ’3’.
– If the cell overlaps the highway thisfield has value ’4’.
Table 3.3: Sample of the CGI table in the database
cgi city type
222-01-24641-00799 RM 2
222-01-24641-05193 RM 2
222-01-24641-05195 RM 2
8
3.2.3 Data cleansing
Before I could start with my work, the CDR needed some data cleansing.
• Telecom Italia provided SKIL with a day’s worth of records covering
all Italy. Only the activites generated from the same users in Rome
and Milan during the 24 hour period were kept. This way remained
only those people who had been in both the cities, thus had traveled
between those two locations.
• For some reason many of the rows were replicated. By eliminating the
duplicates I reduced number of rows from 6121090 to 188343.
• I had to convert the date field, which came as STRING, into DATE-
TIME.
• There was some noise data to be eliminated, such as users generating
activites from both cities in a time window smaller than 50 minutes.
Rome and Milan are more than 500 km distant from each other.
3.3 Manually identifying travel modes
3.3.1 Extracting the full journeys
There are only three ways to travel between Rome and Milan:
• Airplane
• Train
• Car
Each of these transportation modes has more or less different travel times.
This is why, first of all, I must find for each user the time it took him/her
to get from one location to the other. In order to do this I created a table
called FULL JOURNEYS with three fields: user id, departure time,
arrival time.
Since the CDR cover an entire day’s activities, most of it are records
generated before or after the journey has taken place. If I take into account
all the activites the results will not be reliable. Therefore I must associate
to each user a departure and an arrival time. Now every journey trajectory
will consist of CDR generated after the user’s departure time and before the
arrival time.
9
This is the query I performed to get the departure and arrival times for
people going from Milan to Rome:
insert into FULL_JOURNEYS
select user_id, departure_time, arrival_time
from
(select user_id, max(date) as departure_time
from CDR natural join (select cgi from CGI where city=’MI’) as
mi
group by user_id) as departures_milano
natural join
(select user_id, min(date) as arrival_time
from CDR natural join (select cgi from CGI where city=’RM’) as
rm
group by user_id) as arrivals_roma
where (departure_time < arrival_time)
I consider as a user’s departure time the last moment he/she generated an
event from a cell in the Milan area. Arrival time is the first moment the
user generated an event from a cell in the Rome area. I performed the same
query to also get the journeys Rome-Milan. The people who have made a
roundtrip (RM-MI-RM or MI-RM-MI) were left out, so I added them with
the comeback trip to represent them. On a first look there seems I have to
classify a total of 8510 users, but after a filtration of noise data (travel time
less than 50 minutes), the number dropped to 8356. In the end the table
looked like this:
Table 3.4: Sample of the FULL JOURNEYS table
user id departure time arrival time
00124343a94e897362b88a663df81615 16/02/2015 15:16:58 16/02/2015 19:29:04
001247253c166a3c350d989954d2435d 16/02/2015 01:24:15 16/02/2015 07:31:51
0017ff5a12cb684df55c5f28af6b5ed3 16/02/2015 19:38:56 16/02/2015 22:43:16
3.3.2 Extracting the airplane users
Before proceeding with the building of the journey trajectories, I must first
extract the airplane users. This is done for mainly two reasons:
• It takes only 1 hour and 10 minutes to cover the distance between
Rome and Milan by plane (google), while with the fastest train, the
Frecciarossa, it takes 3 hours (trenitalia.it).
10
• People are not supposed to make calls or generate any kind of phone
traffic during a flight, so there’s no point in building journey trajecto-
ries for plane users.
So I go on and classify people with a travel time less then 3 hours as airplane
users, with a result of 2800 users out of 8356:
insert into AIRPLANE_USERS
select user_id
from FULL_JOURNEYS
where timestampdiff(minute,departure_time,arrival_time) < 170
3.3.3 Extracting samples of train and car users
With the airplane users out the way, only those who have traveled by land
remain. In this case, discrimination by traveling time alone might not be
enough. It is true that it takes only 3 hours to a Frecciarossa to cover the
500 km distance, but not so for the other trains. On the other hand, it takes
roughly 5 to 6 hours to cover the same distance by car. Thus, every journey
with a travel time greater than 5 hours is not immediately classifiable.
In order to proceed with my work I need journey samples for both train
and car users. The poeple with a difference around 3 hours between the
departure and arrival times, surely have traveled with a Frecciarossa.
insert into TRAIN_USERS_SAMPLE
select user_id
from
(select * from CDR natural join (select distinct(cgi) from CGI
where city = ’TRIP’) as journey_cells) as journey_cdr
natural join
(select *
from FULL_JOURNEYS
where timestampdiff(minute,departure_time,arrival_time) >= 170
and timestampdiff(minute,departure_time,arrival_time) <= 190)
as train_users
where date >= (departure_time - interval 10 minute) and date <=
(arrival_time + interval 10 minute)
group by user_id
having count(*) < 5 and count(*) >= 3
11
Those leaving a city during the night hours have surely traveled by car,
since the earliest train doesn’t leave before 5:42 am.
insert CAR_USERS_SAMPLE
select user_id
from
(select * from CDR natural join (select distinct(cgi) from CGI
where city = ’TRIP’) as journey_cells) as journey_cdr
natural join
(select *
from (select * from FULL_JOURNEYS where user_id not in (select
user_id from AIRPLANE_USERS)) as land_users
where departure_time <= ’2015-03-23 05:42:00’) as
night_journeys
where date >= (departure_time - interval 10 minute) and date <=
(arrival_time + interval 10 minute)
group by user_id
Since at first the train users sample was bigger than the car one I had to
modify the first query in order to get somewhat balanced samples. In the end
I have extracted 296 train users and 225 car users, with a total of 1114 and
835 CDR generated respectively. Having a balance between these numbers
is very important for the outcome of the research, because the method I
plan to use depends heavily on the samples.
Before I can proceed, I need to add three more columns to the
FULL JOURNEYS table: p airplane, p train, p car. These columns
will contain the probability of having traveled by plane, car or train for each
of the users.
I can already update some of the rows since I have determined the trans-
portation mode for some of them. Now the table looks like this:
Table 3.5: New sample of the FULL JOURNEYS table
user id ... p airplane p train p car
00101cc717dc49803a390754d5568f9 ... 1 0 0
001247253c166a3c350d989954d2435d ... 1 0 0
00277c1cbfc3c00ea62a0159e48da6b2 ... 1 0 0
12
3.4 Building the jouorney trajectories
For some reason, according to the data I have, people haven’t been much
active while traveling. Only 315 of them have generated any records at all,
and most of them with less then 5 cells activated per head. The rest of the
communications has taken place in the city areas. I don’t know the reason
behind this, but with so little to work with I decided to consider also the
records created 10 minutes before departure and 10 minutes after the arrival.
I created a new table TRAIN CAR JOURNEY TRAJECTORIES
with three fields: user id, cgi, date, which has the same structure as the
CDR table.
insert into TRAIN_CAR_JOURNEY_TRAJECTORIES
select distinct user_id, cgi, date
from
CDR
natural join
(select * from FULL_JOURNEYS where user_id not in (select
user_id from AIRPLANE_USERS)) as land_journeys
where date >= (departure_time - interval 10 minute) and date <=
(arrival_time + interval 10 minute)
3.5 The cell probabilities method
Given the events:
• C - Activating a cell C
• T - Being on a train
• A - Being in a car
• J = Activating the cells that compose the journey
the goal is to get P(T|J).
Given Ji ={C1, C2...CN } where Ji ∈ {J1, J2...JN }, ∀ i s.t 1≤ i ≤ N,
according to Bayes’ theorem:
P(T|Ji) =
P(Ji|T)P(T)
P(Ji)
13
First there is the need to associate a probability P(C|T),P(C|A) and
P(C) to each of the cells involved in the train and car journey samples.
P(C|T) =
number of sample train journeys which feature C
total number of sample train journeys
P(C|A) =
number of sample car journeys which feature C
total number of sample car journeys
P(C) =
number of sample journeys which feature C
total number of sample journeys
For all the other cells that don’t appear in any of the samples
P(C) = P(C|T) = P(C|A) = 0.00000001
If we assume events C1, C2...CN to be independent, then
P(Ji) = P(C1, C2...CN ) = P(C1) ∗ P(C2)... ∗ P(CN )
and
P(Ji|T) = P(C1, C2...CN |T) = P(C1|T) ∗ P(C2|T)... ∗ P(CN |T)
I also need to find a value for P(T), which should be near 0.5.
Since
P(C) = P(C|T) ∗ P(T) + P(C|A) ∗ P(A) and P(A) = 1 − P(T)
then
P(T) =
P(C) − P(C|A)
P(C|T) − P(C|A)
I repeat this calculus for each of the cells visited by both train and car
sample journeys, since I need both P(C|A) and P(C|T), and then associate
to the final P(T) the average of the results.
14
Now that we have all we need, P(Ji|T), P(Ji) and P(T), we can calculate
P(T|Ji):
P(T|Ji) = P(T|C1, C2...CN ) =
P(C1|T) ∗ P(C2|T)... ∗ P(CN |T) ∗ P(T)
P(C1) ∗ P(C2) ∗ ...P(CN )
As it turns out, there’s something off with the numbers. I often get prob-
abilities greater than 1, which is of course a problem. The only possibility
is that the issue lies within the assumption that the events C1 C2...CN are
independent. But if this assumption goes out of the window then we don’t
have P(Ji) anymore, since
P(Ji) = P(C1, C2...CN |T) =
P(C1, C2, ...CN , T)
P(T)
If the events C1 C2...CN , T are dependent, there’s no way of directly calcu-
lating P(C1, C2...CN , T) with what I have. It appears that I need to find
another way...
3.6 The journey compatibility method
Associating probabilities to the single cells didn’t give any results, so I need
to work with the journey trajectories as a whole. Although the size of
the samples is only around 250, the cells activated by these journeys are
also activated by most of the users. I take all the non classified journey
trajectories and compare them with the sample ones, creating a new ta-
ble called JOURNEY COMPATIBILITY composed of three columns:
non classified user, sample user, cells in common.
insert into JOURNEY_COMPATIBILITY
select non_classified_users.user_id as non_classified_user,
sample_users.user_id as sample_user, count(*) as
cells_in_common
from
(select * from TRAIN_CAR_JOURNEY_TRAJECTORIES natural join
FULL_JOURNEYS
where p_airplane = 0 and p_train = 0 and p_car = 0) as
non_classified_users
join
(select * from TRAIN_CAR_JOURNEY_TRAJECTORIES natural join
FULL_JOURNEYS
where p_train = 1 or p_car = 1) sample_users
on non_classified_users.cgi = sample_users.cgi
group by non_classified_users.user_id, sample_users.user_id
15
The greater the number of cells in common, the greater the compatibility
between a couple of given journeys. Given the sample of train journeys
JT ={JT 1, JT 2...JT N } where N is the size of the sample, and the car journeys
JA ={JA1, JA2...JAM } where M is the size of the sample.
P(T|J) =
N
i=1
cells in common with JT i
N
i=1
cells in common with JT i +
M
j=1
cells in common with JAj
Translated into a query the formula becomes:
update FULL_JOURNEYS table_to_update
join
(select user_id, train_cells_affected / all_cells_affected as
p_train
from
select non_classified_user as user_id, sum(cells_in_common) as
all_cells_affected
from JOURNEY_COMPATIBILITY
group by non_classified_user) as
compatibility_with_both_samples
natural join
(select non_classified_user as user_id, sum(cells_in_common)
as train_cells_affected
from JOURNEY_COMPATIBILITY
where sample_user in (select user_id from FULL_JOURNEYS where
p_train = 1)
group by non_classified_user) as
compatibility_with_train_samples)
as new_values
on table_to_update.user_id = new_values.user_id
set table_to_update.p_train = new_values.p_train
Of course, P(A|J) is complementary to P(T|J), so P(A|J) = 1 − P(T|J).
If a journey trajectory doesn’t have cells in common with any of the sample
journeys then it is classified as NON DETERMINED, since the formula
results in
P(T|J) =
0
0
16
3.7 Summary
After having extracted beforehand the airplane users and two balanced sam-
ples for both train and car users, I built the journey trajectories. A journey
trajectory consists of all records a user has generated right before his/her
journey begun and right after it ended. The goal was to calculate the prob-
ability of having traveled by train P(T|J), and the probability of having
traveled by car P(A|J) for any given journey J.
Since a journey trajectory consists of a series of visited cells, I tried to
associate probabilities to the single cells, then obtain P(T|J) based on them.
This method gave nonsense results because of the faulty assumtpion that
the events of activating different cells were independent.
Therefore I changed the strategy by taking into account the trajectories
as a whole. By comparing the non classified journeys with the samples, I
associated a compatibility to each couple. The greater the number of cells
they had in common, the greater the compatibility between two journeys.
If a journey had been more compatible with the train sample journeys than
the car ones, than it was more likely that the user had traveled by train.
17
Chapter 4
Results
We can say a user has traveled by train if P(T|J) ≥ 0.75. Same goes for car
and airplane. All the other data is considered as unknown.
Table 4.1: Results
AIRPLANE TRAIN CAR UNKNOWN TOTAL
2800 2930 759 1867 8356
Figure 4.1: Distribution of the train probabilities
18
Of the 1867 unknown users, 1335 had probabilities between 0.25 and 0.75,
not sufficient for a final classification. The other 532 were undetermined:
the journey trajectories had no cells in common with any of the journey
samples, thus resulting in 0
0 probabilities.
Figure 4.1 shows the distribution of P(T|J). With a heavy concentration
around the 0.8 value, it reflects the prevalence of rail transportation mode
over the road.
19
Chapter 5
Conclusions
5.1 Limitations
As I have mentioned in section 3.3, there is a lack of activity concerning
the cells that cover the landmass between Rome and Milan. With very little
data to cover the actual traveling, I had to work with records created mostly
in the urban areas, before and after the journey had taken place.
Furthermore, the CDR I use in this work have been generated by TIM
customers. Telecom Italia has a 34% share of the market, so these results
do not aim rapresent all of the mobile phone users.
For simplification purposes, this work was conducted based on the as-
sumption that a user has utilised only one kind of transportation for the
whole trip, which isn’t always the case. This might be true enough for an
airplane flight, but it is possible for people to switch from car to train at
any given moment during a trip.
5.2 Future work
SKIL would like to extend the research by taking into account also the age
of the users. The analysis could also be applied to other Italian major cities
considered as origin and destination, thus painting a more general picture
of the transportation preferences for all Italy.
20
Chapter 6
Acknowledgements
First of all, I would like to thank the SKIL research team: Cristiana Chitic,
Steven Tait and Roberto Larcher for their help and contribution. Thank you
Cristiana for providing me with the CGI, and thank you Steven for helping
me set up the database and for being always available. My gratitude goes
especially to Roberto, who has guided me during my work, giving advice
and motivating me to find solutions to my problems. I would also like to
thank Stefano Bonaccorsi, who has lent me his expertise everytime I needed
it. And finally a very big thanks to my family and friends, who have always
been there and supported me.
21
Bibliography
[1] Fereshteh Asgari, Vincent Gauthier, and Monique Becker. A survey on
human mobility and its applications. arXiv preprint arXiv:1307.0814,
2013.
[2] Vincent D Blondel, Adeline Decuyper, and Gautier Krings. A sur-
vey of results on mobile phone datasets analysis. arXiv preprint
arXiv:1502.03406, 2015.
[3] Matthew F Dixon, Spencer P Aiello, Funmi Fapohunda, and William
Goldstein. Detecting mobility patterns in mobile phone data from the
ivory coast. NetMob D4D Challenge, 2013.
[4] John Doyle, Peter Hung, Damien Kelly, Se´an McLoone, and Ronan Far-
rell. Utilising mobile phone billing records for travel mode discovery.
2011.
[5] Huayong Wang, Francesco Calabrese, Giusy Di Lorenzo, and Carlo
Ratti. Transportation mode inference from anonymized and aggregated
mobile phone call detail records. In Intelligent Transportation Systems
(ITSC), 2010 13th International IEEE Conference on, pages 318–323.
IEEE, 2010.
22

More Related Content

PDF
CREATING DATA OUTPUTS FROM MULTI AGENT TRAFFIC MICRO SIMULATION TO ASSIMILATI...
PDF
A multi-objective evolutionary scheme for control points deployment in intell...
DOCX
Creative Methods for Transportation Modeling
PDF
Reduced Dimension Lane Detection Method
PDF
IRJET- Review on Assessment of Mode Shift Behavior due to Introduction of New...
PPT
Evolution Lectures 9&10
PPTX
Mutation
PPTX
Mutation, Types and Causes, Chromosomal Variation in Number, Gene Mutation
CREATING DATA OUTPUTS FROM MULTI AGENT TRAFFIC MICRO SIMULATION TO ASSIMILATI...
A multi-objective evolutionary scheme for control points deployment in intell...
Creative Methods for Transportation Modeling
Reduced Dimension Lane Detection Method
IRJET- Review on Assessment of Mode Shift Behavior due to Introduction of New...
Evolution Lectures 9&10
Mutation
Mutation, Types and Causes, Chromosomal Variation in Number, Gene Mutation

Similar to ledio_gjoni_tesi (20)

PDF
Technical report
PDF
Measuring similarity between mobility models and real world motion trajectories
PDF
MEASURING SIMILARITY BETWEEN MOBILITY MODELS AND REAL WORLD MOTION TRAJECTORIES
PDF
Mining Public Transport for Personalised Intelligent Transport Systems
PDF
A developed GPS trajectories data management system for predicting tourists' POI
PPT
Telecom Italia Big Data Challenge
PPTX
A clustering method based on repeated trip behaviour to identify road user cl...
PDF
A Big Data Telco Solution by Dr. Laura Wynter
PPTX
Context Aware crowd analysis for transport planning
PDF
PERFORMANCE EVALUATION OF TRAJECTORY QUERIES ON MULTIPROCESSOR AND CLUSTER
PDF
Performance Evaluation of Trajectory Queries on Multiprocessor and Cluster
PDF
ADAPTIVE MODELING OF URBAN DYNAMICS DURING EPHEMERAL EVENT VIA MOBILE PHONE T...
PDF
ADAPTIVE MODELING OF URBAN DYNAMICS DURING EPHEMERAL EVENT VIA MOBILE PHONE T...
PDF
IRJET- Future of Smart Tourism
PDF
IRJET- Smart Railway System using Trip Chaining Method
PPTX
[20240701_LabSeminar_Huy]TelTrans: Applying Multi-Type Telecom Data to Transp...
PPTX
Cab travel time prediction using ensemble models
PDF
adcom2013_submission_59
PDF
Are ubiquitous technologies the future vehicle for transportation planning a...
PDF
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Technical report
Measuring similarity between mobility models and real world motion trajectories
MEASURING SIMILARITY BETWEEN MOBILITY MODELS AND REAL WORLD MOTION TRAJECTORIES
Mining Public Transport for Personalised Intelligent Transport Systems
A developed GPS trajectories data management system for predicting tourists' POI
Telecom Italia Big Data Challenge
A clustering method based on repeated trip behaviour to identify road user cl...
A Big Data Telco Solution by Dr. Laura Wynter
Context Aware crowd analysis for transport planning
PERFORMANCE EVALUATION OF TRAJECTORY QUERIES ON MULTIPROCESSOR AND CLUSTER
Performance Evaluation of Trajectory Queries on Multiprocessor and Cluster
ADAPTIVE MODELING OF URBAN DYNAMICS DURING EPHEMERAL EVENT VIA MOBILE PHONE T...
ADAPTIVE MODELING OF URBAN DYNAMICS DURING EPHEMERAL EVENT VIA MOBILE PHONE T...
IRJET- Future of Smart Tourism
IRJET- Smart Railway System using Trip Chaining Method
[20240701_LabSeminar_Huy]TelTrans: Applying Multi-Type Telecom Data to Transp...
Cab travel time prediction using ensemble models
adcom2013_submission_59
Are ubiquitous technologies the future vehicle for transportation planning a...
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Ad

ledio_gjoni_tesi

  • 1. UNIVERSITA’ DEGLI STUDI DI TRENTO Department of Information engineering and Computer Science Bachelor degree in Computer Science Thesis Monitoring of human mobility by utilising Call Detail Records Supervisor Student Prof. Stefano Bonaccorsi Ledio Gjoni Co-Supervisor Roberto Larcher Academic year 2014-2015
  • 3. Abstract Mobile phones have become very popular in the last 20 years or so, making communication between people all around the world a trivial matter. Almost everyone owns one. In the developed countries, such as Italy, the coverage of population using such devices reaches 100% (Blondel et. al.[2]). As a consequence, mobile phone operators gather a massive amount of Call Detail Records (CDR) for their billing purposes. This data, besides information on how, when and with whom we communicate, also contains geo-spatial information. Since mobile phones are portable devices, the mobility traces of their users are recorded. In this thesis, the spatio-temporal information in CDR is elaborated in order to classify users utilizing different transportation modes for traveling between two major Italian cities: Rome and Milan. The three main transportation modes taken into consideration are railway, highway and air transport.
  • 4. Contents 1 Introduction 1 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Literature review 3 2.1 A similar paper . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Differences from this work . . . . . . . . . . . . . . . . . . . . 5 2.3 Other research . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Methods 6 3.1 Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2.1 CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2.2 Spatial data . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.3 Data cleansing . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Manually identifying travel modes . . . . . . . . . . . . . . . 9 3.3.1 Extracting the full journeys . . . . . . . . . . . . . . . 9 3.3.2 Extracting the airplane users . . . . . . . . . . . . . . 10 3.3.3 Extracting samples of train and car users . . . . . . . 10 3.4 Building the jouorney trajectories . . . . . . . . . . . . . . . . 13 3.5 The cell probabilities method . . . . . . . . . . . . . . . . . . 13 3.6 The journey compatibility method . . . . . . . . . . . . . . . 15 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Results 18 5 Conclusions 20 5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6 Acknowledgements 21
  • 5. Chapter 1 Introduction 1.1 Context The Semantics and Knowledge Innovation Lab (SKIL), part of the Joint Open Labs network of Telecom Italia, is a laboratory focused in Big Data and Data Mining. The SKIL team explores and develops data-driven solutions that exploit the enormous amount of data generated (Call Detail Records) in the network of Telecom Italia. One of the main initiatives is represented by the CitySensing project, a platform for the management of large events in urban areas based on social media and mobile network Big Data Streams. Another relevant project is the Mobile Territorial Lab, in collaboration with Telefonica and MIT. The project aims at creating an experimental environment to push forward the research on human behavior analysis and interaction studies of people while in mobility. In collaboration with the team, I gave my contribution to SKIL in analysing CDR within the TIM cellular network in order to differentiate peo- ple travelling between big cities and infer the transportation mode. Roberto Larcher, an internal SKIL researcher and my tutor during my period of internship, has helped me and guided me through my work. 1.2 Contributions In this thesis I use CDR to analyse people’s mobility habits. • The travel routes between Rome and Milan are taken into considera- ton. • The location and spatial coverage of tower cells is used to construct journey trajectories for each user, based on their mobile phone activ- ity. Then, using a ”compatibility” approach, I differentiate the jorney trajectories into train, car and plane. 1
  • 6. 1.3 Motivation Understanding human mobility is indispensable for solving several social problematics. Asgari et. al.[1] review the possible applications of human mobility study in solving these issues. One such application is for analysing the mobility data with the purpose of modeling the traffic flow in road net- works and public transportation networks. Furthermore, it helps to under- stand the spreading of infectuous diseases. These diseases contaminate the population because people travel and interact. Another example consists in marketing and advertising. Having knowledge about traffic flows and population movements is essential for placing advertisements in the right locations. Telecom Italia in particular wants to exploit the CDR potential for de- veloping services outside the telephony scope. Information regarding human mobility can prove very valuable for companies like Trenitalia and Alitalia, which use it for updating their infrastructures and for marketing strategies. This kind of study has been carried out in the past by surveys. A national census is performed every 10 years, drawing conclusions regarding millions of people [1]. The number of participants ranges around 1000, with results strongly depending on the subjectivity of the participants’ answers. On the other side, the data collected from a mobile phone network is cheap and very frequent. Millions of people use their phones every day, therefore producing massive ammounts of CDR. These records contain in- formation about the time and location of an occured communication, and thus they are very fit and convenient for a study on human mobility such as the work presented in this thesis. 1.4 Summary In chapter 3 I describe the methodology used for carrying out the analy- sis. I compare non classified journey trajectories with manually identified train and car users in order to make a classification. In chapter 4 I show the results, while in chapter 5 I descuss the limitations I encountered and eventual future work. 2
  • 7. Chapter 2 Literature review 2.1 A similar paper The premise of this work is very similar to the one in Doyle et. al.[4]. In said paper the authors try to classify users traveling between two regions RA and RB (namely Dublin and Cork in Ireland) by the means of transportation mode. They use what they call the Virtual Cell Path (VCP) approach. By knowing the location of the towers collecting CDR, they apply Voronoi tessellation to come up with the coverage area for each tower cell. Figure 2.1: Sample of idealized Voronoi tessellation used to calculate cell network coverage map. User event trajectories (those traveling between RA and RB) are formed by constructing a temporal sequence of the cells generating events. In order to identify the journeys taken place by using the major modes of transport, there is the need to associate cells with a certain route of interest (rail-line or motorway). Such a collection of cells is a virtual cell path (VCP), defined as a rep- resentation of the path through a mobile telephony network along which a user may travel while on a specific route. One cell is part of a VCP if its 3
  • 8. area of coverage overlaps with a transportation route. The VCP can also be improved by using cells belonging to manually classified journey trajectories. Figure 2.2: Virtual Cell Path of rail-line and road. Given a joruney trajectory Ji and a transportation route T, then Ji has taken place in T(rail-line) if the probability P(T|Ji) is deemed sufficient in comparison with P(R|Ji), with R being road. Some necessary conditions are required for VCP based travel path iden- tification to be feasible: • A minimum number of diverging cells that cover different travel routes. • A minimum weight of difference in measures of similarity among all travel routes. 4
  • 9. 2.2 Differences from this work I don’t need to apply the Voronoi tesselation since in my case every cell is associated with its area of coverage. The data regarding cell coverage had already been separated between cells that cover the railroad and those covering the highway. These could represent rudimentary VCPs, like those seen in the previous paragraph, but it is not enough. The VCPs I have need to be heavily refined in order for them to be acceptable for an approach like Doyle et. al.[4], so this aspect is used only partially here. However, for the approach followed in this thesis, the use of manually classified journey trajectories is crucial. The calculation of P(T|Ji), the probability that associates a jorney with a travel route, is only hinted by the authors of the paper, but never shown. Professor Stefano Bonaccorsi helped me elaborate a method to calculate said probabilities, shown in chapter 3. 2.3 Other research Wang et. al.[5] infer transportation mode in an urban context. Using CDR, they try to divide users living in the same city into three groups: those traveling by car, public transport and lastly on foot. They group users with the same origin and destination and compare the travel times for the same route with those from Google Maps. This study relies soley on travel times, since in a case like this the difference of traveling speed between transportation modes is significant enough to allow a classification. I can partially use this approach for identifying only people using air transit. On one hand, the plane travel times are much shorter than car and train ones, but on the other hand, the last two are quite similar and a different strategy is needed, much like Doyle et. al.[4]. A simpler approach is used by Dixon et. al.[3] in their study of mobile phone data generated in the Ivory Coast. To determine on which segment of transport infrastructure a user is traveling, they associate an antenna to the closest route segment. Thus, a set AK of antennas is determined for each route segment SK. A chronological sequence A of antennas, which a user calls on a given day, represent the users journey trajectory. A user is deemed to travel on SK if there is at least a pair of antennas in A that belong also to AK. A minimum distance is required between antennas of the same pair. 5
  • 10. Chapter 3 Methods 3.1 Premise For explanation purposes, sometimes I show samples of tables that come from the dataset I have been working with. While the table structures remain genuine, the values have been changed due to privacy concerns. From this point forward the words user and journey are to be intended as synonyms. A journey rapresents a user’s itinerary, and they both refer to the same field user id in the dataset. All of the queries have been performed in the MySQL Workbench 6.3 environment. 3.2 Data 3.2.1 CDR The main data type used in this work are CDR (Call Detail Records). Tele- com Italia provided SKIL with the CDR. Telecom Italia is one of the biggest mobile phone operators in Italy, with 34% of the market share. CDR are records produced by a receiving tower cell every time a user makes use of mobile services (like making phone call, texting or surfing the internet). Every user activity is recorded by the tower with the strongest receiving capacity, which is not always the one nearest the user. The CDR analysed here were generated during a 24 hour time period in the year 2015. The CDR dataset has three fields: • user id - Every user has his own ID number. To ensure anonimity this ID changes once a day for every user, although this shouldn’t be a problem for this thesis, since I take into account only a 24 hours period. 6
  • 11. • cgi - Cell Global Identity: serves to identify the cell which recorded the user’s activity. • date - The exact time the CDR was generated. This gives us the moment a moving user is picked up by a receiving tower. Table 3.1: Sample of the CDR table in the database date cgi user id 16/02/2015 00:19:00 222-01-00102-16951 b1c06d58ff46a91af9312be341bccfa2 16/02/2015 00:07:34 222-01-61212-00576 38eb7933d8065d669bb71e7e171ccf0a 16/02/2015 00:07:02 222-01-61509-03904 ce7f842ebc51356d327c5310b277a01f 3.2.2 Spatial data To SKIL was also given the spatial data for the area coverage of each tower cell under the form of a shapefile. The shapefile format is a popular geospa- tial vector data format for geographic information system (GIS) software. Figure 3.1: Cell coverage for Rome’s urban area Theres also a .dbf file, where every attribute of a shapefile is stored in a dBase format. Among other attributes, the CGI field is present, so it is possible to perform queries with the CDR. 7
  • 12. Table 3.2: Cell distribution table Area Number of cells Rome urban area 6332 Milan urban area 4290 Rome airports 55 Milan airports 103 Rome-Milan Highway 3329 Rome-Milan Railway 6064 There is a total of 16763 cells with distinct CGI, but many of them overlap, so the overall sum of the values in the ”Number of cells” columns in table 3.2 is greater than 16763. I elaborated the different shapefiles in order to create a CGI table for my database with these fields: • cgi – Cell Global Identity: serves to identify the cell which recorded the user’s activity. • city – If the cell belongs to Rome’s territory this field has value ’RM’. – If the cell belongs to Milan’s territory this field has value ’MI’. – If the cell belongs to neither of the cities this field has value ’TRIP’. • type – If the cell belongs to a city’s urban area this field has value ’1’. – If the cell belongs to a city’s airport this field has value ’2’. – If the cell overlaps the railway this field has value ’3’. – If the cell overlaps the highway thisfield has value ’4’. Table 3.3: Sample of the CGI table in the database cgi city type 222-01-24641-00799 RM 2 222-01-24641-05193 RM 2 222-01-24641-05195 RM 2 8
  • 13. 3.2.3 Data cleansing Before I could start with my work, the CDR needed some data cleansing. • Telecom Italia provided SKIL with a day’s worth of records covering all Italy. Only the activites generated from the same users in Rome and Milan during the 24 hour period were kept. This way remained only those people who had been in both the cities, thus had traveled between those two locations. • For some reason many of the rows were replicated. By eliminating the duplicates I reduced number of rows from 6121090 to 188343. • I had to convert the date field, which came as STRING, into DATE- TIME. • There was some noise data to be eliminated, such as users generating activites from both cities in a time window smaller than 50 minutes. Rome and Milan are more than 500 km distant from each other. 3.3 Manually identifying travel modes 3.3.1 Extracting the full journeys There are only three ways to travel between Rome and Milan: • Airplane • Train • Car Each of these transportation modes has more or less different travel times. This is why, first of all, I must find for each user the time it took him/her to get from one location to the other. In order to do this I created a table called FULL JOURNEYS with three fields: user id, departure time, arrival time. Since the CDR cover an entire day’s activities, most of it are records generated before or after the journey has taken place. If I take into account all the activites the results will not be reliable. Therefore I must associate to each user a departure and an arrival time. Now every journey trajectory will consist of CDR generated after the user’s departure time and before the arrival time. 9
  • 14. This is the query I performed to get the departure and arrival times for people going from Milan to Rome: insert into FULL_JOURNEYS select user_id, departure_time, arrival_time from (select user_id, max(date) as departure_time from CDR natural join (select cgi from CGI where city=’MI’) as mi group by user_id) as departures_milano natural join (select user_id, min(date) as arrival_time from CDR natural join (select cgi from CGI where city=’RM’) as rm group by user_id) as arrivals_roma where (departure_time < arrival_time) I consider as a user’s departure time the last moment he/she generated an event from a cell in the Milan area. Arrival time is the first moment the user generated an event from a cell in the Rome area. I performed the same query to also get the journeys Rome-Milan. The people who have made a roundtrip (RM-MI-RM or MI-RM-MI) were left out, so I added them with the comeback trip to represent them. On a first look there seems I have to classify a total of 8510 users, but after a filtration of noise data (travel time less than 50 minutes), the number dropped to 8356. In the end the table looked like this: Table 3.4: Sample of the FULL JOURNEYS table user id departure time arrival time 00124343a94e897362b88a663df81615 16/02/2015 15:16:58 16/02/2015 19:29:04 001247253c166a3c350d989954d2435d 16/02/2015 01:24:15 16/02/2015 07:31:51 0017ff5a12cb684df55c5f28af6b5ed3 16/02/2015 19:38:56 16/02/2015 22:43:16 3.3.2 Extracting the airplane users Before proceeding with the building of the journey trajectories, I must first extract the airplane users. This is done for mainly two reasons: • It takes only 1 hour and 10 minutes to cover the distance between Rome and Milan by plane (google), while with the fastest train, the Frecciarossa, it takes 3 hours (trenitalia.it). 10
  • 15. • People are not supposed to make calls or generate any kind of phone traffic during a flight, so there’s no point in building journey trajecto- ries for plane users. So I go on and classify people with a travel time less then 3 hours as airplane users, with a result of 2800 users out of 8356: insert into AIRPLANE_USERS select user_id from FULL_JOURNEYS where timestampdiff(minute,departure_time,arrival_time) < 170 3.3.3 Extracting samples of train and car users With the airplane users out the way, only those who have traveled by land remain. In this case, discrimination by traveling time alone might not be enough. It is true that it takes only 3 hours to a Frecciarossa to cover the 500 km distance, but not so for the other trains. On the other hand, it takes roughly 5 to 6 hours to cover the same distance by car. Thus, every journey with a travel time greater than 5 hours is not immediately classifiable. In order to proceed with my work I need journey samples for both train and car users. The poeple with a difference around 3 hours between the departure and arrival times, surely have traveled with a Frecciarossa. insert into TRAIN_USERS_SAMPLE select user_id from (select * from CDR natural join (select distinct(cgi) from CGI where city = ’TRIP’) as journey_cells) as journey_cdr natural join (select * from FULL_JOURNEYS where timestampdiff(minute,departure_time,arrival_time) >= 170 and timestampdiff(minute,departure_time,arrival_time) <= 190) as train_users where date >= (departure_time - interval 10 minute) and date <= (arrival_time + interval 10 minute) group by user_id having count(*) < 5 and count(*) >= 3 11
  • 16. Those leaving a city during the night hours have surely traveled by car, since the earliest train doesn’t leave before 5:42 am. insert CAR_USERS_SAMPLE select user_id from (select * from CDR natural join (select distinct(cgi) from CGI where city = ’TRIP’) as journey_cells) as journey_cdr natural join (select * from (select * from FULL_JOURNEYS where user_id not in (select user_id from AIRPLANE_USERS)) as land_users where departure_time <= ’2015-03-23 05:42:00’) as night_journeys where date >= (departure_time - interval 10 minute) and date <= (arrival_time + interval 10 minute) group by user_id Since at first the train users sample was bigger than the car one I had to modify the first query in order to get somewhat balanced samples. In the end I have extracted 296 train users and 225 car users, with a total of 1114 and 835 CDR generated respectively. Having a balance between these numbers is very important for the outcome of the research, because the method I plan to use depends heavily on the samples. Before I can proceed, I need to add three more columns to the FULL JOURNEYS table: p airplane, p train, p car. These columns will contain the probability of having traveled by plane, car or train for each of the users. I can already update some of the rows since I have determined the trans- portation mode for some of them. Now the table looks like this: Table 3.5: New sample of the FULL JOURNEYS table user id ... p airplane p train p car 00101cc717dc49803a390754d5568f9 ... 1 0 0 001247253c166a3c350d989954d2435d ... 1 0 0 00277c1cbfc3c00ea62a0159e48da6b2 ... 1 0 0 12
  • 17. 3.4 Building the jouorney trajectories For some reason, according to the data I have, people haven’t been much active while traveling. Only 315 of them have generated any records at all, and most of them with less then 5 cells activated per head. The rest of the communications has taken place in the city areas. I don’t know the reason behind this, but with so little to work with I decided to consider also the records created 10 minutes before departure and 10 minutes after the arrival. I created a new table TRAIN CAR JOURNEY TRAJECTORIES with three fields: user id, cgi, date, which has the same structure as the CDR table. insert into TRAIN_CAR_JOURNEY_TRAJECTORIES select distinct user_id, cgi, date from CDR natural join (select * from FULL_JOURNEYS where user_id not in (select user_id from AIRPLANE_USERS)) as land_journeys where date >= (departure_time - interval 10 minute) and date <= (arrival_time + interval 10 minute) 3.5 The cell probabilities method Given the events: • C - Activating a cell C • T - Being on a train • A - Being in a car • J = Activating the cells that compose the journey the goal is to get P(T|J). Given Ji ={C1, C2...CN } where Ji ∈ {J1, J2...JN }, ∀ i s.t 1≤ i ≤ N, according to Bayes’ theorem: P(T|Ji) = P(Ji|T)P(T) P(Ji) 13
  • 18. First there is the need to associate a probability P(C|T),P(C|A) and P(C) to each of the cells involved in the train and car journey samples. P(C|T) = number of sample train journeys which feature C total number of sample train journeys P(C|A) = number of sample car journeys which feature C total number of sample car journeys P(C) = number of sample journeys which feature C total number of sample journeys For all the other cells that don’t appear in any of the samples P(C) = P(C|T) = P(C|A) = 0.00000001 If we assume events C1, C2...CN to be independent, then P(Ji) = P(C1, C2...CN ) = P(C1) ∗ P(C2)... ∗ P(CN ) and P(Ji|T) = P(C1, C2...CN |T) = P(C1|T) ∗ P(C2|T)... ∗ P(CN |T) I also need to find a value for P(T), which should be near 0.5. Since P(C) = P(C|T) ∗ P(T) + P(C|A) ∗ P(A) and P(A) = 1 − P(T) then P(T) = P(C) − P(C|A) P(C|T) − P(C|A) I repeat this calculus for each of the cells visited by both train and car sample journeys, since I need both P(C|A) and P(C|T), and then associate to the final P(T) the average of the results. 14
  • 19. Now that we have all we need, P(Ji|T), P(Ji) and P(T), we can calculate P(T|Ji): P(T|Ji) = P(T|C1, C2...CN ) = P(C1|T) ∗ P(C2|T)... ∗ P(CN |T) ∗ P(T) P(C1) ∗ P(C2) ∗ ...P(CN ) As it turns out, there’s something off with the numbers. I often get prob- abilities greater than 1, which is of course a problem. The only possibility is that the issue lies within the assumption that the events C1 C2...CN are independent. But if this assumption goes out of the window then we don’t have P(Ji) anymore, since P(Ji) = P(C1, C2...CN |T) = P(C1, C2, ...CN , T) P(T) If the events C1 C2...CN , T are dependent, there’s no way of directly calcu- lating P(C1, C2...CN , T) with what I have. It appears that I need to find another way... 3.6 The journey compatibility method Associating probabilities to the single cells didn’t give any results, so I need to work with the journey trajectories as a whole. Although the size of the samples is only around 250, the cells activated by these journeys are also activated by most of the users. I take all the non classified journey trajectories and compare them with the sample ones, creating a new ta- ble called JOURNEY COMPATIBILITY composed of three columns: non classified user, sample user, cells in common. insert into JOURNEY_COMPATIBILITY select non_classified_users.user_id as non_classified_user, sample_users.user_id as sample_user, count(*) as cells_in_common from (select * from TRAIN_CAR_JOURNEY_TRAJECTORIES natural join FULL_JOURNEYS where p_airplane = 0 and p_train = 0 and p_car = 0) as non_classified_users join (select * from TRAIN_CAR_JOURNEY_TRAJECTORIES natural join FULL_JOURNEYS where p_train = 1 or p_car = 1) sample_users on non_classified_users.cgi = sample_users.cgi group by non_classified_users.user_id, sample_users.user_id 15
  • 20. The greater the number of cells in common, the greater the compatibility between a couple of given journeys. Given the sample of train journeys JT ={JT 1, JT 2...JT N } where N is the size of the sample, and the car journeys JA ={JA1, JA2...JAM } where M is the size of the sample. P(T|J) = N i=1 cells in common with JT i N i=1 cells in common with JT i + M j=1 cells in common with JAj Translated into a query the formula becomes: update FULL_JOURNEYS table_to_update join (select user_id, train_cells_affected / all_cells_affected as p_train from select non_classified_user as user_id, sum(cells_in_common) as all_cells_affected from JOURNEY_COMPATIBILITY group by non_classified_user) as compatibility_with_both_samples natural join (select non_classified_user as user_id, sum(cells_in_common) as train_cells_affected from JOURNEY_COMPATIBILITY where sample_user in (select user_id from FULL_JOURNEYS where p_train = 1) group by non_classified_user) as compatibility_with_train_samples) as new_values on table_to_update.user_id = new_values.user_id set table_to_update.p_train = new_values.p_train Of course, P(A|J) is complementary to P(T|J), so P(A|J) = 1 − P(T|J). If a journey trajectory doesn’t have cells in common with any of the sample journeys then it is classified as NON DETERMINED, since the formula results in P(T|J) = 0 0 16
  • 21. 3.7 Summary After having extracted beforehand the airplane users and two balanced sam- ples for both train and car users, I built the journey trajectories. A journey trajectory consists of all records a user has generated right before his/her journey begun and right after it ended. The goal was to calculate the prob- ability of having traveled by train P(T|J), and the probability of having traveled by car P(A|J) for any given journey J. Since a journey trajectory consists of a series of visited cells, I tried to associate probabilities to the single cells, then obtain P(T|J) based on them. This method gave nonsense results because of the faulty assumtpion that the events of activating different cells were independent. Therefore I changed the strategy by taking into account the trajectories as a whole. By comparing the non classified journeys with the samples, I associated a compatibility to each couple. The greater the number of cells they had in common, the greater the compatibility between two journeys. If a journey had been more compatible with the train sample journeys than the car ones, than it was more likely that the user had traveled by train. 17
  • 22. Chapter 4 Results We can say a user has traveled by train if P(T|J) ≥ 0.75. Same goes for car and airplane. All the other data is considered as unknown. Table 4.1: Results AIRPLANE TRAIN CAR UNKNOWN TOTAL 2800 2930 759 1867 8356 Figure 4.1: Distribution of the train probabilities 18
  • 23. Of the 1867 unknown users, 1335 had probabilities between 0.25 and 0.75, not sufficient for a final classification. The other 532 were undetermined: the journey trajectories had no cells in common with any of the journey samples, thus resulting in 0 0 probabilities. Figure 4.1 shows the distribution of P(T|J). With a heavy concentration around the 0.8 value, it reflects the prevalence of rail transportation mode over the road. 19
  • 24. Chapter 5 Conclusions 5.1 Limitations As I have mentioned in section 3.3, there is a lack of activity concerning the cells that cover the landmass between Rome and Milan. With very little data to cover the actual traveling, I had to work with records created mostly in the urban areas, before and after the journey had taken place. Furthermore, the CDR I use in this work have been generated by TIM customers. Telecom Italia has a 34% share of the market, so these results do not aim rapresent all of the mobile phone users. For simplification purposes, this work was conducted based on the as- sumption that a user has utilised only one kind of transportation for the whole trip, which isn’t always the case. This might be true enough for an airplane flight, but it is possible for people to switch from car to train at any given moment during a trip. 5.2 Future work SKIL would like to extend the research by taking into account also the age of the users. The analysis could also be applied to other Italian major cities considered as origin and destination, thus painting a more general picture of the transportation preferences for all Italy. 20
  • 25. Chapter 6 Acknowledgements First of all, I would like to thank the SKIL research team: Cristiana Chitic, Steven Tait and Roberto Larcher for their help and contribution. Thank you Cristiana for providing me with the CGI, and thank you Steven for helping me set up the database and for being always available. My gratitude goes especially to Roberto, who has guided me during my work, giving advice and motivating me to find solutions to my problems. I would also like to thank Stefano Bonaccorsi, who has lent me his expertise everytime I needed it. And finally a very big thanks to my family and friends, who have always been there and supported me. 21
  • 26. Bibliography [1] Fereshteh Asgari, Vincent Gauthier, and Monique Becker. A survey on human mobility and its applications. arXiv preprint arXiv:1307.0814, 2013. [2] Vincent D Blondel, Adeline Decuyper, and Gautier Krings. A sur- vey of results on mobile phone datasets analysis. arXiv preprint arXiv:1502.03406, 2015. [3] Matthew F Dixon, Spencer P Aiello, Funmi Fapohunda, and William Goldstein. Detecting mobility patterns in mobile phone data from the ivory coast. NetMob D4D Challenge, 2013. [4] John Doyle, Peter Hung, Damien Kelly, Se´an McLoone, and Ronan Far- rell. Utilising mobile phone billing records for travel mode discovery. 2011. [5] Huayong Wang, Francesco Calabrese, Giusy Di Lorenzo, and Carlo Ratti. Transportation mode inference from anonymized and aggregated mobile phone call detail records. In Intelligent Transportation Systems (ITSC), 2010 13th International IEEE Conference on, pages 318–323. IEEE, 2010. 22