ledio_gjoni_tesi

UNIVERSITA’ DEGLI STUDI DI TRENTO
Department of Information engineering and Computer
Science
Bachelor degree in Computer Science
Thesis
Monitoring of human mobility by
utilising Call Detail Records
Supervisor Student
Prof. Stefano Bonaccorsi Ledio Gjoni
Co-Supervisor
Roberto Larcher
Academic year 2014-2015

Abstract
Mobile phones have become very popular in the last 20 years or so, making
communication between people all around the world a trivial matter. Almost
everyone owns one. In the developed countries, such as Italy, the coverage
of population using such devices reaches 100% (Blondel et. al.[2]). As a
consequence, mobile phone operators gather a massive amount of Call Detail
Records (CDR) for their billing purposes. This data, besides information
on how, when and with whom we communicate, also contains geo-spatial
information. Since mobile phones are portable devices, the mobility traces
of their users are recorded. In this thesis, the spatio-temporal information in
CDR is elaborated in order to classify users utilizing diﬀerent transportation
modes for traveling between two major Italian cities: Rome and Milan.
The three main transportation modes taken into consideration are railway,
highway and air transport.

Contents
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature review 3
2.1 A similar paper . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Diﬀerences from this work . . . . . . . . . . . . . . . . . . . . 5
2.3 Other research . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Methods 6
3.1 Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 Spatial data . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.3 Data cleansing . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Manually identifying travel modes . . . . . . . . . . . . . . . 9
3.3.1 Extracting the full journeys . . . . . . . . . . . . . . . 9
3.3.2 Extracting the airplane users . . . . . . . . . . . . . . 10
3.3.3 Extracting samples of train and car users . . . . . . . 10
3.4 Building the jouorney trajectories . . . . . . . . . . . . . . . . 13
3.5 The cell probabilities method . . . . . . . . . . . . . . . . . . 13
3.6 The journey compatibility method . . . . . . . . . . . . . . . 15
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Results 18
5 Conclusions 20
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Acknowledgements 21

Chapter 1
Introduction
1.1 Context
The Semantics and Knowledge Innovation Lab (SKIL), part of the Joint
Open Labs network of Telecom Italia, is a laboratory focused in Big Data and
Data Mining. The SKIL team explores and develops data-driven solutions
that exploit the enormous amount of data generated (Call Detail Records)
in the network of Telecom Italia.
One of the main initiatives is represented by the CitySensing project, a
platform for the management of large events in urban areas based on social
media and mobile network Big Data Streams. Another relevant project is
the Mobile Territorial Lab, in collaboration with Telefonica and MIT. The
project aims at creating an experimental environment to push forward the
research on human behavior analysis and interaction studies of people while
in mobility.
In collaboration with the team, I gave my contribution to SKIL in
analysing CDR within the TIM cellular network in order to diﬀerentiate peo-
ple travelling between big cities and infer the transportation mode. Roberto
Larcher, an internal SKIL researcher and my tutor during my period of
internship, has helped me and guided me through my work.
1.2 Contributions
In this thesis I use CDR to analyse people’s mobility habits.
• The travel routes between Rome and Milan are taken into considera-
ton.
• The location and spatial coverage of tower cells is used to construct
journey trajectories for each user, based on their mobile phone activ-
ity. Then, using a ”compatibility” approach, I diﬀerentiate the jorney
trajectories into train, car and plane.
1

1.3 Motivation
Understanding human mobility is indispensable for solving several social
problematics. Asgari et. al.[1] review the possible applications of human
mobility study in solving these issues. One such application is for analysing
the mobility data with the purpose of modeling the traffic flow in road net-
works and public transportation networks. Furthermore, it helps to under-
stand the spreading of infectuous diseases. These diseases contaminate the
population because people travel and interact. Another example consists
in marketing and advertising. Having knowledge about traffic flows and
population movements is essential for placing advertisements in the right
locations.
Telecom Italia in particular wants to exploit the CDR potential for de-
veloping services outside the telephony scope. Information regarding human
mobility can prove very valuable for companies like Trenitalia and Alitalia,
which use it for updating their infrastructures and for marketing strategies.
This kind of study has been carried out in the past by surveys. A national
census is performed every 10 years, drawing conclusions regarding millions
of people [1]. The number of participants ranges around 1000, with results
strongly depending on the subjectivity of the participants’ answers.
On the other side, the data collected from a mobile phone network is
cheap and very frequent. Millions of people use their phones every day,
therefore producing massive ammounts of CDR. These records contain in-
formation about the time and location of an occured communication, and
thus they are very fit and convenient for a study on human mobility such
as the work presented in this thesis.
1.4 Summary
In chapter 3 I describe the methodology used for carrying out the analy-
sis. I compare non classified journey trajectories with manually identified
train and car users in order to make a classification. In chapter 4 I show
the results, while in chapter 5 I descuss the limitations I encountered and
eventual future work.
2

Chapter 2
Literature review
2.1 A similar paper
The premise of this work is very similar to the one in Doyle et. al.[4]. In said
paper the authors try to classify users traveling between two regions RA and
RB (namely Dublin and Cork in Ireland) by the means of transportation
mode. They use what they call the Virtual Cell Path (VCP) approach.
By knowing the location of the towers collecting CDR, they apply Voronoi
tessellation to come up with the coverage area for each tower cell.
Figure 2.1: Sample of idealized Voronoi tessellation used to calculate cell
network coverage map.
User event trajectories (those traveling between RA and RB) are formed
by constructing a temporal sequence of the cells generating events.
In order to identify the journeys taken place by using the major modes of
transport, there is the need to associate cells with a certain route of interest
(rail-line or motorway).
Such a collection of cells is a virtual cell path (VCP), deﬁned as a rep-
resentation of the path through a mobile telephony network along which a
user may travel while on a speciﬁc route. One cell is part of a VCP if its
3

area of coverage overlaps with a transportation route. The VCP can also be
improved by using cells belonging to manually classified journey trajectories.
Figure 2.2: Virtual Cell Path of rail-line and road.
Given a joruney trajectory Ji and a transportation route T, then Ji has
taken place in T(rail-line) if the probability P(T|Ji) is deemed sufficient in
comparison with P(R|Ji), with R being road.
Some necessary conditions are required for VCP based travel path iden-
tification to be feasible:
• A minimum number of diverging cells that cover different travel routes.
• A minimum weight of difference in measures of similarity among all
travel routes.
4

2.2 Differences from this work
I don’t need to apply the Voronoi tesselation since in my case every cell
is associated with its area of coverage. The data regarding cell coverage
had already been separated between cells that cover the railroad and those
covering the highway. These could represent rudimentary VCPs, like those
seen in the previous paragraph, but it is not enough. The VCPs I have need
to be heavily refined in order for them to be acceptable for an approach
like Doyle et. al.[4], so this aspect is used only partially here. However, for
the approach followed in this thesis, the use of manually classified journey
trajectories is crucial.
The calculation of P(T|Ji), the probability that associates a jorney with
a travel route, is only hinted by the authors of the paper, but never shown.
Professor Stefano Bonaccorsi helped me elaborate a method to calculate said
probabilities, shown in chapter 3.
2.3 Other research
Wang et. al.[5] infer transportation mode in an urban context. Using CDR,
they try to divide users living in the same city into three groups: those
traveling by car, public transport and lastly on foot. They group users
with the same origin and destination and compare the travel times for the
same route with those from Google Maps. This study relies soley on travel
times, since in a case like this the difference of traveling speed between
transportation modes is significant enough to allow a classification. I can
partially use this approach for identifying only people using air transit. On
one hand, the plane travel times are much shorter than car and train ones,
but on the other hand, the last two are quite similar and a different strategy
is needed, much like Doyle et. al.[4].
A simpler approach is used by Dixon et. al.[3] in their study of mobile
phone data generated in the Ivory Coast. To determine on which segment
of transport infrastructure a user is traveling, they associate an antenna to
the closest route segment. Thus, a set AK of antennas is determined for
each route segment SK. A chronological sequence A of antennas, which a
user calls on a given day, represent the users journey trajectory.
A user is deemed to travel on SK if there is at least a pair of antennas
in A that belong also to AK. A minimum distance is required between
antennas of the same pair.
5

Chapter 3
Methods
3.1 Premise
For explanation purposes, sometimes I show samples of tables that come
from the dataset I have been working with. While the table structures
remain genuine, the values have been changed due to privacy concerns.
From this point forward the words user and journey are to be intended
as synonyms. A journey rapresents a user’s itinerary, and they both refer
to the same field user id in the dataset.
All of the queries have been performed in the MySQL Workbench 6.3
environment.
3.2 Data
3.2.1 CDR
The main data type used in this work are CDR (Call Detail Records). Tele-
com Italia provided SKIL with the CDR. Telecom Italia is one of the biggest
mobile phone operators in Italy, with 34% of the market share. CDR are
records produced by a receiving tower cell every time a user makes use of
mobile services (like making phone call, texting or surfing the internet).
Every user activity is recorded by the tower with the strongest receiving
capacity, which is not always the one nearest the user. The CDR analysed
here were generated during a 24 hour time period in the year 2015.
The CDR dataset has three fields:
• user id - Every user has his own ID number. To ensure anonimity
this ID changes once a day for every user, although this shouldn’t be
a problem for this thesis, since I take into account only a 24 hours
period.
6

• cgi - Cell Global Identity: serves to identify the cell which recorded
the user’s activity.
• date - The exact time the CDR was generated. This gives us the
moment a moving user is picked up by a receiving tower.
Table 3.1: Sample of the CDR table in the database
date cgi user id
16/02/2015 00:19:00 222-01-00102-16951 b1c06d58ff46a91af9312be341bccfa2
16/02/2015 00:07:34 222-01-61212-00576 38eb7933d8065d669bb71e7e171ccf0a
16/02/2015 00:07:02 222-01-61509-03904 ce7f842ebc51356d327c5310b277a01f
3.2.2 Spatial data
To SKIL was also given the spatial data for the area coverage of each tower
cell under the form of a shapefile. The shapefile format is a popular geospa-
tial vector data format for geographic information system (GIS) software.
Figure 3.1: Cell coverage for Rome’s urban area
Theres also a .dbf file, where every attribute of a shapefile is stored in
a dBase format. Among other attributes, the CGI field is present, so it is
possible to perform queries with the CDR.
7

Table 3.2: Cell distribution table
Area Number of cells
Rome urban area 6332
Milan urban area 4290
Rome airports 55
Milan airports 103
Rome-Milan Highway 3329
Rome-Milan Railway 6064
There is a total of 16763 cells with distinct CGI, but many of them
overlap, so the overall sum of the values in the ”Number of cells” columns
in table 3.2 is greater than 16763.
I elaborated the different shapefiles in order to create a CGI table for
my database with these fields:
• cgi
– Cell Global Identity: serves to identify the cell which recorded
the user’s activity.
• city
– If the cell belongs to Rome’s territory this field has value ’RM’.
– If the cell belongs to Milan’s territory this field has value ’MI’.
– If the cell belongs to neither of the cities this field has value
’TRIP’.
• type
– If the cell belongs to a city’s urban area this field has value ’1’.
– If the cell belongs to a city’s airport this field has value ’2’.
– If the cell overlaps the railway this field has value ’3’.
– If the cell overlaps the highway thisfield has value ’4’.
Table 3.3: Sample of the CGI table in the database
cgi city type
222-01-24641-00799 RM 2
222-01-24641-05193 RM 2
222-01-24641-05195 RM 2
8

3.2.3 Data cleansing
Before I could start with my work, the CDR needed some data cleansing.
• Telecom Italia provided SKIL with a day’s worth of records covering
all Italy. Only the activites generated from the same users in Rome
and Milan during the 24 hour period were kept. This way remained
only those people who had been in both the cities, thus had traveled
between those two locations.
• For some reason many of the rows were replicated. By eliminating the
duplicates I reduced number of rows from 6121090 to 188343.
• I had to convert the date field, which came as STRING, into DATE-
TIME.
• There was some noise data to be eliminated, such as users generating
activites from both cities in a time window smaller than 50 minutes.
Rome and Milan are more than 500 km distant from each other.
3.3 Manually identifying travel modes
3.3.1 Extracting the full journeys
There are only three ways to travel between Rome and Milan:
• Airplane
• Train
• Car
Each of these transportation modes has more or less different travel times.
This is why, first of all, I must find for each user the time it took him/her
to get from one location to the other. In order to do this I created a table
called FULL JOURNEYS with three fields: user id, departure time,
arrival time.
Since the CDR cover an entire day’s activities, most of it are records
generated before or after the journey has taken place. If I take into account
all the activites the results will not be reliable. Therefore I must associate
to each user a departure and an arrival time. Now every journey trajectory
will consist of CDR generated after the user’s departure time and before the
arrival time.
9

This is the query I performed to get the departure and arrival times for
people going from Milan to Rome:
insert into FULL_JOURNEYS
select user_id, departure_time, arrival_time
from
(select user_id, max(date) as departure_time
from CDR natural join (select cgi from CGI where city=’MI’) as
mi
group by user_id) as departures_milano
natural join
(select user_id, min(date) as arrival_time
from CDR natural join (select cgi from CGI where city=’RM’) as
rm
group by user_id) as arrivals_roma
where (departure_time < arrival_time)
I consider as a user’s departure time the last moment he/she generated an
event from a cell in the Milan area. Arrival time is the first moment the
user generated an event from a cell in the Rome area. I performed the same
query to also get the journeys Rome-Milan. The people who have made a
roundtrip (RM-MI-RM or MI-RM-MI) were left out, so I added them with
the comeback trip to represent them. On a first look there seems I have to
classify a total of 8510 users, but after a filtration of noise data (travel time
less than 50 minutes), the number dropped to 8356. In the end the table
looked like this:
Table 3.4: Sample of the FULL JOURNEYS table
user id departure time arrival time
00124343a94e897362b88a663df81615 16/02/2015 15:16:58 16/02/2015 19:29:04
001247253c166a3c350d989954d2435d 16/02/2015 01:24:15 16/02/2015 07:31:51
0017ff5a12cb684df55c5f28af6b5ed3 16/02/2015 19:38:56 16/02/2015 22:43:16
3.3.2 Extracting the airplane users
Before proceeding with the building of the journey trajectories, I must first
extract the airplane users. This is done for mainly two reasons:
• It takes only 1 hour and 10 minutes to cover the distance between
Rome and Milan by plane (google), while with the fastest train, the
Frecciarossa, it takes 3 hours (trenitalia.it).
10

• People are not supposed to make calls or generate any kind of phone
traffic during a flight, so there’s no point in building journey trajecto-
ries for plane users.
So I go on and classify people with a travel time less then 3 hours as airplane
users, with a result of 2800 users out of 8356:
insert into AIRPLANE_USERS
select user_id
from FULL_JOURNEYS
where timestampdiff(minute,departure_time,arrival_time) < 170
3.3.3 Extracting samples of train and car users
With the airplane users out the way, only those who have traveled by land
remain. In this case, discrimination by traveling time alone might not be
enough. It is true that it takes only 3 hours to a Frecciarossa to cover the
500 km distance, but not so for the other trains. On the other hand, it takes
roughly 5 to 6 hours to cover the same distance by car. Thus, every journey
with a travel time greater than 5 hours is not immediately classifiable.
In order to proceed with my work I need journey samples for both train
and car users. The poeple with a difference around 3 hours between the
departure and arrival times, surely have traveled with a Frecciarossa.
insert into TRAIN_USERS_SAMPLE
select user_id
from
(select * from CDR natural join (select distinct(cgi) from CGI
where city = ’TRIP’) as journey_cells) as journey_cdr
natural join
(select *
from FULL_JOURNEYS
where timestampdiff(minute,departure_time,arrival_time) >= 170
and timestampdiff(minute,departure_time,arrival_time) <= 190)
as train_users
where date >= (departure_time - interval 10 minute) and date <=
(arrival_time + interval 10 minute)
group by user_id
having count(*) < 5 and count(*) >= 3
11

Those leaving a city during the night hours have surely traveled by car,
since the earliest train doesn’t leave before 5:42 am.
insert CAR_USERS_SAMPLE
select user_id
from
(select * from CDR natural join (select distinct(cgi) from CGI
where city = ’TRIP’) as journey_cells) as journey_cdr
natural join
(select *
from (select * from FULL_JOURNEYS where user_id not in (select
user_id from AIRPLANE_USERS)) as land_users
where departure_time <= ’2015-03-23 05:42:00’) as
night_journeys
group by user_id
Since at ﬁrst the train users sample was bigger than the car one I had to
modify the ﬁrst query in order to get somewhat balanced samples. In the end
I have extracted 296 train users and 225 car users, with a total of 1114 and
835 CDR generated respectively. Having a balance between these numbers
is very important for the outcome of the research, because the method I
plan to use depends heavily on the samples.
Before I can proceed, I need to add three more columns to the
FULL JOURNEYS table: p airplane, p train, p car. These columns
will contain the probability of having traveled by plane, car or train for each
of the users.
I can already update some of the rows since I have determined the trans-
portation mode for some of them. Now the table looks like this:
Table 3.5: New sample of the FULL JOURNEYS table
user id ... p airplane p train p car
00101cc717dc49803a390754d5568f9 ... 1 0 0
001247253c166a3c350d989954d2435d ... 1 0 0
00277c1cbfc3c00ea62a0159e48da6b2 ... 1 0 0
12

3.4 Building the jouorney trajectories
For some reason, according to the data I have, people haven’t been much
active while traveling. Only 315 of them have generated any records at all,
and most of them with less then 5 cells activated per head. The rest of the
communications has taken place in the city areas. I don’t know the reason
behind this, but with so little to work with I decided to consider also the
records created 10 minutes before departure and 10 minutes after the arrival.
I created a new table TRAIN CAR JOURNEY TRAJECTORIES
with three ﬁelds: user id, cgi, date, which has the same structure as the
CDR table.
insert into TRAIN_CAR_JOURNEY_TRAJECTORIES
select distinct user_id, cgi, date
from
CDR
natural join
(select * from FULL_JOURNEYS where user_id not in (select
user_id from AIRPLANE_USERS)) as land_journeys
3.5 The cell probabilities method
Given the events:
• C - Activating a cell C
• T - Being on a train
• A - Being in a car
• J = Activating the cells that compose the journey
the goal is to get P(T|J).
Given Ji ={C1, C2...CN } where Ji ∈ {J1, J2...JN }, ∀ i s.t 1≤ i ≤ N,
according to Bayes’ theorem:
P(T|Ji) =
P(Ji|T)P(T)
P(Ji)
13

First there is the need to associate a probability P(C|T),P(C|A) and
P(C) to each of the cells involved in the train and car journey samples.
P(C|T) =
number of sample train journeys which feature C
total number of sample train journeys
P(C|A) =
number of sample car journeys which feature C
total number of sample car journeys
P(C) =
number of sample journeys which feature C
total number of sample journeys
For all the other cells that don’t appear in any of the samples
P(C) = P(C|T) = P(C|A) = 0.00000001
If we assume events C1, C2...CN to be independent, then
P(Ji) = P(C1, C2...CN ) = P(C1) ∗ P(C2)... ∗ P(CN )
and
P(Ji|T) = P(C1, C2...CN |T) = P(C1|T) ∗ P(C2|T)... ∗ P(CN |T)
I also need to ﬁnd a value for P(T), which should be near 0.5.
Since
P(C) = P(C|T) ∗ P(T) + P(C|A) ∗ P(A) and P(A) = 1 − P(T)
then
P(T) =
P(C) − P(C|A)
P(C|T) − P(C|A)
I repeat this calculus for each of the cells visited by both train and car
sample journeys, since I need both P(C|A) and P(C|T), and then associate
to the ﬁnal P(T) the average of the results.
14

Now that we have all we need, P(Ji|T), P(Ji) and P(T), we can calculate
P(T|Ji):
P(T|Ji) = P(T|C1, C2...CN ) =
P(C1|T) ∗ P(C2|T)... ∗ P(CN |T) ∗ P(T)
P(C1) ∗ P(C2) ∗ ...P(CN )
As it turns out, there’s something off with the numbers. I often get prob-
abilities greater than 1, which is of course a problem. The only possibility
is that the issue lies within the assumption that the events C1 C2...CN are
independent. But if this assumption goes out of the window then we don’t
have P(Ji) anymore, since
P(Ji) = P(C1, C2...CN |T) =
P(C1, C2, ...CN , T)
P(T)
If the events C1 C2...CN , T are dependent, there’s no way of directly calcu-
lating P(C1, C2...CN , T) with what I have. It appears that I need to find
another way...
3.6 The journey compatibility method
Associating probabilities to the single cells didn’t give any results, so I need
to work with the journey trajectories as a whole. Although the size of
the samples is only around 250, the cells activated by these journeys are
also activated by most of the users. I take all the non classified journey
trajectories and compare them with the sample ones, creating a new ta-
ble called JOURNEY COMPATIBILITY composed of three columns:
non classified user, sample user, cells in common.
insert into JOURNEY_COMPATIBILITY
select non_classified_users.user_id as non_classified_user,
sample_users.user_id as sample_user, count(*) as
cells_in_common
from
(select * from TRAIN_CAR_JOURNEY_TRAJECTORIES natural join
FULL_JOURNEYS
where p_airplane = 0 and p_train = 0 and p_car = 0) as
non_classified_users
join
(select * from TRAIN_CAR_JOURNEY_TRAJECTORIES natural join
FULL_JOURNEYS
where p_train = 1 or p_car = 1) sample_users
on non_classified_users.cgi = sample_users.cgi
group by non_classified_users.user_id, sample_users.user_id
15

The greater the number of cells in common, the greater the compatibility
between a couple of given journeys. Given the sample of train journeys
JT ={JT 1, JT 2...JT N } where N is the size of the sample, and the car journeys
JA ={JA1, JA2...JAM } where M is the size of the sample.
P(T|J) =
N
i=1
cells in common with JT i
N
i=1
cells in common with JT i +
M
j=1
cells in common with JAj
Translated into a query the formula becomes:
update FULL_JOURNEYS table_to_update
join
(select user_id, train_cells_affected / all_cells_affected as
p_train
from
select non_classified_user as user_id, sum(cells_in_common) as
all_cells_affected
from JOURNEY_COMPATIBILITY
group by non_classified_user) as
compatibility_with_both_samples
natural join
(select non_classified_user as user_id, sum(cells_in_common)
as train_cells_affected
from JOURNEY_COMPATIBILITY
where sample_user in (select user_id from FULL_JOURNEYS where
p_train = 1)
group by non_classified_user) as
compatibility_with_train_samples)
as new_values
on table_to_update.user_id = new_values.user_id
set table_to_update.p_train = new_values.p_train
Of course, P(A|J) is complementary to P(T|J), so P(A|J) = 1 − P(T|J).
If a journey trajectory doesn’t have cells in common with any of the sample
journeys then it is classiﬁed as NON DETERMINED, since the formula
results in
P(T|J) =
0
0
16

3.7 Summary
After having extracted beforehand the airplane users and two balanced sam-
ples for both train and car users, I built the journey trajectories. A journey
trajectory consists of all records a user has generated right before his/her
journey begun and right after it ended. The goal was to calculate the prob-
ability of having traveled by train P(T|J), and the probability of having
traveled by car P(A|J) for any given journey J.
Since a journey trajectory consists of a series of visited cells, I tried to
associate probabilities to the single cells, then obtain P(T|J) based on them.
This method gave nonsense results because of the faulty assumtpion that
the events of activating diﬀerent cells were independent.
Therefore I changed the strategy by taking into account the trajectories
as a whole. By comparing the non classiﬁed journeys with the samples, I
associated a compatibility to each couple. The greater the number of cells
they had in common, the greater the compatibility between two journeys.
If a journey had been more compatible with the train sample journeys than
the car ones, than it was more likely that the user had traveled by train.
17

Chapter 4
Results
We can say a user has traveled by train if P(T|J) ≥ 0.75. Same goes for car
and airplane. All the other data is considered as unknown.
Table 4.1: Results
AIRPLANE TRAIN CAR UNKNOWN TOTAL
2800 2930 759 1867 8356
Figure 4.1: Distribution of the train probabilities
18

Of the 1867 unknown users, 1335 had probabilities between 0.25 and 0.75,
not sufficient for a final classification. The other 532 were undetermined:
the journey trajectories had no cells in common with any of the journey
samples, thus resulting in 0
0 probabilities.
Figure 4.1 shows the distribution of P(T|J). With a heavy concentration
around the 0.8 value, it reflects the prevalence of rail transportation mode
over the road.
19

Chapter 5
Conclusions
5.1 Limitations
As I have mentioned in section 3.3, there is a lack of activity concerning
the cells that cover the landmass between Rome and Milan. With very little
data to cover the actual traveling, I had to work with records created mostly
in the urban areas, before and after the journey had taken place.
Furthermore, the CDR I use in this work have been generated by TIM
customers. Telecom Italia has a 34% share of the market, so these results
do not aim rapresent all of the mobile phone users.
For simpliﬁcation purposes, this work was conducted based on the as-
sumption that a user has utilised only one kind of transportation for the
whole trip, which isn’t always the case. This might be true enough for an
airplane ﬂight, but it is possible for people to switch from car to train at
any given moment during a trip.
5.2 Future work
SKIL would like to extend the research by taking into account also the age
of the users. The analysis could also be applied to other Italian major cities
considered as origin and destination, thus painting a more general picture
of the transportation preferences for all Italy.
20

Chapter 6
Acknowledgements
First of all, I would like to thank the SKIL research team: Cristiana Chitic,
Steven Tait and Roberto Larcher for their help and contribution. Thank you
Cristiana for providing me with the CGI, and thank you Steven for helping
me set up the database and for being always available. My gratitude goes
especially to Roberto, who has guided me during my work, giving advice
and motivating me to ﬁnd solutions to my problems. I would also like to
thank Stefano Bonaccorsi, who has lent me his expertise everytime I needed
it. And ﬁnally a very big thanks to my family and friends, who have always
been there and supported me.
21

Bibliography
[1] Fereshteh Asgari, Vincent Gauthier, and Monique Becker. A survey on
human mobility and its applications. arXiv preprint arXiv:1307.0814,
2013.
[2] Vincent D Blondel, Adeline Decuyper, and Gautier Krings. A sur-
vey of results on mobile phone datasets analysis. arXiv preprint
arXiv:1502.03406, 2015.
[3] Matthew F Dixon, Spencer P Aiello, Funmi Fapohunda, and William
Goldstein. Detecting mobility patterns in mobile phone data from the
ivory coast. NetMob D4D Challenge, 2013.
[4] John Doyle, Peter Hung, Damien Kelly, Se´an McLoone, and Ronan Far-
rell. Utilising mobile phone billing records for travel mode discovery.
2011.
[5] Huayong Wang, Francesco Calabrese, Giusy Di Lorenzo, and Carlo
Ratti. Transportation mode inference from anonymized and aggregated
mobile phone call detail records. In Intelligent Transportation Systems
(ITSC), 2010 13th International IEEE Conference on, pages 318–323.
IEEE, 2010.
22

ledio_gjoni_tesi

More Related Content

Similar to ledio_gjoni_tesi (20)

ledio_gjoni_tesi