SlideShare a Scribd company logo
From ICT survey data to
experimental statistics:
using IaD source for
website functionalities
ALESSANDRA NURRA
ISTAT – Researcher
0
① The ICT Survey
② 3 target variables and official European statistics
③ Importance of web ordering
④ Main goals
⑤ 5 phases of alternative estimate procedure
⑥ Results: estimates comparison and additional information
⑦ Production point of view: conclusions and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
1
Outlines
MAIN INDICATORS FROM THE ICT SURVEY
o The principal aim of this survey is to supply users with indicators on:
Internet activities (web site, social media, cloud computing) and
connection used (fixed and mobile broadband), e-Business (use of
software as ERP, CRM), e-Commerce, ICT skills, e-Invoice, etc.
MAIN PURPOSES OF THE ICT SURVEY INDICATORS
o ICT survey is also one of the major sources of data for the Digital
Agenda Scoreboard and Digital Economy and Society Index (DESI)
measuring progress of the European digital economy and to track the
evolution of EU member states in digital competitiveness.
The survey is part of the
European Community
statistics on the information
society
Community Survey on ICT
usage and e-commerce in
enterprises
Data for year 2017:
- Pop ent 10+ (from BR
updated to 2015): 184,865
- Sampling frame: 32,361
- Respondents: 21,410 (66%)
2
2
The ICT Survey
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
o Rate of enterprises where the website provides online ordering or
booking, e.g. shopping cart (percentage out of tot pop 10+)
o Rate of enterprises where the website provides advertisement of
open job position or job application (percentage out of tot pop 10+)
o Rate of enterprises where the website has links or references to the
enterprise's social media profiles (percentage out of tot pop 10+)
 Phenomena are slowly growing
 Italy is below European values
3
3
3 target variables and official European statistics
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
geo
time
2012 2013 2014 2015 2016 2017
EU28 15 16 17 17 18 20
IT 11 12 11 13 14 15
geo
time
2012 2013 2014 2015 2016 2017
EU28 19 21 24 n.d. 27 n.d.
IT 8 8 10 10 10 11
geo
time
2012 2013 2014 2015 2016 2017
EU28 n.d. n.d. 22 28 33 35
IT n.d. n.d. 21 26 28 31
geo
time
2012 2013 2014 2015 2016 2017
EU28 71 73 74 75 77 77
IT 65 67 69 71 71 72
Rate of enterprises with website
(percentage out of tot pop 10+)
4
4
Importance of web ordering
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
E-commerce
E-sales
Web
sales
Web site (web ordering)
App
E-marketplace
Electronic
automatic
sales (EDI)E-purchases
Internet data on
web ordering
would help us to
have more control
on the evolution
of
• WEB SALES
• E-SALES
• E-COMMERCE
Competitiveness
drivers
5
WHAT HOW WHY
To replicate a subset of estimates
currently produced by the survey
• investigating new IT solutions
• Improving /developing new skills
• evaluating and comparing quality of alternative estimates with traditional ones
To produce additional information • increasing the offer of statistical information
To integrate the information
collected with survey with those
collected via Internet
• improving accuracy of traditional estimates
5
Main goals
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
IV model fitting
III
V
II
I phase list of URLs
6
6
5 phases of alternative estimate procedure
Predictors
130,000
Websites
Big Data:
Internet as
Data Source
Doc
Terms
Matrix
100,000 websites
Predictions
Survey
data
Estimation
12,000
Sample
Alternative
Estimates
Estimation Estimates
Population
Frame: Asia
185,000
ent 10+
Predictors
85,000
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Web scraping
Text processing
1 - LIST OF URLs
1. integration, valid. list of
available URLs
2. Retrieval URLs:
• enterprise denom (10
website from query)
• processing 10 to choose 1
 matching other info with
web content
 using a ML approach to
associate URLs to
enterprises
from potentially 130,000
to about 100,000
2 - WEB SCRAPING
1. reading the
homepage and all
the other reachable
pages
2. doing Optical
Character
Recognition (OCR) on
all types of images
(also screen-shot of
homepage)
from 100,000 to 85,000
7
First 4 phases of alternative estimate procedure
3 - TEXT MINING
to convert each website
in a data record with
relevant information:
1. processing text (NLP)
2. computing Term
Evaluation (TE)
function to give a
measure (score) of
relevance to each
term (using ML)
3. select the first best
terms to codify the
enterprise website in
terms of target
variables
4 - MODEL FITTING
1. fitting model with
supervised ML classifier on
a subset of 12,000
(observed and Internet)
2. choosing classifier on basis
of performance measures
3. applying classifier chosen to
all 85,000 document text
matrix; the best results have
been obtained with Random
Forest (RF) (information
retrieval for target variable
on link to SM profiles)
4. obtaining predictions on
85,000 enterprises
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Survey estimates
1. obtained by using the usual design
based / model assisted approach
where weights are obtained by
calibration procedure of basic weights
(inverse of inclusion probabilities)
making use of known totals in the target
population (𝑈 = 185,000) in order to
reduce the bias due to non-response and
the variability due to sampling errors
8
8
Alternative estimates
Survey
data
Estimation Estimates
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
COMPARISON
three different sets of
estimates
V
Predictions
Estimation
Alternative
Estimates
Alternative estimates
have been calculated by adopting two different
estimators:
2. full model based estimator where the estimate of the
total number of enterprises offering target variable on
their websites is given by the count of the predicted values
for all units for which it was possible reach their websites
(𝑈2
= 85,000 ), calibrated in order to make them
representative of all the population having websites (𝑈1
=
130,000);
3. combined estimator produced by summing three
components: the counting of predicted values in the sub
pop 𝑈2
; an adjustment based on the differences between
the reported values and the predicted values expanded to
sub pop 𝑈2
; the counting of observed values for
respondents that declared a website, that was not found
nor scraped expanded to sub pop 𝑈1
− 𝑈2
.
9
RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR
WEBSITES BY SIZE CLASSES - Year 2017
9
Results: estimates comparison (1/3)
The three different sets are not incoherent. In many cases, but not for all, the alternative estimates are well inside the
confidence interval of the survey estimate, and this is the same for many values in the different domains for all three
target variables.
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
10
10
Results: estimates comparison (2/3)
RATE OF ENTERPRISES WITH WEB ORDERING, JOB
ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR
WEBSITES BY NACE (24 economic activities
group considered in the survey) - Year 2017
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
A simulation study carried out on 1000 iterations to
compare the accuracy of the 3 estimates in terms of the
components of the MSE (bias and variance) shows that
the accuracy of these new estimates is not lower than
those already produced by the ICT survey .
11
Results: additional information (3/3)
MODEL BASED ESTIMATES - RATE OF ENTERPRISES WITH WEB ORDERING, JOB
ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY NACE BY NACE
REV. 2 LEVEL 2 (62 division) - Year 2017
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Alla data and metadata were published on June 8 on the Istat website dedicated to
experimental statistics in the subsection on Results of experiments on big data.
For burden reason in ICT survey for year 2019 entire website section will be ‘optional’ so will not be possible
to use combined estimator (produced using also observed values of survey); so full model based estimators
will remain the only alternative for time series.
o Role of ICT survey data: they have been used for fitting the models to predict values; furthermore
prediction errors have a direct impact on one of three component of combined estimator.
12
12
Production point of view: conclusions and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Full model based (and Combined) estimates can be considered acceptable but…. we need time
series analysis to verify stability of procedure and of results
o Open question: respondent or URL website errors or other reasons (for example web ordering
made in an private area of website)? Urgent need (time consuming): re-contact respondent, to ask URL
inside the question on web functionalities and not at the end of questionnaire, improve definitions; ..so
strong effort is requested to assure good quality of answers to reduce response errors in the training
set (from survey) … even if in the future this should be necessary only every ‘n’ years and one solution
could be use a (small) subset of data as training set, not necessarily obtained by costly repeated official
sample surveys.
o In cases of predicted values different from those reported by respondents, after manual controls, we
discovered that in about half of the cases difference was not due to model fault, but to response errors.
o Other open issues for the future: European comparability, with predicted values will not be possible to
combine observed variables and predicted ones for the same respondents
13
13
Production point of view: conclusion and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
The work done could be extended and adapted in multiple ways
Considering the 3 target variables:
 in case of web ordering could be evaluate
possibility to use IaD to find other functionalities
related to website sales as web payment and
web deliver tracking;
 in case of job advertisement (it will not included
in European ICT survey for next 3 years) could
be evaluate possibility to search additional job
details as characteristics of each single job in
terms of skills required;
 in case of social media presence detection could
be extended to not only to scrape the enterprise
website, but also directly the social media in
order to investigate what kind of use of social
media is being done in a more detailed way.
Considering other aspects linked to ICT usage and
eCommerce:
• to investigate web sales of enterprises via other
means: e-marketplaces, app, social shopping
(new Instagram ‘shopping’ feature, launched in
March 2018, Facebook marketplace);
• to investigate more on enterprises operating in
specific economic activities (for example in Nace
47.91 Retail sale via mail order houses or via Internet
including retail sale of any kind of product over the
Internet) in order to have information about
products/services;
• to reuse and to adapt the procedure described to e-
government website or to website of enterprises
with less than 10 persons employed.
14
14
Working team
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Istat: G.Barcaroli, G.Bianchi, N.Golini, A. Nurra,
P.Righi, S.Salamone, F.Scalfati, M.Scannapieco,
D.Summa, D.Zardetto
CINECA: M.Scarnò
Univ.Roma Sapienza: R.Bruni
Link to Istat experimental statistics and metadata on website functionalities:
https://www.istat.it/en/archivio/216641
Thanks for your attention
15
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Use of e-commerce marketplaces for web sales most popular in Italy, Germany, Austria and Poland
Regarding the number of enterprises that sold their goods or services through an e-commerce
marketplace, the highest shares were recorded in Italy (54%) and Germany (52%), followed by Austria
and Poland (both 47%).
16
Enterprises using social networks, 2017 and 2013 (% of enterprises)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
17
E-sales broken down by web and EDI-type sales, 2016 (% enterprises with e-sales)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
The percentage of enterprises receiving orders over websites
or via apps was considerably high for almost all Member
States (Italy 70% of enterprises with e-sales receive orders
via web sales, 20% via Edi, 10% via both)
In Italy 8 out of 10 enterprises with web sales sell through their
own website or app and almost 5 out of 10 enterprises (EU28
39%) sold via an e-marketplace.
18
Summarising performance of first URLs retrieval procedure
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
19
Importance of web ordering
YES - WEB ORDERING (referred to year t)
YES WEB SALES (referred to year t-1) of the respondent
NO WEB SALES (referred to year t-1) of the respondent [due to new web
site or functionality used in year t and not in t-1; due to cases of web
sales computed on turnover of another enterprise of the group, foreign
enterprises, enterprises with less than 10 persons employed (out of the
scope)]
NO - WEB ORDERING (referred to year t)
NO WEB SALES (referred to year t-1) of the respondent
YES WEB SALES (referred to year t-1) of the respondent (via
emarketplaces, via app)
SIMINGLY
CONTRADDICTORY
ANSWERS
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
20
20
5 phases of alternative estimates procedure (2/5)
132,000
Websites
Big Data:
Internet as Data
Source
100,000 websites
• Data integration to
have a list of URL to
validate
Sources: ICT survey, Consodata
• URLs retrieval
• For non available URL an automated procedure has been set up to make use of enterprises
denomination as a search string to make query and collect the first 10 links returned as the
result of the query
• processing the first ten URLs in order to choose the right one for the given enterprise of
population of interest:
 matching of the enterprises information (denomination, fiscal code VAT, telephone,
address, etc. available from administrative data) and the content of the first ten URLs
retrieved;
 using a ML approach to associate URLs to enterprises: for each link its probability of
correctness is evaluated, and those links whose probability exceeds a given threshold
is accepted as valid.
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
IUniform Resource Locator
corresponding to a statistical unit
21
21
5 phases of alternative estimates procedure (3/5)
100,000
Web scraping +
text processing
Doc
Terms
Matrix
85,000
Website scraping
• reads the homepage and all
the other reachable pages
(max of 20 pages, the depth
can be selected)
• does Optical Character
Recognition (OCR) on all types
of images (also screen-shot of
homepage)
The text mining phase to convert
each website in a data record
• processing text using Natural Language
Processing techniques
• computing Term Evaluation (TE) function:
give a measure (score) of relevance to
each term (using supervised ML - given a
term of the training set, its frequency in a class
is compared to its overall frequency: relevant if
it occurs primarily in positive documents and
rarely in negative ones, or vice versa)
• summarizing each website into a number
of relevant terms and applying
dimensionality reductions techniques to
obtain a set of data records describing
Websites (85,000 doc terms matrix)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
II
+
III
II III
22
22
5 phases of alternative estimates procedure (4/5)
Model fitting: supervised ML classifier
using training set (data driven - not
deterministic choice of keywords)
• To fit model (machine learning) in the subset
of enterprises where both Internet data and
survey data were available (12,000)
considering survey data as the true values
(several classification approaches have been
applied);
• To apply the classifier to all 85,000 websites
predicting the values of target variables for all
the enterprises for which the retrieval and
scraping of their websites was successful.
Performance evaluation of classification algorithms -
performance measures for classifiers:
1. Accuracy: rate of correct predictions on the total of cases
2. Sensitivity (or recall): rate of true pos on total number of pos
3. Precision: rate of true pos on total number of pos predictions
4. F1-score: harmonic mean of Sensitivity and Precision
the best results have been obtained with Random Forest (RF)
(information retrieval for target variable on link to SM profiles)
IV
Predictors
Doc
Terms
Matrix
Predictions
12,000
Predictors
85,000
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC

More Related Content

PPTX
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
PPTX
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
PPTX
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
PPTX
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
PPTX
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
PPTX
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
PPTX
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
PPTX
14a Conferenza Nazionale di Statistica
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
Verso le trusted smart statistics - prospettive di sviluppo e risultati del e...
14a Conferenza Nazionale di Statistica

What's hot (9)

PDF
2010.080 1226
PPTX
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...
PPTX
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...
PPTX
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
PDF
Wireless network planning solutions
PDF
DISCOVERY DAY 2017: MAKE IT HAPPEN!
 
PPTX
14a Conferenza Nazionale di Statistica
PDF
Pasquale Persico, Nuovi strumenti di analisi del turismo
PPTX
Site suitability analysis for constructing New ATM in Margao , Goa
2010.080 1226
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
Wireless network planning solutions
DISCOVERY DAY 2017: MAKE IT HAPPEN!
 
14a Conferenza Nazionale di Statistica
Pasquale Persico, Nuovi strumenti di analisi del turismo
Site suitability analysis for constructing New ATM in Margao , Goa
Ad

Similar to A. Nurra, From ICT survey data to experimental statistics; using IaD source for website functionalities (20)

PPTX
Application of DEA in IT & Communication
PPTX
EW-Shopp: Interoperability Challenges and Solutions
PDF
Analyzing the Impact of Visitors on Page Views with Google Analytics
PDF
Analyzing the Impact of Visitors on Page Views with Google Analytics
PDF
Time Series ANN Approach for Weather Forecasting
PDF
Data analytics to improve home broadband cx & network insight
PDF
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
PDF
Semantically Enriched Knowledge Extraction With Data Mining
PDF
P. Struijs, Toward the Use of Big Data for European Statistics
DOCX
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
PDF
Transport for London - London's Operations Digital Twin
PDF
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
PDF
Open Data Infrastructures Evaluation Framework using Value Modelling
PDF
Search Engine Scrapper
PDF
Effort Estimation Development Model for Web-based Mobile Application Using Fu...
PDF
"Unlocking Insights: A Comprehensive Guide to Big Data Analytics for Transfor...
PPTX
Determination and visualization of density210409
PDF
IRJET- Popularity based Recommender Sytsem for Google Maps
PDF
CV-Grace-DataAnalytics-UCL
PDF
IRJET- Logistics Network Superintendence Based on Knowledge Engineering
Application of DEA in IT & Communication
EW-Shopp: Interoperability Challenges and Solutions
Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics
Time Series ANN Approach for Weather Forecasting
Data analytics to improve home broadband cx & network insight
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
Semantically Enriched Knowledge Extraction With Data Mining
P. Struijs, Toward the Use of Big Data for European Statistics
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Transport for London - London's Operations Digital Twin
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
Open Data Infrastructures Evaluation Framework using Value Modelling
Search Engine Scrapper
Effort Estimation Development Model for Web-based Mobile Application Using Fu...
"Unlocking Insights: A Comprehensive Guide to Big Data Analytics for Transfor...
Determination and visualization of density210409
IRJET- Popularity based Recommender Sytsem for Google Maps
CV-Grace-DataAnalytics-UCL
IRJET- Logistics Network Superintendence Based on Knowledge Engineering
Ad

More from Istituto nazionale di statistica (20)

PPTX
Censimenti Permanenti Istituzioni non profit
PPTX
Censimenti Permanenti Istituzioni non profit
PPTX
Censimenti Permanenti Istituzioni non profit
PPTX
Censimenti Permanenti Istituzioni non profit
PPTX
Censimenti Permanenti Istituzioni non profit
PPTX
Censimenti Permanenti Istituzioni non profit
PPTX
Censimento Permanente Istituzioni Pubbliche
PDF
Censimento Permanente Istituzioni Pubbliche
PPTX
Censimento Permanente Istituzioni Pubbliche
PPTX
Censimento Permanente Istituzioni Pubbliche
PPTX
14a Conferenza Nazionale di Statisticacnstatistica14
PPTX
14a Conferenza Nazionale di Statistica
PPSX
14a Conferenza Nazionale di Statistica
PPTX
14a Conferenza Nazionale di Statistica
PDF
14a Conferenza Nazionale di Statistica
PDF
14a Conferenza Nazionale di Statistica
PPTX
14a Conferenza Nazionale di Statistica
PPTX
14a Conferenza Nazionale di Statistica
PPTX
14a Conferenza Nazionale di Statistica
PDF
14a Conferenza Nazionale di Statistica
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
14a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica

Recently uploaded (20)

PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Classroom Observation Tools for Teachers
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
Pharma ospi slides which help in ospi learning
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Insiders guide to clinical Medicine.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Computing-Curriculum for Schools in Ghana
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Institutional Correction lecture only . . .
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Abdominal Access Techniques with Prof. Dr. R K Mishra
Classroom Observation Tools for Teachers
Module 4: Burden of Disease Tutorial Slides S2 2025
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Cell Types and Its function , kingdom of life
Pharma ospi slides which help in ospi learning
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Insiders guide to clinical Medicine.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Computing-Curriculum for Schools in Ghana
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Complications of Minimal Access Surgery at WLH
Institutional Correction lecture only . . .
O7-L3 Supply Chain Operations - ICLT Program
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf

A. Nurra, From ICT survey data to experimental statistics; using IaD source for website functionalities

  • 1. From ICT survey data to experimental statistics: using IaD source for website functionalities ALESSANDRA NURRA ISTAT – Researcher 0
  • 2. ① The ICT Survey ② 3 target variables and official European statistics ③ Importance of web ordering ④ Main goals ⑤ 5 phases of alternative estimate procedure ⑥ Results: estimates comparison and additional information ⑦ Production point of view: conclusions and perspectives ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC 1 Outlines
  • 3. MAIN INDICATORS FROM THE ICT SURVEY o The principal aim of this survey is to supply users with indicators on: Internet activities (web site, social media, cloud computing) and connection used (fixed and mobile broadband), e-Business (use of software as ERP, CRM), e-Commerce, ICT skills, e-Invoice, etc. MAIN PURPOSES OF THE ICT SURVEY INDICATORS o ICT survey is also one of the major sources of data for the Digital Agenda Scoreboard and Digital Economy and Society Index (DESI) measuring progress of the European digital economy and to track the evolution of EU member states in digital competitiveness. The survey is part of the European Community statistics on the information society Community Survey on ICT usage and e-commerce in enterprises Data for year 2017: - Pop ent 10+ (from BR updated to 2015): 184,865 - Sampling frame: 32,361 - Respondents: 21,410 (66%) 2 2 The ICT Survey ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 4. o Rate of enterprises where the website provides online ordering or booking, e.g. shopping cart (percentage out of tot pop 10+) o Rate of enterprises where the website provides advertisement of open job position or job application (percentage out of tot pop 10+) o Rate of enterprises where the website has links or references to the enterprise's social media profiles (percentage out of tot pop 10+)  Phenomena are slowly growing  Italy is below European values 3 3 3 target variables and official European statistics ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC geo time 2012 2013 2014 2015 2016 2017 EU28 15 16 17 17 18 20 IT 11 12 11 13 14 15 geo time 2012 2013 2014 2015 2016 2017 EU28 19 21 24 n.d. 27 n.d. IT 8 8 10 10 10 11 geo time 2012 2013 2014 2015 2016 2017 EU28 n.d. n.d. 22 28 33 35 IT n.d. n.d. 21 26 28 31 geo time 2012 2013 2014 2015 2016 2017 EU28 71 73 74 75 77 77 IT 65 67 69 71 71 72 Rate of enterprises with website (percentage out of tot pop 10+)
  • 5. 4 4 Importance of web ordering ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC E-commerce E-sales Web sales Web site (web ordering) App E-marketplace Electronic automatic sales (EDI)E-purchases Internet data on web ordering would help us to have more control on the evolution of • WEB SALES • E-SALES • E-COMMERCE Competitiveness drivers
  • 6. 5 WHAT HOW WHY To replicate a subset of estimates currently produced by the survey • investigating new IT solutions • Improving /developing new skills • evaluating and comparing quality of alternative estimates with traditional ones To produce additional information • increasing the offer of statistical information To integrate the information collected with survey with those collected via Internet • improving accuracy of traditional estimates 5 Main goals ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 7. IV model fitting III V II I phase list of URLs 6 6 5 phases of alternative estimate procedure Predictors 130,000 Websites Big Data: Internet as Data Source Doc Terms Matrix 100,000 websites Predictions Survey data Estimation 12,000 Sample Alternative Estimates Estimation Estimates Population Frame: Asia 185,000 ent 10+ Predictors 85,000 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Web scraping Text processing
  • 8. 1 - LIST OF URLs 1. integration, valid. list of available URLs 2. Retrieval URLs: • enterprise denom (10 website from query) • processing 10 to choose 1  matching other info with web content  using a ML approach to associate URLs to enterprises from potentially 130,000 to about 100,000 2 - WEB SCRAPING 1. reading the homepage and all the other reachable pages 2. doing Optical Character Recognition (OCR) on all types of images (also screen-shot of homepage) from 100,000 to 85,000 7 First 4 phases of alternative estimate procedure 3 - TEXT MINING to convert each website in a data record with relevant information: 1. processing text (NLP) 2. computing Term Evaluation (TE) function to give a measure (score) of relevance to each term (using ML) 3. select the first best terms to codify the enterprise website in terms of target variables 4 - MODEL FITTING 1. fitting model with supervised ML classifier on a subset of 12,000 (observed and Internet) 2. choosing classifier on basis of performance measures 3. applying classifier chosen to all 85,000 document text matrix; the best results have been obtained with Random Forest (RF) (information retrieval for target variable on link to SM profiles) 4. obtaining predictions on 85,000 enterprises ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 9. Survey estimates 1. obtained by using the usual design based / model assisted approach where weights are obtained by calibration procedure of basic weights (inverse of inclusion probabilities) making use of known totals in the target population (𝑈 = 185,000) in order to reduce the bias due to non-response and the variability due to sampling errors 8 8 Alternative estimates Survey data Estimation Estimates ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC COMPARISON three different sets of estimates V Predictions Estimation Alternative Estimates Alternative estimates have been calculated by adopting two different estimators: 2. full model based estimator where the estimate of the total number of enterprises offering target variable on their websites is given by the count of the predicted values for all units for which it was possible reach their websites (𝑈2 = 85,000 ), calibrated in order to make them representative of all the population having websites (𝑈1 = 130,000); 3. combined estimator produced by summing three components: the counting of predicted values in the sub pop 𝑈2 ; an adjustment based on the differences between the reported values and the predicted values expanded to sub pop 𝑈2 ; the counting of observed values for respondents that declared a website, that was not found nor scraped expanded to sub pop 𝑈1 − 𝑈2 .
  • 10. 9 RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY SIZE CLASSES - Year 2017 9 Results: estimates comparison (1/3) The three different sets are not incoherent. In many cases, but not for all, the alternative estimates are well inside the confidence interval of the survey estimate, and this is the same for many values in the different domains for all three target variables. ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 11. 10 10 Results: estimates comparison (2/3) RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY NACE (24 economic activities group considered in the survey) - Year 2017 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC A simulation study carried out on 1000 iterations to compare the accuracy of the 3 estimates in terms of the components of the MSE (bias and variance) shows that the accuracy of these new estimates is not lower than those already produced by the ICT survey .
  • 12. 11 Results: additional information (3/3) MODEL BASED ESTIMATES - RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY NACE BY NACE REV. 2 LEVEL 2 (62 division) - Year 2017 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Alla data and metadata were published on June 8 on the Istat website dedicated to experimental statistics in the subsection on Results of experiments on big data.
  • 13. For burden reason in ICT survey for year 2019 entire website section will be ‘optional’ so will not be possible to use combined estimator (produced using also observed values of survey); so full model based estimators will remain the only alternative for time series. o Role of ICT survey data: they have been used for fitting the models to predict values; furthermore prediction errors have a direct impact on one of three component of combined estimator. 12 12 Production point of view: conclusions and perspectives ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Full model based (and Combined) estimates can be considered acceptable but…. we need time series analysis to verify stability of procedure and of results o Open question: respondent or URL website errors or other reasons (for example web ordering made in an private area of website)? Urgent need (time consuming): re-contact respondent, to ask URL inside the question on web functionalities and not at the end of questionnaire, improve definitions; ..so strong effort is requested to assure good quality of answers to reduce response errors in the training set (from survey) … even if in the future this should be necessary only every ‘n’ years and one solution could be use a (small) subset of data as training set, not necessarily obtained by costly repeated official sample surveys. o In cases of predicted values different from those reported by respondents, after manual controls, we discovered that in about half of the cases difference was not due to model fault, but to response errors. o Other open issues for the future: European comparability, with predicted values will not be possible to combine observed variables and predicted ones for the same respondents
  • 14. 13 13 Production point of view: conclusion and perspectives ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC The work done could be extended and adapted in multiple ways Considering the 3 target variables:  in case of web ordering could be evaluate possibility to use IaD to find other functionalities related to website sales as web payment and web deliver tracking;  in case of job advertisement (it will not included in European ICT survey for next 3 years) could be evaluate possibility to search additional job details as characteristics of each single job in terms of skills required;  in case of social media presence detection could be extended to not only to scrape the enterprise website, but also directly the social media in order to investigate what kind of use of social media is being done in a more detailed way. Considering other aspects linked to ICT usage and eCommerce: • to investigate web sales of enterprises via other means: e-marketplaces, app, social shopping (new Instagram ‘shopping’ feature, launched in March 2018, Facebook marketplace); • to investigate more on enterprises operating in specific economic activities (for example in Nace 47.91 Retail sale via mail order houses or via Internet including retail sale of any kind of product over the Internet) in order to have information about products/services; • to reuse and to adapt the procedure described to e- government website or to website of enterprises with less than 10 persons employed.
  • 15. 14 14 Working team ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Istat: G.Barcaroli, G.Bianchi, N.Golini, A. Nurra, P.Righi, S.Salamone, F.Scalfati, M.Scannapieco, D.Summa, D.Zardetto CINECA: M.Scarnò Univ.Roma Sapienza: R.Bruni Link to Istat experimental statistics and metadata on website functionalities: https://www.istat.it/en/archivio/216641 Thanks for your attention
  • 16. 15 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Use of e-commerce marketplaces for web sales most popular in Italy, Germany, Austria and Poland Regarding the number of enterprises that sold their goods or services through an e-commerce marketplace, the highest shares were recorded in Italy (54%) and Germany (52%), followed by Austria and Poland (both 47%).
  • 17. 16 Enterprises using social networks, 2017 and 2013 (% of enterprises) ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 18. 17 E-sales broken down by web and EDI-type sales, 2016 (% enterprises with e-sales) ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC The percentage of enterprises receiving orders over websites or via apps was considerably high for almost all Member States (Italy 70% of enterprises with e-sales receive orders via web sales, 20% via Edi, 10% via both) In Italy 8 out of 10 enterprises with web sales sell through their own website or app and almost 5 out of 10 enterprises (EU28 39%) sold via an e-marketplace.
  • 19. 18 Summarising performance of first URLs retrieval procedure ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 20. 19 Importance of web ordering YES - WEB ORDERING (referred to year t) YES WEB SALES (referred to year t-1) of the respondent NO WEB SALES (referred to year t-1) of the respondent [due to new web site or functionality used in year t and not in t-1; due to cases of web sales computed on turnover of another enterprise of the group, foreign enterprises, enterprises with less than 10 persons employed (out of the scope)] NO - WEB ORDERING (referred to year t) NO WEB SALES (referred to year t-1) of the respondent YES WEB SALES (referred to year t-1) of the respondent (via emarketplaces, via app) SIMINGLY CONTRADDICTORY ANSWERS ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 21. 20 20 5 phases of alternative estimates procedure (2/5) 132,000 Websites Big Data: Internet as Data Source 100,000 websites • Data integration to have a list of URL to validate Sources: ICT survey, Consodata • URLs retrieval • For non available URL an automated procedure has been set up to make use of enterprises denomination as a search string to make query and collect the first 10 links returned as the result of the query • processing the first ten URLs in order to choose the right one for the given enterprise of population of interest:  matching of the enterprises information (denomination, fiscal code VAT, telephone, address, etc. available from administrative data) and the content of the first ten URLs retrieved;  using a ML approach to associate URLs to enterprises: for each link its probability of correctness is evaluated, and those links whose probability exceeds a given threshold is accepted as valid. ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC IUniform Resource Locator corresponding to a statistical unit
  • 22. 21 21 5 phases of alternative estimates procedure (3/5) 100,000 Web scraping + text processing Doc Terms Matrix 85,000 Website scraping • reads the homepage and all the other reachable pages (max of 20 pages, the depth can be selected) • does Optical Character Recognition (OCR) on all types of images (also screen-shot of homepage) The text mining phase to convert each website in a data record • processing text using Natural Language Processing techniques • computing Term Evaluation (TE) function: give a measure (score) of relevance to each term (using supervised ML - given a term of the training set, its frequency in a class is compared to its overall frequency: relevant if it occurs primarily in positive documents and rarely in negative ones, or vice versa) • summarizing each website into a number of relevant terms and applying dimensionality reductions techniques to obtain a set of data records describing Websites (85,000 doc terms matrix) ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC II + III II III
  • 23. 22 22 5 phases of alternative estimates procedure (4/5) Model fitting: supervised ML classifier using training set (data driven - not deterministic choice of keywords) • To fit model (machine learning) in the subset of enterprises where both Internet data and survey data were available (12,000) considering survey data as the true values (several classification approaches have been applied); • To apply the classifier to all 85,000 websites predicting the values of target variables for all the enterprises for which the retrieval and scraping of their websites was successful. Performance evaluation of classification algorithms - performance measures for classifiers: 1. Accuracy: rate of correct predictions on the total of cases 2. Sensitivity (or recall): rate of true pos on total number of pos 3. Precision: rate of true pos on total number of pos predictions 4. F1-score: harmonic mean of Sensitivity and Precision the best results have been obtained with Random Forest (RF) (information retrieval for target variable on link to SM profiles) IV Predictors Doc Terms Matrix Predictions 12,000 Predictors 85,000 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC