SlideShare a Scribd company logo
Homepage: http://ReMaDDersoft.wix.com/ReMaDDer
ReMaDDer Software Tutorial
How to use ReMaDDer software for successful records matching, data
cleansing and data deduplication projects
11/20/2016
Revision 2.0.
ReMaDDer Software Tutorial
Page 1 / 59 
Table of Contents
Introduction ........................................................................................................................ 3
What Is ReMaDDer Software.......................................................................................... 3
Fuzzy Match..................................................................................................................... 3
Records Linkage .............................................................................................................. 4
Data Deduplication.......................................................................................................... 4
ReMaDDer Software Advantages.................................................................................... 4
Prerequisites .................................................................................................................... 5
Revision History .............................................................................................................. 5
Projects ................................................................................................................................ 7
Projects Page.................................................................................................................... 7
Concept of “Left” and “Right” Dataset ............................................................................8
Record Matching Project vs. Data Deduplication Projects.............................................8
Copy A Project ................................................................................................................. 9
Raw Data Import................................................................................................................. 9
“Left” and “Right” datasets............................................................................................ 10
Import Raw Data ............................................................................................................11
Browse And Choose CSV files.....................................................................................................................11
Register CSV Files.......................................................................................................................................11
Determine And Convert CSV File To UTF-8 ............................................................................................ 12
Edit Raw Datasource Schema Information ...............................................................................................17
Pre-process Raw Datasource......................................................................................................................17
Import Data From Raw Datasources ........................................................................................................ 19
Solution Definition.............................................................................................................21
ReMaDDer Software Tutorial
Page 2 / 59 
How ReMaDDer performs record linkage and data deduplication..............................22
Solution Definition Header ...........................................................................................22
Solution Basic Information .......................................................................................................................24
Machine Learning Strictness.....................................................................................................................25
Join Type ....................................................................................................................................................25
Return Only Best Matching Records ........................................................................................................26
Solution Definition Details............................................................................................26
Fields Picker ...............................................................................................................................................27
Solution Constraints ..................................................................................................................................29
Solution Execution ............................................................................................................34
Solution Execution In One Step ....................................................................................38
Solution Execution In Two Major Steps .......................................................................39
Solution Execution In Several Minor Steps ..................................................................39
Data Retrieving And Storing..............................................................................................41
Execute Resultset Retrieval SQL Query........................................................................42
Solution Status Info.......................................................................................................43
Save And Load Resultset ............................................................................................... 45
Review And Edit Resultset ............................................................................................46
Resultset Browsing ....................................................................................................................................46
Resultset Edit And Review ........................................................................................................................ 51
Exporting Resultset....................................................................................................................................52
Customize Data Grids........................................................................................................ 55
Customize Splitters ........................................................................................................... 56
ReMaDDer Software Trial................................................................................................. 56
Commercial Release Code Purchase And Activation........................................................ 57
ReMaDDer Software Tutorial
Page 3 / 59 
ReMaDDer Software Tutorial
How to use ReMaDDer software for successful records matching, data cleansing and
data deduplication projects
Introduction
What Is ReMaDDer Software
ReMaDDer is record linkage and data cleansing software, with powerful fuzzy record matching and data
deduplication capabilities, based on state of the art machine learning and data processing techniques.
As client-server application, ReMaDDer consists of two parts: client front-end part and server-side part.
Client front-end provides user-friendly graphical interface with intuitive means for projects creation, raw
data import and solutions definition, while server-side part ensures mighty data processing engine that can
solve even the most complex fuzzy match analysis in reasonable time.
By combining advanced artificial intelligence with clever blocking techniques and multiple string similarity
metrics, ReMaDDer provides unique solution for fully automatic records matching and data deduplication
projects.
Traditionally, fuzzy records matching software require substantial human intervention, either to provide
various parameters and threshold values, either to perform extensive clerical review and supervised
machine learning training. Unique property of the ReMaDDer software is that it does not require any such
human assistance beyond project definition. There are no thresholds or any other input parameters which
user must provide in order to enable software to distinguish between matches and non-matches, the
ReMaDDer software is capable to infer and learn everything by itself.
As far as we are aware, ReMaDDer might be the only software currently available that is capable to perform
fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of
human clerical review. This is accomplished by utilizing various advanced machine learning techniques and
approaches.
The name “ReMaDeDer” is an acronym for “Records Matching and Data Deduplication Software”.
Homepage: http://ReMaDDersoft.wix.com/ReMaDDer
Fuzzy Match
Term “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It
is used in cases where no unique identifier or exact match relation exists between two sets of data.
Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity.
Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with
probabilities below threshold are considered to be non-matches.
ReMaDDer Software Tutorial
Page 4 / 59 
Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold
matching percentage set by the application.
Records Linkage
Record linkage refers to the task of finding records in a data set that refer to the same entity across different
data sources, i.e. to identify related records in two separate data sets.
Record linkage is necessary when joining data sets is based on entities that may or may not share a common
identifier, as may be the case due to differences in record shape, storage location, and/or curator style or
preference.
There are many business cases where record linkage has to be performed. Some typical examples are
product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.
Data Deduplication
Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from
redundant information.
ReMaDDer Software Advantages
Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic
papers. Some of the researchers even tend to build their own software, but those programs suffer from their
complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use
it. This is not something that can be expected from an average user facing data linkage problem in urge to
be able to solve it in matter of hours or days.
On the other hand, there are huge corporate entity resolution framework solutions, produced by big
software companies, oriented towards huge corporate customers. These solutions are often very complex
and affordable only to big companies and corporate users.
ReMaDDer places itself in the middle and provides powerful fuzzy match records linkage solution for mere
mortals and regular office users.
By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints
in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she
can focus on the business case, rather than technical issues. That is where ReMaDDer software really shines
and clearly distinguishes itself from competition.
Traditionally, fuzzy record matching software suffer from requiring immense user involvement in project
parameterization and clerical review. User is either required to provide various input parameters and
threshold values, either he/she is required to perform machine learning training and provide examples of
matches and non-matches. In both cases, considerable user involvement and expertise is prerequisite for
successful analysis.
On the contrary, the ReMaDDer software does not require such heavy user involvement, since it can figure
optimal parameter values automatically, all by itself. This is accomplished by advanced artificial intelligence
utilizing various state of the art machine learning techniques.
ReMaDDer Software Tutorial
Page 5 / 59 
To summarize: utilization of advanced artificial intelligence, accompanied with intuitive graphical user
interface and low pricing - that is what makes ReMaDDer superb fuzzy match records linkage solution.
Prerequisites
Major prerequisite to use ReMaDDer is active internet connection, since the raw data is imported to remote
server where data is processed. After trial period expires, you are required to purchase commercial release
code in order to be able to continue using remote server.
However, project and solution creation and editing can be performed even without established connection
and purchased release code, since these data are stored locally on your computer.
ReMaDDer front-end client is available as executable for Windows and Linux systems. It is possible to
provide executables for various other systems, on demand.
ReMaDDer does not operate directly on original data sources, but requires data to be imported from CSV
(comma separated values) flat files to server, where corresponding “left” and “right” database tables are
then created and processed. Therefore, you will have to provide source datasets as flat CSV file, encoded in
UTF-8, preferably with comma (“,”) or semi-colon (“;”) field separators.
Revision History
Revision Date Change Description
1.0. 3/20/2016 Initial release. Tutorial covers ReMaDDer version 1.0.
1.1. 5/10/2016 Document is updated to reflect changes and improvements brought by
ReMaDDer version 1.1.
New version brings many improvements and simplifies solution
definition. Instead of separately choosing and defining thresholds for
trigram similarity and levenshtein distance functions, a new, combined,
common similarity function (ReMaDDer_similarity) is now introduced
that combines both trigram and levenshtein similarity properties. This
reduces complexity and uncertainty in solution definition creation,
retaining ReMaDDer strength and advantages.
Previous ReMaDDer version has been outputting all columns from left
and right dataset into resultset. Now, you can choose which fields are to
be included in resultset.
Raw data import process is also much improved, especially regarding
importing data from Excel files (in CSV format) where column names
contain non-ascii characters and blanks.
There are many small performance improvements and several bugfixes
that will improve user experience when using the ReMaDDer software
for data match analysis.
2.0. 11/20/2016 Document is updated to reflect major changes and improvements
brought by ReMaDDer version 2.0.
The main changes are:
 Instead of using only Levenshtein and Trigram similarity functions,
multiple other similarity metrics are added to the server engine.
ReMaDDer Software Tutorial
Page 6 / 59 
 Matches and non-matches are not based on similarity thresholds
any more. Instead, ReMaDDer now utilizes machine learning
techniques. Advanced algorithms infer and automatically detect
duplicates and record matches.
 Threshold parameters are removed as obsolete.
 “Use composite field” parameter is removed as obsolete.
 “Use inclusive OR”parameter is removed as obsolete.
 New parameter “Machine Learning Strictness” is introduced. The
parameter defines how strictly artificial intelligence will
distinguished between matches and non-matches. The options are:
match, strict match and potential match.
 New parameter “Join Type”is introduced. Join Type attribute
determines how SQL joins between left and right tables will be
established, via solution base table. There are three options of
joining: a) inner join, b) left outer join, c) right outer join.
The "inner join" option is default behavior, meaning that the
resultset will contain all rows from left and right datasets which
meet matching criteria.
In case of "left outer join" option, resultset will contain all rows
from left dataset and only those rows from right dataset that satisfy
matching criteria.
In case of "right outer join" option, resultset will contain all rows
from right dataset and only those rows from left dataset that satisfy
matching criteria.
 New parameter “Return Only Best Match” is introduced. The
parameter can have True or False value and determines whether
SQL query will return only best matching record or multiple records
satisfying similarity criteria.
Check this option if you wish to return only the best matching
records for each left or right record, when using corresponding left
or right outer joins.
If this option is unchecked (default), multiple matching rows will be
returned.
ReMaDDer Software Tutorial
Page 7 / 59 
Projects
Projects Page
Project is basic entity in ReMaDDER software. Each project contains definition of two source datasets
to be imported and analyzed (so-called "left dataset" and "right dataset"), as well as variable number of
corresponding solutions, which are stored definitions of how to perform fuzzy match analysis.
On creation, each project is assigned unique project tag. During raw data importing to server,
corresponding input tables get that tag appended in their name. This way, imported tables are always tagged
by the project name, which ensures their uniqueness.
The “Projects” page consists of two two sections separated by movable splitter. In upper section there is
a datagrid view where you can browse and edit projects, while on the lower section there is form view
of currently selected project. The same concept of datagrids and form views is implemented throughout the
application.
ReMaDDer Software Tutorial
Page 8 / 59 
You can easily create new projects, edit and browse existing projects, by using navigator buttons.
Concept of “Left” and “Right” Dataset
Throughout ReMaDDer application and this manual, we will use terms “left” and “right” dataset or table.
In every fuzzy match project, we always compare two tables, i.e. two datasets, inspecting their rows
similarity. For convenience, we call them “left” and “right” table.
Purpose of entity resolution framework software, such is ReMaDDer, is to identify which records from “left”
dataset correspond to which records from “right” dataset.
ReMaDDer does not operate on original data sources directly, but requires data to be imported from source
CSV (comma separated values) flat files to server, where corresponding left and right database tables are
then created and processed.
Record Matching Project vs. Data Deduplication Projects
In ReMaDDer software, there is no fundamental difference between data deduplication and records
matching projects. In both cases we compare two datasets, trying to infer which records from “left” dataset
correspond to which records in “right” dataset.
The only difference between the two is that in case of records matching project we have two different input
datasets to be compared, while in case of data deduplication project we have to compare a dataset with
itself, in order to identify duplicate records in the dataset.
ReMaDDer Software Tutorial
Page 9 / 59 
Since ReMaDDer software always compare two datasets - left and right datasets, in case of data
deduplication project we need to import the same original CSV file twice - first as left dataset and then as
right dataset. The ReMaDDer software will thus create two identical tables with different names, in the
underlying database.
Copy A Project
Instead of manually entering all the parameters for new projects, ReMaDDer allows you to copy existing
project into another project. This action copies raw data import specifications as well as solution definitions.
Raw Data Import
Datasets to be analyzed are called "left" and "right" datasets and can be easily imported from source CSV
files, encoded in UTF-8.
The CSV file format ("Comma Separated Values") is chosen due to its ubiquity and because all databases
and spreadsheet editors, as well as all other data sources can be easily exported to a csv file.
The source data CSV files, however, must be UTF-8 encoded. Otherwise, import will most likely fail.
Therefore, you must first ensure that the source data CSV files are properly UTF-8 encoded. ReMaDDer has
embedded tools for charset encoding detection and conversion, but you can also use famous Notepad++
(https://notepad-plus-plus.org/), CudaText (http://uvviewsoft.com/cudatext/) and other powerful text
editors which are capable to perform encoding detection and conversion of files.
ReMaDDer provides simple and intuitive tool for importing csv files. It will automatically detect
field’s delimiter and columns schema information. You can then edit the retrieved schema and
finally import the files on server, for further processing.
ReMaDDer Software Tutorial
Page 10 / 59 
“Left” and “Right” datasets
In each data deduplication or record matching project, we always compare two datasets for matching of
records. In case of record matching projects, these two datasets correspond to two different input CSV files,
while in case of data deduplication projects, these two datasets are imported from the same input CSV file.
ReMaDDer Software Tutorial
Page 11 / 59 
Nevertheless, we always have so-called “left dataset” and “right dataset” to be compared. Think of this like
comparing fingers from left and right hand. You can easily identify thumb on the left hand to be related to
the thumb on the right hand, since they share similar shape. It is obvious due to their physical similarity.
It is same with fuzzy match analysis, where we compare fields from left and right dataset in order to identify
string similarities. ReMaDDer internally uses various functions to measure string similarities, results of
which are then processed by artificial intelligence to infer whether two records represent same entity or not.
Import Raw Data
Process of importing raw data into server database consists of several logical phases. First we need to
identify source CSV files for “left” and “right” dataset. After source files are identified, we need to ensure
that the CSV files are properly UTF-8 encoded. Once we ensured proper encoding, then we need to retrieve
and specify schema information about the CSV files. In last phase we actually perform import from source
files, according to previously defined schema. Result of the last step is that the source files are imported on
server-side database, where they can be processed according to various solution definitions.
On “Data Import” page, there are two sub-pages: “Left Dataset Specification” and “Right Dataset
Specification”, in which we separately define input dataset specifications for “left” and “right” dataset.
Import can be executed separately for left and righ dataset, or both can be imported in batch, at once.
Browse And Choose CSV files
First step in importing input CSV files is to choose CSV files to be imported.
On upper part of “Left Dataset Specification” or “Right Dataset Specification” sub-page, there is a CSV file
browser dialog box.
You can browse CSV files on your computer by clicking on the browse button . This opens a file
browser in which you can choose a CSV file. The absolute file path is then copied to the edit box.
Register CSV Files
Next step is to define CSV file schema specification. We call this process “registering CSV file”.
ReMaDDer Software Tutorial
Page 12 / 59 
By clicking “Register CSV file” button near the file browser, the browsed CSV
file is examined for its columns and it’s schema information is then inserted into the corresponding list of
fields (columns).
As you can see, ReMaDDer determines field delimiter in CSV file (normally it is either “;” or “,”) and
retrieves information about columns.
If a column name has upper case characters, it is converted to lower case.
Currently, ReMaDDer treats all columns as text fields of various length. This is due fact that the comparison
is performed by using string comparison functions, so other data types (e.g. datetime, integer, real etc.)
would not make sense for string comparisons.
Determine And Convert CSV File To UTF-8
In previous ReMaDDer version, the program used to detect encoding and convert it to UTF-8
automatically, during CSV file registration. Although very convenient, this might have lead to wrong results,
since encoding detection function is not 100% reliable and sometimes it guesses encoding wrongly. This is
due fact that charset detection is inherently difficult task and there is no 100% sure method. It is always
kind of educated guess according to content inspection.
Therefore, we decided to remove automatic charset detection and conversion to UTF-8. You will have to do
it yourself and ensure that the source files are properly UTF-8 encoded. Charset detection, as well file
ReMaDDer Software Tutorial
Page 13 / 59 
encoding conversion to UTF-8 is still present as ReMaDDer feature (and even improved), but you will have
to trigger it manually with respective buttons, or by choosing it from menu.
Another option is to use embedded spreadsheet editor “Spready” to open and convert source files.
Alternatively, you can use various established tools such as Notepad++ text editor, that are capable to
recognize file encoding and perform required conversion to UTF-8.
Determine And Convert CSV File Encoding, with embedded tool
After a CSV file is registered as left or right dataset source, it can be analyzed with embedded tool for
detecting charset encoding.
When you click button “Determine Encoding of Left Dataset CSV File” or button “Determine
Encoding of Right Dataset CSV File” the respective CSV file will be analyzed for its encoding type, by
two different embedded procedures. Result of encoding analysis will be displayed in corresponding pop-up
window.
ReMaDDer Software Tutorial
Page 14 / 59 
If both functions agree that the encoding is UTF-8 (utf8), as in the example above, then the CSV file is in
appropriate format for import.
But, if result is not UTF-8, then the CSV file must be converted to UTF-8 before importing!
You can convert CSV file encoding to UTF-8 by clicking button “Convert Encoding Of Left Dataset
CSV File” or “Convert Encoding Of Right Dataset CSV File”.
When the conversion action is triggered, ReMaDDer will first back up the original CSV file and then convert
the file encoding to UTF-8.
Determine And Convert CSV File Encoding, with embedded spreadsheet editor “Spready”
Besides above mentioned embedded encoding detection and conversion tool, ReMaDDer has embedded
“Spready” spreadsheet editor (http://wiki.lazarus.freepascal.org/FPSpreadsheet), which can also be used
for file encoding conversion.
ReMaDDer Software Tutorial
Page 15 / 59 
Determine And Convert CSV File Encoding, with external tools
Charset detection with embedded tool is not 100% reliable, which is also true for any tool performing
charset inferring.
If you encounter difficulties with embedded charset detection and conversion tools or you know what is the
file encoding, you might try various external tools, of which I would recommend well established
Notepad++ text editor (https://notepad-plus-plus.org/).
ReMaDDer Software Tutorial
Page 16 / 59 
Another interesting alternative is CudaText text editor (http://uvviewsoft.com/cudatext/), which is
capable of charset detection and conversion too.
ReMaDDer Software Tutorial
Page 17 / 59 
Edit Raw Datasource Schema Information
Once you retrieved schema information from a CSV file, you might conclude that you don’t want to import
all columns, but only a subset of fields.
You can edit the schema by using corresponding data grid navigator buttons.
If you wish to delete currently selected field from schema, just click delete button.
If you wish to regain original columns schema, just click “Get Fields Schema”
button and the columns list will be repopulated from the CSV file.
Pre-process Raw Datasource
ReMaDDer Software Tutorial
Page 18 / 59 
While defining import schema specification, you might realize that input data need some pre-processing
before importing to server for further analysis.
Of course, you can edit input CSV files by using any spreadsheet editor (such as LibreOffice or OpenOffice
Calc, Gumeric or Miscrosoft Excel) or textual editor (such as Notepad, Notepad ++, ConText, Gedit,
CudaText, Geany or Leafpad), but you can also use an embedded spreadsheet editor “Spready”.
You can launch external default spreadsheet editor by clicking the button “Open CSV File in Ext.
Editor” .
You can launch the embedded spreadsheet editor by clicking button “Open CSV File In Int. Editor”
. This will open the embedded spreadsheet editor “Spready”
(http://wiki.lazarus.freepascal.org/FPSpreadsheet).
ReMaDDer Software Tutorial
Page 19 / 59 
Import Data From Raw Datasources
Final step in source data import is execution of import procedure, by clicking appropriate button or
triggering action from respective menu.
ReMaDDer Software Tutorial
Page 20 / 59 
We can execute import separately for left and right datasets, by clicking corresponding buttons “Import
left dataset CSV file” or “Import right dataset CSV file” or we can import them both at once by
clicking the button “Import both CSV files to server”.
When you click the import button, ReMaDDer will automatically open the “Import Log” page, where you
can watch import process progress.
ReMaDDer Software Tutorial
Page 21 / 59 
Import speed depends on the file size and most importantly, internet connection quality.
Solution Definition
A solution definition represents definition of parameters for performing record linkage or data
deduplication analysis. Each project can have many solutions, with different specification, thus you can test
which combination of parameters lead to best results.
Each solution definition consists of solution header specification and solution constraints
specification.
Solution header specification contains general info about the solution and defines important parameters
which determine how record matching analysis will be performed. These parameters are: “machine
learning strictness”, “join type” and “return only best match”.
Solution constraints specification consists of: exact match relations section, fuzzy match relations
section and other constraints section.
Solution definition page (page “Record Matching Analysis”, sub-page “Solution Definition”):
As with other pages, “Solution” page is also divided into two sections: datagrid view and form view.
For better user experience, form view is additionaly divided into several tabs and sub-tabs. Main tabs are:
“Solution Definition” and “Solution Result”.
“Solution Definition” tab is furtherly divided into: “Solution Header”, “Solution Fields Picker” and “Solution
Constraints”.
ReMaDDer Software Tutorial
Page 22 / 59 
“Solution Header” tab is divided into several sub-tabs: “Common”, “Solution Base Table Creation Query
Info” and “Solution Resultset Retrieval Query Info”.
“Solution Constraints” tab is divided into sub-tabs: “Exact Match Constraints”, “Fuzzy Match Constraints”
and “Other Constraints”.
How ReMaDDer performs record linkage and data deduplication
For each project we can define one or more solutions. A solution consists of solution definition and solution
resultset.
Solution definition is specification which instructs ReMaDDer how to perform record linkage or data
deduplication analysis in order to retrieve resultset.
We can define three type of solution constraints: exact match constraints, fuzzy match constraints and other
constraints.
Fuzzy match constraints define field pairs from left and right dataset to be compared for fuzzy string
similarity. In order to infer records similarity, ReMaDDer utilizes various string similarity metrics, along
with powerful machine learning algorithms.
Advanced artificial intelligence automatically infers records linkage or duplicates and creates solution base
table.
Final step is resultset retrieval, in which database engine creates and executes SQL query which joins left
and right dataset with the solution base table, outputting resultset. The retrieved resultset can be exported
to a spredsheet or flat file.
Solution Definition Header
Solution definition header contains general solution definition parameters and info about solution
execution status.
Solution definition header (whole page):
ReMaDDer Software Tutorial
Page 23 / 59 
Solution definition header (datagrid view):
Solution definition header (form view):
ReMaDDer Software Tutorial
Page 24 / 59 
Solution definition header can be entered either through datagrid or through form view which shows
currently selected solution.
Solution Basic Information
Basic information about a solution is shown in fields: “Solution Name”, “Solution Tag”, “Solution Base Table
Name”, “Tag Assigned”, “Solution Status” and “Solution Comment”.
Solution Tag is automatically generated designation which is appended to each solution name by default
and is also used in Solution Base Table name formation.
Solution Base Table Name is automatically formed from Solution name and Solution Tag. Solution Tag
ensures uniqueness of created solution base table, on server.
Solution Status and Solution Comment are fields in which user can enter additional arbitrary information.
ReMaDDer Software Tutorial
Page 25 / 59 
Machine Learning Strictness
The parameter “Machine Learning Strictness” defines how strictly artificial intelligence will
distinguished between matches and non-matches. The options are: match, strict match and potential
match.
Machine learning strictness attribute determines how strictly fuzzy matching will be determined.
Possible values are: a) match, b) strict match, c) potential match.
"Match" option is default behavior. Resultset retrieved will contained balanced ratio between true
positives and false positives. It tends to include all true positives, with some degree of false positives and
very little false negatives.
"Strict match" is the strictest option. Resultset will tend to contain only true positives, but due to higher
incidence of false negatives, it might miss to recognize some matches.
"Potential match" is the weakest option. Resultset will tend to contain all true positives, but many false
positives as well.
Join Type
“Join Type” attribute determines how SQL joins between left and right tables will be established, via
solution base table. There are three options of joining: a) inner join, b) left outer join, c) right outer join.
ReMaDDer Software Tutorial
Page 26 / 59 
The "inner join" option is default behavior, meaning that the resultset will contain all rows from left and
right datasets which meet matching criteria.
In case of "left outer join" option, resultset will contain all rows from left dataset and only those rows
from right dataset that satisfy matching criteria.
In case of "right outer join" option, resultset will contain all rows from right dataset and only those rows
from left dataset that satisfy matching criteria.
Return Only Best Matching Records
The parameter “Return Only Best Match” can have True or False value and determines whether SQL
query will return only best matching record or multiple records satisfying similarity criteria. It is used as
modifier to left outer join or right outer join.
If this option is unchecked (default), multiple matching rows will be returned. If it is checked, only best
matching item from slave dataset will be joined to corresponding record in master dataset.
Check this option if you wish to return only the best matching records for each left or right record, when
using left or right outer joins and datasets are in master/slave relation.
In case of “inner join” join type, this parameter has no meaning and is ignored.
Typical use case for left or right outer join with “return only best matching” option is when we want to match
two product price lists of which one is master list.
Solution Definition Details
While solution definition header defines general parameters for performing fuzzy match analysis, solution
definition details are being set in Field Picker sub-page and Solution Constraints sub-page with three
sections defining solution constraints: Exact Match Relations section, Fuzzy Match Relations section
and Other Constraints section.
ReMaDDer Software Tutorial
Page 27 / 59 
Fields Picker
ReMaDDer provides simple, yet very powerful visual tool to add field pairs to exact match relations section,
fuzzy match section or other constraints section.
By having input datasets ("left" and "right" datasets) fields listed side by side, you can easily browse two
lists, visually establish field pairs and send them to appropriate constraints definition sections by click on
appropriate button.
ReMaDDer Software Tutorial
Page 28 / 59 
You can add selected fields pair to exact match section by clicking the button “Add Fields Pair To Exact
Match Relations Section”.
You can add selected fields pair to fuzzy match section by clicking the button “Add Fields Pair To Fuzzy
Match Relations Section”.
You can add left or right dataset field to other constraints section by clicking the respective button.
By eliminating need for tedious manual input and letting you to visually build solution constraints instead,
ReMaDDer simplifies solution definition creation and boosts your performance.
Starting from ReMaDDer version 1.1., checkbox column “Output Field to Resultset?” is added to the
Field Picker datagrid. It is used to include or exclude fields from being outputted to a resultset. By default,
all fields are included in resultset.
ReMaDDer Software Tutorial
Page 29 / 59 
Solution Constraints
There are three type of constraints that you can define for a solution: exact match relations, fuzzy match
relations and other constraints.
Exact Match Relations
In exact matching relations section, we can add field pairs from "left" and "right" imported dataset and
define their equalness (=) or not-equalness (<>).
ReMaDDer Software Tutorial
Page 30 / 59 
If we can define exact matching relation on one or more filed pairs, we can tremendously increase speed of
analysis by narrowing down number of record pair combinations to be analyzed for fuzzy match.
Therefore, it is recommended to use exact match relations whenever possible.
Fuzzy Match Relations
In solution header section we can set various general parameters that determine how fuzzy match analysis
will be performed: we can choose machine learning strictness, join type and whether all matches or just
best matches will be returned.
ReMaDDer Software Tutorial
Page 31 / 59 
In fuzzy match relations section we provide details for fuzzy match comparison analysis. We can list field
pairs to be compared and furtherly define how fuzzy match analysis will be performed.
Relative Field Weight
ReMaDDer Software Tutorial
Page 32 / 59 
For each field pair, which will be compared for similarity, we have to define its relative weight. The bigger
the weight, the greater is importance of the particular field pair similarity in final decision whether two
records do match or not.
The weight for particular field pair is entered as an arbitrary integer value in the field “Field Weight
(integer)” and ReMaDDer then calculates its relative weight. The sum of relative weights is always 1.
On new field pair addition to the fuzzy match relations section, the field pair gets default relative weight
(integer) value, which is one (1). You can change this value to any bigger integer and ReMaDDer will
recalculate relative weights, taking care of their sum, which must always be 1.
Notice that there is an additional graphical indicator of relative weights. It shows graphically relative weight
for currently selected fields pair.
There are two buttons provided: “Recalculate Weights” and “Reset Weights”.
ReMaDDer Software Tutorial
Page 33 / 59 
The button “Reset Weights” reset all relative weights to 1, which is the same as if relative weights are not
used at all. In that case, all field pairs are treated equally important.
The button “Recalculate Weights” performs the recalculation of relative fields according to the integer
values entered in the field “Field Weight (integer)”. You don’t need to trigger this action manually, since
this procedure is triggered automatically on each change of integer value or a field pair addition.
Other Constraints
Similar to exact matching relations, it is desirable to limit analysis on particular subset of data. Such
constraints can greatly increase speed of record linkage or data deduplication analysis.
We can define any custom constraint to be applied on a particular field from "left" or "right" dataset.
ReMaDDer Software Tutorial
Page 34 / 59 
Normally, condition is: sometable.somefield= ‘some string’, but other operators such as LIKE can be used
as well.
Solution Execution
Once a solution definition is prepared by setting global parameters, exact match, fuzzy match and other
constraints, you can then execute the solution on remote server and retrieve resultset. There are two
consequences of the solution execution: solution base table is created on server and resultset is retrieved to
client.
Solution execution is actually sequence of two different steps, which can be executed in batch or separately.
First step is solution base table creation on server, which is prerequisite for next step, resultset
retrieval on client.
The first step, in which Solution Base Table is created, is the most critical point in ReMaDDer
application (and most resource and time demanding, too). In this step, sequence of several critical
underlying procedures are triggered that determine solution space from which final resultset is finally
retrieved.
This step is actually composed of several discrete sub-steps.
First of sub-steps is so-called “blocking” procedure, which is a method to reduce space of combinations
which will be furtherly analyzed for string similarity. This step is of great importance, since fuzzy match
analysis is inherently time-consuming job and analyzing all possible combinations would take extremely
long time to complete.
Next sub-step is step in which string similarity is calculated between left and right dataset records.
ReMaDDer utilizes multiple string similarity functions. Some of them are quite resource demanding.
ReMaDDer Software Tutorial
Page 35 / 59 
After string similarity is established for all combinations in solution space, advanced machine learning
algorithms take results from previous step and infer record linkage or detect duplicates. This is the heart of
inventive and unique approach that ReMaDDer software utilizes to perform entity resolution job.
Unlike other competing software, ReMaDDer does not require any user involvement in this step. There is
no need to provide examples of matches and non-matches, neither to provide any threshold value that
would distinguish matches from non-matches. ReMaDDer will acquire knowledge and determine records
linkage automatically, without need for human domain expert or clerical review.
As far as we are aware, there is no other software, currently available on market, that is capable to perform
such automatic record linkage inference by artificial intelligence, with accuracy reaching human clerical
review.
Technically, solution can be executed in three different ways:
A) in one step
In this scenario, both major steps (solution base table creation and resultset retrieval) are executed at once.
B) In two major steps
In this scenario, major steps (solution base table creation and resultset retrieval) are executed one by one
in consecutive order.
C) In several minor steps
In this scenario, both major steps are executed in sequence of several distinct minor steps.
In simplest scenario, you can execute solution in one step. On the Solution definition header, as well in the
corresponding “Solution Header” menu entry there are two buttons. The button “Execute Solution”
executes both steps at once, in batch, while the button “Prepare And Execute Result SQL Query”
executes only the last step, i.e. resultset retrieval.
ReMaDDer Software Tutorial
Page 36 / 59 
Obviously, you must trigger the button “EXECUTE SOLUTION” at least once, in order to create
underlying Solution Base Table on server, which is prerequisite for second step, resultset retrieval.
The first step, solution base table creation, might be extremely resource and time demanding. Depending
on the records count in left and right dataset, number of field pairs to be compared for string similarity etc.,
it can take anything from 30 seconds to 24 hours or even more (!). You must be aware that the time required
for solution base table creation grows exponentially, not linearly, with records count!
Be aware that the solution complexity, and time required for solution to be resolved, grows exponentially
with records count in left and right dataset. The same is true for number of field pairs to be compared. It is
not same if you analyze only one field pair or if you compare 9 field pairs for fuzzy match. Fuzzy match
analysis is inherently complex and time consuming.
Once the solution base table is already created, you can easily change machine learning strictness or join
type or choose whether to return only best match. For these changes, you don’t need to re-trigger tedious
and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is
exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen.
Beside default differentiation on major steps, there is also fine grained differentiation on sub-steps, which
is available in the “Solution Definition” menu entry.
ReMaDDer Software Tutorial
Page 37 / 59 
In fine grained differentiation of solution execution steps, we distinguish following separate actions:
 “Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution
base table, but will not execute it.
 “Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will
execute solution base table creation.
 “Prepare Solution Result SQL Query With Forced Base Table (Re)creation” --> this will
trigger recreation of SQL Query for recreation of solution base table on server and then retrieve
resultset.
 “Prepare Solution Result SQL Query With Check Whether To Create Base Table” --> this
will trigger action that will check whether solution base table has to be recreated. The solution base
table will be recreated only if necessary. Then resultset will be retrieved.
 “Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but
don’t actually trigger it’s execution.
 “Prepare And Execute Solution Result SQL Query” --> prepare and execute SQL query that
will retrieve resultset.
 “Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQL
query that will retrieve resultset.
ReMaDDer Software Tutorial
Page 38 / 59 
These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For
regular user, it is only relevant to remember that the solution base table must first be created in order to be
able to retrieve resultset.
If solution base table is already created, then you don’t need to recreate solution base table for different
combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is
enough to use just “Prepare And Execute Solution Result SQL Query” button.
If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve
resultset is to click “EXECUTE SOLUTION” button.
Solution Execution In One Step
The simplest way to execute solution is to execute the analysis in one step, by clicking the button
“EXECUTE SOLUTION” or by choosing corresponding menu item.
This action will force (re)creation of solution base table on server, from scratch, and prepare and execute
resultset retrival SQL query.
Be aware that solution base table (re)creation is costly action and it might take considerable time to
complete! If left or right dataset contains million of records, this might take extremely long time to
complete.
Therefore, it is preferred to execute solution base table (re)creation only if necessary.
ReMaDDer Software Tutorial
Page 39 / 59 
Solution Execution In Two Major Steps
Besides simple solution execution in one step, there is possibility to execute solution in two major steps.
In this scenario, first step is solution base table creation on server, which is prerequisite for next step,
resultset retrieval on client.
Once the solution base table is already created, you can easily change machine learning strictness or join
type or choose whether to return only best match. For these changes, you don’t need to retrigger tedious
and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is
exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen.
On the Solution definition header, as well in the corresponding “Solution Header” menu entry, there is
button “Prepare And Execute Result SQL Query”, which executes only the last step, i.e. resultset
retrieval. You can use it if proper solution base table is already created on server.
Solution Execution In Several Minor Steps
If appropriate solution base table is not yet created or solution definition is changed so it needs to be
recreated, then you have to (re)create solution base table first, and then execute resulset retrieval query.
Beside executing everything in one step, there there is also fine grained differentiation of these sub-steps
present ed in the “Solution Definition” menu entry.
ReMaDDer Software Tutorial
Page 40 / 59 
In fine grained differentiation of steps, we distinguish following separate actions:
 “Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution
base table, but will not execute it.
 “Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will
execute solution base table creation.
 “Prepare Solution Result SQL Query With Forced Base Table (Re)creation” --> this will
trigger recreation of SQL Query for recreation of solution base table on server.
 “Prepare Solution Result SQL Query With Check Whether To Create Base Table” --> this
will trigger action that will check whether solution base table has to be recreated. The solution base
table will be recreated only if necessary.
 “Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but
don’t actually trigger it’s execution.
 “Prepare And Execute Solution Result SQL Query” --> prepare and execute SQL query that
will retrieve resultset.
 “Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQL
query that will retrieve resultset.
These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For
regular user, it is only relevant to remember that the solution base table must first be created in order to be
able to retrieve resultset.
ReMaDDer Software Tutorial
Page 41 / 59 
If solution base table is already created, then you don’t need to recreate solution base table for different
combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is
enough to use just “Prepare And Execute Solution Result SQL Query” button.
If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve
resultset is to click “EXECUTE SOLUTION” button.
Data Retrieving And Storing
You can launch previously prepared solution SQL queries and return resultsets, by clicking the button
“Prepare And Execute Result SQL Query”.
Alternatively, you can execute solution in one step, which includes both solution base table creation and
resultset retrieval SQL query execution in one step, with button “EXECUTE SOLUTION”.
In both cases, once resultset is retrieved, it is stored locally on your computer and you can load it afterwards,
anytime you wish.
ReMaDDer Software Tutorial
Page 42 / 59 
You can easily browse, edit and analyze results in many different ways, including datasheet forms with
sophisticated data searching, filtering and navigation capabilities.
Execute Resultset Retrieval SQL Query
The resultset retrieval query is executed by clicking the button “Execute Solution”
or by clicking the button “Prepare And Execute Solution”
, which can be used if solution base table has already been
created.
The difference is that “Execute solution” action (re)creates underlying solution base table and then executes
SQL query, which joins left and right datasets with the solution base table, while action “Prepare And
Execute Results SQL Query” just performs the last step. Obviously, prerequisite to use the latter is that the
solution base table has already been created.
When action is triggered, previously prepared SQL query text is sent to server for execution. The progress
of query execution can be monitored in “Solution Log” page.
The retrieved resultset is automatically opened in a separate form.
ReMaDDer Software Tutorial
Page 43 / 59 
Solution Status Info
ReMaDDer automatically updates solution status upon solution base table creation query and resultset
retrieval query preparation and execution actions. These solution status informations are shown both in the
solution header data grid and form view, in respective tabs.
ReMaDDer Software Tutorial
Page 44 / 59 
You get various information about solution base table creation process, such as: whether solution base table
is created or not, whether solution creation query has already been executed or not, whether solution base
table is empty or not, what are query execution times.
Also, you get various information about resultset retrieval query execution process, such as: whether
resultset retrival SQL query is generated (prepared) or not, whether SQL query was already executed or not,
whether resultset is retrieved or not and if retrieved whether it was empty or not. It is also shown whether
the resultset is stored locally and in which file. There is information about execution times and number of
executions performed.
ReMaDDer Software Tutorial
Page 45 / 59 
Save And Load Resultset
Once a solution is executed and results retrieved, the resultset is automatically saved as a locally stored file
in the ReMaDDer installation folder, subfolder “/data/results”.
Resultset can be loaded into the subpage “Solution Result” of the main form, by clicking the button “Load
Solution Resultset” or in a separate form, by clicking the button “Load Solution Resultset In
Separate Window”.
ReMaDDer Software Tutorial
Page 46 / 59 
Review And Edit Resultset
There are various ways you can post-process and review the retrieved resultset.
Resultset Browsing
You can easily browse, edit and analyze loaded resultset in data grid form. Datasheet contains sophisticated
data searching, filtering and navigation capabilities.
ReMaDDer Software Tutorial
Page 47 / 59 
You can scroll by using mouse, vertical and horizontal sliders and arrows.
You can also browse records by using navigation buttons.
Resultset Searching
You can easily search for any particular value in any column. On the upper left corner of the datagrid
there is a small button represented by orange double arrow. This button opens a pop-up dialog
with various search, filter and customization options, of which one is “Find data”.
ReMaDDer Software Tutorial
Page 48 / 59 
When you click on the “Find data” button, a search dialog box appears. You can search any value on
any column.
Resultset Filtration
ReMaDDer Software Tutorial
Page 49 / 59 
You can easily filter by any column. On the upper left corner of the datagrid there is a small button
represented by orange double arrow.
This button opens a pop-up dialog with various search, filter and customization options, including “Filter
data” and “Filter in table”, which are two different ways to perform filtration in a datagrid.
Filter Data
When you click the button “Filter data”, a dialog box appears on which you can build your filtering
conditions. This way you can define complex multicolumn filters.
ReMaDDer Software Tutorial
Page 50 / 59 
The filtering is then applied by clicking “Apply” button.
Filter In Table
Another option for filtration is to use the button “Filter in table”, which activates a filtration
combobox, which is placed just below each column’s title. When you click on the filtration combobox cell,
a combobox list appears, listing all possible values for respective column. When you select a value, the
respective column is automatically filtered by the chosen value.
ReMaDDer Software Tutorial
Page 51 / 59 
Resultset Sorting
You can sort ascending or descending on any column by clicking column title.
Resultset Edit And Review
You can edit the resultset in datagrid easily. You can delete a row by using delete button ,
or edit a record by clicking the edit button .
ReMaDDer Software Tutorial
Page 52 / 59 
Exporting Resultset
Besides using datagrid controls, another option for resultset post-processing is to export the resultset into
a spreadsheet and then perform reviewing and editing in a spreadsheet editor.
ReMaDDer has many different possibilities of exporting resultset to spreadsheets.
Exporting Resultset To Spreadsheet
Resultset can be exported to a CSV file by clicking the button “Export To CSV File”.
Resultset can be exported to a XLSX file by clicking the button “Export To XLSX File”.
Resultset can be exported to XLS file by clicking the button “Export To XLS File”.
Resultset can be loaded directly into your default spreadsheet editor, e.g. LibreOffice Calc or Microsoft
Excel, by clicking the button “Load In Ext. Spreadsheet Editor”.
ReMaDDer Software Tutorial
Page 53 / 59 
ReMaDDer also has its own embedded spreadsheet editor which can be used for resultset post-processing.
Resultset can be loaded into the embedded spreadsheet editor by clicking the button “Load As
Spreadsheet”.
Exporting Datagrid To Spreadsheet
Another possibility for exporting resultset into a spreadsheet file is to use datagrid’s exporting feature.
ReMaDDer Software Tutorial
Page 54 / 59 
You have to browse the destination folder for export and enter exported file name and extension, as well
as to enter page name (sheet name). If you forget to specify “page name”, you will get an error.
ReMaDDer Software Tutorial
Page 55 / 59 
Customize Data Grids
ReMaDDer enables you to customize your user interface in certain extent. You can shrink or stretch
columns, rearrange their order and hide/unhide columns.
Resize columns by dragging vertical splitters between columns.
Rearrange columns by pushing the left mouse button on a column’s title and dragging the column while
mouse button is still pushed down. After the column is moved to another position, release the mouse button.
You can define which columns are shown and which are hidden, by clicking on the button “Select visible
columns”.
ReMaDDer Software Tutorial
Page 56 / 59 
When you close the application, your customization is saved (remadder_props.xml file) and when you
open the application again, your customizations will be loaded as well.
Customize Splitters
You will notice that various sections are divided by splitters which you can easily drag and thus resize
the corresponding splitted sections.
The customization you make is saved on application close and reloaded on application start.
ReMaDDer Software Trial
ReMaDDer client application is distributed as a shareware with 15-days trial period.
On first application start on your computer the trial period is initialized.
ReMaDDer Software Tutorial
Page 57 / 59 
Commercial Release Code Purchase And Activation
After trial period expires, you are required to purchase commercial release code in order to be able
to continue using server features, such as raw data import and query execution.
You can, however, continue creating and editing projects and solution definitions, as well as loading and
editing previously acquired resultsets.
When purchasing release code, you are required to enter MachineID in purchase form. The MachineID is
a tag generated by ReMaDDer software and is unique for your hardware. The purchased commercial release
code is thus machine-specific and valid only for your hardware.
Once you purchased release code, activate it by clicking the button “Activate Commercial Release
Code”.
You are asked to enter the release code.
ReMaDDer Software Tutorial
Page 58 / 59 
The entered release code will be then validated and if correct, the server-side features will be unlocked for
you.

More Related Content

PDF
XA Secure | Whitepaper on data security within Hadoop
PDF
Protecting Global Records Sharing with Identity Based Access Control List
PDF
Opentext Decisiv
PDF
A Survey on Access Control Scheme for Data in Cloud with Anonymous Authentica...
PDF
Brief introduction to NoSQL by fas mosleh
PDF
PROVABLE DATA PROCESSING (PDP) A MODEL FOR CLIENT'S SECURED DATA ON CLOUD
DOCX
facilitating document annotation using content and querying value
DOCX
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
XA Secure | Whitepaper on data security within Hadoop
Protecting Global Records Sharing with Identity Based Access Control List
Opentext Decisiv
A Survey on Access Control Scheme for Data in Cloud with Anonymous Authentica...
Brief introduction to NoSQL by fas mosleh
PROVABLE DATA PROCESSING (PDP) A MODEL FOR CLIENT'S SECURED DATA ON CLOUD
facilitating document annotation using content and querying value
JPJ1421 Facilitating Document Annotation Using Content and Querying Value

What's hot (14)

PDF
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...
DOCX
Facilitating document annotation using content and querying value
PDF
A Robust Approach for Detecting Data Leakage and Data Leaker in Organizations
PPTX
O2 060814
PDF
Isaca journal - bridging the gap between access and security in big data...
PDF
Alchemy sales fundamentals v1.1
PDF
Disclosing Private Information from Metadata, hidden info and lost data
PDF
Implementation of Matching Tree Technique for Online Record Linkage
PDF
IRJET - Confidential Image De-Duplication in Cloud Storage
PPTX
Open text applicationxtender 16.3 highlights customerfacing
PDF
SQL_Server_2016_Deeper_Insights_Across_Data_White_Paper
PPTX
NOW! Get the internet to work for you!
PPT
Acupulco cda access (2)
PDF
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud Storage
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...
Facilitating document annotation using content and querying value
A Robust Approach for Detecting Data Leakage and Data Leaker in Organizations
O2 060814
Isaca journal - bridging the gap between access and security in big data...
Alchemy sales fundamentals v1.1
Disclosing Private Information from Metadata, hidden info and lost data
Implementation of Matching Tree Technique for Online Record Linkage
IRJET - Confidential Image De-Duplication in Cloud Storage
Open text applicationxtender 16.3 highlights customerfacing
SQL_Server_2016_Deeper_Insights_Across_Data_White_Paper
NOW! Get the internet to work for you!
Acupulco cda access (2)
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud Storage
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Lecture1 pattern recognition............
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Mega Projects Data Mega Projects Data
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to the R Programming Language
STUDY DESIGN details- Lt Col Maksud (21).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Database Infoormation System (DBIS).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
IBA_Chapter_11_Slides_Final_Accessible.pptx
Qualitative Qantitative and Mixed Methods.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IB Computer Science - Internal Assessment.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
.pdf is not working space design for the following data for the following dat...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Lecture1 pattern recognition............
STERILIZATION AND DISINFECTION-1.ppthhhbx
Mega Projects Data Mega Projects Data
SAP 2 completion done . PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to the R Programming Language
Ad

Remadder tutorial

  • 1. Homepage: http://ReMaDDersoft.wix.com/ReMaDDer ReMaDDer Software Tutorial How to use ReMaDDer software for successful records matching, data cleansing and data deduplication projects 11/20/2016 Revision 2.0.
  • 2. ReMaDDer Software Tutorial Page 1 / 59  Table of Contents Introduction ........................................................................................................................ 3 What Is ReMaDDer Software.......................................................................................... 3 Fuzzy Match..................................................................................................................... 3 Records Linkage .............................................................................................................. 4 Data Deduplication.......................................................................................................... 4 ReMaDDer Software Advantages.................................................................................... 4 Prerequisites .................................................................................................................... 5 Revision History .............................................................................................................. 5 Projects ................................................................................................................................ 7 Projects Page.................................................................................................................... 7 Concept of “Left” and “Right” Dataset ............................................................................8 Record Matching Project vs. Data Deduplication Projects.............................................8 Copy A Project ................................................................................................................. 9 Raw Data Import................................................................................................................. 9 “Left” and “Right” datasets............................................................................................ 10 Import Raw Data ............................................................................................................11 Browse And Choose CSV files.....................................................................................................................11 Register CSV Files.......................................................................................................................................11 Determine And Convert CSV File To UTF-8 ............................................................................................ 12 Edit Raw Datasource Schema Information ...............................................................................................17 Pre-process Raw Datasource......................................................................................................................17 Import Data From Raw Datasources ........................................................................................................ 19 Solution Definition.............................................................................................................21
  • 3. ReMaDDer Software Tutorial Page 2 / 59  How ReMaDDer performs record linkage and data deduplication..............................22 Solution Definition Header ...........................................................................................22 Solution Basic Information .......................................................................................................................24 Machine Learning Strictness.....................................................................................................................25 Join Type ....................................................................................................................................................25 Return Only Best Matching Records ........................................................................................................26 Solution Definition Details............................................................................................26 Fields Picker ...............................................................................................................................................27 Solution Constraints ..................................................................................................................................29 Solution Execution ............................................................................................................34 Solution Execution In One Step ....................................................................................38 Solution Execution In Two Major Steps .......................................................................39 Solution Execution In Several Minor Steps ..................................................................39 Data Retrieving And Storing..............................................................................................41 Execute Resultset Retrieval SQL Query........................................................................42 Solution Status Info.......................................................................................................43 Save And Load Resultset ............................................................................................... 45 Review And Edit Resultset ............................................................................................46 Resultset Browsing ....................................................................................................................................46 Resultset Edit And Review ........................................................................................................................ 51 Exporting Resultset....................................................................................................................................52 Customize Data Grids........................................................................................................ 55 Customize Splitters ........................................................................................................... 56 ReMaDDer Software Trial................................................................................................. 56 Commercial Release Code Purchase And Activation........................................................ 57
  • 4. ReMaDDer Software Tutorial Page 3 / 59  ReMaDDer Software Tutorial How to use ReMaDDer software for successful records matching, data cleansing and data deduplication projects Introduction What Is ReMaDDer Software ReMaDDer is record linkage and data cleansing software, with powerful fuzzy record matching and data deduplication capabilities, based on state of the art machine learning and data processing techniques. As client-server application, ReMaDDer consists of two parts: client front-end part and server-side part. Client front-end provides user-friendly graphical interface with intuitive means for projects creation, raw data import and solutions definition, while server-side part ensures mighty data processing engine that can solve even the most complex fuzzy match analysis in reasonable time. By combining advanced artificial intelligence with clever blocking techniques and multiple string similarity metrics, ReMaDDer provides unique solution for fully automatic records matching and data deduplication projects. Traditionally, fuzzy records matching software require substantial human intervention, either to provide various parameters and threshold values, either to perform extensive clerical review and supervised machine learning training. Unique property of the ReMaDDer software is that it does not require any such human assistance beyond project definition. There are no thresholds or any other input parameters which user must provide in order to enable software to distinguish between matches and non-matches, the ReMaDDer software is capable to infer and learn everything by itself. As far as we are aware, ReMaDDer might be the only software currently available that is capable to perform fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of human clerical review. This is accomplished by utilizing various advanced machine learning techniques and approaches. The name “ReMaDeDer” is an acronym for “Records Matching and Data Deduplication Software”. Homepage: http://ReMaDDersoft.wix.com/ReMaDDer Fuzzy Match Term “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It is used in cases where no unique identifier or exact match relation exists between two sets of data. Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below threshold are considered to be non-matches.
  • 5. ReMaDDer Software Tutorial Page 4 / 59  Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application. Records Linkage Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources, i.e. to identify related records in two separate data sets. Record linkage is necessary when joining data sets is based on entities that may or may not share a common identifier, as may be the case due to differences in record shape, storage location, and/or curator style or preference. There are many business cases where record linkage has to be performed. Some typical examples are product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc. Data Deduplication Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from redundant information. ReMaDDer Software Advantages Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic papers. Some of the researchers even tend to build their own software, but those programs suffer from their complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use it. This is not something that can be expected from an average user facing data linkage problem in urge to be able to solve it in matter of hours or days. On the other hand, there are huge corporate entity resolution framework solutions, produced by big software companies, oriented towards huge corporate customers. These solutions are often very complex and affordable only to big companies and corporate users. ReMaDDer places itself in the middle and provides powerful fuzzy match records linkage solution for mere mortals and regular office users. By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she can focus on the business case, rather than technical issues. That is where ReMaDDer software really shines and clearly distinguishes itself from competition. Traditionally, fuzzy record matching software suffer from requiring immense user involvement in project parameterization and clerical review. User is either required to provide various input parameters and threshold values, either he/she is required to perform machine learning training and provide examples of matches and non-matches. In both cases, considerable user involvement and expertise is prerequisite for successful analysis. On the contrary, the ReMaDDer software does not require such heavy user involvement, since it can figure optimal parameter values automatically, all by itself. This is accomplished by advanced artificial intelligence utilizing various state of the art machine learning techniques.
  • 6. ReMaDDer Software Tutorial Page 5 / 59  To summarize: utilization of advanced artificial intelligence, accompanied with intuitive graphical user interface and low pricing - that is what makes ReMaDDer superb fuzzy match records linkage solution. Prerequisites Major prerequisite to use ReMaDDer is active internet connection, since the raw data is imported to remote server where data is processed. After trial period expires, you are required to purchase commercial release code in order to be able to continue using remote server. However, project and solution creation and editing can be performed even without established connection and purchased release code, since these data are stored locally on your computer. ReMaDDer front-end client is available as executable for Windows and Linux systems. It is possible to provide executables for various other systems, on demand. ReMaDDer does not operate directly on original data sources, but requires data to be imported from CSV (comma separated values) flat files to server, where corresponding “left” and “right” database tables are then created and processed. Therefore, you will have to provide source datasets as flat CSV file, encoded in UTF-8, preferably with comma (“,”) or semi-colon (“;”) field separators. Revision History Revision Date Change Description 1.0. 3/20/2016 Initial release. Tutorial covers ReMaDDer version 1.0. 1.1. 5/10/2016 Document is updated to reflect changes and improvements brought by ReMaDDer version 1.1. New version brings many improvements and simplifies solution definition. Instead of separately choosing and defining thresholds for trigram similarity and levenshtein distance functions, a new, combined, common similarity function (ReMaDDer_similarity) is now introduced that combines both trigram and levenshtein similarity properties. This reduces complexity and uncertainty in solution definition creation, retaining ReMaDDer strength and advantages. Previous ReMaDDer version has been outputting all columns from left and right dataset into resultset. Now, you can choose which fields are to be included in resultset. Raw data import process is also much improved, especially regarding importing data from Excel files (in CSV format) where column names contain non-ascii characters and blanks. There are many small performance improvements and several bugfixes that will improve user experience when using the ReMaDDer software for data match analysis. 2.0. 11/20/2016 Document is updated to reflect major changes and improvements brought by ReMaDDer version 2.0. The main changes are:  Instead of using only Levenshtein and Trigram similarity functions, multiple other similarity metrics are added to the server engine.
  • 7. ReMaDDer Software Tutorial Page 6 / 59   Matches and non-matches are not based on similarity thresholds any more. Instead, ReMaDDer now utilizes machine learning techniques. Advanced algorithms infer and automatically detect duplicates and record matches.  Threshold parameters are removed as obsolete.  “Use composite field” parameter is removed as obsolete.  “Use inclusive OR”parameter is removed as obsolete.  New parameter “Machine Learning Strictness” is introduced. The parameter defines how strictly artificial intelligence will distinguished between matches and non-matches. The options are: match, strict match and potential match.  New parameter “Join Type”is introduced. Join Type attribute determines how SQL joins between left and right tables will be established, via solution base table. There are three options of joining: a) inner join, b) left outer join, c) right outer join. The "inner join" option is default behavior, meaning that the resultset will contain all rows from left and right datasets which meet matching criteria. In case of "left outer join" option, resultset will contain all rows from left dataset and only those rows from right dataset that satisfy matching criteria. In case of "right outer join" option, resultset will contain all rows from right dataset and only those rows from left dataset that satisfy matching criteria.  New parameter “Return Only Best Match” is introduced. The parameter can have True or False value and determines whether SQL query will return only best matching record or multiple records satisfying similarity criteria. Check this option if you wish to return only the best matching records for each left or right record, when using corresponding left or right outer joins. If this option is unchecked (default), multiple matching rows will be returned.
  • 8. ReMaDDer Software Tutorial Page 7 / 59  Projects Projects Page Project is basic entity in ReMaDDER software. Each project contains definition of two source datasets to be imported and analyzed (so-called "left dataset" and "right dataset"), as well as variable number of corresponding solutions, which are stored definitions of how to perform fuzzy match analysis. On creation, each project is assigned unique project tag. During raw data importing to server, corresponding input tables get that tag appended in their name. This way, imported tables are always tagged by the project name, which ensures their uniqueness. The “Projects” page consists of two two sections separated by movable splitter. In upper section there is a datagrid view where you can browse and edit projects, while on the lower section there is form view of currently selected project. The same concept of datagrids and form views is implemented throughout the application.
  • 9. ReMaDDer Software Tutorial Page 8 / 59  You can easily create new projects, edit and browse existing projects, by using navigator buttons. Concept of “Left” and “Right” Dataset Throughout ReMaDDer application and this manual, we will use terms “left” and “right” dataset or table. In every fuzzy match project, we always compare two tables, i.e. two datasets, inspecting their rows similarity. For convenience, we call them “left” and “right” table. Purpose of entity resolution framework software, such is ReMaDDer, is to identify which records from “left” dataset correspond to which records from “right” dataset. ReMaDDer does not operate on original data sources directly, but requires data to be imported from source CSV (comma separated values) flat files to server, where corresponding left and right database tables are then created and processed. Record Matching Project vs. Data Deduplication Projects In ReMaDDer software, there is no fundamental difference between data deduplication and records matching projects. In both cases we compare two datasets, trying to infer which records from “left” dataset correspond to which records in “right” dataset. The only difference between the two is that in case of records matching project we have two different input datasets to be compared, while in case of data deduplication project we have to compare a dataset with itself, in order to identify duplicate records in the dataset.
  • 10. ReMaDDer Software Tutorial Page 9 / 59  Since ReMaDDer software always compare two datasets - left and right datasets, in case of data deduplication project we need to import the same original CSV file twice - first as left dataset and then as right dataset. The ReMaDDer software will thus create two identical tables with different names, in the underlying database. Copy A Project Instead of manually entering all the parameters for new projects, ReMaDDer allows you to copy existing project into another project. This action copies raw data import specifications as well as solution definitions. Raw Data Import Datasets to be analyzed are called "left" and "right" datasets and can be easily imported from source CSV files, encoded in UTF-8. The CSV file format ("Comma Separated Values") is chosen due to its ubiquity and because all databases and spreadsheet editors, as well as all other data sources can be easily exported to a csv file. The source data CSV files, however, must be UTF-8 encoded. Otherwise, import will most likely fail. Therefore, you must first ensure that the source data CSV files are properly UTF-8 encoded. ReMaDDer has embedded tools for charset encoding detection and conversion, but you can also use famous Notepad++ (https://notepad-plus-plus.org/), CudaText (http://uvviewsoft.com/cudatext/) and other powerful text editors which are capable to perform encoding detection and conversion of files. ReMaDDer provides simple and intuitive tool for importing csv files. It will automatically detect field’s delimiter and columns schema information. You can then edit the retrieved schema and finally import the files on server, for further processing.
  • 11. ReMaDDer Software Tutorial Page 10 / 59  “Left” and “Right” datasets In each data deduplication or record matching project, we always compare two datasets for matching of records. In case of record matching projects, these two datasets correspond to two different input CSV files, while in case of data deduplication projects, these two datasets are imported from the same input CSV file.
  • 12. ReMaDDer Software Tutorial Page 11 / 59  Nevertheless, we always have so-called “left dataset” and “right dataset” to be compared. Think of this like comparing fingers from left and right hand. You can easily identify thumb on the left hand to be related to the thumb on the right hand, since they share similar shape. It is obvious due to their physical similarity. It is same with fuzzy match analysis, where we compare fields from left and right dataset in order to identify string similarities. ReMaDDer internally uses various functions to measure string similarities, results of which are then processed by artificial intelligence to infer whether two records represent same entity or not. Import Raw Data Process of importing raw data into server database consists of several logical phases. First we need to identify source CSV files for “left” and “right” dataset. After source files are identified, we need to ensure that the CSV files are properly UTF-8 encoded. Once we ensured proper encoding, then we need to retrieve and specify schema information about the CSV files. In last phase we actually perform import from source files, according to previously defined schema. Result of the last step is that the source files are imported on server-side database, where they can be processed according to various solution definitions. On “Data Import” page, there are two sub-pages: “Left Dataset Specification” and “Right Dataset Specification”, in which we separately define input dataset specifications for “left” and “right” dataset. Import can be executed separately for left and righ dataset, or both can be imported in batch, at once. Browse And Choose CSV files First step in importing input CSV files is to choose CSV files to be imported. On upper part of “Left Dataset Specification” or “Right Dataset Specification” sub-page, there is a CSV file browser dialog box. You can browse CSV files on your computer by clicking on the browse button . This opens a file browser in which you can choose a CSV file. The absolute file path is then copied to the edit box. Register CSV Files Next step is to define CSV file schema specification. We call this process “registering CSV file”.
  • 13. ReMaDDer Software Tutorial Page 12 / 59  By clicking “Register CSV file” button near the file browser, the browsed CSV file is examined for its columns and it’s schema information is then inserted into the corresponding list of fields (columns). As you can see, ReMaDDer determines field delimiter in CSV file (normally it is either “;” or “,”) and retrieves information about columns. If a column name has upper case characters, it is converted to lower case. Currently, ReMaDDer treats all columns as text fields of various length. This is due fact that the comparison is performed by using string comparison functions, so other data types (e.g. datetime, integer, real etc.) would not make sense for string comparisons. Determine And Convert CSV File To UTF-8 In previous ReMaDDer version, the program used to detect encoding and convert it to UTF-8 automatically, during CSV file registration. Although very convenient, this might have lead to wrong results, since encoding detection function is not 100% reliable and sometimes it guesses encoding wrongly. This is due fact that charset detection is inherently difficult task and there is no 100% sure method. It is always kind of educated guess according to content inspection. Therefore, we decided to remove automatic charset detection and conversion to UTF-8. You will have to do it yourself and ensure that the source files are properly UTF-8 encoded. Charset detection, as well file
  • 14. ReMaDDer Software Tutorial Page 13 / 59  encoding conversion to UTF-8 is still present as ReMaDDer feature (and even improved), but you will have to trigger it manually with respective buttons, or by choosing it from menu. Another option is to use embedded spreadsheet editor “Spready” to open and convert source files. Alternatively, you can use various established tools such as Notepad++ text editor, that are capable to recognize file encoding and perform required conversion to UTF-8. Determine And Convert CSV File Encoding, with embedded tool After a CSV file is registered as left or right dataset source, it can be analyzed with embedded tool for detecting charset encoding. When you click button “Determine Encoding of Left Dataset CSV File” or button “Determine Encoding of Right Dataset CSV File” the respective CSV file will be analyzed for its encoding type, by two different embedded procedures. Result of encoding analysis will be displayed in corresponding pop-up window.
  • 15. ReMaDDer Software Tutorial Page 14 / 59  If both functions agree that the encoding is UTF-8 (utf8), as in the example above, then the CSV file is in appropriate format for import. But, if result is not UTF-8, then the CSV file must be converted to UTF-8 before importing! You can convert CSV file encoding to UTF-8 by clicking button “Convert Encoding Of Left Dataset CSV File” or “Convert Encoding Of Right Dataset CSV File”. When the conversion action is triggered, ReMaDDer will first back up the original CSV file and then convert the file encoding to UTF-8. Determine And Convert CSV File Encoding, with embedded spreadsheet editor “Spready” Besides above mentioned embedded encoding detection and conversion tool, ReMaDDer has embedded “Spready” spreadsheet editor (http://wiki.lazarus.freepascal.org/FPSpreadsheet), which can also be used for file encoding conversion.
  • 16. ReMaDDer Software Tutorial Page 15 / 59  Determine And Convert CSV File Encoding, with external tools Charset detection with embedded tool is not 100% reliable, which is also true for any tool performing charset inferring. If you encounter difficulties with embedded charset detection and conversion tools or you know what is the file encoding, you might try various external tools, of which I would recommend well established Notepad++ text editor (https://notepad-plus-plus.org/).
  • 17. ReMaDDer Software Tutorial Page 16 / 59  Another interesting alternative is CudaText text editor (http://uvviewsoft.com/cudatext/), which is capable of charset detection and conversion too.
  • 18. ReMaDDer Software Tutorial Page 17 / 59  Edit Raw Datasource Schema Information Once you retrieved schema information from a CSV file, you might conclude that you don’t want to import all columns, but only a subset of fields. You can edit the schema by using corresponding data grid navigator buttons. If you wish to delete currently selected field from schema, just click delete button. If you wish to regain original columns schema, just click “Get Fields Schema” button and the columns list will be repopulated from the CSV file. Pre-process Raw Datasource
  • 19. ReMaDDer Software Tutorial Page 18 / 59  While defining import schema specification, you might realize that input data need some pre-processing before importing to server for further analysis. Of course, you can edit input CSV files by using any spreadsheet editor (such as LibreOffice or OpenOffice Calc, Gumeric or Miscrosoft Excel) or textual editor (such as Notepad, Notepad ++, ConText, Gedit, CudaText, Geany or Leafpad), but you can also use an embedded spreadsheet editor “Spready”. You can launch external default spreadsheet editor by clicking the button “Open CSV File in Ext. Editor” . You can launch the embedded spreadsheet editor by clicking button “Open CSV File In Int. Editor” . This will open the embedded spreadsheet editor “Spready” (http://wiki.lazarus.freepascal.org/FPSpreadsheet).
  • 20. ReMaDDer Software Tutorial Page 19 / 59  Import Data From Raw Datasources Final step in source data import is execution of import procedure, by clicking appropriate button or triggering action from respective menu.
  • 21. ReMaDDer Software Tutorial Page 20 / 59  We can execute import separately for left and right datasets, by clicking corresponding buttons “Import left dataset CSV file” or “Import right dataset CSV file” or we can import them both at once by clicking the button “Import both CSV files to server”. When you click the import button, ReMaDDer will automatically open the “Import Log” page, where you can watch import process progress.
  • 22. ReMaDDer Software Tutorial Page 21 / 59  Import speed depends on the file size and most importantly, internet connection quality. Solution Definition A solution definition represents definition of parameters for performing record linkage or data deduplication analysis. Each project can have many solutions, with different specification, thus you can test which combination of parameters lead to best results. Each solution definition consists of solution header specification and solution constraints specification. Solution header specification contains general info about the solution and defines important parameters which determine how record matching analysis will be performed. These parameters are: “machine learning strictness”, “join type” and “return only best match”. Solution constraints specification consists of: exact match relations section, fuzzy match relations section and other constraints section. Solution definition page (page “Record Matching Analysis”, sub-page “Solution Definition”): As with other pages, “Solution” page is also divided into two sections: datagrid view and form view. For better user experience, form view is additionaly divided into several tabs and sub-tabs. Main tabs are: “Solution Definition” and “Solution Result”. “Solution Definition” tab is furtherly divided into: “Solution Header”, “Solution Fields Picker” and “Solution Constraints”.
  • 23. ReMaDDer Software Tutorial Page 22 / 59  “Solution Header” tab is divided into several sub-tabs: “Common”, “Solution Base Table Creation Query Info” and “Solution Resultset Retrieval Query Info”. “Solution Constraints” tab is divided into sub-tabs: “Exact Match Constraints”, “Fuzzy Match Constraints” and “Other Constraints”. How ReMaDDer performs record linkage and data deduplication For each project we can define one or more solutions. A solution consists of solution definition and solution resultset. Solution definition is specification which instructs ReMaDDer how to perform record linkage or data deduplication analysis in order to retrieve resultset. We can define three type of solution constraints: exact match constraints, fuzzy match constraints and other constraints. Fuzzy match constraints define field pairs from left and right dataset to be compared for fuzzy string similarity. In order to infer records similarity, ReMaDDer utilizes various string similarity metrics, along with powerful machine learning algorithms. Advanced artificial intelligence automatically infers records linkage or duplicates and creates solution base table. Final step is resultset retrieval, in which database engine creates and executes SQL query which joins left and right dataset with the solution base table, outputting resultset. The retrieved resultset can be exported to a spredsheet or flat file. Solution Definition Header Solution definition header contains general solution definition parameters and info about solution execution status. Solution definition header (whole page):
  • 24. ReMaDDer Software Tutorial Page 23 / 59  Solution definition header (datagrid view): Solution definition header (form view):
  • 25. ReMaDDer Software Tutorial Page 24 / 59  Solution definition header can be entered either through datagrid or through form view which shows currently selected solution. Solution Basic Information Basic information about a solution is shown in fields: “Solution Name”, “Solution Tag”, “Solution Base Table Name”, “Tag Assigned”, “Solution Status” and “Solution Comment”. Solution Tag is automatically generated designation which is appended to each solution name by default and is also used in Solution Base Table name formation. Solution Base Table Name is automatically formed from Solution name and Solution Tag. Solution Tag ensures uniqueness of created solution base table, on server. Solution Status and Solution Comment are fields in which user can enter additional arbitrary information.
  • 26. ReMaDDer Software Tutorial Page 25 / 59  Machine Learning Strictness The parameter “Machine Learning Strictness” defines how strictly artificial intelligence will distinguished between matches and non-matches. The options are: match, strict match and potential match. Machine learning strictness attribute determines how strictly fuzzy matching will be determined. Possible values are: a) match, b) strict match, c) potential match. "Match" option is default behavior. Resultset retrieved will contained balanced ratio between true positives and false positives. It tends to include all true positives, with some degree of false positives and very little false negatives. "Strict match" is the strictest option. Resultset will tend to contain only true positives, but due to higher incidence of false negatives, it might miss to recognize some matches. "Potential match" is the weakest option. Resultset will tend to contain all true positives, but many false positives as well. Join Type “Join Type” attribute determines how SQL joins between left and right tables will be established, via solution base table. There are three options of joining: a) inner join, b) left outer join, c) right outer join.
  • 27. ReMaDDer Software Tutorial Page 26 / 59  The "inner join" option is default behavior, meaning that the resultset will contain all rows from left and right datasets which meet matching criteria. In case of "left outer join" option, resultset will contain all rows from left dataset and only those rows from right dataset that satisfy matching criteria. In case of "right outer join" option, resultset will contain all rows from right dataset and only those rows from left dataset that satisfy matching criteria. Return Only Best Matching Records The parameter “Return Only Best Match” can have True or False value and determines whether SQL query will return only best matching record or multiple records satisfying similarity criteria. It is used as modifier to left outer join or right outer join. If this option is unchecked (default), multiple matching rows will be returned. If it is checked, only best matching item from slave dataset will be joined to corresponding record in master dataset. Check this option if you wish to return only the best matching records for each left or right record, when using left or right outer joins and datasets are in master/slave relation. In case of “inner join” join type, this parameter has no meaning and is ignored. Typical use case for left or right outer join with “return only best matching” option is when we want to match two product price lists of which one is master list. Solution Definition Details While solution definition header defines general parameters for performing fuzzy match analysis, solution definition details are being set in Field Picker sub-page and Solution Constraints sub-page with three sections defining solution constraints: Exact Match Relations section, Fuzzy Match Relations section and Other Constraints section.
  • 28. ReMaDDer Software Tutorial Page 27 / 59  Fields Picker ReMaDDer provides simple, yet very powerful visual tool to add field pairs to exact match relations section, fuzzy match section or other constraints section. By having input datasets ("left" and "right" datasets) fields listed side by side, you can easily browse two lists, visually establish field pairs and send them to appropriate constraints definition sections by click on appropriate button.
  • 29. ReMaDDer Software Tutorial Page 28 / 59  You can add selected fields pair to exact match section by clicking the button “Add Fields Pair To Exact Match Relations Section”. You can add selected fields pair to fuzzy match section by clicking the button “Add Fields Pair To Fuzzy Match Relations Section”. You can add left or right dataset field to other constraints section by clicking the respective button. By eliminating need for tedious manual input and letting you to visually build solution constraints instead, ReMaDDer simplifies solution definition creation and boosts your performance. Starting from ReMaDDer version 1.1., checkbox column “Output Field to Resultset?” is added to the Field Picker datagrid. It is used to include or exclude fields from being outputted to a resultset. By default, all fields are included in resultset.
  • 30. ReMaDDer Software Tutorial Page 29 / 59  Solution Constraints There are three type of constraints that you can define for a solution: exact match relations, fuzzy match relations and other constraints. Exact Match Relations In exact matching relations section, we can add field pairs from "left" and "right" imported dataset and define their equalness (=) or not-equalness (<>).
  • 31. ReMaDDer Software Tutorial Page 30 / 59  If we can define exact matching relation on one or more filed pairs, we can tremendously increase speed of analysis by narrowing down number of record pair combinations to be analyzed for fuzzy match. Therefore, it is recommended to use exact match relations whenever possible. Fuzzy Match Relations In solution header section we can set various general parameters that determine how fuzzy match analysis will be performed: we can choose machine learning strictness, join type and whether all matches or just best matches will be returned.
  • 32. ReMaDDer Software Tutorial Page 31 / 59  In fuzzy match relations section we provide details for fuzzy match comparison analysis. We can list field pairs to be compared and furtherly define how fuzzy match analysis will be performed. Relative Field Weight
  • 33. ReMaDDer Software Tutorial Page 32 / 59  For each field pair, which will be compared for similarity, we have to define its relative weight. The bigger the weight, the greater is importance of the particular field pair similarity in final decision whether two records do match or not. The weight for particular field pair is entered as an arbitrary integer value in the field “Field Weight (integer)” and ReMaDDer then calculates its relative weight. The sum of relative weights is always 1. On new field pair addition to the fuzzy match relations section, the field pair gets default relative weight (integer) value, which is one (1). You can change this value to any bigger integer and ReMaDDer will recalculate relative weights, taking care of their sum, which must always be 1. Notice that there is an additional graphical indicator of relative weights. It shows graphically relative weight for currently selected fields pair. There are two buttons provided: “Recalculate Weights” and “Reset Weights”.
  • 34. ReMaDDer Software Tutorial Page 33 / 59  The button “Reset Weights” reset all relative weights to 1, which is the same as if relative weights are not used at all. In that case, all field pairs are treated equally important. The button “Recalculate Weights” performs the recalculation of relative fields according to the integer values entered in the field “Field Weight (integer)”. You don’t need to trigger this action manually, since this procedure is triggered automatically on each change of integer value or a field pair addition. Other Constraints Similar to exact matching relations, it is desirable to limit analysis on particular subset of data. Such constraints can greatly increase speed of record linkage or data deduplication analysis. We can define any custom constraint to be applied on a particular field from "left" or "right" dataset.
  • 35. ReMaDDer Software Tutorial Page 34 / 59  Normally, condition is: sometable.somefield= ‘some string’, but other operators such as LIKE can be used as well. Solution Execution Once a solution definition is prepared by setting global parameters, exact match, fuzzy match and other constraints, you can then execute the solution on remote server and retrieve resultset. There are two consequences of the solution execution: solution base table is created on server and resultset is retrieved to client. Solution execution is actually sequence of two different steps, which can be executed in batch or separately. First step is solution base table creation on server, which is prerequisite for next step, resultset retrieval on client. The first step, in which Solution Base Table is created, is the most critical point in ReMaDDer application (and most resource and time demanding, too). In this step, sequence of several critical underlying procedures are triggered that determine solution space from which final resultset is finally retrieved. This step is actually composed of several discrete sub-steps. First of sub-steps is so-called “blocking” procedure, which is a method to reduce space of combinations which will be furtherly analyzed for string similarity. This step is of great importance, since fuzzy match analysis is inherently time-consuming job and analyzing all possible combinations would take extremely long time to complete. Next sub-step is step in which string similarity is calculated between left and right dataset records. ReMaDDer utilizes multiple string similarity functions. Some of them are quite resource demanding.
  • 36. ReMaDDer Software Tutorial Page 35 / 59  After string similarity is established for all combinations in solution space, advanced machine learning algorithms take results from previous step and infer record linkage or detect duplicates. This is the heart of inventive and unique approach that ReMaDDer software utilizes to perform entity resolution job. Unlike other competing software, ReMaDDer does not require any user involvement in this step. There is no need to provide examples of matches and non-matches, neither to provide any threshold value that would distinguish matches from non-matches. ReMaDDer will acquire knowledge and determine records linkage automatically, without need for human domain expert or clerical review. As far as we are aware, there is no other software, currently available on market, that is capable to perform such automatic record linkage inference by artificial intelligence, with accuracy reaching human clerical review. Technically, solution can be executed in three different ways: A) in one step In this scenario, both major steps (solution base table creation and resultset retrieval) are executed at once. B) In two major steps In this scenario, major steps (solution base table creation and resultset retrieval) are executed one by one in consecutive order. C) In several minor steps In this scenario, both major steps are executed in sequence of several distinct minor steps. In simplest scenario, you can execute solution in one step. On the Solution definition header, as well in the corresponding “Solution Header” menu entry there are two buttons. The button “Execute Solution” executes both steps at once, in batch, while the button “Prepare And Execute Result SQL Query” executes only the last step, i.e. resultset retrieval.
  • 37. ReMaDDer Software Tutorial Page 36 / 59  Obviously, you must trigger the button “EXECUTE SOLUTION” at least once, in order to create underlying Solution Base Table on server, which is prerequisite for second step, resultset retrieval. The first step, solution base table creation, might be extremely resource and time demanding. Depending on the records count in left and right dataset, number of field pairs to be compared for string similarity etc., it can take anything from 30 seconds to 24 hours or even more (!). You must be aware that the time required for solution base table creation grows exponentially, not linearly, with records count! Be aware that the solution complexity, and time required for solution to be resolved, grows exponentially with records count in left and right dataset. The same is true for number of field pairs to be compared. It is not same if you analyze only one field pair or if you compare 9 field pairs for fuzzy match. Fuzzy match analysis is inherently complex and time consuming. Once the solution base table is already created, you can easily change machine learning strictness or join type or choose whether to return only best match. For these changes, you don’t need to re-trigger tedious and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen. Beside default differentiation on major steps, there is also fine grained differentiation on sub-steps, which is available in the “Solution Definition” menu entry.
  • 38. ReMaDDer Software Tutorial Page 37 / 59  In fine grained differentiation of solution execution steps, we distinguish following separate actions:  “Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution base table, but will not execute it.  “Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will execute solution base table creation.  “Prepare Solution Result SQL Query With Forced Base Table (Re)creation” --> this will trigger recreation of SQL Query for recreation of solution base table on server and then retrieve resultset.  “Prepare Solution Result SQL Query With Check Whether To Create Base Table” --> this will trigger action that will check whether solution base table has to be recreated. The solution base table will be recreated only if necessary. Then resultset will be retrieved.  “Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but don’t actually trigger it’s execution.  “Prepare And Execute Solution Result SQL Query” --> prepare and execute SQL query that will retrieve resultset.  “Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQL query that will retrieve resultset.
  • 39. ReMaDDer Software Tutorial Page 38 / 59  These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For regular user, it is only relevant to remember that the solution base table must first be created in order to be able to retrieve resultset. If solution base table is already created, then you don’t need to recreate solution base table for different combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is enough to use just “Prepare And Execute Solution Result SQL Query” button. If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve resultset is to click “EXECUTE SOLUTION” button. Solution Execution In One Step The simplest way to execute solution is to execute the analysis in one step, by clicking the button “EXECUTE SOLUTION” or by choosing corresponding menu item. This action will force (re)creation of solution base table on server, from scratch, and prepare and execute resultset retrival SQL query. Be aware that solution base table (re)creation is costly action and it might take considerable time to complete! If left or right dataset contains million of records, this might take extremely long time to complete. Therefore, it is preferred to execute solution base table (re)creation only if necessary.
  • 40. ReMaDDer Software Tutorial Page 39 / 59  Solution Execution In Two Major Steps Besides simple solution execution in one step, there is possibility to execute solution in two major steps. In this scenario, first step is solution base table creation on server, which is prerequisite for next step, resultset retrieval on client. Once the solution base table is already created, you can easily change machine learning strictness or join type or choose whether to return only best match. For these changes, you don’t need to retrigger tedious and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen. On the Solution definition header, as well in the corresponding “Solution Header” menu entry, there is button “Prepare And Execute Result SQL Query”, which executes only the last step, i.e. resultset retrieval. You can use it if proper solution base table is already created on server. Solution Execution In Several Minor Steps If appropriate solution base table is not yet created or solution definition is changed so it needs to be recreated, then you have to (re)create solution base table first, and then execute resulset retrieval query. Beside executing everything in one step, there there is also fine grained differentiation of these sub-steps present ed in the “Solution Definition” menu entry.
  • 41. ReMaDDer Software Tutorial Page 40 / 59  In fine grained differentiation of steps, we distinguish following separate actions:  “Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution base table, but will not execute it.  “Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will execute solution base table creation.  “Prepare Solution Result SQL Query With Forced Base Table (Re)creation” --> this will trigger recreation of SQL Query for recreation of solution base table on server.  “Prepare Solution Result SQL Query With Check Whether To Create Base Table” --> this will trigger action that will check whether solution base table has to be recreated. The solution base table will be recreated only if necessary.  “Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but don’t actually trigger it’s execution.  “Prepare And Execute Solution Result SQL Query” --> prepare and execute SQL query that will retrieve resultset.  “Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQL query that will retrieve resultset. These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For regular user, it is only relevant to remember that the solution base table must first be created in order to be able to retrieve resultset.
  • 42. ReMaDDer Software Tutorial Page 41 / 59  If solution base table is already created, then you don’t need to recreate solution base table for different combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is enough to use just “Prepare And Execute Solution Result SQL Query” button. If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve resultset is to click “EXECUTE SOLUTION” button. Data Retrieving And Storing You can launch previously prepared solution SQL queries and return resultsets, by clicking the button “Prepare And Execute Result SQL Query”. Alternatively, you can execute solution in one step, which includes both solution base table creation and resultset retrieval SQL query execution in one step, with button “EXECUTE SOLUTION”. In both cases, once resultset is retrieved, it is stored locally on your computer and you can load it afterwards, anytime you wish.
  • 43. ReMaDDer Software Tutorial Page 42 / 59  You can easily browse, edit and analyze results in many different ways, including datasheet forms with sophisticated data searching, filtering and navigation capabilities. Execute Resultset Retrieval SQL Query The resultset retrieval query is executed by clicking the button “Execute Solution” or by clicking the button “Prepare And Execute Solution” , which can be used if solution base table has already been created. The difference is that “Execute solution” action (re)creates underlying solution base table and then executes SQL query, which joins left and right datasets with the solution base table, while action “Prepare And Execute Results SQL Query” just performs the last step. Obviously, prerequisite to use the latter is that the solution base table has already been created. When action is triggered, previously prepared SQL query text is sent to server for execution. The progress of query execution can be monitored in “Solution Log” page. The retrieved resultset is automatically opened in a separate form.
  • 44. ReMaDDer Software Tutorial Page 43 / 59  Solution Status Info ReMaDDer automatically updates solution status upon solution base table creation query and resultset retrieval query preparation and execution actions. These solution status informations are shown both in the solution header data grid and form view, in respective tabs.
  • 45. ReMaDDer Software Tutorial Page 44 / 59  You get various information about solution base table creation process, such as: whether solution base table is created or not, whether solution creation query has already been executed or not, whether solution base table is empty or not, what are query execution times. Also, you get various information about resultset retrieval query execution process, such as: whether resultset retrival SQL query is generated (prepared) or not, whether SQL query was already executed or not, whether resultset is retrieved or not and if retrieved whether it was empty or not. It is also shown whether the resultset is stored locally and in which file. There is information about execution times and number of executions performed.
  • 46. ReMaDDer Software Tutorial Page 45 / 59  Save And Load Resultset Once a solution is executed and results retrieved, the resultset is automatically saved as a locally stored file in the ReMaDDer installation folder, subfolder “/data/results”. Resultset can be loaded into the subpage “Solution Result” of the main form, by clicking the button “Load Solution Resultset” or in a separate form, by clicking the button “Load Solution Resultset In Separate Window”.
  • 47. ReMaDDer Software Tutorial Page 46 / 59  Review And Edit Resultset There are various ways you can post-process and review the retrieved resultset. Resultset Browsing You can easily browse, edit and analyze loaded resultset in data grid form. Datasheet contains sophisticated data searching, filtering and navigation capabilities.
  • 48. ReMaDDer Software Tutorial Page 47 / 59  You can scroll by using mouse, vertical and horizontal sliders and arrows. You can also browse records by using navigation buttons. Resultset Searching You can easily search for any particular value in any column. On the upper left corner of the datagrid there is a small button represented by orange double arrow. This button opens a pop-up dialog with various search, filter and customization options, of which one is “Find data”.
  • 49. ReMaDDer Software Tutorial Page 48 / 59  When you click on the “Find data” button, a search dialog box appears. You can search any value on any column. Resultset Filtration
  • 50. ReMaDDer Software Tutorial Page 49 / 59  You can easily filter by any column. On the upper left corner of the datagrid there is a small button represented by orange double arrow. This button opens a pop-up dialog with various search, filter and customization options, including “Filter data” and “Filter in table”, which are two different ways to perform filtration in a datagrid. Filter Data When you click the button “Filter data”, a dialog box appears on which you can build your filtering conditions. This way you can define complex multicolumn filters.
  • 51. ReMaDDer Software Tutorial Page 50 / 59  The filtering is then applied by clicking “Apply” button. Filter In Table Another option for filtration is to use the button “Filter in table”, which activates a filtration combobox, which is placed just below each column’s title. When you click on the filtration combobox cell, a combobox list appears, listing all possible values for respective column. When you select a value, the respective column is automatically filtered by the chosen value.
  • 52. ReMaDDer Software Tutorial Page 51 / 59  Resultset Sorting You can sort ascending or descending on any column by clicking column title. Resultset Edit And Review You can edit the resultset in datagrid easily. You can delete a row by using delete button , or edit a record by clicking the edit button .
  • 53. ReMaDDer Software Tutorial Page 52 / 59  Exporting Resultset Besides using datagrid controls, another option for resultset post-processing is to export the resultset into a spreadsheet and then perform reviewing and editing in a spreadsheet editor. ReMaDDer has many different possibilities of exporting resultset to spreadsheets. Exporting Resultset To Spreadsheet Resultset can be exported to a CSV file by clicking the button “Export To CSV File”. Resultset can be exported to a XLSX file by clicking the button “Export To XLSX File”. Resultset can be exported to XLS file by clicking the button “Export To XLS File”. Resultset can be loaded directly into your default spreadsheet editor, e.g. LibreOffice Calc or Microsoft Excel, by clicking the button “Load In Ext. Spreadsheet Editor”.
  • 54. ReMaDDer Software Tutorial Page 53 / 59  ReMaDDer also has its own embedded spreadsheet editor which can be used for resultset post-processing. Resultset can be loaded into the embedded spreadsheet editor by clicking the button “Load As Spreadsheet”. Exporting Datagrid To Spreadsheet Another possibility for exporting resultset into a spreadsheet file is to use datagrid’s exporting feature.
  • 55. ReMaDDer Software Tutorial Page 54 / 59  You have to browse the destination folder for export and enter exported file name and extension, as well as to enter page name (sheet name). If you forget to specify “page name”, you will get an error.
  • 56. ReMaDDer Software Tutorial Page 55 / 59  Customize Data Grids ReMaDDer enables you to customize your user interface in certain extent. You can shrink or stretch columns, rearrange their order and hide/unhide columns. Resize columns by dragging vertical splitters between columns. Rearrange columns by pushing the left mouse button on a column’s title and dragging the column while mouse button is still pushed down. After the column is moved to another position, release the mouse button. You can define which columns are shown and which are hidden, by clicking on the button “Select visible columns”.
  • 57. ReMaDDer Software Tutorial Page 56 / 59  When you close the application, your customization is saved (remadder_props.xml file) and when you open the application again, your customizations will be loaded as well. Customize Splitters You will notice that various sections are divided by splitters which you can easily drag and thus resize the corresponding splitted sections. The customization you make is saved on application close and reloaded on application start. ReMaDDer Software Trial ReMaDDer client application is distributed as a shareware with 15-days trial period. On first application start on your computer the trial period is initialized.
  • 58. ReMaDDer Software Tutorial Page 57 / 59  Commercial Release Code Purchase And Activation After trial period expires, you are required to purchase commercial release code in order to be able to continue using server features, such as raw data import and query execution. You can, however, continue creating and editing projects and solution definitions, as well as loading and editing previously acquired resultsets. When purchasing release code, you are required to enter MachineID in purchase form. The MachineID is a tag generated by ReMaDDer software and is unique for your hardware. The purchased commercial release code is thus machine-specific and valid only for your hardware. Once you purchased release code, activate it by clicking the button “Activate Commercial Release Code”. You are asked to enter the release code.
  • 59. ReMaDDer Software Tutorial Page 58 / 59  The entered release code will be then validated and if correct, the server-side features will be unlocked for you.