ML using minimal to no coding-KNIME.pptx

Step 1
• Read Data from File
– File Reader Node
– Table Reader Node
– Excel Reader Node
– Absolute and Relative Paths: the knime:// Protocol
• Accessing REST Services

Step 2
• ETL and Data Manipulation
– 2.1 Row and Column Filtering
– 2.2 Aggregations
– 2.3 Join and Concatenation
– 2.4 Transformation: Conversion, Replacement,
Standardization, and New Feature Generation
– 2.5 Data Preparation for Time Series Analysis

2.1 Row and Column Filtering
• Basic Row Filter
• Advanced Row Filter
• Column Filter

2.3 Aggregations
• Classic Aggregations with GroupBy node: A
classic aggregation operation consists of two
steps:
– identifying data groups and
– calculating the aggregation method on the selected
groups.
• Basic groupby aggregation
• Advanced groupby aggregation
• Pivoting

Read adult.csv data set. Then:
1. calculate total number of rows and average age for all
Female with income >50K per year
2. on each one of the 4 groups defined by sex and
income values, calculate the average of all numerical
columns
3. on full input table count:
1. rows with missing values in column “occupation”
2. all rows in column “occupation”
3. rows with no missing value in column “occupation”
4. all rows in another column (i.e. marital-status). Notice that
this number should be the same as the number for all
rows in column “occupation”.

Pivoting
• The pivoting function requires one or more
grouping columns to define the rows, and one or
more pivoting columns to define the columns of
the pivot table.
• The rows and columns define unique sub-groups
of the data. These sub-groups can then be
summarized by aggregated measures.
• The possible aggregations range from listing and
counting values, to calculations on date & time,
and to statistical measures.

Question 1. Using the “age” column as the
grouping column and “workclass” column as
the pivoting column, calculate the number of
people in groups according to their work class
and age.
• 1a. What is the most common combination
of age bin and work class?
• 1b. How many people belong to this group?

• 2.3 Join and Concatenation
• Join:
– inner join,
– right outer join,
– left outer join,
– full outer join
• Concatenation

• Read adult.csv data set. Then calculate the
average age and number of rows for the 4
groups defined by (sex, income) and join the
corresponding 2 aggregated values to each
row in the group.

• Differentiate joining from concatenation
• Read adult.csv data set. Then extract people
with age between 20 and 40 and working in a
work group starting with "S" and people with
age between 40 and 60 and working in the
Private sector (workclass starts with "P"). Put
both groups in a single data table.

2.4 Transformation: Conversion, Replacement,
Standardization, and New Feature Generation
• Data are standardized before being stored, analyzed, or
reported. This means, string and date & time values are
converted to follow the same style and format, numbers are
normalized, and new features are created from the existing ones.
• Possible string manipulation operations are extracting substrings,
standardizing texts to lower case or upper case, or adding a
prefix/suffix to string values, for example.
• To numbers you could apply some kind of mathematical
transformation, like for example normalization or logarithmic
transformation.
• In general, data can be transformed to generate new, hopefully,
more informative input features.

Data Manipulation: Numbers, Strings, and
Rules
• String Manipulation node
• Math Formula node and
• Rule Engine node

• Read the sales.csv dataset.
• Using the Rule Engine node, create a new column
“currency” with value “USD” for the orders from the USA,
and “EUR” for the orders from Germany.
• Using the Rule Engine node, create a new column
“conversion” with value 1 if currency is “EUR”, and 0.88 if
currency is “USD” (we refer to the exchange rate of Nov-
04-2018).
• Using the Math Formula node, calculate values in a new
column named “amount-in-EUR” by multiplying the value
in column “amount” by the value in column “conversion”.

Column Expressions for Data Manipulation
• The Column Expressions node is useful
because it can perform multiple data
manipulation tasks at once.
• It can replace combinations of other data
manipulation nodes, such as the String
Manipulation, Math Formula, and Rule Engine
nodes, with this single node.

Exercise
• Read the sales.csv dataset.
• Write an expression that extracts the first three letters of
country names and converts them to upper case letters.
Append a new column and name it “Country_Code”.
• Write an expression that multiplies the sales amount by
the conversion rate. Replace the “amount” column, but
change its type to double.
• Write an expression that assigns the value “N” to the
missing values in the “card” column. Replace the “card”
column.

ML using minimal to no coding-KNIME.pptx

More Related Content

Similar to ML using minimal to no coding-KNIME.pptx (20)

Recently uploaded (20)

ML using minimal to no coding-KNIME.pptx