Improvisation Ensemble Support Systems for Music Beginners based on Body Motion Tracking

Improvisation Ensemble Support Systems
for Music Beginners
based on Body Motion Tracking
Shugo ICHINOSE*, Souta MIZUNO*, Shun SHIRAMATSU*, Tetsuro KITAHARA**
Department of Computer Science, Graduate School of Engineering, Nagoya Institute of Technology*
Department of Information Science, College of Humanities and Sciences, Nihon University **

Introduction
• There are three cognitive elements in melody recognition
• It is difficult for music beginners to attempt musical improvisation
due to the tonality
Tone
Chord
1. Rhythm
2. Pitch contour
3. Tonality
easy
difficult

Developing an improvisation ensemble support system
• The input is body motion of user as pitch contour and rhythm
• The output is harmonic sound satisfying tonality
Purpose
Ensemble support system
Tone
Chord＋
Easy elements
Difficult element
ex) clapping

Two approaches were considered to develop system
• Approach with 3D Motion Sensor Camera
–  High motion recognition accuracy
–  It is not popular yet
• Approach with Smartphone Sensors
–  It is widely used
–  It is difficult to recognize body motion with high
accuracy
Approach
Intel RealSense 3D camera

Approach with 3D Motion Sensor Camera
RealSense SDK
Control of
the performance sound
with gestures
Coordinates determination
of the hand
Pitch determiner
Background
music
RealSense
User
Information Finger detection and
gesture recognition
hands the recognition result to the system
Songle
Tonality
constraints
Performance
Sound
Music
sound

Improvisation Ensemble Support Systems for Music Beginners based on Body Motion Tracking

How to Input of Body Motion
523 Hz
494 Hz
440 Hz
392 Hz
Body motion
Pitch contour
(Time change of fingertip height)
Pitch satisfying
tonality
The up-down movement of the fingertip represents pitch contour

Sustained Sound and Decaying Sound
This system outputs two types of sounds
• Sustained Sound
– Ex) violin, flute etc
• Decaying Sound
– Ex) piano, iron harp, xylophone, etc
sec
volume
sec
volume

How to Operate System
User can operate the system with gestures
Thumb up Spread fingers Fist Tap
Switch sound type
(sustained/decaying)
Onset of
sustained sound
Offset of
sustained sound
Onset of
decaying sound

Improving Accuracies of Gesture Recognition
Speed (fingertip) > 1.8 m/s
Speed (palm) > 0.6 m/s
Moving distance > 20 mm
If all the threshold values
are satisfied
This gesture is tap
• Delay and false recognition were noticeable in the default
gesture recognition functions of RealSense SDK
• The accuracies of gesture recognition were improved by
optimizing threshold

Determination of Output Sound
• The output sound is restricted by tonality constraints
• Tonality constraints are list of mill-second and pitch frequency
that can be output under a particular chord
– It changes according to chord progression of the background music
– It were prepared in advance
Background music
List of (millsec, frequency)
3033 261.626 329.628 349.228 391.955 440
3783 261.626 329.628 349.228 391.955 440
4533 261.626 329.628 349.228 391.955 440
5283 261.626 329.628 349.228 391.955 440
6033 261.626 293.665 349.228 440
6783 261.626 293.665 349.228 440
7533 261.626 293.665 349.228 440
8283 261.626 293.665 349.228 440
9033 261.626 293.665 349.228 440 466.164
9783 261.626 293.665 349.228 440 466.164
10533 261.626 293.665 349.228 440 466.164
11283 261.626 293.665 349.228 440 466.164
C#M7
B♭m7
F#M7
Tonality constraints

Constitute of Tonality Constraints
• Tonality constraints contain the constituent notes of a chord
and frequently co-occurring notes (FCNs) with that chord
[Goto 11] Goto et al. (2011). Songle: A Web Service for Active Music Listening Improved by User Contributions. Proc. of
ISMIR 2011, pp. 311-316.
• FCNs are determined by statistical analysis from 100 songs
• Song data was obtained by using Songle API[Goto 11]
- Songle analyses chord progression of song data on the web
Tonality constraints of CM = { C, E, G } + { FCNs }
Tonality constraints of Cm = { C, E♭, G } + { FCNs }

Prepare of Selecting FCNs
• Chord and melody are represented by the relative position
with key (scale degree)
– If key is C, chord A is represented as chord VI
– If root note of chord is A, melody B is represented as melody II
Ⅰ #Ⅰ Ⅱ #Ⅱ Ⅲ Ⅳ #Ⅳ Ⅴ #Ⅴ Ⅵ #Ⅵ Ⅶ
C C# D D# E F F# G G# A A# B
A A# B C C# D D# E F F# G G#
Tonic Root
Scale degree of A against the tonic C
Scale degree of B against the chord root A
Melody:
Chord root:

Prepare of Selecting FCNs
• FCNs are determined on the basis of scale degree length
– Scale degree length refers to the length of each scale degree co-
occurring with each chord
I #I II III #III IV #IV V VI #VI VII #VII
Histogram of scale degree length on the chord I M

G G# A A# B C C# D D# E F F#
Count of Scale Degree Length
There are rules on how to count scale degree length
1. Categorize with scale degree: Even if the notation names are different,
notes that have same scale degrees are counted as same one.
E F F# G G# A A# B C C# D D#
Tonic Root
Melody:
Chord root:
Tonic Root
Melody:
Chord root:
The note name D# on chord E on tonic A
The note name F# on chord G on tonic C
They are regarded as melody VII sounds on the chord VM

G G# A A# B C C# D D# E F F#
Count of Scale Degree Length
There are rules on how to count scale degree length
2. Major and minor: Even if scale degrees are same, they are counted as
different ones if they are on major key and minor key.
E F F# G G# A A# B C C# D D#
Tonic Root
Tonic Root
The note name D# on chord E on key A
The note name F# on chord G on key Cm
In this case, although D# on the chord E on the key A and F# on the chord G on the key Cm have same scale degree,
they are counted separately for major and minor because of this rule.
major
minor
Melody:
Chord root:
Melody:
Chord root:

Selecting FCNs
𝑓 > 𝑓𝑚𝑎𝑥 × α
0
0.05
0.1
0.15
0.2
0.25
0.3
Chord
constituents
FCNs
I #I II III #III IV #IV V VI #VI VII #VII
𝑓𝑚𝑎𝑥 × α

Evaluation Experiment for 3D Motion Sensor Camera
We ask the experiment participants to operate the system and
to answer the following questions
1. As comparison of proposed tap recognition method and
RealSense SDK default method
– Q1-1: Whether there was a delay
– Q1-2: Whether there was a false recognition
2. As comparison of statistically generated tonality constraints
and tonality constraints consisting of chord constituents
– Q2-1: Whether there was a dissonance
– Q2-2: Whether user can perform as intended

Result of Experiment
Experiment of tap
The proposed method got better evaluation than the default method
Experiment of tonality constraints
The proposed method has dissonance, but can be done slightly as intended
3.92
2.92
4.75 5.00
0
1
2
3
4
5
6
Q:1-1 Q:1-2
The absense of delay The absense of
missed recognition
5.25 5.50
4.33
5.58
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Q:2-1 Q:2-2
The absense of
dissonance
As Intended
Tap Tonality constraints
:Proposed method
:Default method
:Include FCNs
:Only constituent notes

Additional Evaluation Experiment
: Time when the performance
sound is heard
delay
• Comparing delay of the proposed method and default one
• The experimental participants tap according to the beat of the
song
• The start of the beat is regarded as the best timing of tap
recognition
• The delay from the best timing was recorded
Best timing

Result of Additional Evaluation Experiment
0
10
20
30
40
50
60
70
80
-375 -300 -225 -150 -75 0 75 150 225 300 375
Frequency(times)
Delay(ms)
[Tanaka 13] Tanaka et al. “The Effect of Sound Delay Conditions on Electronic Drum Performance,” Technical
Committee of Musical Acoustics of Acoustical Society of Japan, 2013. (in Japanese)
0
5
10
15
20
25
30
35
40
45
-375 -300 -225 -150 -75 0 75 150 225 300 375
Frequency(times)
Delay(ms)
• The delay average of the default method is in a serious delay area
• The proposed method is closer to the best timing
Area of serious delay[tanaka 13]Best timing
Proposal tap function default tap function
Delay average
-34.62ms 69.63ms

Approach with Smartphone Sensors
• Smartphone is widely used
• Pitch contour is inputted by movements of smartphone
• There are three options for specifying rhythm
1. Shake : acceleration sensors and gyro sensors
2. Clap : ambient light sensor
3. Tap : button located on the screen

Estimating Notation Name by Bayesian Network
• The notation names to output were
estimated by this Bayesian net
• from the value of each sensor and context such as
tonality
input
a : y-acceleration
v : speed
vc : variation in speed
g : gravitational acceleration
p : distance traveled
t : attack timing
rm : The most frequent prediction result of the last m times
Context such as tonality
c : chord of BGM
ni-1 :Previous notation name
output
ni : notation name to output

Estimating Attack Timing by Bayesian Network
• The Attack timings were estimated by this
Bayesian net
– Used only for clap and shake
– Because the touchscreen of a smartphone (tap)
is reliable enough
output
t : attack timing ( 0 or 1 )
input
ax : acceleration in x axis direction
ay : acceleration in the y-axis direction
v : speed
vc : variation in speed
g : gravitational acceleration

Collecting of Training Data by Experiment
• Training data for the Bayesian net was collected with an
examinee experiment
– This sensor data were obtained as five examinees
• The models were trained for estimating the notation name
and attack timing from the smartphone sensor data
The pitch contour of the melody
of an existing tune
They raised and lowered
their smartphone
in accordance with

Evaluation of Prediction Accuracy
1. Evaluation of prediction accuracy of notation name
– The test data and the training data are the same
– Two types of accuracy were calculated using note unit and sample unit
Notation
name
C
D
E
The number of note : 3
The number of sample : 24
Time
1 sample
(5ms)
1 note

Evaluation of Prediction Accuracy
2. Evaluation of prediction accuracy of attack timing
– The recall and the precision to the original song were examined
– They were calculated using note unit
Recall =
Precision =
Estimated attack timing
that matches the original one
The Number of attack timing of original one
The Number of estimated attack timing
that matches the original one
・・・・・・
・・・・・・
・・・・・・
・・・・・・

Result of Evaluation
•Prediction accuracy of notation
name
–Prediction accuracy every sample is
higher than every note
–In both cases, the accuracy by touch
is the highest
•Prediction accuracy of attack
timing
–Precision of shake is low. Small
movements are recognized as shake.
–Recall of clap is low. The ambient
light sensor is not reliable.
Operation Recall Precision
Shake 0.63 (171/270) 0.26 (171/661)
Clap 0.14 (39/270) 0.31 (39/151)
Operation
Ratio of notes estimated as
having same notation as original tune
Shake 0.49 (131/270)
Clap 0.49 (133/270)
Touch 0.56 (152/270)
Operation
Ratio of samples estimated as
having same notation as original tune
Shake 0.66 (42805/65278)
Clap 0.73 (47627/65218)
Touch 0.75 (49155/65251)
Accuracy of Pitch Notation
(Calculated with Sample Count)
Accuracy of Pitch Notation
(Calculated with Note Count)
Accuracy of Attack Timing (Calculated with Note Count)

Social Reuse of Improvisational Melody Data
• Our system can gather two types of users’ performance data
– Pitch contour data without conversion along with tonality constraints
– Tonal melody data converted from the pitch contour data

• If users publish their performance data as open data,
– the data can be used for collaborative music composing or remixing
A
B
C
D
D

• In particular, pitch contour data without tonality constraint
can be applied to various chord progressions.
– For example, pitch contour data of our system can be used in the loop
sequencer based on pitch contour (melodic outline) [Kitahara 16]
[Kitahara 16] T. Kitahara et al. “A loop sequencer that selects music loops based on the degree of excitement,” in
Proceedings of the 12th Sound and Music Computing Conference (SMC 2015), 2015, pp. 435–438.

Conclusion
• Two types of ensemble support systems were developed
– Used 3D motion sensor camera and smartphone respectively
– They automatically adjust note pitches to satisfy the tonality of
background music
– The systems enable music novices to participate in improvisational
ensembles
• In the future, we are also considering
– the social reuse of improvisational melody data shared as open data
– and ensemble among multiple users not only with background music

Improvisational ensemble
We define improvisational ensemble as follows
• Several people play instruments together
with no plan
• One person plays instrument with no plan in
accordance with background music

Available Gestures by RealSense
Thumb upFist V sign Spread fingers Thumb down
Full pinch tap WaveSwipe left Swipe right
Two fingers
pinch open

Improvement of other gestures
• Default function is used as it is
• When Openness is 90 or more
– Openness is a value indicating the degree of
opening of the finger
• When the degree of opening decreases
by 10 or more from the previous frame
openness > 90
openness – old openness > 10

Changing range of pitch
① change range of pitch
②change pitch
• User can change range of pitch using by the depth value
• Depth value is the distance between the hand and the camera
• If the user wanted to change the range of pitch, he should
extend his hand and move it up or down

Improvisation Ensemble Support Systems for Music Beginners based on Body Motion Tracking

More Related Content

Similar to Improvisation Ensemble Support Systems for Music Beginners based on Body Motion Tracking (20)

More from siramatu-lab (20)

Recently uploaded (20)

Improvisation Ensemble Support Systems for Music Beginners based on Body Motion Tracking