Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Case of Apache Spark for Semiconductor Wafers from Real Industry

Seungchul Lee, sclee@bistel.com
BISTel Inc.
Daeyoung Kim, dykim3@bistel.com
BISTel Inc.
Analyzing 2TB of Raw Trace Data
from a Manufacturing Process:
A First Use Case of Apache Spark for
Semiconductor Wafers from Real Industry
#UnifiedAnalytics #SparkAISummit

Contents
2#UnifiedAnalytics #SparkAISummit
• Introduction to BISTel
– BISTel’s business and solutions
– Big Data for BISTel’s smart manufacturing
• Use cases of Apache Spark in manufacturing industry
– Trace Analyzer (TA)
– Map Analyzer (MA)

Introduction to BISTel

BISTel’s business areas
• Providing analytic solutions based on Artificial Intelligence (AI)
and Big Data to the customers for Smart Factory

BISTel’s solution areas
• World-Class Manufacturing Intelligence through innovation

BISTel’s analytic solution: eDataLyzer

BISTel’s analytic solutions (MA)
• Map Pattern Clustering
– Automatically detect and classify map patterns with/without libraries
– Process thousands of wafers and give results in few minutes
Clustered
Defective
wafers

BISTel’s analytic solutions (TA)
• Specialized Application for Trace Raw Data
– Extracts the vital signs out of equipment trace data
– Provide in-depth analysis which traditional methods cannot reach
Abnormal
Normal

BISTel’s big data experiences

BISTel’s big data experiences
- YMA Test using Spark - - Big data platforms comparison-

Trace Analyzer (TA)

Trace Data
• Trace Data is sensor data collected from processing equipment
within a semiconductor fab during a process run.
- Semiconductor industry -
- Wafer -

Logical Hierarchy of the trace data
Wafer
Lot
Recipe Step
Recipe
Process
Visualization
Whole
process
Process
Recipe 1
wafer

An example of the trace data
Process Recipe Recipe step Lot Wafer Param1 Param2 Time
021_LIT RecipeA 1 1501001 1 32.5 45.4
2015-01-20
09:00:00

Data attributes
• Base unit : one process and one parameters
• 1000 wafers
• Each wafer has 1000~2000 data points in a recipe step
• Some factors that make trace data huge volume
• # of parameters
• # of processes
• # of wafers
• # of recipe steps
• duration of the recipe step

An example of the trace data – (2)
No. Fab
# of
processes
# of
recipe steps
Avg. Recipe
ProcessTime
Data
Frequency
# of
units
Parameter
per unit
(max)
1 Array 109 10 16 mins 1Hz 288 185
2 CF 25 5 1min 1Hz 154 340
3 CELL 12 7 1min 1Hz 213 326
4 MDL 5 12 2mins 1Hz 32 154
• Some calculations
• For one process, one parameter and one wafer
• 16 * 10 * 60 sec * 1Hz = 9600 points
• Multi parameters, multi processes and multi wafers
• 9600 * 288 *185 * 109 * (# of wafers)

Spark : Smart manufacturing
• Spark is a best way to process big data in batch analytics
• Distributing data based on parameter is suitable for using
Apache Spark.
• Easy deployment and scalability when it comes to providing the
solutions to our customers

Naïve way: applying spark to TA

How to apply Spark to TA?
traceDataSet = config.getTraceRDDs().mapToPair(t->{
String recipeStepKey = TAUtil.getRecipeStepKey(t); #use recipe step as key
return new Tuple2<String,String>(recipeStepKey,t);
}).groupByKey();
traceDataSet.flatMap(t->{
Map<String,TraceDataSet> alltraceData = TAUtil.getTraceDataSet(t);
...
TAUtil.seperateFocusNonFocus(alltraceData,focus,nonFocus); #separate data
ta.runTraceAnalytic(focus,nonFocus,config); # calling the TA core
...
});

Most cases in manufacturing industry
• In real industry, most parameters have small number of data points.
(Most case : 1Hz)
• In addition, the number of wafers to be analyzed is not massive.
(up to 1,000 wafers)
• Therefore the total number of data points in a process can be easily
processed in a core

Issues in manufacturing industry
• Last year, I have got an email indicating that..

Big parameter
• Tools with high frequency or high recipe time can produce huge
volume for single parameter
• Requirements in industry
• For one parameter
• 400,000 wafers
• 20,000 data points.

Limitations of the Naïve TA
For(Tuple<String,Iterable<String> recipeTrace : allTraceData){
TraceDataSet ftds = new TraceDataSet();
Iterable<String> oneRecipe = recipeTrace._2();
for(String tr : oneRecipe){
TraceData td = TAUtil.convertToTraceData(tr);
ftds.add(td);
}
}
traceDataSet = config.getTraceRDDs().mapToPair(t->{
String recipeStepKey = TAUtil.getRecipeStepKey(t); #use recipe step as key
return new Tuple2<String,String>(recipeStepKey,t);
}).groupByKey();
All the data points based
on the key are pushed
into one core by shuffling
Java object holds too
many data points

Needs for new TA spark
• Naïve TA Spark version cannot process massive data points.
• Nowadays, new technology enhancements enable data capture at
much higher frequencies.
• TA for “big parameter” version is necessary.

Our idea is that..
• Extracting the TA core logic
– Batch mode
– Key-based processing
– Using .collect() to broadcast variables
– Caching the object

• Preprocessing trace data
• Key-based processing
• Base unit : process key or recipe step key
Batch
JavaPairRDD<String, List<String>> traceDataRDD
= TAImpl.generateBatch(traceData)
First element : process, recipe
step, parameter and batch ID
Second element : lot, wafer and
trace values
Summary
statistics
.
.
.
•Param A

Collect() : TA Cleaner
• Filtering out traces that have unusual duration of process time.
• Use the three main Spark APIs
– mapToPair : extract relevant information
– reduceByKey : aggregating values based on the key
– collect : send the data to the driver

Collect() : TA Cleaner – (2)
Worker
wafer value
1 65
2 54
… …
Worker
wafer value
1 83
2 54
… …
Worker
wafer value
1 34
2 77
… …
Worker
wafer value
1 71
2 80
… …
• traceData.mapToPair()
• Return
• key : process
• value : wafer and its length

• reduceByKey()
• Aggregating contexts into one based on the process key
wafer value
1 65
2 54
… …
Shuffling
wafer value
1 88
2 92
… …
wafer value
1 153
2 146
… …

• Applying filtering method in each worker
mapToPair(t -> {
String pk = t._1();
Double[] values = toArray(t._2());
FilterThresdholds ft = CleanerFilter.filterByLength(values);
return Tuple(pk,ft);
}).collect();

Examples 2 : Computing outlier
• To detect the outlier in a process, median statistics is required.
• To compute the median value, the values need to be sorted.
• Sort(values)

Examples 2 : Computing outlier – (2)
mapToPair reduceByKey
• Computed the approximate median value for big data processing.
• Applied histogram for median
• Collecting the histogram
Collect

Caching the trace data
• Persist the trace data before applying TA algorithm
• Be able to prevent data load when the action is performed
Focus=Focus.persist(StorageLevel.MEMORY() AND DISK())
NonFocus=NonFocus.persist(StorageLevel.MEMORY() AND DISK())

RDD vs. DataSet (DataFrame)
• RDD
– All the data points in a process should be scanned
• Advantage of the DataSet is weakened.
– Hard to manipulate trace data using SQL
– Basic statistics (i.e. Min, Max, Avg, Count…)
– Advanced algorithm (Fast Fourier Transform and
Segmentation)

Demo : Running the TA algorithm
• Analyzed 2TB trace data using TA

TA results in eDataLyzer

Results of the Naïve TA

Results of the big parameter TA Spark

• Two different TA Spark versions
Two different TA Spark versions
Data size
# of
parameter
# of
wafers
# of data
points
Running Time
Naïve TA 2TB 270,000 250 1000 1.1h
Big Param TA 1TB 4 400,000 20,000 54min

Map Analyzer (MA)

Map Analytics (MA)
• Hierarchical clustering is used to find a defect pattern
S.-C. Hsu, C.-F. Chien / Int. J. Production Economics 107 (2007) 88–103

MA datasets
Process Process step Parameter Lot Wafer
Defective
chips
FPP Fall_bin P01 8152767 23
-02,04|-
01,22|+00,25|+08,
33|+04,05
waferDataSetRDD.mapToPair(...).groupBy().mapToPair(...);
Generating
a key value pair
Calling hierarchical
clustering

BISTel’s first approach for MA
• Using the batch mode for clustering massive wafers.

Demo : Running the MA algorithm
• Dataset consists of 26 parameters containing 120,000 wafers

Problems in batch for clustering
• In a manufacturing industry, some issues exist
# of wafers Time Detecting a pattern
DataSet1 15 2017-02-01:09:00 ~ 09:30 Yes
DataSet2 7,000 2017-02-01~2017-02-08 No

Spark summit: SHCA algorithm
• In Spark Summit 2017, chen jin presented a scalable hierarchical
clustering algorithm using Spark.

A SHCA algorithm using Spark
Jin, Chen, et al. "A scalable hierarchical clustering algorithm using
spark." 2015 IEEE First International Conference on Big Data Computing
Service and Applications. IEEE, 2015.

Applying SHCA to wafer datasets
Wafer map ID Coordinates of defective chips
A (13,22), (13,23), (13,24), (13,25)…
B (5,15), (6,12), (6,17), (8,25)…
C (9,29), (16,33), (19,39), (22,25)…
D (19,9), (20,2), (23,21), (25,4)…
E (5,5), (5,8), (5,15), (5,25)…
• Designed the key-value pairs
• Minimum spanning tree (MST)
– Vertex : Wafer
– Edge : distance between wafers
• distance w1, w2

Comparison between two versions

Comparison between two versions - (2)

Spark stage results of MA
• Approximately 100,000 wafers are analyzed for clustering

Comparison of the results
0
500
1000
1500
2000
2500
5,000 50,000 100k 160k 320k
Batch New MA

Summary
• MA using SHCA is accurate than the batch MA.
• However, the running time of the batch MA is faster than that of the
new MA.
• In manufacturing industry, we suggest them to use both of two MAs.

Conclusions
• A first use case of Apache Spark in Semiconductor industry
– Terabytes of trace data is processed
– Achieved hierarchical clustering on distributed machines for
semiconductor wafers

Acknowledgements
• BISTel Korea (BK)
– Andrew An
• BISTel America (BA)
– James Na
– WeiDong Wang
– Rachel Choi
– Taeseok Choi
– Mingyu Lu
* This work was supported by the World Class 300 Project (R&D) (S2641209, "Development of next generation intelligent Smart
manufacturing solution based on AI & Big data to improve manufacturing yield and productivity") of the MOTIE, MSS(Korea).

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Case of Apache Spark for Semiconductor Wafers from Real Industry

More Related Content

What's hot (20)

Similar to Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Case of Apache Spark for Semiconductor Wafers from Real Industry (20)

More from Databricks (20)

Recently uploaded (20)

Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Case of Apache Spark for Semiconductor Wafers from Real Industry