SlideShare a Scribd company logo
Prof. Chen Li
Computer Science, UC Irvine
April 26, 2024
Texera: A Scalable Cloud Computing
Platform for Sharing Data and
Workflow-Based Analyses
1
2
Alice: Biologist (PI)
Sally: Bioinformatician Bob: Bioinformatician
Chen Li, UCI
Example: sequence analysis in biology
Differential
"gene" expression analysis
AI/ML analysis
Count Matrix
Cellranger
Quality Control
Data Integration
Clustering
Cluster Annotation
Trajectory analysis
Sequence
analysis
pipeline
FASTQ Sequences
Dimensionality
Reduction
Step 1
Step 2
Step 4
Step 5
Step 6
Step 7
Normalization Step 3
- Coding is hard!
- Version control of libraries
- Needs servers
- Slow on large data
- Not every lab can afford a
bioinformatician
4
Data preparation
Coding challenges
Data analytics
Visualization
Sally: Bioinformatician
Chen Li, UCI
Collaboration challenges
● Collaborators of different backgrounds:
○ Biologists
○ Bioinformaticians
○ Computer scientists
● Collaborators from different organizations
○ Same lab: senior students vs new students
○ Other labs
5
Chen Li, UCI
Limitations
● Only file management, no run-time environment
● Inefficient!
Collaboration: existing tools
6
Chen Li, UCI
● How to utilize state-of-the-art AI/ML technologies?
● Require advanced coding skills
● Not easily available
AI/ML opportunities
7
Chen Li, UCI
Cloud-computing services for sharing data and workflow-based analyses
Benefits:
- Cloud services (no installation, software patches)
- Version control
- Shared editing/execution
- Sharing data and workflows
- Parallel engine, scalable
- …
Our solution
8
Chen Li, UCI
Open source
9
Chen Li, UCI
Texera example workflow
10
Chen Li, UCI
Demo!
11
Alice: Biologist (PI)
Sally: Bioinformatician Bob: Bioinformatician
Chen Li, UCI
Figures on the entire dataset
12
Quality Control
Elbow plot
Clustered UMAP
Annotated UMAP
Chen Li, UCI
Texera Statistics
13
# of user accounts 332 # of projects 86
# of workflows 2,257 # of executions 31,000
# of workflow versions 357,000 # of publications 23
# of deployed servers 7 # of CPU cores in the largest deployment 400
# of files on GitHub 1,291 # of lines of code on GitHub 101,690
# of pull requests on GitHub 2,096 # of current PhD students 7
# of collaborating professors 17 # of involved undergraduates 80+
# of completed PhD theses 3 # of development years 7
Chen Li, UCI
Example: analyzing brain images, 256GB
14
Prof. Chen Li
Teaching non-STEM students AI/ML using Texera
15
Prof. Chen Li
High-school students using Texera
16
Mission: to serve the diabetes, endocrinology, and metabolic diseases
research communities through the FAIR sharing of data and knowledge.
New NIH award (dkNet)
17
Chen Li, UCI
Pilot project: inviting users
18
Chen Li, UCI
- Support a ChatGPT-like interface
- Provide more operators and workflows related to sequencing
- Make analysis parameters configurable
- Parallelize bottleneck steps
- Make more AI/ML techniques available
- Migrate existing programs to workflows
- Support public clouds (e.g., AWS, GCP)
- …
Open research problems
19
Chen Li, UCI
- Cloud-computing platform
- GUI-based workflows (no coding needed)
- Collaboration and sharing of data/analyses
- Parallel computing: for big data
- Supporting multiple languages: Python, R, Java, …
- Supporting AI/ML (training, inference, …)
Summary
20
Chen Li, UCI
Prof. Chen Li
Computer Science, UC Irvine
Texera: A Scalable Cloud Computing Platform for
Sharing Data and Workflow-Based Analyses
21
Acknowledgements: Yicong Huang, Sally Lee, Xinyuan Lin, Xiaozhen
Liu, Kun Woo (Chris) Park, Kevin Wu, and the Texera team

More Related Content

PDF
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
PPTX
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
PPTX
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
PDF
Software tools to facilitate materials science research
PDF
Standards and tools for model management in biomedical research
PPT
Collaborative Data Analysis with Taverna Workflows
PPTX
Pine education-platform
PDF
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Software tools to facilitate materials science research
Standards and tools for model management in biomedical research
Collaborative Data Analysis with Taverna Workflows
Pine education-platform
The Materials Data Facility: A Distributed Model for the Materials Data Commu...

Similar to dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data and Workflow-Based Analyses" 04/26/2024 (20)

PDF
COPO - Collaborative Open Plant Omics, by Rob Davey
PPTX
Towards Automated AI-guided Drug Discovery Labs
PDF
Omics Logic - Bioinformatics 2.0
PPTX
Introduction to FAIRDOM
PPTX
Beyond the Science Gateway
PPT
The New e-Science
PPTX
OpenAIRE: eInfrastructure for Open Science
PPTX
Big Data HPC Convergence and a bunch of other things
PPTX
Hughes RDAP11 Data Publication Repositories
PPTX
Accelerating Discovery via Science Services
PPTX
research unveiling connections and recommendations.pptx
PPT
The New e-Science (Bangalore Edition)
PPTX
Open Source and Science at the National Science Foundation (NSF)
PDF
Software and Education at NSF/ACI
PDF
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
PPTX
UCL’s research IT management systems architecture review aligned with Open Sc...
PPTX
Docker in Open Science Data Analysis Challenges by Bruce Hoff
PPTX
Access the world’s research outputs through the CORE API
PDF
Software Mining and Software Datasets
PPT
Replicating FLOSS Research as eResearch
COPO - Collaborative Open Plant Omics, by Rob Davey
Towards Automated AI-guided Drug Discovery Labs
Omics Logic - Bioinformatics 2.0
Introduction to FAIRDOM
Beyond the Science Gateway
The New e-Science
OpenAIRE: eInfrastructure for Open Science
Big Data HPC Convergence and a bunch of other things
Hughes RDAP11 Data Publication Repositories
Accelerating Discovery via Science Services
research unveiling connections and recommendations.pptx
The New e-Science (Bangalore Edition)
Open Source and Science at the National Science Foundation (NSF)
Software and Education at NSF/ACI
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
UCL’s research IT management systems architecture review aligned with Open Sc...
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Access the world’s research outputs through the CORE API
Software Mining and Software Datasets
Replicating FLOSS Research as eResearch
Ad

More from dkNET (20)

PDF
dkNET Webinar "Introduction to AI-READI, Studying Salutogenesis in T2DM" 10/1...
PDF
dkNET Webinar: Single Cell Multi-Omics Analysis of Beta Cell Heterogeneity an...
PDF
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
PDF
dkNET Office Hours: NIH Data Management and Sharing Mandate 05/03/2024
PDF
dkNET Webinar: Unlocking the Power of FAIR Data Sharing with ImmPort 04/12/2024
PDF
dkNET Webinar: Tabula Sapiens 03/22/2024
PDF
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
PDF
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...
PDF
dkNET Webinar: An Encyclopedia of the Adipose Tissue Secretome to Identify Me...
PDF
dkNET Webinar: A Single Cell Atlas of Human and Mouse White Adipose Tissue 11...
PDF
dkNET Webinar "The National Sleep Research Resource (NSRR) - Opportunities fo...
PDF
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
PDF
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
PDF
dkNET Webinar: Leveraging Computational Strategies to Identify Type 1 Diabete...
PDF
dkNET Webinar: Estimating Relative Beta-Cell Function During Continuous Gluco...
PDF
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
PDF
dkNET Webinar: Postpartum Glucose Screening Among Homeless Women with Gestati...
PDF
dkNET Webinar: Choosing Sample Sizes for Multilevel and Longitudinal Studies ...
PDF
dkNET Webinar: : FAIR Data Curation of Antibody/B-cell and T-cell Receptor Se...
PDF
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Webinar "Introduction to AI-READI, Studying Salutogenesis in T2DM" 10/1...
dkNET Webinar: Single Cell Multi-Omics Analysis of Beta Cell Heterogeneity an...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Office Hours: NIH Data Management and Sharing Mandate 05/03/2024
dkNET Webinar: Unlocking the Power of FAIR Data Sharing with ImmPort 04/12/2024
dkNET Webinar: Tabula Sapiens 03/22/2024
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...
dkNET Webinar: An Encyclopedia of the Adipose Tissue Secretome to Identify Me...
dkNET Webinar: A Single Cell Atlas of Human and Mouse White Adipose Tissue 11...
dkNET Webinar "The National Sleep Research Resource (NSRR) - Opportunities fo...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Leveraging Computational Strategies to Identify Type 1 Diabete...
dkNET Webinar: Estimating Relative Beta-Cell Function During Continuous Gluco...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Webinar: Postpartum Glucose Screening Among Homeless Women with Gestati...
dkNET Webinar: Choosing Sample Sizes for Multilevel and Longitudinal Studies ...
dkNET Webinar: : FAIR Data Curation of Antibody/B-cell and T-cell Receptor Se...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
Ad

Recently uploaded (20)

PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
The scientific heritage No 166 (166) (2025)
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
famous lake in india and its disturibution and importance
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPT
protein biochemistry.ppt for university classes
PDF
HPLC-PPT.docx high performance liquid chromatography
ECG_Course_Presentation د.محمد صقران ppt
7. General Toxicologyfor clinical phrmacy.pptx
Comparative Structure of Integument in Vertebrates.pptx
BIOMOLECULES PPT........................
INTRODUCTION TO EVS | Concept of sustainability
Derivatives of integument scales, beaks, horns,.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
The scientific heritage No 166 (166) (2025)
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
lecture 2026 of Sjogren's syndrome l .pdf
Microbiology with diagram medical studies .pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
neck nodes and dissection types and lymph nodes levels
Taita Taveta Laboratory Technician Workshop Presentation.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
famous lake in india and its disturibution and importance
Introduction to Cardiovascular system_structure and functions-1
protein biochemistry.ppt for university classes
HPLC-PPT.docx high performance liquid chromatography

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data and Workflow-Based Analyses" 04/26/2024

  • 1. Prof. Chen Li Computer Science, UC Irvine April 26, 2024 Texera: A Scalable Cloud Computing Platform for Sharing Data and Workflow-Based Analyses 1
  • 2. 2 Alice: Biologist (PI) Sally: Bioinformatician Bob: Bioinformatician Chen Li, UCI Example: sequence analysis in biology
  • 3. Differential "gene" expression analysis AI/ML analysis Count Matrix Cellranger Quality Control Data Integration Clustering Cluster Annotation Trajectory analysis Sequence analysis pipeline FASTQ Sequences Dimensionality Reduction Step 1 Step 2 Step 4 Step 5 Step 6 Step 7 Normalization Step 3
  • 4. - Coding is hard! - Version control of libraries - Needs servers - Slow on large data - Not every lab can afford a bioinformatician 4 Data preparation Coding challenges Data analytics Visualization Sally: Bioinformatician Chen Li, UCI
  • 5. Collaboration challenges ● Collaborators of different backgrounds: ○ Biologists ○ Bioinformaticians ○ Computer scientists ● Collaborators from different organizations ○ Same lab: senior students vs new students ○ Other labs 5 Chen Li, UCI
  • 6. Limitations ● Only file management, no run-time environment ● Inefficient! Collaboration: existing tools 6 Chen Li, UCI
  • 7. ● How to utilize state-of-the-art AI/ML technologies? ● Require advanced coding skills ● Not easily available AI/ML opportunities 7 Chen Li, UCI
  • 8. Cloud-computing services for sharing data and workflow-based analyses Benefits: - Cloud services (no installation, software patches) - Version control - Shared editing/execution - Sharing data and workflows - Parallel engine, scalable - … Our solution 8 Chen Li, UCI
  • 11. Demo! 11 Alice: Biologist (PI) Sally: Bioinformatician Bob: Bioinformatician Chen Li, UCI
  • 12. Figures on the entire dataset 12 Quality Control Elbow plot Clustered UMAP Annotated UMAP Chen Li, UCI
  • 13. Texera Statistics 13 # of user accounts 332 # of projects 86 # of workflows 2,257 # of executions 31,000 # of workflow versions 357,000 # of publications 23 # of deployed servers 7 # of CPU cores in the largest deployment 400 # of files on GitHub 1,291 # of lines of code on GitHub 101,690 # of pull requests on GitHub 2,096 # of current PhD students 7 # of collaborating professors 17 # of involved undergraduates 80+ # of completed PhD theses 3 # of development years 7 Chen Li, UCI
  • 14. Example: analyzing brain images, 256GB 14
  • 15. Prof. Chen Li Teaching non-STEM students AI/ML using Texera 15
  • 16. Prof. Chen Li High-school students using Texera 16
  • 17. Mission: to serve the diabetes, endocrinology, and metabolic diseases research communities through the FAIR sharing of data and knowledge. New NIH award (dkNet) 17 Chen Li, UCI
  • 18. Pilot project: inviting users 18 Chen Li, UCI
  • 19. - Support a ChatGPT-like interface - Provide more operators and workflows related to sequencing - Make analysis parameters configurable - Parallelize bottleneck steps - Make more AI/ML techniques available - Migrate existing programs to workflows - Support public clouds (e.g., AWS, GCP) - … Open research problems 19 Chen Li, UCI
  • 20. - Cloud-computing platform - GUI-based workflows (no coding needed) - Collaboration and sharing of data/analyses - Parallel computing: for big data - Supporting multiple languages: Python, R, Java, … - Supporting AI/ML (training, inference, …) Summary 20 Chen Li, UCI
  • 21. Prof. Chen Li Computer Science, UC Irvine Texera: A Scalable Cloud Computing Platform for Sharing Data and Workflow-Based Analyses 21 Acknowledgements: Yicong Huang, Sally Lee, Xinyuan Lin, Xiaozhen Liu, Kun Woo (Chris) Park, Kevin Wu, and the Texera team