Data Dissected — Breaking Down the Building Blocks of Information

Data Dissected — Breaking Down the Building Blocks of Information

Data is not one thing. It comes in many forms, each with its own structure, complexity, and best use cases. Some data is organized into predictable rows and columns. Some lives in flexible, evolving formats. Others capture rich media like images or audio, or transform the physical world into digital readings. In scientific and industrial fields, data can be highly specialized.  And then data sizes can range from bytes to trillions of bytes in all of these formats and types. 

In this discussion, we will not cover information in the sense of higher-level conceptual datasets such as real estate listings, population statistics, or business metrics. Nor will we focus on the low-level details of bits and bytes that describe how data is physically stored on a computer. Instead, we’ll start at the logical constructs that sit between those two extremes — the formats, schemas, and structures that define how data is represented and understood.

Data Dissected takes a tour through these categories — from structured data’s neat grids to the sprawling complexity of scientific datasets — explaining what each type looks like, where it comes from, and how it’s typically used. By understanding the landscape of data types, we can choose the right tools, storage, and analysis methods for each type of data, and use it to its fullest extent.

1. Structured Data — The Grid of Information

Structured data is the most familiar form of data — the kind you find in databases and spreadsheets. It’s organized into rows and columns, with each column holding a specific type of information such as a number, date, or short text field.  For example a train ticket table might have columns, for train number, passenger first name, passenger last name, time of train, date of train, departure city, arrival city etc.  These are distinct fields in the train ticket entity providing structure to the data.

The Basic Building Blocks of Structured Data

  • Integers — whole numbers without decimal places
  • Decimals/Floats — numbers with decimal points for precision
  • Text/Strings — alphanumeric characters for names, descriptions, etc.
  • Dates and Times — temporal values for events and scheduling
  • Booleans — true/false values for binary conditions

Tables, Spreadsheets, and the Relational Model

Structured data is most often stored in database tables or spreadsheets — formats that are straightforward to query, filter, and analyze. In relational databases, these tables are organized into rows (records) and columns (fields), and can be linked together through keys to combine information from different sources. 

In real-world business settings, structured data might capture every trade on a stock exchange with timestamps and pricing details, manage airline bookings complete with seat assignments and fare classes, track department store purchases alongside product SKUs and payment methods, or store loyalty program customer records enriched with tier status and points balances.

For many enterprises, structured data still accounts for a substantial share of the information they manage.

2. Semi-Structured Data — Structure with Flexibility

Semi-structured data has an internal organization, but it doesn’t require a strict, fixed schema for every record. It often contains complex, nested structures with varying topologies, where different records may hold different fields, hierarchies, or embedded arrays. This makes it possible to model richer, more detailed information without redesigning an entire database schema.

Formats like JSON, XML, Avro, Parquet, and ORC are common examples. A JSON file, for instance, might store customer details for one record and include a deeply nested purchase history for another, within the valid and general rules of JSON formatting. 

Many modern databases can query semi-structured formats directly, but there is typically a performance trade-off — they are generally not as fast to process, filter, or join as purely structured data in fixed columns. They are often chosen when the flexibility of the data model is more important than the absolute speed of complex joins and aggregations, such as in scenarios where the data changes frequently, the structure cannot be fully anticipated, or the variety of fields is too great for a single rigid schema.

Typical Examples of Semi-Structured Data

  • JSON — A lightweight, text-based format representing data as nested key-value pairs and arrays. Its self-describing structure allows flexible, hierarchical data modeling without predefined schemas, making it easy for programs to parse and manipulate.
  • XML — A markup language that uses nested tags to represent hierarchical relationships, with support for attributes and strict validation through schemas like XSD or DTD. Its verbose but explicit structure makes it robust for complex, well-defined data models.
  • Parquet — A binary, columnar storage format optimized for analytical queries. It stores data by columns rather than rows, enabling high compression ratios and efficient retrieval of only the fields required for a query.

Common Data Characteristics

Semi-structured data often includes key-value pairs (name/value combinations), arrays/lists (ordered collections of values), and nested objects (hierarchical groupings of related values). These structures allow for complex, varied topologies without requiring every record to share the exact same set of fields.

Schema-on-Read vs. Schema-on-Write

In structured systems, the schema (column definitions) is applied before data is stored — schema-on-write. Semi-structured systems often use schema-on-read, meaning the schema is applied only when the data is accessed. This allows for evolving formats without constant schema changes and makes it easier to integrate new data sources without lengthy redesigns.

Open Table Formats

Open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi bring structure and transactional integrity to collections of files in a data lake. They group related data files—often stored in formats like Parquet, Avro, or ORC—into logical tables that can be queried with SQL while preserving data integrity across a set of files. For example, a stock exchange might want to store historical data in S3 buckets or a data lake in an open table format for various experiments but would use Open Table Formats to ensure the data is stored in logical sets forming tables and is modified transactionally.

Business Context for Semi-Structured Data

In general semi-structured data is chosen when the flexibility of the data model and the open format of the data to be read by multiple tools, outweighs the need for the fastest possible query performance. It can be particularly useful for datasets that have varying attributes and optional fields.

3. Graph Data — Mapping Relationships

Graph data represents entities (nodes) and the connections between them (edges). It’s ideal for datasets where relationships are as important as the data points themselves — such as social networks, supply chains, or fraud detection systems.

You’ll see graph data powering everything from social media friend-suggestion webs, fraud detection maps linking suspicious transactions, knowledge graphs used by search engines to connect facts and concepts, global airline route maps, drug discovery molecule networks, and recommendation systems that suggest products, movies, or music based on interconnected patterns of user behavior.

The Basic Building Blocks of Graph Data

  • Nodes (Vertices) — entities such as people, products, or locations
  • Edges — relationships or connections between nodes
  • Properties — attributes attached to nodes or edges

Querying and Traversal

Graph databases use specialized query languages such as Cypher (popular in Neo4j) or Gremlin (used in Apache TinkerPop) to navigate nodes and edges. Instead of writing complex multi-table joins, you describe the pattern of relationships you want to find, and the database efficiently “traverses” the graph to match it. This makes it straightforward to answer questions like “Who are the friends-of-friends of this person who also live in the same city?” or “What is the shortest route connecting two distribution centers?”. Developers can also compute metrics like shortest path, degree centrality (which nodes are most connected), or betweenness centrality (which nodes act as key bridges in the network). Because the traversal logic is built into the database engine, these queries often run far faster and with less code than trying to model the same relationships in a relational database.

4. Unstructured Data — Rich Formats and Multi-Media

Unstructured data is not without definition — every file format, from a PDF to a JPEG to an MP4, follows a precise specification for how bytes are arranged. What makes it “unstructured” is that its organization does not map directly into the fixed rows and columns of structured databases. Instead, these formats represent higher-level constructs such as documents, images, audio, and video, storing data in ways that allow them to be opened, viewed, listened to, or otherwise experienced through a document reader, image viewer, audio player, or video player.

You’ll encounter unstructured and media data in formats like PDF, Word, JPEG, PNG, MP4, MKV, MP3, and WAV. These files are often stored in file systems, object stores, or content management systems. While they can be tagged with metadata, the primary content typically requires specialized parsing, decoding, or processing to be searchable or analyzable.

Typical Examples of Unstructured Data

  • PDF — Portable Document Format combining text, vector graphics, raster images, and layout instructions in a platform-independent container.
  • JPEG/PNG — Raster image formats using compression (lossy for JPEG, lossless for PNG) to efficiently store visual data at defined resolutions and color depths.
  • Word (DOCX) — XML-based document format that encodes styled text, embedded objects, images, and metadata.
  • MP4 — Container format for video and audio streams, supporting multiple codecs and metadata tracks for playback, editing, and streaming.
  • MP3/WAV — Audio formats; MP3 uses lossy compression to reduce file size, WAV stores uncompressed PCM audio for maximum fidelity.
  • Embeddings — Numeric vectors derived from unstructured content that enable semantic search, similarity matching, and integration with structured datasets.

How to Use Unstructured Data?

Making unstructured data useful usually starts with extracting information from it. This might mean using OCR to turn scanned pages into searchable text, transcribing audio recordings into written form, or applying image recognition to identify people or objects in photos and videos. Once this information is extracted, it can be stored as metadata, making it possible to search, filter, and link the content with other business data. This enables practical use cases like finding all legal documents that mention a specific term, searching support call recordings by topic, or retrieving product images similar to a given example.

AI Embeddings and Cross-Modal Search

AI embedding models are designed to convert unstructured data — such as text, images, or audio — into fixed-length numerical arrays called vectors. These vectors capture the semantic meaning of the content rather than just its literal form, enabling systems to compare items by meaning instead of exact matches. Because each vector exists in a high-dimensional mathematical space, similarity search can be performed by finding the nearest vectors to a query vector, often using cosine similarity or Euclidean distance. This works within a single modality (e.g., finding similar text documents) or across different modalities (e.g., finding images that best match a written description), making embeddings a key technology for semantic search, recommendation engines, and multimodal AI applications.

5. Sensor & Measurement Data — Capturing the Physical World

Sensor and measurement data originates from hardware devices that detect, quantify, and record aspects of the physical world. These systems can measure distance, velocity, shape, temperature, vibration, and countless other attributes — often with high precision and at rapid sampling rates. Unlike datasets derived from manual entry or human observation, sensor data is typically generated automatically and can be continuous, event-driven, or periodic.

You’ll encounter sensor and measurement data in formats produced by radar, lidar, sonar, infrared imagers, accelerometers, and other specialized devices. These outputs may be raw waveforms, digitized point clouds, pixel grids, or structured numerical streams, depending on the sensing technology and its purpose.

Typical Examples

  • Radar — Uses radio waves to detect the distance, speed, and movement of objects; widely used in aviation, weather monitoring, and autonomous systems.
  • Lidar — Employs laser pulses to produce precise 3D maps of surfaces and objects, commonly used in surveying, robotics, and self-driving vehicles.
  • Sonar — Uses sound waves for underwater mapping and object detection; essential in maritime navigation and subsea exploration.
  • Infrared Sensors — Detects heat signatures and thermal variations, enabling night vision, energy audits, and temperature monitoring.
  • Ultrasonic Sensors — Emits high-frequency sound for proximity sensing, fluid level detection, and short-range measurement.
  • Accelerometers — Measures acceleration, tilt, and vibration, used in motion tracking, equipment diagnostics, and structural monitoring.

How to Use Sensor & Measurement Data?

Before sensor readings can be acted upon, they often need preprocessing — such as noise filtering, calibration for environmental factors, signal transformation, or synchronization with other data sources. In operational systems, sensor data can be fused from multiple modalities to produce a more complete view of a situation, such as combining radar and lidar for obstacle detection.

For datasets with many variables or channels, techniques like Principal Component Analysis (PCA) can help distill the data into its most informative dimensions, removing redundancy and highlighting the strongest patterns. This not only reduces storage and computation costs but also improves the clarity of the signals used for analysis or decision-making.

When integrated into real-time decision-making systems, sensor data powers critical applications including collision avoidance in vehicles, predictive maintenance in industrial machinery, environmental hazard alerts, and precision control in robotics.

6. Geographic Data — Representing the World in Coordinates

Geographic data captures the location, shape, and attributes of features on Earth’s surface. It can describe anything from the outline of a city block to the path of a shipping lane, and from the footprint of a building to the flow of a river. This type of data underpins mapping, navigation, spatial analysis, and location-aware decision-making across industries.

Formats and structures vary. Vector data represents locations as points, lines, and polygons with precise coordinates, ideal for boundaries, roads, and infrastructure. Raster data stores continuous surfaces like satellite imagery or aerial photos as pixel grids, each cell containing a value such as elevation or temperature. Coordinate systems define how these positions are referenced globally or locally, ensuring datasets align correctly. Attributes add descriptive details — such as a road’s name, a land parcel’s owner, or a river’s flow rate — to each geographic feature.

Geographic data becomes actionable when large-scale database and data platform environments perform spatial analysis — calculating distances and travel times, detecting land use changes, identifying optimal routes, and modeling environmental risks such as flooding. By combining coordinates, imagery, and attributes, these platforms power high-value outcomes like precision agriculture, real-time fleet routing, wildfire spread prediction, infrastructure resilience planning, and targeted retail site selection.

7. Scientific & Industrial Data — From Biology to Astronomy and More

Scientific and industrial data encompasses specialized datasets used in research, engineering, and manufacturing. Built for precision and often governed by strict standards, these formats require deep domain expertise to interpret and apply effectively.

Typical Examples

  • Genomic Sequences — FASTA, FASTQ, BAM
  • Chemical Structure Files — SDF, MOL
  • Simulation Outputs — CFD, FEA results
  • Microscopy Images — electron and optical formats
  • Astronomical Datasets — FITS telescope data
  • Network Packet Captures (PCAP) — electronic and network testing

From Complex Readings to Breakthrough Outcomes

Scientific and industrial datasets often demand specialized workflows — from genome assembly and protein modeling to chemical compound analysis, high-fidelity simulations, and telescope calibration. They are frequently processed on high-performance computing clusters, integrated with precision instruments, and stored alongside meticulous metadata to ensure reproducibility. At scale, these datasets drive outcomes such as accelerating drug discovery, designing more efficient aircraft, predicting material fatigue in infrastructure, mapping deep space phenomena, and diagnosing microscopic defects in advanced manufacturing.

Conclusion — Navigating the Full Data Landscape

You may need to work with multiple data formats simultaneously — a potential advantage rather than a limitation. Leveraging diverse formats can yield richer insights when you clearly understand the data you hold, the available options for transformation or reformatting, and your intended uses. With this awareness, you can make informed data architecture decisions that balance trade-offs, maintain efficiency, and stay flexible for new requirements as they emerge.


Nitin Gupta

Apps. & Data Platforms

1w

Superb, keep sharing.

Like
Reply
Abhishek Gautam

Solution Architect for Fiber Customer Onboarding at JIO

1w

Thanks for sharing, Ivan D.

Like
Reply
Albert R.

AI/ML Specialist @ QBE | Co-Founder

1w

Ivan D. Novick  “AI Embeddings and Cross-Modal Search” captures exactly what Kestri.com has done from the start, applying dual deterministic and probabilistic embeddings to decode clinical jargon for better doctor–patient communication across any c-store dataset for DocNote.ai.

  • No alternative text description for this image
Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics