The challenge of scanning source repositories

The challenge of scanning source repositories

I wrote an article a few weeks about about transitive dependencies in open source projects (it's dependencies all the way down). Something that came up during that discussion was questions about scanning source repositories for dependencies. Rather than try to explain that in a comment, I decided it would be better to write a whole article about it, it's a complicated topic.

Let's start with why a source repository is so different than a container image.

Container images, by definition, contain everything you need to run something. That's the point. Everything gets shoved inside the container. All the operating system things, all the dependencies, random binaries, configuration files, everything. This also makes it easy to scan. Everything you find ... is a finding (that doesn't really make sense, but will later).

You can think of a source code repository as sort of the opposite of a container image. Instead of copies of everything you need, a source repository is just the source code for your project or application. A source code repository generally won't contain copies of your dependencies (I say generally, there are times it does, this is partially why this is so complicated). Ideally your source code repository will list the dependencies you want to use. There aren't any operating system packages or runtime binaries.

Since there aren't any dependencies or runtime installed, how can we scan these and get back usable details? An example should help make this easier to understand.

Here's an example using python. I have a test python project that needs requests. Let's just install that using pip

(.venv) ➜  python-test pip list
Package Version
------- -------
pip     24.3.1


(.venv) ➜  python-test pip install requests
Collecting requests
<lots of output>


(.venv) ➜  python-test pip list
Package            Version
------------------ ---------
certifi            2025.6.15
charset-normalizer 3.4.2
idna               3.10
pip                24.3.1
requests           2.32.4
urllib3            2.5.0        

As we can see, we start with an empty environment, we install requests, then we get some other transitive dependencies installed also. It's pretty rare to install one package and get one thing installed. It's very common to install A LOT of transitive dependencies.

But there's an interesting angle to how package manager dependencies are resolved. It's not as easy as trying to install all the packages then recording what happens. If we look at how the requests package specifies the dependencies it needs, it's in the setup.cfg file. Here's the important bits

requires-dist =
    certifi>=2017.4.17
    charset_normalizer>=2,<4
    idna>=2.5,<4
    urllib3>=1.21.1,<3        

While our pip list shows specific versions of a package, what the requests package specifies is a range. This is why it's very hard when trying to figure out what the transitive dependencies of a source repository will be. If all we have is a range to go on, what ends up installed could be any version that matches those ranges.

Let's use the idna package as our example dependency. We ended up with version 3.10 installed, which is the latest version at the moment. But what if I already have a version installed that's greater than 2.5 and less than 4? Let's try it out

(.venv) ➜  python-test pip list
Package Version
------- -------
idna    3.0
pip     24.3.1        

We have version 3.0 installed, now if we install requests, we get this

(.venv) ➜  python-test pip list
Package            Version
------------------ ---------
certifi            2025.6.15
charset-normalizer 3.4.2
idna               3.0
pip                24.3.1
requests           2.32.4
urllib3            2.5.0        

The version of idna is still 3.0. So if I was scanning a source repository that depends on requests, what should we show as the version of idna that will be installed? I might be able to tell you what packages will get installed, but I probably can't tell you which versions will get installed.

Right about now there's someone who has started to write a comment about pinning your dependencies and using lock files. It's a fair point, but from what I've seen, few project pin their dependencies on main. Possibly on a release branch. So trying to figure out the dependencies for a release branch that has pinned dependencies would be simple. For a main branch that doesn't have pins, it's tough. Also, if you pin main, I'll put money on your dependencies getting stuck in limbo forever, but that's a topic for another day.

The takeaway here shouldn't be scanning source repositories is a bad idea. It's just one part of a much larger picture. There is no single thing that can solve our problems when building software. Scanning your source code is just as important as scanning your built artifacts and container images.

Paula Prinz

Prinz Legal Consulting - Prozess- und Rechtsberatung Open Source Software | Vom Code zur Compliance, vom Problem zur Lösung

1mo

From my work in legal consulting on license compliance, I’ve essentially seen two common approaches when it comes to analyzing dependencies and their licensing obligations: “We’ve documented the direct dependency — that should be enough.” → This is the superficial check. Spoiler: From a legal perspective, that’s unfortunately not sufficient, because all indirect (transitive) dependencies are also relevant for proper legal assessment. “Let’s take a closer look and break down the transitive dependencies as well.” → This is where things get complex — and very big, very quickly. The outcome of a deep dive can be quite sobering, especially if there was a prior assumption that all obligations had already been met. But from a compliance standpoint, this is the safer path to truly understand which components and licenses are present in a product.

Inga S.

Cybersecurity Leader | 15+ Years Driving Compliance, Strategy & Board Trust | From Findings to Fixes, I Lead Security That Performs

1mo

Yes, this happens a lot. It looks easy, but different versions show up in different places. Python can be tricky like that. Excited to see your example! Josh Bressers

Prabhu S.

AppSec Tools Builder | Founder, AppThreat

1mo

Finally! tbh, Python is a solved problem in many BOM tools including cdxgen. You can run cdxgen with special Python types such as `-t python310`, `-t python311`, etc., and make it generate an SBOM (with precise dependency tree!) for the given Python version regardless of the installed version. With "--profile research" argument, cdxgen can even plot reachable call-stacks with atom for most projects including python. Of course, most AppSec users simply will not know what project needs what version of build tools, SDKs and OS libraries. This is another problem that is being solved in tools such as cdxgen, where the formulation and build tooling needed for an accurate build sbom generation will get auto-detected.

Philippe Ombredanne

On a mission to make open source easy and safe to use with open source code, open data, and open standards like PURL. Lead maintainer of ScanCode, AboutCode and Package URL. CycloneDX, SPDX and ClearlyDefined. nexB CTO.

1mo

source, schmource! The source of the truth lives in the #binaries! ... And for when you do not have them, there is a weird small utility that can emulate the pip dependency resolution for a whole Python package tree, where you can pass args for the OS, arch and #Python version you want the resolution for: python-inspector. It can help also help with what-if scenarios, like to find a resolution for a tree of non-vulnerable versions. 🤓 Tushar Goel ^

Oleg Barenboim

Co-Founder & CTO at Dfinitiv | Serial Entrepreneur | Innovator | Ex-Red Hat | Ex-ManageIQ | Ex-Hewlett Packard | Ex-Novadigm | Ex-SpectrumConcepts

1mo

Thanks for sharing, Josh! For the reasons you state, I prefer lock files for dependencies on the main branch. It is a bit onerous to keep bumping dependencies every so often, but, then again, that forces the repo to hop from one consistent state to another.

To view or add a comment, sign in

Others also viewed

Explore topics