Recent computer science graduate Charlotte Gayton shared her journey of implementing the OpenChain standard during her Year in Industry (ISO/IEC 5230) and her dissertation project (ISO/IEC 18974). She discussed the challenges she faced and the solutions she developed to achieve compliance. The session will provide a unique perspective on navigating OpenChain from the viewpoint of someone early in their career. Her work lead to the detailed case study recently published regarding OpenChain ISO/IEC 5230 adoption by endjin.
Watch the Recording:
View the Slides:
More About Our Webinars:
This event is part of the overarching OpenChain Project Webinar Series. Our series highlights knowledge from throughout the global OpenChain eco-system. Participants are discussing approaches, processes and activities from their experience, providing a free service to increase shared knowledge in the supply chain. Our goal, as always, is to increase trust and therefore efficiency. No registration or costs involved. This is user companies producing great informative content for their peers.
The Education Work Group held a hybrid in-person and virtual event on the 7th of August. There has been a lot of work around improving and expanding reference material to support companies adopting OpenChain standards or improving their open source business process management in general. One of the interesting topics recently has been momentum around providing reference capability maturity modeling material, with official partners such as Orcro and now Deloitte providing commitments towards a CC-0 licensed beta document for the Open Compliance Summit in Japan at the end of October.
Watch the Recording:
Be part of this:
You can get involved with the OpenChain Education Work Group through their dedicated mailing list. At this link, you will also find connections to other working groups around the world:
Samsung SDS has adopted OpenChain ISO/IEC 18974, the international standard for open source security assurance. This builds on their previous adoption of OpenChain ISO/IEC 5230 in July 2022.
The adoption of ISO/IEC 18974 by Samsung SDS was supported by resources provided by the Linux Foundation’s OpenChain Project, founded in 2016, which maintains self-certification checklists and other materials to help global companies develop enhanced open source management processes.
“We are delighted to continue our relationship with the Samsung SDS team around the adoption and use of OpenChain standards for open source process management,” says Shane Coughlan, OpenChain General Manager. “The Samsung SDS team have long been involved with the OpenChain Korea Work Group, and provide an excellent example of how a company can have a measured, effective approach to community engagement, best practice adoption, and excellence in customer support.”
About Samsung SDS
Samsung SDS is the digital arm of the Samsung group and a global provider of cloud and digital transformation innovations. Samsung SDS delivers enterprise-grade solutions and services in cloud, secure mobility, analytics / AI, digital marketing and digital workspace. They enable customers in government, financial services, healthcare, and other industries to drive business in a hyper-connected economy, helping them to increase productivity, safeguard assets, and make smarter decisions.
About the OpenChain Project
The OpenChain Project has an extensive global community of over 1,000 companies collaborating to make the supply chain quicker, more effective and more efficient. It maintains OpenChain ISO/IEC 5230, the international standard for open source license compliance programs and OpenChain ISO/IEC 18974, the industry standard for open source security assurance programs
About The Linux Foundation
The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure, including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org.
As my industrial placement year comes to an end, so does my time with the OpenChain project. Over the past year I have worked through the full lifecycle of a project, from ideation & envisioning, to planning, implementation, testing, deployment, roll-out and maintenance to ensure that we are open-source license compliant, and therefore meet the specifications of the OpenChain project. Starting with discovering what open-source licenses are, to collecting and processing data, and finally to displaying it in a way that is useful to the company. In this blog I am going to take you through the processes and explain how we adapted and created our processes in checking our whole codebases components.
You can also read the Case Study as a Slide Deck:
What is OpenChain
The OpenChain Project was set up by the Linux Foundation to create an ISO standard for creating, using and modifying open-source software. Creating an ISO standard like this is important as more software is being created and used extensively without having a clear baseline of quality or structure. During the course of the year, the aim of the OpenChain project has expanded. Initially about license compliance, then it included threats, and now it’s looking at open source contribution process management.
Organisations that have OpenChain have greater control and knowledge about how their open-source software is being created or used, meaning they have a reduced risk of misuse and therefore less legal risk.
If you would like some extra information on what OpenChain is, the risks behind open-source software, and a more in-depth explanation of the specification, I have written a series of blogs:
The first few weeks of the project I worked on getting familiar with the different terminology involved around licensing. This included the different types of open-source licensing:
Permissive – a free-software license that involved minimal restrictions on using the software e.g. MIT or Apache 2.0
Weak Copyleft – allow the code under the license to be combined with code under more permissive licenses without imposing the full copyleft requirements on the entire combined work e.g. LGPL
Strong Copyleft – any derivative of the software must also be distributed under the same strong copyleft terms
I learnt how compliance means you are satisfying the rules of the licenses you are using. I also started to understand the risks involved:
Legal issues
Ethical issues impacting companies reputation
Data breaches due to software vulnerabilities
If you want to know more about licensing and compliance I recommend taking these two courses:
A software bill of materials (SBOM) is a list of all software components that make up a piece of software. Having an SBOM means that you can track all the components required to run your code and ensure they satisfy the license rules. I wrote about SBOMs in greater depth in part 3 of my OpenChain blog series.
We wanted to generate SBOMs for all our repositories across all our codebases so we used a tool called Covenant, created by Patrik Svensson, which generates an SBOM from either a directory or a specific file from .NET 5, .NET 6, .NET Core or NPM projects. To get the SBOM you simply have to run one command:
dotnet covenant generate
We have CodeOps processes to manage our build scripts across our GitHub Organisations and repositories meaning we can easily modify the tasks that run within the build. This meant we could easily add in Covenant to generate our SBOMs.
There are a growing number of SBOM formats, however the two most popular: CycloneDx and SPDX are the formats that Covenant will generate. The current versions don’t add extensibility but Patrik (creator of Covenant) added a custom format which allowed us to capture more custom metadata, for example the branch that was being reviewed and the GitHub organisation it was being taken from.
The SBOMs are stored as JSON files which contain both a Components and Dependencies section. The Components section contains each component that has been identified along with information about it’s license. The Dependencies section brings together all the components and lists for each component which dependencies it has, which when mapped out would create a big tree-like structure of dependencies.
Below is a cut down version of what an SBOM could look like, you can see there is basic metadata at the top, then the components and dependencies section:
Now that we were able to generate our SBOMs, we stored them in our CodeOps Azure Data Lake Storage Gen2 instance in Azure. We are keeping all historic versions so that if we ever need to look back we have the original SBOMs. As JSON is structured data but doesn’t have a schema, we needed to be able to process and clean it so we could extract the parts that were most important to us.
This is when we created the SBOM Wrangler, a PySpark notebook in Azure Synapse Analytics that would load in the data, transform it and then save it back to the datalake. The main benefit of using PySpark is making use of being able to horizontally partition the work, meaning we can process large amounts of data really quickly. To do this we had to design our code in a certain way (e.g. not using for loops) so that it was easily able to be partitioned.
As we had nested data in our JSON file, it made it more difficult to convert the data to be more structured. This was because we had to follow the schema it had, meaning if we imported a different format of SBOM, our wrangler wouldn’t work. In order to make use of the partitioning we used a function called explode which explodes our nested array into individual rows:
Now that we had collected the data we needed about each repository, we were going to apply some business logic to it which would give us a score for each SBOM. Our business logic included which licenses we were going to allow to be used, which ones we weren’t, and for the case in which they didn’t have a license, would we allow the given copyright notice.
Policy Hierarchy
From here we created a policy hierarchy. Policies may need to be tailored for individual repositories but this would be onerous to manage on a per-repository basis. So, using three categories: repository, organisation and company. For each different level, if there exists a policy that is lower down, that takes precedence over the higher level. For example, if there is a repo-specific policy, that will override the org-specific policy.
company
org
repo
We stored this information in our OpenChain repository under a set of different files. This allowed us to be able to access it from any of our other repositories and create a workflow to change it into a JSON file. Here is the structure of the file system: . ├── company │ └── company-level.yaml ├── organisation │ ├── org-level-endjin.yaml │ └── org-level-corvus-dotnet.yaml └── repository └── repo-level-corvus-extensions.yaml\
Each yaml file followed the same structure. Here is an example file:
Each different category contributes to creating a score:
accepted: lists the licenses for this policy that are alright to use
rejected: lists the licenses for this policy that would break compliance and are not allowed
copyright: lists the key words identified in a copyright notice that would allow the component to be used
overrideAccept: lists the components that are always allowed, no matter their license
overrideReject: lists the components that are never allowed, no matter their license
We decided to add functionality to override the license policy for certain components as we found we were catching out some licenses that were our own. The Marain.Net open-source repositories are licensed with the AGPL 3.0 license which is a strong copyleft license. It enforces open source on all derivative work that uses the component. Generally, we don’t want to use copyleft licenses as it may require us to open source some code that we wish to keep as proprietary software rather than permissive. However, as it’s our own code, we are fine using it, meaning we want to override our current rules and accept those specific Marain repositories that have the AGPL 3.0 license.
To use this data later on in our SBOM Analyser we needed it to be in JSON form, so created a workflow to run a script that would pull together all the yaml files and save it as a JSON file to the datalake.
SBOM Analyser
We wanted to be able to produce scores for our SBOMs. The scores represent how many accepted, rejected and unknown components there are for each repository (SBOM). To determine whether they are accepted or not, it is checked against the policy for that specific repository. If there exists a policy for the repository, then that is used, else if there is a GitHub organisation policy, that is use, else there is a default company level policy as a default when no other has been defined. If a component exists in teh accepted list, it’s accepted, same for rejected, if it’s not on either list it checks for a specific match from the copyright section against the copyright notice. If none of those match, it gets assigned unknown.
We want to be able to produce scores for our SBOMs in two different places:
Synapse: generating scores for all the data we have produced at once
GitHub: generating a score for an SBOM in a PR when build runs to see if there are any breaking changes Pandas is now supported by PySpark so we were able to design a package that would run in both basic Python code and in PySpark (still making use of the horizontal partitioning). To do this, again similar to the SBOM Wrangler, we had to ensure we didn’t use for loops and instead other functions that could be partitioned out, whilst still working in normal Python.
We packaged the code up using poetry which allowed us to list the dependencies we needed and bundle it up into a .whl file which could be used in both our build workflows and in Synapse.
The tool consists of three separate parts:
SBOM Wrangler:
When using the tool in the build to instantly generate a score, we need to have the SBOM in the right format, so this SBOM wrangler is the pure Python version of the SBOM Wrangler we have in Synapse. This part of the tool will only get used in GitHub as we already have our SBOMs wrangled in Synapse
Ruleset Formatter
The ruleset formatter takes the JSON ruleset that is generated by the policy hierarchy manager and restructures the data into dataframes so it can be used by the SBOM Scorer
SBOM Scorer
The SBOM Scorer is the main ‘brain’ of the tool. It uses the information generated by the other parts of the tool to score either singular or multiple SBOMs depending on how many accepted or rejected licenses they have.
First, the SBOM data is read in as a Pandas DataFrame. We merge the different policy hierarchy level’s data using left merging, first with the repository level. If the name of the repository matches any of the policy names, it gets merged in. Then the same with organisation level, filling the gaps. And finally if there are any empty spaces, the company level table is merged in.
We then create an extra column for each of the categories in the policies ‘accepted’, ‘rejected’, ‘copyright’, ‘overrideAccept’ and ‘overrideReject’ which checks for each component whether it can find a match in each category or not.
raw_data['licenseAccepted'] = raw_data.apply(lambda x: x['License'] in x['accepted'], axis=1)
raw_data['licenseRejected'] = raw_data.apply(lambda x: x['License'] in x['rejected'], axis=1)
raw_data['copyrightAccepted'] = raw_data.apply(lambda x: any(item in x['copyright'] for item in x['CopyrightSplit']) if isinstance(x['CopyrightSplit'], (list, tuple)) and x['CopyrightSplit'] is not None else False, axis=1)
raw_data['override-accepted'] = raw_data.apply(lambda x: x['Name'] in x['override-rejected'] if x['override-rejected'] is not None else False, axis=1)
raw_data['override-rejected'] = raw_data.apply(lambda x: x['Name'] in x['override-accepted'] if x['override-accepted'] is not None else False, axis=1)
These columns then get sorted logically into one column called ‘sorting’ which, for each component, lists whether it is accepted, rejected or unknown:
raw_data['scoring'] = raw_data.apply(
lambda x: 'Accepted' if x['override-accepted'] else
'Rejected' if x['override-rejected'] else
'Accepted' if x['licenseAccepted'] else
'Rejected' if x['licenseRejected'] else
'Accepted' if x['copyrightAccepted'] else
'Unknown',
axis=1
)
The ‘override’ checks go first, having the highest weighting.
The scores are then summarised into a CSV table similar to below:
Repo Name
Accepted
Rejected
Unknown
Example 1
153
2
32
Example 2
234
7
12
Outputted alongside the scores are the unknown and rejected tables:
Component
License
Copyright
Example Component 1
AGPL
Example Component 2
GNU General Public License
If there are any rejected components in our GitHub build, then it is failed. If there are unknowns then warnings will be thrown.
Using the Scores
Now that we had our scores, we had to find a way to either display them or create an alert saying that something wasn’t right.
Backstage
Backstage is an open source framework for building developer portals which was created by Spotify. Endjin uses it as it provides a unified view over all of the separate github organisations and repositories, so we decided it would be a good place to output our scores.
Backstage uses a plugin-based architecture. It allows you to create custom plugins to add new features and integrations. These plugins can integrate with current internal tools you use, or with third-party services. We brought in data from our Azure Data Lake Storage Gen2 account, and passed the data into tables to be displayed on both of the pages.
Backstage include UI customisation that you can align with your organisation’s brand preferences, whilst still making it easy to have consistency between different people working on different pages.
Overview Page
The first page we have in Backstage is the ‘SBOM Analysis’ summary page. I created this page and added components to it that Backstage provide as a template, these are like building blocks. Backstage offer a Storybook page that demonstrates the different components usages. The image below shows the tab to the left with the ‘SBOM Analysis’ page. This shows, from a top level view, all the repositories that are being checked, and how many rejected, accepted or unknown components they are using.
This makes it easy to spot the rejected components if scanning across the repos, as you can sort the rejected column from highest to lowest amount of rejected components.
Repository specific page
The second type of page we have in Backstage are the repository specific pages which give more information about each repository. I built this page also using the components from the Backstage Storybook.
Below is an example of one page in Backstage. We have the summary section, similar to the overview page, which contains the ‘accepted’, ‘rejected’ and ‘unknown’ counts. Below are the unknown and rejected components tables which display the components, their license and copyright. These are the files generated with the scores from the ‘SBOM Analysis’ package.
Slack message
In order to communicate changes to repositories, and any errors that might be flagged up, we set up a Slack app which will send daily messages to a channel with updates on the current state of the codebase.
I built this into my Synapse notebook ‘SBOM Analyser’, which generates the scores using the SBOM Analyser Python package.
We use the Python package ‘requests’ to post the response to Slack. You can build up your message in the body variable using information you’ve generated previously in the notebook. This runs last in the SBOM Analyser notebook so can access all the information generated by the SBOM Analyser package.
import requests
from azure.keyvault.secrets import SecretClient
if rejectedSlack == 0:
icon = ':white_check_mark:'
else:
icon = ':x:'
url = webhookurl
body = {
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"OpenChain SBOM status summary:\n{icon} There are {rejectedSlack} rejected components {icon}\n\n• *Accepted*: {acceptedSlack}\n• *Unknown*: {unknownSlack}\n\n<[URL to Backstage]| More details in Backstage>"
}
}
]
}
response = requests.post(
url,
json = body
)
print(response.text)
Below is a screenshot of our Slack channel #openchain-notifications which gives a daily update on the status. If there aren’t any rejected components then the message will have green ticks.
Alternatively, if there are some rejected components the message will present itself with red crosses. I made the decision to structure the message like this as it was short and consise, with obvious icons showing whether it’s passed or failed. This means if there is a problem, it’s less likely to be missed as it’s eye catching.
IMM (IP Maturity Matrix)
Endjin has developed a framework called the IP Maturity Matrix (IMM) that measures the engineering practices ‘quality’ of a repository, for example how much of it is documented or how much code is covered by tests. As part of the OpenChain project, we wanted to link in OpenChain to the IMM as a new category, so that it would show whether a repository generates an SBOM.
IMM scores for each repository are stored in an imm.yaml file, with these scores being manually created and updated. However, as we wanted to display whether an SBOM is being created and checked for each repository, we were automatically creating the data we needed, so had to create a process which would update these files for us. This was especially important; as it isn’t a one-time update, our data is dynamic and can be updated from day to day.
Having already got processes from our build script manager that can update files across all our repositories, we decided to repurpose this for the new functionality. I rewrote the logic to instead target the imm.yaml files, if that repo had them, and update the SBOM scores we were accessing from our Azure Data Lake Storage Gen2. Then went through and set up the workflow, so that it can run daily on its own as a GitHub actions workflow. Now it updates all the yaml files to give insight into each one, whether they have an SBOM or not.
These IMM scores get added to the README for each repository, so if you’re looking to use the open-source library, or need some information about it, you can get a quick overview of each one. This includes our SBOM score, meaning it can increase confidence in the code, knowing that the repository is being checked.
100% means that the repository is generating an SBOM, and it is being checked. 0% means that an SBOM isn’t being generated.
Below is an image of what a repository could look like with an SBOM IMM score:
Wrapping up
Finishing my Year in Industry, I am wrapping up my work with the OpenChain project. We covered the ETL (Extract Transform Load) process: extracted and ingested our data, visualised our data, and used it against business rules to gain insight on the whole of our codebase. We added additional output locations, such as Slack messages and displaying as IMM scores, so we get instant feedback without having to look for it. We will get notifications through Slack when compliance is breached, meaning we are significantly reducing the risk of missing a threat to our license compliance.
Now we are able to fully track and manage our open-source license usage, satisfying one of the main points in the OpenChain specification. I hope to build on this work, by implementing OpenChain ISO/IEC DIS 18974 – the industry standard for open source security assurance programs, for my final year project in my upcoming final year at the University of York, where I will be completing my Bachelor of Engineering in Computer Science.
This webinar featured Stefano Maffulli, Executive Director of the Open Source Initiative (OSI), on the current status of the OSI Definition for Open Source AI. It covered their efforts to build community consensus around the topic, and included insights around both progress and challenges.
Watch the Webinar:
More About Our Webinars:
This event is part of the overarching OpenChain Project Webinar Series. Our series highlights knowledge from throughout the global OpenChain eco-system. Participants are discussing approaches, processes and activities from their experience, providing a free service to increase shared knowledge in the supply chain. Our goal, as always, is to increase trust and therefore efficiency. No registration or costs involved. This is user companies producing great informative content for their peers.
The OpenChain India Work Group held its first meeting in a while to discuss a soft “reboot” and what activities can be useful in the local market. Lead by Arun Azhakesan of Siemens Healthineers, our new chair of the India Work Group, the focus was on exploring practical outcomes for the Indian market and open source business process management.
Watch the Meeting:
Join Future Meetings:
The OpenChain India Work Group has a mailing list to coordinate discussion and arrange meetings. Everyone is welcome to join it.
This week we have the following international meetings:
Wednesday 7th August: – OpenChain Automation Work Group Meeting (European Morning) @ 08:00 UTC – OpenChain Education Work Group Monthly Meeting @ 16:00 UTC
Thursday 8th August: – OpenChain Webinar: Implementing OpenChain ISO/IEC 5230 at endjin + Further Research on OpenChain ISO/IEC 18974 @ 07:00 UTC
Get dial-in details and see all our international meetings here:
Save the date! Mark your calendars for the 7th of August at 8:00 UTC (10:00 am CEST), for an insightful presentation by Agustin Benito and Jeronimo Ortiz from SCANOSS.
This session will feature a hands on walk-through of the SCANOSS toolset, which is fully open source. Agustin and Jeronimo will showcase what’s new, how these tools improve the current landscape, and, most importantly, how you can use and integrate them with your existing tools to enhance functionality. It sounds like something you don’t want to miss, right?
During the presentation, they will dive into the various elements of the Software Composition Analysis (SCA) toolset developed and published by SCANOSS. This will be a fantastic opportunity for tooling developers, compliance specialists, and all developers to get a clear understanding of the different components that make up an SCA solution.
Additionally, they will explore the different aspects of “the data” in relation to both the toolset and the service. This discussion will be valuable for anyone looking to leverage these tools to improve their workflows and ensure compliance.
We look forward to seeing you all on the 7th of August at 08:00 UTC (10:00 am CEST). Let’s come together to learn and grow as a community!
In this forthcoming OpenChain Webinar, recent computer science graduate Charlotte Gayton shares her journey of implementing the OpenChain standard during her Year in Industry (ISO/IEC 5230) and her dissertation project (ISO/IEC 18974).
She will discuss the challenges she faced and the solutions she developed to achieve compliance. The session will provide a unique perspective on navigating OpenChain from the viewpoint of someone early in their career.