It is now almost eight years since we started work on the International Standard Content Code (ISCC). In a few weeks, it will become a global ISO standard. The ISCC is a unique identifier for digital media content. It uses specialised algorithms to create a code that can identify, track, and manage digital content like documents, images, audio, and video. Unlike traditional identifiers, the ISCC can also detect whether different content is similar to each other, making it useful for managing copyrights and finding related digital materials.
Over the years, I have found that it can be quite difficult to explain the concept of content-based identifiers in general and ISCC in particular to an audience in the cultural and creative industries. So instead of trying to explain ISCC all over again, in this post I wanted to create a better understanding of why it is so difficult to understand the concept of content-based identifiers and ISCC.
Content-derived Identifiers (CDIs)
Cryptographic algorithms — Content-derived identifiers involve complex algorithms and cryptographic techniques to generate unique identifiers based on the content itself. Minor changes in content can result in a completely different, entirely uncorrelated identifier, this property of being 100% deterministic is crucial for the integrity and uniqueness of identifiers. Understanding these processes requires a basic knowledge of cryptographic hash functions and how they are applied to generate identifiers. Although this understanding is crucial, we often omit this topic from initial discussions with cultural and creative sector stakeholders due to time constraints or a desire not to go into too much technical detail.
From Content to Cryptographic Identifier — Unlike established identifiers, such as barcodes, product or serial numbers, which are often relatively short, the encoding of CDIs can be complex and counter-intuitive. CDIs are designed primarily for machine-to-machine data exchange, emphasising efficiency and security over human readability. This abstraction complicates understanding for those without expertise in digital technology or cryptography, as the direct relationship between the input (content) and the generated output (hash) is not immediately apparent. It seems to be easier to understand an identifier as being distinct and uncorrelated from the entity it identifies (“surrogate IDs”).
Comparison with Traditional Identifiers — Creators are more familiar with traditional identification methods where strings of characters and/or numbers are used to name files, for example. Moving to a system where identifiers are not manually assigned and managed, but derived from the content itself, requires a paradigm shift in terms of understanding how digital assets are uniquely identified and digitally managed.
No embedding required — It’s a common misconception that content-derived identifiers must be physically embedded or associated with the digital content they are intended to identify, to securely link the identifier to its corresponding content. However, this is not correct. In fact, embedding a CDI in the content would change the very digital asset it is supposed to identify, which paradoxically leads to a discrepancy between the content and its originally generated identifier. CDIs are designed to function without the need for making changes to the content, metadata or file names, ensuring that the content’s integrity remains untouched.
The International Standard Content Code (ISCC)
The International Standard Content Code (ISCC) now adds an additional layer of complexity to the topic of content-derived identifiers, since it combines cryptographic hashing along with similarity-preserving hashes (SIM hashes):
Technical Complexity of Multi-Composite Hashing — Both concepts involve intricate algorithms and cryptographic techniques. CDIs use cryptographic hash functions to generate unique identifiers (checksums) from the content. The ISCC code is a composite identifier that uses a combination of both cryptographic hashes and SimHashes. Similarity hashing is a technique used in computer science to efficiently approximate the similarity between sets of data. It is a type of hashing function that, unlike traditional cryptographic hashing, is designed to produce similar hash values for inputs that are similar to each other. This property makes SimHash particularly useful for tasks such as detecting duplicate or near-duplicate content, such as web pages, documents, or images, over large datasets.
This novel and innovative combination used for ISCC ensures both the integrity of the content and the recognition of similar or near-duplicate content. The fact that minor changes in content can result in a completely different identifier component of the ISCC (Instance-Code), and – at the same time – result in similar identifiers in other components of the ISCC, requires an understanding of how both components of the composite ISCC code works.
Vectors in the ISCC — SimHashes can be considered as vectors in a multi-dimensional space. Each SimHash is essentially a fixed-size string of bits (for example, 64 bits long) that represents a point in this space. The process of generating a SimHash involves mapping the features of the input data (like words in a document or pixels in an image) to this multi-dimensional vector space. The resulting SimHash vector is a compact representation that approximates the original data’s distribution of features. The similarity between two pieces of data can then be estimated by calculating the distance (typically, the Hamming distance) between their corresponding SimHash vectors, with closer vectors indicating more similar content.
The ISCC’s use of vectors for representing digital content is a powerful feature that introduces a significant complexity in its explanation. This vector-based approach of the ISCC means that digital content is not just given a unique identifier but is also represented in a way that facilitates the comparison of its ‘digital DNA’ with others in a multi-dimensional space. This concept of using vectors transcends traditional identification methods, merging the realms of abstract mathematics with practical digital content management.
Diverse Applications — CDIs are well established and used in various technical contexts, from cryptography and blockchain technology to file storage and digital asset management. ISCC will continue to expand this versatility, particularly in the media industries, adding a wide range of industry-specific applications. However, the adaptability of ISCC across industries – from news and music to publishing and broadcasting – presents unique challenges. Each industry segment can use ISCC to solve different problems, resulting in a wide range of use cases. For example, while the news industry may use ISCC for content verification to combat fake news and disinformation, the music industry may use it for copyright and metadata management, and accounting purposes. Similarly, book publishers and TV broadcasters may use ISCC for completely different purposes, such as managing digital rights, controlling content distribution or discovering counterfeit products, piracy or other forms of misappropriation. This diversity of use cases complicates the task of summarising the benefits and functions of ISCC to different sectors and stakeholders. The challenge is not only to emphasise the versatility of ISCC, but also to design the implementation to meet the specific needs and challenges of different sectors and stakeholders.
***
In explaining ISCC, the challenge is essentially to bridge the gap between the technical complexity and abstractness of the technical concepts and the audience’s level of knowledge and familiarity with digital technologies within their own domain. The integration of cryptographic and similarity hashing mechanisms for ISCC and the nuanced applications and implications of CDI and ISCC require a careful explanation that demystifies these advanced technologies while highlighting their benefits and applications for specific use cases and hands-on applications in the cultural and creative industries.
We want to encourage you to use ISCC, today. If you have any questions, don’t hesitate to reach out!
Useful Links:
ISCC Codec & Algorithms, https://core.iscc.codes
ISO/TS 22943:2022, Principles of identification, https://www.iso.org/standard/83121.html
Credits:
DeepL and ChatGPT have been used for copy-editing.
ISCC graphic created by Titusz Pan.