Module extract:compiler
Summary
The extract:compiler module is responsible for managing and processing compilation-related data during the extraction pipeline. Its core public data structures, CompileEntry and CompilationDatabase, represent an individual translation unit’s compilation parameters and the overall set of entries loaded from a compilation database, respectively. The module owns a set of public functions that load a compilation database (load_compdb), look up entries by file path (lookup), normalize file paths and arguments (normalize_entry_file, normalize_argument_path), sanitize driver and tool arguments (sanitize_driver_arguments, sanitize_tool_arguments), build content-based compile signatures (build_compile_signature), create compiler instances (create_compiler_instance), strip compiler paths (strip_compiler_path), ensure cache keys on entries (ensure_cache_key), and query toolchain information with caching (query_toolchain_cached). Together, these provide a consistent interface for preparing, normalizing, and caching compilation data before further extraction steps.
Internally, the module implements utility functions for parsing compiler invocations, hashing source files, normalizing entry files, and building cache keys, all of which support the public API. The CompDbError struct encapsulates database-related errors, while the CompilationDatabase maintains a toolchain cache and a list of entries. By depending on the support module for file I/O, path handling, and logging, extract:compiler abstracts away the low-level details of compilation database reading and argument normalization, offering a focused public scope for managing compile entries and their associated compiler instances.
Imports
stdsupport
Imported By
Types
clore::extract::CompDbError
Declaration: extract/compiler.cppm:38
Definition: extract/compiler.cppm:38
Declaration: Namespace clore::extract
The struct clore::extract::CompDbError is implemented as a thin wrapper around a single std::string member named message. There are no user‑defined constructors, destructors, or assignment operators; the struct relies entirely on the compiler‑generated defaults. The sole invariant is that message holds a human‑readable description of the error. Because the implementation does not add any custom resource management or validation logic, its correctness depends entirely on the invariants of std::string — namely exception‑safe allocation and copy/move semantics. The struct is trivially copyable and movable via the compiler‑provided operations, making it efficient to pass by value in error‑handling paths.
Invariants
- The
messagemember holds a validstd::string(default-constructed or assigned).
Key Members
message: astd::stringrepresenting the error description.
Usage Patterns
- Returned as a result type from extraction operations to indicate failure.
- Inspected by callers to retrieve detailed error text.
clore::extract::CompilationDatabase
Declaration: extract/compiler.cppm:31
Definition: extract/compiler.cppm:31
Declaration: Namespace clore::extract
The struct clore::extract::CompilationDatabase is an internal container that aggregates a flat list of CompileEntry objects in entries and maintains a toolchain_cache mapping toolchain identifiers to their resolved argument sequences. The two fields are default-initialised to empty; entries stores all parsed compilation commands while toolchain_cache acts as a memoization table for toolchain-related lookups, preventing redundant re‑computation. The member has_cached_toolchain simply checks whether the cache is non‑empty, providing an efficient way to test if any toolchain data has been populated without examining the cache contents. This design keeps the database self‑contained and allows quick short‑circuit checks during extraction.
Invariants
- entries may be empty
toolchain_cachemaps toolchain identifiers to flag listshas_cached_toolchain()returns true iftoolchain_cachecontains any keys
Key Members
- entries
toolchain_cachehas_cached_toolchain
Usage Patterns
- Used to store and pass around parsed compilation database data
toolchain_cachecan be populated and queried to avoid repeated toolchain resolution
Member Functions
clore::extract::CompilationDatabase::has_cached_toolchain
Declaration: extract/compiler.cppm:35
Definition: extract/compiler.cppm:229
Declaration: Namespace clore::extract
Implementation
auto CompilationDatabase::has_cached_toolchain() const -> bool {
return !toolchain_cache.empty();
}clore::extract::CompileEntry
Declaration: extract/compiler.cppm:21
Definition: extract/compiler.cppm:21
Declaration: Namespace clore::extract
The struct clore::extract::CompileEntry aggregates the data describing a single compilation unit extracted from a build system. The fundamental raw inputs are file (the source file path), directory (the working directory), and arguments (the compiler command‑line tokens). The normalized_file member stores a canonicalized version of the source path, while compile_signature is a hash computed from the combination of file, directory, and arguments, serving as an efficient discriminator for duplicate or equivalent compilations. An optional source_hash may hold a content hash of the source file; its presence is not guaranteed. The cache_key string is a derived, deterministic identifier used for caching or result lookup. An invariant of the structure is that once populated, normalized_file and compile_signature remain consistent with the raw fields, and cache_key is uniquely determined by the compilation signature (and possibly the source hash). The struct is an aggregate with all fields default‑initialized, allowing straightforward brace‑initialization and copying.
Invariants
- All string fields may be empty
compile_signatureis zero-initialized if not setsource_hashisstd::nulloptif not available
Key Members
- file
- directory
- arguments
normalized_filecompile_signaturesource_hashcache_key
Usage Patterns
- Used to store compilation entries from
clore::extract::Extractor - Populated from build system output like
compile_commands.json - Accessed by caching and reproducibility systems
Functions
clore::extract::build_compile_signature
Declaration: extract/compiler.cppm:58
Definition: extract/compiler.cppm:110
Declaration: Namespace clore::extract
The function first checks whether the CompileEntry already holds a non-zero compile_signature and a non-empty normalized_file; if so, it short‑circuits by returning that cached value. Otherwise, it determines the normalized file path: if entry.normalized_file is empty, it calls clore::extract::normalize_entry_file to compute it; otherwise it reuses the existing one. It then delegates to the anonymous‑namespace helper clore::extract::(anonymous namespace)::build_compile_signature_impl, passing both the entry and the normalized file path, and returns the resulting std::uint64_t signature. This design centralizes the actual signature computation in the private helper while providing a caching layer that avoids redundant work when the signature and normalized file have already been established.
Side Effects
No observable side effects are evident from the extracted code.
Reads From
- entry
.normalized_file - entry
.compile_signature - entry (via
normalize_entry_fileandbuild_compile_signature_impl)
Usage Patterns
- computing a unique hash for compile entries
- caching compile signatures to avoid redundant computation
clore::extract::create_compiler_instance
Declaration: extract/compiler.cppm:65
Definition: extract/compiler.cppm:297
Declaration: Namespace clore::extract
The function first sanitizes the command-line arguments via sanitize_driver_arguments; an empty result causes an immediate nullptr return. A physical file system is created, and a diagnostic engine is constructed using clang::CompilerInstance::createDiagnostics with an ignoring consumer. If diagnostics creation fails, the function returns nullptr. The sanitized arguments, together with entry.file, the virtual file system, and the diagnostics, are passed to parse_compiler_invocation to produce a clang::CompilerInvocation. If parsing fails, nullptr is returned. The invocation's frontend options are adjusted: DisableFree is set to false, and WorkingDir is set to entry.directory. If the invocation has no frontend inputs, the function returns nullptr. Finally, a clang::CompilerInstance is constructed from the invocation, its virtual file system is set, diagnostics are recreated for the instance, a file manager is created, and target creation is attempted. Failure at any of these steps yields nullptr; otherwise, the fully initialized instance is returned. Key dependencies include sanitize_driver_arguments, parse_compiler_invocation, llvm::vfs::FileSystem, clang::DiagnosticsEngine, clang::CompilerInvocation, and clang::CompilerInstance.
Side Effects
- Allocates a
clang::CompilerInstanceand related objects (VFS, diagnostics, invocation). - Writes to the diagnostics object via
IgnoringDiagConsumerduring invocation parsing. - Potentially performs I/O when parsing the compiler invocation (e.g., reading source files or configuration).
Reads From
entry(theCompileEntryparameter), specificallyentry.fileandentry.directorydriver_argsreturned fromsanitize_driver_arguments(entry)
Writes To
- The returned
std::unique_ptr<clang::CompilerInstance> - Diagnostics object created internally (written by
parse_compiler_invocation)
Usage Patterns
- Called in extraction pipeline to obtain a Clang compiler instance for a compilation unit.
- Used as part of the process to analyze source code and extract symbol information.
clore::extract::ensure_cache_key
Declaration: extract/compiler.cppm:60
Definition: extract/compiler.cppm:225
Declaration: Namespace clore::extract
Implementation: Implementation
The function clore::extract::ensure_cache_key serves as a thin entry point that delegates entirely to the implementation function ensure_cache_key_impl. Its purpose is to separate the public interface from the actual cache-key computation logic, allowing the implementation details—such as argument sanitization, compiler invocation parsing, file normalization, and hash generation—to evolve without affecting callers. The single call to ensure_cache_key_impl handles all internal control flow, including conditional checks for cached toolchain data, error handling for missing or invalid compilation databases, and the final assignment of the CompileEntry::cache_key field.
Side Effects
- modifies the cache key field of the
CompileEntry
Reads From
- the
CompileEntryobject (fields may be read to compute the cache key)
Writes To
- the
CompileEntryobject (specifically the cache key field)
Usage Patterns
- called by
query_toolchain_cachedbefore caching or querying toolchain for a compile entry
clore::extract::ensure_cache_key_impl
Declaration: extract/compiler.cppm:119
Definition: extract/compiler.cppm:119
Declaration: Namespace clore::extract
Implementation: Implementation
The function clore::extract::ensure_cache_key_impl computes and stores the caching metadata for a given CompileEntry. It begins by obtaining a normalized file path via clore::extract::normalize_entry_file, storing the result into the normalized_file field. Next, it derives a compile signature by calling clore::extract::(anonymous namespace)::build_compile_signature_impl with the entry and the normalized path, saving the result into compile_signature. An optional source file hash is produced by clore::extract::(anonymous namespace)::try_hash_source_file using the normalized file, and that heuristics result is written to source_hash. Finally, the function assembles the cache_key by invoking clore::support::build_cache_key with the normalized file and the compile signature. The control flow is strictly sequential; each step depends on the output of the previous one, and all key values are later used by the caching and lookup mechanisms.
Side Effects
- mutates entry fields:
normalized_file,compile_signature,source_hash,cache_key - potentially reads source file via
try_hash_source_file
Reads From
- entry (
CompileEntry& parameter)
Writes To
- entry
.normalized_file - entry
.compile_signature - entry
.source_hash - entry
.cache_key
Usage Patterns
- called by
ensure_cache_keyto populate cache key for a compile entry
clore::extract::load_compdb
Declaration: extract/compiler.cppm:42
Definition: extract/compiler.cppm:127
Declaration: Namespace clore::extract
The function clore::extract::load_compdb begins by checking that the given path exists as a filesystem path; if not, it returns an error via CompDbError. It then loads the JSON compilation database using clang::tooling::JSONCompilationDatabase::loadFromFile, passing the raw path and an error_message output parameter. On failure, it returns an error describing the load failure. On success, it iterates over all compile commands from the database. For each command, it constructs a CompileEntry, populating file, directory, and arguments from the command’s Filename, Directory, and CommandLine fields. The helper ensure_cache_key is called on each entry to compute and store a cache_key (which later participates in deduplication and toolchain caching). The entries are moved into CompilationDatabase::entries. Finally, the function logs the number of loaded commands and returns the populated CompilationDatabase. The algorithm depends on std::filesystem for existence checks, clang::tooling::JSONCompilationDatabase for parsing, and internal routines such as ensure_cache_key for key derivation.
Side Effects
- Reads file system to check existence of compilation database file
- Opens and parses JSON compilation database file
- Allocates memory for
CompileEntryobjects - Calls
ensure_cache_keyfor each entry, mutating the entry's internal state - Logs an informational message with the number of loaded commands
Reads From
- File system via the
pathparameter - Contents of the JSON file at
path - Global or local
loggingfacility (forlogging::info)
Writes To
- Local
CompilationDatabaseobject (db) and itsentriesvector - Each
CompileEntrywithindb(mutated byensure_cache_key) - Log output (via
logging::info)
Usage Patterns
- Loading a compilation database from a path to
compile_commands.json - Initializing extraction processes for a project
- Providing compile commands to build dependency graphs or symbol indexes
clore::extract::lookup
Declaration: extract/compiler.cppm:44
Definition: extract/compiler.cppm:164
Declaration: Namespace clore::extract
The function clore::extract::lookup iterates over all entries in the provided clore::extract::CompilationDatabase and identifies those whose compiled source file matches the given file path. For each entry, it calls normalize_argument_path to resolve the input file relative to the entry’s directory, producing a canonical absolute path. It then compares this candidate against the entry’s normalized file, which is either pre‑computed in entry.normalized_file or obtained on‑the‑fly via normalize_entry_file. If the two paths are equal (using std::filesystem::path equality), a pointer to the entry is appended to the result vector.
The algorithm relies entirely on normalize_argument_path and normalize_entry_file for path normalization, and on std::filesystem for path comparison. The function returns a possibly empty std::vector<const CompileEntry*>, allowing callers to handle multiple entries that map to the same source file. No error handling is performed; if no match is found, an empty vector is returned.
Side Effects
No observable side effects are evident from the extracted code.
Reads From
- db
.entries - file (parameter)
- entry
.directory - entry
.normalized_file
Usage Patterns
- Used to find compile entries corresponding to a source file path
- Called during extraction to associate a source file with its build configuration
clore::extract::normalize_argument_path
Declaration: extract/compiler.cppm:49
Definition: extract/compiler.cppm:188
Declaration: Namespace clore::extract
The function constructs a normalized filesystem path from a raw path and a directory. It first creates a std::filesystem::path from path, and if it is relative, prepends directory by concatenating std::filesystem::path(directory) / normalized. It then attempts to convert the result to an absolute path via std::filesystem::absolute, capturing any std::error_code; on success, it replaces normalized with the absolute path. After calling normalized.lexically_normal() to remove redundant elements (e.g., .. or .), it tries to produce a canonical, symlink‑resolved path using std::filesystem::weakly_canonical. If that operation does not fail, it returns the canonical path; otherwise, it falls back to the lexically‑normalized path. The function relies solely on the C++ standard library’s std::filesystem facilities and does not invoke any other clore::extract routines.
Side Effects
No observable side effects are evident from the extracted code.
Reads From
pathparameterdirectoryparameter- filesystem state (via
std::filesystem::absoluteandstd::filesystem::weakly_canonical)
Usage Patterns
- Normalize compiler argument paths for consistent processing
clore::extract::normalize_entry_file
Declaration: extract/compiler.cppm:56
Definition: extract/compiler.cppm:91
Declaration: Namespace clore::extract
Implementation: Implementation
The function clore::extract::normalize_entry_file takes a CompileEntry and returns a normalized file path as a std::string. It begins by constructing a std::filesystem::path from entry.file. If the path is relative, it is prefixed with entry.directory. After converting to an absolute path using fs::absolute (with error checking), the path is normalized via lexically_normal(). A subsequent call to fs::weakly_canonical attempts to resolve any symbolic links and produce a canonical path; if this fails (e.g., the file does not exist), the function falls back to the lexically-normalized path. The result is always returned as a generic string (forward slashes).
Internally, the function relies solely on the std::filesystem library and the CompileEntry fields file and directory. Error codes from weakly_canonical are captured but not propagated—the function degrades gracefully to the non-canonical form. This ensures robust behavior when the input file does not exist on disk or when filesystem resolution fails. The algorithm prioritizes accuracy (canonical form when possible) over strict failure reporting.
Side Effects
No observable side effects are evident from the extracted code.
Reads From
entry.fileentry.directory
Usage Patterns
- Called by
build_compile_signatureto derive a unique signature for a compile entry. - Called by
ensure_cache_key_implto produce a normalized file path for cache key computation.
clore::extract::query_toolchain_cached
Declaration: extract/compiler.cppm:62
Definition: extract/compiler.cppm:233
Declaration: Namespace clore::extract
The function first checks if entry.arguments is empty and returns an empty vector if so. It then attempts to obtain a cache key from entry.cache_key; if that field is empty, it creates a copy of the entry and calls ensure_cache_key on the copy to populate its cache_key field, then uses that key for lookup. Using the key, it searches db.toolchain_cache; if a cached value exists, it is returned immediately. Otherwise, the function calls sanitize_tool_arguments on the original entry to produce the toolchain arguments, inserts the result into the cache under the key, and returns the computed vector. This design ensures that each unique compilation entry (as identified by its cache key) triggers the potentially expensive sanitization only once per database session.
Side Effects
- Inserts or updates entries in
db.toolchain_cache
Reads From
entry.argumentsentry.cache_keydb.toolchain_cache
Writes To
db.toolchain_cache
Usage Patterns
- Used to obtain sanitized tool arguments with memoization
- Callers rely on caching to avoid redundant calls to
sanitize_tool_arguments
clore::extract::sanitize_driver_arguments
Declaration: extract/compiler.cppm:52
Definition: extract/compiler.cppm:207
Declaration: Namespace clore::extract
The function clore::extract::sanitize_driver_arguments begins by obtaining a copy of the CompileEntry’s arguments into adjusted. It then normalizes the original input file path by calling normalize_argument_path with entry.file and entry.directory. Using std::erase_if, it scans every argument in adjusted; any non‑empty argument that does not start with a dash (-) is further checked by normalizing it relative to entry.directory. If the normalized result matches the normalized source path, that argument is removed from the vector. The resulting adjusted list, which no longer contains the explicit input file argument, is returned.
The algorithm relies on normalize_argument_path (defined in the anonymous namespace) to produce a canonical absolute path from a given path and a working directory. It assumes all relevant arguments are already present in the CompileEntry::arguments list and that the input file appears there as a non‑flag argument. No other data structures or global state are modified; the function works purely on local copies and the provided entry.
Side Effects
No observable side effects are evident from the extracted code.
Reads From
entry.argumentsentry.fileentry.directoryclore::extract::normalize_argument_path
Usage Patterns
- used to remove the source file from compiler arguments
- called before building compile signature or invoking compiler
clore::extract::sanitize_tool_arguments
Declaration: extract/compiler.cppm:54
Definition: extract/compiler.cppm:221
Declaration: Namespace clore::extract
The implementation of clore::extract::sanitize_tool_arguments is a direct two‑stage pipeline. It first calls sanitize_driver_arguments, passing the const CompileEntry& entry. That function is responsible for parsing and normalizing the driver‑level arguments stored in the entry’s arguments field. The result from sanitize_driver_arguments is then immediately forwarded to strip_compiler_path, which removes the compiler executable path (the first argument of the invocation) from the argument vector. The final returned value is a std::vector<std::string> containing the remaining, sanitized tool arguments. No branching or further transformation occurs inside sanitize_tool_arguments itself; all logic is delegated to these two dependency functions.
Side Effects
No observable side effects are evident from the extracted code.
Reads From
- entry
Usage Patterns
- Used when normalizing compile arguments
- Called during extraction pipeline
clore::extract::strip_compiler_path
Declaration: extract/compiler.cppm:47
Definition: extract/compiler.cppm:181
Declaration: Namespace clore::extract
The implementation of clore::extract::strip_compiler_path applies a simple truncation algorithm: when the input std::vector<std::string> of arguments contains one or fewer elements, it returns an empty vector. Otherwise, it constructs a new vector containing every element except the first by copying from args.begin() + 1 to args.end(). This effectively removes the compiler executable path from a command-line argument list.
The sole control flow is an early return on args.size() <= 1. The function depends only on the standard library’s std::vector, its size and begin / end iterators, and the iterator-range constructor used for the copy. No external project dependencies are required.
Side Effects
No observable side effects are evident from the extracted code.
Reads From
argsparameter
Writes To
- return value (new vector)
Usage Patterns
- used to obtain compiler flags without the program name
Internal Structure
The extract:compiler module is the intermediary between raw compilation database entries and the actual Clang compiler invocations used during extraction. It imports only the std library and the local support module, which supplies foundational utilities for file I/O, path normalization, and cache key management. The module decomposes into three layers: data structures (CompileEntry, CompilationDatabase, CompDbError), public API functions, and private helpers in an anonymous namespace. Public functions like load_compdb, lookup, and sanitize_driver_arguments operate on the database and entries, while private helpers such as parse_compiler_invocation, try_hash_source_file, and build_compile_signature_impl handle path resolution, argument stripping, signature hashing, and compiler instance creation. This layering ensures that caching and normalization logic is encapsulated, leaving the public API stable and focused on extracting ready‑to‑use compiler invocations from concrete compile entries.