Code

_images/dgit-structure.jpg

Repo Manager

dgit supports multiple ways to store datasets. It could be git itself, local filesystem (possibly, with s3 backend). We expect to support Instabase in future .

class dgitcore.plugins.repomanager.RepoManagerBase(name, version, description, supported=[])[source]

Bases: object

Repository manager handles the specifics of the version control system. Currently only git manager is supported.

add(repo)[source]

Add repo to the internal lookup table...

add_files(repo, files)[source]

Files is a list with simple dict structure with relativepath and fullpath

add_raw(repo, files)[source]
clone(repo, newusername, newreponame)[source]

Clone repo

commit(repo, message)[source]
config(what='get', params=None)[source]
drop(repo)[source]

Drop repository

enabled()[source]
get_repo_details(key)[source]
get_repo_list()[source]
init(username, reponame, force)[source]

Initialize a repo (may be fs/git/.. backed)

initialize()[source]
is_my_repo(username, reponame)[source]
key(username, reponame)[source]
lookup(username=None, reponame=None, key=None)[source]

Lookup all available repos

notes(repo, args)[source]
push(repo, args)[source]
repos(username)[source]
rootdir(username, reponame, create=True)[source]

Working directory for the repo

search(username, reponame)[source]
server_rootdir(username, reponame, create=True)[source]

Working directory for the repo

server_rootdir_from_repo(repo, create=True)[source]
show(repo, args)[source]
stash(repo, args)[source]
status(repo, args)[source]
users()[source]

Find users

class dgitcore.contrib.repomanagers.gitmanager.GitRepoManager[source]

Bases: dgitcore.plugins.repomanager.RepoManagerBase

Git-based versioning service. This implements the RepoManagerBase class.

add_files(repo, files)[source]

Add files to the repo

add_raw(repo, files)[source]
clone(url, backend=None)[source]

Clone a URL

Parameters:url : URL of the repo. Supports s3://, git@, http://
commit(repo, args=[])[source]

Commit the changes to the repo (pass thru git command)

Parameters:

repo: Repository object

args: git-specific args

config(what='get', params=None)[source]
delete(repo, args=[])[source]

Delete files from the repo

diff(repo, args=[])[source]

diff two repo versions (pass thru git command)

Parameters:

repo: Repository object

args: git-specific args

drop(repo, args=[])[source]

Cleanup the repo

init(username, reponame, force, backend=None)[source]

Initialize a Git repo

Parameters:

username, reponame : Repo name is tuple (name, reponame)

force: force initialization of the repo even if exists

backend: backend that must be used for this (e.g. s3)

log(repo, args=[])[source]

Show the log (pass thru git command)

Parameters:

repo: Repository object

args: git-specific args

notes(repo, args=[])[source]

Add notes to the commit

Parameters:

repo: Repository object

args: notes-specific args

Get the permalink to command that generated the dataset

pull(repo, args=[])[source]

Pull from origin/filesystem based master

Parameters:

repo: Repository object

args: git-specific args

push(repo, args=[])[source]

Push to origin master

Parameters:

repo: Repository object

args: git-specific args

remote(repo, args=[])[source]

Check remote URL

Parameters:

repo: Repository object

args: git-specific args

show(repo, args=[])[source]

Show the content of the repo (pass thru git command)

Parameters:

repo: Repository object

args: git-specific args

stash(repo, args=[])[source]

Stash all the changes (pass thru git command)

Parameters:

repo: Repository object

args: git-specific args

status(repo, args=[])[source]

Show status of the repo (pass thru git command)

Parameters:

repo: Repository object

args: git-specific args

Backends

dgit is designed to support multiple backends. Intially local filesystem and s3 are supported. We plan to support more in future.

class dgitcore.plugins.backend.BackendBase(name, version, description, supported=[])[source]

Bases: object

Backend object implements

clone_repo(url, gitdir)[source]

Clone a repo at specified URL

config(what='get', params=None)[source]
initialize()[source]

Called to initialize sessions, internal objects etc.

push(state, name)[source]

Push a data version to the server

Parameters:

state: Overall state object that has dataset details

name: name of the dataset

supported(url)[source]

Check if a URL is supported by repo

url_is_valid(url)[source]

Check if a URL exists

class dgitcore.contrib.backends.s3.S3Backend[source]

Bases: dgitcore.plugins.backend.BackendBase

S3 backend for the datasets.

Parameters:Configuration (s3 enable,access, secret, bucket, prefix)
clone_repo(url, gitdir)[source]
config(what='get', params=None)[source]
init_repo(gitdir)[source]

Insert hook into the repo

make_hook_executable(filename)[source]
run(cmd)[source]
url(username, reponame)[source]
url_is_valid(url)[source]
class dgitcore.contrib.backends.local.LocalBackend[source]

Bases: dgitcore.plugins.backend.BackendBase

Filesystem based backend

config(what='get', params=None)[source]
connect()[source]
pull()[source]
push()[source]
url_is_valid(url)[source]

Check if a URL exists

Instrumentation

Various plugins that can be used to instrument any process of generation of the dataset.

class dgitcore.plugins.instrumentation.InstrumentationBase(name, version, description, supported=[])[source]

Bases: object

Pre-computed patterns

config(what='get', params=None)[source]
initialize()[source]
class dgitcore.contrib.instrumentations.content.ContentInstrumentation[source]

Bases: dgitcore.plugins.instrumentation.InstrumentationBase

Instrumentation to extract content summaries including mimetypes, sha1 signature and schema where possible.

update(config)[source]
class dgitcore.contrib.instrumentations.platform.PlatformInstrumentation[source]

Bases: dgitcore.plugins.instrumentation.InstrumentationBase

Instrumentation to extract platform-specific information

get_metadata()[source]
update(config)[source]
class dgitcore.contrib.instrumentations.executable.ExecutableInstrumentation[source]

Bases: dgitcore.plugins.instrumentation.InstrumentationBase

Instrumentation to extract executable related summaries such as the git commit, nature of executable, parameters etc.

update(config)[source]

Metadata

dgit supports posting metadata to simple API servers to enable search, lineage computation, and sharing. A minimal posting client is supported for now.

class dgitcore.plugins.metadata.MetadataBase(name, version, description, supported=[])[source]

Bases: object

This is the base class for all backends including

initialize()[source]

Called to initialize sessions, internal objects etc.

post(repo)[source]

Post to server

class dgitcore.contrib.metadata.default.BasicMetadata[source]

Bases: dgitcore.plugins.metadata.MetadataBase

Metadata backend for the datasets.

Parameters:Configuration (token)
config(what='get', params=None)[source]
post(repo)[source]

Post to the metadata server

Parameters:repo

Validation

class dgitcore.plugins.validator.ValidatorBase(name, version, description, supported=[])[source]

Bases: object

This is the base class for all backends including

autooptions()[source]

Get default options

evaluate(repo, files, rules)[source]

Evaluate the repo

returns: A list of dictionaries with:
target: relative path of the file rules: rules file used validator: name of the validator status: OK/Success/Error Message: Any additional information
initialize()[source]

Called to initialize sessions, internal objects etc.

class dgitcore.contrib.validators.metadata_validator.MetadataValidator[source]

Bases: dgitcore.plugins.validator.ValidatorBase

Validate repository metdata

autooptions()[source]
config(what='get', params=None)[source]
evaluate(repo, spec, args)[source]

Check the integrity of the datapackage.json

class dgitcore.contrib.validators.regression_quality.RegressionQualityValidator[source]

Bases: dgitcore.plugins.validator.ValidatorBase

Validate repository metdata

autooptions()[source]
config(what='get', params=None)[source]
evaluate(repo, spec, args)[source]

Evaluate the files identified for checksum.

Transformer

class dgitcore.plugins.transformer.TransformerBase(name, version, description, supported=[])[source]

Bases: object

This is the base class for all backends including

autooptions()[source]

Get default options

evaluate(repo, files, spec, force)[source]

Execute the generator on the files

initialize()[source]

Called to initialize sessions, internal objects etc.