Configuration ============= There are two configuration files * ~/.dgit.ini : INI file that specifies generic parameters such as client authentication and working space driectories etc. * dgit.json : Repository-specific configuration file that specifies user preferences for a given repository such as validation rules that must be executed. Both can be setup easily and updated at any time. This document gives you a detailed list of options for completeness. Generic Configuration (~/.dgit.ini) ----------------------------------- Summary ~~~~~~~ dgit is composed to multiple modules, each of which requires configuration if enabled. This file lists the parameters for each of the modules. 1) Basic configuration .. automodule:: dgitcore.config 2) Repository Manager (Git) .. automodule:: dgitcore.contrib.repomanagers.gitmanager 3) Backend Manager (S3) .. automodule:: dgitcore.contrib.backends.s3 4) Backend Manager (Local) .. automodule:: dgitcore.contrib.backends.local 5) Metadata Server (Local) .. automodule:: dgitcore.contrib.metadata.default Execution ~~~~~~~~~ :: # .dgit.ini in ~ $ dgit config init General information ================== user.email (Email address) [pingali@gmail.com]: user.name (Full Name) [pingali]: user.fullname (Short Name) [Venkata Pingali]: Local Filesystem Backend ================== workspace (Local directory to store datasets) [/home/pingali/.dgit]: S3 backend ================== enable (Enable S3 backend?) [y]: client (Command line tool to use for repo backup (aws|s3cmd)) [aws]: s3cfg (s3cfg configuration file if s3cmd is chosen. Otherwise ignore) [/home/pingali/.s3cfg]: bucket (Bucket into which the datasets are stored) [appsloka]: prefix (Prefix within bucket to backup the repos) [git]: Git-based Repository Manager ================== Nothing to do. Enabled by default Basic metadata server ================== enable (Enable generic Metadata server?) [y]: token (Provide API token to be used for posting) [02ea2997272026303]: url (URL to which metadata should be posted) [http:///api/metdata/]: Validate integrity of the dataset metadata ================== enable (Enable repository metadata integrity check) [y]: Check R2 of regression model ================== enable (Enable repository regression-quality checker) [y]: Dataset-specific Configuration File (dgit.json) ----------------------------------------------- Summary ~~~~~~~ A dgit configuration file is automatically generated in the local directory to reduce the need for the user to specify preferences each time, and work as much as possible in the auto mode. - username : Name of the user (string) - reponame : Name of the dataset (unique for a given user) - title : One line summary - description : Detailed description of the repository - remoteurl : Path to the archive of the repo - git@github.com:pingali/dgit.git - https://github.com:pingali/dgit.git - s3://mybucket/git/pingali/dgit.git - dependencies: List of other repositories that this dataset depends on - working-directory : Directory that must be searched for updated - tracking : Dictionary specifying files to include and exclude - includes : list of patterns that should be used to include files - excludes : list of patterns that should be used to exclude files - Example: .git - pipeline : Data processing pipeline. This is a dictionary with pipeline name mapped to a details dictionary. Each of them has: - files: Ordered list of files - description: Text summary of the pipeline - import : Transformations that must be performed while importing files from the local directory into the dataset. - directory-mapping: dictionary with local: repo directory mapping - validate : List of validations that must be performed. This is a dictionary of : . Possible parameters include: - Files: List of patterns of source files on which the validation must be performed - Rules: List of patterns that specify rules files with validation parameters - metadata-management: This specifies what should be shared with the metadata server. - servers: List of domain names to post the metadata - code-history: git commit information for specified files from the code repository - include-preview: List of files/patterns and number of bytes that must be included - include-validation: Validate and share the results - include-dependencies: Include information on dependent repositories - include-schema: For csvs and tsvs, detect the schema and share - include-tab-diffs: For csv/tsvs, do an intelligent diff to figure out schema and record changes. - include-platform: Include the os/system information Execution ~~~~~~~~~ :: $ dgit auto Let us know a few details about your data repository Please specify username [pingali] Please specify repo name [simple-regression] Please specify remote URL [s3://mybucket/git/pingali/simple-regression.git] One line summary of your repo: Simple regression model Add any more details: Updated dataset specific config file: dgit.json Please edit it and rerun dgit auto. Tip: Consider committing dgit.json to the code repository. $ cat dgit.json { "username": "pingali", "reponame": "simple-regression", "remoteurl": "s3://appsloka/git/pingali/simple-regression.git", "title": " S", "description": " S", "working-directory": ".", "track": { "includes": [ "*.csv", "*.tsv", "*.txt", "*.json", "*.xlsx", "*.sql", "*.hql" ], "excludes": [ ".git", ".svn", "dgit.json" ] }, "auto-push": false, "pipeline": {}, "import": { "directory-mapping": { ".": "" } }, "dependencies": {}, "validator": { "regression-quality-validator": { "files": [ "*.txt" ], "rules": { "min-r2": 0.25 }, "rules-files": [ "rules.json" ] }, "metadata-validator": { "files": [ "*" ] } }, "transformer": {}, "metadata-management": { "servers": [ "localhost:8000" ], "include-code-history": [ "regression.py", "regression2.py" ], "include-preview": { "length": 512, "files": [ "*.txt", "*.csv", "*.tsv" ] }, "include-data-history": true, "include-validation": true, "include-dependencies": true, "include-schema": [ "*.csv", "*.tsv" ], "include-tab-diffs": [ "*.csv", "*.tsv" ], "include-platform": true } }