Cyral
Get Started Sign In

Cyral Data Classifier and Analysis Utilities

Cyral policies allow you to provide the right level of protection for every table, collection, or other data location in your repositories. The first step in creating policies is understanding where your sensitive data lives, and to help assess this, Cyral provides a number of utilities: 

  • Schema dump extracts a YAML representation of the schemas in a repository.

  • Schema diff compares one schema dump output file to past output files for that database, allowing you to track schema changes.

  • Data classifier scans databases for potentially sensitive data locations (for example table and column names), allowing you to protect those locations with your policies.

When to use the classifier and utilities

Here are some common things you can do with the classifier and schema utilities:

  • Identify tables and columns that match a given pattern and generate a corresponding data map that can be used to instruct Cyral to monitor and protect those locations

  • Discover the schema of a database

  • Commit the discovered schema to GitHub

  • Compare the discovered schema to a previously committed schema on GitHub

  • Perform data classification against a given database by sampling data and looking for predefined patterns


Summary: What does each utility do?

Schema Dump

Connects to the database server and exports the schema of the database. This will generate a YAML representation of the databases discovered via the INFORMATION_SCHEMA views in the database. An example output is below, showing the output for a schema with two tables, brands and products.


production:

  brands:

    brand_id: int, Primary Key

    brand_name: varchar

  products:

    brand_id: int, Foreign Key

    category_id: int, Foreign Key

    list_price: decimal

    model_year: smallint

    product_id: int, Primary Key

    product_name: varchar


By default, the script attempts to commit this YAML to a GitHub repo but can be configured not to integrate with GitHub and simply dump the output to stdout.

Schema Diff

Integrates with GitHub to compare the schema dump to previous dumps to determine if anything has changed. The difference is ascertained using the DeepDiff Python library to compare the YAML stored in Github against the generated YAML.

Data Classifier

The Cyral data classifier helps find data locations that might contain sensitive data. The classifier samples records and checks for the following pattern types:

  • Email Addresses (EMAIL)

  • US Based Phone Numbers (PHONE)

  • Social Security Numbers (SSN)

  • Credit Card Numbers (CCN)

  • IP Addresses (IP_ADDRESS)

  • Columns that contain an age (AGE)

  • Postal addresses (ADDRESS)

The Cyral classifier utility connects to the specified database, samples rows, and checks whether the retrieved data values match a set of regular expressions (regexes) for the data types listed above. Matches are added to a Cyral-formatted data map that you can use to apply policies to the sensitive data locations.

Deploy the Cyral Data Classifier container

Prerequisites

  • GitHub integration: The Offline Data Classifier assumes that the Cyral control plane has been configured with a Github Integration and that the integration’s data map directory is called "datamaps" in the repo.

  • Github API credentials: In order for this to be able to commit discovered data to a GitHub, you will need to create an API key that has the repo and notifications permissions in the same repo that is being used for the Github integration above.

  • Cyral control plane service account: A service account needs to be created in your Cyral control plane with the following roles assigned to it: "Modify Integrations" and "Modify Policies".

Deploy the Classifier container

  1. Get the offline-data-classifier Dockerfile from your Cyral support person.

  2. Create or find your private Amazon ECR repository. We recommend giving the repository a name like offline-data-classifier.

  3. Build the Classifier Docker image and deploy it to your ECR repository.

Run the Cyral Data Classifier container

  1. Save an AWS Secret that contains the Cyral Classifier's JSON-formatted configuration. Use the prefix, in the form, cyral-offline-data-classifier/<your_chosen_identifier>. It must contain these blocks:

    1. github: The URL of the GitHub repo that will hold your Cyral data map

    2. cyral: The URL of your Cyral control plane, and the username and password of the Cyral service account the classifier will use to log in. 

    3. database: The database login and configuration information for the database that the classifier will analyze. Make sure you've created this account in your database service with sufficient permissions.

For a complete list of parameters, see "Classifier JSON Configuration", later in this document. 

For an example, see below:


  {

    "github": {

      "repo_url": "https://www.github.com/<your_org_name>/<your_repo_name>",

      "api_key": "ghp_abc123xyz789"

    },

    "cyral": {

      "api_url": "https://somecustomer.cyral.com:8000/v1",

      "client_id": "<Cyral CP Service Account Client ID>",

      "client_secret": "<Cyral CP Service Account Secret Key>"

    },

    "database": {

      "type": "bigquery",
      "project": "my_project",

      "service_account": {
        "type": "service_account",

        "project_id": "my_project",

        "private_key_id": "2c25f4750a511564eaa48096a32109",

        "private_key": "-----BEGIN PRIVATE KEY-----\ndFP3y...",

        "client_email": "mydbuser@example.com",

        "client_id": "819657379062294586",

        "auth_uri": "https://accounts.google.com/o/oauth2/auth",

        "token_uri": "https://oauth2.googleapis.com/token",

        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",

        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/bigquery/certs",
      },

      "repo_name": "my_repo"

    }

  }


In the Secret details window of your AWS console, find the Secret ARN of the AWS Secret you just saved. You'll pass it as the CONNECTION_SECRETS_ARN it in the next step.

  1. Run the Classifier container using the docker run command. Pass the needed command line options, as described in the section, "Classifier environment variables", later in this document. For example:

docker run -it \
-e CONNECTION_SECRETS_ARN='your_secrets_arn' \
-e CLASSIFIER_SAMPLE_OFFSET_LIMIT=1 \
-e GITHUB_AUTOMERGE_PULL_REQUESTS=True \
-e LOGGING_LEVEL=DEBUG \
-e DISABLE_TAB_SUPPORT=1 \
offline-data-classifier /bin/bash

  1. The container runs the environment and helper utilities, like the Opa runtime, that enable the Classifier to run.

Run the Cyral Classifier and/or Cyral Schema Dump on a database

  1. Get your AWS credentials. You can find these in your AWS Console by finding your user account, expanding your user record and clicking the Command line or programmatic access button. You can choose Option 1: Set AWS environment variables to download the credentials.

  2. Open a command prompt in the Classifier container environment, and paste the credentials environment settings to apply the credentials.

export AWS_ACCESS_KEY_ID="<your access key ID>
export AWS_SECRET_ACCESS_KEY="<your access key>
export AWS_SESSION_TOKEN="<your session token>

  1. Run the classifier to analyze a database. Use the Classifier startup script, startup.sh, to do this. You'll find this script in the root folder of the container:

./startup.sh

The script will list the discovered schemas and their attributes that the Classifier identified as likely to contain sensitive data. The script also commits this information to your datamaps folder in GitHub, saving it in a file called datamap.yaml.

  1. If you with to retrieve the database schema, run the Cyral Schema Dump utility as well:

python opt/app/run_explorer.py -ps -i



Optional: Deploy the Classifier as a Lambda function

AWS Lambda: 

  1. Create Lambda Function

    1. Create the function from the Classifier container image

    2. Provide an identifiable name for the function

    3. Click the Browse images button, and then locate the ECR that you created earlier in this procedure. Choose the image version that you just loaded into that ECR.

  2. Set up the Lambda function:

    1. Edit the execution role and create an inline policy that allows GetSecretValue access to cyral-offline-data-classifier/* secrets

    2. Configure the required environment variable, CONNECTION_SECRETS_ARN. Set it to the ARN of the cyral-offline-data-classifier/<name> created above

    3. Optionally, configure the suggested environment variables:

      1. GITHUB_AUTOMERGE_PULL_REQUESTS : True - So that PRs are created and automatically merged

      2. LOGGING_LEVEL : DEBUG -- Setting this to DEBUG shows details in the Cloudwatch logs that are helpful in testing. In production, this should not be set.

    4. Adjust the timeout: Initial timeout for Lambda scripts is 3 seconds. We suggest an initial timeout of 2 mins and work from there

AWS Lambda - Scheduling Execution

Since this is a Lambda function, any supported trigger can be used to execute the function. In lieu of some existing event/trigger that could be used, the below instructions explain how to use AWS CloudWatch to schedule execution.

  1. Open CloudWatch

  2. Go to Events → Rules

  3. Click the Create rule button

  4. Select the Schedule radio button

  5. Configure the frequency whether via fixed rate or cron expression

  6. In the Targets pane, click the Add target button

  7. From the Function dropdown, select the Lambda Function created in the previous section

  8. Click the Configure details button.

  9. On the resulting page, provide a Name for the rule and description if you prefer. Click the Create rule button.


Configure the Cyral Classifier

The classifier makes use of two sets of configurations:

  • Classifier JSON configuration: The JSON configuration is used to configure the classifier for interaction with the Cyral Control Plane, Github, and the target database/filesystem.

  • Classifier environment variables: Environment variables can be supplied to the classifier in order to change some of the default behavior during classifier execution.

Classifier JSON Configuration

Below we explain the settings you need to make in the Cyral classifier's JSON configuration file:

GitHub-related parameters

  • git_api_key (required): The GitHub API Key with the needed permissions. (See GitHub Integration, above.) For example: ghp_abc123xyz789

  • git_repo_name (required): Name of the GitHub repo to be used by the script. For example: databaseSchemas

  • git_repo_owner (required): Name of the owner of the GitHub repo. For example, cyralinc

  • git_api_url: TheURL of the GitHub API where your data map repo can be managed. Defaults to https://api.github.com 

  • git_primary_branch: The GitHub branch name that is considered to be used as the production policy repo or primary source of truth (typically main). Defaults to main 

Cyral integration parameters

  • cyral.api_url (required): The URL of your Cyral control plane API. For example, https://example.cyral.com:8000/v1 

  • cyral.client_id (required): Cyral control plane service account client ID. For example, "sa/sfff/sfadf"

  • cyral.client_secret (required): Cyral control plane service account secret key. For example, "asdfgs_sd3463d"

Database parameters

  • database.type (required): The name of a supported database type. For example, "postgres"

  • database.repo_name (required): The name of a repository within the Cyral control plane that corresponds to the database being classified. For example, "postgres-prod"

Vendor Specific Attributes

The database JSON expects different values depending upon the database type that will be run against. Check with Cyral support for the settings needed for your database type.

Classifier environment variables

The Cyral classifier draws its configuration from the following environment variables:

  • CONNECTION_SECRETS_ARN (required): The name of ARN to the AWS Secret that contains the JSON noted above. For example, "arn:aws:secretsmanager:us-east-1: <account Id> :secret: <secret Identifier>"

  • GITHUB_AUTOMERGE_PULL_REQUESTS: Tells the classifier whether it should automatically merge the resulting data map PR in Github. Defaults to False.

  • GITHUB_DATAMAP_PATH: This should be the datamap directory specified when setting up the Cyral GitHub integration. Defaults to "datamaps/"

  • CLASSIFIER_SAMPLE_SIZE: Tells the classifier how many rows to sample from each table for classifying data. Defaults to 5.

  • LOGGING_LEVEL: Sets the logging level within the offline classifier. Defaults to "INFO".

  • CLASSIFIER_SAMPLE_OFFSET_LIMIT: In order to ensure that the classifier does not execute against the same samples from the database with each execution, the classifier uses an OFFSET for its sample queries. This offset value is randomly generated from 0 - 1000 by default. This variable changes the max range of available random numbers for the resulting offset value. Defaults to 1000.

  • DISABLE_TAG_SUPPORT: In order to maintain compatibility with data map and policy features within the Cyral platform, the support of tags can be enabled/disabled depending upon whether your Cyral control plane supports the use of tags. Defaults to False.

  • CLASSIFIER_FILE_PATH: If not configured to access the Cyral Control Plane, then this would be the local path to where the classifier rego is located. For example, "/opt/classifier/"

  • CLASSIFIER_FILE_VERSION: The version of the rego classifier. For example, "v1.0.4"

  • LOCAL_JSON_CONFIG: If AWS Secrets Manager is disabled, then this should be the local path to the Configuration JSON file. For example, "/etc/offline_classifier_config.json"

  • AWS_USE_SECRETS_MANAGER: Controls whether the classifier should look for the Configuration JSON in AWS or locally. Defaults to True.


Deploy the Cyral classifier on AWS

Get the offline-data-classifier Docker image from your Cyral support person.

  1. Create a private ECR called, offline-data-classifier

  2. Create an AWS Secret (Classifier JSON Configuration). Use the prefix, cyral-offline-data-classifier/<some useful identifier>. Reference the below plaintext example for creating the secret:

  {

    "github": {

      "repo_url": "https://www.github.com/<your_org_name>/<your_repo_name>",

      "api_key": "ghp_abc123xyz789"

    },

    "cyral": {

      "api_url": "https://example.cyral.com:8000/v1",

      "client_id": "<Cyral CP Service Account Client ID>",

      "client_secret": "<Cyral CP Service Account Secret Key>"

    },

    "database": {

      "type": "snowflake",

      "account": "<your account identifier>",

      "database": "<DATABASE>",

      "username": "<USERNAME>",

      "password": "<PASSWORD>",

      "warehouse": "<WAREHOUSE>",

      "repo_name": "<Name of repo in Cyral CP that this configuration should map to>"

    }

  }


  1. Create Lambda Function

    1. Create the function from Container image

    2. Provide an identifiable name for the function

    3. Click the Browse images button, and then locate the ECR that you created earlier in this procedure. Choose the image version that you just loaded into that ECR.

  2. Set up the lambda function:

    1. Edit the execution role and create an inline policy that allows GetSecretValue access to cyral-offline-data-classifier/* secrets

    2. Configure the required environment variable, CONNECTION_SECRETS_ARN. Set it to the ARN of the cyral-offline-data-classifier/<name> created above

    3. Optionally, configure the suggested environment variables:

      1. GITHUB_AUTOMERGE_PULL_REQUESTS : True - So that PRs are created and automatically merged

      2. LOGGING_LEVEL : DEBUG -- Setting this to DEBUG shows details in the Cloudwatch logs that are helpful in testing. In production, this should not be set.

    4. Adjust the timeout: Initial timeout for Lambda scripts is 3 seconds. We suggest an initial timeout of 2 minutes.

AWS Lambda - Scheduling Execution

Since this is a Lambda function, any supported trigger can be used to execute the function. In lieu of some existing event/trigger that could be used, the below instructions explain how to use AWS CloudWatch to schedule execution.

  1. Open CloudWatch

  2. Go to Events → Rules

  3. Click the Create rule button

  4. Select the Schedule radio button

  5. Configure the frequency whether via fixed rate or cron expression

  6. In the Targets pane, click the Add target button

  7. From the Function dropdown, select the Lambda function you created in the previous section

  8. Click the Configure details button.

  9. On the resulting page, provide a Name for the rule and description if you prefer. Click the Create rule button.

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.