Using the Cyral data classifier

Automatic data classification

For environments where tables can be created dynamically, determining what attributes of what tables are sensitive is often a challenge. Cyral's data classification feature helps flag sensitive data by classifying the data your users are accessing in queries. This allows your team to detect and locate sensitive data, wherever it resides in your repositories.


With this feature enabled, Cyral inspects frequently accessed data values, checks whether these values match the regular expressions (regexes) you've configured, and flags those matches in the Cyral analysis logs, noting the database, table, and column where Cyral observed each match, and the classification of that match.


For example, if you want to be aware of tables and columns containing credit card numbers, you will specify a regex that matches most credit card numbers. Once you enable Cyral data classification with this regex, the Cyral analysis logs will list the table and column names that were observed to hold likely credit card numbers. These tables and columns can then be marked as sensitive, so that Cyral monitors them, and you can apply policies to them to ensure access control and monitoring are enforced.


How it works

The data classification mechanism samples queries and examines the first row of data returned. Cyral compares result columns' contents with your regex. 


Limitations

  • Classification works only on data stored in PostgreSQL repositories.

  • Classification evaluates only string-based results and ignores all numerical results. For example, a telephone number encoded as an 11-digit number will not be picked up by the classification mechanism.



Configure your classifier

By default, the data classification feature is disabled. To enable it, use the Cyral API to set your classificationConfig rules as shown below.


Get your JWT token

Your JWT token is your authentication credential for the Cyral API, and it indicates to the API what actions you're allowed to perform. Follow these steps to get the token:


  1. Log in to the Cyral control plane UI. Make sure you log in as a Cyral user with a role that has rights to both View Sidecars and Repositories and Modify Sidecars and Repositories. See Manage Cyral roles for details.

  2. How you get the token depends on which Cyral version you're using:

    1. In Cyral version 2.15 and later, click on your user avatar in the upper right corner, choose Profile, scroll down to API Access, and copy the JWT from there.

    2. In Cyral version 2.14 or earlier, open the UI in your browser and use your browser's developer tools to find the token. For example, on Chrome browsers, click View: Developer: Developer Tools.

  3. Store JWT as an environment variable. In this example, we'll name the variable current_jwt. You can set this in your shell by typing:


export current_jwt="(place your long JWT string here) "


When you connect to the API, you'll supply the JWT token in the header of the API call, in the Authorization:Bearer field. You must supply this token as your credential for each API call. For example: 


curl https://mycyral.cyral.com:8000/v1/ping -H "Authorization:Bearer $current_jwt"


Get the repo's unique identifier

Run a GET on the repos API endpoint to find the repoID of the repository that you plan to monitor with data classification. Here we pipe our results to the jq JSON processor.


curl https://mycyral.cyral.com:8000/v1/repos -H "Authorization:Bearer $current_jwt" | jq



The output from the above API call will list your repos and their repoIDs. Get the repoID of the repo to which you'll apply the classifier.


Once you have the repo's repoID, you can retrieve its repo config, which contains the settings directing how Cyral handles the repository. Run this GET by typing the following, replacing {my-repo-id} with your repoID:


curl https://mycyral.cyral.com:8000/v1/repos/{my-repo-id}/conf/analysis -H "Authorization:Bearer $current_jwt" 



Update the repo's config to add your classifier preferences

Run a PUT on the analysis API endpoint to add a classifier preferences (classificationConfig) to your repo config. You can copy the examples we've provided, and modify them to suit your needs. For readability, we've broken and indented the lines below and added a backslash at the end of each line that's been broken for formatting. If you copy the example below, take care to remove the line breaks and their backslashes.



curl https://mycyral.cyral.com:8000/v1/repos/{my-repo-id}/conf/analysis \

  -X PUT -H "Authorization:Bearer $current_jwt" \

  -H "content-type:application/json" \

  -d '{"redact":"none","tagSensitiveData":false,"ignoreIdentifierCase":false,\

       "analyzeWhereClause":false,"loggerConfig":null,"alertOnViolation":true,\

       "disablePreConfiguredAlerts":false,"blockOnViolation":true,\

       "disableFilterAnalysis":false,"rewriteOnViolation":false,\

       "classificationConfig":{"samplingPeriod":1000,\

         "forceSingleClassification":false,\

         "customRegexes":[ \

            {"tag":"music_notes",\

             "regex":"\\A *[a-g]+ *\\z"},\

            {"tag":"uk_mobile_phone",\

             "regex":"\\A *((\\+44( )?7)|(07))\\d{3}( )?\\d{6} *\\z"}],\

       "builtins":[\

        {"type":"cyral_us_phone","stripWhitespace":false,"matchAnywhere":false},\

        {"type":"cyral_email"},\

        {"type":"cyral_ssn"}]},\

       "logGroups":["everything"]}' 



In the above example, we set the following in our classificationConfig:

  • A sampling period of 1000;

  • A custom regex called "music_notes" that finds letters that happen to denote musical notes;

  • A custom regex called "uk_mobile_phone" that finds a subset of U.K. cell phone numbers; and

  • Activation of the built-in regexes, cyral_us_phonecyral_email, and cyral_ssn


See "Data classifier parameters," below, to understand the classifier settings.


After you set your classification config, it's a good idea to check that the settings were applied. To do this, run a GET on the same API endpoint:


curl https://mycyral.cyral.com:8000/v1/repos/{my-repo-id}/conf/analysis -H "Authorization:Bearer $current_jwt"


The results returned should look just like what you set in the preceding PUT.


Data classifier parameters

Below, we explain the parameters in the classificationConfig that govern the matching process.


Sampling period

In samplingPeriod, specify the sampling frequency as an integer value that indicates how many queries we will ignore before we examine the data contents of a query.


Force a single classification

Some strings will fit multiple formats. For example, a bank EIN (employer identification number; the U.S. social security number has this format) might be mistaken for a telephone number. If it's important to you to identify each value as only a single type, then list those potential types in priority order in the forceSingleClassification parameter. Otherwise, Cyral will treat each value as all of the types that it matches.


Built-in regex patterns

To use pre-built regex patterns, create a builtins section and add blocks in this format. The available built-ins are:

  • cyral_us_phone (matches US and Canadian telephone numbers), 

  • cyral_email (matches email addresses), and 

  • cyral_ssn (matches US social security numbers).


When you add a built-in regex to your classifier configuration, you have the option of adding two other flags, described here:


  • Match anywhere: If you want to flag a match only when the whole string matches the regex, set matchAnywhere to false. If you want to flag entries that contain the matching string even when there's other text in the attribute, set matchAnywhere to true. For faster performance, set matchAnywhere to false.  Note: The matchAnywhere option is available only in built-in regexes. In a custom regex, to get the behavior of matchAnywhere set to false, make sure that your regex syntax includes \\A at the beginning of regex, and \\z. at the end. (as done for the two custom regexes in the sample).

  • Strip whitespace: To first strip any leading and trailing whitespace from each value before searching for a match, set stripWhitespace to true. This is only relevant when matchAnywhere is set to trueNote: This option is available only in built-in regexes. In a custom regex, to get the behavior of matchAnywhere set to true, you can form your regular expression to match strings, even when those strings include leading/trailing whitespace.

Custom regex patterns

To add your own regular expression, add a customRegexes block and add each custom regex there with a tag block containing its name and a regex block containing the regular expression. 


When specifying a custom regular expression, be aware that it uses the RE2 syntax (See: https://github.com/google/re2/wiki/Syntax) rather than using Perl or Java regular expression syntax. You will need to add escape syntax so that the regular expression can be stored in JSON format. 


Test your classifier

Run queries to retrieve data that contains matches for your specified regex classification. Check the Cyral analysis logs to verify that the sensitive data was flagged properly. When you update your classifier configuration, you should pay attention to the logged output for this process; if there is an error in one of the custom regex expressions, you will get an error during validation.


When you view the Cyral analysis logs, for a match, you should see something like the following (notice the bold-texted classifications section):


{"policyViolated":false,

 "sensitiveQuery":false,

 "autoScalingGroupInstance":"172.18.0.4",

 "queryId":"172.18.0.4:5439:1604432401301721200:1",

 "endUser":"sbtest","dbUser":"sbtest","dbRole":"sbtest",

 "Repo":{"id":"1Xli2wvhNiU6QN1dJHc2s3c7ET0",

         "Name":"bp-postgres",

         "Type":"postgresql",

         "host":"bp-postgres",

         "port":5432},

 "client":{"host":"172.18.0.4",

           "port":5439,

           "applicationName":"psql",

           "connectionId":"172.18.0.4:5439:1604432401301721200",

           "connectionTime":"2020-11-03 19:40:01.3017212 +0000 UTC"},

 "request":{"timestamp":"2020-11-03 19:40:15.456595 +0000 UTC",

            "timestampMillis":1604432415456,

            "searchPath":["sbtest","public"],

            "statement":"SELECT ' 555-12-1212 ' AS some, a, b, c FROM classiftest WHERE id \u003c 3",

            "statementType":"SELECT",

            "tablesReferenced":["classiftest"]},

  "response":{"status":"Ok",

              "rowsAffected":2,

              "bytesAffected":161,

              "executionTime":"58.7124ms",

              "executionTimeMicros":58712,

              "classifications":[{"attribute":"some","tags":["cyral_ssn"]},

                                 {"attribute":"a","tags":["cyral_us_phone"]},

                                 {"attribute":"b","tags":["cyral_email"]}

                                ],},

 "activityType":"query"}





If you are not in a production environment, and you wish to test out a particular regular expression, we recommend temporarily setting the samplingPeriod to 1 so that every query is potentially classified, then invoke queries through the sidecar. Once you have validated your custom regular expressions, set the samplingPeriod back to something reasonable (like perhaps 1000).


Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.