Parsing rules

Parsing rules are rules for custom feeds (feeds that are specified using the Path element). These parameters specify how each feed must be parsed by Feed Utility.

Parsing rules are defined in the Parsing element of feed rules for a custom feed.

The following is an example of parsing rules for a custom feed. These rules specify that the input feed is in JSON format. An MD5 parsing rule is defined for the files/md5 field in the input feed. Values in this field will be parsed as MD5 hashes.

<Feed>

...

<MD5 type="MD5">files/md5</MD5>

</Parsing>

...

<Feed>

Parsing element

The parent element, Parsing contains all nested parsing rules. Its attributes define the input format.

This element has the following attributes:

type
Specifies the input feed type.

This attribute can have the following values: json, csv, xml, misp, stix, stix2, pdf, messageBody, messageAttach.

Feed Utility supports STIX versions 1.0, 1.1, 2.0, and 2.1. The exact version of STIX is determined automatically.

The file added to the directory of the pdf feed will not be processed, if this file is created earlier than or at the same time as the latest created file that had been previously processed.
delimiter
Specifies the delimiter for CSV input feeds. By default, this value is ';'.
rootElement
Specifies the root element path for XML and JSON input feeds.
- XML input feeds
  You can use the '*' and '?' wildcard characters as substitutes for any other character or group of characters. The '*' wildcard character can be used for a group of characters. The '?' wildcard character can be used for a single character.
  
  You cannot specify parts of the rootElement path with wildcard symbols only. For example, "Feeds/*/Contents" is invalid.
- JSON input feeds
  You can specify root element value with any nesting level. Define the limits of the nesting level with a "/" character.
  
  The root element parameter can be empty. If it is not empty, the value of the root element will not contain empty nesting levels (substring "//"), and will not start or end with a "/" character.
  
  You cannot use wildcards in the root element for JSON feeds.

The following example demonstrates how to use the Parsing element for an XML input feed. In this case, parsing rules will be applied to elements nested inside the Feeds > Example > Contents element.

<Feed>

...

...

</Parsing>

...

<Feed>

Individual parsing rules

Parsing rules for individual fields of an input feed must be nested inside the Parsing element. When Feed Utility processes the input feed, it creates the fields of the output feed according to these rules.

Each rule has the following format:

<%OUTPUT_NAME% type="%VALUE_TYPE%">%INPUT_NAME%</%OUTPUT_NAME%>

Above, the following rule name elements are used:

%OUTPUT_NAME% defines the name of the field in the output feed. For example, if %OUTPUT_NAME% is MD5, the field with this value will also be named MD5 in the output feed.
%OUTPUT_NAME% preserves nested fields. If a field specified in the %INPUT_NAME% is nested, the field in the output feed will also be nested. For example, if %OUTPUT_NAME% is MD5_HASH and %INPUT_NAME% is files/md5, the field in the output feed will be files/MD5_HASH.

For JSON input feed, %OUTPUT_NAME% must always use the Field value. Feed Utility uses the field names from the original feed.
%VALUE_TYPE% is the type of the values stored in this field.
These values will be handled by Feed Utility according to the specified type. For example, if the output feed contains domain names and URLs, then it will be compiled to the binary format.

Following value types are possible:
- url—This value type is used for URLs.
- ip—This value type is used for IP addresses.
- md5—This value type is used for MD5 hashes.
- sha1—This value type is used for SHA1 hashes.
- sha256—This value type is used for SHA256 hashes.
- domain—This value type is used for domain names.
- context—This value type is used for context information.
%INPUT_NAME% is the name of the field in the input feed. It must be defined according to the input feed format:
- For JSON input feeds, %INPUT_NAME% must contain the name of the field from the input feed. Nested fields must be delimited by '/'.
- For CSV input feeds, %INPUT_NAME% must contain the column number.
- For XML input feeds, %INPUT_NAME% must contain a path to one of the nested elements of the root element. Root element is defined in the rootElement attribute of Parsing element. The path is case sensitive.
- For STIX and MISP input feeds, Parsing element must contain no parsing rules.

The following example demonstrates parsing rule syntax for JSON input format:

<Feed>

...

<Field type="md5">files/md5</Field>

</Parsing>

...

<Feed>

The following example demonstrates parsing rule syntax for CSV input format:

<Feed>

...

</Parsing>

...

<Feed>

The following example demonstrates parsing rule syntax for XML input format:

<Feed>

...

<GEO type="context">context</GEO>

</Parsing>

...

<Feed>

Parsing rules for feeds of email type

To set the parsing rules for a third-party feed, specify the following values for the type attribute:

messageBody—Parsing rules for an email body.
This value is applicable if POP3 or IMAP are enabled in the Path element.
messageAttach—Parsing rules for an email attachment.
This value is applicable if POP3 or IMAP are enabled in the Path element.

Parsing the message body (for feeds of the email type)

Feed Utility parses the body of an email loaded from a mail server, if the messageBody value is set in the type attribute of the Parsing element.

For parsing the message body, the regular expressions specified in the Parsing element are used.

You can set one or several rules with regular expressions for message body parsing.

Each rule has the following form:

<%FIELD_NAME% type="%FIELD_TYPE%">%REG_EXP%</%FIELD_NAME%>,

Where:

%FIELD_NAME% defines the name of the field in the output feed. For example, if %FIELD_NAME% is MD5, the field with this value will also be named MD5 in the output feed.

%FIELD_TYPE% is an indicator type.

%REG_EXP% is a regular expression.

Each regular expression applies to the whole message body.

The feed is formed according to the content of loaded emails. Formation of the feed meets the following conditions:

The values of all loaded emails are indicated in one resulting feed.
Feed Utility stores the date of the feed's latest update. In the case of addressing the mail server, the parsing will be applied only to the emails received after the previous feed update.
Each entry of the resulting feed contains the following fields:
- message_from—Email address of the message sender.
- message_subject—Subject of the email.
- message_date—Date on which the mail server receives the email.
Each resulting feed entry has only one indicator obtained as per one regular expression with a type attribute value other than CONTEXT.
Each value from an email is indicated in each entry of the resulting feed, if the value is obtained as per regular expression, the type attribute of which has a CONTEXT value.
The values are indicated in the resulting feed entries, which contain the indicators (IP/HASH/URL) from the same email.

If more than one value is obtained per one regular expression (with the type attribute having the CONTEXT value), these values are specified in one entry of the resulting feed. The values are separated by a sequence of ";" characters.
The feed will not contain values that meet the criteria of the rules specified in the Excluded section (see the "Excluded element for PDF and email feeds" section below).

Parsing message attachments (for feeds of the email type)

Feed Utility parses email attachments loaded from a mail server, if the messageAttach value is set in the type attribute of the Parsing element.

You can set one or several rules with types of attached files.

Each rule has the following form:

<Attach type="%ATTACH_TYPE%"></Attach>,

Where:

%ATTACH_TYPE% is an attachment type.

%ATTACH_TYPE% can have the following values:

csv
json
xml
stix
stix2
pdf

The Attach element has at least one value.

You can set one or several rules with regular expressions.

Each rule has the following form:

<%FIELD_NAME% type="%FIELD_TYPE%">%REG_EXP%</%FIELD_NAME%>,

Where:

%FIELD_NAME% defines the name of the field in the output feed. For example, if %FIELD_NAME% is MD5, the field with this value will also be named MD5 in the output feed.

%REG_EXP% is a regular expression.

%FIELD_TYPE% is an indicator type. For the %FIELD_TYPE% element, specify the type attribute by using the following values:

ip
md5
sha256
sha1
url
context

The following example demonstrates the parsing rule for a message attachment:

</Attach>

Feed Utility parses files with the following extensions:

Value in the `type` attribute of the `Parsing` element	File extensions
csv	csv and txt
json	json
xml	xml
stix1	xml
stix2	json
pdf	pdf

If parsing rules are set simultaneously for stix1 and xml (or stix2 and json), Feed Utility performs the following:

Attempts to parse the attached file as stix (with the xml/json extension).
If no errors occurred while parsing, and the file is a valid stix feed, this file is not parsed according to the rules for parsing xml/json attachments specified in the feed's settings.
If an error occurred while parsing (the file is not a valid stix feed), this file is parsed according to the rules for parsing xml/json attachments specified in the feed's settings.

If an email has more than one attachment, the information from each attachment will be in one resulting feed.

The feed is formed according to the content of loaded emails. Formation of the feed meets the following conditions:

The values of all loaded email attachments are indicated in one resulting feed.
Feed Utility stores the date of the feed latest update. In case of addressing the mail server, the parsing will be applied only to the emails received after the previous feed update.
Each entry of the resulting feed contains the following fields:
- message_from—Email address of the message sender.
- message_subject—Subject of email.
- message_date—Date on which the mail server receives the email.
- attach_name—Name of attachment.
Each resulting feed entry has only one indicator obtained as per one regular expression with a type attribute value other than CONTEXT.
Each value from an email attachment is indicated in each entry of the resulting feed, if the value is obtained as per a regular expression, the type attribute of which has a CONTEXT value.
The values are indicated in the resulting feed entries, which contain the indicators (IP/HASH/URL) from the same attachment.

If more than one value is obtained per one regular expression (with the type attribute having the CONTEXT value), these values are specified in one entry of the resulting feed. The values are separated by a sequence of ";" characters.
The feed will not contain values that meet the criteria of the rules specified in the Excluded section (see the "Excluded element for PDF and email feeds" section below).

Excluded element for PDF feeds and email feeds

If the pdf, messageBody, or messageAttach value is specified in the type attribute of the Parsing element, the Feed element can contain the Excluded section and have one or more nested <Item/> elements with indicator exclusion rules for the resulting feed.

The Excluded section has the following form:

<Item>{RegExp}</Item>

...

</Excluded>

Where {RegExp} is a regular expression.

The Excluded section and the Item elements are not obligatory.

The following example demonstrates exclusion rules:

<Item>(https:\/\/badurl\.com)</Item>

</Excluded>

Page top