Filters

ContentGems' primary feature is to recommend relevant Articles that you can either consume yourself, or share with others. In order to do so, ContentGems needs to know about your interests. You use Filters to specify your interests using keywords and other filter settings.

Each Filter extracts the most relevant Articles from hundreds of thousands of Articles indexed each day.

Filter Recommendations

How a Filter works

  1. Filter Articles — The Filter finds all Articles in the Articles Index that match the filter criteria. Some examples for filter criteria are Must query conditions, Must not query conditions, Feed Bundles to include , Feed Bundles to exclude , Minimum popularity , Must have image , Minimum word count ,Content category (Must be, Must NOT be) . This step can possibly result in a very large number of Articles.
  2. Rank Articles — The Filter ranks the Articles based on the selected criterion: Either Relevancy or PopularityRelevancy is based on the number of Mustquery terms that match. The more matching terms a document contains, the higher it ranks. Popularity is based on how many times an Article was shared on social media. Please note that ranking is different from sorting! Ranking determines which Articles make it into the trimmed list, and sorting determines in which order the trimmed Articles are displayed.
  3. Deduplicate Articles — The Filter analyzes each Article's URL, title and entire body to remove Articles that are slightly similarvery similar, or identical, depending on the setting. This is useful if an Article was syndicated and appeared in multiple publications or Feeds with slightly different URLs or titles.
  4. Sort Articles — The Filter sorts the Articles chronologically by the time each Article was indexed. In most cases this is the publishing date, however there are some situations where ContentGems encounters an older Article that was recently shared on social media, and it treats it as fresh.

At the end of this process, the Filter has extracted the top matching Articles from the Articles Index and passes them to the requesting Workflow.

Keywords

The first tab when configuring your Filters is the "Keywords" tab. Here you provide terms that define your topic of interest.

Here you can specify filter rules related to search terms in the content. The filter builder starts with a single rule, however you can add lots more rules and combine them with `AND` or `OR` boolean logic.

Filter: Keywords

"MUST" and "MUST NOT"

The first input determines how the rule is applied in one of two ways:

MUST

A rule with this setting requires that an Article must meet the conditions expressed in the rule. Rules with this setting affect a Filter's filtering stage primarily since all recommended Articles will meet this condition. Articles that match all rules with this setting make it to the list of filtered Articles. They are then ranked in a later step using a combination of how many keyword matches they contained.

This group of terms is very useful for Filters around a topic that can be described with one or more very unique search terms. If you expect that all your Articles will contain a specific phrase, you can use this group to filter out all the noise before you rank the Articles. An example is a Filter around a technical topic like "ClojureScript". This unique technical term will only occur in Articles related to this software technology. So you would enter "clojurescript" into a "Must" field.

Other possible applications for `Must` terms are unique location names, personal names, company names, product names, TV show titles, etc. Basically any string that is unique to the topic of interest.

MUST NOT

Rules with this setting are used for filtering only, and they do not affect ranking. It is used to filter out Articles that contain certain terms when your topic of interest uses ambiguous terms.

A good example is the topic "Apple". You may be interested in the fruit, or the company.

If you are interested in the fruit, then you would enter "apple" as a "MUST contain' rule, and you would enter names of Apple Inc. products and related names under "Must NOT contain" rules: "iphone', "macbook", "ipad", "apple watch", "Tim Cook", etc.

Rule kinds

There are a number of kinds of rules to help you filter Articles based on their content.

Filter: Match Types

  • "contain the exact phrase"
    • Rules of this kind specify exact terms or phrases found in an Article. Depending on the rule application, the Filter will recommend Articles that contain or do NOT contain the given terms.
    • Example: The exact term "water" matches "Water down the bridge", however it doesn't match "Watergate scandal".
    • Matching is not case sensitive.
  • "contain words starting with"
    • Rules of this kind specify word prefixes found in an Article. Depending on the rule application, the Filter will recommend Articles that contain or do NOT contain words with the the given prefix.
    • Example: The word prefix "water" matches both "Water down the bridge", and "Watergate scandal".
  • "contain text similar to phrase"
    • Rules of this kind specify fuzzy search terms. Depending on the rule application, the Filter will recommend Articles that contain or do NOT contain words similar to the given phrase.
    • Example: The fuzzy search term "color" matches both "colour" and "color".
  • "be shared with Hashtag"
    • Rules of this kind specify under which hashtag an Article was shared on Twitter. Depending on the rule application, the Filter will recommend Articles that were or were NOT shared with the given hashtag. Hashtags have to match exactly, however they are not case sensitive.
  • "be from Web Domain ending with"
    • Rules of this kind specify the domain suffix under which an Article is hosted. This rule is useful to specify from which kind of websites you want to get recommendations. You can, e.g., limit search to Canadian websites by entering ".ca" in this rule with a "MUST" application. Or you can exclude Articles from a specific website by entering the Web Domain in this rule with a "MUST NOT" application.
  • "match advanced query"
    • Rules of this kind let you specify advanced rules for matching content. Please see below under "Advanced query syntax" for more information.

Field specifier

Some rule kinds let you choose which of an Article's fields you want to apply the rule to. The default setting is to search both the Article's title and body text ("in the title, first paragraph, or body text"). That works well in most situations, however there may be cases where you want to narrow down which fields are queried.

An example for narrowing down the fields is to make sure that a given term or phrase is important in the Article. Important words tend to appear in the title or near the beginning of the Article. In that case you can choose "in the title or first paragraph" or "in the title".

Sources

The second tab when configuring your Filters is the "Sources" tab. Here you specify which Feed Bundles you want to use for the Filter, and which Web Domains you want to exclude from results.

Filters: Sources Tab

Blocked Websites

You can block Articles from certain Web Domains if you consider them unsuitable for your recommendations. The reasons for blocking could be either you consider them to be of low quality, to be off topic, or to be from a competitor.

In order to block a Web Domain, just click on the "Block" icon on an Article from that domain. Once blocked, the Filter will never include Articles from that Web Domain again. 

Article: Showing Block Icon

You can remove blocked Web Domains under Filter settings > Sources, by clicking on the "X" icon next to the domain.

Filters: Remove Blocked Website

Feed Bundles

You can limit the Feeds considered for a given Filter using the "Only include articles found in these Feed Bundles" setting. Once you have organized your trusted Feeds into Feed Bundles, you can include them here so that only Articles from your included Feed Bundles are being searched.

This is very useful for broad topics, or topics that use ambiguous terms. If you are having a hard time getting good results using keywords, then you can improve things by limiting the search to Feeds that are relevant to your topic of interest.

Sometimes you may only want to exclude a few Feeds, e.g., a competitor's, or a Feed you consider to be of low quality. In that case, you use the "Exclude these Feed Bundles" setting. If you add any Feed Bundles here, then Articles contained in them will be excluded from the search results.

Please Note: If you do not specify a custom Feed Bundle, your Filter will search the entire CG Firehose.

Settings

The third tab is Settings, which contain additional options that let you further filter down Articles.

Filters: Settings Tab

The following settings are available:

  • Minimum popularity - Set it to `None` to find Articles that aren't popular in social media (yet). Set it to a higher value to only consider Articles that have been vetted in social media already. The `None` setting will likely include noise, and you should manually curate the recommended Articles. A higher setting is well suited for automated sharing without manual curation.
  • Media: must have image - Check this checkbox to include only Articles that have a primary image. In some sharing scenarios it is advantageous to add visual interest with images.
  • Minimum word count (body) - If you are looking for longer Articles, then set this parameter to a higher value.
  • Minimum word count (title) - The number of words in the title can be used as a quality metric. Longer titles may indicate higher quality.
  • Content must be… - Limit Articles to any of the pre-configured content categories.
  • Content must not be… - Exclude Articles from any of the pre-configured content categories.

Other settings

  • Rank results by
    • This setting determines how Articles are ranked before they are trimmed and sorted. Please note that this is different from sorting! Ranking determines which Articles will make it into the final list (only the top ranked Articles). Once we have the set of trimmed Articles, they will be sorted chronologically.
  • Remove duplicates that are
    • Use this setting to determine how aggressively duplicate Articles are removed. ContentGems looks at the Article's title and entire body when deciding if two Articles are duplicates. It computes the Jaccard index for every pair of Articles, using the Article's bag of words as the set elements. The similarity settings range from a Jaccard index of 70% (slightly similar) to 95% (identical).

 

Advanced query syntax

Keywords and phrases

You can provide single keywords like apple, or you can specify phrases by wrapping them in double quotes. Example: "content marketing".

Wildcards

To perform a wildcard search, use the * symbol. For instance, to search for content that contains "smartphone" or "smartphones, use the query: smartphone*

Groupings

Parentheses allow you to create queries with nested logic. For instance, to search for content that must contain either “information” or “technology," include the following term: (information technology).

Field specifiers

Field specifiers allow you to query a particular field in an Article. If you don't specify a field, the term will be matched against the Article's title and body text fields.

The following fields are available for searching:

  • body searches in the Article body only. Example: To find Articles that have the term "apple" in their body text, enter body:apple as one of your query terms.
  • domain matches the domain suffix in the Article's URL. Use this to find Articlesfrom a given Web Domain, e.g., for geographic filtering. Domains are interpreted from right to left. This may be unexpected. So to match any ".uk" domains, you just enter domain:uk.
    • Example 1: To match Articles from Web Domain ending in ".com.au", enter domain:com.au.
    • Example 2: To match Articles from a specific Web Domain, enter domain:contentgems.com.
  • excerpt searches the first 300 characters in the Article's body text only. Sometimes searching this field instead of the entire body will eliminate noisy results since the most important terms are typically found at the beginning of an Article. Exampe: To search for Articles that contain the term "content marketing" at the beginning of the body text, enter excerpt:"content marketing".
  • hashtag finds Articles that were shared on Twitter with this hashtag. Example: To find Articles that were shared on Twitter with the "#cdnpoli" hashtag, enter the following : hashtag:cdnpoli.
  • title searches in the Article title only. Example: To find Articles that contain the term "green tea" in their title, enter the search term title:"green tea".

Boosting

Boosting allows you to control the importance of a term in a search. To boost a term use the ^ symbol with a boost factor (a number) at the end of the term. For instance, if you have a search that includes the keyword "AdWords" and want to boost this keyword then use the query AdWords^2. To boost a phrase, append the boost modifier after the closing quote: "content marketing"^10.

Boosting only makes sense in rules that are applied as "MUST" since it is used for ranking.

Any terms that don't have a field or boosting specified default to being searched in the title and body text fields. And the title gets a boost of ^25. You could accomplish the default behavior with the following term: (title:water^25 body:water). This is a boolean OR query that searches for the term "water" in the Article's title field with a boost factor of 25, and in the body field with no boost. This approach ranks Articles with the term in the title higher than those that contain the term in the body.

Fuzzy matching

To match similar spellings, you can make a term fuzzy by adding a tilde and a fuzzy factor. E.g., ~color0.3 will match both "color" as well as "colour". The higher the fuzzy factor, the fuzzier the matches are.