Recommender System for Financial News

Created by Rishabh Srivastava, Founder of Loki.ai

This summary was largely done for my own note-taking, sharing it just in case it adds more value to other people.

I have no affiliation whatsoever with anyone in this note. This is a summary largely taken for my own reference, and may contain errors :)

Context

Source URL:

Why is it important: It's an interesting look at news personalization. It's not the best approach by any means. But worth checking out

Keywords

Personalization

Summary

...

Highlights

Overview

Data captured: data about users, data about articles, and data about interactions between users and articles

articles they have read

articles they have seen but not clicked

Training data: clicks on articles from the past

Prediction data: clicks on articles from tomorrow

Cold-start problem: since articles are new every day, must rely on the content of the article and the user profiles for recommendations. Cannot really use collaborative filtering

Data preprocessing

Data is significantly imbalanced. Can be a problem for the model as users are much more likely to not click on an article than to click on it To solve this, conduct a random sampling from negatives at the user level. This leads to a more balanced dataset. Do feature extraction from here

User Features + Content Features + User-content features = Prediction

3 kinds of features:

user features

article features

user-article features

A. Article features

i) Metadata

authors

content

section

publication date

long story?

ii) Enrichments

#paragraphs, #sentences, #words

tags

article length

article complexity

hapax legomenon

sentiment

word embeddings

Tools used: textpipe, spacy, fasttext, nltk

B User features

i) Demographic

gender

age-range

ii) User reading behavior

tags followed

most read tags/authors

average #words, #sentences, #paragraphs read per story

average article length

average sentiment

average word embeddings

C. User-article features

i) Article & avg user set overlap

tags

authors

article length

ii) Articles and average user reading habits comparson

# words, #sentences, #paragraphs

Word embeddings similarity

Model used and approach

Have an average of 14k feature representations (length of feature vector is 14k)

Research questions:

what model to use?
what data to use?
what features to keep?

Models:

Gradient Boosted Decision Trees (GBDT) + Logistic Regression

Training methods: a) fit: train new model every day b) partial fit: re-train previous day's model with today's data, without adding new trees c) re-train previous day's model with today's data, with new trees

Trained data every day, and trained it on the past 3-7 days worth of data to capture emerging trends well