▶️
Recommender System for Financial News
Created
Oct 27, 2020 07:43 AM
Media Type
Videos
Lesson Type
Technology
Personalization
Project
Personalization
Property
Created by Rishabh Srivastava, Founder of Loki.ai
This summary was largely done for my own note-taking, sharing it just in case it adds more value to other people.
I have no affiliation whatsoever with anyone in this note. This is a summary largely taken for my own reference, and may contain errors :)

Context

Source URL:
Why is it important: It's an interesting look at news personalization. It's not the best approach by any means. But worth checking out

Keywords

Personalization

Summary

...

Highlights

Overview

Data captured: data about users, data about articles, and data about interactions between users and articles
  • articles they have read
  • articles they have seen but not clicked
 
Training data: clicks on articles from the past
Prediction data: clicks on articles from tomorrow
 
Cold-start problem: since articles are new every day, must rely on the content of the article and the user profiles for recommendations. Cannot really use collaborative filtering

Data preprocessing

Data is significantly imbalanced. Can be a problem for the model as users are much more likely to not click on an article than to click on it To solve this, conduct a random sampling from negatives at the user level. This leads to a more balanced dataset. Do feature extraction from here
 
User Features + Content Features + User-content features = Prediction
 
3 kinds of features:
  • user features
  • article features
  • user-article features
 
A. Article features
i) Metadata
  • authors
  • content
  • section
  • publication date
  • long story?
ii) Enrichments
  • #paragraphs, #sentences, #words
  • tags
  • article length
  • article complexity
  • hapax legomenon
  • sentiment
  • word embeddings
Tools used: textpipe, spacy, fasttext, nltk
 
B User features
i) Demographic
  • gender
  • age-range
  • ...
ii) User reading behavior
  • tags followed
  • most read tags/authors
  • average #words, #sentences, #paragraphs read per story
  • average article length
  • average sentiment
  • average word embeddings
C. User-article features
i) Article & avg user set overlap
  • tags
  • authors
  • article length
ii) Articles and average user reading habits comparson
  • # words, #sentences, #paragraphs
  • Word embeddings similarity
 

Model used and approach

  • Have an average of 14k feature representations (length of feature vector is 14k)
  • Research questions:
    • what model to use?
    • what data to use?
    • what features to keep?
 
Models:
  • Gradient Boosted Decision Trees (GBDT) + Logistic Regression
  • Training methods: a) fit: train new model every day b) partial fit: re-train previous day's model with today's data, without adding new trees c) re-train previous day's model with today's data, with new trees
 
Trained data every day, and trained it on the past 3-7 days worth of data to capture emerging trends well