Market Voice Analytics - Oussama Messaoudi

Overview

Market Voice Analytics is an end-to-end data pipeline that scrapes speeches from the European Central Bank (ECB) and Federal Reserve, performs financial sentiment analysis using FinBERT, and correlates the results with market movements across major indices and currencies including EUR/USD, S&P 500, Gold, US Treasuries, and Euro Stoxx 50.

The project demonstrates the intersection of natural language processing, financial analysis, and data engineering by analyzing how central bank communications influence market behavior.

Key Features

Automated Data Ingestion

Scrapers for ECB and Federal Reserve speeches, press releases, and statements with URL-based deduplication

Historical Backfill

Archive scrapers that fetch thousands of historical speeches from ECB foedb JSON database and Fed yearly archives

Financial NLP

Sentiment analysis using FinBERT with intelligent sentence-based chunking for long documents

Market Data Integration

Automatic fetching of price data around speech dates using Yahoo Finance API

Correlation Analysis

Measures how speech sentiment correlates with market movements over 1-day and 1-week periods

Interactive Dashboard

3-page Streamlit app with sentiment distribution, speaker analysis, and market impact visualizations

System Architecture

The pipeline follows a modular architecture with clear separation of concerns:

ECB RSS Feed Fed JSON / RSS | | v v +--------------+ +--------------+ | ECB Scraper | | Fed Scraper | +--------------+ +--------------+ | | +----------+ +------------+ | | v v +----------------+ | PostgreSQL | | (Docker) | +----------------+ | +---------------+----------------+ | | v v +---------------+ +-----------------+ | FinBERT NLP | | Market Data | | (Sentiment) | | (yfinance) | +---------------+ +-----------------+ | | +---------------+----------------+ | v +------------------+ | Streamlit | | Dashboard | +------------------+

Technology Stack

Language

Python 3.11+

Orchestration

Apache Airflow 2.9

Database

PostgreSQL 16, SQLAlchemy 2.0

NLP

HuggingFace Transformers, FinBERT

Market Data

yfinance

Dashboard

Streamlit, Plotly

Infrastructure

Docker, Docker Compose

Tools

Poetry, Ruff, pytest

Technical Challenges & Solutions

Challenge: BERT Token Limit

Central bank speeches are typically 2,000-5,000 words, but BERT models have a 512-token limit.

Solution: Implemented intelligent sentence-based chunking that splits text on sentence boundaries, verifies exact token count for each chunk, handles edge cases like abbreviations and long sentences, and aggregates sentiment scores across all chunks.

Challenge: Data Deduplication

Ensuring no duplicate speeches are stored when running daily ingestion and historical backfills.

Solution: URL-based deduplication with database constraints and validation checks before insertion.

Challenge: Orchestration & Scheduling

Coordinating multiple data sources and processing steps in a reliable, scheduled manner.

Solution: Apache Airflow DAGs for orchestration with proper dependency management and error handling.

Dashboard Pages

Overview: KPI cards showing total speeches analyzed, sentiment distribution charts, sentiment trends over time, and filterable speech table
Speaker Analysis: Per-speaker drill down with sentiment breakdown, speaker comparison heatmap, and individual speaker statistics
Market Impact: Sentiment vs market changes correlation, scatter plots with trendlines, box plots showing sentiment distribution, and speaker impact ranking showing which speakers move markets the most

Data Sources

European Central Bank: Speeches and press releases via RSS feed and foedb JSON archive
Federal Reserve: Speeches and statements via JSON endpoint and yearly HTML archives
Yahoo Finance: Market price data for EUR/USD, S&P 500, US 10Y Treasury, Gold, and Euro Stoxx 50

Results & Impact

The project successfully demonstrates:

End-to-end data engineering pipeline design and implementation
Integration of modern NLP techniques with financial data analysis
Scalable architecture using industry-standard tools (Airflow, PostgreSQL, Docker)
Interactive data visualization for exploratory analysis
Correlation between central bank communications and market behavior