For a small project I run together with a couple of guys I decided to ramp up our data collection. The previous solution used google sheets to fetch tables from public websites (its surprising what you can do with sheets). Though we suffered from constant changing website structures, broken pages and tables that changed order all the time. It was a pain to maintain, so lets call the glorious LLMs to the rescue!
Reading tables
To keep cost a minimum I didnt want to push the whole website through OpenAIs gpt for each crawl. So for the first step just use the fantastic python pandas library (beforehand I crawled the page content with Scrappey):
import pandas as pd
stock_dfs = pd.read_html(StringIO(page_html), decimal=",", thousands=".")
This will give us each table of the page nicely packaged in a data frame. But the important question is: what table and column contains the information we desire? I.e. the PER (Price Earnings Ratio) of a stock?
Dead End
First try: ask the gpt model from OpenAI. But even after finetuning the model with some sample data, prompt engineering or examples the results were always plagued by hallucination, false information or just swapped columns. The data mapping was just not good enough and I could not really justify the use of bigger models (i.e. gpt-4) for this use case, so lets do it the old way!
A simple data mapper with python
Could I solve the issue in around 30 minutes? In this case yes. First I assigned for the required categories a couple of sample labels used in the websites (all german):
stock_data_key_map = {
StockDataKey.SALES.value: ["Umsatzerlöse in Mio.", "Umsatz", "Umsatzerlöse"],
StockDataKey.EBIT.value: ["EBIT", "EBIT in Mio.", "Ergebnis vor Steuer (EBT)"],
StockDataKey.SALES_PER_SHARE.value: [
"Umsatz/Aktie",
"Umsatz pro Aktie",
"Umsatz je Aktie",
],
...
After that you just iterate the columns in the data frames, compare with the labels and pick the column with the highest similarity for each category. The SequenceMatcher in python will do the trick:
def calculate_similarity(val1, val2) -> float:
try:
return SequenceMatcher(None, val1, val2).ratio()
except Exception:
# probably type error, doesnt match with anything
return 0
Usually the table headers are quite stable and only change by a couple of characters (i.e. other currency) so this approach works good enough!
See the mapped table meta data in CSV format, first columns marks the table index, then table column header and finally category:
Thats it
All that was left to do now, is to read the data frames at the specified locations and save the data into a database of your choice. In my case I hosted the whole setup on AWS with serverless components (Lambda, Dynamodb and cloudwatch to trigger the scraping).
After a whole month of runtime with a moderate load it incurred around 25 cents. And the biggest part was the secret manager which I could replace easily, so it will be even less in future.
You can find the complede code for this project also on my Github