Comparing records in the same dataset and finding closest peers
Recently, I was trying to solve this problem:
You have a dataset of companies that looks like this:
company_id | AUM (assets under management) | fiscal_year_end |
---|---|---|
1 | 200,000 | 2020 |
2 | 300,000 | 2020 |
We need to add a column with a list of 8 peers that are similar to that particular company.
We define "similarity" as follows:
- needs to have the same fiscal_year_end
- needs to have an AUM that is greater than 50% or less than 50% of the current company
- then filter down the list so that you get 2 peers whose AUM is less than the current company's AUM and 2 peers whose AUM is greater than the current company's AUM
A neat technique I recently found out was that when you are comparing records in a dataset to other records in that same dataset, you can join the dataset to itself and specify the conditions of the join based on your criteria.
df1 = companies.alias("df1").select(
col("company_id"),
col("aum"),
col("fiscal_year_end"))
df2 = companies.alias("df2").select(
col("company_id").alias("company_id_other"),
col("aum").alias("aum_other"),
col("fiscal_year_end").alias("fiscal_year_end_other"))
join_criteria = (df1["fiscal_year_end"] == df2["fiscal_year_end_other"]) & ((df2["aum_other"] >= 0.5*df1["aum"]) | (df2["aum_other"] <= 1.5*df1["aum"]))
df1.join(df2, join_criteria, "left").show()
This will give you a dataset where for each company, you would get similar companies:
company_id | company_id_other | aum | aum_other | fiscal_year_end | fiscal_year_end_other |
---|---|---|---|---|---|
1 | 1 | 200,000 | 200,000 | 2020 | 2020 |
1 | 2 | 200,000 | 90,000 | 2020 | 2020 |
1 | 3 | 200,000 | 100,000 | 2020 | 2020 |
1 | 4 | 200,000 | 300,000 | 2020 | 2020 |
1 | 5 | 200,000 | 350,000 | 2020 | 2020 |
1 | 6 | 200,000 | 400,000 | 2020 | 2020 |
Now, to find which companies are similar to my company in question--company_id=1, you would create another column that gives you the difference between the company_id in question and the other companies.
similar_companies = similar_companies.withColumn(
"aum_diff",
abs(df1["aum"] - df2["aum_other"]))
similar_companies.orderBy(col("aum_diff").asc()).show()
This would give you a table like this:
company_id | company_id_other | aum | aum_other | fiscal_year_end | fiscal_year_end_other | aum_diff |
---|---|---|---|---|---|---|
1 | 1 | 200,000 | 200,000 | 2020 | 2020 | 0 |
1 | 3 | 200,000 | 100,000 | 2020 | 2020 | 100,000 |
1 | 4 | 200,000 | 300,000 | 2020 | 2020 | 100,000 |
1 | 2 | 200,000 | 90,000 | 2020 | 2020 | 110,000 |
1 | 5 | 200,000 | 350,000 | 2020 | 2020 | 150,000 |
1 | 6 | 200,000 | 400,000 | 2020 | 2020 | 200,000 |
The four closest companies are the four companies whose aum_diff is the closest to the company in question. So, how would you get these? Using a window function.
similar_companies = similar_companies.withColumn(
"similar_company_ids",
collect_list(df2["company_id_other"]).over(Window.partitionBy("company_id").orderBy(col("aum_diff").asc()).rowsBetween(0, 4)))
)
company_id | company_id_other | aum | aum_other | fiscal_year_end | fiscal_year_end_other | aum_diff | similar_company_ids |
---|---|---|---|---|---|---|---|
1 | 1 | 200,000 | 200,000 | 2020 | 2020 | 0 | [3, 4, 2, 5, 6] |
1 | 3 | 200,000 | 100,000 | 2020 | 2020 | 100,000 | [4, 2, 5, 6] |
1 | 4 | 200,000 | 300,000 | 2020 | 2020 | 100,000 | [2, 5, 6] |
1 | 2 | 200,000 | 90,000 | 2020 | 2020 | 110,000 | [5, 6] |
1 | 5 | 200,000 | 350,000 | 2020 | 2020 | 150,000 | [6] |
1 | 6 | 200,000 | 400,000 | 2020 | 2020 | 200,000 | [] |
Now, just filter the data so that you only have the company in question and select the columns you care about.
similar_companies = similar_companies.filter(
col("company_id") == col("company_id_other")).select(
"company_id", "aum", "fiscal_year_end", "similar_company_ids")
company_id | aum | fiscal_year_end | similar_company_ids |
---|---|---|---|
1 | 200,000 | 2020 | [3, 4, 2, 5, 6] |
Takeaways
- When you want to compare records in a dataset to other records in that dataset, join the dataframe on itself with the comparison criteria as the join criteria
- Reframe the question "find 2 peers whose AUM is less than the current company's AUM and 2 peers whose AUM is greater than the current company's AUM" as "find 4 peers whose AUM is closest to the current company's AUM". Closest to = absolute value of the difference