Frames White 2024-12-08 04:55:10

By: Frames White

Re-posted from: https://www.oxinabox.net/2024/12/08/2024-12-08-bsky_api.html

I have been on bluesky since it was invite only.
Recently a lot of the JuliaLang community showed up there.
Which is interesting times for me.
Since it results in me having a very split following in demographics.
I wanted to analyse that, and I thought I would take the opertunity to write a bit about how to access the bsky API using Julia.

This is going to be fairly familar to anyone used to dealing with REST APIs.
The atproto API used by bluesky is not immensely unusual.
It’s only surprising how much you can get without being authenticated – basically anything as long as it is public.
Which is all we are interested in today.
Other similar APIs don’t let you get anything without an API key, even if it is public information
This means I get to be spared dealing with OAuth.

Lets start by installing packages

For this kind of thing I turn to the very well establishedHTTP.jl.
One can instead use the LibCURL based Downloads.jl standard library, but it is a bit more complex since it doesn’t take care of url encoding query parameters etc.

We also will need a JSON parser, so I am going with the well-trusted JSON3.jl

 pkg> add JSON3@1 HTTP@1

So we will load them up:

julia> using JSON3, HTTP

Let’s analyse my followers

Getting my profile

We will get the profile via the app.bsky.actor.getProfile end-point.

julia> uri = HTTP.URI(
           scheme="https", host="public.api.bsky.app", path="/xrpc/app.bsky.actor.getProfile", query=["actor"=>"oxinabox.net"]
       )
URI("https://public.api.bsky.app/xrpc/app.bsky.actor.getProfile?actor=oxinabox.net")

julia> resp = HTTP.get(uri)
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Sun, 08 Dec 2024 07:49:15 GMT
...
Content-Encoding: gzip

{"did":"did:plc:5iszgpfyko6qsk26ytkfawuh","handle":"oxinabox.net","displayName":"Frames","avatar":"https://cdn.bsky.app/img/avatar/plain/did:plc:5iszgpfyko6qsk26ytkfawuh/bafkreidb26akzyz2223ym6tgduutitekrggdhijjhhgthu3h55kc5u425q@jpeg",...,"pinnedPost":{"cid":"bafyreiahb2jus4k73fw5ylc7g4j326fbsllsvydwo6s5qgagcjjfunw

1085-byte body
"""

julia> JSON3.read(resp.body)
JSON3.Object{Vector{UInt8}, Vector{UInt64}} with 14 entries:
  :did            => "did:plc:5iszgpfyko6qsk26ytkfawuh"
  :handle         => "oxinabox.net"
  :displayName    => "Frames"
  :avatar         => "https://cdn.bsky.app/img/avatar/plain/did:plc:5iszgpfyko6qsk26ytkfawuh/bafkreidb26akzyz2223ym6tgduu…
  :associated     => {
  :labels         => Union{}[]
  :createdAt      => "2023-07-06T15:04:41.936Z"
  :description    => "Talking about programming, science, and being queer.\nMinors DNI\n\nIf you know me IRL and see me h…
  :indexedAt      => "2024-11-29T16:29:51.945Z"
  :banner         => "https://cdn.bsky.app/img/banner/plain/did:plc:5iszgpfyko6qsk26ytkfawuh/bafkreidpk5h5yw6ptn5kcgr7k47…
  :followersCount => 2195
  :followsCount   => 553
  :postsCount     => 8526
  :pinnedPost     => {

This pattern of building a URI, calling HTTP.get on it then calling JSON3.read on the response is going to become very repedative.
The only thing that is a bit usual about this first request is the end point – for the rest of this tutorial we will be using the AppView endpoint which is at public.api.bsky.app (rather than the bsky.social host)

Lets make a helper function to cut down on repeating.

function run_query(endpoint, host="public.api.bsky.app"; query_params...)
    query = [string(key)=>string(val) for (key, val) in query_params]
    uri = HTTP.URI(; scheme="https", host, path=endpoint, query);
    resp = HTTP.get(uri)
    return JSON3.read(resp.body)
end

You may notice in the url we are taking advantage of not needing to say host=host and query=query in keyword arguments. A modern Julia feature I have missed when writing Python lately.

Often when writing such helpers I will also need to take care of passing in OAuth tokens, but as mentioned bluesky doesn’t use them for public info.

So we can test that to see whe get the same result:

julia> run_query("/xrpc/app.bsky.actor.getProfile"; actor="oxinabox.net")
JSON3.Object{Vector{UInt8}, Vector{UInt64}} with 14 entries:
  :did            => "did:plc:5iszgpfyko6qsk26ytkfawuh"
  :handle         => "oxinabox.net"
  :displayName    => "Frames"
  :avatar         => "https://cdn.bsky.app/img/avatar/plain/did:plc:5iszgpfyko6qsk26ytkfawuh/bafkreidb26akzyz2223ym6tgduu…
  :associated     => {…
  :labels         => Union{}[]
  :createdAt      => "2023-07-06T15:04:41.936Z"
  :description    => "Talking about programming, science, and being queer.\nMinors DNI\n\nIf you know me IRL and see me h…
  :indexedAt      => "2024-11-29T16:29:51.945Z"
  :banner         => "https://cdn.bsky.app/img/banner/plain/did:plc:5iszgpfyko6qsk26ytkfawuh/bafkreidpk5h5yw6ptn5kcgr7k47…
  :followersCount => 2195
  :followsCount   => 553
  :postsCount     => 8526
  :pinnedPost     => {…

Great. This worked.

Getting my followers.

This is done using the app.bsky.graph.getFollowers endpoint.

julia> obj = run_query("/xrpc/app.bsky.graph.getFollowers"; actor="oxinabox.net")
JSON3.Object{Vector{UInt8}, Vector{UInt64}} with 3 entries:
  :followers => Object[{
  :subject   => {
  :cursor    => "3lcaoii4bxs2y"

julia> obj.followers
47-element JSON3.Array{JSON3.Object, Vector{UInt8}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}:
 {
           "did": "did:plc:7egjxzliluyeon2pd4txabm7",
        "handle": 
    ...

Hang on! 47 elements?
but I have like several thousand followers.
What’s going on?

What is happening here of course, is that the results have been pagenated.
That cursor field returned can be passed to the next call to get rest,
and so forth.

julia> obj2 = run_query("/xrpc/app.bsky.graph.getFollowers"; actor="oxinabox.net", cursor=obj.cursor)
JSON3.Object{Vector{UInt8}, Vector{UInt64}} with 3 entries:
  :followers => Object[{
  :subject   => {
  :cursor    => "3lbuwshgfei2g"

julia> obj2.followers
48-element JSON3.Array{JSON3.Object, Vector{UInt8}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}:
 {
           "did": "did:plc:l2xlpoj2k5yvs4a4f6pq76iy",
        "handle":

This is called pagenation.
Its kinda annoying, but it does prevent getting an unmanagable amount of data all at once.

So we will make a function to help with that:

function run_query_pagenated(endpoint, field, host="public.api.bsky.app"; query_params...)
    obj = run_query(endpoint, host="public.api.bsky.app"; query_params...)
    initial_vals = getproperty(obj, field)
    T = eltype(initial_vals)
    return Channel{T}(50, spawn=true) do ch  # get at most 50 at a time
        for x in getproperty(obj, field)
            put!(ch, x)
        end
        while haskey(obj, :cursor)
            new_query_params = Dict(query_params)
            new_query_params[:cursor] = obj.cursor
            obj = run_query(endpoint, host="public.api.bsky.app"; new_query_params...)
            for x in getproperty(obj, field)
                put!(ch, x)
            end
        end
    end
end

So this is some moderately sophisticated code.
All the sophistication though is in the Channel.
I have actually blogged about Channel’s back in 2017 but it was a pretty different language then.
In short, a Channel is a data-structure for asynchonous communication.
It is like a queue, that blocks (and yeilds control to another task/thread) when it is full and you try and add something, or when it is empty and you try and read something.
The do-block form (really the argument taking form) creates a new asynchonous task to run the contained code-on.
Which should fill the channel.
This task runs until control is yeilded (e.g. by being blocked by the Channel being full).
Once control is yeilded back to the task that created it the function returns the Channel object,
and later code can start consuming it.
Which it will do until it is closed (by the code-block exiting), or until it runs empty in which case this code will block, yeilding control back to the task thanywat was filling it, and so forth.
The spawn=true argument allows the task to run on a thread that didn’t create it, which means it could actually be being filled while it is being consumed, actual concurrency!
But even without that, this would achieve some parallelism as IO operations yield while network buffers are filled.

Anyway, lets see it work:

julia> first(run_query_pagenated("/xrpc/app.bsky.graph.getFollowers", :followers; actor="oxinabox.net"))
JSON3.Object{Vector{UInt8}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}} with 7 entries:
  :did         => "did:plc:mj3tgpkknd7asmxvbom4putl"
  :handle ...

(Very fast)

then lets get a list together:

julia> followers = collect(run_query_pagenated("/xrpc/app.bsky.graph.getFollowers", :followers; actor="oxinabox.net"))
1977-element Vector{JSON3.Object}:
 {
           "did":...

(like 10 seconds)

Interesting it is still not quiet at full numbers, I am guessing this is to do with deleted or deactivated accounts.

Some basic analysis:

function profile_mentions(checker_fun, obj)
    for field in (:handle, :displayName, :description)
        haskey(obj, field) || continue
        if checker_fun(obj[field])
            return true
        end
    end
    return false
end

So let’s see how many profiles mention trans:

julia> trans_marker(str::AbstractString) = occursin(r"(\btrans\b)|⚧️"i, str)
trans_marker (generic function with 1 method)

julia> count(x->profile_mentions(trans_marker, x), followers)
533

Lets see how many mention JuliaLang

julia> julia_marker(str::AbstractString) = occursin(r"(\bjulialang\b)"i, str)
julia_marker (generic function with 1 method)

julia> count(x->profile_mentions(julia_marker, x), followers)
30

I can say with some confidence that both are under-estimates.
Especially the latter, perhaps for that I should compare to subscribers to the Julia feed or folks in the Julia starter pack.
For the former… building ways to reliably identify trans accounts on mass is a bad idea.

Conclusion

This has been an introduction to using the bluesky API from JuliaLang.
We have made a few helpers that let us more easily query it.
There are many many more API end points we didn’t touch.
I would like to go and use this to do post analystics.
E.g. workout what times I am posting, what times things are getting reacts, etc.
But it is late.