Panel 5: Copyrightable Subject Matter and the Special Problem of Software

29th
Annual BTLJ-BCLT Spring Symposium: Origins, Evolution, and Possible Futures of
the 1976 Copyright Act

Pamela Samuelson, UC Berkeley Law (Moderator and Speaker):
discusses history (in which she was intimately involved as an intellectual powerhouse).
From uncertainty over whether software was protectable to Whelan which gave
very broad protection; took 6 years for the Second Circuit to respond and start
with Baker v. Selden to keep functional elements out of © protection. Merger,
scenes a faire, 102(b), fair use—doctrinal cocktails, in the words of Molly van
Houweling.

Samuelson initially thought sui generis protection for
software would be better, but admits error: © did a really good job and gave an
international standard that’s enabled some stability.

Jule Sigall, former Microsoft: CONTU was doing its work as
Microsoft was just getting started. Trade secrets, patents, and copyrights do
different work at different eras of software. 1980: PC era—rapid rise of
copyright’s relevance. Business model: product licenses. Practical control:
EULA, shrinkwrap, key disc/dongle. Copyright’s salience for executives was high
for how they were going to recover fixed cost investment. This was the model
CONTU had in mind when it decided to embrace software ©: you make a product
& send it out through distribution channels not unlike books.

1990s: WWW. Easier to send software as bits. Business model
if people won’t necessarily pay for copies: hardware bundling (Apple; PC with independent
OEMs); ad supported. Practical control: B2B contracts. Copyright salience:
medium.

2000s: cloud and OSS: business model: subscription/SaaS/consulting.
Practical control: server access control/OSS license; not much a pirate copy
will do for you. Copyright salience: medium. Antipiracy efforts shifted to
antifraud—scammers would purport to sell subscriptions. Open source was a
different path—add consulting services to OS or build services using OS. That
does depend on © but the most prevalent ©-based model was making software as
accessible as possible and using © to ensure it was only used/redistributed in
certain ways.

2010s: mobile era/app ecosystem. Business model: app store
sales/subscriptions—you can, as in the 80s, get paid for a copy. Practical
control: platform control/cloud services. © salience: low.

2020s: AI. Business model: ?? Practical control: ??
Copyright salience: None? [Real underpants gnomes vibes.] More software will be
developed by more people than ever before. The tools allow people of all kinds
to make software, and they allow software to make software. Maybe we are back
where we started before CONTU with unclear © coverage.

Clark Asay, BYU Law: reasons for concern, but countervailing
forces/reasons for optimism. FOSS licenses presuppose copyrightable code:
copyleft, attribution, etc. W/o © the governance architecture becomes much less
reliable. In the context of other developments that threaten open source—MongoDB
and Elastisearch have abandoned OS; monetization has always been a question for
companies that can’t directly monetize software. AI agents: those agents are creating
tons of software and making pull requests/contributions to OS products w/o
human review, which are being overwhelmed in some cases. Some projects are
closing off in response. Open collaboration norms may be eroding from multiple
directions simultaneously.

Might push us more in direction of trade secrecy and
possibly patents. A more closed, fragmented software ecosystem and possibly AI
system. But developers desire to influence the AI stack, which is likely to
keep the ecosystem at least partially open.

A. Feder Cooper, Yale University (co-author Mark Lemley,
Stanford Law School): Model weights that give a possibility but not a certainty
of generating infringing output: is that a “copy”? Relates to memorization
debate. It’s common to describe models as learning statistical correlations or
patterns: that’s not wrong but it oversimplifies how info is represented.
Another important part: how the LLM is used. Some methods of selecting outputs
are deterministic—same input, same output; many are stochastic. Variability in
outputs doesn’t derive from model but how the model is used in decoding.

Memorization is, when based on training, the model produces
really high concentration of probability on particular sequences. The model is
still probabilistic, but the distribution is so sharply peaked that one
sequence (or small number of sequences) dominates. This is related to
compression: memorization means that Ted Chiang’s “blurry jpg of the web” is sometimes
not blurry at all for certain chunks. Memorization is pretty mysterious still—keeps
giving new insights about LLM behavior. Not a bug; it’s far too interesting and
complicated.

What is a copy? The statute’s answer is pretty incoherent: copies
are material objects in which a work is fixed. (The “by or under the authority
of the © owner” can’t be taken seriously for infringement by copying. We used
the same definitions for protectability and infringement, so courts just ignore
that part for infringement.) In litigation, parties take extreme positions—no memorization,
or models are just a collage. Neither of these are right and sometimes not even
partially right.

We can extract a near reproduction of Harry Potter from a
short prompt from Meta’s Llama: that prompt is deterministic. That’s an extreme
result—extraction is possible from some models for some works and not others.
Most of our experiments measure whether verbatim memorization is occurring; we
can get more if we accept small changes like extra spaces or commas in place of
semicolons. Sometimes we needed adversarial strategies but sometimes not. None
of that work changes model weights, but you can also do that to extract more
works.

Jane Ginsburg et al. have shown that fine-tuning on public
domain works can reveal memorization from previously-trained-on © works.

So is a model a copy fixed in a tangible medium of
expression? That’s still complicated! You can make a copy by storing parts in
ones & zeros. But you can’t say that Microsoft Word encodes War &
Peace. Models aren’t like either of those things. Some of the memorization isn’t
deterministic—you might only get a memorized copy one in 1000 times. Are the
other 999 “stored” in the model? That would involve more copies stored than
there are atoms in the universe.

Closest examples in existing law: Kelly v. Chicago Park
District—garden isn’t fixed b/c it isn’t deterministic; video games where
content is generated from a number of fixed options. Micro Star: the new levels
aren’t really “in the game.” Nor would we say that all the possibilities
currently exist. So maybe the answer is predictability: if the model weights
can easily generate the work, functionally there’s a copy in the model. If it’s
merely possible to extract the work through effort, it’s not a copy. Why it
matters: if there’s a copy in the model, then copying the model is making a
copy of the work. Maybe that’s fair use (via intermediate use) but we’d have to
figure it out.

Doesn’t love the conclusion, but this is where the empirical
evidence leads.

Samuelson for Sigall: you didn’t say much about patents—Whelan
might be affected by the idea that patents weren’t available; then patents
started becoming available, making thick © less attractive.

Sigall: late 90s was a marriage of two historical trends: if
you want to go the IP route for software, patents might be more efficient/useful
b/c there’s also a risk with seeking ©. Patents and © come with embedded
strategic choices about your business. Book: Capitalism
w/o Capital
: many of the most successful companies today have intangible
assets, not tangible assets—a lot of the benefit is taking advantage of
synergies and spillovers in intangible assets. IP can interrupt and interfere
w/those synergies & spillovers so it might not be optimal—businesses can
capitalize on other aspects instead of IP.

Samuelson for Asay: what do you do w/the Office’s policy
requiring you to ID the parts that are AI-generated and disclaim authorship?
Will people do that or just pretend that they authored the whole thing?

Asay: Unworkable! Possible that developers will just
continue as usual and ignore © complications, slapping license on even if code
is AI-generated; that’s somebody else’s problem.

Sag: how do you deal with misuse of your work as evidence
that LLMs don’t learn, they copy?

Cooper: Not great feeling! The research I do is careful and the
papers are long; that’s not an accurate gloss of what models are doing. But it’s
important to do the work to show information about model behavior that we didn’t
know before.

Q: is Harry Potter an outlier given how many copies there
are online?

A: It’s astonishing still to get a book from a fragmentary
prompt; not all models do this and certainly not all the time, but other books
can be derived; it’s hard to connect the dots from training data. Tried to do
it with Coates’ “The Case for Reparations”—also got that from the same model—it’s
very famous but not HP famous.

Cathy Gellis: isn’t © a background assumption for these
business models even if you aren’t “relying” on it? If © didn’t exist, would
these business models work?

Sigall: it’s a behavioral Q—what behavior is © shaping and
it’s certainly possible that affects what businesses do with particular
software. It’s there, but the Q is how do you use that fact as a business in
your strategic choices? Microsoft housed its antipiracy department in the
marketing department, not legal, because the goal wasn’t really to stop piracy
but to get them to use Microsoft software. Other industries put antipiracy
efforts in legal. Trying to understand actual behavior of users of their works
and adapt to that. [This may also be relevant to the shift to streaming
video/music!]

Brauneis: suggests that Office’s disclosure form isn’t
onerous; doesn’t require you to ID which lines are AI-generated, so you should
disclose and figure it out later.

Asay: may be true, but issue in the industry is
norms/perceptions about copyrightability—that’s more important to behavior than
technicalities of registration. [So what he’s saying is that coders have …
always gone on vibes?]

Samuelson: A bit of an old problem with SaaS. Oracle started
with a PD work and then made a derivative work from it; trying to sort which
parts were protected from which weren’t was already a task.

Bracha: you said that you were wrong about sui generis
protection for software because after that didn’t happen, courts rolled up
their sleeves and did their job of developing relevant principles. Do you think
that courts would do the same thing today?

Samuelson: good point—we sort of got sui generis protection
w/in copyright.

Nimmer: works that
incorporate works from the USG should in theory disclose that, even if it’s a
paragraph quote; they don’t and it’s been a nonissue. So it could also work for
AI.  

from Blogger https://tushnet.blogspot.com/2026/04/panel-5-copyrightable-subject-matter.html

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

Leave a comment